idnits 2.17.1 draft-ietf-rddp-applicability-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 915. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 892. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 899. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 905. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 26, 2005) is 6780 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '1' is defined on line 827, but no explicit reference was found in the text == Unused Reference: '2' is defined on line 830, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 833, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 836, but no explicit reference was found in the text == Unused Reference: '5' is defined on line 841, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 844, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 847, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 850, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 855, but no explicit reference was found in the text == Unused Reference: '10' is defined on line 858, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 860, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346) ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960) ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5') == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-05 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-sctp-02 == Outdated reference: A later version (-08) exists of draft-ietf-rddp-mpa-02 -- Possible downref: Non-RFC (?) normative reference: ref. '10' -- Possible downref: Non-RFC (?) normative reference: ref. '11' Summary: 9 errors (**), 0 flaws (~~), 18 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Remote Direct Data Placement C. Bestler 3 Working group Broadcom 4 Internet-Draft L. Coene 5 Expires: March 30, 2006 Siemens 6 September 26, 2005 8 Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct 9 Data Placement (DDP) 10 draft-ietf-rddp-applicability-03.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on March 30, 2006. 37 Copyright Notice 39 Copyright (C) The Internet Society (2005). 41 Abstract 43 This document describes the applicability of Remote Direct Memory 44 Access Protocol (RDMAP) and the Direct Data Placement Protocol (DDP). 45 It comparese and contrasts the different transport options over IP 46 that DDP can use, provides guidance to ULP developers on choosing 47 between available transports and/or how to be indifferent to the 48 specific transport layer used, compares use of DDP with direct use of 49 the supporting transports, and compares DDP over IP transports with 50 non-IP transports that support RDMA functionality. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 56 3. Direct Placement . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.1. Fewer Required ULP Interactions . . . . . . . . . . . . . 6 58 3.2. Direct Placement using only the LLP . . . . . . . . . . . 6 59 4. Tagged Messages . . . . . . . . . . . . . . . . . . . . . . . 8 60 4.1. Order Independent Reception . . . . . . . . . . . . . . . 8 61 4.2. Reduced ULP Notifications . . . . . . . . . . . . . . . . 8 62 4.3. Simplified ULP Exchanges . . . . . . . . . . . . . . . . . 9 63 4.4. Order Independent Sending . . . . . . . . . . . . . . . . 10 64 4.5. Tagged Buffers as ULP Credits . . . . . . . . . . . . . . 11 65 5. RDMA Read . . . . . . . . . . . . . . . . . . . . . . . . . . 13 66 6. LLP Comparisons . . . . . . . . . . . . . . . . . . . . . . . 14 67 6.1. Multistreaming Implications . . . . . . . . . . . . . . . 14 68 6.2. Out of Order Reception Implications . . . . . . . . . . . 14 69 6.3. Header and Marker Overhead . . . . . . . . . . . . . . . . 14 70 6.4. Middlebox Support . . . . . . . . . . . . . . . . . . . . 15 71 6.5. Processing Overhead . . . . . . . . . . . . . . . . . . . 15 72 6.6. Data Integrity Implications . . . . . . . . . . . . . . . 15 73 6.6.1. MPA/TCP Specifics . . . . . . . . . . . . . . . . . . 15 74 6.6.2. SCTP Specifics . . . . . . . . . . . . . . . . . . . . 16 75 6.7. Non-IP Transports . . . . . . . . . . . . . . . . . . . . 16 76 6.7.1. No RDMA Layer Ack . . . . . . . . . . . . . . . . . . 16 77 6.8. Other IP Transports . . . . . . . . . . . . . . . . . . . 17 78 6.9. LLP Independent Session Establishment . . . . . . . . . . 17 79 6.9.1. RDMA-only Session Establishment . . . . . . . . . . . 18 80 6.9.2. RDMA-Conditional Session Establishment . . . . . . . . 18 81 7. Local Interface Implications . . . . . . . . . . . . . . . . . 20 82 8. Security considerations . . . . . . . . . . . . . . . . . . . 21 83 8.1. Connection/Association Setup . . . . . . . . . . . . . . . 21 84 8.2. Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . 21 85 8.3. Impact of Encrypted Transports . . . . . . . . . . . . . . 21 86 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 88 Intellectual Property and Copyright Statements . . . . . . . . . . 24 90 1. Introduction 92 Remote Direct Memory Access Protocol (RDMAP) and Direct Data 93 Placement (DDP) work together to provide application independent 94 efficient placement of application payload directly into buffers 95 specified by the Upper Layer Protocol (ULP). 97 The DDP protocol is responsible for direct placement of received 98 payload into ULP specified buffers. The RDMAP protocol provides 99 completion notifications to the ULP and support for Data Sink 100 initiated fetch of advertised buffers (RDMA Reads). 102 DDP and RDMAP are both application independent protocols which allow 103 the ULP to perform remote direct data placement. DDP can use 104 multiple standard IP transports including SCTP and TCP. 106 By clarifying the situations where the functionality of these 107 protocols are applicable, this document can guide implementers, 108 application and protocol designers in selecting which protocols to 109 use. 111 The applicability of RDMAP/DDP is driven by their unique 112 capabilities: 114 o The existence of an application independent protocol allows common 115 solutions to be implemented in hardware and/or the kernel. This 116 document will discuss when common data placement procedures are of 117 the greatest benefit to applications as contrasted with 118 application specific solutions built on top of direct use of the 119 underlying transport. 121 o DDP supports both untagged and tagged buffers. Tagged buffers 122 allow the Data Sink ULP to be indifferent to what order (or in 123 what packets) the Data Source sent the data, or what order they 124 are received in. This document will discuss when Data Source 125 flexibility is of benefit to applications. 127 o RDMAP consolidates ULP notifications, thereby minimizing the 128 number of required ULP interactions. 130 o RDMAP defines RDMA Reads, which allow remote access to advertised 131 buffers. This document will review the advantages of using RDMA 132 Reads as contrasted to alternate solutions. 134 Some non-IP transports, such as InfiniBand, directly integrate RDMA 135 features. This document will review the applicability of providing 136 RDMA services over ubiquitous IP transports as opposed to the use of 137 customized transport protocols. Due to the fact that DDP is defined 138 cleanly as a layer over existing IP transports, DDP has simpler 139 ordering rules than some prior RDMA protocols. This may have some 140 implications for application designers. 142 The full capabilities of DDP and RDMAP can only be fully realized by 143 applications that are designed to exploit them. The co-existence of 144 RDMAP/DDP aware local interfaces with traditional socket interfaces 145 will also be explored. 147 Finally, DDP support is defined for at least two IP transports: SCTP 148 and TCP. The rationale for supporting both transports is reviewed, 149 as well as when each would be the appropriate selection. 151 2. Definitions 153 Advertisement - the act of informing a Remote Peer that a local RDMA 154 Buffer is available to it. A Node makes available an RDMA Buffer 155 for incoming RDMA Read or RDMA Write access by informing its RDMA/ 156 DDP peer of the Tagged Buffer identifiers (STag, base address, and 157 buffer length). This advertisement of Tagged Buffer information 158 is not defined by RDMA/DDP and is left to the ULP. A typical 159 method would be for the Local Peer to embed the Tagged Buffer's 160 Steering Tag, base address, and length in a Send Message destined 161 for the Remote Peer. 163 Data Sink - The peer receiving a data payload. Note that the Data 164 Sink can be required to both send and receive RDMA/DDP Messages to 165 transfer a data payload. 167 Data Source - The peer sending a data payload. Note that the Data 168 Source can be required to both send and receive RDMA/DDP Messages 169 to transfer a data payload. 171 Lower Layer Protocol (LLP) The transport protocol that provides 172 services to DDP. This is an IP transport with any required 173 adaptation layer. Adaptation layers are defined for SCTP and TCP. 175 Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid 176 as defined within a protocol specification. 178 Tagged Message A DDP message that is directed to a ULP specified 179 buffer based upon imbedded addressing information. In the 180 immediate sense, the destination buffer is specified by the 181 message sender. 183 Untagged Message A DDP message that is directed to a ULP specified 184 buffer based upon a Message Sequence Number being matched with a 185 receiver supplied buffer. The destination buffer is specified by 186 the message receiver. 188 Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services. 189 This may be an application, or a middleware layer such as Sockets 190 Direct Protocol (SDP) or Remote Procedure Calls (RPC). 192 3. Direct Placement 194 Direct Data Placement optimizes the placement of ULP payload into the 195 correct destination buffers, typically eliminating intermediate 196 copying. Placement is enabled without regard to order of arrival, 197 order of transmission or requiring per-placement interaction with the 198 ULP. 200 RDMAP minimizes the required ULP interactions . This capability is 201 most valuable for applications that require multiple transport layer 202 packets for each required ULP interaction. 204 3.1. Fewer Required ULP Interactions 206 While reducing the number of required ULP interactions is in itself 207 desirable, it is critical for high speed connections. The burst 208 packet rate for a high speed interface could easily exceed the host 209 systems ability to switch ULP contexts. 211 Content access applications are primary examples of applications with 212 both high bandwidth and high content to required ULP interaction 213 ratios. These applications include file access protocols (NAS), 214 storage access (SAN), database access and other application specific 215 forms of content access such as HTTP, XML and email. 217 3.2. Direct Placement using only the LLP 219 Direct data placement can be achieved without RDMA. Pre-posting of 220 receive buffers could allow a non-RDMA network stack to place data 221 directly to user buffers. 223 The degree to which DDP optimizes depends on which transport is being 224 compared with, and on the nature of the local interface. Without 225 RDMAP/DDP pre-posting buffers requires the receiving side to 226 accurately predict the required buffers and their sizes. This is not 227 feasible for all ULPs. By contrast, DDP only requires the ULP to 228 predict the sequence and size of incoming untagged messages. 230 An application that could predict incoming messages and required 231 nothing more than direct placement into buffers might be able to do 232 so with a properly designed local interface to SCTP or TCP. Doing so 233 for TCP requires making predictions at a byte level rather than a 234 message level. 236 The main benefit of DDP for such an application would be that pre- 237 posting of receive buffers is a mandated local interface capability, 238 and that predictions can be made on a per-message basis (not per 239 byte). 241 The LLP can also be used directly if ULP specific knowledge is built 242 into the protocol stack to allow "parse and place" handling of 243 received packets. Such a solution either requires interaction with 244 the ULP, or that the protocol stack have knowledge of ULP specific 245 syntax rules. 247 DDP achieves the benefits of directly placing incoming payload 248 without requiring tight coupling between the ULP and the protocol 249 stack. However, "parse and place" capabilities can certainly provide 250 equivalent services to a limited number of ULPs. 252 4. Tagged Messages 254 This section covers the major benefits from the use of Tagged 255 Messages. 257 A more critical advantage of DDP is the ability of the Data Source to 258 use tagged buffers. Tagging messages allows the Data Source to 259 choose the ordering and packetization of its payload deliveries. 260 With direct data placement based solely upon pre-posted receives, the 261 packetization and delivery of payload must be agreed by the ULP peers 262 in advance. Even if there is an encoding of what is being 263 transferred, as is common with middleware solutions, this information 264 is not understood at the application independent layers. The 265 directions on where to place the incoming data cannot be accessed 266 without switching to the ULP first. DDP provides a standardized 267 'packing list' which can be interpreted without requiring ULP 268 interaction. Indeed, it is designed to be implementable in hardware. 270 4.1. Order Independent Reception 272 Tagged messages are directed to a buffer based on an included 273 Steering Tag. Additionally, no notice is provided to the ULP for each 274 individual Tagged Message's arrival. Together these allow tagged 275 messages received out-of-order to be processed without intermediate 276 buffering or additional notifications to the ULP. 278 4.2. Reduced ULP Notifications 280 RDMAP further reduces required ULP interactions consolidating 281 completion notifications of tagged messages with the completion 282 notification of a trailing untagged message. For most ULPs this 283 radically reduces the number of ULP required interactions even 284 further. 286 While RDMAP consolidation of notices is beneficial to most 287 applications, it may be detrimental to some applications that benefit 288 from streamed delivery to enable ULP processing of received data as 289 promptly as possible. A ULP that uses RDMAP cannot begin processing 290 any portion of an exchange until it receives notification that the 291 entire exchange has been placed. An "exchange" here is a set of zero 292 or more tagged messages and a single terminating untagged message. 293 An application that would prefer to begin work on the received 294 payload, no matter what order it arrived in, as soon as possible 295 might prefer to work directly with the LLP. RDMAP is optimized for 296 applications that are more concerned when the entire exchange is 297 complete. 299 An application that benefits from being able to begin processing of 300 each received packet as quickly as possible may find RDMAP interferes 301 with that goal. 303 Such an application might be able to retain most of the benefits of 304 RDMAP by using the DDP layer directly. However, in addition to 305 taking on the responsibilities of the RDMAP layer, the application 306 would likely have more difficulty finding support for a DDP-only API. 307 Many hardware implementations may choose to tightly couple RDMAP and 308 DDP, and might not provide an API directly to DDP services. 310 These features minimize the required interactions with the ULP. This 311 can be extremely beneficial for applications that use multiple 312 transport layer packets to accomplish what is a single ULP 313 interaction. 315 4.3. Simplified ULP Exchanges 317 The notification rules for Tagged Messages allows ULPs to create 318 multi-message "exchanges" consisting of zero or more tagged messages 319 that represent a single step in the ULP interaction. The receiving 320 ULP is notified that the untagged message has arrived, and implicitly 321 of any associated tagged messages. 323 A ULP where all exchanges would naturally be only the untagged 324 message would derive virtually no benefit from the use of RDMAP/DDP 325 as opposed to SCTP. But while tagged buffers are the justification 326 for RDMAP/DDP, untagged buffers are still necessary. Without 327 untagged buffers the only method to exchange buffer advertisements 328 would involve out-of-band communications and/or sharing of compile 329 time constants. Most RDMA-aware ULPs use untagged buffers for 330 requests and responses. Buffer advertisements are typically done 331 within these untagged messages. 333 Limiting use of untagged buffers to requests and responses by moving 334 all bulk data using tagged transfers can greatly simplify the amount 335 of prediction that the Data Sink must perform in pre-posting receive 336 buffers. For example, a typical RDMA enabled interaction would 337 consist of the following: 339 Client sends transaction request to server's as an untagged 340 message. 342 This message includes buffer advertisements for the buffers where 343 the results are to be placed. 345 The Server sends multiple tagged messages to the advertised 346 buffers. 348 The Server sends transaction reply as an untagged message to the 349 client. 351 Client receives single notification, indicating completion of the 352 interaction. 354 With this type of exchange the pacing and required size of untagged 355 buffers is highly predictable. The variability of response sizes is 356 absorbed by tagged transfers. 358 4.4. Order Independent Sending 360 Use of tagged messages is especially applicable when the Data Sink 361 does not know the actual size, structure or location of the content 362 it is requesting (or updating). 364 For example, suppose the Data Sink ULP needs to fetch four related 365 pieces of data into a four separate buffers. With SCTP the Data Sink 366 ULP could receive four messages into four separate buffers, only 367 having to predict the maximum size of each. However it would have to 368 dictate the order in which the Data Source supplied the separate 369 pieces. If the Data Source found it advantageous to fetch them in a 370 different order it would have to use intermediate buffering to re- 371 order the pieces into the expected order even though the application 372 only required that all four be delivered and did not truly have an 373 ordering requirement. 375 Techniques such as RAID striping and mirroring represent this same 376 problem, but one step further. What appears to be a single resource 377 to the Data Sink is actually stored in separate locations by the Data 378 Source. Non RDMA protocols would either require the Data Source to 379 fetch the material in the desired order or force the Data Source to 380 use its own holding buffers to assemble an image of the destination 381 buffer. 383 While sometimes referred to as a "buffer-to-buffer" solution, RDMA 384 more fundamentally enables remote buffer access. The ULP is free to 385 work with larger remote buffers than it has locally. This reduces 386 buffering requirements and the number of times the data must be 387 copied in an end-to-end transfer. 389 There are numerous reasons why the Data Sink would not know the true 390 order or location of the requested data. It could be different for 391 each client, different records selected and/or different sort orders, 392 RAID striping, file fragmentation, volume fragmentation, volume 393 mirroring and server-side dynamic compositing of content (such as 394 server side includes for HTTP). 396 In all of these cases the Data Source is free to assemble the desired 397 data in the Data Sinks buffer in whatever order the component data 398 becomes available to it. It is not constrained on ordering. It does 399 not have to assemble an image in its own memory before creating it in 400 the Data Sink's buffers. 402 Note that while DDP enables use of tagged messages for bulk transfer, 403 there are some application scenarios where untagged messages would 404 still be used for bulk transfer. For example, under the Direct 405 Access File Server (DAFS) protocol the file server does not expose 406 its own memory to its clients. A client wishing to write may 407 advertise a buffer which the server will issue RDMA Reads upon. 408 However, when performing a small write it may be preferable to 409 include the data in the untagged message rather than incurring an 410 additional round trip with the RDMA Read and its response. 412 4.5. Tagged Buffers as ULP Credits 414 The handling of end-to-end buffer credits differs considerably with 415 DDP than when the ULP directly uses either TCP or SCTP. 417 With both TCP and SCTP buffer credits are based upon the receiver 418 granting transmit permission based on the total number of bytes. 419 These credits reflect system buffering resources and/or simple flow 420 control. They do not represent ULP resources. 422 DDP defines no standard flow control, but presumes the existince of a 423 ULP mechanism. The presumed mechanism is that the Data Sink ULP has 424 issued credits to the Data Source allowing the Data Source to send a 425 specific number of untagged messages. 427 The ULP peers must ensure that the sender is aware of the maximum 428 size that can be sent to any specific target buffer. One method of 429 doing so is to use a standard size for all untagged buffers within a 430 given connection. For example, DAFS specifies an initial size 431 requirement for session establishment, during which the untagged 432 buffer size for the remainder of the session is negotiated. 434 Tagged buffers are ULP resources advertised directly from ULP to ULP. 435 A DDP put to a known tagged buffer is constrained only by transport 436 level flow control, not by available system buffering. 438 Either tagged or untagged buffers allows bypassing of system buffer 439 resources. Use of tagged buffers additionally allows the Data Source 440 to choose what order to exercise the credits in. 442 To the extent allowed by the ULP, tagged buffers are also divisible 443 resources. The Data Sink can advertise a single 100 KB buffer, and 444 then receive notifications from its peer that it had written 50 KB, 445 20 KB and 30 KB to that buffer in three successive transactions. 447 ULP-management of tagged buffer resources, independent of transport 448 and DDP layer credits, is an additional benefit of RDMA protocols. 449 Large bulk transfers cannot be blocked by limited general purpose 450 buffering capacity. Applications can flow control based upon higher 451 level abstractions, such as number of outstanding requests, 452 independent of the amount of data that must be transferred. 454 However, use of system buffering, as offered by direct use of the 455 underlying transports, can be preferable under certain circumstances. 457 One example would be when the number of target ULP buffers is 458 sufficiently large, and the rate at which any writes arrive is 459 sufficiently low, that pinning all the target ULP buffers in memory 460 would be undesirable. The maximum transfer rate, and hence the 461 maximum amount of system buffering required, may be more stable and 462 predictable than the total ULP buffer exposure. 464 Another would be the Data Sink wishes to receive a stream of data at 465 a predictable rate, but does not know in advance what the size of 466 each data packet will be. This is common from streaming media that 467 has been encoded with a variable bit rate. With DDP the Data Sink 468 would either have to use untagged buffers large enough for the 469 largest packet, or advertise a circular buffer. If for security or 470 other reasons the Data Sink did not want the size of its buffer to be 471 publicly known, using the underlying SCTP transport directly may be 472 preferable because of their byte-oriented credits. 474 5. RDMA Read 476 RDMA Reads are a further service provided by RDMAP. RDMA Reads allow 477 the Data Sink to fetch exactly the portion of the peer ULP buffer 478 required on a "just in time" basis. This can be done without 479 requiring per-fetch support from the Data Source ULP. 481 Storage servers may wish to limit the maximum write buffer allocated 482 to any single session. The storage server may be a very minimal 483 layer between the client and the disk storage media, or the server 484 may merely wish to limit the total resources that would be required 485 if all clients could push the entire payload they wished written at 486 their own convenience. 488 In either case, there is little benefit in transferring data from the 489 Data Source far in advance of when it will be written to the 490 persistent storage media. RDMA Reads allow the Storage Server to 491 fetch the payload on a "just in time" basis. In this fashion a 492 relatively small number of block sized buffers can be used to execute 493 a single transaction that specified writing a large file, or a 494 Storage Server with numerous clients can fetch buffers from the 495 individual clients in the order that is most convenient to the 496 server. 498 This same capability can be used when the desired portion of the 499 advertised buffer is not known in advance. For example the 500 advertised buffer could contain performance statistics. The data 501 sink could request the portions of the data it required, without 502 requiring an interaction with the Data Source ULP. 504 This is applicable for many applications that publish semi-volatile 505 data that does not require transactional validity checking (i.e., 506 authorized users have read access to the entire set of data). It is 507 less applicable when there are ULP consistency checks that must be 508 performed upon the data. Such applications would be better served by 509 having the client send a request, and having the server use RDMA 510 Writes to publish the requested data. Neither RDMAP or DDP provide 511 mechanisms for bundling multiple disjoint updates into an atomic 512 operation. Therefore use of an advertised buffer as a data resource 513 is subject to the same caveats as any randomly updated data resource, 514 such as flat files, that do not enforce their own cosnsistency. 516 6. LLP Comparisons 518 Normally the choice of underlying IP transport is irrelevant to the 519 ULP. RDMAP and DDP provides the same services over either. There 520 may be performance impacts of the choice, however. It is the 521 responsibility of the ULP to determine which IP transport is best 522 suited to its needs. 524 SCTP provides for preservation of message boundaries. Each DDP 525 segment will be delivered within a single SCTP packet. The 526 equivalent services are only available with TCP through the use of 527 the MPA adaptation layer. 529 6.1. Multistreaming Implications 531 SCTP also provides multi-streaming. When the same pair of hosts have 532 need for multiple DDP streams this can be a major advantage. A 533 single SCTP association carries multiple DDP streams, consolidating 534 connection setup, congestion control and acknowledgements. 536 Completions are controlled by the DDP Source Sequence Number (DDP- 537 SSN) on a per stream basis. Therefore combining multiple DDP Streams 538 into a single SCTP association cannot result in a dropped packet 539 carrying data for one stream delaying completions on others. 541 6.2. Out of Order Reception Implications 543 The use of unordered Data Chunks with SCTP guarantees that the DDP 544 layer will be able to perform placements when IP datagrams are 545 received out of order. 547 Placement of out-of-order DDP Segments carried over MPA/TCP is not 548 guaranteed, but certainly allowed. The ability of the MPA receiver 549 to process out-of-order DDP Segments may be impaired when alignment 550 of TCP segments and MPA FPDUs is lost. Using SCTP, each DDP Segment 551 is encoded in a single Data Chunk and never spread over multiple IP 552 datagrams. 554 6.3. Header and Marker Overhead 556 MPA and TCP headers together are smaller than the headers used by 557 SCTP and its adaptation layer. However, this advantage can be 558 considerably reduced by the insertion of MPA markers. In any event 559 the different in ULP payload per IP Datagram is not likely to be a 560 signifigant factor. 562 6.4. Middlebox Support 564 Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP 565 will appear to all network middleboxes as a normal TCP connection. 566 In many environments there may be a requirement to use only TCP 567 connections to satisfy existing network elements and/or to facilitate 568 monitoring and control of connections. While SCTP is certainly just 569 as monitorable and controllable as TCP, there is no guarantee that 570 the network management infrastructure has the required support for 571 both. 573 6.5. Processing Overhead 575 A DDP stream delivered via MPA/TCP will required more processing 576 effort that one delivered over SCTP. However this extra work may be 577 justified for many deployments where full SCTP support is unavailable 578 in the endpoints of the network, or where middleboxes impair the 579 usability of SCTP. 581 6.6. Data Integrity Implications 583 Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c 584 protection against data corruption, or its equivalent. 586 A ULP that requires a greater degree of protection may add it own. 587 However, DDP and RDMAP headers will only be guaranteed to have the 588 equivalent of end-to-end CRC32c protection. A ULP that requires data 589 integrity checking more thorough than an end-to-end CRC32c should 590 first invalidate all STags that reference a buffer before applying 591 their own integrity check. 593 6.6.1. MPA/TCP Specifics 595 It is mandatory for MPA/TCP implementations to implement CRC32c, but 596 it is NOT mandatory to use the CRC32c during an RDMA connection. The 597 activating or deactivating of the CRC in MPA/TCP is an administrative 598 configuration operation at the local and remote end. The 599 administration of the CRC(ON/OFF) is invisible to the ULP. 601 Applications SHOULD trust that this administrative option will only 602 be used when the end-to-end protection is at least as effective as a 603 transport layer CRC32c. Applications SHOULD NOT apply additional 604 protection as a guard against this administrative option being turned 605 on inadvertently. 607 Administrators MUST NOT enable CRC32c suppression unless the end-to- 608 end protection is truly equivalent. 610 If the CRC is active/used for one direction/end , then the use of the 611 CRC is mandatory in both directions/ends. 613 If both ends have been configured NOT to use the CRC, then this is 614 allowed as long as an equivalent protection(comparable or better 615 than/to CRC) from undetected errors on the connection is provided. 617 6.6.2. SCTP Specifics 619 SCTP provides CRC32c protection automatically. The adaptation to 620 SCTP provides for no option to suppress SCTP CRC32c protection. 622 6.7. Non-IP Transports 624 DDP is defined to operate over ubiquitous IP transports such as SCTP 625 and TCP. This enabled a new DDP-enabled node to be added anywhere to 626 an IP network. No DDP-specific support from middle-boxes is 627 required. 629 There are non-IP transport fabric offering RDMA capabilities. 630 Because these capabilities are integrated with the transport protocol 631 they have some technical advantages when compared to RDMA over IP. 632 For example fencing of RDMA operations can be based upon transport 633 level acks. Because DDP is cleanly layered over an IP transport, any 634 explicit RDMA layer ack must be separate from the transport layer 635 ack. 637 There may be deployments where the benefits of RDMA/transport 638 integration outweigh the benefits of being on an IP network. 640 6.7.1. No RDMA Layer Ack 642 DDP does not provide for its own acknowledgements. The only form of 643 ack provided at the RDMAP layer is an RDMA Read Response. DDP and 644 RDMAP rely almost entirely upon other layers for flow control and 645 pacing. The LLP is relied upon to guarantee delivery and avoid 646 network congestion, and ULP level acking is relied upon for ULP 647 pacing and to avoid ULP buffer overruns. 649 Previous RDMA protocols, such as InfiniBand, have been able to use 650 their integration with the transport layer to provide stronger 651 ordering guarantees. It is important that application designers that 652 require such guarantees to provide them through ULP interaction. 654 Specifically: 656 There is no ability for a local interface to "fence" outbound 657 messages to guarantee that prior tagged messages have been placed 658 prior to sending a tagged message. The only guarantees available 659 from the other side would be an RDMA Read Response (coming from 660 the RDMAP layer) or a response from the ULP layer. Remember that 661 the normal ordering rules only guarantee when the Data Sink ULP 662 will be notified of untagged messages, it does not control when 663 data is placed into receive buffers. 665 Re-use of tagged buffers must be done with extreme care. The fact 666 that an untagged message indicates that all prior tagged messages 667 have been placed does not guarantee that no later tagged message 668 have. The best strategy is to only change the state of any given 669 advertised buffers with with untagged messages. 671 As covered elsewhere in this document, flow control of untagged 672 messages MUST be provided by the ULP itself. 674 6.8. Other IP Transports 676 Both TCP and SCTP provide DDP with reliable transport with TCP 677 friendly rate control. As currently DDP is defined to work over 678 reliable transports and implicitly relies upon some form of rate 679 control. 681 DDP is fully compatible with a non-reliable protocol. Out-of-order 682 placement is obviously not dependent on whether the other DDP 683 Segments ever actually arrive. 685 However, RDMAP requires the LLP to provide reliable service. An 686 alternate completion handling protocol would be required if DDP were 687 to be deployed over an unreliable IP transport. 689 As noted in the prior section on tagged buffers as ULP credits, 690 neither RDMAP or DDP provide any flow control for tagged messages. 691 If no transport layer flow control is provided, an RDMAP/DDP 692 application would be only limited by the link layer rate, almost 693 inevitably resulting in severe network congestion. 695 RDMAP encourages applications to be ignorant of the underlying 696 transport PMTU. The ULP is only notified when all messages ending in 697 a single untagged message have completed. The ULP is not aware of 698 the granularity or ordering of the underlying message. This approach 699 assumes that the ULP is only interested in the complete set of 700 messages, and has no use for a subset of them. 702 6.9. LLP Independent Session Establishment 704 For an RDMAP/DDP application, the transport services provided by a 705 pair of SCTP Streams and by a TCP connection both provide the same 706 service (reliable delivery of DDP Segments between two connected 707 RDMAP/DDP endpoints). 709 6.9.1. RDMA-only Session Establishment 711 It is also possible to allow for transport neutral establishment of 712 RDMAP/DDP sessions between endpoints. Combined, these two features 713 would allow most applications to be unconcerned as to which LLP was 714 actually in use. 716 Specifically, the procedures for DDP Stream Session establishment 717 discussed in section 3 of the SCTP mapping, and section 13.3 of the 718 MPA/TCP mapping, both allow for the exchange of ULP specific data 719 ("Private Data") before enabling the exchange of DDP Segments. This 720 delays can allow for proper selection and/or configuration of the 721 endpoints based upon the exchanged data. For example, each DDP 722 Stream Session associated with a single client session might be 723 assigned to the same DDP Protection Domain. 725 To be transport neutral, the applications should exchange Private 726 Data as part of session establishment messages to determine how the 727 RDMA endpoints are to be configured. One side must be the Initiator, 728 and the other the Responder. 730 With SCTP, a pair of SCTP streams can be used for sequential 731 sessions. With MPA/TCP each connection can be used for at most one 732 session. However, the same source/destination pair of ports can be 733 re-used sequentially subject to normal TCP rules. 735 Both SCTP and MPA limit the private data size to a maximum of 512 736 bytes. 738 MPA/TCP requires the end of the TCP connection that initiated the 739 conversion to MPA mode to send the first DDP Segment. SCTP does not 740 have this requirement. ULPs which wish to be transport neutral 741 should require the initiating end to send the first message. A zero- 742 length RDMA Write can be used for this purpose if the ULP logic 743 itself does naturally support this restriction. 745 6.9.2. RDMA-Conditional Session Establishment 747 It is sometimes desirable for the active side of a session to connect 748 with the passive side before knowing whether the passive side 749 supports RDMA. 751 This style of session establishment can be supported with either TCP 752 or SCTP, but not as transparently as for RDMA-only sessions. Pre- 753 existing non-RDMA servers are also far more likely to be using TCP 754 than SCTP. 756 With TCP. a normal TCP connection is established. It is then used by 757 the ULP to determine whether or not to convert to MPA mode and use 758 RDMA. This will typically be integral with other session 759 establishment negotiations. 761 With SCTP, the establishment of an association tests whether RDMA is 762 supported. If not supported, the application simply requests the 763 association without the RDMA adaptation indication. 765 In key difference is that with SCTP the determination as to whether 766 the peer can support RDMA is made before the transport layer 767 association/connection is established while with TCP the established 768 connection itself is used to determine whether RDMA is supported. 770 7. Local Interface Implications 772 Full utilization of DDP and RDMAP capabilities requires a local 773 interface that explicitly requests these services. Protocols such as 774 Sockets Direct Protocol (SDP) can allow applications to keep their 775 traditional byte-stream or message-stream interface and still enjoy 776 many of the benefits of the optimized wire level protocols. 778 8. Security considerations 780 8.1. Connection/Association Setup 782 Both the SCTP and TCP adaptations allow for existing procedures to be 783 followed for the establishment of the SCTP association or TCP 784 connection. Use of DDP does not impair the use of any security 785 measures to filter, validate and/or log the remote end of an 786 association/connection. 788 8.2. Tagged Buffer Exposure 790 DDP only exposes ULP memory to the extent explicitly allowed by ULP 791 actions. These include posting of receive operations and enabling of 792 Steering Tags. 794 Neither RDMAP or DDP place requirements on how ULP's advertise 795 buffers. A ULP may use a single Steering Tag for multiple buffer 796 advertisements. However, the ULP should be aware that enforcement on 797 STag usage is likely limited to the overall range that is enabled. 798 If the remote peer writes into the 'wrong' advertised buffer, neither 799 the DDP or RDMAP layer will be aware of this. Nor is there any 800 report to the ULP on how the remote peer specifically used tagged 801 buffers. 803 Unless the ULP peers have an adequate basis for mutual trust, the 804 receiving ULP might be well advised to use a distinct STag for each 805 interaction, and to invalidate it after each use or to require its 806 peer to use the RDMAP option to invalidate the STag with its 807 responding untagged message. 809 8.3. Impact of Encrypted Transports 811 While DDP is cleanly layered over the LLP, its maximum benefit may be 812 limited when the LLP Stream is secured with a streaming cypher, such 813 as Transport Layer Security (TLS). If the LLP must decrypt in order, 814 it cannot provide out-of-order DDP Segments to the DDP layer for 815 placement purposes. IPsec tunnel mode encrypts entire IP Datagrams. 816 IPsec transport mode encrypts TCP Segments or SCTP packets. In 817 neither case should IPsec preclude providing out-of-order DDP 818 Segments to the DDP layer for placement. 820 Note that end-to-end use of IPsec cryptographic integrity protection 821 may allow suppression of MPA CRC generation and checking under 822 certain circumstances. This is one example where the LLP may be 823 judged to have "or equivalent" protection to an end-to-end CRC32c. 825 9. References 827 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 828 Levels", BCP 14, RFC 2119, March 1997. 830 [2] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", 831 RFC 2246, January 1999. 833 [3] Kent, S. and R. Atkinson, "IP Encapsulating Security Payload 834 (ESP)", RFC 2406, November 1998. 836 [4] Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, 837 H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. 838 Paxson, "Stream Control Transmission Protocol", RFC 2960, 839 October 2000. 841 [5] Coene, L., "Stream Control Transmission Protocol Applicability 842 Statement", RFC 3257, April 2002. 844 [6] Recio, R., "An RDMA Protocol Specification", 845 draft-ietf-rddp-rdmap-05 (work in progress), July 2005. 847 [7] Shah, H., "Direct Data Placement over Reliable Transports", 848 draft-ietf-rddp-ddp-05 (work in progress), July 2005. 850 [8] Stewart, R., "Stream Control Transmission Protocol (SCTP) 851 Remote Direct Memory Access (RDMA) Direct Data Placement (DDP) 852 Adaptationn", draft-ietf-rddp-sctp-02 (work in progress), 853 August 2005. 855 [9] Culley, P., "Marker PDU Aligned Framing for TCP Specification", 856 draft-ietf-rddp-mpa-02 (work in progress), February 2005. 858 [10] "Direct Access File System versino 1.0", September 2001. 860 [11] Pinkerton, J., "Sockets Direct Protocol (SDP) for iWARP over 861 TCP 1.0", October 2003. 863 Authors' Addresses 865 Caitlin Bestler 866 Broadcom 867 49 Discovery 868 Irvine, CA 92618 869 USA 871 Phone: 949-926-6383 872 Email: caitlinb@broadcom.com 874 Lode Coene 875 Siemens 876 Atealaan 26 877 Herentals, 2200 878 Belgium 880 Phone: +32-14-252081 881 Email: lode.coene@siemens.com 883 Intellectual Property Statement 885 The IETF takes no position regarding the validity or scope of any 886 Intellectual Property Rights or other rights that might be claimed to 887 pertain to the implementation or use of the technology described in 888 this document or the extent to which any license under such rights 889 might or might not be available; nor does it represent that it has 890 made any independent effort to identify any such rights. Information 891 on the procedures with respect to rights in RFC documents can be 892 found in BCP 78 and BCP 79. 894 Copies of IPR disclosures made to the IETF Secretariat and any 895 assurances of licenses to be made available, or the result of an 896 attempt made to obtain a general license or permission for the use of 897 such proprietary rights by implementers or users of this 898 specification can be obtained from the IETF on-line IPR repository at 899 http://www.ietf.org/ipr. 901 The IETF invites any interested party to bring to its attention any 902 copyrights, patents or patent applications, or other proprietary 903 rights that may cover technology that may be required to implement 904 this standard. Please address the information to the IETF at 905 ietf-ipr@ietf.org. 907 Disclaimer of Validity 909 This document and the information contained herein are provided on an 910 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 911 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 912 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 913 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 914 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 915 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 917 Copyright Statement 919 Copyright (C) The Internet Society (2005). This document is subject 920 to the rights, licenses and restrictions contained in BCP 78, and 921 except as set forth therein, the authors retain all their rights. 923 Acknowledgment 925 Funding for the RFC Editor function is currently provided by the 926 Internet Society.