idnits 2.17.1 

draft-ietf-rddp-applicability-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 976.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 953.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 960.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 966.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (December 5, 2005) is 6716 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: '1' is defined on line 886, but no explicit reference
     was found in the text

  == Unused Reference: '2' is defined on line 889, but no explicit reference
     was found in the text

  == Unused Reference: '3' is defined on line 892, but no explicit reference
     was found in the text

  == Unused Reference: '4' is defined on line 895, but no explicit reference
     was found in the text

  == Unused Reference: '5' is defined on line 899, but no explicit reference
     was found in the text

  == Unused Reference: '6' is defined on line 902, but no explicit reference
     was found in the text

  == Unused Reference: '7' is defined on line 905, but no explicit reference
     was found in the text

  == Unused Reference: '8' is defined on line 908, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 913, but no explicit reference
     was found in the text

  ** Obsolete normative reference: RFC 2246 (ref. '2') (Obsoleted by RFC 4346)

  ** Obsolete normative reference: RFC 2406 (ref. '3') (Obsoleted by RFC
     4303, RFC 4305)

  ** Obsolete normative reference: RFC 2960 (ref. '4') (Obsoleted by RFC 4960)

  ** Downref: Normative reference to an Informational RFC: RFC 3257 (ref. '5')

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-rdmap-05

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-ddp-05

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-sctp-02

  == Outdated reference: A later version (-08) exists of
     draft-ietf-rddp-mpa-02

  == Outdated reference: A later version (-08) exists of
     draft-ietf-nfsv4-nfsdirect-02


     Summary: 7 errors (**), 0 flaws (~~), 17 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Remote Direct Data Placement                                  C. Bestler
3	Working group                                                   Broadcom
4	Internet-Draft                                                  L. Coene
5	Expires: June 8, 2006                                            Siemens
6	                                                        December 5, 2005

8	Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct
9	                          Data Placement (DDP)
10	                  draft-ietf-rddp-applicability-05.txt

12	Status of this Memo

14	   By submitting this Internet-Draft, each author represents that any
15	   applicable patent or other IPR claims of which he or she is aware
16	   have been or will be disclosed, and any of which he or she becomes
17	   aware will be disclosed, in accordance with Section 6 of BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups.  Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six months
25	   and may be updated, replaced, or obsoleted by other documents at any
26	   time.  It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt.

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html.

35	   This Internet-Draft will expire on June 8, 2006.

37	Copyright Notice

39	   Copyright (C) The Internet Society (2005).

41	Abstract

43	   This document describes the applicability of Remote Direct Memory
44	   Access Protocol (RDMAP) and the Direct Data Placement Protocol (DDP).
45	   It comparese and contrasts the different transport options over IP
46	   that DDP can use, provides guidance to ULP developers on choosing
47	   between available transports and/or how to be indifferent to the
48	   specific transport layer used, compares use of DDP with direct use of
49	   the supporting transports, and compares DDP over IP transports with
50	   non-IP transports that support RDMA functionality.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
55	   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  5
56	   3.  Direct Placement . . . . . . . . . . . . . . . . . . . . . . .  6
57	     3.1.  Fewer Required ULP Interactions  . . . . . . . . . . . . .  6
58	     3.2.  Direct Placement using only the LLP  . . . . . . . . . . .  6
59	   4.  Tagged Messages  . . . . . . . . . . . . . . . . . . . . . . .  8
60	     4.1.  Order Independent Reception  . . . . . . . . . . . . . . .  8
61	     4.2.  Reduced ULP Notifications  . . . . . . . . . . . . . . . .  9
62	     4.3.  Simplified ULP Exchanges . . . . . . . . . . . . . . . . .  9
63	     4.4.  Order Independent Sending  . . . . . . . . . . . . . . . . 11
64	     4.5.  Untagged Messages and Tagged Buffers as ULP Credits  . . . 12
65	   5.  RDMA Read  . . . . . . . . . . . . . . . . . . . . . . . . . . 14
66	   6.  LLP Comparisons  . . . . . . . . . . . . . . . . . . . . . . . 15
67	     6.1.  Multistreaming Implications  . . . . . . . . . . . . . . . 15
68	     6.2.  Out of Order Reception Implications  . . . . . . . . . . . 15
69	     6.3.  Header and Marker Overhead . . . . . . . . . . . . . . . . 15
70	     6.4.  Middlebox Support  . . . . . . . . . . . . . . . . . . . . 15
71	     6.5.  Processing Overhead  . . . . . . . . . . . . . . . . . . . 16
72	     6.6.  Data Integrity Implications  . . . . . . . . . . . . . . . 16
73	       6.6.1.  MPA/TCP Specifics  . . . . . . . . . . . . . . . . . . 16
74	       6.6.2.  SCTP Specifics . . . . . . . . . . . . . . . . . . . . 17
75	     6.7.  Non-IP Transports  . . . . . . . . . . . . . . . . . . . . 17
76	       6.7.1.  No RDMA Layer Ack  . . . . . . . . . . . . . . . . . . 17
77	     6.8.  Other IP Transports  . . . . . . . . . . . . . . . . . . . 18
78	     6.9.  LLP Independent Session Establishment  . . . . . . . . . . 18
79	       6.9.1.  RDMA-only Session Establishment  . . . . . . . . . . . 19
80	       6.9.2.  RDMA-Conditional Session Establishment . . . . . . . . 19
81	   7.  Local Interface Implications . . . . . . . . . . . . . . . . . 21
82	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 22
83	   9.  Security considerations  . . . . . . . . . . . . . . . . . . . 23
84	     9.1.  Connection/Association Setup . . . . . . . . . . . . . . . 23
85	     9.2.  Tagged Buffer Exposure . . . . . . . . . . . . . . . . . . 23
86	     9.3.  Impact of Encrypted Transports . . . . . . . . . . . . . . 23
87	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
88	     10.1. Normative references . . . . . . . . . . . . . . . . . . . 24
89	     10.2. Informative References . . . . . . . . . . . . . . . . . . 24
90	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25
91	   Intellectual Property and Copyright Statements . . . . . . . . . . 26

93	1.  Introduction

95	   Remote Direct Memory Access Protocol (RDMAP) and Direct Data
96	   Placement (DDP) work together to provide application independent
97	   efficient placement of application payload directly into buffers
98	   specified by the Upper Layer Protocol (ULP).

100	   The DDP protocol is responsible for direct placement of received
101	   payload into ULP specified buffers.  The RDMAP protocol provides
102	   completion notifications to the ULP and support for Data Sink
103	   initiated fetch of advertised buffers (RDMA Reads).

105	   DDP and RDMAP are both application independent protocols which allow
106	   the ULP to perform remote direct data placement.  DDP can use
107	   multiple standard IP transports including SCTP and TCP.

109	   By clarifying the situations where the functionality of these
110	   protocols are applicable, this document can guide implementers,
111	   application and protocol designers in selecting which protocols to
112	   use.

114	   The applicability of RDMAP/DDP is driven by their unique
115	   capabilities:

117	   o  The existence of an application independent protocol allows common
118	      solutions to be implemented in hardware and/or the kernel.  This
119	      document will discuss when common data placement procedures are of
120	      the greatest benefit to applications as contrasted with
121	      application specific solutions built on top of direct use of the
122	      underlying transport.

124	   o  DDP supports both untagged and tagged buffers.  Tagged buffers
125	      allow the Data Sink ULP to be indifferent to what order (or in
126	      what messages) the Data Source sent the data, or what order
127	      packets are received in.  Typically tagged data can be used for
128	      payload transfer, while untagged is best used for control
129	      messages.  However each upper layer protocol can determine the
130	      optimal use of tagged and untagged messages for itself.  This
131	      document will discuss when Data Source flexibility is of benefit
132	      to applications.

134	   o  RDMAP consolidates ULP notifications, thereby minimizing the
135	      number of required ULP interactions.

137	   o  RDMAP defines RDMA Reads, which allow remote access to advertised
138	      buffers.  This document will review the advantages of using RDMA
139	      Reads as contrasted to alternate solutions.

141	   Some non-IP transports, such as InfiniBand, directly integrate RDMA
142	   features.  This document will review the applicability of providing
143	   RDMA services over ubiquitous IP transports as opposed to the use of
144	   customized transport protocols.  Due to the fact that DDP is defined
145	   cleanly as a layer over existing IP transports, DDP has simpler
146	   ordering rules than some prior RDMA protocols.  This may have some
147	   implications for application designers.

149	   The full capabilities of DDP and RDMAP can only be fully realized by
150	   applications that are designed to exploit them.  The co-existence of
151	   RDMAP/DDP aware local interfaces with traditional socket interfaces
152	   will also be explored.

154	   Finally, DDP support is defined for at least two IP transports: SCTP
155	   and TCP.  The rationale for supporting both transports is reviewed,
156	   as well as when each would be the appropriate selection.

158	2.  Definitions

160	   Advertisement - the act of informing a Remote Peer that a local RDMA
161	      Buffer is available to it.  A Node makes available an RDMA Buffer
162	      for incoming RDMA Read or RDMA Write access by informing its RDMA/
163	      DDP peer of the Tagged Buffer identifiers (STag, base address, and
164	      buffer length).  This advertisement of Tagged Buffer information
165	      is not defined by RDMA/DDP and is left to the ULP.  A typical
166	      method would be for the Local Peer to embed the Tagged Buffer's
167	      Steering Tag, base address, and length in a Send Message destined
168	      for the Remote Peer.

170	   Data Sink - The peer receiving a data payload.  Note that the Data
171	      Sink can be required to both send and receive RDMA/DDP Messages to
172	      transfer a data payload.

174	   Data Source - The peer sending a data payload.  Note that the Data
175	      Source can be required to both send and receive RDMA/DDP Messages
176	      to transfer a data payload.

178	   Lower Layer Protocol (LLP) The transport protocol that provides
179	      services to DDP.  This is an IP transport with any required
180	      adaptation layer.  Adaptation layers are defined for SCTP and TCP.

182	   Steering Tag (STag) An identifier of a Tagged Buffer on a Node, valid
183	      as defined within a protocol specification.

185	   Tagged Message A DDP message that is directed to a ULP specified
186	      buffer based upon imbedded addressing information.  In the
187	      immediate sense, the destination buffer is specified by the
188	      message sender.  The message receiver is given no independent
189	      indication that a tagged message has been received.

191	   Untagged Message A DDP message that is directed to a ULP specified
192	      buffer based upon a Message Sequence Number being matched with a
193	      receiver supplied buffer.  The destination buffer is specified by
194	      the message receiver.  The message receiver is notified by some
195	      mechanism that an untagged message has been received.

197	   Upper Layer Protocol (ULP) The direct user of RDMAP/DDP services.  In
198	      addition to protocols such as iSER [10] and NFSv4 over RDMA [11],
199	      the ULP may be embedded in an application, or a middleware layer
200	      as is often the case for the Sockets Direct Protocol (SDP) and
201	      Remote Procedure Call (RPC) protocols.

203	3.  Direct Placement

205	   Direct Data Placement optimizes the placement of ULP payload into the
206	   correct destination buffers, typically eliminating intermediate
207	   copying.  Placement is enabled without regard to order of arrival,
208	   order of transmission or requiring per-placement interaction with the
209	   ULP.

211	   RDMAP minimizes the required ULP interactions .  This capability is
212	   most valuable for applications that require multiple transport layer
213	   packets for each required ULP interaction.

215	3.1.  Fewer Required ULP Interactions

217	   While reducing the number of required ULP interactions is in itself
218	   desirable, it is critical for high speed connections.  The burst
219	   packet rate for a high speed interface could easily exceed the host
220	   systems ability to switch ULP contexts.

222	   Content access applications are primary examples of applications with
223	   both high bandwidth and high content to required ULP interaction
224	   ratios.  These applications include file access protocols (NAS),
225	   storage access (SAN), database access and other application specific
226	   forms of content access such as HTTP, XML and email.

228	3.2.  Direct Placement using only the LLP

230	   Direct data placement can be achieved without RDMA.  Pre-posting of
231	   receive buffers could allow a non-RDMA network stack to place data
232	   directly to user buffers.

234	   The degree to which DDP optimizes depends on which transport is being
235	   compared with, and on the nature of the local interface.  Without
236	   RDMAP/DDP pre-posting buffers requires the receiving side to
237	   accurately predict the required buffers and their sizes.  This is not
238	   feasible for all ULPs.  By contrast, DDP only requires the ULP to
239	   predict the sequence and size of incoming untagged messages.

241	   An application that could predict incoming messages and required
242	   nothing more than direct placement into buffers might be able to do
243	   so with a properly designed local interface to SCTP or TCP.  Doing so
244	   for TCP requires making predictions at a byte level rather than a
245	   message level.

247	   The main benefit of DDP for such an application would be that pre-
248	   posting of receive buffers is a mandated local interface capability,
249	   and that predictions can be made on a per-message basis (not per
250	   byte).

252	   The Lower Layer Protocol, LLP, can also be used directly if ULP
253	   specific knowledge is built into the protocol stack to allow "parse
254	   and place" handling of received packets.  Such a solution either
255	   requires interaction with the ULP, or that the protocol stack have
256	   knowledge of ULP specific syntax rules.

258	   DDP achieves the benefits of directly placing incoming payload
259	   without requiring tight coupling between the ULP and the protocol
260	   stack.  However, "parse and place" capabilities can certainly provide
261	   equivalent services to a limited number of ULPs.

263	4.  Tagged Messages

265	   This section covers the major benefits from the use of Tagged
266	   Messages.

268	   A more critical advantage of DDP is the ability of the Data Source to
269	   use tagged buffers.  Tagging messages allows the Data Source to
270	   choose the ordering and packetization of its payload deliveries.
271	   With direct data placement based solely upon pre-posted receives, the
272	   packetization and delivery of payload must be agreed by the ULP peers
273	   in advance.

275	   The Upper Layer Protocol can allocate content between untagged and/or
276	   tagged messages to maximize the potential optimizations.  Placing
277	   content within an untagged message can deliver the content in the
278	   same packet that signals completion to the receiver.  This can
279	   improve latency.  It can even eliminate round trips.  But it requires
280	   making larger anonymous buffers to be available.

282	   Some examples of data that typically belongs in the untagged message
283	   would include short fixed size control data that is inherently part
284	   of the control message almost always should be included in the
285	   untagged message, relatively short payload that is almost always
286	   needed (especially when it would eliminate a round-trip to fetch the
287	   data.  For example, the initial data on a write request, and of
288	   course advertising tagged buffers that specify the location of data
289	   not in the untagged message.

291	   Tagged messages standardizes direct placemtn of data without per-
292	   packet interaction with the upper layers.  Even if there is an upper
293	   layer protocol encoding of what is being transferred, as is common
294	   with middleware solutions, this information is not understood at the
295	   application independent layers.  The directions on where to place the
296	   incoming data cannot be accessed without switching to the ULP first.
297	   DDP provides a standardized\ 'packing list' which can be interpreted
298	   without requiring ULP interaction.  Indeed, it is designed to be
299	   implementable in hardware.

301	4.1.  Order Independent Reception

303	   Tagged messages are directed to a buffer based on an included
304	   Steering Tag. Additionally, no notice is provided to the ULP for each
305	   individual Tagged Message's arrival.  Together these allow tagged
306	   messages received out-of-order to be processed without intermediate
307	   buffering or additional notifications to the ULP.

309	4.2.  Reduced ULP Notifications

311	   RDMAP offers both tagged and untagged messages.  No receiving side
312	   ULP interactions are required for tagged messages.  By optimally
313	   dividing traffic between tagged and untagged messages the ULP can
314	   limit the number of events that must be dealt with at the ULP layer.
315	   This typically reduces the number of context switches required and
316	   improves performance.

318	   RDMAP further reduces required ULP interactions consolidating
319	   completion notifications of tagged messages with the completion
320	   notification of a trailing untagged message.  For most ULPs this
321	   radically reduces the number of ULP required interactions even
322	   further.

324	   While RDMAP consolidation of notices is beneficial to most
325	   applications, it may be detrimental to some applications that benefit
326	   from streamed delivery to enable ULP processing of received data as
327	   promptly as possible.  A ULP that uses RDMAP cannot begin processing
328	   any portion of an exchange until it receives notification that the
329	   entire exchange has been placed.  An "exchange" here is a set of zero
330	   or more tagged messages and a single terminating untagged message.
331	   An application that would prefer to begin work on the received
332	   payload, no matter what order it arrived in, as soon as possible
333	   might prefer to work directly with the LLP.  RDMAP is optimized for
334	   applications that are more concerned when the entire exchange is
335	   complete.

337	   An application that benefits from being able to begin processing of
338	   each received packet as quickly as possible may find RDMAP interferes
339	   with that goal.

341	   Such an application might be able to retain most of the benefits of
342	   RDMAP by using the DDP layer directly.  However, in addition to
343	   taking on the responsibilities of the RDMAP layer, the application
344	   would likely have more difficulty finding support for a DDP-only API.
345	   Many hardware implementations may choose to tightly couple RDMAP and
346	   DDP, and might not provide an API directly to DDP services.

348	   These features minimize the required interactions with the ULP.  This
349	   can be extremely beneficial for applications that use multiple
350	   transport layer packets to accomplish what is a single ULP
351	   interaction.

353	4.3.  Simplified ULP Exchanges

355	   The notification rules for Tagged Messages allows ULPs to create
356	   multi-message "exchanges" consisting of zero or more tagged messages
357	   that represent a single step in the ULP interaction.  The receiving
358	   ULP is notified that the untagged message has arrived, and implicitly
359	   of any associated tagged messages.

361	   A ULP where all exchanges would naturally be only the untagged
362	   message would derive virtually no benefit from the use of RDMAP/DDP
363	   as opposed to SCTP.  But while tagged buffers are the justification
364	   for RDMAP/DDP, untagged buffers are still necessary.  Without
365	   untagged buffers the only method to exchange buffer advertisements
366	   would involve out-of-band communications and/or sharing of compile
367	   time constants.  Most RDMA-aware ULPs use untagged buffers for
368	   requests and responses.  Buffer advertisements are typically done
369	   within these untagged messages.

371	   More importantly there would be no reliable method for the upper
372	   layer peers to synchronize.  The absence of any guarantees about
373	   ordering within or between tagged messages is fundamental to allowing
374	   the DDP layer to optimize transfer of tagged payload.

376	   So no ULP can be defined entirely in terms of tagged messages.
377	   Eventually a notification that confirms delivery must be generated
378	   from the RDMAP/DDP layer.

380	   Limiting use of untagged buffers to requests and responses by moving
381	   all bulk data using tagged transfers can greatly simplify the amount
382	   of prediction that the Data Sink must perform in pre-posting receive
383	   buffers.  For example, a typical RDMA enabled interaction would
384	   consist of the following:

386	      Client sends transaction request to server's as an untagged
387	      message.

389	      This message includes buffer advertisements for the buffers where
390	      the results are to be placed.

392	      The Server sends multiple tagged messages to the advertised
393	      buffers.

395	      The Server sends transaction reply as an untagged message to the
396	      client.

398	      Client receives single notification, indicating completion of the
399	      interaction.

401	   With this type of exchange the pacing and required size of untagged
402	   buffers is highly predictable.  The variability of response sizes is
403	   absorbed by tagged transfers.

405	4.4.  Order Independent Sending

407	   Use of tagged messages is especially applicable when the Data Sink
408	   does not know the actual size, structure or location of the content
409	   it is requesting (or updating).

411	   For example, suppose the Data Sink ULP needs to fetch four related
412	   pieces of data into a four separate buffers.  With SCTP the Data Sink
413	   ULP could receive four messages into four separate buffers, only
414	   having to predict the maximum size of each.  However it would have to
415	   dictate the order in which the Data Source supplied the separate
416	   pieces.  If the Data Source found it advantageous to fetch them in a
417	   different order it would have to use intermediate buffering to re-
418	   order the pieces into the expected order even though the application
419	   only required that all four be delivered and did not truly have an
420	   ordering requirement.

422	   Techniques such as RAID striping and mirroring represent this same
423	   problem, but one step further.  What appears to be a single resource
424	   to the Data Sink is actually stored in separate locations by the Data
425	   Source.  Non RDMA protocols would either require the Data Source to
426	   fetch the material in the desired order or force the Data Source to
427	   use its own holding buffers to assemble an image of the destination
428	   buffer.

430	   While sometimes referred to as a "buffer-to-buffer" solution, RDMA
431	   more fundamentally enables remote buffer access.  The ULP is free to
432	   work with larger remote buffers than it has locally.  This reduces
433	   buffering requirements and the number of times the data must be
434	   copied in an end-to-end transfer.

436	   There are numerous reasons why the Data Sink would not know the true
437	   order or location of the requested data.  It could be different for
438	   each client, different records selected and/or different sort orders,
439	   RAID striping, file fragmentation, volume fragmentation, volume
440	   mirroring and server-side dynamic compositing of content (such as
441	   server side includes for HTTP).

443	   In all of these cases the Data Source is free to assemble the desired
444	   data in the Data Sink's buffer in whatever order the component data
445	   becomes available to it.  It is not constrained on ordering.  It does
446	   not have to assemble an image in its own memory before creating it in
447	   the Data Sink's buffers.

449	   Note that while DDP enables use of tagged messages for bulk transfer,
450	   there are some application scenarios where untagged messages would
451	   still be used for bulk transfer.  For example, a file server may not
452	   expose its own memory to its clients.  A client wishing to write may
453	   advertise a buffer which the server will issue RDMA Reads upon.
454	   However, when performing a small write it may be preferable to
455	   include the data in the untagged message rather than incurring an
456	   additional round trip with the RDMA Read and its response.

458	   Generally, the best use of an untagged message is to synchronize and
459	   to deliver data that is naturally tied to the same message as the
460	   synchronization.  For initial data transfers this has the additional
461	   benefit of avoiding the need to advertise specific tagged buffers for
462	   indefinite time periods.  Instead anonymous buffers can be used for
463	   initial data reception.  Because anonymous buffers do not need to be
464	   tied to specific messages in advance this can be a major benefit.

466	4.5.  Untagged Messages and Tagged Buffers as ULP Credits

468	   The handling of end-to-end buffer credits differs considerably with
469	   DDP than when the ULP directly uses either TCP or SCTP.

471	   With both TCP and SCTP buffer credits are based upon the receiver
472	   granting transmit permission based on the total number of bytes.
473	   These credits reflect system buffering resources and/or simple flow
474	   control.  They do not represent ULP resources.

476	   DDP defines no standard flow control, but presumes the existince of a
477	   ULP mechanism.  The presumed mechanism is that the Data Sink ULP has
478	   issued credits to the Data Source allowing the Data Source to send a
479	   specific number of untagged messages.

481	   The ULP peers must ensure that the sender is aware of the maximum
482	   size that can be sent to any specific target buffer.  One method of
483	   doing so is to use a standard size for all untagged buffers within a
484	   given connection.  For example, a ULP may specify an initial untagged
485	   buffer size to be used immediately after session establishment, and
486	   then optionally specify mechanisms for negotiating changes.

488	   Tagged buffers are ULP resources advertised directly from ULP to ULP.
489	   A DDP put to a known tagged buffer is constrained only by transport
490	   level flow control, not by available system buffering.

492	   Either tagged or untagged buffers allows bypassing of system buffer
493	   resources.  Use of tagged buffers additionally allows the Data Source
494	   to choose what order to exercise the credits in.

496	   To the extent allowed by the ULP, tagged buffers are also divisible
497	   resources.  The Data Sink can advertise a single 100 KB buffer, and
498	   then receive notifications from its peer that it had written 50 KB,
499	   20 KB and 30 KB to that buffer in three successive transactions.

501	   ULP-management of tagged buffer resources, independent of transport
502	   and DDP layer credits, is an additional benefit of RDMA protocols.
503	   Large bulk transfers cannot be blocked by limited general purpose
504	   buffering capacity.  Applications can flow control based upon higher
505	   level abstractions, such as number of outstanding requests,
506	   independent of the amount of data that must be transferred.

508	   However, use of system buffering, as offered by direct use of the
509	   underlying transports, can be preferable under certain circumstances.

511	   One example would be when the number of target ULP buffers is
512	   sufficiently large, and the rate at which any writes arrive is
513	   sufficiently low, that pinning all the target ULP buffers in memory
514	   would be undesirable.  The maximum transfer rate, and hence the
515	   maximum amount of system buffering required, may be more stable and
516	   predictable than the total ULP buffer exposure.

518	   Another would be the Data Sink wishes to receive a stream of data at
519	   a predictable rate, but does not know in advance what the size of
520	   each data packet will be.  This is common from streaming media that
521	   has been encoded with a variable bit rate.  With DDP the Data Sink
522	   would either have to use untagged buffers large enough for the
523	   largest packet, or advertise a circular buffer.  If for security or
524	   other reasons the Data Sink did not want the size of its buffer to be
525	   publicly known, using the underlying SCTP transport directly may be
526	   preferable because of their byte-oriented credits.

528	5.  RDMA Read

530	   RDMA Reads are a further service provided by RDMAP.  RDMA Reads allow
531	   the Data Sink to fetch exactly the portion of the peer ULP buffer
532	   required on a "just in time" basis.  This can be done without
533	   requiring per-fetch support from the Data Source ULP.

535	   Storage servers may wish to limit the maximum write buffer allocated
536	   to any single session.  The storage server may be a very minimal
537	   layer between the client and the disk storage media, or the server
538	   may merely wish to limit the total resources that would be required
539	   if all clients could push the entire payload they wished written at
540	   their own convenience.

542	   In either case, there is little benefit in transferring data from the
543	   Data Source far in advance of when it will be written to the
544	   persistent storage media.  RDMA Reads allow the Storage Server to
545	   fetch the payload on a "just in time" basis.  In this fashion a
546	   relatively small number of block sized buffers can be used to execute
547	   a single transaction that specified writing a large file, or a
548	   Storage Server with numerous clients can fetch buffers from the
549	   individual clients in the order that is most convenient to the
550	   server.

552	   This same capability can be used when the desired portion of the
553	   advertised buffer is not known in advance.  For example the
554	   advertised buffer could contain performance statistics.  The data
555	   sink could request the portions of the data it required, without
556	   requiring an interaction with the Data Source ULP.

558	   This is applicable for many applications that publish semi-volatile
559	   data that does not require transactional validity checking (i.e.,
560	   authorized users have read access to the entire set of data).  It is
561	   less applicable when there are ULP consistency checks that must be
562	   performed upon the data.  Such applications would be better served by
563	   having the client send a request, and having the server use RDMA
564	   Writes to publish the requested data.  Neither RDMAP or DDP provide
565	   mechanisms for bundling multiple disjoint updates into an atomic
566	   operation.  Therefore use of an advertised buffer as a data resource
567	   is subject to the same caveats as any randomly updated data resource,
568	   such as flat files, that do not enforce their own consistency.

570	6.  LLP Comparisons

572	   Normally the choice of underlying IP transport is irrelevant to the
573	   ULP.  RDMAP and DDP provides the same services over either.  There
574	   may be performance impacts of the choice, however.  It is the
575	   responsibility of the ULP to determine which IP transport is best
576	   suited to its needs.

578	   SCTP provides for preservation of message boundaries.  Each DDP
579	   segment will be delivered within a single SCTP packet.  The
580	   equivalent services are only available with TCP through the use of
581	   the MPA (Marker PDU Alignment) adaptation layer.

583	6.1.  Multistreaming Implications

585	   SCTP also provides multi-streaming.  When the same pair of hosts have
586	   need for multiple DDP streams this can be a major advantage.  A
587	   single SCTP association carries multiple DDP streams, consolidating
588	   connection setup, congestion control and acknowledgements.

590	   Completions are controlled by the DDP Source Sequence Number (DDP-
591	   SSN) on a per stream basis.  Therefore combining multiple DDP Streams
592	   into a single SCTP association cannot result in a dropped packet
593	   carrying data for one stream delaying completions on others.

595	6.2.  Out of Order Reception Implications

597	   The use of unordered Data Chunks with SCTP guarantees that the DDP
598	   layer will be able to perform placements when IP datagrams are
599	   received out of order.

601	   Placement of out-of-order DDP Segments carried over MPA/TCP is not
602	   guaranteed, but certainly allowed.  The ability of the MPA receiver
603	   to process out-of-order DDP Segments may be impaired when alignment
604	   of TCP segments and MPA FPDUs is lost.  Using SCTP, each DDP Segment
605	   is encoded in a single Data Chunk and never spread over multiple IP
606	   datagrams.

608	6.3.  Header and Marker Overhead

610	   MPA and TCP headers together are smaller than the headers used by
611	   SCTP and its adaptation layer.  However, this advantage can be
612	   reduced by the insertion of MPA markers.  The different in ULP
613	   payload per IP Datagram is not likely to be a signifigant factor.

615	6.4.  Middlebox Support

617	   Even with the MPA adaptation layer, DDP traffic carried over MPA/TCP
618	   will appear to all network middleboxes as a normal TCP connection.
619	   In many environments there may be a requirement to use only TCP
620	   connections to satisfy existing network elements and/or to facilitate
621	   monitoring and control of connections.  While SCTP is certainly just
622	   as monitorable and controllable as TCP, there is no guarantee that
623	   the network management infrastructure has the required support for
624	   both.

626	6.5.  Processing Overhead

628	   A DDP stream delivered via MPA/TCP will require more processing
629	   effort that one delivered over SCTP.  However this extra work may be
630	   justified for many deployments where full SCTP support is unavailable
631	   in the endpoints of the network, or where middleboxes impair the
632	   usability of SCTP.

634	6.6.  Data Integrity Implications

636	   Both the SCTP and MPA/TCP adaptation provide end-to-end CRC32c
637	   protection against data corruption, or its equivalent.

639	   A ULP that requires a greater degree of protection may add it own.
640	   However, DDP and RDMAP headers will only be guaranteed to have the
641	   equivalent of end-to-end CRC32c protection.  A ULP that requires data
642	   integrity checking more thorough than an end-to-end CRC32c should
643	   first invalidate all STags that reference a buffer before applying
644	   their own integrity check.

646	6.6.1.  MPA/TCP Specifics

648	   It is mandatory for MPA/TCP implementations to implement CRC32c, but
649	   it is NOT mandatory to use the CRC32c during an RDMA connection.  The
650	   activating or deactivating of the CRC in MPA/TCP is an administrative
651	   configuration operation at the local and remote end.  The
652	   administration of the CRC(ON/OFF) is invisible to the ULP.

654	   Applications SHOULD trust that this administrative option will only
655	   be used when the end-to-end protection is at least as effective as a
656	   transport layer CRC32c.  Applications SHOULD NOT apply additional
657	   protection as a guard against this administrative option being turned
658	   on inadvertently.

660	   Administrators MUST NOT enable CRC32c suppression unless the end-to-
661	   end protection is truly equivalent.

663	   If the CRC is active/used for one direction/end , then the use of the
664	   CRC is mandatory in both directions/ends.

666	   If both ends have been configured NOT to use the CRC, then this is
667	   allowed as long as an equivalent protection(comparable or better
668	   than/to CRC) from undetected errors on the connection is provided.

670	6.6.2.  SCTP Specifics

672	   SCTP provides CRC32c protection automatically.  The adaptation to
673	   SCTP provides for no option to suppress SCTP CRC32c protection.

675	6.7.  Non-IP Transports

677	   DDP is defined to operate over ubiquitous IP transports such as SCTP
678	   and TCP.  This enabled a new DDP-enabled node to be added anywhere to
679	   an IP network.  No DDP-specific support from middle-boxes is
680	   required.

682	   There are non-IP transport fabric offering RDMA capabilities.
683	   Because these capabilities are integrated with the transport protocol
684	   they have some technical advantages when compared to RDMA over IP.
685	   For example fencing of RDMA operations can be based upon transport
686	   level acks.  Because DDP is cleanly layered over an IP transport, any
687	   explicit RDMA layer ack must be separate from the transport layer
688	   ack.

690	   There may be deployments where the benefits of RDMA/transport
691	   integration outweigh the benefits of being on an IP network.

693	6.7.1.  No RDMA Layer Ack

695	   DDP does not provide for its own acknowledgements.  The only form of
696	   ack provided at the RDMAP layer is an RDMA Read Response.  DDP and
697	   RDMAP rely almost entirely upon other layers for flow control and
698	   pacing.  The LLP is relied upon to guarantee delivery and avoid
699	   network congestion, and ULP level acking is relied upon for ULP
700	   pacing and to avoid ULP buffer overruns.

702	   Previous RDMA protocols, such as InfiniBand, have been able to use
703	   their integration with the transport layer to provide stronger
704	   ordering guarantees.  It is important that application designers that
705	   require such guarantees to provide them through ULP interaction.

707	   Specifically:

709	      There is no ability for a local interface to "fence" outbound
710	      messages to guarantee that prior tagged messages have been placed
711	      prior to sending a tagged message.  The only guarantees available
712	      from the other side would be an RDMA Read Response (coming from
713	      the RDMAP layer) or a response from the ULP layer.  Remember that
714	      the normal ordering rules only guarantee when the Data Sink ULP
715	      will be notified of untagged messages, it does not control when
716	      data is placed into receive buffers.

718	      Re-use of tagged buffers must be done with extreme care.  The fact
719	      that an untagged message indicates that all prior tagged messages
720	      have been placed does not guarantee that no later tagged message
721	      have.  The best strategy is to only change the state of any given
722	      advertised buffers with with untagged messages.

724	      As covered elsewhere in this document, flow control of untagged
725	      messages MUST be provided by the ULP itself.

727	6.8.  Other IP Transports

729	   Both TCP and SCTP provide DDP with reliable transport with TCP
730	   friendly rate control.  As currently DDP is defined to work over
731	   reliable transports and implicitly relies upon some form of rate
732	   control.

734	   DDP is fully compatible with a non-reliable protocol.  Out-of-order
735	   placement is obviously not dependent on whether the other DDP
736	   Segments ever actually arrive.

738	   However, RDMAP requires the LLP to provide reliable service.  An
739	   alternate completion handling protocol would be required if DDP were
740	   to be deployed over an unreliable IP transport.

742	   As noted in the prior section on tagged buffers as ULP credits,
743	   neither RDMAP or DDP provide any flow control for tagged messages.
744	   If no transport layer flow control is provided, an RDMAP/DDP
745	   application would be only limited by the link layer rate, almost
746	   inevitably resulting in severe network congestion.

748	   RDMAP encourages applications to be ignorant of the underlying
749	   transport PMTU.  The ULP is only notified when all messages ending in
750	   a single untagged message have completed.  The ULP is not aware of
751	   the granularity or ordering of the underlying message.  This approach
752	   assumes that the ULP is only interested in the complete set of
753	   messages, and has no use for a subset of them.

755	6.9.  LLP Independent Session Establishment

757	   For an RDMAP/DDP application, the transport services provided by a
758	   pair of SCTP Streams and by a TCP connection both provide the same
759	   service (reliable delivery of DDP Segments between two connected
760	   RDMAP/DDP endpoints).

762	6.9.1.  RDMA-only Session Establishment

764	   It is also possible to allow for transport neutral establishment of
765	   RDMAP/DDP sessions between endpoints.  Combined, these two features
766	   would allow most applications to be unconcerned as to which LLP was
767	   actually in use.

769	   Specifically, the procedures for DDP Stream Session establishment
770	   discussed in section 3 of the SCTP mapping, and section 13.3 of the
771	   MPA/TCP mapping, both allow for the exchange of ULP specific data
772	   ("Private Data") before enabling the exchange of DDP Segments.  This
773	   delay can allow for proper selection and/or configuration of the
774	   endpoints based upon the exchanged data.  For example, each DDP
775	   Stream Session associated with a single client session might be
776	   assigned to the same DDP Protection Domain.

778	   To be transport neutral, the applications should exchange Private
779	   Data as part of session establishment messages to determine how the
780	   RDMA endpoints are to be configured.  One side must be the Initiator,
781	   and the other the Responder.

783	   With SCTP, a pair of SCTP streams can be used for sequential
784	   sessions.  With MPA/TCP each connection can be used for at most one
785	   session.  However, the same source/destination pair of ports can be
786	   re-used sequentially subject to normal TCP rules.

788	   Both SCTP and MPA limit the private data size to a maximum of 512
789	   bytes.

791	   MPA/TCP requires the end of the TCP connection that initiated the
792	   conversion to MPA mode to send the first DDP Segment.  SCTP does not
793	   have this requirement.  ULPs which wish to be transport neutral
794	   should require the initiating end to send the first message.  A zero-
795	   length RDMA Write can be used for this purpose if the ULP logic
796	   itself does naturally support this restriction.

798	6.9.2.  RDMA-Conditional Session Establishment

800	   It is sometimes desirable for the active side of a session to connect
801	   with the passive side before knowing whether the passive side
802	   supports RDMA.

804	   This style of session establishment can be supported with either TCP
805	   or SCTP, but not as transparently as for RDMA-only sessions.  Pre-
806	   existing non-RDMA servers are also far more likely to be using TCP
807	   than SCTP.

809	   With TCP. a normal TCP connection is established.  It is then used by
810	   the ULP to determine whether or not to convert to MPA mode and use
811	   RDMA.  This will typically be integral with other session
812	   establishment negotiations.

814	   With SCTP, the establishment of an association tests whether RDMA is
815	   supported.  If not supported, the application simply requests the
816	   association without the RDMA adaptation indication.

818	   In key difference is that with SCTP the determination as to whether
819	   the peer can support RDMA is made before the transport layer
820	   association/connection is established while with TCP the established
821	   connection itself is used to determine whether RDMA is supported.

823	7.  Local Interface Implications

825	   Full utilization of DDP and RDMAP capabilities requires a local
826	   interface that explicitly requests these services.  Protocols such as
827	   Sockets Direct Protocol (SDP) can allow applications to keep their
828	   traditional byte-stream or message-stream interface and still enjoy
829	   many of the benefits of the optimized wire level protocols.

831	8.  IANA Considerations

833	   There are no IANA considerations in this document.

835	9.  Security considerations

837	9.1.  Connection/Association Setup

839	   Both the SCTP and TCP adaptations allow for existing procedures to be
840	   followed for the establishment of the SCTP association or TCP
841	   connection.  Use of DDP does not impair the use of any security
842	   measures to filter, validate and/or log the remote end of an
843	   association/connection.

845	9.2.  Tagged Buffer Exposure

847	   DDP only exposes ULP memory to the extent explicitly allowed by ULP
848	   actions.  These include posting of receive operations and enabling of
849	   Steering Tags.

851	   Neither RDMAP or DDP place requirements on how ULP's advertise
852	   buffers.  A ULP may use a single Steering Tag for multiple buffer
853	   advertisements.  However, the ULP should be aware that enforcement on
854	   STag usage is likely limited to the overall range that is enabled.
855	   If the remote peer writes into the 'wrong' advertised buffer, neither
856	   the DDP or RDMAP layer will be aware of this.  Nor is there any
857	   report to the ULP on how the remote peer specifically used tagged
858	   buffers.

860	   Unless the ULP peers have an adequate basis for mutual trust, the
861	   receiving ULP might be well advised to use a distinct STag for each
862	   interaction, and to invalidate it after each use or to require its
863	   peer to use the RDMAP option to invalidate the STag with its
864	   responding untagged message.

866	9.3.  Impact of Encrypted Transports

868	   While DDP is cleanly layered over the LLP, its maximum benefit may be
869	   limited when the LLP Stream is secured with a streaming cypher, such
870	   as Transport Layer Security (TLS).  If the LLP must decrypt in order,
871	   it cannot provide out-of-order DDP Segments to the DDP layer for
872	   placement purposes.  IPsec tunnel mode encrypts entire IP Datagrams.
873	   IPsec transport mode encrypts TCP Segments or SCTP packets.  In
874	   neither case should IPsec preclude providing out-of-order DDP
875	   Segments to the DDP layer for placement.

877	   Note that end-to-end use of IPsec cryptographic integrity protection
878	   may allow suppression of MPA CRC generation and checking under
879	   certain circumstances.  This is one example where the LLP may be
880	   judged to have "or equivalent" protection to an end-to-end CRC32c.

882	10.  References

884	10.1.  Normative references

886	   [1]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
887	        Levels", BCP 14, RFC 2119, March 1997.

889	   [2]  Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",
890	        RFC 2246, January 1999.

892	   [3]  Kent, S. and R. Atkinson, "IP Encapsulating Security Payload
893	        (ESP)", RFC 2406, November 1998.

895	   [4]  Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer,
896	        H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson,
897	        "Stream Control Transmission Protocol", RFC 2960, October 2000.

899	   [5]  Coene, L., "Stream Control Transmission Protocol Applicability
900	        Statement", RFC 3257, April 2002.

902	   [6]  Recio, R., "An RDMA Protocol Specification",
903	        draft-ietf-rddp-rdmap-05 (work in progress), July 2005.

905	   [7]  Shah, H., "Direct Data Placement over Reliable Transports",
906	        draft-ietf-rddp-ddp-05 (work in progress), July 2005.

908	   [8]  Stewart, R., "Stream Control Transmission Protocol (SCTP) Remote
909	        Direct Memory Access (RDMA) Direct Data Placement (DDP)
910	        Adaptationn", draft-ietf-rddp-sctp-02 (work in progress),
911	        August 2005.

913	   [9]  Culley, P., "Marker PDU Aligned Framing for TCP Specification",
914	        draft-ietf-rddp-mpa-02 (work in progress), February 2005.

916	10.2.  Informative References

918	   [10]  Ko, M., "iSCSI Extensions for RDMA Specification",
919	         October 2005.

921	   [11]  Callaghan, B. and T. Talpey, "NFS Direct Data Placemetn",
922	         draft-ietf-nfsv4-nfsdirect-02 (work in progress), October 2005.

924	Authors' Addresses

926	   Caitlin Bestler
927	   Broadcom
928	   49 Discovery
929	   Irvine, CA  92618
930	   USA

932	   Phone: 949-926-6383
933	   Email: caitlinb@broadcom.com

935	   Lode Coene
936	   Siemens
937	   Atealaan 26
938	   Herentals,   2200
939	   Belgium

941	   Phone: +32-14-252081
942	   Email: lode.coene@siemens.com

944	Intellectual Property Statement

946	   The IETF takes no position regarding the validity or scope of any
947	   Intellectual Property Rights or other rights that might be claimed to
948	   pertain to the implementation or use of the technology described in
949	   this document or the extent to which any license under such rights
950	   might or might not be available; nor does it represent that it has
951	   made any independent effort to identify any such rights.  Information
952	   on the procedures with respect to rights in RFC documents can be
953	   found in BCP 78 and BCP 79.

955	   Copies of IPR disclosures made to the IETF Secretariat and any
956	   assurances of licenses to be made available, or the result of an
957	   attempt made to obtain a general license or permission for the use of
958	   such proprietary rights by implementers or users of this
959	   specification can be obtained from the IETF on-line IPR repository at
960	   http://www.ietf.org/ipr.

962	   The IETF invites any interested party to bring to its attention any
963	   copyrights, patents or patent applications, or other proprietary
964	   rights that may cover technology that may be required to implement
965	   this standard.  Please address the information to the IETF at
966	   ietf-ipr@ietf.org.

968	Disclaimer of Validity

970	   This document and the information contained herein are provided on an
971	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
972	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
973	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
974	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
975	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
976	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

978	Copyright Statement

980	   Copyright (C) The Internet Society (2005).  This document is subject
981	   to the rights, licenses and restrictions contained in BCP 78, and
982	   except as set forth therein, the authors retain all their rights.

984	Acknowledgment

986	   Funding for the RFC Editor function is currently provided by the
987	   Internet Society.