idnits 2.17.1 

draft-ietf-rddp-arch-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 933.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 943.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 950.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 956.

  ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line
     924), which is fine, but *also* found old RFC 2026, Section 10.4C,
     paragraph 1 text on line 36.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** The document seems to lack an RFC 3978 Section 5.4 Reference to BCP 78
     -- however, there's a paragraph with a matching beginning. Boilerplate
     error?

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'DAFS' is defined on line 857, but no explicit
     reference was found in the text

  == Unused Reference: 'FCVI' is defined on line 862, but no explicit
     reference was found in the text

  == Unused Reference: 'IB' is defined on line 867, but no explicit reference
     was found in the text

  == Unused Reference: 'MYR' is defined on line 871, but no explicit
     reference was found in the text

  == Unused Reference: 'SDP' is defined on line 885, but no explicit
     reference was found in the text

  == Unused Reference: 'SRVNET' is defined on line 891, but no explicit
     reference was found in the text

  == Unused Reference: 'VI' is defined on line 895, but no explicit reference
     was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 2960 (ref.
     'SCTP') (Obsoleted by RFC 4960)


     Summary: 7 errors (**), 0 flaws (~~), 9 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft                                Stephen Bailey (Sandburst)
3	Expires: April 2005                           Tom Talpey        (NetApp)

5	            The Architecture of Direct Data Placement (DDP)
6	                 and Remote Direct Memory Access (RDMA)
7	                         on Internet Protocols
8	                        draft-ietf-rddp-arch-06

10	Status of this Memo

12	     By submitting this Internet-Draft, I certify that any applicable
13	     patent or other IPR claims of which I am aware have been disclosed,
14	     or will be disclosed, and any of which I become aware will be
15	     disclosed, in accordance with RFC 3668.

17	     Internet-Drafts are working documents of the Internet Engineering
18	     Task Force (IETF), its areas, and its working groups.  Note that
19	     other groups may also distribute working documents as Internet-
20	     Drafts.

22	     Internet-Drafts are draft documents valid for a maximum of six
23	     months and may be updated, replaced, or obsoleted by other
24	     documents at any time.  It is inappropriate to use Internet-Drafts
25	     as reference material or to cite them other than as "work in
26	     progress."

28	     The list of current Internet-Drafts can be accessed at
29	         http://www.ietf.org/ietf/1id-abstracts.txt

31	     The list of Internet-Draft Shadow Directories can be accessed at
32	         http://www.ietf.org/shadow.html

34	Copyright Notice

36	     Copyright (C) The Internet Society (2004).  All Rights Reserved.

38	Abstract

40	     This document defines an abstract architecture for Direct Data
41	     Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to
42	     run on Internet Protocol-suite transports.  This architecture does
43	     not necessarily reflect the proper way to implement such protocols,
44	     but is, rather, a descriptive tool for defining and understanding
45	     the protocols.  DDP allows the efficient placement of data into
46	     buffers designated by Upper Layer Protocols (e.g. RDMA).  RDMA
47	     provides the semantics to enable Remote Direct Memory Access
48	     between peers in a way consistent with application requirements.

50	Table Of Contents

52	     1.     Introduction . . . . . . . . . . . . . . . . . . . . . .   2
53	     1.1.   Terminology  . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.2.   DDP and RDMA Protocols . . . . . . . . . . . . . . . . .   3
55	     2.     Architecture . . . . . . . . . . . . . . . . . . . . . .   4
56	     2.1.   Direct Data Placement (DDP) Protocol Architecture  . . .   4
57	     2.1.1. Transport Operations . . . . . . . . . . . . . . . . . .   6
58	     2.1.2. DDP Operations . . . . . . . . . . . . . . . . . . . . .   7
59	     2.1.3. Transport Characteristics in DDP . . . . . . . . . . . .  10
60	     2.2.   Remote Direct Memory Access Protocol Architecture  . . .  12
61	     2.2.1. RDMA Operations  . . . . . . . . . . . . . . . . . . . .  13
62	     2.2.2. Transport Characteristics in RDMA  . . . . . . . . . . .  16
63	     3.     Security Considerations  . . . . . . . . . . . . . . . .  17
64	     4.     IANA Considerations  . . . . . . . . . . . . . . . . . .  19
65	     5.     Acknowledgements . . . . . . . . . . . . . . . . . . . .  19
66	            Informative References . . . . . . . . . . . . . . . . .  19
67	            Authors' Addresses . . . . . . . . . . . . . . . . . . .  20
68	            Full Copyright Statement . . . . . . . . . . . . . . . .  20

70	1.  Introduction

72	     This document defines an abstract architecture for Direct Data
73	     Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to
74	     run on Internet Protocol-suite transports.  This architecture does
75	     not necessarily reflect the proper way to implement such protocols,
76	     but is, rather, a descriptive tool for defining and understanding
77	     the protocols.  This document uses C language notation as a
78	     shorthand to describe the architectural elements of DDP and RDMA
79	     protocols.  The choice of C notation is not intended to describe
80	     concrete protocols or programming interfaces.

82	     The first part of the document describes the architecture of DDP
83	     protocols, including what assumptions are made about the transports
84	     on which DDP is built.  The second part describes the architecture
85	     of RDMA protocols layered on top of DDP.

87	1.1.  Terminology

89	     Before introducing the protocols, certain definitions will be
90	     useful to guide discussion:

92	     o    Placement - writing to a data buffer.

94	     o    Operation - a protocol message, or sequence of messages, which
95	          provide a architectural semantic, such as reading or writing
96	          of a data buffer.

98	     o    Delivery - informing any Upper Layer or application that a
99	          particular message is available for use.  Delivery therefore
100	          may be viewed as the "control" signal associated with a unit
101	          of data.  Note that the order of delivery is defined more
102	          strictly than it is for placement.

104	     o    Completion - informing any Upper Layer or application that a
105	          particular operation has finished.  A completion, for
106	          instance, may require the delivery of several messages, or it
107	          may also reflect that some local processing has finished.

109	     o    Data Sink - the peer on which any placement occurs.

111	     o    Data Source - the peer from which the placed data originates.

113	     o    Steering Tag - a "handle" used to identify memory which is the
114	          target of placement.  A "tagged" message is one which
115	          references such a handle.

117	     o    RDMA Write - an Operation which places data from a local data
118	          buffer to a remote data buffer specified by a Steering Tag.

120	     o    RDMA Read - an Operation which places data to a local data
121	          buffer specified by a Steering Tag from a remote data buffer
122	          specified by another Steering Tag.

124	     o    Send - an Operation which places data from a local data buffer
125	          to a remote data buffer of the data sink's choice.  Sends are
126	          therefore "untagged".

128	1.2.  DDP and RDMA Protocols

130	     The goal of the DDP protocol is to allow the efficient placement of
131	     data into buffers designated by protocols layered above DDP (e.g.
132	     RDMA).  This is described in detail in [ROM].  Efficiency may be
133	     characterized by the minimization of the number of transfers of the
134	     data over the receiver's system buses.

136	     The goal of the RDMA protocol is to provide the semantics to enable
137	     Remote Direct Memory Access between peers in a way consistent with
138	     application requirements.  The RDMA protocol provides facilities
139	     immediately useful to existing and future networking, storage, and
140	     other application protocols.  [DAFS, FCVI, IB, MYR, SDP, SRVNET,
141	     VI]

143	     The DDP and RDMA protocols work together to achieve their
144	     respective goals.  DDP provides facilities to safely steer payloads
145	     to specific buffers at the Data Sink.  RDMA provides facilities to
146	     Upper Layers for identifying these buffers, controlling the
147	     transfer of data between peers' buffers, supporting authorized
148	     bidirectional transfer between buffers, and signalling completion.
149	     Upper Layer Protocols that do not require the features of RDMA may
150	     be layered directly on top of DDP.

152	     The DDP and RDMA protocols are transport independent.  The
153	     following figure shows the relationship between RDMA, DDP, Upper
154	     Layer Protocols and Transport.

156	          +--------------------------------------------------+
157	          |               Upper Layer Protocol               |
158	          +---------+------------+---------------------------+
159	          |         |            |           RDMA            |
160	          |         |            +---------------------------+
161	          |         |                   DDP                  |
162	          |         +----------------------------------------+
163	          |                    Transport                     |
164	          +--------------------------------------------------+

166	2.  Architecture

168	     The Architecture section is presented in two parts: Direct Data
169	     Placement Protocol architecture and Remote Direct Memory Access
170	     Protocol architecture.

172	2.1.  Direct Data Placement (DDP) Protocol Architecture

174	     The central idea of general-purpose DDP is that a data sender will
175	     supplement the data it sends with placement information that allows
176	     the receiver's network interface to place the data directly at its
177	     final destination without any copying.  DDP can be used to steer
178	     received data to its final destination, without requiring layer-
179	     specific behavior for each different layer.  Data sent with such
180	     DDP information is said to be `tagged'.

182	     The central component of the DDP architecture is the `buffer',
183	     which is an object with beginning and ending addresses, and a
184	     method (set()) to set the value of an octet at an address.  In many
185	     cases, a buffer corresponds directly to a portion of host user
186	     memory.  However, DDP does not depend on this---a buffer could be a
187	     disk file, or anything else that can be viewed as an addressable
188	     collection of octets.  Abstractly, a buffer provides the interface:

190	          typedef struct {
191	            const address_t start;
192	            const address_t end;
193	            void            set(address_t a, data_t v);
194	          } ddp_buffer_t;

196	     address_t

198	          a reference to local memory

200	     data_t

202	          an octet data value.

204	     The protocol layering and in-line data flow of DDP is:

206	                      DDP Client Protocol
207	               (e.g. RDMA or Upper Layer Protocol)
208	                             |  ^
209	           untagged messages |  | untagged message delivery
210	             tagged messages |  | tagged message delivery
211	                             v  |
212	                             DDP+---> data placement
213	                              ^
214	                              | transport messages
215	                              v
216	                          Transport
217	                 (e.g. SCTP, DCCP, framed TCP)
218	                              ^
219	                              | IP datagrams
220	                              v
221	                            . . .

223	     In addition to in-line data flow, the client protocol registers
224	     buffers with DDP, and DDP performs buffer update (set()) operations
225	     as a result of receiving tagged messages.

227	     DDP messages may be split into multiple, smaller DDP messages, each
228	     in a separate transport message.  However, if the transport is
229	     unreliable or unordered, messages split across transport messages
230	     may or may not provide useful behavior, in the same way as
231	     splitting arbitrary Upper Layer messages across unreliable or
232	     unordered transport messages may or may not provide useful
233	     behavior.  In other words, the same considerations apply to
234	     building client protocols on different types of transports with or
235	     without the use of DDP.

237	     A DDP message split across transport messages looks like:

239	     DDP message:              Transport messages:

241	       stag=s, offset=o,          message 1:
242	       notify=y, id=i               |type=ddp  |
243	       message=                     |stag=s    |
244	         |aabbccddee|-------.       |offset=o  |
245	         ~   ...    ~----.   \      |notify=n  |
246	         |vvwwxxyyzz|-.   \   \     |id=?      |
247	                      |    \   `--->|aabbccddee|
248	                      |     \       ~    ...   ~
249	                      |      +----->|iijjkkllmm|
250	                      |      |
251	                      +      |    message 2:
252	                       \     |      |type=ddp  |
253	                        \    |      |stag=s    |
254	                         \   +      |offset=o+n|
255	                          \   \     |notify=y  |
256	                           \   \    |id=i      |
257	                            \   `-->|nnooppqqrr|
258	                             \      ~    ...   ~
259	                              `---->|vvwwxxyyzz|

261	     Although this picture suggests that DDP information is carried in-
262	     line with the message payload, components of the DDP information
263	     may also be in transport-specific fields, or derived from
264	     transport-specific control information if the transport permits.

266	2.1.1.  Transport Operations

268	     For the purposes of this architecture, the transport provides:

270	          void      xpt_send(socket_t s, message_t m);
271	          message_t xpt_recv(socket_t s);
272	          msize_t   xpt_max_msize(socket_t s);

274	     socket_t

276	          a transport address, including IP addresses, ports and other
277	          transport-specific identifiers.

279	     message_t

281	          a string of octets.

283	     msize_t (scalar)

285	          a message size.

287	     xpt_send(socket_t s, message_t m)

289	          send a transport message.

291	     xpt_recv(socket_t s)

293	          receive a transport message.

295	     xpt_max_msize(socket_t s)

297	          get the current maximum transport message size.  Corresponds,
298	          roughly, to the current path Maximum Transfer Unit (PMTU),
299	          adjusted by underlying protocol overheads.

301	     Real implementations of xpt_send() and xpt_recv() typically return
302	     error indications, but that is not relevant to this architecture.

304	2.1.2.  DDP Operations

306	     The DDP layer provides:

308	          void       ddp_send(socket_t s, message_t m);
309	          void       ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d,
310	                                  ddp_notify_t n);
311	          void       ddp_post_recv(socket_t s, bdesc_t b);
312	          ddp_ind_t  ddp_recv(socket_t s);
313	          bdesc_t    ddp_register(socket_t s, ddp_buffer_t b);
314	          void       ddp_deregister(bhand_t bh);
315	          msizes_t   ddp_max_msizes(socket_t s);

317	     ddp_addr_t

319	          the buffer address portion of a tagged message:

321	               typedef struct {
322	                 stag_t stag;
323	                 address_t offset;
324	               } ddp_addr_t;

326	     stag_t (scalar)
327	          a Steering Tag.  A stag_t identifies the destination buffer
328	          for tagged messages.  stag_ts are generated when the buffer is
329	          registered, communicated to the sender by some client protocol
330	          convention and inserted in DDP messages.  stag_t values in
331	          this DDP architecture are assumed to be completely opaque to
332	          the client protocol, and implementation-dependent.  However,
333	          particular implementations, such as DDP on a multicast
334	          transport (see below), may provide the buffer holder some
335	          control in selecting stag_ts.

337	     ddp_notify_t

339	          the notification portion of a DDP message, used to signal that
340	          the message represents the final fragment of a multi-segmented
341	          DDP message:

343	               typedef struct {
344	                 boolean_t notify;
345	                 ddp_msg_id_t i;
346	               } ddp_notify_t;

348	     ddp_msg_id_t (scalar)

350	          a DDP message identifier.  msg_id_ts are chosen by the DDP
351	          message receiver (buffer holder), communicated to the sender
352	          by some client protocol convention and inserted in DDP
353	          messages.  Whether a message reception indication is requested
354	          for a DDP message is a matter of client protocol convention.
355	          Unlike stag_ts, the structure of msg_id_ts is opaque to DDP,
356	          and therefore, completely in the hands of the client protocol.

358	     bdesc_t

360	          a description of a registered buffer:

362	               typedef struct {
363	                 bhand_t bh;
364	                 ddp_addr_t a;
365	               } bdesc_t;

367	          `a.offset' is the starting offset of the registered buffer,
368	          which may have no relationship to the `start' or `end'
369	          addresses of that buffer.  However, particular
370	          implementations, such as DDP on a multicast transport (see
371	          below), may allow some client protocol control over the
372	          starting offset.

374	     bhand_t

376	          an opaque buffer handle used to deregister a buffer.

378	     recv_message_t

380	          a description of a completed untagged receive buffer:

382	               typedef struct {
383	                 bdesc_t b;
384	                 length_t l;
385	               } recv_message_t;

387	     ddp_ind_t

389	          an untagged message, a tagged message reception indication, or
390	          a tagged message reception error:

392	               typedef union {
393	                 recv_message_t m;
394	                 ddp_msg_id_t i;
395	                 ddp_err_t e;
396	               } ddp_ind_t;

398	     ddp_err_t

400	          indicates an error while receiving a tagged message, typically
401	          `offset' out of bounds, or `stag' is not registered to the
402	          socket.

404	     msizes_t

406	          The maximum untagged and tagged messages that fit in a single
407	          transport message:

409	               typedef struct {
410	                 msize_t max_untagged;
411	                 msize_t max_tagged;
412	               } msizes_t;

414	     ddp_send(socket_t s, message_t m)

416	          send an untagged message.

418	     ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, ddp_notify_t n)
419	          send a tagged message to remote buffer address d.

421	     ddp_post_recv(socket_t s, bdesc_t b)

423	          post a registered buffer to accept a single received untagged
424	          message.  Each buffer is returned to the caller in a
425	          ddp_recv() untagged message reception indication, in the order
426	          in which it was posted.  The same buffer may be enabled on
427	          multiple sockets, receipt of an untagged message into the
428	          buffer from any of these sockets unposts the buffer from all
429	          sockets.

431	     ddp_recv(socket_t s)

433	          get the next received untagged message, tagged message
434	          reception indication, or tagged message error.

436	     ddp_register(socket_t s, ddp_buffer_t b)

438	          register a buffer for DDP on a socket.  The same buffer may be
439	          registered multiple times on the same or different sockets.
440	          The same buffer registered on different sockets may result in
441	          a common registration.  Different buffers may also refer to
442	          portions of the same underlying addressable object (buffer
443	          aliasing).

445	     ddp_deregister(bhand_t bh)

447	          remove a registration from a buffer.

449	     ddp_max_msizes(socket_t s)

451	          get the current maximum untagged and tagged message sizes that
452	          will fit in a single transport message.

454	2.1.3.  Transport Characteristics In DDP

456	     Certain characteristics of the transport on which DDP is mapped
457	     determine the nature of the service provided to client protocols.
458	     Fundamentally, the characteristics of the transport will not be
459	     changed by the presence of DDP.  The choice of transport is
460	     therefore driven not by DDP, but by the requirements of the Upper
461	     Layer, and employing the DDP service.

463	     Specifically, transports are:

465	     o    reliable or unreliable,
466	     o    ordered or unordered,

468	     o    single source or multisource,

470	     o    single destination or multidestination (multicast or anycast).

472	     Some transports support several combinations of these
473	     characteristics.  For example, SCTP [SCTP] is reliable, single
474	     source, single destination (point-to-point) and supports both
475	     ordered and unordered modes.

477	     DDP messages carried by transport are framed for processing by the
478	     receiver, and may be further protected for integrity or privacy in
479	     accordance with the transport capabilities.  DDP does not provide
480	     such functions.

482	     In general, transport characteristics equally affect transport and
483	     DDP message delivery.  However, there are several issues specific
484	     to DDP messages.

486	     A key component of DDP is how the following operations on the
487	     receiving side are ordered among themselves, and how they relate to
488	     corresponding operations on the sending side:

490	          o    set()s,

492	          o    untagged message reception indications, and

494	          o    tagged message reception indications.

496	     These relationships depend upon the characteristics of the
497	     underlying transport in a way which is defined by the DDP protocol.
498	     For example, if the transport is unreliable and unordered, the DDP
499	     protocol might specify that the client protocol is subject to the
500	     consequences of transport messages being lost or duplicated, rather
501	     than requiring different characteristics be presented to the client
502	     protocol.

504	     Multidestination data delivery is the other transport
505	     characteristic which may require specific consideration in a DDP
506	     protocol.  As mentioned above, the basic DDP model assumes that
507	     buffer address values returned by ddp_register() are opaque to the
508	     client protocol, and can be implementation dependent.  The most
509	     natural way to map DDP to a multidestination transport is to
510	     require all receivers produce the same buffer address when
511	     registering a multidestination destination buffer.  Restriction of
512	     the DDP model to accommodate multiple destinations involves
513	     engineering tradeoffs comparable to those of providing non-DDP
514	     multidestination transport capability.

516	     A registered buffer is identified within DDP by its stag_t, which
517	     in turn is associated with a socket.  This registration therefore
518	     grants a capability to the DDP peer, and the socket (using the
519	     underlying properties of its chosen transport and possible
520	     security) identifies the peer and authenticates the stag_t.

522	     The same buffer may be enabled by ddp_post_recv() on multiple
523	     sockets.  In this case any ddp_recv() untagged message reception
524	     indication may be provided on a different socket from that on which
525	     the buffer was posted.  Such indications are not ordered among
526	     multiple DDP sockets.

528	     When multiple sockets reference an untagged message reception
529	     buffer, local interfaces are responsible for managing the
530	     mechanisms of allocating posted buffers to received untagged
531	     messages, the handling of received untagged messages when no buffer
532	     is available, and of resource management among multiple sockets.
533	     Where underprovisioning of buffers on multiple sockets is allowed,
534	     mechanisms should be provided to manage buffer consumption on a
535	     per-socket or group of related sockets basis.

537	     Architecturally, therefore, DDP is a flexible and general paradigm
538	     which may be applied to any variety of transports.  Implementations
539	     of DDP may, however, adapt themselves to these differences in ways
540	     appropriate to each transport.  In all cases the layering of DDP
541	     must continue to express the transport's underlying
542	     characteristics.

544	2.2.  Remote Direct Memory Access (RDMA) Protocol Architecture

546	     Remote Direct Memory Access (RDMA) extends the capabilities of DDP
547	     with two primary functions.

549	     First, it adds the ability to read from buffers registered to a
550	     socket (RDMA Read).  This allows a client protocol to perform
551	     arbitrary, bidirectional data movement without involving the remote
552	     client.  When RDMA is implemented in hardware, arbitrary data
553	     movement can be performed without involving the remote host CPU at
554	     all.

556	     In addition, RDMA specifies a transport-independent untagged
557	     message service (Send) with characteristics which are both very
558	     efficient to implement in hardware, and convenient for client
559	     protocols.

561	     The RDMA architecture is patterned after the traditional model for
562	     device programming, where the client requests an operation using
563	     Send-like actions (programmed I/O), the server performs the
564	     necessary data transfers for the operation (DMA reads and writes),
565	     and notifies the client of completion.  The programmed I/O+DMA
566	     model efficiently supports a high degree of concurrency and
567	     flexibility for both the client and server, even when operations
568	     have a wide range of intrinsic latencies.

570	     RDMA is layered as a client protocol on top of DDP:

572	                        Client Protocol
573	                             |  ^
574	                       Sends |  | Send reception indications
575	          RDMA Read Requests |  | RDMA Read Completion indications
576	                 RDMA Writes |  | RDMA Write Completion indications
577	                             v  |
578	                             RDMA
579	                             |  ^
580	           untagged messages |  | untagged message delivery
581	             tagged messages |  | tagged message delivery
582	                             v  |
583	                             DDP+---> data placement
584	                              ^
585	                              | transport messages
586	                              v
587	                            . . .

589	     In addition to in-line data flow, read (get()) and update (set())
590	     operations are performed on buffers registered with RDMA as a
591	     result of RDMA Read Requests and RDMA Writes, respectively.

593	     An RDMA `buffer' extends a DDP buffer with a get() operation that
594	     retrieves the value of the octet at address `a':

596	          typedef struct {
597	            const address_t start;
598	            const address_t end;
599	            void            set(address_t a, data_t v);
600	            data_t          get(address_t a);
601	          } rdma_buffer_t;

603	2.2.1.  RDMA Operations

605	     The RDMA layer provides:

607	          void        rdma_send(socket_t s, message_t m);
608	          void        rdma_write(socket_t s, message_t m, ddp_addr_t d,
609	                                 rdma_notify_t n);
610	          void        rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d);
611	          void        rdma_post_recv(socket_t s, bdesc_t b);
612	          rdma_ind_t  rdma_recv(socket_t s);
613	          bdesc_t     rdma_register(socket_t s, rdma_buffer_t b,
614	                                 bmode_t mode);
615	          void        rdma_deregister(bhand_t bh);
616	          msizes_t    rdma_max_msizes(socket_t s);

618	     Although, for clarity, these data transfer interfaces are
619	     synchronous, rdma_read() and possibly rdma_send() (in the presence
620	     of Send flow control), can require an arbitrary amount of time to
621	     complete.  To express the full concurrency and interleaving of RDMA
622	     data transfer, these interfaces should also be reentrant.  For
623	     example, a client protocol may perform an rdma_send(), while an
624	     rdma_read() operation is in progress.

626	     rdma_notify_t

628	          RDMA Write notification information, used to signal that the
629	          message represents the final fragment of a multi-segmented
630	          RDMA message:

632	               typedef struct {
633	                 boolean_t notify;
634	                 rdma_write_id_t i;
635	               } rdma_notify_t;

637	          identical in function to ddp_notify_t, except that the type
638	          rdma_write_id_t may not be equivalent to ddp_msg_id_t.

640	     rdma_write_id_t (scalar)

642	          an RDMA Write identifier.

644	     rdma_ind_t

646	          a Send message, or an RDMA error:

648	               typedef union {
649	                 recv_message_t m;
650	                 rdma_err_t e;
651	               } rdma_ind_t;

653	     rdma_err_t

655	          an RDMA protocol error indication.  RDMA errors include buffer
656	          addressing errors corresponding to ddp_err_ts, and buffer
657	          protection violations (e.g. RDMA Writing a buffer only
658	          registered for reading).

660	     bmode_t

662	          buffer registration mode (permissions).  Any combination of
663	          permitting RDMA Read (BMODE_READ) and RDMA Write (BMODE_WRITE)
664	          operations.

666	     rdma_send(socket_t s, message_t m)

668	          send a message, delivering it to the next untagged RDMA buffer
669	          at the remote peer.

671	     rdma_write(socket_t s, message_t m, ddp_addr_t d, rdma_notify_t n)

673	          RDMA Write to remote buffer address d.

675	     rdma_read(socket_t s, ddp_addr_t s, length_t l, ddp_addr_t d)

677	          RDMA Read l octets from remote buffer address s to local
678	          buffer address d.

680	     rdma_post_recv(socket_t s, bdesc_t b)

682	          post a registered buffer to accept a single Send message, to
683	          be filled and returned in-order to a subsequent caller of
684	          rdma_recv().  As with DDP, buffers may be enabled on multiple
685	          sockets, in which case ordering guarantees are relaxed.  Also
686	          as with DDP, local interfaces must manage the mechanisms of
687	          allocation and management of buffers posted to multiple
688	          sockets.

690	     rdma_recv(socket_t s);

692	          get the next received Send message, RDMA Write completion
693	          identifier, or RDMA error.

695	     rdma_register(socket_t s, rdma_buffer_t b, bmode_t mode)

697	          register a buffer for RDMA on a socket (for read access, write
698	          access or both).  As with DDP, the same buffer may be
699	          registered multiple times on the same or different sockets,
700	          and different buffers may refer to portions of the same
701	          underlying addressable object.

703	     rdma_deregister(bhand_t bh)

705	          remove a registration from a buffer.

707	     rdma_max_msizes(socket_t s)

709	          get the current maximum Send (max_untagged) and RDMA Read or
710	          Write (max_tagged) operations that will fit in a single
711	          transport message.  The values returned by rdma_max_msizes()
712	          are closely related to the values returned by
713	          ddp_max_msizes(), but may not be equal.

715	2.2.2.  Transport Characteristics In RDMA

717	     As with DDP, RDMA can be used on transports with a variety of
718	     different characteristics that manifest themselves directly in the
719	     service provided by RDMA.  Also as with DDP, the fundamental
720	     characteristics of the transport will not be changed by the
721	     presence of RDMA.

723	     Like DDP, an RDMA protocol must specify how:

725	          o    set()s,

727	          o    get()s,

729	          o    Send messages, and

731	          o    RDMA Read completions

733	     are ordered among themselves and how they relate to corresponding
734	     operations on the remote peer(s).  These relationships are likely
735	     to be a function of the underlying transport characteristics.

737	     There are some additional characteristics of RDMA which may
738	     translate poorly to unreliable or multipoint transports due to
739	     attendant complexities in managing endpoint state:

741	     o    Send flow control

743	     o    RDMA Read

745	     These difficulties can be overcome by placing restrictions on the
746	     service provided by RDMA.  However, many RDMA clients, especially
747	     those that separate data transfer and application logic concerns,
748	     are likely to depend upon capabilities only provided by RDMA on a
749	     point-to-point, reliable transport.  In other words, many potential
750	     Upper Layers which might avail themselves of RDMA services are
751	     naturally already biased toward these transport classes.

753	3.  Security Considerations

755	     Fundamentally, the DDP and RDMA protocols should not introduce
756	     additional vulnerabilities.  They are intermediate protocols and so
757	     should not perform or require functions such as authorization,
758	     which are the domain of Upper Layers.  However, the DDP and RDMA
759	     protocols should allow mapping by strict Upper Layers which are not
760	     permissive of new vulnerabilities -- DDP and RDMAP implementations
761	     should be prohibited from `cutting corners' that create new
762	     vulnerabilities.  Implementations must ensure that only `supplied'
763	     resources (i.e. buffers) can be manipulated by DDP or RDMAP
764	     messages.

766	     System integrity must be maintained in any RDMA solution.
767	     Mechanisms must be specified to prevent RDMA or DDP operations from
768	     impairing system integrity.  For example, threats can include
769	     potential buffer reuse or buffer overflow, and are not merely a
770	     security issue.  Even trusted peers must not be allowed to damage
771	     local integrity.  Any DDP and RDMA protocol must address the issue
772	     of giving end-systems and applications the capabilities to offer
773	     protection from such compromises.

775	     Because a Steering Tag exports access to a memory region, one
776	     critical aspect of security is the scope of this access.  It must
777	     be possible to individually control specific attributes of the
778	     access provided by a Steering Tag on the endpoint (socket) on which
779	     it was registered, including remote read access, remote write
780	     access, and others that might be identified.  DDP and RDMA
781	     specifications must provide both implementation requirements
782	     relevant to this issue, and guidelines to assist implementors in
783	     making the appropriate design decisions.

785	     For example, it must not be possible for DDP to enable evasion of
786	     memory consistency checks at the recipient.  The DDP and RDMA
787	     specifications must allow the recipient to rely on its consistent
788	     memory contents by explicitly controlling peer access to memory
789	     regions at appropriate times.

791	     Peer connections which do not pass authentication and authorization
792	     checks by upper layers must not be permitted to begin processing in
793	     RDMA mode with an inappropriate endpoint.  Once associated, peer
794	     accesses to memory regions must be authenticated and made subject
795	     to authorization checks in the context of the association and
796	     endpoint (socket) on which they are to be performed, prior to any
797	     transfer operation or data being accessed.  The RDMA protocols must
798	     ensure that these region protections be under strict application
799	     control.

801	     The use of DDP and RDMA on a transport connection may interact with
802	     any security mechanism, and vice-versa.  For example, if the
803	     security mechanism is implemented above the transport layer, the
804	     DDP and RDMA headers may not be protected.  Such a layering may
805	     therefore be inappropriate, depending on requirements.

807	     IPsec, operating to secure the connection on a packet-by-packet
808	     basis, seems to be a natural fit to securing RDMA placement, which
809	     operates in conjunction with transport.  Because RDMA enables an
810	     implementation to avoid buffering, it is preferable to perform all
811	     applicable security protection prior to processing of each segment
812	     by the transport and RDMA layers.  Such a layering enables the most
813	     efficient secure RDMA implementation.

815	     The TLS record protocol, on the other hand, is layered on top of
816	     reliable transports and cannot provide such security assurance
817	     until an entire record is available, which may require the
818	     buffering and/or assembly of several distinct messages prior to TLS
819	     processing.  This defers RDMA processing and introduces overheads
820	     that RDMA is designed to avoid.  TLS therefore is viewed as
821	     potentially a less natural fit for protecting the RDMA protocols.

823	     Resource issues leading to denial-of-service attacks, overwrites
824	     and other concurrent operations, the ordering of completions as
825	     required by the RDMA protocol, and the granularity of transfer are
826	     all within the required scope of any security analysis of RDMA and
827	     DDP.

829	     The RDMA operations require checking of what is essentially user
830	     information, explicitly including addressing information and
831	     operation type (read or write), and implicitly including protection
832	     and attributes.  The semantics associated with each class of error
833	     resulting from possible failure of such checks must be clearly
834	     defined, and the expected action to be taken by the protocols in
835	     each case must be specified.

837	     In some cases, this will result in a catastrophic error on the RDMA
838	     association, however in others, a local or remote error may be
839	     signalled.  Certain of these errors may require consideration of
840	     abstract local semantics.  The result of the error on the RDMA
841	     association must be carefully specified so as to provide useful
842	     behavior, while not constraining the implementation.

844	4.  IANA Considerations

846	     IANA considerations are not addressed in by this document.  Any
847	     IANA considerations resulting from the use of DDP or RDMA must be
848	     addressed in the relevant standards.

850	5.  Acknowledgements

852	     The authors wish to acknowledge the valuable contributions of
853	     Caitlin Bestler, David Black, Jeff Mogul and Allyn Romanow.

855	6.  Informative References

857	     [DAFS]
858	          DAFS Collaborative, "Direct Access File System Specification
859	          v1.0", September 2001, available from
860	          http://www.dafscollaborative.org

862	     [FCVI]
863	          ANSI Technical Committee T11, "Fibre Channel Standard Virtual
864	          Interface Architecture Mapping", ANSI/NCITS 357-2001, March
865	          2001, available from http://www.t11.org/t11/stat.nsf/fcproj

867	     [IB] InfiniBand Trade Association, "InfiniBand Architecture
868	          Specification Volumes 1 and 2", Release 1.1, November 2002,
869	          available from http://www.infinibandta.org/specs

871	     [MYR]
872	          VMEbus International Trade Association, "Myrinet on VME
873	          Protocol Specification", ANSI/VITA 26-1998, August 1998,
874	          available from http://www.myri.com/open-specs

876	     [ROM]
877	          A. Romanow, J. Mogul, T. Talpey and S. Bailey, "RDMA over IP
878	          Problem Statement", draft-ietf-rddp-problem-statement-05, Work
879	          in Progress, October 2004

881	     [SCTP]
882	          R. Stewart et al., "Stream Transmission Control Protocol", RFC
883	          2960, Standards Track

885	     [SDP]
886	          InfiniBand Trade Association, "Sockets Direct Protocol v1.0",
887	          Annex A of InfiniBand Architecture Specification Volume 1,
888	          Release 1.1, November 2002, available from
889	          http://www.infinibandta.org/specs

891	     [SRVNET]
892	          R. Horst, "TNet: A reliable system area network", IEEE Micro,
893	          pp. 37-45, February 1995

895	     [VI] Compaq Computer Corp., Intel Corporation and Microsoft
896	          Corporation, "Virtual Interface Architecture Specification
897	          Version 1.0", December 1997, available from
898	          http://www.vidf.org/info/04standards.html

900	Authors' Addresses

902	     Stephen Bailey
903	     Sandburst Corporation
904	     600 Federal Street
905	     Andover, MA  01810 USA
906	     USA

908	     Phone: +1 978 689 1614
909	     Email: steph@sandburst.com

911	     Tom Talpey
912	     Network Appliance
913	     375 Totten Pond Road
914	     Waltham, MA  02451 USA

916	     Phone: +1 781 768 5329
917	     Email: thomas.talpey@netapp.com

919	Full Copyright Statement

921	     Copyright (C) The Internet Society (2004).  This document is
922	     subject to the rights, licenses and restrictions contained in BCP
923	     78 and except as set forth therein, the authors retain all their
924	     rights.

926	     This document and the information contained herein are provided on
927	     an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE
928	     REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND
929	     THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
930	     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
931	     THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
932	     ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
933	     PARTICULAR PURPOSE.

935	Intellectual Property
936	     The IETF takes no position regarding the validity or scope of any
937	     Intellectual Property Rights or other rights that might be claimed
938	     to pertain to the implementation or use of the technology described
939	     in this document or the extent to which any license under such
940	     rights might or might not be available; nor does it represent that
941	     it has made any independent effort to identify any such rights.
942	     Information on the procedures with respect to rights in RFC
943	     documents can be found in BCP 78 and BCP 79.

945	     Copies of IPR disclosures made to the IETF Secretariat and any
946	     assurances of licenses to be made available, or the result of an
947	     attempt made to obtain a general license or permission for the use
948	     of such proprietary rights by implementers or users of this
949	     specification can be obtained from the IETF on-line IPR repository
950	     at http://www.ietf.org/ipr.

952	     The IETF invites any interested party to bring to its attention any
953	     copyrights, patents or patent applications, or other proprietary
954	     rights that may cover technology that may be required to implement
955	     this standard.  Please address the information to the IETF at ietf-
956	     ipr@ietf.org.

958	Acknowledgement
959	     Funding for the RFC Editor function is currently provided by the
960	     Internet Society.