| < draft-bailey-roi-ddp-rdma-arch-00.txt | draft-bailey-roi-ddp-rdma-arch-01.txt > | |||
|---|---|---|---|---|
| S. Bailey (Sandburst) | Internet-Draft Stephen Bailey (Sandburst) | |||
| Internet-draft Expires: July 2002 | Expires: May 2003 Tom Talpey (NetApp) | |||
| The Architecture of Direct Data Placement (DDP) | The Architecture of Direct Data Placement (DDP) | |||
| And Remote Direct Memory Access (RDMA) | And Remote Direct Memory Access (RDMA) | |||
| On Internet Protocols | On Internet Protocols | |||
| draft-bailey-roi-ddp-rdma-arch-00 | draft-bailey-roi-ddp-rdma-arch-01 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 35 ¶ | skipping to change at page 1, line 35 ¶ | |||
| progress." | progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2001). All Rights Reserved. | Copyright (C) The Internet Society (2002). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines an abstract architecture for Direct Data | This document defines an abstract architecture for Direct Data | |||
| Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to | Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to | |||
| run on Internet Protocol-suite transport protocols. This | run on Internet Protocol-suite transports. This architecture does | |||
| architecture does not necessarily reflect the proper way to | not necessarily reflect the proper way to implement such protocols, | |||
| implement such protocols, but is, rather, a descriptive tool for | but is, rather, a descriptive tool for defining and understanding | |||
| defining and understanding the protocols. | the protocols. | |||
| Table Of Contents | Table Of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 2. Direct Data Placement (DDP) Architecture . . . . . . . . . 2 | 2. Architecture . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2.1. Transport Operations . . . . . . . . . . . . . . . . . . . 4 | 2.1. Direct Data Placement (DDP) Protocol Architecture . . . 3 | |||
| 2.2. DDP Operations . . . . . . . . . . . . . . . . . . . . . . 5 | 2.1.1. Transport Operations . . . . . . . . . . . . . . . . . . 5 | |||
| 2.3. Transport Characterstics In DDP . . . . . . . . . . . . . 8 | 2.1.2. DDP Operations . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 3. Remote Direct Memory Access (RDMA) Protocol Architecture . 9 | 2.1.3. Transport Characteristics in DDP . . . . . . . . . . . . 9 | |||
| 3.1. RDMA Operations . . . . . . . . . . . . . . . . . . . . . 10 | 2.2. Remote Direct Memory Access Protocol Architecture . . . 10 | |||
| 3.2. Transport Characterstics In RDMA . . . . . . . . . . . . . 12 | 2.2.1. RDMA Operations . . . . . . . . . . . . . . . . . . . . 11 | |||
| 4. Security Considerations . . . . . . . . . . . . . . . . . 13 | 2.2.2. Transport Characteristics in RDMA . . . . . . . . . . . 14 | |||
| 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 13 | 3. Security Considerations . . . . . . . . . . . . . . . . 14 | |||
| Author's Address . . . . . . . . . . . . . . . . . . . . . 13 | 4. IANA Considerations . . . . . . . . . . . . . . . . . . 15 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . 14 | 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . 15 | |||
| References . . . . . . . . . . . . . . . . . . . . . . . 15 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . 16 | ||||
| Full Copyright Statement . . . . . . . . . . . . . . . . 17 | ||||
| 1. Introduction | 1. Introduction | |||
| This document defines an abstract architecture for Direct Data | This document defines an abstract architecture for Direct Data | |||
| Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to | Placement (DDP) and Remote Direct Memory Access (RDMA) protocols to | |||
| run on Internet Protocol-suite transport protocols. This | run on Internet Protocol-suite transports [RDDP, ROM]. This | |||
| architecture does not necessarily reflect the proper way to | architecture does not necessarily reflect the proper way to | |||
| implement such protocols, but is, rather, a descriptive tool for | implement such protocols, but is, rather, a descriptive tool for | |||
| defining and understanding the protocols. | defining and understanding the protocols. | |||
| The first section describes the architecture of DDP protocols, | The first part of the document describes the architecture of DDP | |||
| including assumptions of the transports on which DDP is built. The | protocols, including what assumptions are made about the transports | |||
| second section describes the architecture of RDMA protocols layered | on which DDP is built. The second part describes the architecture | |||
| on top of DDP. | of RDMA protocols layered on top of DDP. | |||
| 2. Direct Data Placement (DDP) Architecture | Before introducing the protocols, three definitions will be useful | |||
| to guide discussion: | ||||
| o Placement - writing to a data buffer. | ||||
| o Delivery - informing the Upper Layer Protocol (ULP) (e.g. | ||||
| RDMA) that a particular message is available for use. | ||||
| Delivery therefore may be viewed as the "control" signal | ||||
| associated with a unit of data. Note that the order of | ||||
| delivery is defined more strictly than it is for placement. | ||||
| o Completion - informing the ULP or application that a | ||||
| particular RDMA operation has finished. A completion, for | ||||
| instance, may require the delivery of several messages, or it | ||||
| may also reflect that some local processing has finished. | ||||
| The goal of the DDP protocol is to allow the efficient placement of | ||||
| data into buffers designated by Upper Layer Protocols (e.g. RDMA). | ||||
| This is described in detail in [ROM]. Efficiency may be | ||||
| characterized by the minimization of the number of transfers of the | ||||
| data over the receiver's system buses. | ||||
| The goal of the RDMA protocol is to provide the semantics to enable | ||||
| Remote Direct Memory Access between peers in a way consistent with | ||||
| application requirements. The RDMA protocol provides facilities | ||||
| immediately useful to existing and future networking, storage, and | ||||
| other application protocols. [DAFS, FIBRE, IB, MYR, SDP, SRVNET, | ||||
| VI] | ||||
| The DDP and RDMA protocols work together to achieve their | ||||
| respective goals. RDMA provides facilities to a ULP for | ||||
| identifying buffers, controlling the transfer of data between ULP | ||||
| peers, and providing completion notifications to the ULP. RDMA | ||||
| uses the features of DDP to steer payloads to specific buffers at | ||||
| the Data Sink. ULPs that do not require the features of RDMA may | ||||
| be layered directly on top of DDP. | ||||
| The DDP and RDMA protocols are transport independent. The | ||||
| following figure shows the relationship between RDMA, DDP, Upper | ||||
| Layer Protocols and Transport. | ||||
| +---------------------------------------------------+ | ||||
| | ULP | | ||||
| +---------+------------+----------------------------+ | ||||
| | | | RDMA | | ||||
| | | +----------------------------+ | ||||
| | | DDP | | ||||
| | +-----------------------------------------+ | ||||
| | Transport | | ||||
| +---------------------------------------------------+ | ||||
| 2. Architecture | ||||
| The Architecture section is presented in two parts: Direct Data | ||||
| Placement Protocol architecture and Remote Direct Memory Access | ||||
| Protocol architecture. | ||||
| 2.1. Direct Data Placement (DDP) Protocol Architecture | ||||
| The central idea of general-purpose DDP is that a data sender will | The central idea of general-purpose DDP is that a data sender will | |||
| supplement the data it sends with placement information that allows | supplement the data it sends with placement information that allows | |||
| the receiver's network interface (NI) to place the data directly at | the receiver's network interface to place the data directly at its | |||
| its final destination without any copying. DDP can be used to | final destination without any copying. DDP can be used to steer | |||
| steer received data to its final destination for any ULP without | received data to its final destination, without requiring layer- | |||
| requiring ULP-specific behavior in the NI for each different ULP. | specific behavior for each different layer. Data sent with such | |||
| Data sent with DDP information is said to be `DDP-decorated'. | DDP information is said to be `tagged'. | |||
| The central component of the DDP architecture is the `buffer', | The central component of the DDP architecture is the `buffer', | |||
| which is an object with beginning and ending addresses, and a | which is an object with beginning and ending addresses, and a | |||
| method (set()) to set the value of an octet at an address. In many | method (set()) to set the value of an octet at an address. In many | |||
| cases, a buffer corresponds directly to a portion of host memory. | cases, a buffer corresponds directly to a portion of host user | |||
| However, DDP does not depend on this---a buffer could be a disk | memory. However, DDP does not depend on this---a buffer could be a | |||
| file, or anything else that can be viewed as an addressable | disk file, or anything else that can be viewed as an addressable | |||
| collection of octets. Abstractly, a buffer provides the interface: | collection of octets. Abstractly, a buffer provides the interface: | |||
| typedef struct { | typedef struct { | |||
| const address_t start; | const address_t start; | |||
| const address_t end; | const address_t end; | |||
| void set(address_t a, uint8_t v); | void set(address_t a, data_t v); | |||
| } buffer_t; | } ddp_buffer_t; | |||
| address_t | ||||
| a reference to local memory | ||||
| data_t | ||||
| an octet data value. | ||||
| The protocol layering and in-line data flow of DDP is: | The protocol layering and in-line data flow of DDP is: | |||
| Client Protocol | Client Protocol | |||
| (e.g. ULP or RDMA) | (e.g. ULP or RDMA) | |||
| | ^ | | ^ | |||
| undecorated messages | | undecorated messages | untagged messages | | untagged message delivery | |||
| DDP-decorated messages | | DDP-decorated message reception | tagged messages | | tagged message delivery | |||
| v | indications | v | | |||
| DDP | DDP+---> data placement | |||
| ^ | ^ | |||
| | transport messages | | transport messages | |||
| v | v | |||
| Transport | Transport | |||
| (e.g. SCTP, DCP) | (e.g. SCTP, DCP, framed TCP) | |||
| ^ | ^ | |||
| | IP datagrams | | IP datagrams | |||
| v | v | |||
| . . . | . . . | |||
| In addition to in-line data flow, the client protocol registers | In addition to in-line data flow, the client protocol registers | |||
| buffers with DDP, and DDP performs buffer update (set()) operations | buffers with DDP, and DDP performs buffer update (set()) operations | |||
| as a result of receiving DDP-decorated messages. | as a result of receiving tagged messages. | |||
| Undecorated messages correspond directly to messages of the | ||||
| underlying transport, but must still be distinguished from DDP- | ||||
| decorated messages in some way. | ||||
| DDP-decorated messages may be split into multiple, smaller DDP- | DDP messages may be split into multiple, smaller DDP messages, each | |||
| decorated messages each in a separate transport message. However, | in a separate transport message. However, if the transport is | |||
| if the transport is unreliable or unordered, DDP-decorated messages | unreliable or unordered, messages split across transport messages | |||
| split across transport messages may or may not provide useful | may or may not provide useful behavior, in the same way as | |||
| behavior, in the same way as splitting regular, undecorated | splitting arbitrary upper layer messages across unreliable or | |||
| messages across unreliable or unordered transport messages may or | unordered transport messages may or may not provide useful | |||
| may not provide useful behavior. In other words, the same | behavior. In other words, the same considerations apply to | |||
| considerations apply to building client protocols on different | building client protocols on different types of transports with or | |||
| types of transports with or without the use of DDP. | without the use of DDP. | |||
| A DDP-decorated message split across transport messages looks like: | A DDP message split across transport messages looks like: | |||
| DDP-decorated message: Transport messages: | DDP message: Transport messages: | |||
| stag=s, offset=o, message 1: | stag=s, offset=o, message 1: | |||
| notify=y, id=i |type=ddp | | notify=y, id=i |type=ddp | | |||
| message= |stag=s | | message= |stag=s | | |||
| |aabbccddee|-------. |offset=o | | |aabbccddee|-------. |offset=o | | |||
| ~ ... ~----. \ |notify=n | | ~ ... ~----. \ |notify=n | | |||
| |vvwwxxyyzz|-. \ \ |id=? | | |vvwwxxyyzz|-. \ \ |id=? | | |||
| | \ `--->|aabbccddee| | | \ `--->|aabbccddee| | |||
| | \ ~ ... ~ | | \ ~ ... ~ | |||
| | +----->|iijjkkllmm| | | +----->|iijjkkllmm| | |||
| skipping to change at page 4, line 27 ¶ | skipping to change at page 5, line 40 ¶ | |||
| + | message 2: | + | message 2: | |||
| \ | |type=ddp | | \ | |type=ddp | | |||
| \ | |stag=s | | \ | |stag=s | | |||
| \ + |offset=o+n| | \ + |offset=o+n| | |||
| \ \ |notify=y | | \ \ |notify=y | | |||
| \ \ |id=i | | \ \ |id=i | | |||
| \ `-->|nnooppqqrr| | \ `-->|nnooppqqrr| | |||
| \ ~ ... ~ | \ ~ ... ~ | |||
| `---->|vvwwxxyyzz| | `---->|vvwwxxyyzz| | |||
| Although this picture suggests that DDP decoration information is | Although this picture suggests that DDP information is carried in- | |||
| carried in-line with the message payload, components of the DDP | line with the message payload, components of the DDP information | |||
| decoration may also be in transport-specific fields, or derived | may also be in transport-specific fields, or derived from | |||
| from transport-specific control information if the transport | transport-specific control information if the transport permits. | |||
| permits. | ||||
| 2.1. Transport Operations | 2.1.1. Transport Operations | |||
| For the purposes of this architecture, the transport provides: | For the purposes of this architecture, the transport provides: | |||
| void xpt_send(socket_t s, message_t m); | void xpt_send(socket_t s, message_t m); | |||
| message_t xpt_recv(socket_t s); | message_t xpt_recv(socket_t s); | |||
| msize_t xpt_max_msize(socket_t s); | msize_t xpt_max_msize(socket_t s); | |||
| socket_t | socket_t | |||
| a transport address, including IP addresses, ports and other | a transport address, including IP addresses, ports and other | |||
| transport-specific identifiers. | transport-specific identifiers. | |||
| message_t | message_t | |||
| a string of octets. | a string of octets. | |||
| msize_t (unsigned integer) | msize_t (scalar) | |||
| a message size. | a message size. | |||
| xpt_send(socket_t s, message_t m) | xpt_send(socket_t s, message_t m) | |||
| send a transport message. | send a transport message. | |||
| xpt_recv(socket_t s) | xpt_recv(socket_t s) | |||
| receive a transport message. | receive a transport message. | |||
| xpt_max_msize(socket_t s) | xpt_max_msize(socket_t s) | |||
| get the current maximum transport message size. Corresponds, | get the current maximum transport message size. Corresponds, | |||
| roughly, to the current path Maximum Transfer Unit (PMTU), | roughly, to the current path Maximum Transfer Unit (PMTU), | |||
| adjusted by underlying protocol overheads. | adjusted by underlying protocol overheads. | |||
| Real implementations of xpt_send() and xpt_recv() typically return | Real implementations of xpt_send() and xpt_recv() typically return | |||
| error indications, but that is not relevant to this architecture. | error indications, but that is not relevant to this architecture. | |||
| 2.2. DDP Operations | 2.1.2. DDP Operations | |||
| The DDP layer provides: | The DDP layer provides: | |||
| void ddp_send(socket_t s, message_t m); | void ddp_send(socket_t s, message_t m); | |||
| void ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, | void ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, | |||
| ddp_notify_t n); | ddp_notify_t n); | |||
| ddp_recv_t ddp_recv(socket_t s); | ddp_recv_t ddp_recv(socket_t s); | |||
| bdesc_t ddp_register(socket_t s, buffer_t b); | bdesc_t ddp_register(socket_t s, ddp_buffer_t b); | |||
| void ddp_deregister(bhand_t bh); | void ddp_deregister(bhand_t bh); | |||
| msizes_t ddp_max_msizes(socket_t s); | msizes_t ddp_max_msizes(socket_t s); | |||
| ddp_addr_t | ddp_addr_t | |||
| the buffer address portion of a DDP-decoration: | the buffer address portion of a tagged message: | |||
| typedef struct { | typedef struct { | |||
| stag_t stag; | stag_t stag; | |||
| address_t offset; | address_t offset; | |||
| } ddp_addr_t; | } ddp_addr_t; | |||
| stag_t (unsigned integer) | stag_t (scalar) | |||
| a steering tag. A stag_t identifies the destination buffer | a Steering Tag. A stag_t identifies the destination buffer | |||
| for DDP-decorated messages. stag_ts are generated when the | for tagged messages. stag_ts are generated when the buffer is | |||
| buffer is registered, communicated to the sender by some | registered, communicated to the sender by some client protocol | |||
| client protocol convention and inserted in DDP-decorated | convention and inserted in DDP messages. stag_t values in | |||
| messages. stag_t values in this DDP architecture are assumed | this DDP architecture are assumed to be completely opaque to | |||
| to be completely opaque to the client protocol, and | the client protocol, and implementation-dependent. However, | |||
| implementation-dependent. However, particular | particular implementations, such as DDP on a multicast | |||
| implementations, such as DDP on a multicast transport (see | transport (see below), may provide the buffer holder some | |||
| below), may provide the buffer holder some control in | control in selecting stag_ts. | |||
| selecting stag_ts. | ||||
| ddp_notify_t | ddp_notify_t | |||
| the notification portion of a DDP-decoration: | the notification portion of a DDP message, used to signal that | |||
| the message represents the final fragment of a multi-segmented | ||||
| DDP message: | ||||
| typedef struct { | typedef struct { | |||
| bool notify; | boolean_t notify; | |||
| ddp_msg_id_t i; | ddp_msg_id_t i; | |||
| } ddp_notify_t; | } ddp_notify_t; | |||
| ddp_msg_id_t (unsigned integer) | ddp_msg_id_t (scalar) | |||
| a DDP-decorated message identifier. msg_id_ts are chosen by | a DDP message identifier. msg_id_ts are chosen by the DDP | |||
| the DDP-decorated message receiver (buffer holder), | message receiver (buffer holder), communicated to the sender | |||
| communicated to the sender by some client protocol convention | by some client protocol convention and inserted in DDP | |||
| and inserted in DDP-decorated messages. Whether a message | messages. Whether a message reception indication is requested | |||
| reception indication is requested for a DDP-decorated message | for a DDP message is a matter of client protocol convention. | |||
| is a matter of client protocol convention. Unlike stag_ts, | Unlike stag_ts, the structure of msg_id_ts is opaque to DDP, | |||
| the structure of msg_id_ts is opaque to DDP, and therefore, | and therefore, completely in the hands of the client protocol. | |||
| completely in the hands of the client protocol. | ||||
| bdesc_t | bdesc_t | |||
| a description of a registered buffer: | a description of a registered buffer: | |||
| typedef struct { | typedef struct { | |||
| bhand_t bh; | bhand_t bh; | |||
| ddp_addr_t a; | ddp_addr_t a; | |||
| } bdesc_t; | } bdesc_t; | |||
| `a.offset' is the starting offset of the registered buffer, | `a.offset' is the starting offset of the registered buffer, | |||
| which may have no relationship to the `start' or `end' | which may have no relationship to the `start' or `end' | |||
| addresses of that buffer. However, particular implemenations, | addresses of that buffer. However, particular | |||
| such as DDP on a multicast transport (see below), may allow | implementations, such as DDP on a multicast transport (see | |||
| some client protocol control over the starting offset. | below), may allow some client protocol control over the | |||
| starting offset. | ||||
| bhand_t | bhand_t | |||
| an opaque buffer handle used to unregister a buffer. | an opaque buffer handle used to deregister a buffer. | |||
| ddp_recv_t | ddp_recv_t | |||
| an undecorated message, a DDP-decorated message reception | ||||
| indication, or a DDP-decorated message reception error: | an untagged message, a tagged message reception indication, or | |||
| a tagged message reception error: | ||||
| typedef union { | typedef union { | |||
| message_t m; | message_t m; | |||
| ddp_msg_id_t i; | ddp_msg_id_t i; | |||
| ddp_err_t e; | ddp_err_t e; | |||
| } ddp_recv_t; | } ddp_recv_t; | |||
| ddp_err_t | ddp_err_t | |||
| indicates an error while receiving a DDP-decorated message, | indicates an error while receiving a tagged message, typically | |||
| typically `offset' out of bounds, or `stag' is not registered | `offset' out of bounds, or `stag' is not registered to the | |||
| to the socket. | socket. | |||
| msizes_t | msizes_t | |||
| The maximum undecorated and DDP-decorated messages that fit in | The maximum untagged and tagged messages that fit in a single | |||
| a single transport message: | transport message: | |||
| typedef struct { | typedef struct { | |||
| msize_t max_undec; | msize_t max_untagged; | |||
| msize_t max_dec; | msize_t max_tagged; | |||
| } msizes_t; | } msizes_t; | |||
| ddp_send(socket_t s, message_t m) | ddp_send(socket_t s, message_t m) | |||
| send an untagged message. | ||||
| send an undecorated message. | ||||
| ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, ddp_notify_t n) | ddp_send_ddp(socket_t s, message_t m, ddp_addr_t d, ddp_notify_t n) | |||
| send a DDP-decorated message. | send a tagged message. | |||
| ddp_recv(socket_t s) | ddp_recv(socket_t s) | |||
| get the next received undecorated message, DDP-decorated | get the next received untagged message, tagged message | |||
| message reception indication, or DDP-decorated message error. | reception indication, or tagged message error. | |||
| ddp_register(socket_t s, buffer_t b) | ddp_register(socket_t s, ddp_buffer_t b) | |||
| register a buffer for DDP on a socket. The same buffer may be | register a buffer for DDP on a socket. The same buffer may be | |||
| registered multiple times on the same or different sockets. | registered multiple times on the same or different sockets. | |||
| Different buffers may also refer to portions of the same | Different buffers may also refer to portions of the same | |||
| underlying addressable object (buffer aliasing). | underlying addressable object (buffer aliasing). | |||
| ddp_deregister(bhand_t bh) | ddp_deregister(bhand_t bh) | |||
| unregister a buffer from a socket. | ||||
| remove a registration from a buffer. | ||||
| ddp_max_msizes(socket_t s) | ddp_max_msizes(socket_t s) | |||
| get the current maximum undecorated and DDP-decorated message | get the current maximum untagged and tagged message sizes that | |||
| sizes that will fit in a single transport message. | will fit in a single transport message. | |||
| 2.3. Transport Characterstics In DDP | 2.1.3. Transport Characteristics In DDP | |||
| Certain characteristics of the transport on which DDP is mapped | Certain characteristics of the transport on which DDP is mapped | |||
| determine the nature of the service provided to client protocols. | determine the nature of the service provided to client protocols. | |||
| Specifically, transports are: | Specifically, transports are: | |||
| o reliable or unreliable, | o reliable or unreliable, | |||
| o ordered or unordered, | o ordered or unordered, | |||
| o single source or multisource, | o single source or multisource, | |||
| o single destination or multidestination (multicast or anycast). | o single destination or multidestination (multicast or anycast). | |||
| Some transports support several combinations of these | Some transports support several combinations of these | |||
| characteristics. For example, SCTP is reliable, single source, | characteristics. For example, SCTP [SCTP] is reliable, single | |||
| single destination (point-to-point) and supports both ordered and | source, single destination (point-to-point) and supports both | |||
| unordered modes. | ordered and unordered modes. | |||
| In general, these transport characteristics equally affect | In general, these transport characteristics equally affect | |||
| transport and DDP-decorated message delivery. However, there are | transport and DDP message delivery. However, there are several | |||
| several issues specific to DDP-decorated messages. | issues specific to DDP messages. | |||
| A key component of DDP, is how operations on the receiving side: | A key component of DDP is how the following operations on the | |||
| receiving side are ordered among themselves, and how they relate to | ||||
| corresponding operations on the sending side: | ||||
| o set()s, | o set()s, | |||
| o undecorated messages, and | o untagged message reception indications, and | |||
| o DDP-decorated message reception indications | o tagged message reception indications. | |||
| are ordered among themselves, and how they relate to corresponding | These relationships depend upon the characteristics of the | |||
| operations on the sending side. These relationships depend upon | underlying transport in a way which is defined by the DDP protocol. | |||
| the characteristics of the underlying transport in a way which is | For example, if the transport is unreliable and unordered, the DDP | |||
| defined by the DDP protocol. For example, if the transport is | protocol might specify that the client protocol is subject to the | |||
| unreliable and unordered, the DDP protocol might specify that the | consequences of transport messages being lost or duplicated, rather | |||
| client protocol is subject to the consequences of transport | requiring different characteristics be presented to the client | |||
| messages being lost or duplicated, rather requiring different | protocol. | |||
| characteristics be presented to the client protocol. | ||||
| Multidestination data delivery is the other transport | Multidestination data delivery is the other transport | |||
| characteristic which may require specific consideration in a DDP | characteristic which may require specific consideration in a DDP | |||
| protocol. As mentioned above, the basic DDP model assumes that | protocol. As mentioned above, the basic DDP model assumes that | |||
| buffer address values returned by ddp_register() are opaque to the | buffer address values returned by ddp_register() are opaque to the | |||
| client protocol, and can be implementation dependent. The most | client protocol, and can be implementation dependent. The most | |||
| natural way to map DDP to a multidestination transport is to | natural way to map DDP to a multidestination transport is to | |||
| require all receivers produce the same buffer address when | require all receivers produce the same buffer address when | |||
| registering a multidestination destination buffer. Restriction of | registering a multidestination destination buffer. Restriction of | |||
| the DDP model to accomodate multiple destinations involves | the DDP model to accommodate multiple destinations involves | |||
| engineering tradeoffs comparable to those of providing non-DDP | engineering tradeoffs comparable to those of providing non-DDP | |||
| multidestination transport capability. | multidestination transport capability. | |||
| 3. Remote Direct Memory Access (RDMA) Protocol Architecture | 2.2. Remote Direct Memory Access (RDMA) Protocol Architecture | |||
| Remote Direct Memory Access (RDMA) extends the capabilities of DDP | Remote Direct Memory Access (RDMA) extends the capabilities of DDP | |||
| with the ability to read from buffers registered to a socket (RDMA | with the ability to read from buffers registered to a socket (RDMA | |||
| Read). This allows a client protocol to perform arbitrary, | Read). This allows a client protocol to perform arbitrary, | |||
| bidirectional data movement without involving the remote client | bidirectional data movement without involving the remote client. | |||
| protocol. When RDMA is implemented in the NI, arbitrary data | When RDMA is implemented in hardware, arbitrary data movement can | |||
| movement can be performed without involving the remote host CPU at | be performed without involving the remote host CPU at all. | |||
| all. | ||||
| In addition, RDMA protocols usually specify a transport-independent | In addition, RDMA protocols usually specify a transport-independent | |||
| undecorated message service (Send) with characteristics which are | untagged message service (Send) with characteristics which are both | |||
| both very efficient to implement in an NI, and convenient for | very efficient to implement in hardware, and convenient for client | |||
| client protocols. | protocols. | |||
| The RDMA architecture is patterned after the traditional model for | The RDMA architecture is patterned after the traditional model for | |||
| device programming, where the client requests an operation using | device programming, where the client requests an operation using | |||
| Send-like actions (programmed I/O), the server performs the | Send-like actions (programmed I/O), the server performs the | |||
| necessary data transfers for the operation (DMA reads and writes), | necessary data transfers for the operation (DMA reads and writes), | |||
| and notifies the client of completion. The programmed I/O+DMA | and notifies the client of completion. The programmed I/O+DMA | |||
| model efficiently supports a high degree of concurrency and | model efficiently supports a high degree of concurrency and | |||
| flexibility for both the client and server, even when operations | flexibility for both the client and server, even when operations | |||
| have a wide range of intrinsic latencies. | have a wide range of intrinsic latencies. | |||
| RDMA is implemented as a client protocol on top of DDP: | RDMA is layered as a client protocol on top of DDP: | |||
| Client Protocol | Client Protocol | |||
| | ^ | | ^ | |||
| Sends | | Sends | Sends | | Send reception indications | |||
| RDMA Read Requests | | RDMA Read Completion indications | RDMA Read Requests | | RDMA Read Completion indications | |||
| RDMA Writes v | RDMA Write Completion indications | RDMA Writes | | RDMA Write Completion indications | |||
| v | | ||||
| RDMA | RDMA | |||
| | ^ | | ^ | |||
| undecorated messages | | undecorated messages | untagged messages | | untagged message delivery | |||
| DDP-decorated messages | | DDP-decorated message reception | tagged messages | | tagged message delivery | |||
| v | indications | v | | |||
| DDP | DDP+---> data placement | |||
| ^ | ^ | |||
| | transport messages | | transport messages | |||
| v | v | |||
| . . . | . . . | |||
| In addition to in-line data flow, read (get()) and update (set()) | In addition to in-line data flow, read (get()) and update (set()) | |||
| operations are performed on buffers registered with RDMA as a | operations are performed on buffers registered with RDMA as a | |||
| result of RDMA Read Requests and RDMA Writes, respectively. | result of RDMA Read Requests and RDMA Writes, respectively. | |||
| An RDMA `buffer' extends a DDP buffer with a get() operation that | An RDMA `buffer' extends a DDP buffer with a get() operation that | |||
| retrieves the value of the octet at address `a': | retrieves the value of the octet at address `a': | |||
| typedef struct { | typedef struct { | |||
| const address_t start; | const address_t start; | |||
| const address_t end; | const address_t end; | |||
| void set(address_t a, uint8_t v); | void set(address_t a, data_t v); | |||
| uint8_t get(address_t a); | data_t get(address_t a); | |||
| } buffer_t; | } rdma_buffer_t; | |||
| 3.1. RDMA Operations | 2.2.1. RDMA Operations | |||
| The RDMA layer provides: | The RDMA layer provides: | |||
| void rdma_send(socket_t s, message_t m); | void rdma_send(socket_t s, message_t m); | |||
| void rdma_write(socket_t s, message_t m, ddp_addr_t d, | void rdma_write(socket_t s, message_t m, ddp_addr_t d, | |||
| rdma_notify_t n); | rdma_notify_t n); | |||
| void rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d); | void rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d); | |||
| rdma_recv_t rdma_recv(socket_t s); | rdma_recv_t rdma_recv(socket_t s); | |||
| bdesc_t rdma_register(socket_t s, buffer_t b, bmode_t mode); | bdesc_t rdma_register(socket_t s, rdma_buffer_t b, | |||
| bmode_t mode); | ||||
| void rdma_deregister(bhand_t bh); | void rdma_deregister(bhand_t bh); | |||
| msizes_t rdma_max_msizes(socket_t s); | msizes_t rdma_max_msizes(socket_t s); | |||
| Although, for clarity, these data transfer interfaces are | Although, for clarity, these data transfer interfaces are | |||
| synchronous, rdma_read() and possibly rdma_send() (in the presence | synchronous, rdma_read() and possibly rdma_send() (in the presence | |||
| of Send flow control), can require an arbitrary amount of time to | of Send flow control), can require an arbitrary amount of time to | |||
| complete. To express the full concurrency and interleaving of RDMA | complete. To express the full concurrency and interleaving of RDMA | |||
| data transfer, these interfaces are also defined to be | data transfer, these interfaces are also defined to be | |||
| multithreaded. For example, a client protocol may perform an | multithreaded. For example, a client protocol may perform an | |||
| rdma_send(), while an rdma_read() operation is in progress. | rdma_send(), while an rdma_read() operation is in progress. | |||
| rdma_notify_t | rdma_notify_t | |||
| RDMA Write notification information: | RDMA Write notification information, used to signal that the | |||
| message represents the final fragment of a multi-segmented | ||||
| RDMA message: | ||||
| typedef struct { | typedef struct { | |||
| bool notify; | boolean_t notify; | |||
| rdma_write_id_t i; | rdma_write_id_t i; | |||
| } rdma_notify_t; | } rdma_notify_t; | |||
| identical in function to ddp_notify_t, except that the type | identical in function to ddp_notify_t, except that the type | |||
| rdma_write_id_t may not be equivalent to ddp_msg_id_t. | rdma_write_id_t may not be equivalent to ddp_msg_id_t. | |||
| rdma_write_id_t (unsigned integer) | rdma_write_id_t (scalar) | |||
| an RDMA Write identifier. | an RDMA Write identifier. | |||
| rdma_recv_t | rdma_recv_t | |||
| a Send message, an RDMA Write completion identifier, or an | a Send message, an RDMA Write completion identifier, or an | |||
| RDMA error: | RDMA error: | |||
| typedef union { | typedef union { | |||
| message_t m; | message_t m; | |||
| skipping to change at page 12, line 4 ¶ | skipping to change at page 13, line 19 ¶ | |||
| protection violations (e.g. RDMA Writing a buffer only | protection violations (e.g. RDMA Writing a buffer only | |||
| registered for reading). | registered for reading). | |||
| bmode_t | bmode_t | |||
| buffer registration mode (permissions). Any combination of | buffer registration mode (permissions). Any combination of | |||
| permitting RDMA Read (BMODE_READ) and RDMA Write (BMODE_WRITE) | permitting RDMA Read (BMODE_READ) and RDMA Write (BMODE_WRITE) | |||
| operations. | operations. | |||
| rdma_send(socket_t s, message_t m) | rdma_send(socket_t s, message_t m) | |||
| Send a message. | ||||
| send a message, delivering it to the next untagged RDMA buffer | ||||
| at the remote peer. | ||||
| rdma_write(socket_t s, message_t m, ddp_addr_t d, rdma_notify_t n) | rdma_write(socket_t s, message_t m, ddp_addr_t d, rdma_notify_t n) | |||
| RDMA Write to remote buffer address d. | RDMA Write to remote buffer address d. | |||
| rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d) | rdma_read(socket_t s, ddp_addr_t s, ddp_addr_t d) | |||
| RDMA Read from remote buffer address s to local buffer address | RDMA Read from remote buffer address s to local buffer address | |||
| d. | d. | |||
| rdma_recv(socket_t s); | rdma_recv(socket_t s); | |||
| get the next received Send message, RDMA Write completion | get the next received Send message, RDMA Write completion | |||
| identifier, or RDMA error. | identifier, or RDMA error. | |||
| rdma_register(socket_t s, buffer_t b, bmode_t mode) | rdma_register(socket_t s, rdma_buffer_t b, bmode_t mode) | |||
| register a buffer for RDMA on a socket (for read access, write | register a buffer for RDMA on a socket (for read access, write | |||
| access or both). As with DDP, the same buffer may be | access or both). As with DDP, the same buffer may be | |||
| registered multiple times on the same or different sockets, | registered multiple times on the same or different sockets, | |||
| and different buffers may refer to portions of the same | and different buffers may refer to portions of the same | |||
| underlying addressable object. | underlying addressable object. | |||
| rdma_deregister(bhand_t bh) | rdma_deregister(bhand_t bh) | |||
| unregister a buffer from a socket. | remove a registration from a buffer. | |||
| rdma_max_msizes(socket_t s) | rdma_max_msizes(socket_t s) | |||
| get the current maximum Send (max_undec) and RDMA Read or | get the current maximum Send (max_untagged) and RDMA Read or | |||
| Write (max_dec) operations that will fit in a single transport | Write (max_tagged) operations that will fit in a single | |||
| message. The values returned by rdma_max_msizes() are closely | transport message. The values returned by rdma_max_msizes() | |||
| related to the values returned by ddp_max_msizes(), but may | are closely related to the values returned by | |||
| not be equal. | ddp_max_msizes(), but may not be equal. | |||
| 3.2. Transport Characterstics In RDMA | 2.2.2. Transport Characteristics In RDMA | |||
| As with DDP, RDMA can be used on transports with a variety of | As with DDP, RDMA can be used on transports with a variety of | |||
| different characteristics that manifest themselves directly in the | different characteristics that manifest themselves directly in the | |||
| service provided by RDMA. | service provided by RDMA. | |||
| Like DDP, an RDMA protocol must specify how: | Like DDP, an RDMA protocol must specify how: | |||
| o set()s, | o set()s, | |||
| o get()s, | o get()s, | |||
| o Send messages, and | ||||
| o RDMA Read completions | o Send messages, and | |||
| o RDMA Read completions | ||||
| are ordered among themselves and how they relate to corresponding | are ordered among themselves and how they relate to corresponding | |||
| operations on the remote peer(s). These relationships are likely | operations on the remote peer(s). These relationships are likely | |||
| to be a function of the underlying transport characteristics. | to be a function of the underlying transport characteristics. | |||
| There are some additional characteristics of RDMA which may | There are some additional characteristics of RDMA which may | |||
| translate poorly to unreliable or multipoint transports due to | translate poorly to unreliable or multipoint transports due to | |||
| attendent complexities in managing endpoint state: | attendant complexities in managing endpoint state: | |||
| o Send flow control | o Send flow control | |||
| o RDMA Read | o RDMA Read | |||
| These difficulties can be overcome by placing restrictions on the | These difficulties can be overcome by placing restrictions on the | |||
| service provided by RDMA. However, many RDMA clients, especially | service provided by RDMA. However, many RDMA clients, especially | |||
| those that separate data transfer and application logic concerns, | those that separate data transfer and application logic concerns, | |||
| are likely to depend upon capabilities only provided by RDMA on a | are likely to depend upon capabilities only provided by RDMA on a | |||
| point-to-point, reliable transport. | point-to-point, reliable transport. | |||
| 4. Security Considerations | 3. Security Considerations | |||
| Security considerations are not addressed in this document. Any | System integrity must be maintained in any RDMA solution. | |||
| security considerations resulting from the use of DDP or RDMA must | Mechanisms must be specified to prevent RDMA or DDP operations from | |||
| be addressed in the relevant standards. | impairing system integrity. For example, the threat caused by | |||
| potential buffer overflow needs full examination, and prevention | ||||
| mechanisms must be spelled out. | ||||
| 5. IANA Considerations | Because a Steering Tag exports access to a memory region, one | |||
| critical aspect of security is the scope of this access. It must | ||||
| be possible to individually control specific attributes of the | ||||
| access provided by a Steering Tag, including remote read access, | ||||
| remote write access, and others that might be identified. A | ||||
| specification must provide both implementation requirements | ||||
| relevant to this issue, and guidelines to assist implementors in | ||||
| making the appropriate design decisions. | ||||
| A number of other potential attacks have been envisioned and must | ||||
| be addressed. Some such examples are outlined in [RDMACON]. | ||||
| Resource issues leading to denial-of-service attacks, overwrites | ||||
| and other concurrent operations, the ordering of completions as | ||||
| required by the RDMA protocol, and the granularity of transfer are | ||||
| all within the required scope of any security analysis of RDMA and | ||||
| DDP. | ||||
| 4. IANA Considerations | ||||
| IANA considerations are not addressed in by this document. Any | IANA considerations are not addressed in by this document. Any | |||
| IANA considerations resulting from the use of DDP or DMA must be | IANA considerations resulting from the use of DDP or RDMA must be | |||
| addressed in the relevant standards. | addressed in the relevant standards. | |||
| Author's Address | 5. Acknowledgements | |||
| The authors wish to acknowledge the valuable contributions of David | ||||
| Black, Jeff Mogul and Allyn Romanow. | ||||
| 6. References | ||||
| [DAFS] | ||||
| Direct Access File System http://www.dafscollaborative.org | ||||
| http://www.ietf.org/internet-drafts/draft-wittle-dafs-00.txt | ||||
| [FIBRE] | ||||
| Fibre Channel Standard | ||||
| http://www.fibrechannel.com/technology/index.master.html | ||||
| [IB] InfiniBand Architecture Specification, Volumes 1 and 2, | ||||
| Release 1.0.a. http://www.infinibandta.org | ||||
| [MYR] | ||||
| Myrinet, http://www.myricom.com | ||||
| [RDDP] | ||||
| Remote Direct Data Placement Working Group charter, | ||||
| http://www.ietf.org/html.charters/rddp-charter.html | ||||
| [RDMACON] | ||||
| D. Black, M. Speer, J. Wroclawski, "DDP and RDMA Concerns", | ||||
| http://www.ietf.org/internet-drafts/draft-black-rdma- | ||||
| concerns-00.txt, Work in Progress, June 2002 | ||||
| [ROM] | ||||
| A. Romanow, J. Mogul, T. Talpey, S. Bailey, "RDMA over IP | ||||
| Problem Statement", http://www.ietf.org/internet-drafts/draft- | ||||
| romanow-rdma-over-ip-problem-statement-01.txt, Work in | ||||
| Progress, November 2002 | ||||
| [SCTP] | ||||
| R. Stewart et al., "Stream Transmission Control Protocol", | ||||
| Standards Track RFC, http://www.ietf.org/rfc/rfc2960 | ||||
| [SDP] | ||||
| Sockets Direct Protocol v1.0 | ||||
| [SRVNET] | ||||
| Compaq Servernet, | ||||
| http://nonstop.compaq.com/view.asp?PAGE=ServerNet | ||||
| [VI] Virtual Interface Architecture Specification Version 1.0. | ||||
| http://www.viarch.org/html/collateral/san_10.pdf | ||||
| Authors' Addresses | ||||
| Stephen Bailey | Stephen Bailey | |||
| Sandburst Corporation | Sandburst Corporation | |||
| 600 Federal Street | 600 Federal Street | |||
| Andover, MA 01810 | Andover, MA 01810 USA | |||
| USA | USA | |||
| Phone: +1 978 689 1614 | ||||
| Email: steph@sandburst.com | Email: steph@sandburst.com | |||
| Tom Talpey | ||||
| Network Appliance | ||||
| 375 Totten Pond Road | ||||
| Waltham, MA 02451 USA | ||||
| Phone: +1 781 768 5329 | ||||
| Email: thomas.talpey@netapp.com | ||||
| Full Copyright Statement | Full Copyright Statement | |||
| Copyright (C) The Internet Society (2001). All Rights Reserved. | Copyright (C) The Internet Society (2002). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain | others, and derivative works that comment on or otherwise explain | |||
| it or assist in its implementation may be prepared, copied, | it or assist in its implementation may be prepared, copied, | |||
| published and distributed, in whole or in part, without restriction | published and distributed, in whole or in part, without restriction | |||
| of any kind, provided that the above copyright notice and this | of any kind, provided that the above copyright notice and this | |||
| paragraph are included on all such copies and derivative works. | paragraph are included on all such copies and derivative works. | |||
| However, this document itself may not be modified in any way, such | However, this document itself may not be modified in any way, such | |||
| as by removing the copyright notice or references to the Internet | as by removing the copyright notice or references to the Internet | |||
| Society or other Internet organizations, except as needed for the | Society or other Internet organizations, except as needed for the | |||
| End of changes. 81 change blocks. | ||||
| 176 lines changed or deleted | 327 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||