< draft-ietf-storm-mpa-peer-connect-08.txt   draft-ietf-storm-mpa-peer-connect-09.txt >
STORM A. Kanevsky, Ed. STORM A. Kanevsky, Ed.
Internet-Draft Dell Inc. Internet-Draft Dell Inc.
Updates: 5043, 5044 (if approved) C. Bestler, Ed. Updates: 5043, 5044 (if approved) C. Bestler, Ed.
Intended status: Standards Track Nexenta Systems Intended status: Standards Track Nexenta Systems
Expires: April 24, 2012 R. Sharp Expires: June 17, 2012 R. Sharp
Intel Intel
S. Wise S. Wise
Open Grid Computing Open Grid Computing
October 22, 2011 December 15, 2011
Enhanced RDMA Connection Establishment Enhanced RDMA Connection Establishment
draft-ietf-storm-mpa-peer-connect-08 draft-ietf-storm-mpa-peer-connect-09
Abstract Abstract
This document updates RFC 5043 and RFC 5044 by extending Marker This document updates RFC 5043 and RFC 5044 by extending Marker
Protocol Data Unit (PDU) Aligned Framing (MPA) negotiation for Remote Protocol Data Unit (PDU) Aligned Framing (MPA) negotiation for Remote
Direct Memory Access (RDMA) connection establishment. The first Direct Memory Access (RDMA) connection establishment. The first
enhancement extends RFC 5044, enabling peer-to-peer connection enhancement extends RFC 5044, enabling peer-to-peer connection
establishment over MPA/ Transmission Control Protocol (TCP). The establishment over MPA/ Transmission Control Protocol (TCP). The
second enhancement extends both RFC 5043 and RFC 5044, by providing second enhancement extends both RFC 5043 and RFC 5044, by providing
an option for standardized exchange of RDMA-layer connection an option for standardized exchange of RDMA-layer connection
skipping to change at page 1, line 42 skipping to change at page 1, line 42
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 24, 2012. This Internet-Draft will expire on June 17, 2012.
Copyright Notice Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 20 skipping to change at page 2, line 20
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Summary of changes affecting RFC 5044 . . . . . . . . . . 4 1.1. Summary of changes affecting RFC 5044 . . . . . . . . . . 4
1.2. Summary of changes affecting RFC 5043 . . . . . . . . . . 4 1.2. Summary of changes affecting RFC 5043 . . . . . . . . . . 4
2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4
3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1. Standardization of RDMA Read Parameter Configuration . . . 7 4.1. Standardization of RDMA Read Parameter Configuration . . . 7
4.2. Enabling MPA Mode . . . . . . . . . . . . . . . . . . . . 8 4.2. Enabling MPA Mode . . . . . . . . . . . . . . . . . . . . 9
4.3. Lack of Explicit RTR in MPA Request/Reply Exchange . . . . 9 4.3. Lack of Explicit RTR in MPA Request/Reply Exchange . . . . 9
4.4. Limitations on ULP Workaround . . . . . . . . . . . . . . 10 4.4. Limitations on ULP Workaround . . . . . . . . . . . . . . 10
4.4.1. Transport Neutral APIs . . . . . . . . . . . . . . . . 11 4.4.1. Transport Neutral APIs . . . . . . . . . . . . . . . . 11
4.4.2. Work/Completion Queue Accounting . . . . . . . . . . . 11 4.4.2. Work/Completion Queue Accounting . . . . . . . . . . . 11
4.4.3. Host-based Implementation of MPA Fencing . . . . . . . 12 4.4.3. Host-based Implementation of MPA Fencing . . . . . . . 12
5. Enhanced MPA Connection Establishment . . . . . . . . . . . . 12 5. Enhanced MPA Connection Establishment . . . . . . . . . . . . 12
6. Enhanced MPA Request/Reply Frames . . . . . . . . . . . . . . 13 6. Enhanced MPA Request/Reply Frames . . . . . . . . . . . . . . 13
7. Enhanced SCTP Session Control Chunks . . . . . . . . . . . . . 14 7. Enhanced SCTP Session Control Chunks . . . . . . . . . . . . . 14
8. MPA Error Reporting . . . . . . . . . . . . . . . . . . . . . 16 8. MPA Error Reporting . . . . . . . . . . . . . . . . . . . . . 16
9. Enhanced RDMA Connection Establishment Data . . . . . . . . . 16 9. Enhanced RDMA Connection Establishment Data . . . . . . . . . 16
9.1. IRD and ORD Negotiation . . . . . . . . . . . . . . . . . 17 9.1. IRD and ORD Negotiation . . . . . . . . . . . . . . . . . 17
9.2. Peer-to-Peer Connection Negotiation . . . . . . . . . . . 19 9.2. Peer-to-Peer Connection Negotiation . . . . . . . . . . . 19
9.3. Enhanced Connection Negotiation Flow . . . . . . . . . . . 20 9.3. Enhanced Connection Negotiation Flow . . . . . . . . . . . 20
10. Interoperability . . . . . . . . . . . . . . . . . . . . . . . 20 10. Interoperability . . . . . . . . . . . . . . . . . . . . . . . 21
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22
12. Security Considerations . . . . . . . . . . . . . . . . . . . 22 12. Security Considerations . . . . . . . . . . . . . . . . . . . 22
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22
14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
14.1. Normative References . . . . . . . . . . . . . . . . . . . 22 14.1. Normative References . . . . . . . . . . . . . . . . . . . 22
14.2. Informative References . . . . . . . . . . . . . . . . . . 23 14.2. Informative References . . . . . . . . . . . . . . . . . . 23
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24
1. Introduction 1. Introduction
When used over Transmission Control Protocol (TCP), the current When used over Transmission Control Protocol (TCP), the current
Remote Direct Data Placement (RDDP) [RFC5041] suite of protocols Remote Direct Data Placement (RDDP) [RFC5041] suite of protocols
relies on MPA [RFC5044] protocol for both connection establishment relies on MPA [RFC5044] protocol for both connection establishment
and for markers for TCP layering. and for markers for TCP layering.
A typical model for establishing an RDMA connection has the following A typical model for establishing an RDMA connection has the following
steps: steps:
skipping to change at page 4, line 9 skipping to change at page 4, line 9
negotiation of some of Remote Direct Memory Access Protocol (RDMAP) negotiation of some of Remote Direct Memory Access Protocol (RDMAP)
[RFC5040] specific parameters are left to ULP negotiation. Providing [RFC5040] specific parameters are left to ULP negotiation. Providing
an optional ULP-independent format for exchanging these parameters an optional ULP-independent format for exchanging these parameters
would be of benefit to transport neutral Remote Direct Memory Access would be of benefit to transport neutral Remote Direct Memory Access
(RDMA) applications. (RDMA) applications.
1.1. Summary of changes affecting RFC 5044 1.1. Summary of changes affecting RFC 5044
This draft enhances [RFC5044] MPA connection setup protocol. First, This draft enhances [RFC5044] MPA connection setup protocol. First,
it adds exchange and negotiation of the parameters necessary to it adds exchange and negotiation of the parameters necessary to
support RDMA Read Requests. Second, it adds a Ready to Receive (RTR) support RDMA Read Requests. Second, it adds a message that serves as
message from the initiator to the responder as the last message of a Ready to Receive (RTR) indication from the initiator to the
connection establishment and adds negotiation of an RTR message type responder as the last message of connection establishment and adds
into MPA request/reply frames. negotiation of an which type of message to use to carry the RTR
indication into MPA request/reply frames.
1.2. Summary of changes affecting RFC 5043 1.2. Summary of changes affecting RFC 5043
This draft enhances [RFC5043] by adding new Enhanced Session Control This draft enhances [RFC5043] by adding new Enhanced Session Control
Chunks that extends the currently defined Chunks with the addition of Chunks that extends the currently defined Chunks with the addition of
Inbound RDMA Read Queue Depth (IRD) and Outbound RDMA Read Queue Inbound RDMA Read Queue Depth (IRD) and Outbound RDMA Read Queue
Depth (ORD) negotiation. Depth (ORD) negotiation.
2. Requirements Language 2. Requirements Language
skipping to change at page 6, line 9 skipping to change at page 6, line 14
Remote Peer: The MPA protocol implementation on the opposite end of Remote Peer: The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two describing protocol exchanges or other interactions between two
Nodes. See [RFC5044]. Nodes. See [RFC5044].
Responder: The connection endpoint that responds to an incoming MPA Responder: The connection endpoint that responds to an incoming MPA
connection request (the MPA Request Frame). Responder is the connection request (the MPA Request Frame). Responder is the
passive side of the connection establishment. See [RFC5044]. passive side of the connection establishment. See [RFC5044].
Ready to Receive (RTR): RTR is the last connection establishment Ready to Receive (RTR): RTR is an indication provided by the last
message sent from the initiator to the responder indicating that connection establishment message sent from the initiator to the
the initiator is ready to receive messages and that connection responder. An RTR indicates that the initiator is ready to
establishment is completed. See [IBTA]. receive messages and that connection establishment is completed.
Startup Phase: The initial exchanges of an MPA connection that Startup Phase: The initial exchanges of an MPA connection that
serves to more fully identify MPA endpoints to each other and pass serves to more fully identify MPA endpoints to each other and pass
connection specific setup information to each other. See connection specific setup information to each other. See
[RFC5044]. [RFC5044].
Shared Receive Queue(SRQ): A shared pool of Receive Work Requests Shared Receive Queue(SRQ): A shared pool of Receive Work Requests
posted by the Consumer that can be allocated by multiple RDMA posted by the Consumer that can be allocated by multiple RDMA
endpoints (Queue Pair). See [RDMAC]. endpoints (Queue Pair). See [RDMAC].
skipping to change at page 10, line 6 skipping to change at page 10, line 15
o Before the first MPA frame is transmitted, all pre-MPA mode TCP o Before the first MPA frame is transmitted, all pre-MPA mode TCP
payload will have been acknowledged by the peer. Therefore it is payload will have been acknowledged by the peer. Therefore it is
never necessary to generate a retransmission that mixes pre-MPA never necessary to generate a retransmission that mixes pre-MPA
and MPA payload. and MPA payload.
o Before MPA reception is enabled, all incoming pre-MPA mode TCP o Before MPA reception is enabled, all incoming pre-MPA mode TCP
payload will have been acknowledged. Therefore the host will payload will have been acknowledged. Therefore the host will
never receive a TCP segment that mixes pre-MPA and MPA payload. never receive a TCP segment that mixes pre-MPA and MPA payload.
The limitation of the current MPA Request/Reply exchange is that it The limitation of the current MPA Request/Reply exchange is that it
does not define a Ready to Receive (RTR) message that the active side does not define a Ready to Receive (RTR) indication that the active
would send, so that the passive side can know that the last non-MPA side would send, so that the passive side can know that the last non-
payload (the MPA Reply) had been received. MPA payload (the MPA Reply) had been received.
Instead, the role of an RTR message is piggy-backed on the first MPA Instead, the role of an RTR indication is piggy-backed on the first
FULPDU sent by the active side. This is actually a valuable MPA FULPDU sent by the active side. This is actually a valuable
optimization for all applications that fit the classic client/server optimization for all applications that fit the classic client/server
model. The client only initiates the connection when it has a model. The client only initiates the connection when it has a
request to send to the server, and the server has nothing to send request to send to the server, and the server has nothing to send
until it has received and processed the client request. until it has received and processed the client request.
Even applications where the server sends some configuration data Even applications where the server sends some configuration data
immediately can easily send the same information as application immediately can easily send the same information as application
private data in the MPA Reply. So the currently defined exchange private data in the MPA Reply. So the currently defined exchange
works for almost all applications. works for almost all applications.
skipping to change at page 10, line 34 skipping to change at page 10, line 43
[UsingMPI], or [RDS]), have no natural client or server roles [UsingMPI], or [RDS]), have no natural client or server roles
([PPMPI], [OpenMP]). Typically one member of the cluster is ([PPMPI], [OpenMP]). Typically one member of the cluster is
arbitrarily selected to initiate the connection when the distributed arbitrarily selected to initiate the connection when the distributed
task is launched, while the other accepts it. At startup time, task is launched, while the other accepts it. At startup time,
however, there is no way to predict which node will have the first however, there is no way to predict which node will have the first
message to actually send. Establishing the connections immediately, message to actually send. Establishing the connections immediately,
however, is valuable because it reduces latency once results are however, is valuable because it reduces latency once results are
ready to transmit and it validates connectivity throughout the ready to transmit and it validates connectivity throughout the
cluster. cluster.
The lack of an explicit RTR message in the MPA Request/Reply exchange The lack of an explicit RTR indication in the MPA Request/Reply
forces all applications to have a first message from the connection exchange forces all applications to have a first message from the
initiator, whether this matches the application communication model connection initiator, whether this matches the application
or not. communication model or not.
4.4. Limitations on ULP Workaround 4.4. Limitations on ULP Workaround
The requirement that the RDMA connection initiator sends the first The requirement that the RDMA connection initiator sends the first
message does not appear to be onerous on first examination. The message does not appear to be onerous on first examination. The
natural question is why the application layer would not simply natural question is why the application layer would not simply
generate a dummy message when there was no other message to submit. generate a dummy message when there was no other message to submit.
There are three factors that make this workaround unsuitable for many There are three factors that make this workaround unsuitable for many
peer-to-peer applications. peer-to-peer applications.
skipping to change at page 12, line 41 skipping to change at page 12, line 49
is to allow standard negotiation of ORD/IRD setting on both sides of is to allow standard negotiation of ORD/IRD setting on both sides of
the RDMA connection and/or to negotiate the initial data transfer the RDMA connection and/or to negotiate the initial data transfer
operation by the initiator when the existing 'client sends first' operation by the initiator when the existing 'client sends first'
rule does not match application requirements. rule does not match application requirements.
The RDMA connection initiator sends an MPA Request, as specified in The RDMA connection initiator sends an MPA Request, as specified in
[RFC5044]; the new format defined here allows for: [RFC5044]; the new format defined here allows for:
o Standardized negotiation of ORD and IRD. o Standardized negotiation of ORD and IRD.
o Negotiation of an RTR message. o Negotiation of RTR functionality and the RDMA message type to use
as the RTR indication.
The RDMA connection responder processes the MPA Request and generates The RDMA connection responder processes the MPA Request and generates
an MPA Reply, as specified in [RFC5044]; the new format completes the an MPA Reply, as specified in [RFC5044]; the new format completes the
negotiation. negotiation.
The local interface needs to provide a way for a ULP to request the The local interface needs to provide a way for a ULP to request the
use of explicit RTR messages per-application or per-connection basis use of explicit RTR indication per-application or per-connection
when an explicit RTR message will be required. Piggy-backing the RTR basis when an explicit RTR indication will be required. Piggy-
on a Client's first message is a valuable optimization for most backing the RTR on a Client's first message is a valuable
connections. optimization for most connections.
The RDMA connection initiator MUST NOT allow any later FULPDUs to be The RDMA connection initiator MUST NOT allow any later FULPDUs to be
transmitted before the RTR message. One method to achieve that is to transmitted before the RTR indication. One method to achieve that is
delay notifying the ULP that the RDMA connection has been established to delay notifying the ULP that the RDMA connection has been
until after any required RTR Message has been transmitted. established until after any required RTR indication has been
transmitted.
All MPA exchanges are performed via TCP prior to RDMA establishment, All MPA exchanges are performed via TCP prior to RDMA establishment,
and are therefore signaled via TCP and not via RDMA completion. and are therefore signaled via TCP and not via RDMA completion.
6. Enhanced MPA Request/Reply Frames 6. Enhanced MPA Request/Reply Frames
Enhanced RDMA connection establishment uses an alternate format for Enhanced RDMA connection establishment uses an alternate format for
MPA Requests and Replies, as follows: MPA Requests and Replies, as follows:
0 1 2 3 0 1 2 3
skipping to change at page 14, line 27 skipping to change at page 14, line 38
higher, If no enhanced connection establishment features are higher, If no enhanced connection establishment features are
desired it MAY be set to one. A host accepting MPA connections desired it MAY be set to one. A host accepting MPA connections
MUST continue to accept MPA Requests with version one even if it MUST continue to accept MPA Requests with version one even if it
supports version two. supports version two.
PD_Length: Unchanged from [RFC5044]. This is the total length of PD_Length: Unchanged from [RFC5044]. This is the total length of
the Private Data field, including the enhanced RDMA connection the Private Data field, including the enhanced RDMA connection
establishment data if present. establishment data if present.
Private Data: Unchanged from [RFC5044]. However, if the 'S' flag is Private Data: Unchanged from [RFC5044]. However, if the 'S' flag is
set, Private Data begins with enhanced RDMA connection set, Private Data MUST begin with enhanced RDMA connection
establishment data. establishment data (see Section 9).
7. Enhanced SCTP Session Control Chunks 7. Enhanced SCTP Session Control Chunks
Enhanced RDMA Connection Establishment uses the first 32 bits of the Enhanced RDMA Connection Establishment uses the first 32 bits of the
Private data field for IRD and ORD negotiation in the "DDP Stream Private data field for IRD and ORD negotiation in the "DDP Stream
Session Initiate" and "DDP Stream Session Accept" SCTP Session Session Initiate" and "DDP Stream Session Accept" SCTP Session
Control Chunks. Control Chunks.
The type of the SCTP Session Control Chunk is defined by a Function The type of the SCTP Session Control Chunk is defined by a Function
Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP
skipping to change at page 15, line 4 skipping to change at page 15, line 11
Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP
Stream Session Initiate' and 'DDP Stream Session Accept', which are Stream Session Initiate' and 'DDP Stream Session Accept', which are
equivalent to a MPA Request Frame and an accepting MPA Reply Frame. equivalent to a MPA Request Frame and an accepting MPA Reply Frame.
Enhanced RDMA connection establishment requires three additional Enhanced RDMA connection establishment requires three additional
Function codes listed below: Function codes listed below:
Enhanced DDP Stream Session Initiate: 0x005 Enhanced DDP Stream Session Initiate: 0x005
Enhanced DDP Stream Session Accept: 0x006 Enhanced DDP Stream Session Accept: 0x006
Enhanced DDP Stream Session Reject: 0x007 Enhanced DDP Stream Session Reject: 0x007
The Enhanced Reject function code MUST be used to indicate rejection The Enhanced Reject function code MUST be used to indicate rejection
of enhanced DDP stream session for a configuration that would have of enhanced DDP stream session for a configuration that would have
been accepted for unenhanced DDP Stream Session negotiation. been accepted for unenhanced DDP Stream Session negotiation.
The Enhanced DDP stream session establishment follows the same rules The Enhanced DDP stream session establishment follows the same rules
as the standard DDP stream session establishment as defined in as the standard DDP stream session establishment as defined in
[RFC5043]. ULP-supplied Private Data MUST be included for Enhanced [RFC5043]. ULP-supplied Private Data MUST be included for Enhanced
DDP Stream Session Initiate, Enhanced DDP Stream Session Accept, and DDP Stream Session Initiate, Enhanced DDP Stream Session Accept, and
Enhanced DDP Stream Session Reject messages. Enhanced DDP Stream Session Reject messages, and MUST follow the
enhanced RDMA connection establishment data in the DDP Stream Session
Initiate and the Enhanced DDP Stream Session Accept messages.
Private Data length MUST NOT exceed 512 bytes in any message, Private Data length MUST NOT exceed 512 bytes in any message,
including enhanced RDMA connection establishment data. including enhanced RDMA connection establishment data.
Private Data MUST NOT be included in the DDP Stream Session TERM Private Data MUST NOT be included in the DDP Stream Session TERM
message. message.
Received Extended DDP Stream Session Control messages SHOULD be Received Extended DDP Stream Session Control messages SHOULD be
reported to the ULP. If reported, any supplied Private Data MUST be reported to the ULP. If reported, any supplied Private Data MUST be
available for the ULP to examine. For example, a received Extended available for the ULP to examine. For example, a received Extended
DDP Stream Session Control message is not reported to ULP if none of DDP Stream Session Control message is not reported to ULP if none of
the requested RTR message types are supported by receiver. In this the requested RTR indication types are supported by receiver. In
case, Provider MAY generate reject reply message indicating which RTR this case, Provider MAY generate reject reply message indicating
message types it supports. which RTR indication types it supports.
The enhanced DDP stream management MUST use the DDP stream session The enhanced DDP stream management MUST use the DDP stream session
termination function code to terminate a stream established using termination function code to terminate a stream established using
enhanced DDP stream session function codes. enhanced DDP stream session function codes.
[RFC5043] already supports either side sending the first DDP Message [RFC5043] already supports either side sending the first DDP Message
since the Payload Protocol Identifier (PPID) already distinguishes since the Payload Protocol Identifier (PPID) already distinguishes
between Session Establishment and DDP Segments. The enhanced RDMA between Session Establishment and DDP Segments. The enhanced RDMA
Connection Establishment provides to the ULP a transport independent Connection Establishment provides to the ULP a transport independent
way to support peer-to-peer model. way to support peer-to-peer model.
skipping to change at page 17, line 12 skipping to change at page 17, line 20
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 |A|B| IRD |C|D| ORD | 0 |A|B| IRD |C|D| ORD |
4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
IRD: Inbound RDMA Read Queue Depth. IRD: Inbound RDMA Read Queue Depth.
ORD: Outbound RDMA Read Queue Depth. ORD: Outbound RDMA Read Queue Depth.
A: Control Flag for connection model. A: Control Flag for connection model.
B: Control Flag for zero length FULPDU (Send) RTR message. B: Control Flag for use of a zero length FULPDU (Send) RTR
indication.
C: Control Flag for zero length RDMA Write RTR message. C: Control Flag for use of a zero length RDMA Write RTR indication.
D: Control Flag for zero length RDMA Read RTR message. D: Control Flag for use of a zero length RDMA Read RTR indication.
9.1. IRD and ORD Negotiation 9.1. IRD and ORD Negotiation
IRD and ORD are used for negotiation of Inbound RDMA Read Request IRD and ORD are used for negotiation of Inbound RDMA Read Request
Queue depths for both endpoints of the RDMA connection. IRD is used Queue depths for both endpoints of the RDMA connection. IRD is used
to configure the depth of the Inbound RDMA Read Request Queue (IRRQ) to configure the depth of the Inbound RDMA Read Request Queue (IRRQ)
on each endpoint. ORD is used to limit the number of simultaneous on each endpoint. ORD is used to limit the number of simultaneous
outbound RDMA Read Requests allowed at at given point in time in outbound RDMA Read Requests allowed at at given point in time in
order to avoid IRRQ overruns at the remote endpoint. In order to order to avoid IRRQ overruns at the remote endpoint. In order to
describe the negotiation of both local endpoint and remote endpoint describe the negotiation of both local endpoint and remote endpoint
skipping to change at page 18, line 29 skipping to change at page 18, line 38
responder ORD <= initiator IRD responder ORD <= initiator IRD
The responder and initiator MUST pass the peer's provided IRD and ORD The responder and initiator MUST pass the peer's provided IRD and ORD
values to the ULP, in addition to using the values as calculated by values to the ULP, in addition to using the values as calculated by
the preceding rules. the preceding rules.
Responder ORD SHOULD be set to a value less than or equal to Responder ORD SHOULD be set to a value less than or equal to
initiator IRD. If initiator ORD is insufficient to support the initiator IRD. If initiator ORD is insufficient to support the
selected connection model, responder IRD MAY be increased, for selected connection model, responder IRD MAY be increased, for
example if initiator ORD is 0 (RDMA Reads will not be used by the example if initiator ORD is 0 (RDMA Reads will not be used by the
ULP) and the responder supports a zero length RDMA Read RTR message, ULP) and the responder supports use of a zero length RDMA Read RTR
then responder IRD can be set to 1. The responder MUST set its ORD indication, then responder IRD can be set to 1. The responder MUST
at most to initiator IRD. The responder MAY reject the connection set its ORD at most to initiator IRD. The responder MAY reject the
request if initiator IRD is not sufficient for the ULP required ORD connection request if initiator IRD is not sufficient for the ULP
and specify the required ORD in the MPA Reject frame responder ORD. required ORD and specify the required ORD in the MPA Reject frame
Thus, the TERM message MUST contain Layer 2, Error Type 0, Error Code responder ORD. Thus, the TERM message MUST contain Layer 2, Error
6. Type 0, Error Code 6.
Upon receiving the MPA Accept frame from the responder, the initiator Upon receiving the MPA Accept frame from the responder, the initiator
MUST set its IRD at least to responder ORD and its ORD at most to MUST set its IRD at least to responder ORD and its ORD at most to
responder IRD. If the initiator does not have sufficient resources responder IRD. If the initiator does not have sufficient resources
for the required IRD, it MUST send a TERM message to the responder for the required IRD, it MUST send a TERM message to the responder
indicating insufficient resources, and terminate the connection due indicating insufficient resources, and terminate the connection due
to insufficient resources. Thus, the TERM message MUST contain Layer to insufficient resources. Thus, the TERM message MUST contain Layer
2, Error Type 0, Error Code 6. 2, Error Type 0, Error Code 6.
The initiator MUST pass the responder provided IRD and ORD to the ULP The initiator MUST pass the responder provided IRD and ORD to the ULP
skipping to change at page 19, line 19 skipping to change at page 19, line 28
value of 0x3FFF by leaving its local endpoint ORD value unchanged, value of 0x3FFF by leaving its local endpoint ORD value unchanged,
and setting ORD to 0x3FFF in its reply message. The initiator MUST and setting ORD to 0x3FFF in its reply message. The initiator MUST
leave its local endpoint IRD value unchanged upon receiving a leave its local endpoint IRD value unchanged upon receiving a
responder ORD value of 0x3FFF. responder ORD value of 0x3FFF.
9.2. Peer-to-Peer Connection Negotiation 9.2. Peer-to-Peer Connection Negotiation
Control Flag A value 1 indicates that a peer-to-peer connection model Control Flag A value 1 indicates that a peer-to-peer connection model
is being performed, and value 0 indicates a client-server model. is being performed, and value 0 indicates a client-server model.
Control Flag B value 1 indicates that a zero length FULPDU (Send) RTR Control Flag B value 1 indicates that a zero length FULPDU (Send) RTR
message is requested for the initiator and supported by the indication is requested for the initiator and supported by the
responder, respectively, 0 otherwise. Control Flag C value 1 responder, respectively, 0 otherwise. Control Flag C value 1
indicates that a zero length RDMA Write RTR message is requested for indicates that a zero length RDMA Write RTR indication is requested
the initiator and supported by the responder, respectively, 0 for the initiator and supported by the responder, respectively, 0
otherwise. Control Flag D value 1 indicates that a zero length RDMA otherwise. Control Flag D value 1 indicates that a zero length RDMA
Read RTR message is requested for the initiator and supported by the Read RTR indication is requested for the initiator and supported by
responder, respectively, 0 otherwise. The initiator MUST set Control the responder, respectively, 0 otherwise. The initiator MUST set
Flag A to 1 for peer-to-peer model. The initiator MUST set each Control Flag A to 1 for peer-to-peer model. The initiator MUST set
Control Flag B, C and D to 1 for each of the options it supports, if each Control Flag B, C and D to 1 for each of the options it
Control Flag A is set to 1. supports, if Control Flag A is set to 1.
The responder MUST support at least one RTR message option if it The responder MUST support at least one RTR indication option if it
supports Enhanced RDMA connection establishment. If Control Flag A supports Enhanced RDMA connection establishment. If Control Flag A
is 1 in the MPA request message then the responder MUST set Control is 1 in the MPA request message then the responder MUST set Control
Flag A to 1 in the MPA reply message. For each initiator supported Flag A to 1 in the MPA reply message. For each initiator supported
RTR message option the responder SHOULD set the corresponding Control RTR indication option the responder SHOULD set the corresponding
Flag if the responder can support that option in an MPA reply. The Control Flag if the responder can support that option in an MPA
responder is not required to specify all RTR message options it reply. The responder is not required to specify all RTR indication
supports. The responder MUST set at least one RTR message option if options it supports. The responder MUST set at least one RTR
it supports more than one initiator specified RTR message option. indication option if it supports more than one initiator specified
The responder MAY include additional RTR message options it supports, RTR indication option. The responder MAY include additional RTR
even if not requested by any initiator specified RTR message options. indication options it supports, even if not requested by any
If the responder does not support any of the initiator specified RTR initiator specified RTR indication options. If the responder does
message options then the responder MUST set at least one RTR message not support any of the initiator specified RTR indication options
type option it supports. then the responder MUST set at least one RTR indication type option
it supports.
Upon receiving the MPA accept frame with Control Flag A set to 1, the Upon receiving the MPA accept frame with Control Flag A set to 1, the
initiator MUST generate one of the negotiated RTR messages. If the initiator MUST generate one of the negotiated RTR indications. If
initiator is not able to generate any of the responder supported RTR the initiator is not able to generate any of the responder supported
messages, then it MUST send a TERM message to the responder RTR indications, then it MUST send a TERM message to the responder
indicating failure to negotiate a mutually compatible connection indicating failure to negotiate a mutually compatible connection
model or RTR option, and terminate the connection. Thus, the TERM model or RTR option, and terminate the connection. Thus, the TERM
message MUST contain Layer 2, Error Type 0, Error Code 7. The ULP message MUST contain Layer 2, Error Type 0, Error Code 7. The ULP
can negotiate a ULP level RTR message when a Provider level RTR can negotiate a ULP level RTR indication when a Provider level RTR
message cannot be negotiated. indication cannot be negotiated.
The initiator MUST set Control Flag A to 0 for client-server model. The initiator MUST set Control Flag A to 0 for client-server model.
The responder MUST set Control Flag A to 0 if Control Flag A is 0 in The responder MUST set Control Flag A to 0 if Control Flag A is 0 in
request. If Control Flag A is set to 0 then Control Flags B, C and D request. If Control Flag A is set to 0 then Control Flags B, C and D
MUST also be set to 0. On reception if Control Flag A is set to 0 MUST also be set to 0. On reception if Control Flag A is set to 0
then Control Flags B, C, and D MUST be ignored. then Control Flags B, C, and D MUST be ignored.
9.3. Enhanced Connection Negotiation Flow 9.3. Enhanced Connection Negotiation Flow
The RTR message type and ORD/IRD negotiation follows the following The RTR indication type and ORD/IRD negotiation follows the following
order: order:
initiator (MPA Request) --> Set Control Flag A to 1 to indicate initiator (MPA Request) --> Set Control Flag A to 1 to indicate
peer-to-peer connection model and initiator IRD, ORD setting on peer-to-peer connection model and initiator IRD, ORD setting on
local Endpoint of the connection. Set Control Flags B, C, and D local Endpoint of the connection. Set Control Flags B, C, and D
to 1 for each initiator-supported option of RTR message. to 1 for each initiator-supported option of RTR indication.
responder (MPA Reply) <-- Match the initiator Control Flag A value responder (MPA Reply) <-- Match the initiator Control Flag A value
and set ORD/IRD to the responder local endpoint values based upon and set ORD/IRD to the responder local endpoint values based upon
the initiator initial ORD/IRD values and the number of the initiator initial ORD/IRD values and the number of
simultaneous RDMA Read Requests required by the ULP. Sets Control simultaneous RDMA Read Requests required by the ULP. Sets Control
Flags B, C, and D to 1 for responder-supported options of RTR Flags B, C, and D to 1 for responder-supported options of RTR
message options for peer-to-peer connection model and sets the indication options for peer-to-peer connection model and sets the
responder IRD/ORD actual values. responder IRD/ORD actual values.
initiator (First RDMA Message) --> After the initiator modifies its initiator (First RDMA Message) --> After the initiator modifies its
ORD/IRD to match the responder's values as stated above, the ORD/IRD to match the responder's values as stated above, the
initiator sends the first message of negotiated RTR message initiator sends the first message of negotiated RTR indication
option. If no matching RTR message option exists then the option. If no matching RTR indication option exists then the
initiator sends a TERM message. initiator sends a TERM message.
The initiator or responder MUST generate the TERM message that The initiator or responder MUST generate the TERM message that
contains Layer 2, Error Type 0, Error Code 5 when it encounters any contains Layer 2, Error Type 0, Error Code 5 when it encounters any
error locally for which the special Error Code is not defined in error locally for which the special Error Code is not defined in
section Section 8 before resetting the connection. Section 8 before resetting the connection.
10. Interoperability 10. Interoperability
The initiator requests enhanced RDMA connection establishment by
sending an enhanced RDMA establishment request; an enhanced responder
is REQUIRED to respond with an enhanced RDMA connection establishment
response, whereas an unenhanced responder treats the enhanced request
as incorrectly formatted and closes the TCP connection. All
responders are REQUIRED to issue unenhanced RDMA connection
establishment responses in response to unenhanced RDMA connection
establishment requests.
The initiator MUST NOT use the enhanced RDMA connection establishment The initiator MUST NOT use the enhanced RDMA connection establishment
formats or function codes when no enhanced functionality is desired. formats or function codes when no enhanced functionality is desired.
The responder MUST continue to accept unenhanced connection requests. The responder MUST continue to accept unenhanced connection requests.
There are three initiator/responder cases that involve enhanced MPA: There are three initiator/responder cases that involve enhanced MPA:
both the initiator and responder, only the responder, and only the both the initiator and responder, only the responder, and only the
initiator. The enhanced MPA frame is defined by field 'S' set to 1. initiator. The enhanced MPA frame is defined by field 'S' set to 1.
Enhanced MPA initiator and responder: If the responder receives an Enhanced MPA initiator and responder: If the responder receives an
skipping to change at page 21, line 29 skipping to change at page 21, line 48
Thus, both the initiator and responder report TCP connection Thus, both the initiator and responder report TCP connection
termination to an application locally. In this case the initiator termination to an application locally. In this case the initiator
MAY attempt to establish an RDMA connection using the unenhanced MAY attempt to establish an RDMA connection using the unenhanced
MPA protocol as defined in [RFC5044] if this protocol is MPA protocol as defined in [RFC5044] if this protocol is
compatible with the application, and let ULP deal with ORD and compatible with the application, and let ULP deal with ORD and
IRD, and peer-to-peer negotiations. IRD, and peer-to-peer negotiations.
A note for a potential future enhancements for connection A note for a potential future enhancements for connection
establishment negotiation: It is possible to further extend establishment negotiation: It is possible to further extend
formatting of private data of the MPA Request and Reply frames and to formatting of private data of the MPA Request and Reply frames and to
use other bits from "Res" field to indicate that private data use other bits from "Res" field to indicate additional private data
formatting. formatting.
11. IANA Considerations 11. IANA Considerations
IANA is requested to add the following entries to the "SCTP Function IANA is requested to add the following entries to the "SCTP Function
Codes for DDP Session Control" registry created by Section 3.4 of Codes for DDP Session Control" registry created by Section 3.4 of
[IANA_RDDP_REGISTRY]: [IANA_RDDP_REGISTRY]:
0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX]
skipping to change at page 22, line 4 skipping to change at page 22, line 21
0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX]
0x0006, Enhanced DDP Stream Session Accept, [RFCXXXX] 0x0006, Enhanced DDP Stream Session Accept, [RFCXXXX]
0x0007, Enhanced DDP Stream Session Reject, [RFCXXXX] 0x0007, Enhanced DDP Stream Session Reject, [RFCXXXX]
IANA is requested to add the following entries to the "MPA Errors" IANA is requested to add the following entries to the "MPA Errors"
registry created by Section 3.3 of [IANA_RDDP_REGISTRY] registry created by Section 3.3 of [IANA_RDDP_REGISTRY]
0x2/0x0/0x05, - MPA Error / Local catastrophic error, [RFCXXXX] 0x2/0x0/0x05, - MPA Error / Local catastrophic error, [RFCXXXX]
0x2/0x0/0x06 - MPA Error / Insufficient IRD resources, [RFCXXXX] 0x2/0x0/0x06 - MPA Error / Insufficient IRD resources, [RFCXXXX]
0x2/0x0/0x07 - MPA Error / No matching RTR option, [RFCXXXX] 0x2/0x0/0x07 - MPA Error / No matching RTR option, [RFCXXXX]
RFC Editor: Please replace XXXX in the six instances of [RFCXXXX] RFC Editor: Please replace XXXX in the six instances of [RFCXXXX]
above with the RFC number of this document and remove this note. above with the RFC number of this document and remove this note.
12. Security Considerations 12. Security Considerations
The security considerations from RFC 5044 and RFC 5043 apply and the The security considerations from RFC 5044 and RFC 5043 apply and the
changes in this document do not introduce new security changes in this document do not introduce new security
considerations. However it is recommended that implementations do considerations. However it is recommended that implementations do
sanity checking for the input parameters, including ORD, IRD, and sanity checking for the input parameters, including ORD, IRD, and the
RTR. control flags used for RTR indication option negotiation.
13. Acknowledgements 13. Acknowledgements
The authors wish to thank Sean Hefty, Dave Minturn, Tom Talpey, David The authors wish to thank Sean Hefty, Dave Minturn, Tom Talpey, David
Black and David Harrington for their valuable contributions and Black and David Harrington for their valuable contributions and
reviews of this document. reviews of this document.
14. References 14. References
14.1. Normative References 14.1. Normative References
 End of changes. 40 change blocks. 
83 lines changed or deleted 101 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/