| < draft-ietf-storm-mpa-peer-connect-08.txt | draft-ietf-storm-mpa-peer-connect-09.txt > | |||
|---|---|---|---|---|
| STORM A. Kanevsky, Ed. | STORM A. Kanevsky, Ed. | |||
| Internet-Draft Dell Inc. | Internet-Draft Dell Inc. | |||
| Updates: 5043, 5044 (if approved) C. Bestler, Ed. | Updates: 5043, 5044 (if approved) C. Bestler, Ed. | |||
| Intended status: Standards Track Nexenta Systems | Intended status: Standards Track Nexenta Systems | |||
| Expires: April 24, 2012 R. Sharp | Expires: June 17, 2012 R. Sharp | |||
| Intel | Intel | |||
| S. Wise | S. Wise | |||
| Open Grid Computing | Open Grid Computing | |||
| October 22, 2011 | December 15, 2011 | |||
| Enhanced RDMA Connection Establishment | Enhanced RDMA Connection Establishment | |||
| draft-ietf-storm-mpa-peer-connect-08 | draft-ietf-storm-mpa-peer-connect-09 | |||
| Abstract | Abstract | |||
| This document updates RFC 5043 and RFC 5044 by extending Marker | This document updates RFC 5043 and RFC 5044 by extending Marker | |||
| Protocol Data Unit (PDU) Aligned Framing (MPA) negotiation for Remote | Protocol Data Unit (PDU) Aligned Framing (MPA) negotiation for Remote | |||
| Direct Memory Access (RDMA) connection establishment. The first | Direct Memory Access (RDMA) connection establishment. The first | |||
| enhancement extends RFC 5044, enabling peer-to-peer connection | enhancement extends RFC 5044, enabling peer-to-peer connection | |||
| establishment over MPA/ Transmission Control Protocol (TCP). The | establishment over MPA/ Transmission Control Protocol (TCP). The | |||
| second enhancement extends both RFC 5043 and RFC 5044, by providing | second enhancement extends both RFC 5043 and RFC 5044, by providing | |||
| an option for standardized exchange of RDMA-layer connection | an option for standardized exchange of RDMA-layer connection | |||
| skipping to change at page 1, line 42 ¶ | skipping to change at page 1, line 42 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on April 24, 2012. | This Internet-Draft will expire on June 17, 2012. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2011 IETF Trust and the persons identified as the | Copyright (c) 2011 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| skipping to change at page 2, line 20 ¶ | skipping to change at page 2, line 20 ¶ | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.1. Summary of changes affecting RFC 5044 . . . . . . . . . . 4 | 1.1. Summary of changes affecting RFC 5044 . . . . . . . . . . 4 | |||
| 1.2. Summary of changes affecting RFC 5043 . . . . . . . . . . 4 | 1.2. Summary of changes affecting RFC 5043 . . . . . . . . . . 4 | |||
| 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 | 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 | |||
| 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 4. Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 4. Motivations . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 4.1. Standardization of RDMA Read Parameter Configuration . . . 7 | 4.1. Standardization of RDMA Read Parameter Configuration . . . 7 | |||
| 4.2. Enabling MPA Mode . . . . . . . . . . . . . . . . . . . . 8 | 4.2. Enabling MPA Mode . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4.3. Lack of Explicit RTR in MPA Request/Reply Exchange . . . . 9 | 4.3. Lack of Explicit RTR in MPA Request/Reply Exchange . . . . 9 | |||
| 4.4. Limitations on ULP Workaround . . . . . . . . . . . . . . 10 | 4.4. Limitations on ULP Workaround . . . . . . . . . . . . . . 10 | |||
| 4.4.1. Transport Neutral APIs . . . . . . . . . . . . . . . . 11 | 4.4.1. Transport Neutral APIs . . . . . . . . . . . . . . . . 11 | |||
| 4.4.2. Work/Completion Queue Accounting . . . . . . . . . . . 11 | 4.4.2. Work/Completion Queue Accounting . . . . . . . . . . . 11 | |||
| 4.4.3. Host-based Implementation of MPA Fencing . . . . . . . 12 | 4.4.3. Host-based Implementation of MPA Fencing . . . . . . . 12 | |||
| 5. Enhanced MPA Connection Establishment . . . . . . . . . . . . 12 | 5. Enhanced MPA Connection Establishment . . . . . . . . . . . . 12 | |||
| 6. Enhanced MPA Request/Reply Frames . . . . . . . . . . . . . . 13 | 6. Enhanced MPA Request/Reply Frames . . . . . . . . . . . . . . 13 | |||
| 7. Enhanced SCTP Session Control Chunks . . . . . . . . . . . . . 14 | 7. Enhanced SCTP Session Control Chunks . . . . . . . . . . . . . 14 | |||
| 8. MPA Error Reporting . . . . . . . . . . . . . . . . . . . . . 16 | 8. MPA Error Reporting . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 9. Enhanced RDMA Connection Establishment Data . . . . . . . . . 16 | 9. Enhanced RDMA Connection Establishment Data . . . . . . . . . 16 | |||
| 9.1. IRD and ORD Negotiation . . . . . . . . . . . . . . . . . 17 | 9.1. IRD and ORD Negotiation . . . . . . . . . . . . . . . . . 17 | |||
| 9.2. Peer-to-Peer Connection Negotiation . . . . . . . . . . . 19 | 9.2. Peer-to-Peer Connection Negotiation . . . . . . . . . . . 19 | |||
| 9.3. Enhanced Connection Negotiation Flow . . . . . . . . . . . 20 | 9.3. Enhanced Connection Negotiation Flow . . . . . . . . . . . 20 | |||
| 10. Interoperability . . . . . . . . . . . . . . . . . . . . . . . 20 | 10. Interoperability . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 | 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 12. Security Considerations . . . . . . . . . . . . . . . . . . . 22 | 12. Security Considerations . . . . . . . . . . . . . . . . . . . 22 | |||
| 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 | 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 14.1. Normative References . . . . . . . . . . . . . . . . . . . 22 | 14.1. Normative References . . . . . . . . . . . . . . . . . . . 22 | |||
| 14.2. Informative References . . . . . . . . . . . . . . . . . . 23 | 14.2. Informative References . . . . . . . . . . . . . . . . . . 23 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 1. Introduction | 1. Introduction | |||
| When used over Transmission Control Protocol (TCP), the current | When used over Transmission Control Protocol (TCP), the current | |||
| Remote Direct Data Placement (RDDP) [RFC5041] suite of protocols | Remote Direct Data Placement (RDDP) [RFC5041] suite of protocols | |||
| relies on MPA [RFC5044] protocol for both connection establishment | relies on MPA [RFC5044] protocol for both connection establishment | |||
| and for markers for TCP layering. | and for markers for TCP layering. | |||
| A typical model for establishing an RDMA connection has the following | A typical model for establishing an RDMA connection has the following | |||
| steps: | steps: | |||
| skipping to change at page 4, line 9 ¶ | skipping to change at page 4, line 9 ¶ | |||
| negotiation of some of Remote Direct Memory Access Protocol (RDMAP) | negotiation of some of Remote Direct Memory Access Protocol (RDMAP) | |||
| [RFC5040] specific parameters are left to ULP negotiation. Providing | [RFC5040] specific parameters are left to ULP negotiation. Providing | |||
| an optional ULP-independent format for exchanging these parameters | an optional ULP-independent format for exchanging these parameters | |||
| would be of benefit to transport neutral Remote Direct Memory Access | would be of benefit to transport neutral Remote Direct Memory Access | |||
| (RDMA) applications. | (RDMA) applications. | |||
| 1.1. Summary of changes affecting RFC 5044 | 1.1. Summary of changes affecting RFC 5044 | |||
| This draft enhances [RFC5044] MPA connection setup protocol. First, | This draft enhances [RFC5044] MPA connection setup protocol. First, | |||
| it adds exchange and negotiation of the parameters necessary to | it adds exchange and negotiation of the parameters necessary to | |||
| support RDMA Read Requests. Second, it adds a Ready to Receive (RTR) | support RDMA Read Requests. Second, it adds a message that serves as | |||
| message from the initiator to the responder as the last message of | a Ready to Receive (RTR) indication from the initiator to the | |||
| connection establishment and adds negotiation of an RTR message type | responder as the last message of connection establishment and adds | |||
| into MPA request/reply frames. | negotiation of an which type of message to use to carry the RTR | |||
| indication into MPA request/reply frames. | ||||
| 1.2. Summary of changes affecting RFC 5043 | 1.2. Summary of changes affecting RFC 5043 | |||
| This draft enhances [RFC5043] by adding new Enhanced Session Control | This draft enhances [RFC5043] by adding new Enhanced Session Control | |||
| Chunks that extends the currently defined Chunks with the addition of | Chunks that extends the currently defined Chunks with the addition of | |||
| Inbound RDMA Read Queue Depth (IRD) and Outbound RDMA Read Queue | Inbound RDMA Read Queue Depth (IRD) and Outbound RDMA Read Queue | |||
| Depth (ORD) negotiation. | Depth (ORD) negotiation. | |||
| 2. Requirements Language | 2. Requirements Language | |||
| skipping to change at page 6, line 9 ¶ | skipping to change at page 6, line 14 ¶ | |||
| Remote Peer: The MPA protocol implementation on the opposite end of | Remote Peer: The MPA protocol implementation on the opposite end of | |||
| the connection. Used to refer to the remote entity when | the connection. Used to refer to the remote entity when | |||
| describing protocol exchanges or other interactions between two | describing protocol exchanges or other interactions between two | |||
| Nodes. See [RFC5044]. | Nodes. See [RFC5044]. | |||
| Responder: The connection endpoint that responds to an incoming MPA | Responder: The connection endpoint that responds to an incoming MPA | |||
| connection request (the MPA Request Frame). Responder is the | connection request (the MPA Request Frame). Responder is the | |||
| passive side of the connection establishment. See [RFC5044]. | passive side of the connection establishment. See [RFC5044]. | |||
| Ready to Receive (RTR): RTR is the last connection establishment | Ready to Receive (RTR): RTR is an indication provided by the last | |||
| message sent from the initiator to the responder indicating that | connection establishment message sent from the initiator to the | |||
| the initiator is ready to receive messages and that connection | responder. An RTR indicates that the initiator is ready to | |||
| establishment is completed. See [IBTA]. | receive messages and that connection establishment is completed. | |||
| Startup Phase: The initial exchanges of an MPA connection that | Startup Phase: The initial exchanges of an MPA connection that | |||
| serves to more fully identify MPA endpoints to each other and pass | serves to more fully identify MPA endpoints to each other and pass | |||
| connection specific setup information to each other. See | connection specific setup information to each other. See | |||
| [RFC5044]. | [RFC5044]. | |||
| Shared Receive Queue(SRQ): A shared pool of Receive Work Requests | Shared Receive Queue(SRQ): A shared pool of Receive Work Requests | |||
| posted by the Consumer that can be allocated by multiple RDMA | posted by the Consumer that can be allocated by multiple RDMA | |||
| endpoints (Queue Pair). See [RDMAC]. | endpoints (Queue Pair). See [RDMAC]. | |||
| skipping to change at page 10, line 6 ¶ | skipping to change at page 10, line 15 ¶ | |||
| o Before the first MPA frame is transmitted, all pre-MPA mode TCP | o Before the first MPA frame is transmitted, all pre-MPA mode TCP | |||
| payload will have been acknowledged by the peer. Therefore it is | payload will have been acknowledged by the peer. Therefore it is | |||
| never necessary to generate a retransmission that mixes pre-MPA | never necessary to generate a retransmission that mixes pre-MPA | |||
| and MPA payload. | and MPA payload. | |||
| o Before MPA reception is enabled, all incoming pre-MPA mode TCP | o Before MPA reception is enabled, all incoming pre-MPA mode TCP | |||
| payload will have been acknowledged. Therefore the host will | payload will have been acknowledged. Therefore the host will | |||
| never receive a TCP segment that mixes pre-MPA and MPA payload. | never receive a TCP segment that mixes pre-MPA and MPA payload. | |||
| The limitation of the current MPA Request/Reply exchange is that it | The limitation of the current MPA Request/Reply exchange is that it | |||
| does not define a Ready to Receive (RTR) message that the active side | does not define a Ready to Receive (RTR) indication that the active | |||
| would send, so that the passive side can know that the last non-MPA | side would send, so that the passive side can know that the last non- | |||
| payload (the MPA Reply) had been received. | MPA payload (the MPA Reply) had been received. | |||
| Instead, the role of an RTR message is piggy-backed on the first MPA | Instead, the role of an RTR indication is piggy-backed on the first | |||
| FULPDU sent by the active side. This is actually a valuable | MPA FULPDU sent by the active side. This is actually a valuable | |||
| optimization for all applications that fit the classic client/server | optimization for all applications that fit the classic client/server | |||
| model. The client only initiates the connection when it has a | model. The client only initiates the connection when it has a | |||
| request to send to the server, and the server has nothing to send | request to send to the server, and the server has nothing to send | |||
| until it has received and processed the client request. | until it has received and processed the client request. | |||
| Even applications where the server sends some configuration data | Even applications where the server sends some configuration data | |||
| immediately can easily send the same information as application | immediately can easily send the same information as application | |||
| private data in the MPA Reply. So the currently defined exchange | private data in the MPA Reply. So the currently defined exchange | |||
| works for almost all applications. | works for almost all applications. | |||
| skipping to change at page 10, line 34 ¶ | skipping to change at page 10, line 43 ¶ | |||
| [UsingMPI], or [RDS]), have no natural client or server roles | [UsingMPI], or [RDS]), have no natural client or server roles | |||
| ([PPMPI], [OpenMP]). Typically one member of the cluster is | ([PPMPI], [OpenMP]). Typically one member of the cluster is | |||
| arbitrarily selected to initiate the connection when the distributed | arbitrarily selected to initiate the connection when the distributed | |||
| task is launched, while the other accepts it. At startup time, | task is launched, while the other accepts it. At startup time, | |||
| however, there is no way to predict which node will have the first | however, there is no way to predict which node will have the first | |||
| message to actually send. Establishing the connections immediately, | message to actually send. Establishing the connections immediately, | |||
| however, is valuable because it reduces latency once results are | however, is valuable because it reduces latency once results are | |||
| ready to transmit and it validates connectivity throughout the | ready to transmit and it validates connectivity throughout the | |||
| cluster. | cluster. | |||
| The lack of an explicit RTR message in the MPA Request/Reply exchange | The lack of an explicit RTR indication in the MPA Request/Reply | |||
| forces all applications to have a first message from the connection | exchange forces all applications to have a first message from the | |||
| initiator, whether this matches the application communication model | connection initiator, whether this matches the application | |||
| or not. | communication model or not. | |||
| 4.4. Limitations on ULP Workaround | 4.4. Limitations on ULP Workaround | |||
| The requirement that the RDMA connection initiator sends the first | The requirement that the RDMA connection initiator sends the first | |||
| message does not appear to be onerous on first examination. The | message does not appear to be onerous on first examination. The | |||
| natural question is why the application layer would not simply | natural question is why the application layer would not simply | |||
| generate a dummy message when there was no other message to submit. | generate a dummy message when there was no other message to submit. | |||
| There are three factors that make this workaround unsuitable for many | There are three factors that make this workaround unsuitable for many | |||
| peer-to-peer applications. | peer-to-peer applications. | |||
| skipping to change at page 12, line 41 ¶ | skipping to change at page 12, line 49 ¶ | |||
| is to allow standard negotiation of ORD/IRD setting on both sides of | is to allow standard negotiation of ORD/IRD setting on both sides of | |||
| the RDMA connection and/or to negotiate the initial data transfer | the RDMA connection and/or to negotiate the initial data transfer | |||
| operation by the initiator when the existing 'client sends first' | operation by the initiator when the existing 'client sends first' | |||
| rule does not match application requirements. | rule does not match application requirements. | |||
| The RDMA connection initiator sends an MPA Request, as specified in | The RDMA connection initiator sends an MPA Request, as specified in | |||
| [RFC5044]; the new format defined here allows for: | [RFC5044]; the new format defined here allows for: | |||
| o Standardized negotiation of ORD and IRD. | o Standardized negotiation of ORD and IRD. | |||
| o Negotiation of an RTR message. | o Negotiation of RTR functionality and the RDMA message type to use | |||
| as the RTR indication. | ||||
| The RDMA connection responder processes the MPA Request and generates | The RDMA connection responder processes the MPA Request and generates | |||
| an MPA Reply, as specified in [RFC5044]; the new format completes the | an MPA Reply, as specified in [RFC5044]; the new format completes the | |||
| negotiation. | negotiation. | |||
| The local interface needs to provide a way for a ULP to request the | The local interface needs to provide a way for a ULP to request the | |||
| use of explicit RTR messages per-application or per-connection basis | use of explicit RTR indication per-application or per-connection | |||
| when an explicit RTR message will be required. Piggy-backing the RTR | basis when an explicit RTR indication will be required. Piggy- | |||
| on a Client's first message is a valuable optimization for most | backing the RTR on a Client's first message is a valuable | |||
| connections. | optimization for most connections. | |||
| The RDMA connection initiator MUST NOT allow any later FULPDUs to be | The RDMA connection initiator MUST NOT allow any later FULPDUs to be | |||
| transmitted before the RTR message. One method to achieve that is to | transmitted before the RTR indication. One method to achieve that is | |||
| delay notifying the ULP that the RDMA connection has been established | to delay notifying the ULP that the RDMA connection has been | |||
| until after any required RTR Message has been transmitted. | established until after any required RTR indication has been | |||
| transmitted. | ||||
| All MPA exchanges are performed via TCP prior to RDMA establishment, | All MPA exchanges are performed via TCP prior to RDMA establishment, | |||
| and are therefore signaled via TCP and not via RDMA completion. | and are therefore signaled via TCP and not via RDMA completion. | |||
| 6. Enhanced MPA Request/Reply Frames | 6. Enhanced MPA Request/Reply Frames | |||
| Enhanced RDMA connection establishment uses an alternate format for | Enhanced RDMA connection establishment uses an alternate format for | |||
| MPA Requests and Replies, as follows: | MPA Requests and Replies, as follows: | |||
| 0 1 2 3 | 0 1 2 3 | |||
| skipping to change at page 14, line 27 ¶ | skipping to change at page 14, line 38 ¶ | |||
| higher, If no enhanced connection establishment features are | higher, If no enhanced connection establishment features are | |||
| desired it MAY be set to one. A host accepting MPA connections | desired it MAY be set to one. A host accepting MPA connections | |||
| MUST continue to accept MPA Requests with version one even if it | MUST continue to accept MPA Requests with version one even if it | |||
| supports version two. | supports version two. | |||
| PD_Length: Unchanged from [RFC5044]. This is the total length of | PD_Length: Unchanged from [RFC5044]. This is the total length of | |||
| the Private Data field, including the enhanced RDMA connection | the Private Data field, including the enhanced RDMA connection | |||
| establishment data if present. | establishment data if present. | |||
| Private Data: Unchanged from [RFC5044]. However, if the 'S' flag is | Private Data: Unchanged from [RFC5044]. However, if the 'S' flag is | |||
| set, Private Data begins with enhanced RDMA connection | set, Private Data MUST begin with enhanced RDMA connection | |||
| establishment data. | establishment data (see Section 9). | |||
| 7. Enhanced SCTP Session Control Chunks | 7. Enhanced SCTP Session Control Chunks | |||
| Enhanced RDMA Connection Establishment uses the first 32 bits of the | Enhanced RDMA Connection Establishment uses the first 32 bits of the | |||
| Private data field for IRD and ORD negotiation in the "DDP Stream | Private data field for IRD and ORD negotiation in the "DDP Stream | |||
| Session Initiate" and "DDP Stream Session Accept" SCTP Session | Session Initiate" and "DDP Stream Session Accept" SCTP Session | |||
| Control Chunks. | Control Chunks. | |||
| The type of the SCTP Session Control Chunk is defined by a Function | The type of the SCTP Session Control Chunk is defined by a Function | |||
| Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP | Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP | |||
| skipping to change at page 15, line 4 ¶ | skipping to change at page 15, line 11 ¶ | |||
| Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP | Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP | |||
| Stream Session Initiate' and 'DDP Stream Session Accept', which are | Stream Session Initiate' and 'DDP Stream Session Accept', which are | |||
| equivalent to a MPA Request Frame and an accepting MPA Reply Frame. | equivalent to a MPA Request Frame and an accepting MPA Reply Frame. | |||
| Enhanced RDMA connection establishment requires three additional | Enhanced RDMA connection establishment requires three additional | |||
| Function codes listed below: | Function codes listed below: | |||
| Enhanced DDP Stream Session Initiate: 0x005 | Enhanced DDP Stream Session Initiate: 0x005 | |||
| Enhanced DDP Stream Session Accept: 0x006 | Enhanced DDP Stream Session Accept: 0x006 | |||
| Enhanced DDP Stream Session Reject: 0x007 | Enhanced DDP Stream Session Reject: 0x007 | |||
| The Enhanced Reject function code MUST be used to indicate rejection | The Enhanced Reject function code MUST be used to indicate rejection | |||
| of enhanced DDP stream session for a configuration that would have | of enhanced DDP stream session for a configuration that would have | |||
| been accepted for unenhanced DDP Stream Session negotiation. | been accepted for unenhanced DDP Stream Session negotiation. | |||
| The Enhanced DDP stream session establishment follows the same rules | The Enhanced DDP stream session establishment follows the same rules | |||
| as the standard DDP stream session establishment as defined in | as the standard DDP stream session establishment as defined in | |||
| [RFC5043]. ULP-supplied Private Data MUST be included for Enhanced | [RFC5043]. ULP-supplied Private Data MUST be included for Enhanced | |||
| DDP Stream Session Initiate, Enhanced DDP Stream Session Accept, and | DDP Stream Session Initiate, Enhanced DDP Stream Session Accept, and | |||
| Enhanced DDP Stream Session Reject messages. | Enhanced DDP Stream Session Reject messages, and MUST follow the | |||
| enhanced RDMA connection establishment data in the DDP Stream Session | ||||
| Initiate and the Enhanced DDP Stream Session Accept messages. | ||||
| Private Data length MUST NOT exceed 512 bytes in any message, | Private Data length MUST NOT exceed 512 bytes in any message, | |||
| including enhanced RDMA connection establishment data. | including enhanced RDMA connection establishment data. | |||
| Private Data MUST NOT be included in the DDP Stream Session TERM | Private Data MUST NOT be included in the DDP Stream Session TERM | |||
| message. | message. | |||
| Received Extended DDP Stream Session Control messages SHOULD be | Received Extended DDP Stream Session Control messages SHOULD be | |||
| reported to the ULP. If reported, any supplied Private Data MUST be | reported to the ULP. If reported, any supplied Private Data MUST be | |||
| available for the ULP to examine. For example, a received Extended | available for the ULP to examine. For example, a received Extended | |||
| DDP Stream Session Control message is not reported to ULP if none of | DDP Stream Session Control message is not reported to ULP if none of | |||
| the requested RTR message types are supported by receiver. In this | the requested RTR indication types are supported by receiver. In | |||
| case, Provider MAY generate reject reply message indicating which RTR | this case, Provider MAY generate reject reply message indicating | |||
| message types it supports. | which RTR indication types it supports. | |||
| The enhanced DDP stream management MUST use the DDP stream session | The enhanced DDP stream management MUST use the DDP stream session | |||
| termination function code to terminate a stream established using | termination function code to terminate a stream established using | |||
| enhanced DDP stream session function codes. | enhanced DDP stream session function codes. | |||
| [RFC5043] already supports either side sending the first DDP Message | [RFC5043] already supports either side sending the first DDP Message | |||
| since the Payload Protocol Identifier (PPID) already distinguishes | since the Payload Protocol Identifier (PPID) already distinguishes | |||
| between Session Establishment and DDP Segments. The enhanced RDMA | between Session Establishment and DDP Segments. The enhanced RDMA | |||
| Connection Establishment provides to the ULP a transport independent | Connection Establishment provides to the ULP a transport independent | |||
| way to support peer-to-peer model. | way to support peer-to-peer model. | |||
| skipping to change at page 17, line 12 ¶ | skipping to change at page 17, line 20 ¶ | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| 0 |A|B| IRD |C|D| ORD | | 0 |A|B| IRD |C|D| ORD | | |||
| 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| IRD: Inbound RDMA Read Queue Depth. | IRD: Inbound RDMA Read Queue Depth. | |||
| ORD: Outbound RDMA Read Queue Depth. | ORD: Outbound RDMA Read Queue Depth. | |||
| A: Control Flag for connection model. | A: Control Flag for connection model. | |||
| B: Control Flag for zero length FULPDU (Send) RTR message. | B: Control Flag for use of a zero length FULPDU (Send) RTR | |||
| indication. | ||||
| C: Control Flag for zero length RDMA Write RTR message. | C: Control Flag for use of a zero length RDMA Write RTR indication. | |||
| D: Control Flag for zero length RDMA Read RTR message. | D: Control Flag for use of a zero length RDMA Read RTR indication. | |||
| 9.1. IRD and ORD Negotiation | 9.1. IRD and ORD Negotiation | |||
| IRD and ORD are used for negotiation of Inbound RDMA Read Request | IRD and ORD are used for negotiation of Inbound RDMA Read Request | |||
| Queue depths for both endpoints of the RDMA connection. IRD is used | Queue depths for both endpoints of the RDMA connection. IRD is used | |||
| to configure the depth of the Inbound RDMA Read Request Queue (IRRQ) | to configure the depth of the Inbound RDMA Read Request Queue (IRRQ) | |||
| on each endpoint. ORD is used to limit the number of simultaneous | on each endpoint. ORD is used to limit the number of simultaneous | |||
| outbound RDMA Read Requests allowed at at given point in time in | outbound RDMA Read Requests allowed at at given point in time in | |||
| order to avoid IRRQ overruns at the remote endpoint. In order to | order to avoid IRRQ overruns at the remote endpoint. In order to | |||
| describe the negotiation of both local endpoint and remote endpoint | describe the negotiation of both local endpoint and remote endpoint | |||
| skipping to change at page 18, line 29 ¶ | skipping to change at page 18, line 38 ¶ | |||
| responder ORD <= initiator IRD | responder ORD <= initiator IRD | |||
| The responder and initiator MUST pass the peer's provided IRD and ORD | The responder and initiator MUST pass the peer's provided IRD and ORD | |||
| values to the ULP, in addition to using the values as calculated by | values to the ULP, in addition to using the values as calculated by | |||
| the preceding rules. | the preceding rules. | |||
| Responder ORD SHOULD be set to a value less than or equal to | Responder ORD SHOULD be set to a value less than or equal to | |||
| initiator IRD. If initiator ORD is insufficient to support the | initiator IRD. If initiator ORD is insufficient to support the | |||
| selected connection model, responder IRD MAY be increased, for | selected connection model, responder IRD MAY be increased, for | |||
| example if initiator ORD is 0 (RDMA Reads will not be used by the | example if initiator ORD is 0 (RDMA Reads will not be used by the | |||
| ULP) and the responder supports a zero length RDMA Read RTR message, | ULP) and the responder supports use of a zero length RDMA Read RTR | |||
| then responder IRD can be set to 1. The responder MUST set its ORD | indication, then responder IRD can be set to 1. The responder MUST | |||
| at most to initiator IRD. The responder MAY reject the connection | set its ORD at most to initiator IRD. The responder MAY reject the | |||
| request if initiator IRD is not sufficient for the ULP required ORD | connection request if initiator IRD is not sufficient for the ULP | |||
| and specify the required ORD in the MPA Reject frame responder ORD. | required ORD and specify the required ORD in the MPA Reject frame | |||
| Thus, the TERM message MUST contain Layer 2, Error Type 0, Error Code | responder ORD. Thus, the TERM message MUST contain Layer 2, Error | |||
| 6. | Type 0, Error Code 6. | |||
| Upon receiving the MPA Accept frame from the responder, the initiator | Upon receiving the MPA Accept frame from the responder, the initiator | |||
| MUST set its IRD at least to responder ORD and its ORD at most to | MUST set its IRD at least to responder ORD and its ORD at most to | |||
| responder IRD. If the initiator does not have sufficient resources | responder IRD. If the initiator does not have sufficient resources | |||
| for the required IRD, it MUST send a TERM message to the responder | for the required IRD, it MUST send a TERM message to the responder | |||
| indicating insufficient resources, and terminate the connection due | indicating insufficient resources, and terminate the connection due | |||
| to insufficient resources. Thus, the TERM message MUST contain Layer | to insufficient resources. Thus, the TERM message MUST contain Layer | |||
| 2, Error Type 0, Error Code 6. | 2, Error Type 0, Error Code 6. | |||
| The initiator MUST pass the responder provided IRD and ORD to the ULP | The initiator MUST pass the responder provided IRD and ORD to the ULP | |||
| skipping to change at page 19, line 19 ¶ | skipping to change at page 19, line 28 ¶ | |||
| value of 0x3FFF by leaving its local endpoint ORD value unchanged, | value of 0x3FFF by leaving its local endpoint ORD value unchanged, | |||
| and setting ORD to 0x3FFF in its reply message. The initiator MUST | and setting ORD to 0x3FFF in its reply message. The initiator MUST | |||
| leave its local endpoint IRD value unchanged upon receiving a | leave its local endpoint IRD value unchanged upon receiving a | |||
| responder ORD value of 0x3FFF. | responder ORD value of 0x3FFF. | |||
| 9.2. Peer-to-Peer Connection Negotiation | 9.2. Peer-to-Peer Connection Negotiation | |||
| Control Flag A value 1 indicates that a peer-to-peer connection model | Control Flag A value 1 indicates that a peer-to-peer connection model | |||
| is being performed, and value 0 indicates a client-server model. | is being performed, and value 0 indicates a client-server model. | |||
| Control Flag B value 1 indicates that a zero length FULPDU (Send) RTR | Control Flag B value 1 indicates that a zero length FULPDU (Send) RTR | |||
| message is requested for the initiator and supported by the | indication is requested for the initiator and supported by the | |||
| responder, respectively, 0 otherwise. Control Flag C value 1 | responder, respectively, 0 otherwise. Control Flag C value 1 | |||
| indicates that a zero length RDMA Write RTR message is requested for | indicates that a zero length RDMA Write RTR indication is requested | |||
| the initiator and supported by the responder, respectively, 0 | for the initiator and supported by the responder, respectively, 0 | |||
| otherwise. Control Flag D value 1 indicates that a zero length RDMA | otherwise. Control Flag D value 1 indicates that a zero length RDMA | |||
| Read RTR message is requested for the initiator and supported by the | Read RTR indication is requested for the initiator and supported by | |||
| responder, respectively, 0 otherwise. The initiator MUST set Control | the responder, respectively, 0 otherwise. The initiator MUST set | |||
| Flag A to 1 for peer-to-peer model. The initiator MUST set each | Control Flag A to 1 for peer-to-peer model. The initiator MUST set | |||
| Control Flag B, C and D to 1 for each of the options it supports, if | each Control Flag B, C and D to 1 for each of the options it | |||
| Control Flag A is set to 1. | supports, if Control Flag A is set to 1. | |||
| The responder MUST support at least one RTR message option if it | The responder MUST support at least one RTR indication option if it | |||
| supports Enhanced RDMA connection establishment. If Control Flag A | supports Enhanced RDMA connection establishment. If Control Flag A | |||
| is 1 in the MPA request message then the responder MUST set Control | is 1 in the MPA request message then the responder MUST set Control | |||
| Flag A to 1 in the MPA reply message. For each initiator supported | Flag A to 1 in the MPA reply message. For each initiator supported | |||
| RTR message option the responder SHOULD set the corresponding Control | RTR indication option the responder SHOULD set the corresponding | |||
| Flag if the responder can support that option in an MPA reply. The | Control Flag if the responder can support that option in an MPA | |||
| responder is not required to specify all RTR message options it | reply. The responder is not required to specify all RTR indication | |||
| supports. The responder MUST set at least one RTR message option if | options it supports. The responder MUST set at least one RTR | |||
| it supports more than one initiator specified RTR message option. | indication option if it supports more than one initiator specified | |||
| The responder MAY include additional RTR message options it supports, | RTR indication option. The responder MAY include additional RTR | |||
| even if not requested by any initiator specified RTR message options. | indication options it supports, even if not requested by any | |||
| If the responder does not support any of the initiator specified RTR | initiator specified RTR indication options. If the responder does | |||
| message options then the responder MUST set at least one RTR message | not support any of the initiator specified RTR indication options | |||
| type option it supports. | then the responder MUST set at least one RTR indication type option | |||
| it supports. | ||||
| Upon receiving the MPA accept frame with Control Flag A set to 1, the | Upon receiving the MPA accept frame with Control Flag A set to 1, the | |||
| initiator MUST generate one of the negotiated RTR messages. If the | initiator MUST generate one of the negotiated RTR indications. If | |||
| initiator is not able to generate any of the responder supported RTR | the initiator is not able to generate any of the responder supported | |||
| messages, then it MUST send a TERM message to the responder | RTR indications, then it MUST send a TERM message to the responder | |||
| indicating failure to negotiate a mutually compatible connection | indicating failure to negotiate a mutually compatible connection | |||
| model or RTR option, and terminate the connection. Thus, the TERM | model or RTR option, and terminate the connection. Thus, the TERM | |||
| message MUST contain Layer 2, Error Type 0, Error Code 7. The ULP | message MUST contain Layer 2, Error Type 0, Error Code 7. The ULP | |||
| can negotiate a ULP level RTR message when a Provider level RTR | can negotiate a ULP level RTR indication when a Provider level RTR | |||
| message cannot be negotiated. | indication cannot be negotiated. | |||
| The initiator MUST set Control Flag A to 0 for client-server model. | The initiator MUST set Control Flag A to 0 for client-server model. | |||
| The responder MUST set Control Flag A to 0 if Control Flag A is 0 in | The responder MUST set Control Flag A to 0 if Control Flag A is 0 in | |||
| request. If Control Flag A is set to 0 then Control Flags B, C and D | request. If Control Flag A is set to 0 then Control Flags B, C and D | |||
| MUST also be set to 0. On reception if Control Flag A is set to 0 | MUST also be set to 0. On reception if Control Flag A is set to 0 | |||
| then Control Flags B, C, and D MUST be ignored. | then Control Flags B, C, and D MUST be ignored. | |||
| 9.3. Enhanced Connection Negotiation Flow | 9.3. Enhanced Connection Negotiation Flow | |||
| The RTR message type and ORD/IRD negotiation follows the following | The RTR indication type and ORD/IRD negotiation follows the following | |||
| order: | order: | |||
| initiator (MPA Request) --> Set Control Flag A to 1 to indicate | initiator (MPA Request) --> Set Control Flag A to 1 to indicate | |||
| peer-to-peer connection model and initiator IRD, ORD setting on | peer-to-peer connection model and initiator IRD, ORD setting on | |||
| local Endpoint of the connection. Set Control Flags B, C, and D | local Endpoint of the connection. Set Control Flags B, C, and D | |||
| to 1 for each initiator-supported option of RTR message. | to 1 for each initiator-supported option of RTR indication. | |||
| responder (MPA Reply) <-- Match the initiator Control Flag A value | responder (MPA Reply) <-- Match the initiator Control Flag A value | |||
| and set ORD/IRD to the responder local endpoint values based upon | and set ORD/IRD to the responder local endpoint values based upon | |||
| the initiator initial ORD/IRD values and the number of | the initiator initial ORD/IRD values and the number of | |||
| simultaneous RDMA Read Requests required by the ULP. Sets Control | simultaneous RDMA Read Requests required by the ULP. Sets Control | |||
| Flags B, C, and D to 1 for responder-supported options of RTR | Flags B, C, and D to 1 for responder-supported options of RTR | |||
| message options for peer-to-peer connection model and sets the | indication options for peer-to-peer connection model and sets the | |||
| responder IRD/ORD actual values. | responder IRD/ORD actual values. | |||
| initiator (First RDMA Message) --> After the initiator modifies its | initiator (First RDMA Message) --> After the initiator modifies its | |||
| ORD/IRD to match the responder's values as stated above, the | ORD/IRD to match the responder's values as stated above, the | |||
| initiator sends the first message of negotiated RTR message | initiator sends the first message of negotiated RTR indication | |||
| option. If no matching RTR message option exists then the | option. If no matching RTR indication option exists then the | |||
| initiator sends a TERM message. | initiator sends a TERM message. | |||
| The initiator or responder MUST generate the TERM message that | The initiator or responder MUST generate the TERM message that | |||
| contains Layer 2, Error Type 0, Error Code 5 when it encounters any | contains Layer 2, Error Type 0, Error Code 5 when it encounters any | |||
| error locally for which the special Error Code is not defined in | error locally for which the special Error Code is not defined in | |||
| section Section 8 before resetting the connection. | Section 8 before resetting the connection. | |||
| 10. Interoperability | 10. Interoperability | |||
| The initiator requests enhanced RDMA connection establishment by | ||||
| sending an enhanced RDMA establishment request; an enhanced responder | ||||
| is REQUIRED to respond with an enhanced RDMA connection establishment | ||||
| response, whereas an unenhanced responder treats the enhanced request | ||||
| as incorrectly formatted and closes the TCP connection. All | ||||
| responders are REQUIRED to issue unenhanced RDMA connection | ||||
| establishment responses in response to unenhanced RDMA connection | ||||
| establishment requests. | ||||
| The initiator MUST NOT use the enhanced RDMA connection establishment | The initiator MUST NOT use the enhanced RDMA connection establishment | |||
| formats or function codes when no enhanced functionality is desired. | formats or function codes when no enhanced functionality is desired. | |||
| The responder MUST continue to accept unenhanced connection requests. | The responder MUST continue to accept unenhanced connection requests. | |||
| There are three initiator/responder cases that involve enhanced MPA: | There are three initiator/responder cases that involve enhanced MPA: | |||
| both the initiator and responder, only the responder, and only the | both the initiator and responder, only the responder, and only the | |||
| initiator. The enhanced MPA frame is defined by field 'S' set to 1. | initiator. The enhanced MPA frame is defined by field 'S' set to 1. | |||
| Enhanced MPA initiator and responder: If the responder receives an | Enhanced MPA initiator and responder: If the responder receives an | |||
| skipping to change at page 21, line 29 ¶ | skipping to change at page 21, line 48 ¶ | |||
| Thus, both the initiator and responder report TCP connection | Thus, both the initiator and responder report TCP connection | |||
| termination to an application locally. In this case the initiator | termination to an application locally. In this case the initiator | |||
| MAY attempt to establish an RDMA connection using the unenhanced | MAY attempt to establish an RDMA connection using the unenhanced | |||
| MPA protocol as defined in [RFC5044] if this protocol is | MPA protocol as defined in [RFC5044] if this protocol is | |||
| compatible with the application, and let ULP deal with ORD and | compatible with the application, and let ULP deal with ORD and | |||
| IRD, and peer-to-peer negotiations. | IRD, and peer-to-peer negotiations. | |||
| A note for a potential future enhancements for connection | A note for a potential future enhancements for connection | |||
| establishment negotiation: It is possible to further extend | establishment negotiation: It is possible to further extend | |||
| formatting of private data of the MPA Request and Reply frames and to | formatting of private data of the MPA Request and Reply frames and to | |||
| use other bits from "Res" field to indicate that private data | use other bits from "Res" field to indicate additional private data | |||
| formatting. | formatting. | |||
| 11. IANA Considerations | 11. IANA Considerations | |||
| IANA is requested to add the following entries to the "SCTP Function | IANA is requested to add the following entries to the "SCTP Function | |||
| Codes for DDP Session Control" registry created by Section 3.4 of | Codes for DDP Session Control" registry created by Section 3.4 of | |||
| [IANA_RDDP_REGISTRY]: | [IANA_RDDP_REGISTRY]: | |||
| 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] | 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] | |||
| skipping to change at page 22, line 4 ¶ | skipping to change at page 22, line 21 ¶ | |||
| 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] | 0x0005, Enhanced DDP Stream Session Initiate, [RFCXXXX] | |||
| 0x0006, Enhanced DDP Stream Session Accept, [RFCXXXX] | 0x0006, Enhanced DDP Stream Session Accept, [RFCXXXX] | |||
| 0x0007, Enhanced DDP Stream Session Reject, [RFCXXXX] | 0x0007, Enhanced DDP Stream Session Reject, [RFCXXXX] | |||
| IANA is requested to add the following entries to the "MPA Errors" | IANA is requested to add the following entries to the "MPA Errors" | |||
| registry created by Section 3.3 of [IANA_RDDP_REGISTRY] | registry created by Section 3.3 of [IANA_RDDP_REGISTRY] | |||
| 0x2/0x0/0x05, - MPA Error / Local catastrophic error, [RFCXXXX] | 0x2/0x0/0x05, - MPA Error / Local catastrophic error, [RFCXXXX] | |||
| 0x2/0x0/0x06 - MPA Error / Insufficient IRD resources, [RFCXXXX] | 0x2/0x0/0x06 - MPA Error / Insufficient IRD resources, [RFCXXXX] | |||
| 0x2/0x0/0x07 - MPA Error / No matching RTR option, [RFCXXXX] | 0x2/0x0/0x07 - MPA Error / No matching RTR option, [RFCXXXX] | |||
| RFC Editor: Please replace XXXX in the six instances of [RFCXXXX] | RFC Editor: Please replace XXXX in the six instances of [RFCXXXX] | |||
| above with the RFC number of this document and remove this note. | above with the RFC number of this document and remove this note. | |||
| 12. Security Considerations | 12. Security Considerations | |||
| The security considerations from RFC 5044 and RFC 5043 apply and the | The security considerations from RFC 5044 and RFC 5043 apply and the | |||
| changes in this document do not introduce new security | changes in this document do not introduce new security | |||
| considerations. However it is recommended that implementations do | considerations. However it is recommended that implementations do | |||
| sanity checking for the input parameters, including ORD, IRD, and | sanity checking for the input parameters, including ORD, IRD, and the | |||
| RTR. | control flags used for RTR indication option negotiation. | |||
| 13. Acknowledgements | 13. Acknowledgements | |||
| The authors wish to thank Sean Hefty, Dave Minturn, Tom Talpey, David | The authors wish to thank Sean Hefty, Dave Minturn, Tom Talpey, David | |||
| Black and David Harrington for their valuable contributions and | Black and David Harrington for their valuable contributions and | |||
| reviews of this document. | reviews of this document. | |||
| 14. References | 14. References | |||
| 14.1. Normative References | 14.1. Normative References | |||
| End of changes. 40 change blocks. | ||||
| 83 lines changed or deleted | 101 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||