idnits 2.17.1 draft-ietf-rddp-mpa-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2942. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([DDP], [ELZER-MPA]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 364: '...t recovery of out of order ULPDUs MUST...' RFC 2119 keyword, line 381: '...RC check, however CRCs MUST be enabled...' RFC 2119 keyword, line 452: '...P implementation MUST inform MPA when ...' RFC 2119 keyword, line 458: '... implementation SHOULD be enabled to:...' RFC 2119 keyword, line 462: '.... Multiple FPDUs MAY be packed into a...' (123 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == In addition to RFC 3978, Section 5.5 boilerplate, a section with a similar start was also found: This document and the information contained herein is provided on an "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION, EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY, AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: 9. MPA implementations MUST validate the PD_Length field. The buffer that receives the "Private Data" field MUST be large enough to receive that data; the amount of "Private Data" MUST not exceed the PD_Length, or the application buffer. If any of the above fails, the startup frame MUST be considered improperly formatted. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: C: This bit declares an endpoint's preferred CRC usage. When this field is '0' in the "MPA Request Frame" and the "MPA Reply Frame", CRCs MUST not be checked and need not be generated by either endpoint. When this bit is '1' in either the "MPA Request Frame" or "MPA Reply Frame", CRCs MUST be generated and checked by both endpoints. Note that even when not in use, the CRC field remains present in the FPDU. When CRCs are not in use, the CRC field MUST be considered valid for FPDU checking regardless of its contents. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 2, 2004) is 7390 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'ELZER-MPA' is mentioned on line 211, but not defined == Missing Reference: 'RDMAP' is mentioned on line 1831, but not defined == Missing Reference: 'MPA' is mentioned on line 2310, but not defined == Missing Reference: 'S' is mentioned on line 2122, but not defined == Missing Reference: 'RFC0793' is mentioned on line 2407, but not defined ** Obsolete undefined reference: RFC 793 (Obsoleted by RFC 9293) == Unused Reference: 'RFC2026' is defined on line 1957, but no explicit reference was found in the text == Unused Reference: 'RFC3667' is defined on line 1960, but no explicit reference was found in the text == Unused Reference: 'RFC3668' is defined on line 1963, but no explicit reference was found in the text == Unused Reference: 'RDMASEC' is defined on line 1972, but no explicit reference was found in the text == Unused Reference: 'NagleDAck' is defined on line 1991, but no explicit reference was found in the text == Unused Reference: 'ELZUR-MPA' is defined on line 2011, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3667 (Obsoleted by RFC 3978) ** Obsolete normative reference: RFC 3668 (Obsoleted by RFC 3979) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-10) exists of draft-ietf-rddp-security-06 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-04 -- Obsolete informational reference (is this intentional?): RFC 2401 (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) == Outdated reference: A later version (-04) exists of draft-ietf-nfsv4-channel-bindings-02 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-03 -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 14 errors (**), 0 flaws (~~), 21 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Remote Direct Data Placement Work Group P. Culley 2 INTERNET-DRAFT Hewlett-Packard Company 3 draft-ietf-rddp-mpa-02.txt U. Elzur 4 Broadcom Corporation 5 R. Recio 6 IBM Corporation 7 S. Bailey 8 Sandburst Corporation 9 J. Carrier 10 Adaptec 12 Expires: August 2005 February 2, 2004 14 Marker PDU Aligned Framing for TCP Specification 16 Status of this Memo 18 By submitting this Internet-Draft, I certify that any applicable 19 patent or other IPR claims of which I am aware have been disclosed, 20 or will be disclosed, and any of which I become aware will be 21 disclosed, in accordance with RFC 3668. 23 By submitting this Internet-Draft, I accept the provisions of Section 24 4 of RFC 3667. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft 38 Shadow Directories can be accessed at http://www.ietf.org/shadow.html 40 Abstract 42 A framing protocol is defined for TCP that is fully compliant with 43 applicable TCP RFCs and fully interoperable with existing TCP 44 implementations. The framing mechanism is designed to work as an 45 "adaptation layer" between TCP and the Direct Data Placement [DDP] 46 protocol, preserving the reliable, in-order delivery of TCP, while 47 adding the preservation of higher-level protocol record boundaries 48 that DDP requires. 50 Table of Contents 52 Status of this Memo.................................................1 53 Abstract............................................................1 54 1 Introduction.................................................6 55 1.1 Motivation...................................................6 56 1.2 Protocol Overview............................................6 57 2 Glossary....................................................10 58 3 LLP and DDP requirements....................................12 59 3.1 TCP implementation Requirements to support MPA..............12 60 3.1.1 TCP Transmit side...........................................12 61 3.1.2 TCP Receive side............................................12 62 3.2 MPA's interactions with DDP.................................13 63 4 FPDU Formats................................................15 64 4.1 Marker Format...............................................16 65 5 Data Transfer Semantics.....................................17 66 5.1 MPA Markers.................................................17 67 5.2 CRC Calculation.............................................19 68 5.3 MPA on TCP Sender Segmentation..............................22 69 5.3.1 Effects of MPA on TCP Segmentation..........................22 70 5.3.2 FPDU Size Considerations....................................24 71 5.4 MPA Receiver FPDU Identification............................25 72 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....26 73 6 Connection Semantics........................................27 74 6.1 Connection setup............................................27 75 6.1.1 MPA Request and Reply Frame Format..........................31 76 6.1.2 Example Delayed Startup sequence............................32 77 6.1.3 Use of "Private Data".......................................35 78 6.1.4 "Dual Stack" implementations................................38 79 6.2 Normal Connection Teardown..................................39 80 7 Error Semantics.............................................40 81 8 Security Considerations.....................................41 82 8.1 Protocol-specific Security Considerations...................41 83 8.1.1 Spoofing....................................................41 84 8.1.2 Eavesdropping...............................................42 85 8.2 Introduction to Security Options............................43 86 8.3 Using IPsec With MPA........................................43 87 8.4 Requirements for IPsec Encapsulation of DDP.................44 88 9 IANA Considerations.........................................45 89 10 References..................................................46 90 10.1 Normative References........................................46 91 10.2 Informative References......................................46 92 11 Appendix....................................................48 93 11.1 Analysis of MPA over TCP Operations.........................48 94 11.1.1 Assumptions...............................................48 95 11.1.2 The Value of Header Alignment.............................49 96 11.2 Receiver implementation.....................................57 97 11.2.1 Network Layer Reassembly Buffers..........................57 98 11.2.2 TCP Reassembly buffers....................................58 99 11.3 IETF RNIC Interoperability with RDMA Consortium Protocols...59 100 11.3.1 Negotiated Parameters.....................................59 101 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC...................60 102 11.3.3 RDMAC RNIC and Permissive IETF RNIC.......................62 103 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC.........63 104 12 Author's Addresses..........................................64 105 13 Acknowledgments.............................................65 106 14 Full Copyright Statement....................................68 108 Table of Figures 110 Figure 1 ULP MPA TCP Layering.......................................8 111 Figure 2 FPDU Format...............................................15 112 Figure 3 Marker Format.............................................16 113 Figure 4 Example FPDU Format with Marker...........................18 114 Figure 5 Annotated Hex Dump of an FPDU.............................21 115 Figure 6 Annotated Hex Dump of an FPDU with Marker.................21 116 Figure 7 "MPA Request/Reply Frame".................................31 117 Figure 8: Example Delayed Startup negotiation......................33 118 Figure 9: Example Immediate Startup negotiation....................36 119 Figure 10: Non-aligned FPDU freely placed in TCP octet stream......51 120 Figure 11: Aligned FPDU placed immediately after TCP header........53 121 Figure 12. Connection Parameters for the RNIC Types................60 122 Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive 123 IETF RNIC..........................................................61 124 Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive 125 IETF RNIC..........................................................62 126 Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a 127 Permissive IETF RNIC...............................................63 129 Revision history 131 [draft-ietf-rddp-mpa-02] workgroup draft with following changes: 133 Made IPSEC must implement, optional to use. 135 Updated Marker language to clarify that it points to ULPDU 136 Length even when marker precedes FPDU. 138 Clarified when to start markers use (in full operation mode). 140 Added informative text on interoperability with RDMAC RNICs. 142 Reduced "Private Data" to 512 octets max. 144 Clarified CRC use description, must be used unless data is at 145 least as well protected by another means. 147 Clarified CRC disabled mode; CRC field is always valid. 149 Added Security text. 151 Changed DDP and RDMAP version numbers in hex dumps (Fig 5,6) and 152 adjusted CRC accordingly. 154 [draft-ietf-rddp-mpa-01] workgroup draft with following changes: 156 Added the "R" bit (Rejected) to the "MPA Reply Frame" and 157 described its semantics. 159 Added some comments on recent decisions regarding startup. 161 Updated RFC3667 boilerplate. 163 [draft-ietf-rddp-mpa-00] workgroup draft with following changes: 165 Changed "Start Key" to two separate startup frames to facilitate 166 identification of incorrect Active/Active startup. 168 Changed Active/Passive nomenclature to Initiator/Responder to 169 reduce confusion with TCP startup and verbs doc (which used 170 opposite sense). 172 Added "Private Data" to the startup key sequences. This also 173 required describing the motivation and expected usage models 174 along with some interface hints. Removed the "Private data" 175 stuff from appendix. 177 Added example "Immediate" startup with TCP and explanation. 179 [draft-culley-iwarp-mpa-03] 181 Add option to allow receivers to specify Marker use. 183 Add option that allows both sides to agree not to use CRC. 185 Added startup declaration "Start Key" with options and larger 186 MPA mode recognition "key". 188 Updated MPA/DDP connection startup rules and sequence to deal 189 with "Start Key". 191 Added Appendix that provides a more detailed analysis of the 192 effects of MPA on TCP data streams. 194 Added appendix that describes a mechanism to deal with "private 195 data" prior to full MPA/DDP operation. 197 [draft-culley-iwarp-mpa-02] 199 Enhanced descriptions of how MPA is used over an unmodified TCP. 201 Removed "No Packing" text. 203 Made MPA an adaptation layer for DDP, instead of a generalized 204 framing solution. 206 Added clarifications of the MPA/TCP interaction for optimized 207 implementations and that any such optimizations are to be used 208 only when requested by MPA. 210 Note: a discussion of reasons for these changes can be found in 211 [ELZER-MPA]. 213 [draft-culley-iwarp-mpa-01] initial draft. 215 1 Introduction 217 This section discusses the reason for creating MPA on TCP and a 218 general overview of the protocol. Later sections show the MPA 219 headers (see section 4 on page 15), and detailed protocol 220 requirements and characteristics (see section 5 on page 17), as well 221 as Connection Semantics (section 6 on page 26), Error Semantics 222 (section 7 on page 40), and Security Considerations (section 8 on 223 page 41). 225 1.1 Motivation 227 The Direct Data Placement protocol [DDP], when used with TCP [RFC793] 228 requires a mechanism to detect record boundaries. The DDP records 229 are referred to as Upper Layer Protocol Data Units by this document. 230 The ability to locate the Upper Layer Protocol Data Unit (ULPDU) 231 boundary is useful to a hardware network adapter that uses DDP to 232 directly place the data in the application buffer based on the 233 control information carried in the ULPDU header. This may be done 234 without requiring that the packets arrive in order. Potential 235 benefits of this capability are the avoidance of the memory copy 236 overhead and a smaller memory requirement for handling out of order 237 or dropped packets. 239 Many approaches have been proposed for a generalized framing 240 mechanism. Some are probabilistic in nature and others are 241 deterministic. A probabilistic approach is characterized by a 242 detectable value embedded in the octet stream. It is probabilistic 243 because under some conditions the receiver may incorrectly interpret 244 application data as the detectable value. Under these conditions, 245 the protocol may fail with unacceptable frequency. A deterministic 246 approach is characterized by embedded controls at known locations in 247 the octet stream. Because the receiver can guarantee it will only 248 examine the data stream at locations that are known to contain the 249 embedded control, the protocol can never misinterpret application 250 data as being embedded control data. For unambiguous handling of an 251 out of order packet, the deterministic approach is preferred. 253 The MPA protocol provides a framing mechanism for DDP running over 254 TCP using the deterministic approach. It allows the location of the 255 ULPDU to be determined in the TCP stream even if the TCP segments 256 arrive out of order. 258 1.2 Protocol Overview 260 MPA is described as an extra layer above TCP and below DDP. The 261 operation sequence is: 263 1. A TCP connection is established by ULP action. This is done 264 using methods not described by this specification. The ULP may 265 exchange some amount of data in streaming mode prior to starting 266 MPA, but is not required to do so. 268 2. The Consumer negotiates the use of DDP and MPA at both ends of a 269 connection. The mechanisms to do this are not described in this 270 specification. The negotiation may be done in streaming mode, or 271 by some other mechanism (such as a pre-arranged port number). 273 3. The ULP activates MPA on each end in the "Startup Phase", either 274 as an "Initiator" or a "Responder", as determined by the ULP. 275 This mode verifies the usage of MPA, specifies the use of CRC and 276 Markers, and allows the ULP to communicate some additional data 277 via a "private data" exchange. See section 6.1 Connection setup 278 for more details on the startup process. 280 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 281 full operation and begins sending DDP data as further described 282 below. In this document, DDP data chunks are called ULPDUs. For 283 a description of the DDP data, see [DDP]. 285 Following is a description of data transfer when MPA is in full 286 operation. 288 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 289 for this value. MPA derives this information from TCP, when it 290 is available, or chooses a reasonable value. This information is 291 already supported on many TCP implementations, including all 292 modern flavors of BSD networking, through the TCP_MAXSEG socket 293 option. 295 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 296 MPA at the sender. 298 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 299 header, optionally inserting markers, and appending a CRC field 300 after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. 302 4. The TCP sender puts the FPDUs into the TCP stream. If the TCP 303 Sender is MPA-aware, it segments the TCP stream in such a way 304 that a TCP Segment boundary is also the boundary of an FPDU. TCP 305 then passes each segment to the IP layer for transmission. 307 5. The TCP receiver may be MPA-aware or may not be MPA-aware. If it 308 is MPA-aware, it may separate passing the TCP payload to MPA from 309 passing the TCP payload ordering information to MPA. In either 310 case, RFC compliant TCP wire behavior is observed at both the 311 sender and receiver. 313 6. The MPA receiver locates and assembles complete FPDUs within the 314 stream, verifies their integrity, and removes MPA markers (when 315 present), ULPDU_Length, PAD and the CRC field. 317 7. MPA then provides the complete ULPDUs to DDP. MPA may also 318 separate passing MPA payload to DDP from passing the MPA payload 319 ordering information. 321 The layering of PDUs with MPA is shown in Figure 1, below. 323 MPA-aware TCP is a TCP layer which potentially contains some 324 additional semantics as defined in this document. MPA is implemented 325 as a data stream ULP for TCP and is therefore RFC compliant. MPA- 326 aware TCP is RFC compliant. 328 +------------------+ 329 | ULP client | 330 +------------------+ <- Consumer messages 331 | DDP | 332 +------------------+ <- ULPDUs 333 | MPA | 334 +------------------+ <- FPDUs (containing ULPDUs) 335 | TCP* | 336 +------------------+ <- TCP Segments (containing FPDUs) 337 | IP etc. | 338 +------------------+ 339 * TCP or MPA-aware TCP. 341 Figure 1 ULP MPA TCP Layering 343 An MPA-aware TCP sender is able to segment the data stream such that 344 TCP segments begin with FPDUs (FPDU Alignment). This has significant 345 advantages for receivers. When segments arrive with aligned FPDUs 346 the receiver usually need not buffer any portion of the segment, 347 allowing DDP to place it in its destination memory immediately, thus 348 avoiding copies from intermediate buffers (DDP's reason for 349 existence). 351 MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation 352 to recover ULPDUs that may be received out of order. This enables a 353 DDP on MPA implementation to save a significant amount of 354 intermediate storage by placing the ULPDUs in the right locations in 355 the application buffers when they arrive, rather than waiting until 356 full ordering can be restored. 358 The ability of a receiver to recover out of order ULPDUs is optional 359 and declared to the transmitter during startup. When the receiver 360 declares that it does not support out of order recovery, the 361 transmitter does not add the control information to the data stream 362 needed for out of order recovery. 364 MPA implementations that support recovery of out of order ULPDUs MUST 365 support a mechanism to indicate the ordering of ULPDUs as the sender 366 transmitted them and indicate when missing intermediate segments 367 arrive. These mechanisms allow DDP to reestablish record ordering 368 and report Delivery of complete messages (groups of records). 370 MPA also addresses enhanced data integrity. Many users of TCP have 371 noted that the TCP checksum is not as strong as could be desired 372 [CRCTCP]. Studies have shown that the TCP checksum indicates 373 segments in error at a much higher rate than the underlying link 374 characteristics would indicate. With these higher error rates, the 375 chance that an error will escape detection, when using only the TCP 376 checksum for data integrity, becomes a concern. A stronger integrity 377 check can reduce the chance of data errors being missed. 379 MPA includes a CRC check to increase the ULPDU data integrity to the 380 level provided by other modern protocols, such as SCTP [RFC2960]. It 381 is possible to disable this CRC check, however CRCs MUST be enabled 382 unless it is clear that the end to end connection through the network 383 has data integrity at least as good as a MPA with CRC enabled (for 384 example when IPSEC is implemented end to end). DDP's ULP expects 385 this level of data integrity and therefore the ULP does not have to 386 provide its own duplicate data integrity and error recovery for lost 387 data. 389 2 Glossary 391 Consumer - the ULPs or applications that lie above MPA and DDP. The 392 Consumer is responsible for making TCP connections, starting MPA 393 and DDP connections, and generally controlling operations. 395 Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as 396 the process of informing DDP that a particular PDU is ordered for 397 use. This is specifically different from "passing the PDU to 398 DDP", which may generally occur in any order, while the order of 399 "Delivery" is strictly defined. 401 EMSS - Effective Maximum Segment Size. EMSS is the smaller of the 402 TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], 403 and the current path Maximum Transfer Unit (MTU) [RFC1191]. 405 FPDU - Framing Protocol Data Unit. The unit of data created by an 406 MPA sender. 408 FPDU Alignment - the property that a TCP segment begins with an FPDU. 410 Header Alignment - the property that a TCP segment begins with an 411 FPDU and the TCP segment includes an integer number of FPDUs. 413 PDU - protocol data unit 415 MPA-aware TCP - a TCP implementation that is aware of the receiver 416 efficiencies of MPA Header Alignment and is capable of sending 417 TCP segments that begin with an FPDU. 419 MPA-enabled - MPA is enabled if the MPA protocol is visible on the 420 wire. When the sender is MPA-enabled, it is inserting framing 421 and markers. When the receiver is MPA-enabled, it is 422 interpreting framing and markers. 424 MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This 425 document defines the MPA protocol. 427 MULPDU - Maximum ULPDU. The current maximum size of the record that 428 is acceptable for DDP to pass to MPA for transmission. 430 Node - A computing device attached to one or more links of a Network. 431 A Node in this context does not refer to a specific application 432 or protocol instantiation running on the computer. A Node may 433 consist of one or more MPA on TCP devices installed in a host 434 computer. 436 Remote Peer - The MPA protocol implementation on the opposite end of 437 the connection. Used to refer to the remote entity when 438 describing protocol exchanges or other interactions between two 439 Nodes. 441 ULP - Upper Layer Protocol. The protocol layer above the protocol 442 layer currently being referenced. The ULP for MPA is DDP [DDP]. 444 ULPDU - Upper Layer Protocol Data Unit. The data record defined by 445 the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP 446 Segment". 448 3 LLP and DDP requirements 450 3.1 TCP implementation Requirements to support MPA 452 The TCP implementation MUST inform MPA when the TCP connection is 453 closed or has begun closing the connection (e.g. received a FIN). 455 3.1.1 TCP Transmit side 457 To provide optimum performance, an MPA-aware transmit side TCP 458 implementation SHOULD be enabled to: 460 * With an EMSS large enough to contain the FPDU(s), segment the 461 outgoing TCP stream such that the first octet of every TCP 462 Segment begins with an FPDU. Multiple FPDUs MAY be packed into a 463 single TCP segment as long as they are entirely contained in the 464 TCP segment. 466 * Report the current EMSS to the MPA transmit layer. 468 An MPA-aware TCP transmit side implementation MUST continue to use 469 the method of segmentation expected by non-MPA applications (and 470 described in TCP RFCs) when MPA is not enabled on the connection. 471 When MPA is enabled above an MPA-aware TCP, it SHOULD specifically 472 enable the segmentation rules described above for the DDP segments 473 (FPDUs) posted for transmission. 475 If the transmit side TCP implementation is not able to segment the 476 TCP stream as indicated above, MPA SHOULD make a best effort to 477 achieve that result. For example, using the TCP_NODELAY socket 478 option to disable the Nagle algorithm will usually result in many of 479 the segments starting with an FPDU. 481 If the transmit side TCP implementation is not able to report the 482 EMSS, MPA may assume that TCP will use 1460 octet segments in 483 creating FPDUs. If the implementation has reason to believe that the 484 TCP segment size is actually smaller than 1460, it may instead use a 485 536 octet FPDU. 487 3.1.2 TCP Receive side 489 When an MPA receive implementation and the MPA-aware receive side TCP 490 implementation support handling out of order ULPDUs, the TCP receive 491 implementation SHOULD be enabled to: 493 * Pass incoming TCP segments to MPA as soon as they have been 494 received and validated, even if not received in order. The TCP 495 layer MUST have committed to keeping each segment before it can 496 be passed to the MPA. This means that the segment must have 497 passed the TCP, IP, and lower layer data integrity validation 498 (i.e., checksum), must be in the receive window, must not be a 499 duplicate, must be part of the same epoch (if timestamps are used 500 to verify this) and any other checks required by TCP RFCs. The 501 segment MUST NOT be passed to MPA more than once unless 502 explicitly requested (see Section 7). 504 This is not to imply that the data must be completely ordered 505 before use. An implementation may accept out of order segments, 506 SACK them [RFC2018], and pass them to DDP when the reception of 507 the segments needed to fill in the gaps arrive. Such an 508 implementation can "commit" to the data early on, and will not 509 overwrite it even if (or when) duplicate data arrives. MPA 510 expects to utilize this "commit" to allow the passing of ULPDUs 511 to DDP when they arrive, independent of ordering. 513 * Provide a mechanism to indicate the ordering of TCP segments as 514 the sender transmitted them. One possible mechanism might be 515 attaching the TCP sequence number to each segment. 517 * Provide a mechanism to indicate when a given TCP segment (and the 518 prior TCP stream) is complete. One possible mechanism might be 519 to utilize the leading (left) edge of the TCP Receive Window. 521 DDP on MPA MUST utilize these two mechanisms to establish the 522 Delivery semantics that DDP's consumers agree to. These 523 semantics are described fully in [DDP]. These include 524 requirements on DDP's consumer to respect ownership of buffers 525 prior to the time that DDP delivers them to the consumer. 527 An MPA-aware TCP receive side implementation MUST continue to buffer 528 TCP segments until completely ordered and then deliver them as 529 expected by non-MPA applications (and described in TCP RFCs) when MPA 530 is not enabled on the connection. When MPA is enabled above an MPA- 531 aware TCP, TCP SHOULD enable the in and out of order passing of data, 532 and the separate ordering information as described above. 534 When an MPA receive implementation is coupled with a TCP receive 535 implementation that does not support the preceding mechanisms, TCP 536 passes and Delivers incoming stream data to MPA in order. 538 3.2 MPA's interactions with DDP 540 DDP requires MPA to maintain DDP record boundaries from the sender to 541 the receiver. When using MPA on TCP to send data, DDP provides 542 records (ULPDUs) to MPA. MPA will use the reliable transmission 543 abilities of TCP to transmit the data, and will insert appropriate 544 additional information into the TCP stream to allow the MPA receiver 545 to locate the record boundary information. 547 As such, MPA accepts complete records (ULPDUs) from DDP at the sender 548 and returns them to DDP at the receiver. 550 MPA combined with an MPA-aware TCP can only ensure FPDU Alignment 551 with the TCP Header if the FPDU is less than or equal to TCP's EMSS. 552 Since FPDU alignment is generally desired by the receiver, DDP must 553 cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS 554 under normal conditions. This is done with the MULPDU mechanism. 556 MPA provides information to DDP on the current maximum size of the 557 record that is acceptable to send (MULPDU). DDP SHOULD limit each 558 record size to MULPDU. The range of MULPDU values MUST be between 559 128 octets and 64768 octets, inclusive. 561 The sending DDP MUST NOT post a ULPDU larger than 64768 octets to 562 MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, 563 however MPA is NOT REQUIRED to support a "ULPDU Length" that is 564 greater than the current MULPDU. 566 While the maximum theoretical length supported by the MPA header 567 ULPDU_Length field is 65535, TCP over IP requires the IP datagram 568 maximum length to be 65535 octets. To enable MPA to support FPDU 569 Alignment, the maximum size of the FPDU must fit within an IP 570 datagram. Thus the ULPDU limit of 64768 octets was derived by taking 571 the maximum IP datagram length, subtracting from it the maximum total 572 length of the sum of the IPv4 header, TCP header, IPv4 options, TCP 573 options, and the worst case MPA overhead, and then rounding the 574 result down to a 128 octet boundary. 576 On receive, MPA MUST pass each ULPDU with its length to DDP when it 577 has been validated. 579 If an MPA implementation supports passing out of order ULPDUs to DDP, 580 the MPA implementation SHOULD: 582 * Pass each ULPDU with its length to DDP as soon as it has been 583 fully received and validated. 585 * Provide a mechanism to indicate the ordering of ULPDUs as the 586 sender transmitted them. One possible mechanism might be 587 providing the TCP sequence number for each ULPDU. 589 * Provide a mechanism to indicate when a given ULPDU (and prior 590 ULPDUs) are complete. One possible mechanism might be to allow 591 DDP to see the current outgoing TCP Ack sequence number. 593 * Provide an indication to DDP that the TCP has closed or has begun 594 to close the connection (e.g. received a FIN). 596 MPA MUST provide the protocol version negotiated with its peer to 597 DDP. DDP will use this version to set the version in its header and 598 to report the version to RDMAP 600 4 FPDU Formats 602 MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown 603 below MUST be used for all MPA FPDUs. For purposes of clarity, 604 markers are not shown in Figure 2. 606 0 1 2 3 607 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 608 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 609 | ULPDU_Length | | 610 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 611 | | 612 ~ ~ 613 ~ ULPDU ~ 614 | | 615 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 616 | | PAD (0-3 octets) | 617 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 618 | CRC | 619 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 620 Figure 2 FPDU Format 622 ULPDU_Length: 16 bits (unsigned integer). This is the number of 623 octets of the contained ULPDU. It does not include the length of the 624 FPDU header itself, the pad, the CRC, or of any markers that fall 625 within the ULPDU. The 16-bit "ULPDU Length" field is large enough to 626 support the largest IP datagrams for IPv4 or IPv6. 628 PAD: The PAD field trails the ULPDU and contains between zero and 629 three octets of data. The pad data MUST be set to zero by the sender 630 and ignored by the receiver (except for CRC checking). The length of 631 the pad is set so as to make the size of the FPDU an integral 632 multiple of four. 634 CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C 635 check value, which is used to verify the entire contents of the FPDU, 636 using CRC32C. See section 5.2 CRC Calculation on page 19. When CRCs 637 are not enabled, this field is still present, may contain any value, 638 and MUST NOT be checked. 640 The FPDU adds a minimum of 6 octets to the length of the ULPDU. In 641 addition, the total length of the FPDU will include the length of any 642 markers and from 0 to 3 pad octets added to round-up the ULPDU size. 644 4.1 Marker Format 646 The format of a marker MUST be as specified in Figure 3: 648 0 1 2 3 649 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 650 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 651 | RESERVED | FPDUPTR | 652 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 653 Figure 3 Marker Format 655 RESERVED: The Reserved field MUST be set to zero on transmit and 656 ignored on receive (except for CRC calculation). 658 FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, 659 interpreted as an unsigned integer, that indicates the number of 660 octets in the TCP stream from the beginning of the "ULPDU Length" 661 field to the first octet of the entire marker. 663 5 Data Transfer Semantics 665 This section discusses some characteristics and behavior of the MPA 666 protocol as well as implications of that protocol. 668 5.1 MPA Markers 670 MPA markers are used to identify the start of FPDUs when packets are 671 received out of order. This is done by locating the markers at fixed 672 intervals in the data stream (which is correlated to the TCP sequence 673 number) and using the marker value to locate the preceding FPDU 674 start. 676 The MPA receiver's ability to locate out of order FPDUs and pass the 677 ULPDUs to DDP is implementation dependent. MPA/DDP allows those 678 receivers that are able to deal with out of order FPDUs in this way 679 to require the insertion of markers in the data stream. When the 680 receiver cannot deal with out of order FPDUs in this way, it may 681 disable the insertion of markers at the sender. All MPA senders MUST 682 be able to generate markers when their use is declared by the 683 opposing receiver (see section 6.1 Connection setup on page 27). 685 When Markers are enabled, MPA senders MUST insert a marker into the 686 data stream at a 512 octet periodic interval in the TCP Sequence 687 Number Space. The marker contains a 16 bit unsigned integer referred 688 to as the FPDUPTR (FPDU Pointer). 690 If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit 691 relative back-pointer. FPDUPTR MUST contain the number of octets in 692 the TCP stream from the beginning of the "ULPDU Length" field to the 693 first octet of the marker, unless the marker falls between FPDUs. 694 Thus the location of the first octet of the previous FPDU header can 695 be determined by subtracting the value of the given marker from the 696 current octet-stream sequence number (i.e. TCP sequence number) of 697 the first octet of the marker. Note that this computation must take 698 into account that the TCP sequence number could have wrapped between 699 the marker and the header. 701 An FPDUPTR value of 0x0000 is a special case - it is used when the 702 marker falls exactly between FPDUs (between the preceding FPDU CRC 703 field, and the next FPDU's "ULPDU Length" field). In this case, the 704 marker MUST be included in the CRC calculation of the FPDU following 705 the marker (if CRCs are being generated or checked). Thus an FPDUPTR 706 value of 0x0000 means that immediately following the marker is an 707 FPDU header (the "ULPDU Length" field). 709 Since all FPDUs are integral multiples of 4 octets, the bottom two 710 bits of the FPDUPTR as calculated by the sender are zero. MPA 711 reserves these bits so they MUST be treated as zero for computation 712 at the receiver. 714 When Markers are enabled (see section 6.1 Connection setup on page 715 27), the MPA markers MUST be inserted immediately preceding the first 716 FPDU of full operation phase, and at every 512th octet of the TCP 717 octet stream thereafter. As a result, the first marker has an 718 FPDUPTR value of 0x0000. If the first marker begins at octet 719 sequence number SeqStart, then markers are inserted such that the 720 first octet of the marker is at octet sequence number SeqNum if the 721 remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum 722 can wrap. 724 For example, if the TCP sequence number were used to calculate the 725 insertion point of the marker, the starting TCP sequence number is 726 unlikely to be zero, and 512 octet multiples are unlikely to fall on 727 a modulo 512 of zero. If the MPA connection is started at TCP 728 sequence number 11, then the 1st marker will begin at 11, and 729 subsequent markers will begin at 523, 1035, etc. 731 If an FPDU is large enough to contain multiple markers, they MUST all 732 point to the same point in the TCP stream: the first octet of the 733 "ULPDU Length" field for the FPDU. 735 If a marker interval contains multiple FPDUs (the FPDUs are small), 736 the marker MUST point to the start of the "ULPDU Length" field for 737 the FPDU containing the marker unless the marker falls between FPDUs, 738 in which case the marker MUST be zero. 740 The following example shows an FPDU containing a marker. 742 0 1 2 3 743 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 744 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 745 | ULPDU Length (0x0010) | | 746 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 747 | | 748 + + 749 | ULPDU (octets 0-9) | 750 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 751 | (0x0000) | FPDU ptr (0x000C) | 752 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 753 | ULPDU (octets 10-15) | 754 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 755 | | PAD (2 octets:0,0) | 756 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 757 | CRC | 758 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 759 Figure 4 Example FPDU Format with Marker 761 MPA Receivers MUST preserve ULPDU boundaries when passing data to 762 DDP. MPA Receivers MUST pass the ULPDU data and the "ULPDU Length" to 763 DDP and not the markers, headers, and CRC. 765 5.2 CRC Calculation 767 An MPA implementation MUST implement CRC support and MUST either: 769 (1) always use CRCs 771 or 773 (2) only negotiate the non-use of CRC on the explicit request of the 774 system administrator, via an interface not defined in this spec. 775 The default configuration for a connection MUST be to use CRCs. 777 (3) The MPA provider at either peer MAY ignore its administrator's 778 request that CRCs not be used. 780 The decision for one host to request CRC suppression MAY be made on 781 an administrative basis for any path that provides equivalent 782 protection from undetected errors as an end-to-end CRC32c. 784 The process MUST be invisible to the ULP. 786 After receipt of an MPA startup declaration indicating that its peer 787 requires CRCs, an MPA instance MUST continue generating and checking 788 CRCs until the connection terminates. If an MPA instance has 789 declared that it does not require CRCs, it MUST turn off CRC checking 790 immediately after receipt of an MPA mode declaration indicating that 791 its peer also does not require CRCs. It MAY continue generating 792 CRCs. See section 6.1 Connection setup on page 27 for details on the 793 MPA startup. 795 When sending an FPDU, the sender MUST include a CRC field. When CRCs 796 are enabled, the CRC field in the MPA FPDU MUST be computed using the 797 CRC32C polynomial in the manner described in the iSCSI Protocol 798 [iSCSI] document for Header and Data Digests. 800 The fields which MUST be included in the CRC calculation when sending 801 an FPDU are as follows: 803 1) If a marker does not immediately precede the "ULPDU Length" 804 field, the CRC-32c is calculated from the first octet of the 805 "ULPDU Length" field, through all the ULPDU and markers (if 806 present), to the last octet of the PAD (if present), inclusive. 807 If there is a marker immediately following the PAD, the marker is 808 included in the CRC calculation for this FPDU. 810 2) If a marker immediately precedes the first octet of the "ULPDU 811 Length" field of the FPDU, (i.e. the marker fell between FPDUs, 812 and thus is required to be included in the second FPDU), the CRC- 813 32c is calculated from the first octet of the marker, through the 814 "ULPDU Length" header, through all the ULPDU and markers (if 815 present), to the last octet of the PAD (if present), inclusive. 817 3) After calculating the CRC-32c, the resultant value is placed into 818 the CRC field at the end of the FPDU. 820 When an FPDU is received, and CRC checking is enabled, the receiver 821 MUST first perform the following: 823 1) Calculate the CRC of the incoming FPDU in the same fashion as 824 defined above. 826 2) Verify that the calculated CRC-32c value is the same as the 827 received CRC-32c value found in the FPDU CRC field. If not, the 828 receiver MUST treat the FPDU as an invalid FPDU. 830 The procedure for handling invalid FPDUs is covered in the Error 831 Section (see section 7 on page 40) 833 The following is an annotated hex dump of an example FPDU sent as the 834 first FPDU on the stream. As such, it starts with a marker. The FPDU 835 contains 24 octets of the contained ULPDU, which are all zeros. The 836 CRC32c has been correctly calculated and can be used as a reference. 837 See the [DDP] and [RDMA] specification for definitions of the DDP 838 Control field, Queue, MSN, MO, and Send Data. 840 Octet Contents Annotation 841 Count 843 0000 00 00 Marker: Reserved 844 0002 00 00 FPDUPTR 845 0004 00 2a Length 846 0006 41 43 DDP Control Field, Send with Last flag set 847 0008 00 00 Reserved (STag position with no STag) 848 000a 00 00 849 000c 00 00 Queue = 0 850 000e 00 00 851 0010 00 00 MSN = 1 852 0012 00 01 853 0014 00 00 MO = 0 854 0016 00 00 855 0018 00 00 856 Send Data (24 octets of zeros) 857 002e 00 00 858 0030 52 23 CRC32c 859 0032 99 83 860 Figure 5 Annotated Hex Dump of an FPDU 862 The following is an example sent as the second FPDU of the stream 863 where the first FPDU (which is not shown here) had a length of 492 864 octets and was also a Send to Queue 0 with Last Flag set. This 865 example contains a marker. 867 Octet Contents Annotation 868 Count 870 01ec 00 2a Length 871 01ee 41 43 DDP Control Field: Send with Last Flag set 872 01f0 00 00 Reserved (STag position with no STag) 873 01f2 00 00 874 01f4 00 00 Queue = 0 875 01f6 00 00 876 01f8 00 00 MSN = 2 877 01fa 00 02 878 01fc 00 00 MO = 0 879 01fe 00 00 880 0200 00 00 Marker: Reserved 881 0202 00 14 FPDUPTR 882 0204 00 00 883 Send Data (24 octets of zeros) 884 021a 00 00 885 021c 84 92 CRC32c 886 021e 58 98 887 Figure 6 Annotated Hex Dump of an FPDU with Marker 889 5.3 MPA on TCP Sender Segmentation 891 The various TCP RFCs allow considerable choice in segmenting a TCP 892 stream. In order to optimize FPDU recovery at the MPA receiver, MPA 893 specifies additional segmentation rules. 895 MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU 896 contained in one FPDU. 898 An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP 899 implementations that support this, and with an EMSS large enough to 900 contain at least one FPDU, segment the outbound TCP stream such that 901 each TCP segment begins with an FPDU, and fully contains all included 902 FPDUs. 904 Implementation note: To achieve the previous segmentation rule, 905 TCP's Nagle [RFC0896] algorithm SHOULD be disabled. 907 There are exceptions to the above rule. Once an ULPDU is provided to 908 MPA, the MPA on TCP sender MUST transmit it or fail the connection; 909 it cannot be repudiated. As a result, during changes in MTU and 910 EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it 911 may be necessary to send FPDUs that do not conform to the 912 segmentation rule above. 914 A possible, but less desirable, alternative is to use IP 915 fragmentation on accepted FPDUs to deal with MTU reductions or 916 extremely small EMSS. 918 The sender MUST still format the FPDU according to FPDU format as 919 shown in Figure 2. 921 On a retransmission, TCP does not necessarily preserve original TCP 922 segmentation boundaries. This can lead to the loss of FPDU alignment 923 and containment within a TCP segment during TCP retransmissions. An 924 MPA-aware TCP sender SHOULD try to preserve original TCP segmentation 925 boundaries on a retransmission. 927 5.3.1 Effects of MPA on TCP Segmentation 929 Applications expected to see strong advantages from Direct Data 930 Placement include transaction-based applications and throughput 931 applications. Request/response protocols typically send one FPDU per 932 TCP segment and then wait for a response. Therefore, the application 933 is expected to set TCP parameters such that it can trade off latency 934 and wire efficiency. This is accomplished by setting the TCP_NODELAY 935 socket option. 937 When latency is not critical, and the application provides data in 938 chunks larger than EMSS at one time, the TCP implementation may 939 "pack" any available stream data into TCP segments so that the 940 segments are filled to the EMSS. If the amount of data available is 941 not enough to fill the TCP segment when it is prepared for 942 transmission, TCP can send the segment partly filled, or use the 943 Nagle algorithm to wait for the ULP to post more data (discussed 944 below). 946 DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU 947 when a DDP message is large enough. Since the DDP message may not 948 exactly fit into TCP segments, a "message tail" often occurs that 949 results in an FPDU that is smaller than a single TCP segment. If a 950 "message tail", small DDP messages, or the start of a larger DDP 951 message are available, MPA MAY "pack" the resulting FPDUs into TCP 952 segments. When this is done, the TCP segments can be more fully 953 utilized, but, due to the size constraints of FPDUs, segments may not 954 be filled to the EMSS. 956 Note that MPA receivers must do more processing of a TCP segment 957 that contains multiple FPDUs, this may affect the performance of 958 some receiver implementations. 960 TCP implementations often utilize the "Nagle" [RFC0896] algorithm to 961 ensure that segments are filled to the EMSS whenever the round trip 962 latency is large enough that the source stream can fully fill 963 segments before Acks arrive. The algorithm does this by delaying the 964 transmission of TCP segments until a ULP can fill a segment, or until 965 an ACK arrives from the far side. The algorithm thus allows for 966 smaller segments when latencies are shorter to keep the ULP's end to 967 end latency to reasonable levels. 969 The Nagle algorithm is not mandatory to use [RFC1122]. 971 It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note 972 that many of the applications expected to take advantage of MPA/DDP 973 prefer to avoid the extra delays caused by Nagle. In such scenarios 974 it is anticipated there will be minimal opportunity for packing at 975 the transmitter and receivers may choose to optimize their 976 performance for this anticipated behavior. 978 5.3.2 FPDU Size Considerations 980 MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as 981 the size of the largest ULPDU fitting in an FPDU. For an empty TCP 982 Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus 983 space for markers and pad octets. 985 The maximum ULPDU Length for a single ULPDU when markers are 986 present MUST be computed as: 988 MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) 990 The formula above accounts for the worst-case number of markers. 992 The maximum ULPDU Length for a single ULPDU when markers are NOT 993 present MUST be computed as: 995 MULPDU = EMSS - (6 + EMSS mod 4) 997 As a further optimization of the wire efficiency an MPA 998 implementation MAY dynamically adjust the MULPDU (see section 7.3.1. 999 for latency and wire efficiency trade-offs). When one or more FPDUs 1000 are already packed into a TCP Segment, MULPDU MAY be reduced 1001 accordingly. 1003 DDP SHOULD provide ULPDUs that are as large as possible, but less 1004 than or equal to MULPDU. 1006 If the TCP implementation needs to adjust EMSS to support MTU 1007 changes, the MULPDU value is changed accordingly. 1009 In certain rare situations, the EMSS may shrink to very small sizes. 1010 If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU 1011 below 128 octets and is not required to follow the segmentation rules 1012 in Section 5.3 MPA on TCP Sender Segmentation on page 22. 1014 If one or more FPDUs are already packed into a TCP segment, such that 1015 the remaining room is less than 128 octets, MPA MUST NOT provide a 1016 MULPDU smaller than 128. In this case, MPA would typically provide a 1017 MULPDU for the next full sized segment, but may still pack the next 1018 FPDU into the small remaining room, provide that the next FPDU is 1019 small enough to fit. 1021 The value 128 is chosen as to allow DDP designers room for the DDP 1022 Header and some user data. 1024 5.4 MPA Receiver FPDU Identification 1026 An MPA receiver MUST first verify the FPDU before passing the ULPDU 1027 to DDP. To do this, the receiver MUST: 1029 * locate the start of the FPDU unambiguously, 1031 * verify its CRC (if CRC checking is enabled). 1033 If the above conditions are true, the MPA receiver passes the ULPDU 1034 to DDP. 1036 To detect the start of the FPDU unambiguously one of the following 1037 MUST be used: 1039 1: In an ordered TCP stream, the "ULPDU Length" field in the current 1040 FPDU when FPDU has a valid CRC, can be used to identify the 1041 beginning of the next FPDU. 1043 2: For receivers that support out of order reception of FPDUs (see 1044 section 5.1 MPA Markers on page 17) a Marker can always be used 1045 to locate the beginning of an FPDU (in FPDUs with valid CRCs). 1046 Since the location of the marker is known in the octet stream 1047 (sequence number space), the marker can always be found. 1049 3: Having found an FPDU by means of a Marker, following contiguous 1050 FPDUs can be found by using the "ULPDU Length" fields (from FPDUs 1051 with valid CRCs) to establish the next FPDU boundary. 1053 The "ULPDU Length" field (see section 4) MUST be used to determine if 1054 the entire FPDU is present before forwarding the ULPDU to DDP. 1056 CRC calculation is discussed in section 5.2 on page 19 above. 1058 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 1060 Since MPA on MPA-aware TCP senders start FPDUs on TCP segment 1061 boundaries, a receiving DDP on MPA on TCP implementation may be able 1062 to optimize the reception of data in various ways. 1064 However, MPA receivers MUST NOT depend on FPDU Alignment on TCP 1065 segment boundaries. 1067 Some MPA senders may be unable to conform to the sender requirements 1068 because their implementation of TCP is not designed with MPA in mind. 1069 Even if the sender is MPA-aware, the network may contain "middle 1070 boxes" which modify the TCP stream by changing the segmentation. 1071 This is generally interoperable with TCP and its users and MPA must 1072 be no exception. 1074 The presence of markers in MPA (when enabled) allows an MPA receiver 1075 to recover the FPDUs despite these obstacles, although it may be 1076 necessary to utilize additional buffering at the receiver to do so. 1078 Some of the cases that a receiver may have to contend with are listed 1079 below as a reminder to the implementer: 1081 * A single Aligned and complete FPDU, either in order, or out of 1082 order: This can be passed to DDP as soon as validated, and 1083 Delivered when ordering is established. 1085 * Multiple FPDUs in a TCP segment, aligned and fully contained, 1086 either in order, or out of order: These can be passed to DDP as 1087 soon as validated, and Delivered when ordering is established. 1089 * Incomplete FPDU: The receiver should buffer until the remainder 1090 of the FPDU arrives. If the remainder of the FPDU is already 1091 available, this can be passed to DDP as soon as validated, and 1092 Delivered when ordering is established. 1094 * Unaligned FPDU start: The partial FPDU must be combined with its 1095 preceding portion(s). If the preceding parts are already 1096 available, and the whole FPDU is present, this can be passed to 1097 DDP as soon as validated, and Delivered when ordering is 1098 established. If the whole FPDU is not available, the receiver 1099 should buffer until the remainder of the FPDU arrives. 1101 * Combinations of Unaligned or incomplete FPDUs (and potentially 1102 other complete FPDUs) in the same TCP segment: If any FPDU is 1103 present in its entirety, or can be completed with portions 1104 already available, it can be passed to DDP as soon as validated, 1105 and Delivered when ordering is established. 1107 6 Connection Semantics 1109 6.1 Connection setup 1111 MPA requires that the consumer MUST activate MPA, and any TCP 1112 enhancements for MPA, on a TCP half connection at the same location 1113 in the octet stream at both the sender and the receiver. This is 1114 required in order for the marker scheme to correctly locate the 1115 markers (if enabled) and to correctly locate the first FPDU. 1117 MPA, and any TCP enhancements for MPA are enabled by the ULP in both 1118 directions at once at an endpoint. 1120 This can be accomplished several ways, and is left up to DDP's ULP: 1122 * DDP's ULP MAY require DDP on MPA startup immediately after TCP 1123 connection setup. This has the advantage that no streaming mode 1124 negotiation is needed. An example of such a protocol is shown in 1125 Figure 9: Example Immediate Startup negotiation on page 36. 1127 This may be accomplished by using a well-known port, or a service 1128 locator protocol to locate an appropriate port on which DDP on 1129 MPA is expected to operate. 1131 * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a 1132 normal TCP startup, using TCP streaming data exchanges on the 1133 same connection. The exchange establishes that DDP on MPA (as 1134 well as other ULPs) will be used, and exactly locates the point 1135 in the octet stream where MPA is to begin operation. Note that 1136 such a negotiation protocol is outside the scope of this 1137 specification. A simplified example of such a protocol is shown 1138 in Figure 8: Example Delayed Startup negotiation on page 33. 1140 An MPA endpoint operates in two distinct phases. 1142 The "Startup Phase" is used to verify correct MPA setup, exchange CRC 1143 and Marker configuration, and optionally pass "private data" between 1144 endpoints prior to completing a DDP connection. During this phase, 1145 specifically formatted frames are exchanged as TCP byte streams 1146 without using CRCs or Markers. During this phase a DDP endpoint need 1147 not be "bound" to the MPA connection. In fact, the choice of DDP 1148 endpoint and its operating parameters may not be known until the 1149 consumer supplied "private data" (if any) has been examined by the 1150 consumer. 1152 The second distinct phase is "Full operation" during which FPDUs are 1153 sent using all the rules that pertain (CRCs, Markers, MULPDU 1154 restrictions etc.). A DDP endpoint MUST be "bound" to the MPA 1155 connection at entry to this phase. 1157 When "private data" is passed between ULPs in the "Startup Phase", 1158 the ULP is responsible for interpreting that data, and then placing 1159 MPA into "Full operation". 1161 Note: The following text differentiates the two endpoints by calling 1162 them "Initiator" and "Responder". This is quite arbitrary and is 1163 NOT related to the TCP startup (SYN, SYN/ACK sequence). The 1164 Initiator is the side that sends first in the MPA startup 1165 sequence (the "MPA Request Frame"). 1167 Note: The possibility that both endpoints would be allowed to make a 1168 connection at the same time, sometimes called an "Active/Active" 1169 connection, was considered by the work group and rejected. There 1170 were several motivations for this decision. One was that 1171 applications needing this facility were few (none other than 1172 theoretical at the time of this draft). Another was that the 1173 facility created some implementation difficulties, particularly 1174 with the "Dual Stack" designs described later on. A last issue 1175 was that dealing with rejected connections at startup would have 1176 required at least an additional frame type, and more recovery 1177 actinos, complicating the protocol. While none of these issues 1178 was overwhelming, the group and implementers were not motivated 1179 to do the work to resolve these issues. 1181 The ULP is responsible for determining which side is "Initiator" or 1182 "Responder". For "Client/Server" type ULPs this is easy. For peer- 1183 peer ULPs (which might utilize a TCP style "active/active" startup), 1184 some mechanism (not defined by this specification) must be 1185 established, or some streaming mode data exchanged prior to MPA 1186 startup to determine the side which starts in "Initiator" and which 1187 starts in "Responder" MPA mode. 1189 The following rules apply to MPA connection startup phase: 1191 1. When MPA is started in the "Initiator" mode, the MPA 1192 implementation MUST send a valid "MPA Request Frame". The "MPA 1193 Request Frame" MAY include ULP supplied "Private Data". 1195 2. When MPA is started in the "Responder" mode, the MPA 1196 implementation MUST wait until a "MPA Request Frame" is received 1197 and validated before entering full MPA/DDP operation. 1199 If the "MPA Request Frame" is improperly formatted, the 1200 implementation MUST close the TCP connection and exit MPA. 1202 If the "MPA Request Frame" is properly formatted but the "Private 1203 Data" is not acceptable, the implementation SHOULD return an "MPA 1204 Reply Frame" with the "Rejected Connection" bit set to '1'; the 1205 "MPA Reply Frame" MAY include ULP supplied "Private Data"; the 1206 implementation MUST exit MPA, leaving the TCP connection open. 1207 The ULP may close TCP or use the connection for other purposes. 1209 If the "MPA Request Frame" is properly formatted and the "Private 1210 Data" is acceptable, the implementation SHOULD return an "MPA 1211 Reply Frame" with the "Rejected Connection" bit set to '0'; the 1212 "MPA Reply Frame" MAY include ULP supplied "Private Data"; and 1213 the responder SHOULD prepare to interpret any data received as 1214 FPDUs and pass any received ULPDUs to DDP. 1216 Note: Since the receiver's ability to deal with markers is 1217 unknown until the Request and Reply frames have been 1218 received, sending FPDUs before this occurs is not possible. 1220 Note: The requirement to wait on a Request Frame before sending a 1221 Reply frame is a design choice, it makes for well ordered 1222 sequence of events at each end, and avoids having to specify 1223 how to deal with situations where both ends start at the same 1224 time. 1226 3. MPA "Initiator" mode implementations MUST receive and validate a 1227 "MPA Reply Frame". 1229 If the "MPA Reply Frame" is improperly formatted, the 1230 implementation MUST close the TCP connection and exit MPA. 1232 If the "MPA Reply Frame" is properly formatted but is the 1233 "Private Data" is not acceptable, or if the "Rejected Connection" 1234 bit set to '1', the implementation MUST exit MPA, leaving the TCP 1235 connection open. The ULP may close TCP or use the connection for 1236 other purposes. 1238 If the "MPA Reply Frame" is properly formatted and the "Private 1239 Data" is acceptable, and the "Reject Connection" bit is set to 1240 '0', the implementation SHOULD enter full MPA/DDP operation mode; 1241 interpreting any received data as FPDUs and sending DDP ULPDUs as 1242 FPDUs. 1244 4. MPA "Responder" mode implementations MUST receive and validate at 1245 least one FPDU before sending any FPDUs or markers. 1247 Note: this requirement is present to allow the Initiator time to 1248 get its receiver into full operation before an FPDU arrives, 1249 avoiding potential race conditions at the initiator. This 1250 was also subject to some debate in the work group before 1251 rough consensus was reached. Eliminating this requirement 1252 would allow faster startup in some types of applications. 1253 However, that would also make certain implementations 1254 (particularly "Dual Stack") much harder. 1256 5. If a received "Key" does not match the expected value, (See 6.1.1 1257 MPA Request and Reply Frame Format below) the TCP/DDP connection 1258 MUST be closed, and an error returned to the ULP. 1260 6. The received "Private Data" fields may be used by consumers at 1261 either end to further validate the connection, and set up DDP or 1262 other ULP parameters. The Initiator ULP MAY close the 1263 TCP/MPA/DDP connection as a result of validating the "Private 1264 Data" fields. The Responder SHOULD return a "MPA Reply Frame" 1265 with the "Reject Connection" Bit set to '1' if the validation of 1266 the "Private Data" is not acceptable to the ULP. 1268 7. When the first FPDU is to be sent, then if markers are enabled, 1269 the first octets sent are the special marker 0x00000000, followed 1270 by the start of the FPDU (the FPDU's "ULPDU Length" field). If 1271 markers are not enabled, the first octets sent are the start of 1272 the FPDU (the FPDU's "ULPDU Length" field). 1274 8. MPA implementations MUST use the difference between the "MPA 1275 Request Frame" and the "MPA Reply Frame" to check for incorrect 1276 "Initiator/Initiator" startups. Implementations SHOULD put a 1277 timeout on waiting for the "MPA Request Frame" when started in 1278 "Responder" mode, to detect incorrect "Responder/Responder" 1279 startups. 1281 9. MPA implementations MUST validate the PD_Length field. The 1282 buffer that receives the "Private Data" field MUST be large 1283 enough to receive that data; the amount of "Private Data" MUST 1284 not exceed the PD_Length, or the application buffer. If any of 1285 the above fails, the startup frame MUST be considered improperly 1286 formatted. 1288 10. MPA implementations SHOULD implement a reasonable timeout while 1289 waiting for the entire startup frames; this prevents certain 1290 denial of service attacks. ULPs SHOULD implement a reasonable 1291 timeout while waiting for FPDUs, ULPDUs and application level 1292 messages to guard against application failures and certain denial 1293 of service attacks. 1295 6.1.1 MPA Request and Reply Frame Format 1297 0 1 2 3 1298 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1300 0 | | 1301 + Key (16 bytes containing "MPA ID Req Frame") + 1302 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | 1303 + Or (16 bytes containing "MPA ID Rep Frame") + 1304 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | 1305 + + 1306 12 | | 1307 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1308 16 |M|C|R| Res | Rev | PD_Length | 1309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1310 | | 1311 ~ ~ 1312 ~ Private Data ~ 1313 | | 1314 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1315 | | 1316 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1317 Figure 7 "MPA Request/Reply Frame" 1319 Key: This field contains the "key" used to authenticate that the 1320 sender is an MPA sender. Initiator mode senders must set this 1321 field to the fixed value "MPA ID Req frame" or (in byte order) 4D 1322 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). 1323 Responder mode receivers MUST check this field for the same 1324 value, and close the connection and report an error locally if 1325 any other value is detected. Responder mode senders must set this 1326 field to the fixed value "MPA ID Rep frame" or (in byte order) 4D 1327 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). 1328 Initiator mode receivers MUST check this field for the same 1329 value, and close the connection and report an error locally if 1330 any other value is detected. 1332 M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply 1333 Frame", declares a receiver's requirement for Markers. When in a 1334 received "MPA Request Frame" or "MPA Reply Frame" and the value 1335 is '0', markers MUST NOT be added to the data stream by the 1336 sender. When '1' markers MUST be added as described in section 1337 5.1 MPA Markers on page 17. 1339 C: This bit declares an endpoint's preferred CRC usage. When this 1340 field is '0' in the "MPA Request Frame" and the "MPA Reply 1341 Frame", CRCs MUST not be checked and need not be generated by 1342 either endpoint. When this bit is '1' in either the "MPA Request 1343 Frame" or "MPA Reply Frame", CRCs MUST be generated and checked 1344 by both endpoints. Note that even when not in use, the CRC field 1345 remains present in the FPDU. When CRCs are not in use, the CRC 1346 field MUST be considered valid for FPDU checking regardless of 1347 its contents. 1349 R: This bit is set to zero, and not checked on reception in the "MPA 1350 Request Frame". In the "MPA Reply Frame", this bit is the 1351 "Rejected Connection" bit, set by the responders ULP to indicate 1352 acceptance '0', or rejection '1', of the connection parameters 1353 provided in the "Private Data". 1355 Res: This field is reserved for future use. It must be set to zero 1356 when sending, and not checked on reception. 1358 Rev: This field contains the Revision of MPA. For this version of 1359 the specification senders MUST set this field to one. MPA 1360 receivers compliant with this version of the specification MUST 1361 check this field. If the MPA receiver cannot interoperate with 1362 the received version, then it MUST close the connection and 1363 report an error locally. Otherwise, the MPA receiver should 1364 report the received version to the ULP. 1366 PD_Length: This field MUST contain the length in Octets of the 1367 Private Data field. A value of zero indicates that there is no 1368 private data field present at all. If the receiver detects that 1369 the PD_Length field does not match the length of the "Private 1370 Data" field, or if the length of the "Private Data" field exceeds 1371 512 octets, the receiver MUST close the connection and report an 1372 error locally. Otherwise, the MPA receiver should pass the 1373 PD_Length value and "Private Data" to the ULP. 1375 Private Data: This field may contain any value defined by ULPs or may 1376 not be present. The "Private Data" field MUST between 0 and 512 1377 octets in length. ULPs define how to size, set, and validate 1378 this field within these limits. 1380 6.1.2 Example Delayed Startup sequence 1382 A variety of startup sequences are possible when using MPA on TCP. 1383 Following is an example of an MPA/DDP startup that occurs after TCP 1384 has been running for a while and has exchanged some amount of 1385 streaming data. This example does not use any private data (an 1386 example that does is shown later in 6.1.3.2 Example Immediate Startup 1387 using Private Data on page 36), although it is perfectly legal to 1388 include the private data. Note that since the example does not use 1389 any Private Data, there are no ULP interactions shown between 1390 receiving "Startup frames" and putting MPA into "Full operation". 1392 Initiator Responder 1394 +---------------------------+ 1395 |ULP streaming mode | 1396 | request to | 1397 | transition to DDP/MPA | +--------------------------+ 1398 | mode (optional) | --------> |ULP gets request; | 1399 +---------------------------+ |enables MPA Responder mode| 1400 |with last (optional) | 1401 |streaming mode | 1402 |for MPA to send. | 1403 +---------------------------+ |MPA waits for incoming | 1404 |ULP receives streaming | <-------- | | 1405 | ; | +--------------------------+ 1406 |Enters MPA Initiator mode; | 1407 |MPA sends | 1408 | ; | 1409 |MPA waits for incoming | +--------------------------+ 1410 | |MPA receives | 1411 +---------------------------+ | | 1412 |Consumer binds DDP to MPA,| 1413 |MPA sends the | 1414 | . | 1415 |DDP/MPA enables FPDU | 1416 +---------------------------+ |decoding, but does not | 1417 |MPA receives the | < - - - - |send any FPDUs. | 1418 | | +--------------------------+ 1419 |Consumer binds DDP to MPA, | 1420 |DDP/MPA begins full | 1421 |operation. | 1422 |MPA sends first FPDU (as | +--------------------------+ 1423 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1424 |available). | |MPA sends first FPDU (as | 1425 +---------------------------+ |DDP ULPDUs become | 1426 <====== |available. | 1427 +--------------------------+ 1428 Figure 8: Example Delayed Startup negotiation 1430 An example Delayed Startup sequence is described below: 1432 * Active and passive sides start up a TCP connection in the 1433 usual fashion, probably using sockets APIs. They exchange 1434 some amount of streaming mode data. At some point one side 1435 (the MPA Initiator) sends streaming mode data that 1436 effectively says "Hello, Lets go into MPA/DDP mode." 1438 * When the remote side (the MPA Responder) gets this streaming mode 1439 message, the consumer would send a last streaming mode message 1440 that effectively says "I Acknowledge your Hello, and am now in 1441 MPA Responder Mode". The exchange of these messages establishes 1442 the exact point in the TCP stream where MPA is enabled. The 1443 Responding Consumer enables MPA in the Responder mode and waits 1444 for the initial MPA startup message. 1446 * The Initiating Consumer would enable MPA startup in the 1447 Initiator mode which then sends the "MPA Request Frame". It 1448 is assumed that no "Private Data" messages are needed for 1449 this example, although it is possible to do so. The 1450 Initiating MPA (and Consumer) would also wait for the MPA 1451 connection to be accepted. 1453 * The Responding MPA would receive the initial "MPA Request Frame" 1454 and would inform the consumer that this message arrived. The 1455 Consumer can then accept the MPA/DDP connection or close the TCP 1456 connection. 1458 * To accept the connection request, the Responding Consumer would 1459 use an appropriate API to bind the TCP/MPA connections to a DDP 1460 endpoint, thus enabling MPA/DDP into full operation. In the 1461 process of going to full operation, MPA sends the "MPA Reply 1462 Frame". MPA/DDP waits for the first incoming FPDU before sending 1463 any FPDUs. 1465 * If the initial TCP data was not a properly formatted "MPA Request 1466 Frame" MPA will close or reset the TCP connection immediately. 1468 * The Initiating MPA would receive the "MPA Reply Frame" and 1469 would report this message to the Consumer. The Consumer can 1470 then accept the MPA/DDP connection, or close or reset the TCP 1471 connection to abort the process. 1473 * On determining that the Connection is acceptable, the 1474 Initiating Consumer would use an appropriate API to bind the 1475 TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP 1476 into full operation. MPA/DDP would begin sending DDP 1477 messages as MPA FPDUs. 1479 6.1.3 Use of "Private Data" 1481 This section is advisory in nature, in that it suggests a method that 1482 a ULP can deal with pre-DDP connection information exchange. 1484 6.1.3.1 Motivation 1486 Prior RDMA protocols have been developed that provide "private data" 1487 via out of band mechanisms. As a result, many applications now 1488 expect some form of "private data" to be available for application 1489 use prior to setting up the DDP/RDMA connection. For example, 1491 An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand 1492 and the [Verbs]) must be associated with a Protection Domain. No 1493 receive operations may be posted to the endpoint before it is 1494 associated with a Protection Domain. Indeed under both the 1495 InfiniBand and proposed iWARP verbs [Verbs] an endpoint/QP is created 1496 within a Protection Domain. 1498 There are some applications where the choice of Protection Domain is 1499 dependent upon the identity of the remote ULP client. For example, if 1500 a user session requires multiple connections, it is highly desirable 1501 for all of those connections to use a single Protection Domain. 1503 InfiniBand, the DAT APIs and the IT-API all provide for the active 1504 side ULP to provide "Private Data" when requesting a connection. This 1505 data is passed to the ULP to allow it to determine whether to accept 1506 the connection, and if so with which endpoint (and implicitly which 1507 Protection Domain). 1509 The Private Data can also be used to ensure that both ends of the 1510 connection have configured their RDMA endpoints compatibly on such 1511 matters as the RDMA Read capacity. Further ULP-specific uses are also 1512 presumed, such as establishing the identity of the client. 1514 Private Data is also allowed for when accepting the connection, to 1515 allow completion of any negotiation on RDMA resources and for other 1516 ULP reasons. 1518 There are several potential ways to exchange this "Private Data". 1519 For Example, the InfiniBand specification includes a connection 1520 management protocol that allows a small amount of "private data" to 1521 be exchanged using datagrams before actually starting the RDMA 1522 connection. 1524 This draft allows for small amounts of "Private Data" to be exchanged 1525 as part of the MPA startup sequence. The actual Private Data fields 1526 are carried in the "MPA Request Frame", and the "MPA Reply Frame". 1528 If larger amounts of private data or more negotiation is necessary, 1529 TCP streaming mode messages may be exchanged prior to enabling MPA. 1531 6.1.3.2 Example Immediate Startup using Private Data 1533 Initiator Responder 1535 +---------------------------+ 1536 |TCP SYN sent | +--------------------------+ 1537 +---------------------------+ --------> |TCP gets SYN packet; | 1538 +---------------------------+ | Sends SYN-Ack | 1539 |TCP gets SYN-Ack | <-------- +--------------------------+ 1540 | Sends Ack | 1541 +---------------------------+ --------> +--------------------------+ 1542 +---------------------------+ |Consumer enables MPA | 1543 |Enters MPA Initiator mode; | |Responder Mode, waits for | 1544 |MPA sends | | | 1545 | ; | +--------------------------+ 1546 |MPA waits for incoming | +--------------------------+ 1547 | |MPA receives | 1548 +---------------------------+ | | 1549 |Consumer examines "Private| 1550 |Data", provides MPA with | 1551 |return "Private Data", | 1552 |binds DDP to MPA, and | 1553 |enables MPA to send an | 1554 | . | 1555 |DDP/MPA enables FPDU | 1556 +---------------------------+ |decoding, but does not | 1557 |MPA receives the | < - - - - |send any FPDUs. | 1558 | | +--------------------------+ 1559 |Consumer examines "Private | 1560 |Data", binds DDP to MPA, | 1561 |and enables DDP/MPA to | 1562 |begin full operation. | 1563 |MPA sends first FPDU (as | +--------------------------+ 1564 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1565 |available). | |MPA sends first FPDU (as | 1566 +---------------------------+ |DDP ULPDUs become | 1567 <====== |available. | 1568 +--------------------------+ 1569 Figure 9: Example Immediate Startup negotiation 1571 Note: the exact order of when MPA is started in the TCP connection 1572 sequence is implementation dependent; the above diagram shows one 1573 possible sequence. Also, the Initiator "Ack" to the Responder's 1574 "SYN-Ack" may be combined into the same TCP segment containing 1575 the "MPA Request Frame" (as is allowed by TCP RFCs). 1577 The example immediate startup sequence is described below: 1579 * The passive side (Responding Consumer) would listen on the TCP 1580 destination port, to indicate its readiness to accept a 1581 connection. 1583 * The active side (Initiating Consumer) would request a 1584 connection from a TCP endpoint (that expected to upgrade to 1585 MPA/DDP/RDMA and expected the private data) to a destination 1586 address and port. 1588 * The Initiating Consumer would initiate a TCP connection to 1589 the destination port. Acceptance/rejection of the connection 1590 would proceed as per normal TCP connection establishment. 1592 * The passive side (Responding Consumer) would receive the TCP 1593 connection request as usual allowing normal TCP gatekeepers, such 1594 as INETD and TCPserver, to exercise their normal 1595 safeguard/logging functions. On acceptance of the TCP 1596 connection, the Responding consumer would enable MPA in the 1597 Responder mode and wait for the initial MPA startup message. 1599 * The Initiating Consumer would enable MPA startup in the 1600 Initiator mode to send an initial "MPA Request Frame" with 1601 its included "Private Data" message to send. The Initiating 1602 MPA (and Consumer) would also wait for the MPA connection to 1603 be accepted, and any returned private data. 1605 * The Responding MPA would receive the initial "MPA Request Frame" 1606 with the "Private Data" message and would pass the Private Data 1607 through to the consumer. The Consumer can then accept the 1608 MPA/DDP connection, close the TCP connection, or reject the MPA 1609 connection with a return message. 1611 * To accept the connection request, the Responding Consumer would 1612 use an appropriate API to bind the TCP/MPA connections to a DDP 1613 endpoint, thus enabling MPA/DDP into full operation. In the 1614 process of going to full operation, MPA sends the "MPA Reply 1615 Frame" which includes the Consumer supplied "Private Data" 1616 containing any appropriate consumer response. MPA/DDP waits for 1617 the first incoming FPDU before sending any FPDUs. 1619 * If the initial TCP data was not a properly formatted "MPA Request 1620 Frame", MPA will close or reset the TCP connection immediately. 1622 * To reject the MPA connection request, the Responding Consumer 1623 would send an "MPA Reply Frame" with any ULP supplied "Private 1624 Data" (with reason for rejection), with the "Rejected Connection" 1625 bit set to '1', and may close the TCP connection. 1627 * The Initiating MPA would receive the "MPA Reply Frame" with 1628 the "Private Data" message and would report this message to 1629 the Consumer, including the supplied Private Data. 1631 If the "rejected Connection" bit is set to a '1', MPA will 1632 close the TCP connection and exit. 1634 If the "Rejected Connection" bit is set to a '0', and on 1635 determining from the "MPA Reply Frame" "Private Data" that 1636 the Connection is acceptable, the Initiating Consumer would 1637 use an appropriate API to bind the TCP/MPA connections to a 1638 DDP endpoint thus enabling MPA/DDP into full operation. 1639 MPA/DDP would begin sending DDP messages as MPA FPDUs. 1641 6.1.4 "Dual Stack" implementations 1643 MPA/DDP implementations are commonly expected to be implemented as 1644 part of a "Dual stack" architecture. One "stack" is the traditional 1645 TCP stack, usually with a sockets interface API. The second stack is 1646 the MPA/DDP "stack" with its own API, and potentially separate code 1647 or hardware to deal with the MPA/DDP data. Of course, 1648 implementations may vary, so the following comments are of an 1649 advisory nature only. 1651 The use of the two "stacks" offers advantages: 1653 TCP connection setup is usually done with the TCP stack. This 1654 allows use of the usual naming and addressing mechanisms. It 1655 also means that any mechanisms used to "harden" the connection 1656 setup against security threats are also used when starting 1657 MPA/DDP. 1659 Some applications may have been originally designed for TCP, but 1660 are "enhanced" to utilize MPA/DDP after a negotiation reveals 1661 the capability to do so. The negotiation process takes place in 1662 TCP's streaming mode, using the usual TCP APIs. 1664 Some new applications, designed for RDMA or DDP, still need to 1665 exchange some data prior to starting MPA/DDP. This exchange can 1666 be of arbitrary length or complexity, but often consists of only 1667 a small amount of "private data", perhaps only a single message. 1668 Using the TCP streaming mode for this exchange allows this to be 1669 done using well understood methods. 1671 The main disadvantage of using two stacks is the conversion of an 1672 active TCP connection between them. This process must be done with 1673 care to prevent loss of data. 1675 To avoid some of the problems when using a "dual stack" architecture 1676 the following additional restrictions may be required by the 1677 implementation: 1679 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming 1680 stream data is expected. This is typically managed by the ULP 1681 protocol. When following the recommended startup sequence, the 1682 "Responder" side enters DDP/MPA mode, sends the last streaming 1683 mode data, and then waits for the "MPA Request frame". No 1684 additional streaming mode data is expected. The "Initiator" side 1685 ULP receives the last streaming mode data, and then enters 1686 DDP/MPA mode. Again, no additional streaming mode data is 1687 expected. 1689 2. The DDP/MPA MAY provide the ability to send a "Last streaming 1690 message" as part of its "Responder" DDP/MPA enable function. 1691 This allows the DDP/MPA stack to more easily manage the 1692 conversion to DDP/MPA mode (and avoid problems with a very fast 1693 return of the "MPA Request Frame" from the Initiator side). 1695 Note: Regardless of the "stack" architecture used, TCP's rules must 1696 be followed. For example, if network data is lost, re-segmented 1697 or re-ordered, TCP must recover appropriately even when this 1698 occurs while switching stacks. 1700 6.2 Normal Connection Teardown 1702 Each half connection of MPA terminates when DDP closes the 1703 corresponding TCP half connection. 1705 A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware 1706 that a graceful close of the LLP connection has been received by the 1707 LLP (e.g. FIN is received). 1709 7 Error Semantics 1711 The following errors MUST be detected by MPA and the codes SHOULD be 1712 provided to DDP or other consumer: 1714 Code Error 1716 1 TCP connection closed, terminated or lost. This includes lost 1717 by timeout, too many retries, RST received or FIN received. 1719 2 Received MPA CRC does not match the calculated value for the 1720 FPDU. 1722 3 In the event that the CRC is valid, received MPA marker (if 1723 enabled) and "ULPDU Length" fields do not agree on the start 1724 of a FPDU. If the FPDU start determined from previous "ULPDU 1725 Length" fields does not match with the MPA marker position, 1726 MPA SHOULD deliver an error to DDP. It may not be possible to 1727 make this check as a segment arrives, but the check SHOULD be 1728 made when a gap creating an out of order sequence is closed 1729 and any time a marker points to an already identified FPDU. 1730 It is OPTIONAL for a receiver to check each marker, if 1731 multiple markers are present in an FPDU, or if the segment is 1732 received in order. 1734 4 Invalid MPA Request Frame or MPA Response Frame received. In 1735 this case, the TCP connection MUST be immediately closed. DDP 1736 and other ULPs should treat this similar to code 1, above. 1738 When conditions 2 or 3 above are detected, an MPA-aware TCP 1739 implementation MAY choose to silently drop the TCP segment rather 1740 than reporting the error to DDP. In this case, the sending TCP will 1741 retry the segment, usually correcting the error, unless the problem 1742 was at the source. In that case, the source will usually exceed the 1743 number of retries and terminate the connection. 1745 Once MPA delivers an error of any type, it MUST NOT pass or deliver 1746 any additional FPDUs on that half connection. 1748 For Error codes 2 and 3, MPA MUST NOT close the TCP connection 1749 following a reported error. Closing the connection is the 1750 responsibility of DDP's ULP. 1752 Note that since MPA will not deliver any FPDUs on a half 1753 connection following an error detected on the receive side of 1754 that connection, DDP's ULP is expected to tear down the 1755 connection. This may not occur until after one or more last 1756 messages are transmitted on the opposite half connection. This 1757 allows a diagnostic error message to be sent. 1759 8 Security Considerations 1761 This section discusses the security considerations for MPA. 1763 8.1 Protocol-specific Security Considerations 1765 The vulnerabilities of MPA to third-party attacks are no greater than 1766 any other protocol running over TCP. A third party, by sending 1767 packets into the network that are delivered to an MPA receiver, could 1768 launch a variety of attacks that take advantage of how MPA operates. 1769 For example, a third party could send random packets that are valid 1770 for TCP, but contain no FPDU headers. An MPA receiver reports an 1771 error to DDP when any packet arrives that cannot be validated as an 1772 FPDU when properly located on an FPDU boundary. A third party could 1773 also send packets that are valid for TCP, MPA, and DDP, but do not 1774 target valid buffers. These types of attacks ultimately result in 1775 loss of connection and thus become a type of DOS (Denial Of Service) 1776 attack. Communication security mechanisms such as IPsec [RFC2401] 1777 may be used to prevent such attacks. 1779 Independent of how MPA operates, a third party could use ICMP 1780 messages to reduce the path MTU to such a small size that performance 1781 would likewise be severely impacted. Range checking on path MTU 1782 sizes in ICMP packets may be used to prevent such attacks. 1784 [RDMA] and [DDP] are used to control, read and write data buffers 1785 over IP networks. Therefore, the control and the data packets of 1786 these protocols are vulnerable to the spoofing, tampering and 1787 information disclosure attacks listed below. In addition, Connection 1788 to/from an unauthorized or unauthenticated endpoint is a potential 1789 problem with most applications using RDMA, DDP, and MPA. 1791 8.1.1 Spoofing 1793 Spoofing attacks can be launched by the Remote Peer, or by a network 1794 based attacker. A network based spoofing attack applies to all Remote 1795 Peers. Because the MPA Stream requires an TCP Stream in the 1796 ESTABLISHED state, certain types of traditional forms of wire attacks 1797 do not apply -- an end-to-end handshake must have occurred to 1798 establish the MPA Stream. So, the only form of spoofing that applies 1799 is one when a remote node can both send and receive packets. Yet even 1800 with this limitation the Stream is still exposed to the following 1801 spoofing attacks. 1803 8.1.1.1 Impersonation 1805 A network based attacker can impersonate a legal MPA/DDP/RDMAP peer 1806 (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP 1807 Stream with the victim. End to end authentication (i.e. IPsec or ULP 1808 authentication) provides protection against this attack. 1810 8.1.1.2 Stream Hijacking 1812 Stream hijacking happens when a network based attacker follows the 1813 Stream establishment phase, and waits until the authentication phase 1814 (if such a phase exists) is completed successfully. He can then spoof 1815 the IP address and re-direct the Stream from the victim to its own 1816 machine. For example, an attacker can wait until an iSCSI 1817 authentication is completed successfully, and hijack the iSCSI 1818 Stream. 1820 The best protection against this form of attack is end-to-end 1821 integrity protection and authentication, such as IPsec to prevent 1822 spoofing. Another option is to provide physical security. Discussion 1823 of physical security is out of scope for this document. 1825 8.1.1.3 Man in the Middle Attack 1827 If a network based attacker has the ability to delete, inject replay, 1828 or modify packets which will still be accepted by MPA (e.g., TCP 1829 sequence number is correct, FPDU is valid etc.) then the Stream can 1830 be exposed to a man in the middle attack. The attacker could 1831 potentially use the services of [DDP] and [RDMAP] to read the 1832 contents of the associated data buffer, modify the contents of the 1833 associated data buffer, or to disable further access to the buffer. 1834 The only countermeasure for this form of attack is to either secure 1835 the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to 1836 provide physical security to prevent man-in-the-middle type attacks. 1838 The best protection against this form of attack is end-to-end 1839 integrity protection and authentication, such as IPsec, to prevent 1840 spoofing or tampering. If Stream or session level authentication and 1841 integrity protection are not used, then a man-in-the-middle attack 1842 can occur, enabling spoofing and tampering. 1844 Another approach is to restrict access to only the local subnet/link, 1845 and provide some mechanism to limit access, such as physical security 1846 or 802.1.x. This model is an extremely limited deployment scenario, 1847 and will not be further examined here. 1849 8.1.2 Eavesdropping 1851 Generally speaking, Stream confidentiality protects against 1852 eavesdropping. Stream and/or session authentication and integrity 1853 protection is a counter measurement against various spoofing and 1854 tampering attacks. The effectiveness of authentication and integrity 1855 against a specific attack, depend on whether the authentication is 1856 machine level authentication (as the one provided by IPsec), or ULP 1857 authentication. 1859 8.2 Introduction to Security Options 1861 The following security services can be applied to an MPA/DDP/RDMAP 1862 Stream: 1864 1. Session confidentiality - protects against eavesdropping. 1866 2. Per-packet data source authentication - protects against the 1867 following spoofing attacks: network based impersonation, Stream 1868 hijacking, and man in the middle. 1870 3. Per-packet integrity - protects against tampering done by 1871 network based modification of FPDUs (indirectly affecting buffer 1872 content through DDP services). 1874 4. Packet sequencing - protects against replay attacks, which is 1875 a special case of the above tampering attack. 1877 If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, 1878 or Stream hijacking attacks, it is recommended that the Stream be 1879 authenticated, integrity protected, and protected from replay 1880 attacks; it may use confidentiality protection to protect from 1881 eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public 1882 network). 1884 IPsec is capable of providing the above security services for IP and 1885 TCP traffic. 1887 ULP protocols may be able to provide part of the above security 1888 services. See [NFSv4CHANNEL] for additional information on a 1889 promising approach called "channel binding". From [NFSv4CHANNEL]: 1891 "The concept of channel bindings allows applications to prove 1892 that the end-points of two secure channels at different network 1893 layers are the same by binding authentication at one channel to 1894 the session protection at the other channel. The use of channel 1895 bindings allows applications to delegate session protection to 1896 lower layers, which may significantly improve performance for 1897 some applications." 1899 8.3 Using IPsec With MPA 1901 IPsec can be used to protect against the packet injection attacks 1902 outlined above. Because IPsec is designed to secure individual IP 1903 packets, MPA can run above IPsec without change. IPsec packets are 1904 processed (e.g., integrity checked and decrypted) in the order they 1905 are received, and an MPA receiver will process the decrypted FPDUs 1906 contained in these packets in the same manner as FPDUs contained in 1907 unsecured IP packets. 1909 MPA Implementations MUST implement IPSEC. The use of IPSEC is up to 1910 ULPs and administrators. 1912 8.4 Requirements for IPsec Encapsulation of DDP 1914 The IP Storage working group has spent significant time and effort to 1915 define the normative IPsec requirements for IP Storage [RFC3723]. 1916 Portions of that specification are applicable to a wide variety of 1917 protocols, including the RDDP protocol suite. In order to not 1918 replicate this effort, an RNIC implementation MUST follow the 1919 requirements defined in RFC3723 Section 2.3 and Section 5, including 1920 the associated normative references for those sections. 1922 Additionally, since IPsec acceleration hardware may only be able to 1923 handle a limited number of active IKE Phase 2 SAs, Phase 2 delete 1924 messages may be sent for idle SAs, as a means of keeping the number 1925 of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 1926 delete message MUST NOT be interpreted as a reason for tearing down 1927 an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, 1928 and if additional traffic is sent on it, to bring up another IKE 1929 Phase 2 SA to protect it. This avoids the potential for continually 1930 bringing Streams up and down. 1932 Note that there are serious security issues if IPsec is not 1933 implemented end-to-end. For example, if IPsec is implemented as a 1934 tunnel in the middle of the network, any hosts between the peer and 1935 the IPsec tunneling device can freely attack the unprotected Stream. 1937 9 IANA Considerations 1939 If a well-known port is chosen as the mechanism to identify a DDP on 1940 MPA on TCP, the well-known port must be registered with IANA. 1941 Because the use of the port is DDP specific, registration of the port 1942 with IANA is left to DDP. 1944 10 References 1946 10.1 Normative References 1948 [iSCSI] Satran, J., Internet Small Computer Systems Interface 1949 (iSCSI), RFC 3720, April 2004. 1951 [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, 1952 November 1990. 1954 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP 1955 Selective Acknowledgment Options", RFC 2018, October 1996. 1957 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1958 3", BCP 9, RFC 2026, October 1996. 1960 [RFC3667] Bradner, S., "IETF Rights in Contributions", BCP 78, RFC 1961 3667, February 2004. 1963 [RFC3668] Bradner, S., Ed., "Intellectual Property Rights in IETF 1964 Technology", BCP 79, RFC 3668, February 2004. 1966 [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over 1967 IP", RFC3723, April 2004. 1969 [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet 1970 Program Protocol Specification", RFC 793, September 1981. 1972 [RDMASEC] Pinkerton J., Deleganes E., Romanow A., Bitan S., 1973 "DDP/RDMAP Security", draft-ietf-rddp-security-06.txt (work in 1974 progress), December 2004. 1976 10.2 Informative References 1978 [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum 1979 disagree", ACM Sigcomm, Sept. 2000. 1981 [DDP] H. Shah et al., "Direct Data Placement over Reliable 1982 Transports", draft-ietf-rddp-ddp-04.txt (Work in progress), 1983 February 2005 1985 [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the 1986 Internet Protocol", RFC 2401, November 1998. 1988 [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC 1989 896, January 1984. 1991 [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., 1992 "Application performance pitfalls and TCP's Nagle algorithm", 1993 Workshop on Internet Server Performance, May 1999. 1995 [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to 1996 Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- 1997 bindings-02.txt, July 2004. 1999 [RDMA] R. Recio et al., "RDMA Protocol Specification", 2000 draft-ietf-rddp-rdmap-03.txt, February 2005 2002 [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", 2003 RFC 2960, October 2000. 2005 [RFC792] Postel, J., "Internet Control Message Protocol". September 2006 1981 2008 [RFC1122] Braden, R.T., "Requirements for Internet hosts - 2009 communication layers". October 1989. 2011 [ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft- 2012 elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003. 2014 [Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft- 2015 hilland-rddp-verbs-00.txt, April 2003. 2017 11 Appendix 2019 This appendix is for information only and is NOT part of the 2020 standard. 2022 11.1 Analysis of MPA over TCP Operations 2024 This appendix analyzes the impact of MPA (Marker PDU Aligned Framing 2025 for TCP [MPA]) on the TCP sender, receiver, and wire protocol. 2027 One of MPA's high level goals is to provide enough information, when 2028 combined with the Direct Data Placement Protocol [DDP], to enable 2029 out-of-order placement of DDP payload into the final Upper Layer 2030 Protocol (ULP) buffer. Note that DDP separates the act of placing 2031 data into a ULP buffer from that of notifying the ULP that the ULP 2032 buffer is available for use. In DDP terminology, the former is 2033 defined as "Placement", and the later is defined as "Delivery". MPA 2034 supports in-order delivery of the data to the ULP, including support 2035 for Direct Data Placement in the final ULP buffer location when TCP 2036 segments arrive out-of-order. Effectively, the goal is to use the 2037 pre-posted ULP buffers as the TCP receive buffer, where the 2038 reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and 2039 DDP) is done in place, in the ULP buffer, with no data copies. 2041 This Appendix walks through the advantages and disadvantages of the 2042 TCP sender modifications proposed by MPA: 2044 1) that MPA require the TCP sender to do "Header Alignment", where a 2045 TCP segment is required to begin with an MPA Framing Protocol Data 2046 Unit (FPDU) (if there is payload present). 2048 2) that there be an integral number of FPDUs in a TCP segment (under 2049 conditions where the Path MTU is not changing). 2051 This Appendix concludes that the scaling advantages of Header 2052 Alignment are strong, based primarily on fairly drastic TCP receive 2053 buffer reduction requirements and simplified receive handling. The 2054 analysis also shows that there is little effect to TCP wire behavior. 2056 11.1.1 Assumptions 2058 11.1.1.1 MPA is layered beneath DDP [DDP] 2060 MPA is an adaptation layer between DDP and TCP. DDP requires 2061 preservation of DDP segment boundaries and a CRC32C digest covering 2062 the DDP header and data. MPA adds these features to the TCP stream 2063 so that DDP over TCP has the same basic properties as DDP over SCTP. 2065 11.1.1.2 MPA preserves DDP message framing 2067 MPA was designed as a framing layer specifically for DDP and was not 2068 intended as a general-purpose framing layer for any other ULP using 2069 TCP. 2071 A framing layer allows ULPs using it to receive indications from the 2072 transport layer only when complete ULPDUs are present. As a framing 2073 layer, MPA is not aware of the content of the DDP PDU, only that it 2074 has received and, if necessary, reassembled a complete PDU for 2075 delivery to the DDP. 2077 11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under 2078 normal conditions 2080 To make reception of a complete DDP PDU on every received segment 2081 possible, DDP passes to MPA a PDU that is no larger than the EMSS of 2082 the underlying fabric. Each FPDU that MPA creates contains sufficient 2083 information for the receiver to directly place the ULP payload in the 2084 correct location in the correct receive buffer. 2086 Edge cases when this condition does not occur are dealt with, but do 2087 not need to be on the fast path 2089 11.1.1.4 Out-of-order placement but NO out-of-order delivery 2091 DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the 2092 information necessary to place its ULP payload directly in the 2093 correct location in host memory. 2095 Because each DDP segment is self-describing, it is possible for DDP 2096 segments received out of order to have their ULP payload placed 2097 immediately in the ULP receive buffer. 2099 Data delivery to the ULP is guaranteed to be in the order the data 2100 was sent. DDP only indicates data delivery to the ULP after TCP has 2101 acknowledged the complete byte stream. 2103 11.1.2 The Value of Header Alignment 2105 Significant receiver optimizations can be achieved when Header 2106 Alignment and complete FPDUs are the common case. The optimizations 2107 allow utilizing significantly fewer buffers on the receiver and less 2108 computation per FPDU. The net effect is the ability to build a "Flow- 2109 Through" receiver that enables TCP-based solutions to scale to 10G 2110 and beyond in an economical way. The optimizations are especially 2111 relevant to hardware implementations of receivers that process 2112 multiple protocol layers - Data Link Layer (e.g., Ethernet), Network 2113 and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP 2114 (e.g., MPA/DDP). As network speed increases, there is an increasing 2115 desire to use a hardware based receiver in order to achieve an 2116 efficient high performance solution. 2118 A TCP receiver, under worst case conditions, has to allocate buffers 2119 (BufferSizeTCP) whose capacities are a function of the bandwidth- 2120 delay product. Thus: 2122 BufferSizeTCP = K * bandwidth [octets/S] * Delay [S]. 2124 Where bandwidth is the end-to-end bandwidth of the connection, delay 2125 is the round trip delay of the connection, and K is an implementation 2126 dependent constant. 2128 Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more 2129 buffers for a 10x increase in end-to-end bandwidth). As this 2130 buffering approach may scale poorly for hardware or software 2131 implementations alike, several approaches allow reduction in the 2132 amount of buffering required for high-speed TCP communication. 2134 The MPA/DDP approach is to enable the ULP's buffer to be used as the 2135 TCP receive buffer. If the application pre-posts a sufficient amount 2136 of buffering, and each TCP segment has sufficient information to 2137 place the payload into the right application buffer, when an out-of- 2138 order TCP segment arrives it could potentially be placed directly in 2139 the ULP buffer. However, placement can only be done when a complete 2140 FPDU with the placement information is available to the receiver, and 2141 the FPDU contents contain enough information to place the data into 2142 the correct ULP buffer (e.g., there is a DDP header available). 2144 For the case when the FPDU is not aligned with the TCP segment, it 2145 may take, on average, 2 TCP segments to assemble one FPDU. Therefore, 2146 the receiver has to allocate BufferSizeNAF (Buffer Size, Non-Aligned 2147 FPDU) octets: 2149 BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS 2151 Where K1 and K2 are implementation dependent constants and EMSS is 2152 the effective maximum segment size. 2154 For example, a 1 Gbps link with 10,000 connections and an EMSS of 2155 1500B would require 15 MB of memory. Often the number of connections 2156 used scales with the network speed, aggravating the situation for 2157 higher speeds. 2159 A Header Aligned FPDU would allow the receiver to allocate 2160 BufferSizeAF (Buffer Size, Aligned FPDU) octets: 2162 BufferSizeAF = K2 * EMSS 2164 for the same conditions. A Header Aligned receiver may require memory 2165 in the range of ~100s of KB - which is feasible for an on-chip memory 2166 and enables a "Flow-Through" design, in which the data flows through 2167 the NIC and is placed directly in the destination buffer. Assuming 2168 most of the connections support Header Alignment, the receiver 2169 buffers no longer scale with number of connections. 2171 Additional optimizations can be achieved in a balanced I/O sub-system 2172 -- where the system interface of the network controller provides 2173 ample bandwidth as compared with the network bandwidth. For almost 2174 twenty years this has been the case and the trend is expected to 2175 continue - while Ethernet speeds have scaled by 1000 (from 10 2176 megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU 2177 architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to 2178 PCI-X DDR). Under these conditions, the Header Aligned FPDU approach 2179 allows BufferSizeAF to be indifferent to network speed. It is 2180 primarily a function of the local processing time for a given frame. 2181 Thus when the Header Aligned FPDU approach is used, receive buffering 2182 is expected to scale gracefully (i.e. less than linear scaling) as 2183 network speed is increased. 2185 11.1.2.1 Impact of lack of Header Alignment on the receiver 2186 computational load and complexity 2188 The receiver must perform IP and TCP processing, and then perform 2189 FPDU CRC checks, before it can trust the FPDU header placement 2190 information. For simplicity of the description, the assumption is 2191 that a FPDU is carried in no more than 2 TCP segments. In reality, 2192 with no Header Alignment, an FPDU can be carried by more than 2 TCP 2193 segments (e.g., if the PMTU was reduced). 2195 ----++-----------------------------++-----------------------++----- 2196 +---||---------------+ +--------||--------+ +----------||----+ 2197 | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | 2198 +---||---------------+ +--------||--------+ +----------||----+ 2199 ----++-----------------------------++-----------------------++----- 2200 FPDU #N-1 FPDU #N 2202 Figure 10: Non-aligned FPDU freely placed in TCP octet stream 2204 The receiver algorithm for processing TCP segments (e.g., TCP segment 2205 #X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream) 2206 carrying non-aligned FPDUs (in-order or out-of-order) includes: 2208 Data Link Layer processing (whole frame) - typically including a 2209 CRC calculation. 2211 1. Network Layer processing (assuming not an IP fragment, the 2212 whole Data Link Layer frame contains one IP datagram. IP 2213 fragments should be reassembled in a local buffer. This is not 2214 a performance optimization goal) 2216 2. Transport Layer processing -- TCP protocol processing, header 2217 and checksum checks. 2219 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2220 IP DST, TCP SRC Port, TCP DST Port, protocol) 2222 3. Find FPDU message boundaries. 2224 a. Get MPA state information for the connection 2226 If the TCP segment is in-order, use the receiver managed 2227 MPA state information to calculate where the previous 2228 FPDU message (#N-1) ends in the current TCP segment X. 2229 (previously, when the MPA receiver processed the first 2230 part of FPDU #N-1, it calculated the number of bytes 2231 remaining to complete FPDU #N-1 by using the MPA 2232 Length field). 2234 Get the stored partial CRC for FPDU #N-1 2236 Complete CRC calculation for FPDU #N-1 data (first 2237 portion of TCP segment #X) 2239 Check CRC calculation for FPDU #N-1 2241 If no FPDU CRC errors, placement is allowed 2243 Locate the local buffer for the first portion of 2244 FPDU#N-1, CopyData(local buffer of first portion 2245 of FPDU #N-1, host buffer address, length) 2247 Compute host buffer address for second portion of FPDU 2248 #N-1 2250 CopyData (local buffer of second portion of FPDU #N-1, 2251 host buffer address for second portion, length) 2253 Calculate the octet offset into the TCP segment for 2254 the next FPDU #N. 2256 Start Calculation of CRC for available data for FPDU 2257 #N 2259 Store partial CRC results for FPDU #N 2261 Store local buffer address of first portion of FPDU #N 2263 No further action is possible on FPDU #N, before it is 2264 completely received 2266 If TCP out-of-order, receiver must buffer the data until 2267 at least one complete FPDU is received. Typically 2268 buffering for more than one TCP segment per connection 2269 is required. Use the MPA based Markers to calculate 2270 where FPDU boundaries are. 2272 When a complete FPDU is available, a similar procedure 2273 to the in-order algorithm above is used. There is 2274 additional complexity, though, because when the 2275 missing segment arrives, this TCP segment must be 2276 run through the CRC engine after the CRC is 2277 calculated for the missing segment. 2279 If we assume Header Alignment, the following diagram and the 2280 algorithm below apply. Note that when using MPA, the receiver is 2281 assumed to actively detect presence or loss of Header Alignment for 2282 every TCP segment received. 2284 +--------------------------+ +--------------------------+ 2285 +--|--------------------------+ +--|--------------------------+ 2286 | | TCP Seg X | | | TCP Seg X+1 | 2287 +--|--------------------------+ +--|--------------------------+ 2288 +--------------------------+ +--------------------------+ 2289 FPDU #N FPDU #N+1 2291 Figure 11: Aligned FPDU placed immediately after TCP header 2293 The receiver algorithm for Header Aligned frames (in-order or out-of- 2294 order) includes: 2296 1) Data Link Layer processing (whole frame) - typically 2297 including a CRC calculation. 2299 2) Network Layer processing (assuming not an IP fragment, the 2300 whole Data Link Layer frame contains one IP datagram. IP 2301 fragments should be reassembled in a local buffer. This is 2302 not a performance optimization goal) 2304 3) Transport Layer processing -- TCP protocol processing, header 2305 and checksum checks. 2307 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2308 IP DST, TCP SRC Port, TCP DST Port, protocol) 2310 4) Check for Header Alignment. (Described in detail in [MPA] 2311 section 7.4). Assuming Header Alignment for the rest of the 2312 algorithm below. 2314 a. If the header is not aligned, see the algorithm defined 2315 in the prior section. 2317 5) If TCP is in-order or out-of-order the MPA header is at the 2318 beginning of the current TCP payload. Get the FPDU length 2319 from the FPDU header. 2321 6) Calculate CRC over FPDU 2323 7) Check CRC calculation for FPDU #N 2325 8) If no FPDU CRC errors, placement is allowed 2327 9) CopyData(TCP segment #X, host buffer address, length) 2329 10) Loop to #5 until all the FPDUs in the TCP segment are 2330 consumed in order to handle FPDU packing. 2332 Implementation note: In both cases the receiver has to classify the 2333 incoming TCP segment and associate it with one of the flows it 2334 maintains. In the case of no Header Alignment, the receiver is forced 2335 to classify incoming traffic before it can calculate the FPDU CRC. In 2336 the case of Header Alignment the operations order is left to the 2337 implementer. 2339 The Header Aligned receiver algorithm is significantly simpler. There 2340 is no need to locally buffer portions of FPDUs. Accessing state 2341 information is also substantially simplified - the normal case does 2342 not require retrieving information to find out where a FPDU starts 2343 and ends or retrieval of a partial CRC before the CRC calculation can 2344 commence. This avoids adding internal latencies, having multiple data 2345 passes through the CRC machine, or scheduling multiple commands for 2346 moving the data to the host buffer. 2348 The aligned FPDU approach is useful for in-order and out-of-order 2349 reception. The receiver can use the same mechanisms for data storage 2350 in both cases, and only needs to account for when all the TCP 2351 segments have arrived to enable delivery. . The Header Alignment, 2352 along with the high probability that at least one complete FPDU is 2353 found with every TCP segment, allows the receiver to perform data 2354 placement for out-of-order TCP segments with no need for intermediate 2355 buffering. Essentially the TCP receive buffer has been eliminated and 2356 TCP reassembly is done in place within the ULP buffer. 2358 In case Header Alignment is not found, the receiver should follow the 2359 algorithm for non aligned FPDU reception which may be slower and less 2360 efficient. 2362 11.1.2.2 Header Alignment effects on TCP wire protocol 2364 An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to 2365 calculate its MULPDU, which it then exposes to DDP, its ULP. DDP 2366 uses the MULPDU to segment its payload so that each FPDU sent by 2367 MPA fits completely into one TCP segment. This has no impact on 2368 wire protocol and exposing this information is already supported 2369 on many TCP implementations, including all modern flavors of BSD 2370 networking, through the TCP_MAXSEG socket option. 2372 In the common case, the ULP (i.e. DDP over MPA) messages provided to 2373 the TCP layer are segmented to MULPDU size. It is assumed that the 2374 ULP message size is bounded by MULPDU, such that a single ULP message 2375 can be encapsulated in a single TCP segment. Therefore, in the common 2376 case, there is no increase in the number of TCP segments emitted. For 2377 smaller ULP messages, the sender can also apply packing, i.e. the 2378 sender packs as many complete FPDUs as possible into one TCP segment. 2379 The requirement to always have a complete FPDU may increase the 2380 number of TCP segments emitted. Typically, a ULP message size varies 2381 from few bytes to multiple EMSS (e.g., 64 Kbytes). In some cases the 2382 ULP may post more than one message at a time for transmission, giving 2383 the sender an opportunity for packing. In the case where more than 2384 one FPDU is available for transmission and the FPDUs are encapsulated 2385 into a TCP segment and there is no room in the TCP segment to include 2386 the next complete FPDU, another TCP segment is sent. In this corner 2387 case some of the TCP segments are not full size. In the worst case 2388 scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has 2389 multiple messages available for transmission. For this poor choice of 2390 FPDU size, the average TCP segment size is therefore about 1/2 of the 2391 EMSS and the number of TCP segments emitted is approaching 2x of what 2392 is possible without the requirement to encapsulate an integer number 2393 of complete FPDUs in every TCP segment. This is a dynamic situation 2394 that only lasts for the duration where the sender ULP has multiple 2395 non-optimal messages for transmission and this causes a minor impact 2396 on the wire utilization. 2398 However, it is not expected that requiring Header Alignment will have 2399 a measurable impact on wire behavior of most applications. Throughput 2400 applications with large I/Os are expected to take full advantage of 2401 the EMSS. Another class of applications with many small outstanding 2402 buffers (as compared to EMSS) is expected to use packing when 2403 applicable. Transaction oriented applications are also optimal. 2405 TCP retransmission is another area that can affect sender behavior. 2406 TCP supports retransmission of the exact, originally transmitted 2407 segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing 2408 the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event 2409 that part of the original segment has been received and acknowledged 2410 by the remote peer (e.g., a re-segmenting middle box, as documented 2411 in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on 2412 page 26), a better available bandwidth utilization may be possible by 2413 re-transmitting only the missing octets. If an MPA-aware TCP 2414 retransmits complete FPDUs, there may be some marginal bandwidth 2415 loss. 2417 Another area where a change in the TCP segment number may have impact 2418 is that of Slow Start and Congestion Avoidance. Slow-start 2419 exponential increase is measured in segments per second, as the 2420 algorithm focuses on the overhead per segment at the source for 2421 congestion that eventually results in dropped segments. Slow-start 2422 exponential bandwidth growth for MPA-aware TCP is similar to any TCP 2423 implementation. Congestion Avoidance allows for a linear growth in 2424 available bandwidth when recovering after a packet drop. Similar to 2425 the analysis for slow-start, MPA-aware TCP doesn't change the 2426 behavior of the algorithm. Therefore the average size of the segment 2427 versus EMSS is not a major factor in the assessment of the bandwidth 2428 growth for a sender. Both Slow Start and Congestion Avoidance for an 2429 MPA-aware TCP will behave similarly to any TCP sender and allow an 2430 MPA-aware TCP to enjoy the theoretical performance limits of the 2431 algorithms. 2433 In summary, the ULP messages generated at the sender (e.g., the 2434 amount of messages grouped for every transmission request) and 2435 message size distribution has the most significant impact over the 2436 number of TCP segments emitted. The worst case effect for certain 2437 ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by 2438 an increase of up to 2x in the number of TCP segments and 2439 acknowledges. In reality the effect is expected to be marginal. 2441 11.2 Receiver implementation 2443 Transport & Network Layer Reassembly Buffers: 2445 The use of reassembly buffers (either TCP reassembly buffers or IP 2446 fragmentation reassembly buffers) is implementation dependent. When 2447 MPA is enabled, reassembly buffers are needed if out of order packets 2448 arrive and Markers are not enabled. Buffers are also needed if FPDU 2449 Alignment is lost or if IP fragmentation occurs. This is because the 2450 incoming out of order segment may not contain enough information for 2451 MPA to process all of the FPDU. For cases where a re-segmenting 2452 middle box is present, or where the TCP sender is not MPA-aware, the 2453 presence of markers significantly reduces the amount of buffering 2454 needed. 2456 Recovery from IP Fragmentation must be transparent to the MPA 2457 Consumers. 2459 11.2.1 Network Layer Reassembly Buffers 2461 Most IP implementations set the IP Don't Fragment bit. Thus upon a 2462 path MTU change, intermediate devices drop the IP datagram if it is 2463 too large and reply with an ICMP message which tells the source TCP 2464 that the path MTU has changed. This causes TCP to emit segments 2465 conformant with the new path MTU size. Thus IP fragments under most 2466 conditions should never occur at the receiver. But it is possible. 2468 There are several options for implementation of network layer 2469 reassembly buffers: 2471 1. drop any IP fragments, and reply with an ICMP message according 2472 to [RFC792] (fragmentation needed and DF set) to tell the Remote 2473 Peer to resize its TCP segment 2475 2. support an IP reassembly buffer, but have it of limited size 2476 (possibly the same size as the local link's MTU). The end Node 2477 would normally never advertise a path MTU larger than the local 2478 link MTU. It is recommended that a dropped IP fragment cause an 2479 ICMP message to be generated according to RFC792. 2481 3. multiple IP reassembly buffers, of effectively unlimited size. 2483 4. support an IP reassembly buffer for the largest IP datagram (64 2484 KB). 2486 5. support for a large IP reassembly buffer which could span 2487 multiple IP datagrams. 2489 An implementation should support at least 2 or 3 above, to avoid 2490 dropping packets that have traversed the entire fabric. 2492 There is no end-to-end ACK for IP reassembly buffers, so there is no 2493 flow control on the buffer. The only end-to-end ACK is a TCP ACK, 2494 which can only occur when a complete IP datagram is delivered to TCP. 2495 Because of this, under worst case, pathological scenarios, the 2496 largest IP reassembly buffer is the TCP receive window (to buffer 2497 multiple IP datagrams that have all been fragmented). 2499 Note that if the Remote Peer does not implement re-segmentation of 2500 the data stream upon receiving the ICMP reply updating the path MTU, 2501 it is possible to halt forward progress because the opposite peer 2502 would continue to retransmit using a transport segment size that is 2503 too large. This deadlock scenario is no different than if the fabric 2504 MTU (not last hop MTU) was reduced after connection setup, and the 2505 remote Node's behavior is not compliant with [RFC1122]. 2507 11.2.2 TCP Reassembly buffers 2509 A TCP reassembly buffer is also needed. TCP reassembly buffers are 2510 needed if FPDU Alignment is lost when using TCP with MPA or when the 2511 MPA FPDU spans multiple TCP segments. Buffers are also needed if 2512 Markers are disabled and out of order packets arrive. 2514 Since lost FPDU Alignment often means that FPDUs are incomplete, an 2515 MPA on TCP implementation must have a reassembly buffer large enough 2516 to recover an FPDU that is less than or equal to the MTU of the 2517 locally attached link (this should be the largest possible advertised 2518 TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST 2519 be at least 140 octets long to support the minimum FPDU size. The 2520 140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2 2521 of ULPDU_Length, 4 of CRC, and space for a possible marker. As usual, 2522 additional buffering may provide better performance. 2524 Note that if the TCP segment were not stored, it is possible to 2525 deadlock the MPA algorithm. If the path MTU is reduced, FPDU 2526 Alignment requires the source TCP to re-segment the data stream to 2527 the new path MTU. The source MPA will detect this condition and 2528 reduce the MPA segment size, but any FPDUs already posted to the 2529 source TCP will be re-segmented and lose FPDU Alignment. If the 2530 destination does not support a TCP reassembly buffer, these segments 2531 can never be successfully transmitted and the protocol deadlocks. 2533 When a complete FPDU is received, processing continues normally. 2535 11.3 IETF RNIC Interoperability with RDMA Consortium Protocols 2537 Without the exchange of MPA Request/Reply Frames, there is no 2538 standard mechanism for enabling RDMAC RNICs to interoperate with IETF 2539 RNICs. Even if a ULP uses a well-known port to start an IETF RNIC 2540 immediately in RDMA mode (i.e., without exchanging the MPA 2541 Request/Reply messages), there is no reason to believe an IETF RNIC 2542 will interoperate with an RDMAC RNIC because of the differences in 2543 the version number in the DDP and RDMAP headers on the wire. 2545 Therefore, the ULP or other supporting entity at the RDMAC RNIC must 2546 implement MPA Request/Reply Frames on behalf of the RNIC in order to 2547 negotiate the connection parameters. The following section describes 2548 the results following the exchange of the MPA Request/Reply Frames 2549 before the conversion from streaming to RDMA mode. 2551 11.3.1 Negotiated Parameters 2553 Three types of RNICs are considered: 2555 Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which 2556 has a ULP or other supporting entity that exchanges the MPA 2557 Request/Reply Frames in streaming mode before the conversion to 2558 RDMA mode. 2560 Non-permissive IETF RNIC - an RNIC implementing the IETF protocols 2561 which is not capable of implementing the RDMAC protocols. Such 2562 an RNIC can only interoperate with other IETF RNICs. 2564 Permissive IETF RNIC - an RNIC implementing the IETF protocols which 2565 is capable of implementing the RDMAC protocols on a per 2566 connection basis. 2568 The values used by these three RNIC types for the MPA, DDP, and RDMAP 2569 versions as well as MPA markers and CRC are summarized in Figure 12. 2571 +----------------++-----------+-----------+-----------+-----------+ 2572 | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | 2573 | || Version | Revision | Markers | CRC | 2574 +----------------++-----------+-----------+-----------+-----------+ 2575 +----------------++-----------+-----------+-----------+-----------+ 2576 | RDMAC || 0 | 0 | 1 | 1 | 2577 | || | | | | 2578 +----------------++-----------+-----------+-----------+-----------+ 2579 | IETF || 1 | 1 | 0 or 1 | 0 or 1 | 2580 | Non-permissive || | | | | 2581 +----------------++-----------+-----------+-----------+-----------+ 2582 | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | 2583 | permissive || | | | | 2584 +----------------++-----------+-----------+-----------+-----------+ 2585 Figure 12. Connection Parameters for the RNIC Types. 2586 For MPA markers and MPA CRC, enabled=1, disabled=0. 2588 It is assumed there is no mixing of versions allowed between MPA, DDP 2589 and RDMAP. The RNIC either generates the RDMAC protocols on the wire 2590 (version is zero) or the IETF protocols (version is one). 2592 During the exchange of the MPA Request/Reply Frames, each peer 2593 provides its MPA Revision, Marker preference (M: 0=disabled, 2594 1=enabled), and CRC preference. The MPA Revision provided in the MPA 2595 Request Frame and the MPA Reply Frame may differ. 2597 From the information in the MPA Request/Reply Frames, each side sets 2598 the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as 2599 well as the state of the Markers for each half connection. Between 2600 DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP 2601 and RDMAP version MUST be identical in the two directions. The RNIC 2602 either generates the RDMAC protocols on the wire (version is zero) or 2603 the IETF protocols (version is one). 2605 In the following sections, the figures do not discuss CRC negotiation 2606 because there is no interoperability issue for CRCs. Since the RDMAC 2607 RNIC will always request CRC use, then, according to the IETF MPA 2608 specification, both peers MUST generate and check CRCs. 2610 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 2612 Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate 2613 with an RDMAC RNIC, despite the fact that both peers exchange MPA 2614 Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA 2615 negotiation has no effect on the DDP/RDMAP version and it is unable 2616 to interoperate with the RDMAC RNIC. 2618 The rows in the figure show the state of the Marker field in the MPA 2619 Request Frame sent by the MPA Initiator. The columns show the state 2620 of the Marker field in the MPA Reply Frame sent by the MPA Responder. 2621 Each type of RNIC is shown as an initiator and a responder. The 2622 connection results are shown in the lower right corner, at the 2623 intersection of the different RNIC types, where V=0 is the RDMAC 2624 DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA 2625 markers are disabled and M=1 means MPA markers are enabled. The 2626 negotiated marker state is shown as X/Y, for the receive direction of 2627 the initiator/responder. 2629 +---------------------------++-----------------------+ 2630 | MPA || MPA | 2631 | CONNECT || Responder | 2632 | MODE +-----------------++-------+---------------+ 2633 | | RNIC || RDMAC | IETF | 2634 | | TYPE || | Non-permissive| 2635 | | +------++-------+-------+-------+ 2636 | | |MARKER|| M=1 | M=0 | M=1 | 2637 +---------+----------+------++-------+-------+-------+ 2638 +---------+----------+------++-------+-------+-------+ 2639 | | RDMAC | M=1 || V=0 | close | close | 2640 | | | || M=1/1 | | | 2641 | +----------+------++-------+-------+-------+ 2642 | MPA | | M=0 || close | V=1 | V=1 | 2643 |Initiator| IETF | || | M=0/0 | M=0/1 | 2644 | |Non-perms.+------++-------+-------+-------+ 2645 | | | M=1 || close | V=1 | V=1 | 2646 | | | || | M=1/0 | M=1/1 | 2647 +---------+----------+------++-------+-------+-------+ 2648 Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive 2649 IETF RNIC. 2651 11.3.2.1 RDMAC RNIC Initiator 2653 If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request 2654 Frame with Rev field set to zero and the M and C bits set to one. 2655 Because the Non-permissive IETF RNIC cannot dynamically downgrade the 2656 version number it uses for DDP and RDMAP, it would send an MPA Reply 2657 Frame with the Rev field equal to one and then gracefully close the 2658 connection. 2660 11.3.2.2 Non-Permissive IETF RNIC Initiator 2662 If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA 2663 Request Frame with Rev field equal to one. The ULP or supporting 2664 entity for the RDMAC RNIC responds with an MPA Reply Frame that has 2665 the Rev field equal to zero and the M bit set to one. The Non- 2666 permissive IETF RNIC will gracefully close the connection after it 2667 reads the incompatible Rev field in the MPA Reply Frame. 2669 11.3.3 RDMAC RNIC and Permissive IETF RNIC 2671 Figure 14 shows that a Permissive IETF RNIC can interoperate with an 2672 RDMAC RNIC regardless of its Marker preference. The figure uses the 2673 same format as shown with the Non-permissive IETF RNIC. 2675 +---------------------------++-----------------------+ 2676 | MPA || MPA | 2677 | CONNECT || Responder | 2678 | MODE +-----------------++-------+---------------+ 2679 | | RNIC || RDMAC | IETF | 2680 | | TYPE || | Permissive | 2681 | | +------++-------+-------+-------+ 2682 | | |MARKER|| M=1 | M=0 | M=1 | 2683 +---------+----------+------++-------+-------+-------+ 2684 +---------+----------+------++-------+-------+-------+ 2685 | | RDMAC | M=1 || V=0 | N/A | V=0 | 2686 | | | || M=1/1 | | M=1/1 | 2687 | +----------+------++-------+-------+-------+ 2688 | MPA | | M=0 || V=0 | V=1 | V=1 | 2689 |Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 | 2690 | |Permissive+------++-------+-------+-------+ 2691 | | | M=1 || V=0 | V=1 | V=1 | 2692 | | | || M=1/1 | M=1/0 | M=1/1 | 2693 +---------+----------+------++-------+-------+-------+ 2694 Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive 2695 IETF RNIC. 2697 A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the 2698 Rev field of the MPA Req/Rep Frames and then adjust its receive 2699 Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As 2700 a result, as an MPA Responder, the Permissive IETF RNIC will never 2701 return an MPA Reply Frame with the M bit set to zero. This case is 2702 shown as a not applicable (N/A) in Figure 14. 2704 11.3.3.1 RDMAC RNIC Initiator 2706 When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting 2707 entity prepares an MPA Request message and sets the revision to zero 2708 and the M bit and C bit to one. 2710 The Permissive IETF Responder receives the MPA Request message and 2711 checks the revision field. Since it is capable of generating RDMAC 2712 DDP/RDMAP headers, it sends an MPA Reply message with revision set to 2713 zero and the M and C bits set to one. The Responder must inform its 2714 ULP that it is generating version zero DDP/RDMAP messages. 2716 11.3.3.2 Permissive IETF RNIC Initiator 2718 If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA 2719 Request Frame setting the Rev field to one. Regardless of the value 2720 of the M bit in the MPA Request Frame, the ULP or other supporting 2721 entity for the RDMAC RNIC will create an MPA Reply Frame with Rev 2722 equal to zero and the M bit set to one. 2724 When the Initiator reads the Rev field of the MPA Reply Frame and 2725 finds that its peer is an RDMAC RNIC, it must inform its ULP that it 2726 should generate version zero DDP/RDMAP messages and enable MPA 2727 markers and CRC. 2729 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 2731 For completeness, Figure 15 shows the results of MPA negotiation 2732 between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The 2733 important point from this figure is that an IETF RNIC cannot detect 2734 whether its peer is a Permissive or Non-permissive RNIC. 2736 +---------------------------++-------------------------------+ 2737 | MPA || MPA | 2738 | CONNECT || Responder | 2739 | MODE +-----------------++---------------+---------------+ 2740 | | RNIC || IETF | IETF | 2741 | | TYPE || Non-permissive| Permissive | 2742 | | +------++-------+-------+-------+-------+ 2743 | | |MARKER|| M=0 | M=1 | M=0 | M=1 | 2744 +---------+----------+------++-------+-------+-------+-------+ 2745 +---------+----------+------++-------+-------+-------+-------+ 2746 | | | M=0 || V=1 | V=1 | V=1 | V=1 | 2747 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2748 | |Non-perms.+------++-------+-------+-------+-------+ 2749 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2750 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2751 | MPA +----------+------++-------+-------+-------+-------+ 2752 |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | 2753 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2754 | |Permissive+------++-------+-------+-------+-------+ 2755 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2756 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2757 +---------+----------+------++-------+-------+-------+-------+ 2758 Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a 2759 Permissive IETF RNIC. 2761 12 Author's Addresses 2763 Stephen Bailey 2764 Sandburst Corporation 2765 600 Federal Street 2766 Andover, MA 01810 USA 2767 Phone: +1 978 689 1614 2768 Email: steph@sandburst.com 2770 Paul R. Culley 2771 Hewlett-Packard Company 2772 20555 SH 249 2773 Houston, Tx. USA 77070-2698 2774 Phone: 281-514-5543 2775 Email: paul.culley@hp.com 2777 Uri Elzur 2778 Broadcom 2779 16215 Alton Parkway 2780 CA, 92618 2781 Phone: 949.585.6432 2782 Email: uri@broadcom.com 2784 Renato J Recio 2785 IBM 2786 Internal Zip 9043 2787 11400 Burnett Road 2788 Austin, Texas 78759 2789 Phone: 512-838-3685 2790 Email: recio@us.ibm.com 2792 John Carrier 2793 Adaptec Inc. 2794 691 South Milpitas Blvd. 2795 Milpitas, CA 95035 2796 Phone: 360-378-8526 2797 Email: John_Carrier@adaptec.com 2799 13 Acknowledgments 2801 Dwight Barron 2802 Hewlett-Packard Company 2803 20555 SH 249 2804 Houston, Tx. USA 77070-2698 2805 Phone: 281-514-2769 2806 Email: dwight.barron@hp.com 2808 Jeff Chase 2809 Department of Computer Science 2810 Duke University 2811 Durham, NC 27708-0129 USA 2812 Phone: +1 919 660 6559 2813 Email: chase@cs.duke.edu 2815 Ted Compton 2816 EMC Corporation 2817 Research Triangle Park, NC 27709, USA 2818 Phone: 919-248-6075 2819 Email: compton_ted@emc.com 2821 Dave Garcia 2822 Hewlett-Packard Company 2823 19333 Vallco Parkway 2824 Cupertino, Ca. USA 95014 2825 Phone: 408.285.6116 2826 Email: dave.garcia@hp.com 2828 Hari Ghadia 2829 Adaptec, Inc. 2830 691 S. Milpitas Blvd., 2831 Milpitas, CA 95035 USA 2832 Phone: +1 (408) 957-5608 2833 Email: hari_ghadia@adaptec.com 2835 Howard C. Herbert 2836 Intel Corporation 2837 MS CH7-404 2838 5000 West Chandler Blvd. 2839 Chandler, Arizona 85226 2840 Phone: 480-554-3116 2841 Email: howard.c.herbert@intel.com 2843 Jeff Hilland 2844 Hewlett-Packard Company 2845 20555 SH 249 2846 Houston, Tx. USA 77070-2698 2847 Phone: 281-514-9489 2848 Email: jeff.hilland@hp.com 2850 Mike Ko 2851 IBM 2852 650 Harry Rd. 2853 San Jose, CA 95120 2854 Phone: (408) 927-2085 2855 Email: mako@us.ibm.com 2857 Mike Krause 2858 Hewlett-Packard Corporation, 43LN 2859 19410 Homestead Road 2860 Cupertino, CA 95014 USA 2861 Phone: +1 (408) 447-3191 2862 Email: krause@cup.hp.com 2864 Dave Minturn 2865 Intel Corporation 2866 MS JF1-210 2867 5200 North East Elam Young Parkway 2868 Hillsboro, Oregon 97124 2869 Phone: 503-712-4106 2870 Email: dave.b.minturn@intel.com 2872 Jim Pinkerton 2873 Microsoft, Inc. 2874 One Microsoft Way 2875 Redmond, WA, USA 98052 2876 Email: jpink@microsoft.com 2878 Hemal Shah 2879 Intel Corporation 2880 MS PTL1 2881 1501 South Mopac Expressway, #400 2882 Austin, Texas 78746 2883 Phone: 512-732-3963 2884 Email: hemal.shah@intel.com 2886 Allyn Romanow 2887 Cisco Systems 2888 170 W Tasman Drive 2889 San Jose, CA 95134 USA 2890 Phone: +1 408 525 8836 2891 Email: allyn@cisco.com 2893 Tom Talpey 2894 Network Appliance 2895 375 Totten Pond Road 2896 Waltham, MA 02451 USA 2897 Phone: +1 (781) 768-5329 2898 EMail: thomas.talpey@netapp.com 2900 Patricia Thaler 2901 Agilent Technologies, Inc. 2902 1101 Creekside Ridge Drive, #100 2903 M/S-RG10 2904 Roseville, CA 95678 2905 Phone: +1-916-788-5662 2906 email: pat_thaler@agilent.com 2908 Jim Wendt 2909 Hewlett Packard Corporation 2910 8000 Foothills Boulevard MS 5668 2911 Roseville, CA 95747-5668 USA 2912 Phone: +1 916 785 5198 2913 Email: jim_wendt@hp.com 2915 Jim Williams 2916 Emulex Corporation 2917 580 Main Street 2918 Bolton, MA 01740 USA 2919 Phone: +1 978 779 7224 2920 Email: jim.williams@emulex.com 2922 14 Full Copyright Statement 2924 This document and the information contained herein is provided on an 2925 "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM 2926 CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION, 2927 EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS 2928 MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, 2929 NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY, 2930 AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 2931 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 2932 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY 2933 IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR 2934 PURPOSE. 2936 This document and the information contained herein are provided on an 2937 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2938 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2939 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2940 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2941 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2942 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2944 Copyright (C) The Internet Society (2005). This document is subject 2945 to the rights, licenses and restrictions contained in BCP 78, and 2946 except as set forth therein, the authors retain all their rights.