idnits 2.17.1 draft-ietf-rddp-mpa-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3201. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3216. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3223. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3229. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([DDP]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: C: This bit declares an endpoint's preferred CRC usage. When this field is '0' in the MPA Request Frame and the MPA Reply Frame, CRCs MUST not be checked and need not be generated by either endpoint. When this bit is '1' in either the MPA Request Frame or MPA Reply Frame, CRCs MUST be generated and checked by both endpoints. Note that even when not in use, the CRC field remains present in the FPDU. When CRCs are not in use, the CRC field MUST be considered valid for FPDU checking regardless of its contents. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: 9. MPA implementations MUST validate the PD_Length field. The buffer that receives the Private Data field MUST be large enough to receive that data; the amount of Private Data MUST not exceed the PD_Length, or the application buffer. If any of the above fails, the startup frame MUST be considered improperly formatted. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 7, 2006) is 6404 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Seconds' is mentioned on line 2376, but not defined ** Obsolete normative reference: RFC 2401 (Obsoleted by RFC 4301) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-10) exists of draft-ietf-rddp-security-09 == Outdated reference: A later version (-04) exists of draft-ietf-nfsv4-channel-bindings-02 -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) -- No information found for draft-hilland-iwarp-verbs-v1 - is the name correct? Summary: 7 errors (**), 0 flaws (~~), 8 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Remote Direct Data Placement Work Group P. Culley 2 INTERNET-DRAFT Hewlett-Packard Company 3 draft-ietf-rddp-mpa-08.txt U. Elzur 4 Broadcom Corporation 5 R. Recio 6 IBM Corporation 7 S. Bailey 8 Sandburst Corporation 9 J. Carrier 10 Cray Inc. 12 Expires: April 2007 October 7, 2006 14 Marker PDU Aligned Framing for TCP Specification 16 Status of this Memo 18 By submitting this Internet-Draft, each author represents that any 19 applicable patent or other IPR claims of which he or she is aware 20 have been or will be disclosed, and any of which he or she becomes 21 aware will be disclosed, in accordance with Section 6 of BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft 35 Shadow Directories can be accessed at http://www.ietf.org/shadow.html 37 Abstract 39 MPA (Marker Protocol data unit Aligned framing) is designed to work 40 as an "adaptation layer" between TCP and the Direct Data Placement 41 [DDP] protocol, preserving the reliable, in-order delivery of TCP, 42 while adding the preservation of higher-level protocol record 43 boundaries that DDP requires. MPA is fully compliant with applicable 44 TCP RFCs and can be utilized with existing TCP implementations. MPA 45 also supports integrated implementations that combine TCP, MPA and 46 DDP to reduce buffering requirements in the implementation and 47 improve performance at the system level. 49 Table of Contents 51 Status of this Memo 1 52 Abstract 1 53 1 Glossary 5 54 2 Introduction 8 55 2.1 Motivation 8 56 2.2 Protocol Overview 8 57 3 MPA's interactions with DDP 12 58 4 MPA Full Operation Mode 14 59 4.1 FPDU Format 14 60 4.2 Marker Format 15 61 4.3 MPA Markers 15 62 4.4 CRC Calculation 18 63 4.5 FPDU Size Considerations 21 64 5 MPA's interactions with TCP 23 65 5.1 MPA transmitters with a standard layered TCP 23 66 5.2 MPA receivers with a standard layered TCP 24 67 6 MPA Receiver FPDU Identification 24 68 7 Connection Semantics 26 69 7.1 Connection setup 26 70 7.1.1 MPA Request and Reply Frame Format 28 71 7.1.2 Connection Startup Rules 29 72 7.1.3 Example Delayed Startup sequence 32 73 7.1.4 Use of Private Data 35 74 7.1.4.1 Motivation 35 75 7.1.4.2 Example Immediate Startup using Private Data 36 76 7.1.5 "Dual stack" implementations 38 77 7.2 Normal Connection Teardown 39 78 8 Error Semantics 40 79 9 Security Considerations 41 80 9.1 Protocol-specific Security Considerations 41 81 9.1.1 Spoofing 41 82 9.1.1.1 Impersonation 41 83 9.1.1.2 Stream Hijacking 42 84 9.1.1.3 Man in the Middle Attack 42 85 9.1.2 Eavesdropping 42 86 9.2 Introduction to Security Options 43 87 9.3 Using IPsec With MPA 43 88 9.4 Requirements for IPsec Encapsulation of MPA/DDP 44 89 10 IANA Considerations 45 90 A Appendix. Optimized MPA-aware TCP implementations 46 91 A.1 Optimized MPA/TCP transmitters 46 92 A.2 Effects of Optimized MPA/TCP Segmentation 47 93 A.3 Optimized MPA/TCP receivers 49 94 A.4 Re-segmenting Middle boxes and non optimized MPA/TCP senders50 95 A.5 Receiver implementation 51 96 A.5.1 Network Layer Reassembly Buffers 52 97 A.5.2 TCP Reassembly buffers 53 98 B Appendix. Analysis of MPA over TCP Operations 54 99 B.1 Assumptions 54 100 B.1.1 MPA is layered beneath DDP [DDP] 54 101 B.1.2 MPA preserves DDP message framing 55 102 B.1.3 The size of the ULPDU passed to MPA is less than EMSS under 103 normal conditions 55 104 B.1.4 Out-of-order placement but NO out-of-order Delivery 55 105 B.2 The Value of FPDU Alignment 55 106 B.2.1 Impact of lack of FPDU Alignment on the receiver computational 107 load and complexity 57 108 B.2.2 FPDU Alignment effects on TCP wire protocol 61 109 C Appendix. IETF Implementation Interoperability with RDMA Consortium 110 Protocols 63 111 C.1 Negotiated Parameters 63 112 C.2 RDMAC RNIC and Non-permissive IETF RNIC 65 113 C.2.1 RDMAC RNIC Initiator 65 114 C.2.2 Non-Permissive IETF RNIC Initiator 66 115 C.2.3 RDMAC RNIC and Permissive IETF RNIC 66 116 C.2.4 RDMAC RNIC Initiator 67 117 C.2.5 Permissive IETF RNIC Initiator 67 118 C.3 Non-Permissive IETF RNIC and Permissive IETF RNIC 67 119 Normative References 69 120 Informative References 69 121 Author's Addresses 71 122 Acknowledgments 72 123 Full Copyright Statement 75 124 Intellectual Property 75 126 Table of Figures 128 Figure 1 ULP MPA TCP Layering 9 129 Figure 2 FPDU Format 14 130 Figure 3 Marker Format 15 131 Figure 4 Example FPDU Format with Marker 17 132 Figure 5 Annotated Hex Dump of an FPDU 20 133 Figure 6 Annotated Hex Dump of an FPDU with Marker 21 134 Figure 7 Fully layered implementation 23 135 Figure 8 MPA Request/Reply Frame 28 136 Figure 9: Example Delayed Startup negotiation 33 137 Figure 10: Example Immediate Startup negotiation 36 138 Figure 11 Optimized MPA/TCP implementation 46 139 Figure 12: Non-aligned FPDU freely placed in TCP octet stream 57 140 Figure 13: Aligned FPDU placed immediately after TCP header 59 141 Figure 14. Connection Parameters for the RNIC Types. 64 142 Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive 143 IETF RNIC. 65 144 Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive 145 IETF RNIC. 66 146 Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a 147 Permissive IETF RNIC. 68 149 Revision history [To be deleted prior to RFC publication] 151 [draft-ietf-rddp-mpa-08] workgroup draft with following changes: 153 Re-submission to correct conversion errors. 155 [draft-ietf-rddp-mpa-07] workgroup draft with following changes: 157 Minor clarifications; added CRC to glossary, made 2.1 discussion 158 on probabilistic/deterministic a little less global. Added note 159 that MULPDU is likely smaller than 64768, clarified 'M' bit 160 description, added xref to private data discussion in field 161 definition, removed LLP acronym, added sentence on DOS attack to 162 "Man in Middle" in security. 164 [draft-ietf-rddp-mpa-06] workgroup draft with following changes: 166 Document restructuring to move descriptive information on 167 implementing optimized MPA/TCP implementations to an appendix. 168 All normative text was removed from the appendix. Paragraph 169 added to security section explaining IPSEC version. Added 170 informative references to architecture, applicability, and 171 problem statement documents. 173 1 Glossary 175 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 176 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 177 this document are to be interpreted as described in [RFC2119]. 179 Consumer - the ULPs or applications that lie above MPA and DDP. The 180 Consumer is responsible for making TCP connections, starting MPA 181 and DDP connections, and generally controlling operations. 183 CRC - Cyclic Redundancy Check. 185 Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as 186 the process of informing DDP that a particular PDU is ordered for 187 use. A PDU is Delivered in the exact order that it was sent by 188 the original sender; MPA uses TCP's byte stream ordering to 189 determine when Delivery is possible. This is specifically 190 different from "passing the PDU to DDP", which may generally 191 occur in any order, while the order of Delivery is strictly 192 defined. 194 EMSS - Effective Maximum Segment Size. EMSS is the smaller of the 195 TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], 196 and the current path Maximum Transfer Unit (MTU) [RFC1191]. 198 FPDU - Framed Protocol Data Unit. The unit of data created by an MPA 199 sender. 201 FPDU Alignment - the property that an FPDU is Header Aligned with the 202 TCP segment, and the TCP segment includes an integer number of 203 FPDUs. A TCP segment with a FPDU Alignment allows immediate 204 processing of the contained FPDUs without waiting on other TCP 205 segments to arrive or combining with prior segments. 207 FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate 208 the beginning of an FPDU. 210 Full Operation (Full Operation Phase) - After the completion of the 211 Startup Phase MPA begins exchanging FPDUs. 213 Header Alignment - the property that a TCP segment begins with an 214 FPDU. The FPDU is Header Aligned when the FPDU header is exactly 215 at the start of the TCP segment (right behind the TCP headers on 216 the wire). 218 Initiator - The endpoint of a connection that sends the MPA Request 219 Frame, i.e. the first to actually send data (which may not be the 220 one which sends the TCP SYN). 222 Marker - A four octet field that is placed in the MPA data stream at 223 fixed octet intervals (every 512 octets). 225 MPA-aware TCP - a TCP implementation that is aware of the receiver 226 efficiencies of MPA FPDU Alignment and is capable of sending TCP 227 segments that begin with an FPDU. 229 MPA-enabled - MPA is enabled if the MPA protocol is visible on the 230 wire. When the sender is MPA-enabled, it is inserting framing 231 and Markers. When the receiver is MPA-enabled, it is 232 interpreting framing and Markers. 234 MPA Request Frame - Data sent from the MPA Initiator to the MPA 235 Responder during the Startup Phase. 237 MPA Reply Frame - Data sent from the MPA Responder to the MPA 238 Initiator during the Startup Phase. 240 MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This 241 document defines the MPA protocol. 243 MULPDU - Maximum ULPDU. The current maximum size of the record that 244 is acceptable for DDP to pass to MPA for transmission. 246 Node - A computing device attached to one or more links of a Network. 247 A Node in this context does not refer to a specific application 248 or protocol instantiation running on the computer. A Node may 249 consist of one or more MPA on TCP devices installed in a host 250 computer. 252 PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact 253 modulo 4 size. 255 PDU - protocol data unit 257 Private Data - A block of data exchanged between MPA endpoints during 258 initial connection setup. 260 Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that 261 tie use of various endpoint resources (memory access etc.) to the 262 specific RDMA/DDP/MPA connection. 264 RDDP - a suite of protocols including MPA, [DDP], [RDMAP], an overall 265 security document [RDMASEC], a problem statement [RFC4297], an 266 architecture document [RFC4296], and an applicability document 267 [APPL]. 269 RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA 270 to enable applications to transfer data directly from memory 271 buffers. See [RDMAP]. 273 Remote Peer - The MPA protocol implementation on the opposite end of 274 the connection. Used to refer to the remote entity when 275 describing protocol exchanges or other interactions between two 276 Nodes. 278 Responder - The connection endpoint which responds to an incoming MPA 279 connection request (the MAP Request Frame). This may not be the 280 endpoint which awaited the TCP SYN. 282 Startup Phase - The initial exchanges of an MPA connection which 283 serves to more fully identify MPA endpoints to each other and 284 pass connection specific setup information to each other. 286 ULP - Upper Layer Protocol. The protocol layer above the protocol 287 layer currently being referenced. The ULP for MPA is DDP [DDP]. 289 ULPDU - Upper Layer Protocol Data Unit. The data record defined by 290 the layer above MPA (DDP). ULPDU corresponds to DDP's DDP 291 segment. 293 ULPDU_Length - a field in the FPDU describing the length of the 294 included ULPDU. 296 2 Introduction 298 This section discusses the reason for creating MPA on TCP and a 299 general overview of the protocol. 301 2.1 Motivation 303 The Direct Data Placement protocol [DDP], when used with TCP [RFC793] 304 requires a mechanism to detect record boundaries. The DDP records 305 are referred to as Upper Layer Protocol Data Units by this document. 306 The ability to locate the Upper Layer Protocol Data Unit (ULPDU) 307 boundary is useful to a hardware network adapter that uses DDP to 308 directly place the data in the application buffer based on the 309 control information carried in the ULPDU header. This may be done 310 without requiring that the packets arrive in order. Potential 311 benefits of this capability are the avoidance of the memory copy 312 overhead and a smaller memory requirement for handling out of order 313 or dropped packets. 315 Many approaches have been proposed for a generalized framing 316 mechanism. Some are probabilistic in nature and others are 317 deterministic. An example probabilistic approach is characterized by 318 a detectable value embedded in the octet stream, with no method of 319 preventing that value elsewhere within user data. It is 320 probabilistic because under some conditions the receiver may 321 incorrectly interpret application data as the detectable value. 322 Under these conditions, the protocol may fail with unacceptable 323 frequency. One deterministic approach is characterized by embedded 324 controls at known locations in the octet stream. Because the 325 receiver can guarantee it will only examine the data stream at 326 locations that are known to contain the embedded control, the 327 protocol can never misinterpret application data as being embedded 328 control data. For unambiguous handling of an out of order packet, a 329 deterministic approach is preferred. 331 The MPA protocol provides a framing mechanism for DDP running over 332 TCP using the deterministic approach. It allows the location of the 333 ULPDU to be determined in the TCP stream even if the TCP segments 334 arrive out of order. 336 2.2 Protocol Overview 338 The layering of PDUs with MPA is shown in Figure 1, below. 340 +------------------+ 341 | ULP client | 342 +------------------+ <- Consumer messages 343 | DDP | 344 +------------------+ <- ULPDUs 345 | MPA* | 346 +------------------+ <- FPDUs (containing ULPDUs) 347 | TCP* | 348 +------------------+ <- TCP Segments (containing FPDUs) 349 | IP etc. | 350 +------------------+ 351 * These may be fully layered or optimized together. 353 Figure 1 ULP MPA TCP Layering 355 MPA is described as an extra layer above TCP and below DDP. The 356 operation sequence is: 358 1. A TCP connection is established by ULP action. This is done 359 using methods not described by this specification. The ULP may 360 exchange some amount of data in streaming mode prior to starting 361 MPA, but is not required to do so. 363 2. The Consumer negotiates the use of DDP and MPA at both ends of a 364 connection. The mechanisms to do this are not described in this 365 specification. The negotiation may be done in streaming mode, or 366 by some other mechanism (such as a pre-arranged port number). 368 3. The ULP activates MPA on each end in the Startup Phase, either as 369 an Initiator or a Responder, as determined by the ULP. This mode 370 verifies the usage of MPA, specifies the use of CRC and Markers, 371 and allows the ULP to communicate some additional data via a 372 Private Data exchange. See section 7.1 Connection setup for more 373 details on the startup process. 375 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 376 Full Operation and begins sending DDP data as further described 377 below. In this document, DDP data chunks are called ULPDUs. For 378 a description of the DDP data, see [DDP]. 380 Following is a description of data transfer when MPA is in Full 381 Operation. 383 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 384 for this value. MPA derives this information from TCP or IP, 385 when it is available, or chooses a reasonable value. 387 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 388 MPA at the sender. 390 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 391 header, optionally inserting Markers, and appending a CRC field 392 after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. 394 4. The TCP sender puts the FPDUs into the TCP stream. If the sender 395 is optimized MPA/TCP, it segments the TCP stream in such a way 396 that a TCP Segment boundary is also the boundary of an FPDU. TCP 397 then passes each segment to the IP layer for transmission. 399 5. The receiver may or may not be optimized. If it is optimized 400 MPA/TCP, it may separate passing the TCP payload to MPA from 401 passing the TCP payload ordering information to MPA. In either 402 case, RFC compliant TCP wire behavior is observed at both the 403 sender and receiver. 405 6. The MPA receiver locates and assembles complete FPDUs within the 406 stream, verifies their integrity, and removes MPA Markers (when 407 present), ULPDU_Length, PAD and the CRC field. 409 7. MPA then provides the complete ULPDUs to DDP. MPA may also 410 separate passing MPA payload to DDP from passing the MPA payload 411 ordering information. 413 A fully layered MPA on TCP is implemented as a data stream ULP for 414 TCP and is therefore RFC compliant. 416 An optimized DDP/MPA/TCP uses a TCP layer which potentially contains 417 some additional behaviors as suggested in this document. When 418 DDP/MPA/TCP are cross-layer optimized, the behavior of TCP (esp. 419 sender segmentation) may change from that of the un-optimized 420 implementation, but the changes are within the bounds permitted by 421 the TCP RFC specifications, and will interoperate with an un- 422 optimized TCP. The additional behaviors are described in Appendix A 423 and are not normative, they are described at a TCP interface layer as 424 a convenience. Implementations may achieve the described 425 functionality using any method, including cross layer optimizations 426 between TCP, MPA and DDP. 428 An optimized DDP/MPA/TCP sender is able to segment the data stream 429 such that TCP segments begin with FPDUs (FPDU Alignment). This has 430 significant advantages for receivers. When segments arrive with 431 aligned FPDUs the receiver usually need not buffer any portion of the 432 segment, allowing DDP to place it in its destination memory 433 immediately, thus avoiding copies from intermediate buffers (DDP's 434 reason for existence). 436 An optimized DDP/MPA/TCP receiver allows a DDP on MPA implementation 437 to locate the start of ULPDUs that may be received out of order. It 438 also allows the implementation to determine if the entire ULPDU has 439 been received. As a result, MPA can pass out of order ULPDUs to DDP 440 for immediate use. This enables a DDP on MPA implementation to save 441 a significant amount of intermediate storage by placing the ULPDUs in 442 the right locations in the application buffers when they arrive, 443 rather than waiting until full ordering can be restored. 445 The ability of a receiver to recover out of order ULPDUs is optional 446 and declared to the transmitter during startup. When the receiver 447 declares that it does not support out of order recovery, the 448 transmitter does not add the control information to the data stream 449 needed for out of order recovery. 451 If the receiver is fully layered, then MPA receives a strictly 452 ordered stream of data and does not deal with out of order ULPDUs. 453 In this case MPA passes each ULPDU to DDP when the last bytes arrive 454 from TCP, along with the indication that they are in order. 456 MPA implementations that support recovery of out of order ULPDUs MUST 457 support a mechanism to indicate the ordering of ULPDUs as the sender 458 transmitted them and indicate when missing intermediate segments 459 arrive. These mechanisms allow DDP to reestablish record ordering 460 and report Delivery of complete messages (groups of records). 462 MPA also addresses enhanced data integrity. Some users of TCP have 463 noted that the TCP checksum is not as strong as could be desired (see 464 [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum 465 indicates segments in error at a much higher rate than the underlying 466 link characteristics would indicate. With these higher error rates, 467 the chance that an error will escape detection, when using only the 468 TCP checksum for data integrity, becomes a concern. A stronger 469 integrity check can reduce the chance of data errors being missed. 471 MPA includes a CRC check to increase the ULPDU data integrity to the 472 level provided by other modern protocols, such as SCTP [RFC2960]. It 473 is possible to disable this CRC check, however CRCs MUST be enabled 474 unless it is clear that the end to end connection through the network 475 has data integrity at least as good as an MPA with CRC enabled (for 476 example when IPsec is implemented end to end). DDP's ULP expects 477 this level of data integrity and therefore the ULP does not have to 478 provide its own duplicate data integrity and error recovery for lost 479 data. 481 3 MPA's interactions with DDP 483 DDP requires MPA to maintain DDP record boundaries from the sender to 484 the receiver. When using MPA on TCP to send data, DDP provides 485 records (ULPDUs) to MPA. MPA will use the reliable transmission 486 abilities of TCP to transmit the data, and will insert appropriate 487 additional information into the TCP stream to allow the MPA receiver 488 to locate the record boundary information. 490 As such, MPA accepts complete records (ULPDUs) from DDP at the sender 491 and returns them to DDP at the receiver. 493 MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU 494 contained in one FPDU. 496 MPA over a standard TCP stack can usually provide FPDU Alignment with 497 the TCP Header if the FPDU is equal to TCP's EMSS. An optimized 498 MPA/TCP stack can also maintain alignment as long as the FPDU is less 499 than or equal to TCP's EMSS. Since FPDU Alignment is generally 500 desired by the receiver, DDP cooperates with MPA to ensure FPDUs' 501 lengths do not exceed the EMSS under normal conditions. This is done 502 with the MULPDU mechanism. 504 MPA MUST provide information to DDP on the current maximum size of 505 the record that is acceptable to send (MULPDU). DDP SHOULD limit 506 each record size to MULPDU. The range of MULPDU values MUST be 507 between 128 octets and 64768 octets, inclusive. 509 The sending DDP MUST NOT post a ULPDU larger than 64768 octets to 510 MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, 511 however MPA is not REQUIRED to support a ULPDU Length that is greater 512 than the current MULPDU. 514 While the maximum theoretical length supported by the MPA header 515 ULPDU_Length field is 65535, TCP over IP requires the IP datagram 516 maximum length to be 65535 octets. To enable MPA to support FPDU 517 Alignment, the maximum size of the FPDU must fit within an IP 518 datagram. Thus the ULPDU limit of 64768 octets was derived by taking 519 the maximum IP datagram length, subtracting from it the maximum total 520 length of the sum of the IPv4 header, TCP header, IPv4 options, TCP 521 options, and the worst case MPA overhead, and then rounding the 522 result down to a 128 octet boundary. 524 Note that MULPDU will be significantly smaller than the theoretical 525 maximum in most implementations for most circumstances, due to link 526 MTUs, use of extra headers such as required for IPSEC etc. 528 On receive, MPA MUST pass each ULPDU with its length to DDP when it 529 has been validated. 531 If an MPA implementation supports passing out of order ULPDUs to DDP, 532 the MPA implementation SHOULD: 534 * Pass each ULPDU with its length to DDP as soon as it has been 535 fully received and validated. 537 * Provide a mechanism to indicate the ordering of ULPDUs as the 538 sender transmitted them. One possible mechanism might be 539 providing the TCP sequence number for each ULPDU. 541 * Provide a mechanism to indicate when a given ULPDU (and prior 542 ULPDUs) are complete (Delivered to DDP). One possible mechanism 543 might be to allow DDP to see the current outgoing TCP Ack 544 sequence number. 546 * Provide an indication to DDP that the TCP has closed or has begun 547 to close the connection (e.g. received a FIN). 549 MPA MUST provide the protocol version negotiated with its peer to 550 DDP. DDP will use this version to set the version in its header and 551 to report the version to [RDMAP]. 553 4 MPA Full Operation Mode 555 The following sections describe the main semantics of the full 556 operation mode of MPA. 558 4.1 FPDU Format 560 MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown 561 below MUST be used for all MPA FPDUs. For purposes of clarity, 562 Markers are not shown in Figure 2. 564 0 1 2 3 565 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 566 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 567 | ULPDU_Length | | 568 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 569 | | 570 ~ ~ 571 ~ ULPDU ~ 572 | | 573 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 574 | | PAD (0-3 octets) | 575 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 576 | CRC | 577 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 578 Figure 2 FPDU Format 580 ULPDU_Length: 16 bits (unsigned integer). This is the number of 581 octets of the contained ULPDU. It does not include the length of the 582 FPDU header itself, the pad, the CRC, or of any Markers that fall 583 within the ULPDU. The 16-bit ULPDU Length field is large enough to 584 support the largest IP datagrams for IPv4 or IPv6. 586 PAD: The PAD field trails the ULPDU and contains between zero and 587 three octets of data. The pad data MUST be set to zero by the sender 588 and ignored by the receiver (except for CRC checking). The length of 589 the pad is set so as to make the size of the FPDU an integral 590 multiple of four. 592 CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C 593 check value, which is used to verify the entire contents of the FPDU, 594 using CRC32C. See section 4.4 CRC Calculation on page 18. When CRCs 595 are not enabled, this field is still present, may contain any value, 596 and MUST NOT be checked. 598 The FPDU adds a minimum of 6 octets to the length of the ULPDU. In 599 addition, the total length of the FPDU will include the length of any 600 Markers and from 0 to 3 pad octets added to round-up the ULPDU size. 602 4.2 Marker Format 604 The format of a Marker MUST be as specified in Figure 3: 606 0 1 2 3 607 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 608 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 609 | RESERVED | FPDUPTR | 610 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 611 Figure 3 Marker Format 613 RESERVED: The Reserved field MUST be set to zero on transmit and 614 ignored on receive (except for CRC calculation). 616 FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, 617 interpreted as an unsigned integer that indicates the number of 618 octets in the TCP stream from the beginning of the ULPDU Length field 619 to the first octet of the entire Marker. The least significant two 620 bits MUST always be set to zero at the transmitter, and the receivers 621 MUST always treat these as zero for calculations. 623 4.3 MPA Markers 625 MPA Markers are used to identify the start of FPDUs when packets are 626 received out of order. This is done by locating the Markers at fixed 627 intervals in the data stream (which is correlated to the TCP sequence 628 number) and using the Marker value to locate the preceding FPDU 629 start. 631 All MPA Markers are included in the containing FPDU CRC calculation 632 (when both CRCs and Markers are in use). 634 The MPA receiver's ability to locate out of order FPDUs and pass the 635 ULPDUs to DDP is implementation dependent. MPA/DDP allows those 636 receivers that are able to deal with out of order FPDUs in this way 637 to require the insertion of Markers in the data stream. When the 638 receiver cannot deal with out of order FPDUs in this way, it may 639 disable the insertion of Markers at the sender. All MPA senders MUST 640 be able to generate Markers when their use is declared by the 641 opposing receiver (see section 7.1 Connection setup on page 26). 643 When Markers are enabled, MPA senders MUST insert a Marker into the 644 data stream at a 512 octet periodic interval in the TCP Sequence 645 Number Space. The Marker contains a 16 bit unsigned integer referred 646 to as the FPDUPTR (FPDU Pointer). 648 If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit 649 relative back-pointer. FPDUPTR MUST contain the number of octets in 650 the TCP stream from the beginning of the ULPDU Length field to the 651 first octet of the Marker, unless the Marker falls between FPDUs. 652 Thus the location of the first octet of the previous FPDU header can 653 be determined by subtracting the value of the given Marker from the 654 current octet-stream sequence number (i.e. TCP sequence number) of 655 the first octet of the Marker. Note that this computation MUST take 656 into account that the TCP sequence number could have wrapped between 657 the Marker and the header. 659 An FPDUPTR value of 0x0000 is a special case - it is used when the 660 Marker falls exactly between FPDUs (between the preceding FPDU CRC 661 field, and the next FPDU's ULPDU Length field). In this case, the 662 Marker is considered to be contained in the following FPDU; the 663 Marker MUST be included in the CRC calculation of the FPDU following 664 the Marker (if CRCs are being generated or checked). Thus an FPDUPTR 665 value of 0x0000 means that immediately following the Marker is an 666 FPDU header (the ULPDU Length field). 668 Since all FPDUs are integral multiples of 4 octets, the bottom two 669 bits of the FPDUPTR as calculated by the sender are zero. MPA 670 reserves these bits so they MUST be treated as zero for computation 671 at the receiver. 673 When Markers are enabled (see section 7.1 Connection setup on page 674 26), the MPA Markers MUST be inserted immediately preceding the first 675 FPDU of Full Operation phase, and at every 512th octet of the TCP 676 octet stream thereafter. As a result, the first Marker has an 677 FPDUPTR value of 0x0000. If the first Marker begins at octet 678 sequence number SeqStart, then Markers are inserted such that the 679 first octet of the Marker is at octet sequence number SeqNum if the 680 remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum 681 can wrap. 683 For example, if the TCP sequence number were used to calculate the 684 insertion point of the Marker, the starting TCP sequence number is 685 unlikely to be zero, and 512 octet multiples are unlikely to fall on 686 a modulo 512 of zero. If the MPA connection is started at TCP 687 sequence number 11, then the 1st Marker will begin at 11, and 688 subsequent Markers will begin at 523, 1035, etc. 690 If an FPDU is large enough to contain multiple Markers, they MUST all 691 point to the same point in the TCP stream: the first octet of the 692 ULPDU Length field for the FPDU. 694 If a Marker interval contains multiple FPDUs (the FPDUs are small), 695 the Marker MUST point to the start of the ULPDU Length field for the 696 FPDU containing the Marker unless the Marker falls between FPDUs, in 697 which case the Marker MUST be zero. 699 The following example shows an FPDU containing a Marker. 701 0 1 2 3 702 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 703 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 704 | ULPDU Length (0x0010) | | 705 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 706 | | 707 + + 708 | ULPDU (octets 0-9) | 709 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 710 | (0x0000) | FPDU ptr (0x000C) | 711 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 712 | ULPDU (octets 10-15) | 713 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 714 | | PAD (2 octets:0,0) | 715 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 716 | CRC | 717 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 718 Figure 4 Example FPDU Format with Marker 720 MPA Receivers MUST preserve ULPDU boundaries when passing data to 721 DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to 722 DDP and not the Markers, headers, and CRC. 724 4.4 CRC Calculation 726 An MPA implementation MUST implement CRC support and MUST either: 728 (1) always use CRCs; The MPA provider at is not REQUIRED to support 729 an administrator's request that CRCs not be used. 731 or 733 (2a) only indicate a preference to not use CRCs on the explicit 734 request of the system administrator, via an interface not defined 735 in this spec. The default configuration for a connection MUST be 736 to use CRCs. 738 (2b) disable CRC checking (and possibly generation) if both the local 739 and remote endpoints indicate preference to not use CRCs. 741 The decision for hosts to request CRC suppression MAY be made on an 742 administrative basis for any path that provides equivalent protection 743 from undetected errors as an end-to-end CRC32c. 745 The process MUST be invisible to the ULP. 747 After receipt of an MPA startup declaration indicating that its peer 748 requires CRCs, an MPA instance MUST continue generating and checking 749 CRCs until the connection terminates. If an MPA instance has 750 declared that it does not require CRCs, it MUST turn off CRC checking 751 immediately after receipt of an MPA mode declaration indicating that 752 its peer also does not require CRCs. It MAY continue generating 753 CRCs. See section 7.1 Connection setup on page 26 for details on the 754 MPA startup. 756 When sending an FPDU, the sender MUST include a CRC field. When CRCs 757 are enabled, the CRC field in the MPA FPDU MUST be computed using the 758 CRC32C polynomial in the manner described in the iSCSI Protocol 759 [iSCSI] document for Header and Data Digests. 761 The fields which MUST be included in the CRC calculation when sending 762 an FPDU are as follows: 764 1) If a Marker does not immediately precede the ULPDU Length field, 765 the CRC-32c is calculated from the first octet of the ULPDU 766 Length field, through all the ULPDU and Markers (if present), to 767 the last octet of the PAD (if present), inclusive. If there is a 768 Marker immediately following the PAD, the Marker is included in 769 the CRC calculation for this FPDU. 771 2) If a Marker immediately precedes the first octet of the ULPDU 772 Length field of the FPDU, (i.e. the Marker fell between FPDUs, 773 and thus is required to be included in the second FPDU), the CRC- 774 32c is calculated from the first octet of the Marker, through the 775 ULPDU Length header, through all the ULPDU and Markers (if 776 present), to the last octet of the PAD (if present), inclusive. 778 3) After calculating the CRC-32c, the resultant value is placed into 779 the CRC field at the end of the FPDU. 781 When an FPDU is received, and CRC checking is enabled, the receiver 782 MUST first perform the following: 784 1) Calculate the CRC of the incoming FPDU in the same fashion as 785 defined above. 787 2) Verify that the calculated CRC-32c value is the same as the 788 received CRC-32c value found in the FPDU CRC field. If not, the 789 receiver MUST treat the FPDU as an invalid FPDU. 791 The procedure for handling invalid FPDUs is covered in the Error 792 Section (see section 8 on page 40). 794 The following is an annotated hex dump of an example FPDU sent as the 795 first FPDU on the stream. As such, it starts with a Marker. The 796 FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn 797 contains 24 octets of the contained ULPDU, which is a data load that 798 is all zeros. The CRC32c has been correctly calculated and can be 799 used as a reference. See the [DDP] and [RDMAP] specification for 800 definitions of the DDP Control field, Queue, MSN, MO, and Send Data. 802 Octet Contents Annotation 803 Count 805 0000 00 Marker: Reserved 806 0001 00 807 0002 00 Marker: FPDUPTR 808 0003 00 809 0004 00 ULPDU Length 810 0005 2a 811 0006 41 DDP Control Field, Send with Last flag set 812 0007 43 813 0008 00 Reserved (DDP STag position with no STag) 814 0009 00 815 000a 00 816 000b 00 817 000c 00 DDP Queue = 0 818 000d 00 819 000e 00 820 000f 00 821 0010 00 DDP MSN = 1 822 0011 00 823 0012 00 824 0013 01 825 0014 00 DDP MO = 0 826 0015 00 827 0016 00 828 0017 00 829 0018 00 DDP Send Data (24 octets of zeros) 830 ... 831 002f 00 832 0030 52 CRC32c 833 0031 23 834 0032 99 835 0033 83 836 Figure 5 Annotated Hex Dump of an FPDU 838 The following is an example sent as the second FPDU of the stream 839 where the first FPDU (which is not shown here) had a length of 492 840 octets and was also a Send to Queue 0 with Last Flag set. This 841 example contains a Marker. 843 Octet Contents Annotation 844 Count 846 01ec 00 Length 847 01ed 2a 848 01ee 41 DDP Control Field: Send with Last Flag set 849 01ef 43 850 01f0 00 Reserved (DDP STag position with no STag) 851 01f1 00 852 01f2 00 853 01f3 00 854 01f4 00 DDP Queue = 0 855 01f5 00 856 01f6 00 857 01f7 00 858 01f8 00 DDP MSN = 2 859 01f9 00 860 01fa 00 861 01fb 02 862 01fc 00 DDP MO = 0 863 01fd 00 864 01fe 00 865 01ff 00 866 0200 00 Marker: Reserved 867 0201 00 868 0202 00 Marker: FPDUPTR 869 0203 14 870 0204 00 DDP Send Data (24 octets of zeros) 871 ... 872 021b 00 873 021c 84 CRC32c 874 021d 92 875 021e 58 876 021f 98 877 Figure 6 Annotated Hex Dump of an FPDU with Marker 879 4.5 FPDU Size Considerations 881 MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as 882 the size of the largest ULPDU fitting in an FPDU. For an empty TCP 883 Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus 884 space for Markers and pad octets. 886 The maximum ULPDU Length for a single ULPDU when Markers are 887 present MUST be computed as: 889 MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) 891 The formula above accounts for the worst-case number of Markers. 893 The maximum ULPDU Length for a single ULPDU when Markers are NOT 894 present MUST be computed as: 896 MULPDU = EMSS - (6 + EMSS mod 4) 898 As a further optimization of the wire efficiency an MPA 899 implementation MAY dynamically adjust the MULPDU (see section 5 for 900 latency and wire efficiency trade-offs). When one or more FPDUs are 901 already packed into a TCP Segment, MULPDU MAY be reduced accordingly. 903 DDP SHOULD provide ULPDUs that are as large as possible, but less 904 than or equal to MULPDU. 906 If the TCP implementation needs to adjust EMSS to support MTU changes 907 or changing TCP options, the MULPDU value is changed accordingly. 909 In certain rare situations, the EMSS may shrink below 128 octets in 910 size. If this occurs, the MPA on TCP sender MUST NOT shrink the 911 MULPDU below 128 octets and is not required to follow the 912 segmentation rules in Sections 5.1 and Appendix A. 914 If one or more FPDUs are already packed into a TCP segment, such that 915 the remaining room is less than 128 octets, MPA MUST NOT provide a 916 MULPDU smaller than 128. In this case, MPA would typically provide a 917 MULPDU for the next full sized segment, but may still pack the next 918 FPDU into the small remaining room, provide that the next FPDU is 919 small enough to fit. 921 The value 128 is chosen as to allow DDP designers room for the DDP 922 Header and some user data. 924 5 MPA's interactions with TCP 926 The following sections describe MPA's interactions with TCP. This 927 section discusses using a standard layered TCP stack with MPA 928 attached above a TCP socket. Discussion of using an optimized MPA- 929 aware TCP with an MPA implementation that takes advantage of the 930 extra optimizations is done in Appendix A. 932 +-----------------------------------+ 933 | +-----+ +-----------------+ | 934 | | MPA | | Other Protocols | | 935 | +-----+ +-----------------+ | 936 | || || | 937 | ----- socket API -------------- | 938 | || | 939 | +-----+ | 940 | | TCP | | 941 | +-----+ | 942 | || | 943 | +-----+ | 944 | | IP | | 945 | +-----+ | 946 +-----------------------------------+ 948 Figure 7 Fully layered implementation 950 The Fully layered implementation is described for completeness; 951 however, the user is cautioned that the reduced probability of FPDU 952 alignment when transmitting with this implementation will tend to 953 introduce a higher overhead at optimized receivers. In addition, the 954 lack of out-of-order receive processing will significantly reduce the 955 value of DDP/MPA by imposing higher buffering and copying overhead in 956 the local receiver. 958 5.1 MPA transmitters with a standard layered TCP 960 MPA transmitters SHOULD calculate a MULPDU as described in section 961 4.5 If the TCP implementation allows EMSS to be determined by MPA, 962 that value should be used. If the transmit side TCP implementation 963 is not able to report the EMSS, MPA SHOULD use the current MTU value 964 to establish a likely FPDU size, taking into account the various 965 expected header sizes. 967 MPA transmitters SHOULD also use whatever facilities the TCP stack 968 presents to cause the TCP transmitter to start TCP segments at FPDU 969 boundaries. Multiple FPDUs MAY be packed into a single TCP segment 970 as determined by the EMSS calculation as long as they are entirely 971 contained in the TCP segment. 973 For example, passing FPDU buffers sized to the current EMSS to the 974 TCP socket and using the TCP_NODELAY socket option to disable the 975 Nagle [RFC0896] algorithm will usually result in many of the segments 976 starting with an FPDU. 978 It is recognized that various effects can cause a FPDU alignment to 979 be lost. Following are a few of the effects: 981 * ULPDUs that are smaller than the MULPDU. If these are sent in a 982 continuous stream, FPDU alignment will be lost. Note that 983 careful use of a dynamic MULPDU can help in this case; the MULPDU 984 for future FPDUs can be adjusted to re-establish alignment with 985 the segments based on the current EMSS. 987 * Sending enough data that the TCP receive window limit is reached. 988 TCP may send a smaller segment to exactly fill the receive 989 window. 991 * Sending data when TCP is operating up against the congestion 992 window. If TCP is not tracking the congestion window in 993 segments, it may transmit a smaller segment to exactly fill the 994 receive window. 996 * Changes in EMSS due to varying TCP options, or changes in MTU. 998 If FPDU alignment with TCP segments is lost for any reason, the 999 alignment is regained after a break in transmission where the TCP 1000 send buffers are emptied. Many usage models for DDP/MPA will include 1001 such breaks. 1003 MPA receivers are REQUIRED to be able to operate correctly even if 1004 alignment is lost (see section 6). 1006 5.2 MPA receivers with a standard layered TCP 1008 MPA receivers will get TCP data in the usual ordered stream. The 1009 receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH 1010 field, as described in section 6. Receivers MAY utilize markers to 1011 check for FPDU boundary consistency, but they are NOT required to 1012 examine the markers to determine the FPDU boundaries. 1014 6 MPA Receiver FPDU Identification 1016 An MPA receiver MUST first verify the FPDU before passing the ULPDU 1017 to DDP. To do this, the receiver MUST: 1019 * locate the start of the FPDU unambiguously, 1021 * verify its CRC (if CRC checking is enabled). 1023 If the above conditions are true, the MPA receiver passes the ULPDU 1024 to DDP. 1026 To detect the start of the FPDU unambiguously one of the following 1027 MUST be used: 1029 1: In an ordered TCP stream, the ULPDU Length field in the current 1030 FPDU when FPDU has a valid CRC, can be used to identify the 1031 beginning of the next FPDU. 1033 2: For optimized MPA/TCP receivers that support out of order 1034 reception of FPDUs (see section 4.3 MPA Markers on page 15) a 1035 Marker can always be used to locate the beginning of an FPDU (in 1036 FPDUs with valid CRCs). Since the location of the Marker is 1037 known in the octet stream (sequence number space), the Marker can 1038 always be found. 1040 3: Having found an FPDU by means of a Marker, an optimized MPA/TCP 1041 receiver can find following contiguous FPDUs by using the ULPDU 1042 Length fields (from FPDUs with valid CRCs) to establish the next 1043 FPDU boundary. 1045 The ULPDU Length field (see section 4 on page 14) MUST be used to 1046 determine if the entire FPDU is present before forwarding the ULPDU 1047 to DDP. 1049 CRC calculation is discussed in section 4.4 on page 18 above. 1051 7 Connection Semantics 1053 7.1 Connection setup 1055 MPA requires that the Consumer MUST activate MPA, and any TCP 1056 enhancements for MPA, on a TCP half connection at the same location 1057 in the octet stream at both the sender and the receiver. This is 1058 required in order for the Marker scheme to correctly locate the 1059 Markers (if enabled) and to correctly locate the first FPDU. 1061 MPA, and any TCP enhancements for MPA are enabled by the ULP in both 1062 directions at once at an endpoint. 1064 This can be accomplished several ways, and is left up to DDP's ULP: 1066 * DDP's ULP MAY require DDP on MPA startup immediately after TCP 1067 connection setup. This has the advantage that no streaming mode 1068 negotiation is needed. An example of such a protocol is shown in 1069 Figure 10: Example Immediate Startup negotiation on page 36. 1071 This may be accomplished by using a well-known port, or a service 1072 locator protocol to locate an appropriate port on which DDP on 1073 MPA is expected to operate. 1075 * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a 1076 normal TCP startup, using TCP streaming data exchanges on the 1077 same connection. The exchange establishes that DDP on MPA (as 1078 well as other ULPs) will be used, and exactly locates the point 1079 in the octet stream where MPA is to begin operation. Note that 1080 such a negotiation protocol is outside the scope of this 1081 specification. A simplified example of such a protocol is shown 1082 in Figure 9: Example Delayed Startup negotiation on page 33. 1084 An MPA endpoint operates in two distinct phases. 1086 The Startup Phase is used to verify correct MPA setup, exchange CRC 1087 and Marker configuration, and optionally pass Private Data between 1088 endpoints prior to completing a DDP connection. During this phase, 1089 specifically formatted frames are exchanged as TCP byte streams 1090 without using CRCs or Markers. During this phase a DDP endpoint need 1091 not be "bound" to the MPA connection. In fact, the choice of DDP 1092 endpoint and its operating parameters may not be known until the 1093 Consumer supplied Private Data (if any) has been examined by the 1094 Consumer. 1096 The second distinct phase is Full Operation during which FPDUs are 1097 sent using all the rules that pertain (CRCs, Markers, MULPDU 1098 restrictions etc.). A DDP endpoint MUST be "bound" to the MPA 1099 connection at entry to this phase. 1101 When Private Data is passed between ULPs in the Startup Phase, the 1102 ULP is responsible for interpreting that data, and then placing MPA 1103 into Full Operation. 1105 Note: The following text differentiates the two endpoints by calling 1106 them Initiator and Responder. This is quite arbitrary and is NOT 1107 related to the TCP startup (SYN, SYN/ACK sequence). The 1108 Initiator is the side that sends first in the MPA startup 1109 sequence (the MPA Request Frame). 1111 Note: The possibility that both endpoints would be allowed to make a 1112 connection at the same time, sometimes called an active/active 1113 connection, was considered by the work group and rejected. There 1114 were several motivations for this decision. One was that 1115 applications needing this facility were few (none other than 1116 theoretical at the time of this draft). Another was that the 1117 facility created some implementation difficulties, particularly 1118 with the "dual stack" designs described later on. A last issue 1119 was that dealing with rejected connections at startup would have 1120 required at least an additional frame type, and more recovery 1121 actions, complicating the protocol. While none of these issues 1122 was overwhelming, the group and implementers were not motivated 1123 to do the work to resolve these issues. The protocol includes a 1124 method of detecting these active/active startup attempts so that 1125 they can be rejected and an error reported. 1127 The ULP is responsible for determining which side is Initiator or 1128 Responder. For client/server type ULPs this is easy. For peer-peer 1129 ULPs (which might utilize a TCP style active/active startup), some 1130 mechanism (not defined by this specification) must be established, or 1131 some streaming mode data exchanged prior to MPA startup to determine 1132 the side which starts in Initiator and which starts in Responder MPA 1133 mode. 1135 7.1.1 MPA Request and Reply Frame Format 1137 0 1 2 3 1138 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1139 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 0 | | 1141 + Key (16 bytes containing "MPA ID Req Frame") + 1142 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | 1143 + Or (16 bytes containing "MPA ID Rep Frame") + 1144 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | 1145 + + 1146 12 | | 1147 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1148 16 |M|C|R| Res | Rev | PD_Length | 1149 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1150 | | 1151 ~ ~ 1152 ~ Private Data ~ 1153 | | 1154 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1155 | | 1156 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1157 Figure 8 MPA Request/Reply Frame 1159 Key: This field contains the "key" used to validate that the sender 1160 is an MPA sender. Initiator mode senders MUST set this field to 1161 the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20 1162 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder 1163 mode receivers MUST check this field for the same value, and 1164 close the connection and report an error locally if any other 1165 value is detected. Responder mode senders MUST set this field to 1166 the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 1167 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator 1168 mode receivers MUST check this field for the same value, and 1169 close the connection and report an error locally if any other 1170 value is detected. 1172 M: This bit declares an endpoint's REQUIRED Marker usage. When this 1173 bit is '1' in an MPA Request Frame, the Initiator declares that 1174 Markers are REQUIRED in FPDUs sent from the Responder. When set 1175 to '1' in an MPA Reply Frame, this bit declares that Markers are 1176 REQUIRED in FPDUs sent from the Initiator. When in a received 1177 MPA Request Frame or MPA Reply Frame and the value is '0', 1178 Markers MUST NOT be added to the data stream by that endpoint. 1179 When '1' Markers MUST be added as described in section 4.3 MPA 1180 Markers on page 15. 1182 C: This bit declares an endpoint's preferred CRC usage. When this 1183 field is '0' in the MPA Request Frame and the MPA Reply Frame, 1184 CRCs MUST not be checked and need not be generated by either 1185 endpoint. When this bit is '1' in either the MPA Request Frame 1186 or MPA Reply Frame, CRCs MUST be generated and checked by both 1187 endpoints. Note that even when not in use, the CRC field remains 1188 present in the FPDU. When CRCs are not in use, the CRC field 1189 MUST be considered valid for FPDU checking regardless of its 1190 contents. 1192 R: This bit is set to zero, and not checked on reception in the MPA 1193 Request Frame. In the MPA Reply Frame, this bit is the Rejected 1194 Connection bit, set by the Responders ULP to indicate acceptance 1195 '0', or rejection '1', of the connection parameters provided in 1196 the Private Data. 1198 Res: This field is reserved for future use. It MUST be set to zero 1199 when sending, and not checked on reception. 1201 Rev: This field contains the Revision of MPA. For this version of 1202 the specification senders MUST set this field to one. MPA 1203 receivers compliant with this version of the specification MUST 1204 check this field. If the MPA receiver cannot interoperate with 1205 the received version, then it MUST close the connection and 1206 report an error locally. Otherwise, the MPA receiver should 1207 report the received version to the ULP. 1209 PD_Length: This field MUST contain the length in Octets of the 1210 Private Data field. A value of zero indicates that there is no 1211 Private Data field present at all. If the receiver detects that 1212 the PD_Length field does not match the length of the Private Data 1213 field, or if the length of the Private Data field exceeds 512 1214 octets, the receiver MUST close the connection and report an 1215 error locally. Otherwise, the MPA receiver should pass the 1216 PD_Length value and Private Data to the ULP. 1218 Private Data: This field may contain any value defined by ULPs or may 1219 not be present. The Private Data field MUST between 0 and 512 1220 octets in length. ULPs define how to size, set, and validate 1221 this field within these limits. Private Data usage is further 1222 discussed in section 7.1.4 on page 35. 1224 7.1.2 Connection Startup Rules 1226 The following rules apply to MPA connection Startup Phase: 1228 1. When MPA is started in the Initiator mode, the MPA implementation 1229 MUST send a valid MPA Request Frame. The MPA Request Frame MAY 1230 include ULP supplied Private Data. 1232 2. When MPA is started in the Responder mode, the MPA implementation 1233 MUST wait until a MPA Request Frame is received and validated 1234 before entering full MPA/DDP operation. 1236 If the MPA Request Frame is improperly formatted, the 1237 implementation MUST close the TCP connection and exit MPA. 1239 If the MPA Request Frame is properly formatted but the Private 1240 Data is not acceptable, the implementation SHOULD return an MPA 1241 Reply Frame with the Rejected Connection bit set to '1'; the MPA 1242 Reply Frame MAY include ULP supplied Private Data; the 1243 implementation MUST exit MPA, leaving the TCP connection open. 1244 The ULP may close TCP or use the connection for other purposes. 1246 If the MPA Request Frame is properly formatted and the Private 1247 Data is acceptable, the implementation SHOULD return an MPA Reply 1248 Frame with the Rejected Connection bit set to '0'; the MPA Reply 1249 Frame MAY include ULP supplied Private Data; and the Responder 1250 SHOULD prepare to interpret any data received as FPDUs and pass 1251 any received ULPDUs to DDP. 1253 Note: Since the receiver's ability to deal with Markers is 1254 unknown until the Request and Reply frames have been 1255 received, sending FPDUs before this occurs is not possible. 1257 Note: The requirement to wait on a Request Frame before sending a 1258 Reply frame is a design choice, it makes for well ordered 1259 sequence of events at each end, and avoids having to specify 1260 how to deal with situations where both ends start at the same 1261 time. 1263 3. MPA Initiator mode implementations MUST receive and validate a 1264 MPA Reply Frame. 1266 If the MPA Reply Frame is improperly formatted, the 1267 implementation MUST close the TCP connection and exit MPA. 1269 If the MPA Reply Frame is properly formatted but is the Private 1270 Data is not acceptable, or if the Rejected Connection bit set to 1271 '1', the implementation MUST exit MPA, leaving the TCP connection 1272 open. The ULP may close TCP or use the connection for other 1273 purposes. 1275 If the MPA Reply Frame is properly formatted and the Private Data 1276 is acceptable, and the Reject Connection bit is set to '0', the 1277 implementation SHOULD enter full MPA/DDP operation mode; 1278 interpreting any received data as FPDUs and sending DDP ULPDUs as 1279 FPDUs. 1281 4. MPA Responder mode implementations MUST receive and validate at 1282 least one FPDU before sending any FPDUs or Markers. 1284 Note: this requirement is present to allow the Initiator time to 1285 get its receiver into Full Operation before an FPDU arrives, 1286 avoiding potential race conditions at the Initiator. This 1287 was also subject to some debate in the work group before 1288 rough consensus was reached. Eliminating this requirement 1289 would allow faster startup in some types of applications. 1290 However, that would also make certain implementations 1291 (particularly "dual stack") much harder. 1293 5. If a received "Key" does not match the expected value, (See 7.1.1 1294 MPA Request and Reply Frame Format above) the TCP/DDP connection 1295 MUST be closed, and an error returned to the ULP. 1297 6. The received Private Data fields may be used by Consumers at 1298 either end to further validate the connection, and set up DDP or 1299 other ULP parameters. The Initiator ULP MAY close the 1300 TCP/MPA/DDP connection as a result of validating the Private Data 1301 fields. The Responder SHOULD return a MPA Reply Frame with the 1302 "Reject Connection" Bit set to '1' if the validation of the 1303 Private Data is not acceptable to the ULP. 1305 7. When the first FPDU is to be sent, then if Markers are enabled, 1306 the first octets sent are the special Marker 0x00000000, followed 1307 by the start of the FPDU (the FPDU's ULPDU Length field). If 1308 Markers are not enabled, the first octets sent are the start of 1309 the FPDU (the FPDU's ULPDU Length field). 1311 8. MPA implementations MUST use the difference between the MPA 1312 Request Frame and the MPA Reply Frame to check for incorrect 1313 "Initiator/Initiator" startups. Implementations SHOULD put a 1314 timeout on waiting for the MPA Request Frame when started in 1315 Responder mode, to detect incorrect "Responder/Responder" 1316 startups. 1318 9. MPA implementations MUST validate the PD_Length field. The 1319 buffer that receives the Private Data field MUST be large enough 1320 to receive that data; the amount of Private Data MUST not exceed 1321 the PD_Length, or the application buffer. If any of the above 1322 fails, the startup frame MUST be considered improperly formatted. 1324 10. MPA implementations SHOULD implement a reasonable timeout while 1325 waiting for the entire startup frames; this prevents certain 1326 denial of service attacks. ULPs SHOULD implement a reasonable 1327 timeout while waiting for FPDUs, ULPDUs and application level 1328 messages to guard against application failures and certain denial 1329 of service attacks. 1331 7.1.3 Example Delayed Startup sequence 1333 A variety of startup sequences are possible when using MPA on TCP. 1334 Following is an example of an MPA/DDP startup that occurs after TCP 1335 has been running for a while and has exchanged some amount of 1336 streaming data. This example does not use any Private Data (an 1337 example that does is shown later in 7.1.4.2 Example Immediate Startup 1338 using Private Data on page 36), although it is perfectly legal to 1339 include the Private Data. Note that since the example does not use 1340 any Private Data, there are no ULP interactions shown between 1341 receiving "Startup frames" and putting MPA into Full Operation. 1343 Initiator Responder 1345 +---------------------------+ 1346 |ULP streaming mode | 1347 | request to | 1348 | transition to DDP/MPA | +--------------------------+ 1349 | mode (optional) | --------> |ULP gets request; | 1350 +---------------------------+ |enables MPA Responder mode| 1351 |with last (optional) | 1352 |streaming mode | 1353 |for MPA to send. | 1354 +---------------------------+ |MPA waits for incoming | 1355 |ULP receives streaming | <-------- | | 1356 | ; | +--------------------------+ 1357 |Enters MPA Initiator mode; | 1358 |MPA sends | 1359 | ; | 1360 |MPA waits for incoming | +--------------------------+ 1361 | |MPA receives | 1362 +---------------------------+ | | 1363 |Consumer binds DDP to MPA,| 1364 |MPA sends the | 1365 | . | 1366 |DDP/MPA enables FPDU | 1367 +---------------------------+ |decoding, but does not | 1368 |MPA receives the | < - - - - |send any FPDUs. | 1369 | | +--------------------------+ 1370 |Consumer binds DDP to MPA, | 1371 |DDP/MPA begins full | 1372 |operation. | 1373 |MPA sends first FPDU (as | +--------------------------+ 1374 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1375 |available). | |MPA sends first FPDU (as | 1376 +---------------------------+ |DDP ULPDUs become | 1377 <====== |available. | 1378 +--------------------------+ 1379 Figure 9: Example Delayed Startup negotiation 1381 An example Delayed Startup sequence is described below: 1383 * Active and passive sides start up a TCP connection in the 1384 usual fashion, probably using sockets APIs. They exchange 1385 some amount of streaming mode data. At some point one side 1386 (the MPA Initiator) sends streaming mode data that 1387 effectively says "Hello, Lets go into MPA/DDP mode." 1389 * When the remote side (the MPA Responder) gets this streaming mode 1390 message, the Consumer would send a last streaming mode message 1391 that effectively says "I Acknowledge your Hello, and am now in 1392 MPA Responder Mode". The exchange of these messages establishes 1393 the exact point in the TCP stream where MPA is enabled. The 1394 Responding Consumer enables MPA in the Responder mode and waits 1395 for the initial MPA startup message. 1397 * The Initiating Consumer would enable MPA startup in the 1398 Initiator mode which then sends the MPA Request Frame. It is 1399 assumed that no Private Data messages are needed for this 1400 example, although it is possible to do so. The Initiating 1401 MPA (and Consumer) would also wait for the MPA connection to 1402 be accepted. 1404 * The Responding MPA would receive the initial MPA Request Frame 1405 and would inform the Consumer that this message arrived. The 1406 Consumer can then accept the MPA/DDP connection or close the TCP 1407 connection. 1409 * To accept the connection request, the Responding Consumer would 1410 use an appropriate API to bind the TCP/MPA connections to a DDP 1411 endpoint, thus enabling MPA/DDP into Full Operation. In the 1412 process of going to Full Operation, MPA sends the MPA Reply 1413 Frame. MPA/DDP waits for the first incoming FPDU before sending 1414 any FPDUs. 1416 * If the initial TCP data was not a properly formatted MPA Request 1417 Frame MPA will close or reset the TCP connection immediately. 1419 * The Initiating MPA would receive the MPA Reply Frame and 1420 would report this message to the Consumer. The Consumer can 1421 then accept the MPA/DDP connection, or close or reset the TCP 1422 connection to abort the process. 1424 * On determining that the Connection is acceptable, the 1425 Initiating Consumer would use an appropriate API to bind the 1426 TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP 1427 into Full Operation. MPA/DDP would begin sending DDP 1428 messages as MPA FPDUs. 1430 7.1.4 Use of Private Data 1432 This section is advisory in nature, in that it suggests a method that 1433 a ULP can deal with pre-DDP connection information exchange. 1435 7.1.4.1 Motivation 1437 Prior RDMA protocols have been developed that provide Private Data 1438 via out of band mechanisms. As a result, many applications now 1439 expect some form of Private Data to be available for application use 1440 prior to setting up the DDP/RDMA connection. Following are some 1441 examples of the use of Private Data. 1443 An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand 1444 and the [VERBS]) must be associated with a Protection Domain. No 1445 receive operations may be posted to the endpoint before it is 1446 associated with a Protection Domain. Indeed under both the 1447 InfiniBand and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is 1448 created within a Protection Domain. 1450 There are some applications where the choice of Protection Domain is 1451 dependent upon the identity of the remote ULP client. For example, 1452 if a user session requires multiple connections, it is highly 1453 desirable for all of those connections to use a single Protection 1454 Domain. Note: use of Protection Domains is further discussed in 1455 [RDMASEC]. 1457 InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for 1458 the active side ULP to provide Private Data when requesting a 1459 connection. This data is passed to the ULP to allow it to determine 1460 whether to accept the connection, and if so with which endpoint (and 1461 implicitly which Protection Domain). 1463 The Private Data can also be used to ensure that both ends of the 1464 connection have configured their RDMA endpoints compatibly on such 1465 matters as the RDMA Read capacity (see [RDMAP]). Further ULP- 1466 specific uses are also presumed, such as establishing the identity of 1467 the client. 1469 Private Data is also allowed for when accepting the connection, to 1470 allow completion of any negotiation on RDMA resources and for other 1471 ULP reasons. 1473 There are several potential ways to exchange this Private Data. For 1474 example, the InfiniBand specification includes a connection 1475 management protocol that allows a small amount of Private Data to be 1476 exchanged using datagrams before actually starting the RDMA 1477 connection. 1479 This draft allows for small amounts of Private Data to be exchanged 1480 as part of the MPA startup sequence. The actual Private Data fields 1481 are carried in the MPA Request Frame, and the MPA Reply Frame. 1483 If larger amounts of Private Data or more negotiation is necessary, 1484 TCP streaming mode messages may be exchanged prior to enabling MPA. 1486 7.1.4.2 Example Immediate Startup using Private Data 1488 Initiator Responder 1490 +---------------------------+ 1491 |TCP SYN sent | +--------------------------+ 1492 +---------------------------+ --------> |TCP gets SYN packet; | 1493 +---------------------------+ | Sends SYN-Ack | 1494 |TCP gets SYN-Ack | <-------- +--------------------------+ 1495 | Sends Ack | 1496 +---------------------------+ --------> +--------------------------+ 1497 +---------------------------+ |Consumer enables MPA | 1498 |Consumer enables MPA | |Responder Mode, waits for | 1499 |Initiator mode with | | | 1500 |Private Data; MPA sends | +--------------------------+ 1501 | ; | 1502 |MPA waits for incoming | +--------------------------+ 1503 | |MPA receives | 1504 +---------------------------+ | | 1505 |Consumer examines Private | 1506 |Data, provides MPA with | 1507 |return Private Data, | 1508 |binds DDP to MPA, and | 1509 |enables MPA to send an | 1510 | . | 1511 |DDP/MPA enables FPDU | 1512 +---------------------------+ |decoding, but does not | 1513 |MPA receives the | < - - - - |send any FPDUs. | 1514 | | +--------------------------+ 1515 |Consumer examines Private | 1516 |Data, binds DDP to MPA, | 1517 |and enables DDP/MPA to | 1518 |begin Full Operation. | 1519 |MPA sends first FPDU (as | +--------------------------+ 1520 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1521 |available). | |MPA sends first FPDU (as | 1522 +---------------------------+ |DDP ULPDUs become | 1523 <====== |available. | 1524 +--------------------------+ 1525 Figure 10: Example Immediate Startup negotiation 1527 Note: the exact order of when MPA is started in the TCP connection 1528 sequence is implementation dependent; the above diagram shows one 1529 possible sequence. Also, the Initiator "Ack" to the Responder's 1530 "SYN-Ack" may be combined into the same TCP segment containing 1531 the MPA Request Frame (as is allowed by TCP RFCs). 1533 The example immediate startup sequence is described below: 1535 * The passive side (Responding Consumer) would listen on the TCP 1536 destination port, to indicate its readiness to accept a 1537 connection. 1539 * The active side (Initiating Consumer) would request a 1540 connection from a TCP endpoint (that expected to upgrade to 1541 MPA/DDP/RDMA and expected the Private Data) to a destination 1542 address and port. 1544 * The Initiating Consumer would initiate a TCP connection to 1545 the destination port. Acceptance/rejection of the connection 1546 would proceed as per normal TCP connection establishment. 1548 * The passive side (Responding Consumer) would receive the TCP 1549 connection request as usual allowing normal TCP gatekeepers, such 1550 as INETD and TCPserver, to exercise their normal 1551 safeguard/logging functions. On acceptance of the TCP 1552 connection, the Responding Consumer would enable MPA in the 1553 Responder mode and wait for the initial MPA startup message. 1555 * The Initiating Consumer would enable MPA startup in the 1556 Initiator mode to send an initial MPA Request Frame with its 1557 included Private Data message to send. The Initiating MPA 1558 (and Consumer) would also wait for the MPA connection to be 1559 accepted, and any returned Private Data. 1561 * The Responding MPA would receive the initial MPA Request Frame 1562 with the Private Data message and would pass the Private Data 1563 through to the Consumer. The Consumer can then accept the 1564 MPA/DDP connection, close the TCP connection, or reject the MPA 1565 connection with a return message. 1567 * To accept the connection request, the Responding Consumer would 1568 use an appropriate API to bind the TCP/MPA connections to a DDP 1569 endpoint, thus enabling MPA/DDP into Full Operation. In the 1570 process of going to Full Operation, MPA sends the MPA Reply Frame 1571 which includes the Consumer supplied Private Data containing any 1572 appropriate Consumer response. MPA/DDP waits for the first 1573 incoming FPDU before sending any FPDUs. 1575 * If the initial TCP data was not a properly formatted MPA Request 1576 Frame, MPA will close or reset the TCP connection immediately. 1578 * To reject the MPA connection request, the Responding Consumer 1579 would send an MPA Reply Frame with any ULP supplied Private Data 1580 (with reason for rejection), with the "Rejected Connection" bit 1581 set to '1', and may close the TCP connection. 1583 * The Initiating MPA would receive the MPA Reply Frame with the 1584 Private Data message and would report this message to the 1585 Consumer, including the supplied Private Data. 1587 If the "rejected Connection" bit is set to a '1', MPA will 1588 close the TCP connection and exit. 1590 If the "Rejected Connection" bit is set to a '0', and on 1591 determining from the MPA Reply Frame Private Data that the 1592 Connection is acceptable, the Initiating Consumer would use 1593 an appropriate API to bind the TCP/MPA connections to a DDP 1594 endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP 1595 would begin sending DDP messages as MPA FPDUs. 1597 7.1.5 "Dual stack" implementations 1599 MPA/DDP implementations are commonly expected to be implemented as 1600 part of a "dual stack" architecture. One "stack" is the traditional 1601 TCP stack, usually with a sockets interface API (Application 1602 Programming Interface). The second stack is the MPA/DDP "stack" with 1603 its own API, and potentially separate code or hardware to deal with 1604 the MPA/DDP data. Of course, implementations may vary, so the 1605 following comments are of an advisory nature only. 1607 The use of the two "stacks" offers advantages: 1609 TCP connection setup is usually done with the TCP stack. This 1610 allows use of the usual naming and addressing mechanisms. It 1611 also means that any mechanisms used to "harden" the connection 1612 setup against security threats are also used when starting 1613 MPA/DDP. 1615 Some applications may have been originally designed for TCP, but 1616 are "enhanced" to utilize MPA/DDP after a negotiation reveals 1617 the capability to do so. The negotiation process takes place in 1618 TCP's streaming mode, using the usual TCP APIs. 1620 Some new applications, designed for RDMA or DDP, still need to 1621 exchange some data prior to starting MPA/DDP. This exchange can 1622 be of arbitrary length or complexity, but often consists of only 1623 a small amount of Private Data, perhaps only a single message. 1624 Using the TCP streaming mode for this exchange allows this to be 1625 done using well understood methods. 1627 The main disadvantage of using two stacks is the conversion of an 1628 active TCP connection between them. This process must be done with 1629 care to prevent loss of data. 1631 To avoid some of the problems when using a "dual stack" architecture 1632 the following additional restrictions may be required by the 1633 implementation: 1635 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming 1636 stream data is expected. This is typically managed by the ULP 1637 protocol. When following the recommended startup sequence, the 1638 Responder side enters DDP/MPA mode, sends the last streaming mode 1639 data, and then waits for the MPA Request Frame. No additional 1640 streaming mode data is expected. The Initiator side ULP receives 1641 the last streaming mode data, and then enters DDP/MPA mode. 1642 Again, no additional streaming mode data is expected. 1644 2. The DDP/MPA MAY provide the ability to send a "last streaming 1645 message" as part of its Responder DDP/MPA enable function. This 1646 allows the DDP/MPA stack to more easily manage the conversion to 1647 DDP/MPA mode (and avoid problems with a very fast return of the 1648 MPA Request Frame from the Initiator side). 1650 Note: Regardless of the "stack" architecture used, TCP's rules MUST 1651 be followed. For example, if network data is lost, re-segmented 1652 or re-ordered, TCP MUST recover appropriately even when this 1653 occurs while switching stacks. 1655 7.2 Normal Connection Teardown 1657 Each half connection of MPA terminates when DDP closes the 1658 corresponding TCP half connection. 1660 A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware 1661 that a graceful close of the TCP connection has been received by the 1662 TCP (e.g. FIN is received). 1664 8 Error Semantics 1666 The following errors MUST be detected by MPA and the codes SHOULD be 1667 provided to DDP or other Consumer: 1669 Code Error 1671 1 TCP connection closed, terminated or lost. This includes lost 1672 by timeout, too many retries, RST received or FIN received. 1674 2 Received MPA CRC does not match the calculated value for the 1675 FPDU. 1677 3 In the event that the CRC is valid, received MPA Marker (if 1678 enabled) and ULPDU Length fields do not agree on the start of 1679 a FPDU. If the FPDU start determined from previous ULPDU 1680 Length fields does not match with the MPA Marker position, MPA 1681 SHOULD deliver an error to DDP. It may not be possible to 1682 make this check as a segment arrives, but the check SHOULD be 1683 made when a gap creating an out of order sequence is closed 1684 and any time a Marker points to an already identified FPDU. 1685 It is OPTIONAL for a receiver to check each Marker, if 1686 multiple Markers are present in an FPDU, or if the segment is 1687 received in order. 1689 4 Invalid MPA Request Frame or MPA Response Frame received. In 1690 this case, the TCP connection MUST be immediately closed. DDP 1691 and other ULPs should treat this similar to code 1, above. 1693 When conditions 2 or 3 above are detected, an optimized MPA/TCP 1694 implementation MAY choose to silently drop the TCP segment rather 1695 than reporting the error to DDP. In this case, the sending TCP will 1696 retry the segment, usually correcting the error, unless the problem 1697 was at the source. In that case, the source will usually exceed the 1698 number of retries and terminate the connection. 1700 Once MPA delivers an error of any type, it MUST NOT pass or deliver 1701 any additional FPDUs on that half connection. 1703 For Error codes 2 and 3, MPA MUST NOT close the TCP connection 1704 following a reported error. Closing the connection is the 1705 responsibility of DDP's ULP. 1707 Note that since MPA will not Deliver any FPDUs on a half 1708 connection following an error detected on the receive side of 1709 that connection, DDP's ULP is expected to tear down the 1710 connection. This may not occur until after one or more last 1711 messages are transmitted on the opposite half connection. This 1712 allows a diagnostic error message to be sent. 1714 9 Security Considerations 1716 This section discusses the security considerations for MPA. 1718 9.1 Protocol-specific Security Considerations 1720 The vulnerabilities of MPA to third-party attacks are no greater than 1721 any other protocol running over TCP. A third party, by sending 1722 packets into the network that are delivered to an MPA receiver, could 1723 launch a variety of attacks that take advantage of how MPA operates. 1724 For example, a third party could send random packets that are valid 1725 for TCP, but contain no FPDU headers. An MPA receiver reports an 1726 error to DDP when any packet arrives that cannot be validated as an 1727 FPDU when properly located on an FPDU boundary. A third party could 1728 also send packets that are valid for TCP, MPA, and DDP, but do not 1729 target valid buffers. These types of attacks ultimately result in 1730 loss of connection and thus become a type of DOS (Denial Of Service) 1731 attack. Communication security mechanisms such as IPsec [RFC2401] 1732 may be used to prevent such attacks. 1734 Independent of how MPA operates, a third party could use ICMP 1735 messages to reduce the path MTU to such a small size that performance 1736 would likewise be severely impacted. Range checking on path MTU 1737 sizes in ICMP packets may be used to prevent such attacks. 1739 [RDMAP] and [DDP] are used to control, read and write data buffers 1740 over IP networks. Therefore, the control and the data packets of 1741 these protocols are vulnerable to the spoofing, tampering and 1742 information disclosure attacks listed below. In addition, Connection 1743 to/from an unauthorized or unauthenticated endpoint is a potential 1744 problem with most applications using RDMA, DDP, and MPA. 1746 9.1.1 Spoofing 1748 Spoofing attacks can be launched by the Remote Peer, or by a network 1749 based attacker. A network based spoofing attack applies to all 1750 Remote Peers. Because the MPA Stream requires a TCP Stream in the 1751 ESTABLISHED state, certain types of traditional forms of wire attacks 1752 do not apply -- an end-to-end handshake must have occurred to 1753 establish the MPA Stream. So, the only form of spoofing that applies 1754 is one when a remote node can both send and receive packets. Yet 1755 even with this limitation the Stream is still exposed to the 1756 following spoofing attacks. 1758 9.1.1.1 Impersonation 1760 A network based attacker can impersonate a legal MPA/DDP/RDMAP peer 1761 (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP 1762 Stream with the victim. End to end authentication (i.e. IPsec or ULP 1763 authentication) provides protection against this attack. 1765 9.1.1.2 Stream Hijacking 1767 Stream hijacking happens when a network based attacker follows the 1768 Stream establishment phase, and waits until the authentication phase 1769 (if such a phase exists) is completed successfully. He can then 1770 spoof the IP address and re-direct the Stream from the victim to its 1771 own machine. For example, an attacker can wait until an iSCSI 1772 authentication is completed successfully, and hijack the iSCSI 1773 Stream. 1775 The best protection against this form of attack is end-to-end 1776 integrity protection and authentication, such as IPsec to prevent 1777 spoofing. Another option is to provide physical security. 1778 Discussion of physical security is out of scope for this document. 1780 9.1.1.3 Man in the Middle Attack 1782 If a network based attacker has the ability to delete, inject, 1783 replay, or modify packets which will still be accepted by MPA (e.g., 1784 TCP sequence number is correct, FPDU is valid etc.) then the Stream 1785 can be exposed to a man in the middle attack. The attacker could 1786 potentially use the services of [DDP] and [RDMAP] to read the 1787 contents of the associated data buffer, modify the contents of the 1788 associated data buffer, or to disable further access to the buffer. 1789 Other attacks on the connection setup sequence and even on TCP can be 1790 used to cause denial of service. The only countermeasure for this 1791 form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e. 1792 integrity protect) or attempt to provide physical security to prevent 1793 man-in-the-middle type attacks. 1795 The best protection against this form of attack is end-to-end 1796 integrity protection and authentication, such as IPsec, to prevent 1797 spoofing or tampering. If Stream or session level authentication and 1798 integrity protection are not used, then a man-in-the-middle attack 1799 can occur, enabling spoofing and tampering. 1801 Another approach is to restrict access to only the local subnet/link, 1802 and provide some mechanism to limit access, such as physical security 1803 or 802.1.x. This model is an extremely limited deployment scenario, 1804 and will not be further examined here. 1806 9.1.2 Eavesdropping 1808 Generally speaking, Stream confidentiality protects against 1809 eavesdropping. Stream and/or session authentication and integrity 1810 protection is a counter measurement against various spoofing and 1811 tampering attacks. The effectiveness of authentication and integrity 1812 against a specific attack, depend on whether the authentication is 1813 machine level authentication (as the one provided by IPsec), or ULP 1814 authentication. 1816 9.2 Introduction to Security Options 1818 The following security services can be applied to an MPA/DDP/RDMAP 1819 Stream: 1821 1. Session confidentiality - protects against eavesdropping. 1823 2. Per-packet data source authentication - protects against the 1824 following spoofing attacks: network based impersonation, Stream 1825 hijacking, and man in the middle. 1827 3. Per-packet integrity - protects against tampering done by 1828 network based modification of FPDUs (indirectly affecting buffer 1829 content through DDP services). 1831 4. Packet sequencing - protects against replay attacks, which is 1832 a special case of the above tampering attack. 1834 If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, 1835 or Stream hijacking attacks, it is recommended that the Stream be 1836 authenticated, integrity protected, and protected from replay 1837 attacks; it may use confidentiality protection to protect from 1838 eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public 1839 network). 1841 IPsec is capable of providing the above security services for IP and 1842 TCP traffic. 1844 ULP protocols may be able to provide part of the above security 1845 services. See [NFSv4CHANNEL] for additional information on a 1846 promising approach called "channel binding". From [NFSv4CHANNEL]: 1848 "The concept of channel bindings allows applications to prove 1849 that the end-points of two secure channels at different network 1850 layers are the same by binding authentication at one channel to 1851 the session protection at the other channel. The use of channel 1852 bindings allows applications to delegate session protection to 1853 lower layers, which may significantly improve performance for 1854 some applications." 1856 9.3 Using IPsec With MPA 1858 IPsec can be used to protect against the packet injection attacks 1859 outlined above. Because IPsec is designed to secure individual IP 1860 packets, MPA can run above IPsec without change. IPsec packets are 1861 processed (e.g., integrity checked and decrypted) in the order they 1862 are received, and an MPA receiver will process the decrypted FPDUs 1863 contained in these packets in the same manner as FPDUs contained in 1864 unsecured IP packets. 1866 MPA Implementations MUST implement IPsec as described in Section 9.4 1867 below. The use of IPsec is up to ULPs and administrators. 1869 9.4 Requirements for IPsec Encapsulation of MPA/DDP 1871 The IP Storage working group has spent significant time and effort to 1872 define the normative IPsec requirements for IP Storage [RFC3723]. 1873 Portions of that specification are applicable to a wide variety of 1874 protocols, including the RDDP protocol suite. In order to not 1875 replicate this effort, an MPA on TCP implementation MUST follow the 1876 requirements defined in RFC3723 Section 2.3 and Section 5, including 1877 the associated normative references for those sections. 1879 Additionally, since IPsec acceleration hardware may only be able to 1880 handle a limited number of active IKE Phase 2 SAs, Phase 2 delete 1881 messages MAY be sent for idle SAs, as a means of keeping the number 1882 of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 1883 delete message MUST NOT be interpreted as a reason for tearing down 1884 an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, 1885 and if additional traffic is sent on it, to bring up another IKE 1886 Phase 2 SA to protect it. This avoids the potential for continually 1887 bringing Streams up and down. 1889 The IPsec requirements for RDDP are based on the version of IPsec 1890 specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC 1891 3723 [RFC3723], despite the existence of a newer version of IPsec 1892 specified in RFC 4301 [RFC4301] and related RFCs. One of the 1893 important early applications of the RDDP protocols is their use with 1894 iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in 1895 order to facilitate that usage by allowing a common profile of IPsec 1896 to be used with iSCSI and the RDDP protocols. In the future, RFC 1897 3723 may be updated to the newer version of IPsec, the IPsec security 1898 requirements of any such update should apply uniformly to iSCSI and 1899 the RDDP protocols. 1901 Note that there are serious security issues if IPsec is not 1902 implemented end-to-end. For example, if IPsec is implemented as a 1903 tunnel in the middle of the network, any hosts between the peer and 1904 the IPsec tunneling device can freely attack the unprotected Stream. 1906 10 IANA Considerations 1908 No IANA actions are required by this document. 1910 If a well-known port is chosen as the mechanism to identify a DDP on 1911 MPA on TCP, the well-known port must be registered with IANA. 1912 Because the use of the port is DDP specific, registration of the port 1913 with IANA is left to DDP. 1915 A Appendix. 1916 Optimized MPA-aware TCP implementations 1918 This appendix is for information only and is NOT part of the 1919 standard. 1921 This appendix covers some Optimized MPA-aware TCP implementation 1922 guidance to implementers. It is intended for those implementations 1923 that want to send/receive as much traffic as possible in an aligned 1924 and zero-copy fashion. 1926 +-----------------------------------+ 1927 | +-----------+ +-----------------+ | 1928 | | Optimized | | Other Protocols | | 1929 | | MPA/TCP | +-----------------+ | 1930 | +-----------+ || | 1931 | \\ --- socket API --- | 1932 | \\ || | 1933 | \\ +-----+ | 1934 | \\ | TCP | | 1935 | \\ +-----+ | 1936 | \\ // | 1937 | +-------+ | 1938 | | IP | | 1939 | +-------+ | 1940 +-----------------------------------+ 1942 Figure 11 Optimized MPA/TCP implementation 1944 The diagram above shows a block diagram of a potential 1945 implementation. The network sub-system in the diagram can support 1946 traditional sockets based connections using the normal API as shown 1947 on the right side of the diagram. Connections for DDP/MPA/TCP are 1948 run using the facilities shown on the left side of the diagram. 1950 The DDP/MPA/TCP connections can be started using the facilities shown 1951 on the left side using some suitable API, or they can be initiated 1952 using the facilities shown on the right side and transitioned to the 1953 left side at the point in the connection setup where MPA goes to 1954 "full MPA/DDP operation mode" as described in section 7.1.2 on page 1955 29. 1957 The optimized MPA/TCP implementations (left side of diagram and 1958 described below) are only applicable to MPA, all other TCP 1959 applications continue to use the standard TCP stacks and interfaces 1960 shown in the right side of the diagram. 1962 A.1 Optimized MPA/TCP transmitters 1964 The various TCP RFCs allow considerable choice in segmenting a TCP 1965 stream. In order to optimize FPDU recovery at the MPA receiver, an 1966 optimized MPA/TCP implementation uses additional segmentation rules. 1968 To provide optimum performance, an optimized MPA/TCP transmit side 1969 implementation should be enabled to: 1971 * With an EMSS large enough to contain the FPDU(s), segment the 1972 outgoing TCP stream such that the first octet of every TCP 1973 Segment begins with an FPDU. Multiple FPDUs may be packed into a 1974 single TCP segment as long as they are entirely contained in the 1975 TCP segment. 1977 * Report the current EMSS from the TCP to the MPA transmit layer. 1979 There are exceptions to the above rule. Once an ULPDU is provided to 1980 MPA, the MPA/TCP sender transmits it or fails the connection; it 1981 cannot be repudiated. As a result, during changes in MTU and EMSS, 1982 or when TCP's Receive Window size (RWIN) becomes too small, it may be 1983 necessary to send FPDUs that do not conform to the segmentation rule 1984 above. 1986 A possible, but less desirable, alternative is to use IP 1987 fragmentation on accepted FPDUs to deal with MTU reductions or 1988 extremely small EMSS. 1990 Even when alignment with TCP segments is lost, the sender still 1991 formats the FPDU according to FPDU format as shown in Figure 2. 1993 On a retransmission, TCP does not necessarily preserve original TCP 1994 segmentation boundaries. This can lead to the loss of FPDU Alignment 1995 and containment within a TCP segment during TCP retransmissions. An 1996 optimized MPA/TCP sender should try to preserve original TCP 1997 segmentation boundaries on a retransmission. 1999 A.2 Effects of Optimized MPA/TCP Segmentation 2001 Optimized MPA/TCP senders will fill TCP segments to the EMSS with a 2002 single FPDU when a DDP message is large enough. Since the DDP 2003 message may not exactly fit into TCP segments, a "message tail" often 2004 occurs that results in an FPDU that is smaller than a single TCP 2005 segment. Additionally some DDP messages may be considerably shorter 2006 than the EMSS. If a small FPDU is sent in a single TCP segment the 2007 result is a "short" TCP segment. 2009 Applications expected to see strong advantages from Direct Data 2010 Placement include transaction-based applications and throughput 2011 applications. Request/response protocols typically send one FPDU per 2012 TCP segment and then wait for a response. Under these conditions, 2013 these "short" TCP segments are an appropriate and expected effect of 2014 the segmentation. 2016 Another possibility is that the application might be sending multiple 2017 messages (FPDUs) to the same endpoint before waiting for a response. 2019 In this case, the segmentation policy would tend to reduce the 2020 available connection bandwidth by under-filling the TCP segments. 2022 Standard TCP implementations often utilize the Nagle [RFC0896] 2023 algorithm to ensure that segments are filled to the EMSS whenever the 2024 round trip latency is large enough that the source stream can fully 2025 fill segments before Acks arrive. The algorithm does this by 2026 delaying the transmission of TCP segments until a ULP can fill a 2027 segment, or until an ACK arrives from the far side. The algorithm 2028 thus allows for smaller segments when latencies are shorter to keep 2029 the ULP's end to end latency to reasonable levels. 2031 The Nagle algorithm is not mandatory to use [RFC1122]. 2033 When used with optimized MPA/TCP stacks, Nagle and similar algorithms 2034 can result in the "packing" of multiple FPDUs into TCP segments. 2036 If a "message tail", small DDP messages, or the start of a larger DDP 2037 message are available, MPA may pack multiple FPDUs into TCP segments. 2038 When this is done, the TCP segments can be more fully utilized, but, 2039 due to the size constraints of FPDUs, segments may not be filled to 2040 the EMSS. A dynamic MULPDU that informs DDP of the size of the 2041 remaining TCP segment space makes filling the TCP segment more 2042 effective. 2044 Note that MPA receivers do more processing of a TCP segment that 2045 contains multiple FPDUs, this may affect the performance of some 2046 receiver implementations. 2048 It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note 2049 that many of the applications expected to take advantage of MPA/DDP 2050 prefer to avoid the extra delays caused by Nagle. In such scenarios 2051 it is anticipated there will be minimal opportunity for packing at 2052 the transmitter and receivers may choose to optimize their 2053 performance for this anticipated behavior. 2055 Therefore, the application is expected to set TCP parameters such 2056 that it can trade off latency and wire efficiency. Implementations 2057 should provide a connection option which disables Nagle for MPA/TCP 2058 similar to the way the TCP_NODELAY socket option is provided for a 2059 traditional sockets interface. 2061 When latency is not critical, application is expected to leave Nagle 2062 enabled. In this case the TCP implementation may pack any available 2063 FPDUs into TCP segments so that the segments are filled to the EMSS. 2064 If the amount of data available is not enough to fill the TCP segment 2065 when it is prepared for transmission, TCP can send the segment partly 2066 filled, or use the Nagle algorithm to wait for the ULP to post more 2067 data. 2069 A.3 Optimized MPA/TCP receivers 2071 When an MPA receive implementation and the MPA-aware receive side TCP 2072 implementation support handling out of order ULPDUs, the TCP receive 2073 implementation performs the following functions: 2075 1) The implementation passes incoming TCP segments to MPA as soon as 2076 they have been received and validated, even if not received in 2077 order. The TCP layer commits to keeping each segment before it 2078 can be passed to the MPA. This means that the segment must have 2079 passed the TCP, IP, and lower layer data integrity validation 2080 (i.e., checksum), must be in the receive window, must be part of 2081 the same epoch (if timestamps are used to verify this) and any 2082 other checks required by TCP RFCs. 2084 This is not to imply that the data must be completely ordered 2085 before use. An implementation can accept out of order segments, 2086 SACK them [RFC2018], and pass them to MPA immediately, before the 2087 reception of the segments needed to fill in the gaps arrive. 2088 MPA expects to utilize these segments when they are complete 2089 FPDUs or can be combined into complete FPDUs to allow the passing 2090 of ULPDUs to DDP when they arrive, independent of ordering. DDP 2091 uses the passed ULPDU to "place" the DDP segments (see [DDP] for 2092 more details). 2094 Since MPA performs a CRC calculation and other checks on received 2095 FPDUs, the MPA/TCP implementation ensures that any TCP segments 2096 that duplicate data already received and processed (as can happen 2097 during TCP retries) do not overwrite already received and 2098 processed FPDUs. This avoids the possibility that duplicate data 2099 may corrupt already validated FPDUs. 2101 2) The implementation provides a mechanism to indicate the ordering 2102 of TCP segments as the sender transmitted them. One possible 2103 mechanism might be attaching the TCP sequence number to each 2104 segment. 2106 3) The implementation also provides a mechanism to indicate when a 2107 given TCP segment (and the prior TCP stream) is complete. One 2108 possible mechanism might be to utilize the leading (left) edge of 2109 the TCP Receive Window. 2111 MPA uses the ordering and completion indications to inform DDP 2112 when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses 2113 the indications to "deliver" its messages to the DDP consumer 2114 (see [DDP] for more details). 2116 DDP on MPA utilizes the above two mechanisms to establish the 2117 Delivery semantics that DDP's consumers agree to. These 2118 semantics are described fully in [DDP]. These include 2119 requirements on DDP's consumer to respect ownership of buffers 2120 prior to the time that DDP delivers them to the Consumer. 2122 The use of SACK [RFC2018] significantly improves network utilization 2123 and performance and is therefore recommended. When combined with the 2124 out-of-order passing of segments to MPA and DDP, significant 2125 buffering and copying of received data can be avoided. 2127 A.4 Re-segmenting Middle boxes and non optimized MPA/TCP senders 2129 Since MPA senders often start FPDUs on TCP segment boundaries, a 2130 receiving optimized MPA/TCP implementation may be able to optimize 2131 the reception of data in various ways. 2133 However, MPA receivers MUST NOT depend on FPDU Alignment on TCP 2134 segment boundaries. 2136 Some MPA senders may be unable to conform to the sender requirements 2137 because their implementation of TCP is not designed with MPA in mind. 2138 Even for optimized MPA/TCP senders, the network may contain "middle 2139 boxes" which modify the TCP stream by changing the segmentation. 2140 This is generally interoperable with TCP and its users and MPA must 2141 be no exception. 2143 The presence of Markers in MPA (when enabled) allows an optimized 2144 MPA/TCP receiver to recover the FPDUs despite these obstacles, 2145 although it may be necessary to utilize additional buffering at the 2146 receiver to do so. 2148 Some of the cases that a receiver may have to contend with are listed 2149 below as a reminder to the implementer: 2151 * A single Aligned and complete FPDU, either in order, or out of 2152 order: This can be passed to DDP as soon as validated, and 2153 Delivered when ordering is established. 2155 * Multiple FPDUs in a TCP segment, aligned and fully contained, 2156 either in order, or out of order: These can be passed to DDP as 2157 soon as validated, and Delivered when ordering is established. 2159 * Incomplete FPDU: The receiver should buffer until the remainder 2160 of the FPDU arrives. If the remainder of the FPDU is already 2161 available, this can be passed to DDP as soon as validated, and 2162 Delivered when ordering is established. 2164 * Unaligned FPDU start: The partial FPDU must be combined with its 2165 preceding portion(s). If the preceding parts are already 2166 available, and the whole FPDU is present, this can be passed to 2167 DDP as soon as validated, and Delivered when ordering is 2168 established. If the whole FPDU is not available, the receiver 2169 should buffer until the remainder of the FPDU arrives. 2171 * Combinations of Unaligned or incomplete FPDUs (and potentially 2172 other complete FPDUs) in the same TCP segment: If any FPDU is 2173 present in its entirety, or can be completed with portions 2174 already available, it can be passed to DDP as soon as validated, 2175 and Delivered when ordering is established. 2177 A.5 Receiver implementation 2179 Transport & Network Layer Reassembly Buffers: 2181 The use of reassembly buffers (either TCP reassembly buffers or IP 2182 fragmentation reassembly buffers) is implementation dependent. When 2183 MPA is enabled, reassembly buffers are needed if out of order packets 2184 arrive and Markers are not enabled. Buffers are also needed if FPDU 2185 Alignment is lost or if IP fragmentation occurs. This is because the 2186 incoming out of order segment may not contain enough information for 2187 MPA to process all of the FPDU. For cases where a re-segmenting 2188 middle box is present, or where the TCP sender is not optimized, the 2189 presence of Markers significantly reduces the amount of buffering 2190 needed. 2192 Recovery from IP Fragmentation is transparent to the MPA Consumers. 2194 A.5.1 Network Layer Reassembly Buffers 2196 The MPA/TCP implementation should set the IP Don't Fragment bit at 2197 the IP layer. Thus upon a path MTU change, intermediate devices drop 2198 the IP datagram if it is too large and reply with an ICMP message 2199 which tells the source TCP that the path MTU has changed. This 2200 causes TCP to emit segments conformant with the new path MTU size. 2201 Thus IP fragments under most conditions should never occur at the 2202 receiver. But it is possible. 2204 There are several options for implementation of network layer 2205 reassembly buffers: 2207 1. drop any IP fragments, and reply with an ICMP message according 2208 to [RFC792] (fragmentation needed and DF set) to tell the Remote 2209 Peer to resize its TCP segment 2211 2. support an IP reassembly buffer, but have it of limited size 2212 (possibly the same size as the local link's MTU). The end Node 2213 would normally never advertise a path MTU larger than the local 2214 link MTU. It is recommended that a dropped IP fragment cause an 2215 ICMP message to be generated according to RFC792. 2217 3. multiple IP reassembly buffers, of effectively unlimited size. 2219 4. support an IP reassembly buffer for the largest IP datagram (64 2220 KB). 2222 5. support for a large IP reassembly buffer which could span 2223 multiple IP datagrams. 2225 An implementation should support at least 2 or 3 above, to avoid 2226 dropping packets that have traversed the entire fabric. 2228 There is no end-to-end ACK for IP reassembly buffers, so there is no 2229 flow control on the buffer. The only end-to-end ACK is a TCP ACK, 2230 which can only occur when a complete IP datagram is delivered to TCP. 2231 Because of this, under worst case, pathological scenarios, the 2232 largest IP reassembly buffer is the TCP receive window (to buffer 2233 multiple IP datagrams that have all been fragmented). 2235 Note that if the Remote Peer does not implement re-segmentation of 2236 the data stream upon receiving the ICMP reply updating the path MTU, 2237 it is possible to halt forward progress because the opposite peer 2238 would continue to retransmit using a transport segment size that is 2239 too large. This deadlock scenario is no different than if the fabric 2240 MTU (not last hop MTU) was reduced after connection setup, and the 2241 remote Node's behavior is not compliant with [RFC1122]. 2243 A.5.2 TCP Reassembly buffers 2245 A TCP reassembly buffer is also needed. TCP reassembly buffers are 2246 needed if FPDU Alignment is lost when using TCP with MPA or when the 2247 MPA FPDU spans multiple TCP segments. Buffers are also needed if 2248 Markers are disabled and out of order packets arrive. 2250 Since lost FPDU Alignment often means that FPDUs are incomplete, an 2251 MPA on TCP implementation must have a reassembly buffer large enough 2252 to recover an FPDU that is less than or equal to the MTU of the 2253 locally attached link (this should be the largest possible advertised 2254 TCP path MTU). If the MTU is smaller than 140 octets, a buffer of at 2255 least 140 octets long is needed to support the minimum FPDU size. 2256 The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2257 2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As 2258 usual, additional buffering is likely to provide better performance. 2260 Note that if the TCP segment were not stored, it is possible to 2261 deadlock the MPA algorithm. If the path MTU is reduced, FPDU 2262 Alignment requires the source TCP to re-segment the data stream to 2263 the new path MTU. The source MPA will detect this condition and 2264 reduce the MPA segment size, but any FPDUs already posted to the 2265 source TCP will be re-segmented and lose FPDU Alignment. If the 2266 destination does not support a TCP reassembly buffer, these segments 2267 can never be successfully transmitted and the protocol deadlocks. 2269 When a complete FPDU is received, processing continues normally. 2271 B Appendix. 2272 Analysis of MPA over TCP Operations 2274 This appendix is for information only and is NOT part of the 2275 standard. 2277 This appendix is an analysis of MPA on TCP and why it is useful to 2278 integrate MPA with TCP (with modifications to typical TCP 2279 implementations) to reduce overall system buffering and overhead. 2281 One of MPA's high level goals is to provide enough information, when 2282 combined with the Direct Data Placement Protocol [DDP], to enable 2283 out-of-order placement of DDP payload into the final Upper Layer 2284 Protocol (ULP) buffer. Note that DDP separates the act of placing 2285 data into a ULP buffer from that of notifying the ULP that the ULP 2286 buffer is available for use. In DDP terminology, the former is 2287 defined as "Placement", and the later is defined as "Delivery". MPA 2288 supports in-order Delivery of the data to the ULP, including support 2289 for Direct Data Placement in the final ULP buffer location when TCP 2290 segments arrive out-of-order. Effectively, the goal is to use the 2291 pre-posted ULP buffers as the TCP receive buffer, where the 2292 reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and 2293 DDP) is done in place, in the ULP buffer, with no data copies. 2295 This Appendix walks through the advantages and disadvantages of the 2296 TCP sender modifications proposed by MPA: 2298 1) that MPA prefers that the TCP sender to do Header Alignment, where 2299 a TCP segment should begin with an MPA Framing Protocol Data Unit 2300 (FPDU) (if there is payload present). 2302 2) that there be an integral number of FPDUs in a TCP segment (under 2303 conditions where the Path MTU is not changing). 2305 This Appendix concludes that the scaling advantages of FPDU Alignment 2306 are strong, based primarily on fairly drastic TCP receive buffer 2307 reduction requirements and simplified receive handling. The analysis 2308 also shows that there is little effect to TCP wire behavior. 2310 B.1 Assumptions 2312 B.1.1 MPA is layered beneath DDP [DDP] 2314 MPA is an adaptation layer between DDP and TCP. DDP requires 2315 preservation of DDP segment boundaries and a CRC32C digest covering 2316 the DDP header and data. MPA adds these features to the TCP stream 2317 so that DDP over TCP has the same basic properties as DDP over SCTP. 2319 B.1.2 MPA preserves DDP message framing 2321 MPA was designed as a framing layer specifically for DDP and was not 2322 intended as a general-purpose framing layer for any other ULP using 2323 TCP. 2325 A framing layer allows ULPs using it to receive indications from the 2326 transport layer only when complete ULPDUs are present. As a framing 2327 layer, MPA is not aware of the content of the DDP PDU, only that it 2328 has received and, if necessary, reassembled a complete PDU for 2329 Delivery to the DDP. 2331 B.1.3 The size of the ULPDU passed to MPA is less than EMSS under 2332 normal conditions 2334 To make reception of a complete DDP PDU on every received segment 2335 possible, DDP passes to MPA a PDU that is no larger than the EMSS of 2336 the underlying fabric. Each FPDU that MPA creates contains 2337 sufficient information for the receiver to directly place the ULP 2338 payload in the correct location in the correct receive buffer. 2340 Edge cases when this condition does not occur are dealt with, but do 2341 not need to be on the fast path 2343 B.1.4 Out-of-order placement but NO out-of-order Delivery 2345 DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the 2346 information necessary to place its ULP payload directly in the 2347 correct location in host memory. 2349 Because each DDP segment is self-describing, it is possible for DDP 2350 segments received out of order to have their ULP payload placed 2351 immediately in the ULP receive buffer. 2353 Data delivery to the ULP is guaranteed to be in the order the data 2354 was sent. DDP only indicates data delivery to the ULP after TCP has 2355 acknowledged the complete byte stream. 2357 B.2 The Value of FPDU Alignment 2359 Significant receiver optimizations can be achieved when Header 2360 Alignment and complete FPDUs are the common case. The optimizations 2361 allow utilizing significantly fewer buffers on the receiver and less 2362 computation per FPDU. The net effect is the ability to build a 2363 "flow-through" receiver that enables TCP-based solutions to scale to 2364 10G and beyond in an economical way. The optimizations are 2365 especially relevant to hardware implementations of receivers that 2366 process multiple protocol layers - Data Link Layer (e.g., Ethernet), 2367 Network and Transport Layer (e.g., TCP/IP), and even some ULP on top 2368 of TCP (e.g., MPA/DDP). As network speed increases, there is an 2369 increasing desire to use a hardware based receiver in order to 2370 achieve an efficient high performance solution. 2372 A TCP receiver, under worst case conditions, has to allocate buffers 2373 (BufferSizeTCP) whose capacities are a function of the bandwidth- 2374 delay product. Thus: 2376 BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds]. 2378 Where bandwidth is the end-to-end bandwidth of the connection, delay 2379 is the round trip delay of the connection, and K is an implementation 2380 dependent constant. 2382 Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more 2383 buffers for a 10x increase in end-to-end bandwidth). As this 2384 buffering approach may scale poorly for hardware or software 2385 implementations alike, several approaches allow reduction in the 2386 amount of buffering required for high-speed TCP communication. 2388 The MPA/DDP approach is to enable the ULP's buffer to be used as the 2389 TCP receive buffer. If the application pre-posts a sufficient amount 2390 of buffering, and each TCP segment has sufficient information to 2391 place the payload into the right application buffer, when an out-of- 2392 order TCP segment arrives it could potentially be placed directly in 2393 the ULP buffer. However, placement can only be done when a complete 2394 FPDU with the placement information is available to the receiver, and 2395 the FPDU contents contain enough information to place the data into 2396 the correct ULP buffer (e.g., there is a DDP header available). 2398 For the case when the FPDU is not aligned with the TCP segment, it 2399 may take, on average, 2 TCP segments to assemble one FPDU. 2400 Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size, 2401 Non-Aligned FPDU) octets: 2403 BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS 2405 Where K1 and K2 are implementation dependent constants and EMSS is 2406 the effective maximum segment size. 2408 For example, a 1 Gbps link with 10,000 connections and an EMSS of 2409 1500B would require 15 MB of memory. Often the number of connections 2410 used scales with the network speed, aggravating the situation for 2411 higher speeds. 2413 FPDU Alignment would allow the receiver to allocate BufferSizeAF 2414 (Buffer Size, Aligned FPDU) octets: 2416 BufferSizeAF = K2 * EMSS 2418 for the same conditions. A FPDU Aligned receiver may require memory 2419 in the range of ~100s of KB - which is feasible for an on-chip memory 2420 and enables a "flow-through" design, in which the data flows through 2421 the NIC and is placed directly in the destination buffer. Assuming 2422 most of the connections support FPDU Alignment, the receiver buffers 2423 no longer scale with number of connections. 2425 Additional optimizations can be achieved in a balanced I/O sub-system 2426 -- where the system interface of the network controller provides 2427 ample bandwidth as compared with the network bandwidth. For almost 2428 twenty years this has been the case and the trend is expected to 2429 continue - while Ethernet speeds have scaled by 1000 (from 10 2430 megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU 2431 architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to 2432 PCI-X DDR). Under these conditions, the FPDU Alignment approach 2433 allows BufferSizeAF to be indifferent to network speed. It is 2434 primarily a function of the local processing time for a given frame. 2435 Thus when the FPDU Alignment approach is used, receive buffering is 2436 expected to scale gracefully (i.e. less than linear scaling) as 2437 network speed is increased. 2439 B.2.1 Impact of lack of FPDU Alignment on the receiver computational 2440 load and complexity 2442 The receiver must perform IP and TCP processing, and then perform 2443 FPDU CRC checks, before it can trust the FPDU header placement 2444 information. For simplicity of the description, the assumption is 2445 that a FPDU is carried in no more than 2 TCP segments. In reality, 2446 with no FPDU Alignment, an FPDU can be carried by more than 2 TCP 2447 segments (e.g., if the PMTU was reduced). 2449 ----++-----------------------------++-----------------------++----- 2450 +---||---------------+ +--------||--------+ +----------||----+ 2451 | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | 2452 +---||---------------+ +--------||--------+ +----------||----+ 2453 ----++-----------------------------++-----------------------++----- 2454 FPDU #N-1 FPDU #N 2456 Figure 12: Non-aligned FPDU freely placed in TCP octet stream 2458 The receiver algorithm for processing TCP segments (e.g., TCP segment 2459 #X in Figure 12: Non-aligned FPDU freely placed in TCP octet stream) 2460 carrying non-aligned FPDUs (in-order or out-of-order) includes: 2462 Data Link Layer processing (whole frame) - typically including a 2463 CRC calculation. 2465 1. Network Layer processing (assuming not an IP fragment, the 2466 whole Data Link Layer frame contains one IP datagram. IP 2467 fragments should be reassembled in a local buffer. This is 2468 not a performance optimization goal) 2470 2. Transport Layer processing -- TCP protocol processing, header 2471 and checksum checks. 2473 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2474 IP DST, TCP SRC Port, TCP DST Port, protocol) 2476 3. Find FPDU message boundaries. 2478 a. Get MPA state information for the connection 2480 If the TCP segment is in-order, use the receiver managed 2481 MPA state information to calculate where the previous 2482 FPDU message (#N-1) ends in the current TCP segment X. 2483 (previously, when the MPA receiver processed the first 2484 part of FPDU #N-1, it calculated the number of bytes 2485 remaining to complete FPDU #N-1 by using the MPA 2486 Length field). 2488 Get the stored partial CRC for FPDU #N-1 2490 Complete CRC calculation for FPDU #N-1 data (first 2491 portion of TCP segment #X) 2493 Check CRC calculation for FPDU #N-1 2495 If no FPDU CRC errors, placement is allowed 2497 Locate the local buffer for the first portion of 2498 FPDU#N-1, CopyData(local buffer of first portion 2499 of FPDU #N-1, host buffer address, length) 2501 Compute host buffer address for second portion of FPDU 2502 #N-1 2504 CopyData (local buffer of second portion of FPDU #N-1, 2505 host buffer address for second portion, length) 2507 Calculate the octet offset into the TCP segment for 2508 the next FPDU #N. 2510 Start Calculation of CRC for available data for FPDU 2511 #N 2513 Store partial CRC results for FPDU #N 2515 Store local buffer address of first portion of FPDU #N 2517 No further action is possible on FPDU #N, before it is 2518 completely received 2520 If TCP out-of-order, receiver must buffer the data until 2521 at least one complete FPDU is received. Typically 2522 buffering for more than one TCP segment per connection 2523 is required. Use the MPA based Markers to calculate 2524 where FPDU boundaries are. 2526 When a complete FPDU is available, a similar procedure 2527 to the in-order algorithm above is used. There is 2528 additional complexity, though, because when the 2529 missing segment arrives, this TCP segment must be 2530 run through the CRC engine after the CRC is 2531 calculated for the missing segment. 2533 If we assume FPDU Alignment, the following diagram and the algorithm 2534 below apply. Note that when using MPA, the receiver is assumed to 2535 actively detect presence or loss of FPDU Alignment for every TCP 2536 segment received. 2538 +--------------------------+ +--------------------------+ 2539 +--|--------------------------+ +--|--------------------------+ 2540 | | TCP Seg X | | | TCP Seg X+1 | 2541 +--|--------------------------+ +--|--------------------------+ 2542 +--------------------------+ +--------------------------+ 2543 FPDU #N FPDU #N+1 2545 Figure 13: Aligned FPDU placed immediately after TCP header 2547 The receiver algorithm for FPDU Aligned frames (in-order or out-of- 2548 order) includes: 2550 1) Data Link Layer processing (whole frame) - typically 2551 including a CRC calculation. 2553 2) Network Layer processing (assuming not an IP fragment, the 2554 whole Data Link Layer frame contains one IP datagram. IP 2555 fragments should be reassembled in a local buffer. This is 2556 not a performance optimization goal) 2558 3) Transport Layer processing -- TCP protocol processing, header 2559 and checksum checks. 2561 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2562 IP DST, TCP SRC Port, TCP DST Port, protocol) 2564 4) Check for Header Alignment. (Described in detail in Section 2565 6). Assuming Header Alignment for the rest of the algorithm 2566 below. 2568 a. If the header is not aligned, see the algorithm defined 2569 in the prior section. 2571 5) If TCP is in-order or out-of-order the MPA header is at the 2572 beginning of the current TCP payload. Get the FPDU length 2573 from the FPDU header. 2575 6) Calculate CRC over FPDU 2577 7) Check CRC calculation for FPDU #N 2579 8) If no FPDU CRC errors, placement is allowed 2581 9) CopyData(TCP segment #X, host buffer address, length) 2583 10) Loop to #5 until all the FPDUs in the TCP segment are 2584 consumed in order to handle FPDU packing. 2586 Implementation note: In both cases the receiver has to classify the 2587 incoming TCP segment and associate it with one of the flows it 2588 maintains. In the case of no FPDU Alignment, the receiver is forced 2589 to classify incoming traffic before it can calculate the FPDU CRC. 2590 In the case of FPDU Alignment the operations order is left to the 2591 implementer. 2593 The FPDU Aligned receiver algorithm is significantly simpler. There 2594 is no need to locally buffer portions of FPDUs. Accessing state 2595 information is also substantially simplified - the normal case does 2596 not require retrieving information to find out where a FPDU starts 2597 and ends or retrieval of a partial CRC before the CRC calculation can 2598 commence. This avoids adding internal latencies, having multiple 2599 data passes through the CRC machine, or scheduling multiple commands 2600 for moving the data to the host buffer. 2602 The aligned FPDU approach is useful for in-order and out-of-order 2603 reception. The receiver can use the same mechanisms for data storage 2604 in both cases, and only needs to account for when all the TCP 2605 segments have arrived to enable Delivery. The Header Alignment, 2606 along with the high probability that at least one complete FPDU is 2607 found with every TCP segment, allows the receiver to perform data 2608 placement for out-of-order TCP segments with no need for intermediate 2609 buffering. Essentially the TCP receive buffer has been eliminated 2610 and TCP reassembly is done in place within the ULP buffer. 2612 In case FPDU Alignment is not found, the receiver should follow the 2613 algorithm for non aligned FPDU reception which may be slower and less 2614 efficient. 2616 B.2.2 FPDU Alignment effects on TCP wire protocol 2618 In an optimized MPA/TCP implementation, TCP exposes its EMSS to 2619 MPA. MPA uses the EMSS to calculate its MULPDU, which it then 2620 exposes to DDP, its ULP. DDP uses the MULPDU to segment its 2621 payload so that each FPDU sent by MPA fits completely into one 2622 TCP segment. This has no impact on wire protocol and exposing 2623 this information is already supported on many TCP 2624 implementations, including all modern flavors of BSD networking, 2625 through the TCP_MAXSEG socket option. 2627 In the common case, the ULP (i.e. DDP over MPA) messages provided to 2628 the TCP layer are segmented to MULPDU size. It is assumed that the 2629 ULP message size is bounded by MULPDU, such that a single ULP message 2630 can be encapsulated in a single TCP segment. Therefore, in the 2631 common case, there is no increase in the number of TCP segments 2632 emitted. For smaller ULP messages, the sender can also apply 2633 packing, i.e. the sender packs as many complete FPDUs as possible 2634 into one TCP segment. The requirement to always have a complete FPDU 2635 may increase the number of TCP segments emitted. Typically, a ULP 2636 message size varies from few bytes to multiple EMSS (e.g., 64 2637 Kbytes). In some cases the ULP may post more than one message at a 2638 time for transmission, giving the sender an opportunity for packing. 2639 In the case where more than one FPDU is available for transmission 2640 and the FPDUs are encapsulated into a TCP segment and there is no 2641 room in the TCP segment to include the next complete FPDU, another 2642 TCP segment is sent. In this corner case some of the TCP segments 2643 are not full size. In the worst case scenario, the ULP may choose a 2644 FPDU size that is EMSS/2 +1 and has multiple messages available for 2645 transmission. For this poor choice of FPDU size, the average TCP 2646 segment size is therefore about 1/2 of the EMSS and the number of TCP 2647 segments emitted is approaching 2x of what is possible without the 2648 requirement to encapsulate an integer number of complete FPDUs in 2649 every TCP segment. This is a dynamic situation that only lasts for 2650 the duration where the sender ULP has multiple non-optimal messages 2651 for transmission and this causes a minor impact on the wire 2652 utilization. 2654 However, it is not expected that requiring FPDU Alignment will have a 2655 measurable impact on wire behavior of most applications. Throughput 2656 applications with large I/Os are expected to take full advantage of 2657 the EMSS. Another class of applications with many small outstanding 2658 buffers (as compared to EMSS) is expected to use packing when 2659 applicable. Transaction oriented applications are also optimal. 2661 TCP retransmission is another area that can affect sender behavior. 2662 TCP supports retransmission of the exact, originally transmitted 2663 segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the 2664 window" and [RFC1122] section 4.2.2.15). In the unlikely event that 2665 part of the original segment has been received and acknowledged by 2666 the remote peer (e.g., a re-segmenting middle box, as documented in 2667 Appendix A.4, Re-segmenting Middle boxes and non optimized MPA/TCP 2668 senders on page 50), a better available bandwidth utilization may be 2669 possible by re-transmitting only the missing octets. If an optimized 2670 MPA/TCP retransmits complete FPDUs, there may be some marginal 2671 bandwidth loss. 2673 Another area where a change in the TCP segment number may have impact 2674 is that of Slow Start and Congestion Avoidance. Slow-start 2675 exponential increase is measured in segments per second, as the 2676 algorithm focuses on the overhead per segment at the source for 2677 congestion that eventually results in dropped segments. Slow-start 2678 exponential bandwidth growth for optimized MPA/TCP is similar to any 2679 TCP implementation. Congestion Avoidance allows for a linear growth 2680 in available bandwidth when recovering after a packet drop. Similar 2681 to the analysis for slow-start, optimized MPA/TCP doesn't change the 2682 behavior of the algorithm. Therefore the average size of the segment 2683 versus EMSS is not a major factor in the assessment of the bandwidth 2684 growth for a sender. Both Slow Start and Congestion Avoidance for an 2685 optimized MPA/TCP will behave similarly to any TCP sender and allow 2686 an optimized MPA/TCP to enjoy the theoretical performance limits of 2687 the algorithms. 2689 In summary, the ULP messages generated at the sender (e.g., the 2690 amount of messages grouped for every transmission request) and 2691 message size distribution has the most significant impact over the 2692 number of TCP segments emitted. The worst case effect for certain 2693 ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by 2694 an increase of up to 2x in the number of TCP segments and 2695 acknowledges. In reality the effect is expected to be marginal. 2697 C Appendix. 2698 IETF Implementation Interoperability with RDMA Consortium 2699 Protocols 2701 This appendix is for information only and is NOT part of the 2702 standard. 2704 This appendix covers methods of making MPA implementations 2705 interoperate with both IETF and RDMA Consortium versions of the 2706 protocols. 2708 The RDMA Consortium created early specifications of the MPA/DDP/RDMA 2709 protocols and some manufacturers created implementations of those 2710 protocols before the IETF versions were finalized. These protocols 2711 and are very similar to the IETF versions making it possible for 2712 implementations to be created or modified to support either set of 2713 specifications. 2715 For those interested, the RDMA Consortium protocol documents 2716 (draft-culley-iwarp-mpa-v1.0.pdf, draft-shah-iwarp-ddp-v1.0.pdf, and 2717 draft-recio-iwarp-rdmac-v1.0.pdf) can be obtained at 2718 http://www.rdmaconsortium.org. 2720 In this section, implementations of MPA/DDP/RDMA that conform to the 2721 RDMAC specifications are called RDMAC RNICs. Implementations of 2722 MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs. 2724 Without the exchange of MPA Request/Reply Frames, there is no 2725 standard mechanism for enabling RDMAC RNICs to interoperate with IETF 2726 RNICs. Even if a ULP uses a well-known port to start an IETF RNIC 2727 immediately in RDMA mode (i.e., without exchanging the MPA 2728 Request/Reply messages), there is no reason to believe an IETF RNIC 2729 will interoperate with an RDMAC RNIC because of the differences in 2730 the version number in the DDP and RDMAP headers on the wire. 2732 Therefore, the ULP or other supporting entity at the RDMAC RNIC must 2733 implement MPA Request/Reply Frames on behalf of the RNIC in order to 2734 negotiate the connection parameters. The following section describes 2735 the results following the exchange of the MPA Request/Reply Frames 2736 before the conversion from streaming to RDMA mode. 2738 C.1 Negotiated Parameters 2740 Three types of RNICs are considered: 2742 Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which 2743 has a ULP or other supporting entity that exchanges the MPA 2744 Request/Reply Frames in streaming mode before the conversion to 2745 RDMA mode. 2747 Non-permissive IETF RNIC - an RNIC implementing the IETF protocols 2748 which is not capable of implementing the RDMAC protocols. Such 2749 an RNIC can only interoperate with other IETF RNICs. 2751 Permissive IETF RNIC - an RNIC implementing the IETF protocols which 2752 is capable of implementing the RDMAC protocols on a per 2753 connection basis. 2755 The Permissive IETF RNIC is recommended for those implementers that 2756 want maximum interoperability with other RNIC implementations. 2758 The values used by these three RNIC types for the MPA, DDP, and RDMAP 2759 versions as well as MPA Markers and CRC are summarized in Figure 14. 2761 +----------------++-----------+-----------+-----------+-----------+ 2762 | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | 2763 | || Version | Revision | Markers | CRC | 2764 +----------------++-----------+-----------+-----------+-----------+ 2765 +----------------++-----------+-----------+-----------+-----------+ 2766 | RDMAC || 0 | 0 | 1 | 1 | 2767 | || | | | | 2768 +----------------++-----------+-----------+-----------+-----------+ 2769 | IETF || 1 | 1 | 0 or 1 | 0 or 1 | 2770 | Non-permissive || | | | | 2771 +----------------++-----------+-----------+-----------+-----------+ 2772 | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | 2773 | permissive || | | | | 2774 +----------------++-----------+-----------+-----------+-----------+ 2775 Figure 14. Connection Parameters for the RNIC Types. 2776 For MPA Markers and MPA CRC, enabled=1, disabled=0. 2778 It is assumed there is no mixing of versions allowed between MPA, DDP 2779 and RDMAP. The RNIC either generates the RDMAC protocols on the wire 2780 (version is zero) or the IETF protocols (version is one). 2782 During the exchange of the MPA Request/Reply Frames, each peer 2783 provides its MPA Revision, Marker preference (M: 0=disabled, 2784 1=enabled), and CRC preference. The MPA Revision provided in the MPA 2785 Request Frame and the MPA Reply Frame may differ. 2787 From the information in the MPA Request/Reply Frames, each side sets 2788 the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as 2789 well as the state of the Markers for each half connection. Between 2790 DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP 2791 and RDMAP version MUST be identical in the two directions. The RNIC 2792 either generates the RDMAC protocols on the wire (version is zero) or 2793 the IETF protocols (version is one). 2795 In the following sections, the figures do not discuss CRC negotiation 2796 because there is no interoperability issue for CRCs. Since the RDMAC 2797 RNIC will always request CRC use, then, according to the IETF MPA 2798 specification, both peers MUST generate and check CRCs. 2800 C.2 RDMAC RNIC and Non-permissive IETF RNIC 2802 Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate 2803 with an RDMAC RNIC, despite the fact that both peers exchange MPA 2804 Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA 2805 negotiation has no effect on the DDP/RDMAP version and it is unable 2806 to interoperate with the RDMAC RNIC. 2808 The rows in the figure show the state of the Marker field in the MPA 2809 Request Frame sent by the MPA Initiator. The columns show the state 2810 of the Marker field in the MPA Reply Frame sent by the MPA Responder. 2811 Each type of RNIC is shown as an Initiator and a Responder. The 2812 connection results are shown in the lower right corner, at the 2813 intersection of the different RNIC types, where V=0 is the RDMAC 2814 DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA 2815 Markers are disabled and M=1 means MPA Markers are enabled. The 2816 negotiated Marker state is shown as X/Y, for the receive direction of 2817 the Initiator/Responder. 2819 +---------------------------++-----------------------+ 2820 | MPA || MPA | 2821 | CONNECT || Responder | 2822 | MODE +-----------------++-------+---------------+ 2823 | | RNIC || RDMAC | IETF | 2824 | | TYPE || | Non-permissive| 2825 | | +------++-------+-------+-------+ 2826 | | |MARKER|| M=1 | M=0 | M=1 | 2827 +---------+----------+------++-------+-------+-------+ 2828 +---------+----------+------++-------+-------+-------+ 2829 | | RDMAC | M=1 || V=0 | close | close | 2830 | | | || M=1/1 | | | 2831 | +----------+------++-------+-------+-------+ 2832 | MPA | | M=0 || close | V=1 | V=1 | 2833 |Initiator| IETF | || | M=0/0 | M=0/1 | 2834 | |Non-perms.+------++-------+-------+-------+ 2835 | | | M=1 || close | V=1 | V=1 | 2836 | | | || | M=1/0 | M=1/1 | 2837 +---------+----------+------++-------+-------+-------+ 2838 Figure 15: MPA negotiation between an RDMAC RNIC and a Non-permissive 2839 IETF RNIC. 2841 C.2.1 RDMAC RNIC Initiator 2843 If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request 2844 Frame with Rev field set to zero and the M and C bits set to one. 2845 Because the Non-permissive IETF RNIC cannot dynamically downgrade the 2846 version number it uses for DDP and RDMAP, it would send an MPA Reply 2847 Frame with the Rev field equal to one and then gracefully close the 2848 connection. 2850 C.2.2 Non-Permissive IETF RNIC Initiator 2852 If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA 2853 Request Frame with Rev field equal to one. The ULP or supporting 2854 entity for the RDMAC RNIC responds with an MPA Reply Frame that has 2855 the Rev field equal to zero and the M bit set to one. The Non- 2856 permissive IETF RNIC will gracefully close the connection after it 2857 reads the incompatible Rev field in the MPA Reply Frame. 2859 C.2.3 RDMAC RNIC and Permissive IETF RNIC 2861 Figure 16 shows that a Permissive IETF RNIC can interoperate with an 2862 RDMAC RNIC regardless of its Marker preference. The figure uses the 2863 same format as shown with the Non-permissive IETF RNIC. 2865 +---------------------------++-----------------------+ 2866 | MPA || MPA | 2867 | CONNECT || Responder | 2868 | MODE +-----------------++-------+---------------+ 2869 | | RNIC || RDMAC | IETF | 2870 | | TYPE || | Permissive | 2871 | | +------++-------+-------+-------+ 2872 | | |MARKER|| M=1 | M=0 | M=1 | 2873 +---------+----------+------++-------+-------+-------+ 2874 +---------+----------+------++-------+-------+-------+ 2875 | | RDMAC | M=1 || V=0 | N/A | V=0 | 2876 | | | || M=1/1 | | M=1/1 | 2877 | +----------+------++-------+-------+-------+ 2878 | MPA | | M=0 || V=0 | V=1 | V=1 | 2879 |Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 | 2880 | |Permissive+------++-------+-------+-------+ 2881 | | | M=1 || V=0 | V=1 | V=1 | 2882 | | | || M=1/1 | M=1/0 | M=1/1 | 2883 +---------+----------+------++-------+-------+-------+ 2884 Figure 16: MPA negotiation between an RDMAC RNIC and a Permissive 2885 IETF RNIC. 2887 A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the 2888 Rev field of the MPA Req/Rep Frames and then adjust its receive 2889 Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As 2890 a result, as an MPA Responder, the Permissive IETF RNIC will never 2891 return an MPA Reply Frame with the M bit set to zero. This case is 2892 shown as a not applicable (N/A) in Figure 16. 2894 C.2.4 RDMAC RNIC Initiator 2896 When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting 2897 entity prepares an MPA Request message and sets the revision to zero 2898 and the M bit and C bit to one. 2900 The Permissive IETF Responder receives the MPA Request message and 2901 checks the revision field. Since it is capable of generating RDMAC 2902 DDP/RDMAP headers, it sends an MPA Reply message with revision set to 2903 zero and the M and C bits set to one. The Responder must inform its 2904 ULP that it is generating version zero DDP/RDMAP messages. 2906 C.2.5 Permissive IETF RNIC Initiator 2908 If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA 2909 Request Frame setting the Rev field to one. Regardless of the value 2910 of the M bit in the MPA Request Frame, the ULP or other supporting 2911 entity for the RDMAC RNIC will create an MPA Reply Frame with Rev 2912 equal to zero and the M bit set to one. 2914 When the Initiator reads the Rev field of the MPA Reply Frame and 2915 finds that its peer is an RDMAC RNIC, it must inform its ULP that it 2916 should generate version zero DDP/RDMAP messages and enable MPA 2917 Markers and CRC. 2919 C.3 Non-Permissive IETF RNIC and Permissive IETF RNIC 2921 For completeness, Figure 17 below shows the results of MPA 2922 negotiation between a Non-permissive IETF RNIC and a Permissive IETF 2923 RNIC. The important point from this figure is that an IETF RNIC 2924 cannot detect whether its peer is a Permissive or Non-permissive 2925 RNIC. 2927 +---------------------------++-------------------------------+ 2928 | MPA || MPA | 2929 | CONNECT || Responder | 2930 | MODE +-----------------++---------------+---------------+ 2931 | | RNIC || IETF | IETF | 2932 | | TYPE || Non-permissive| Permissive | 2933 | | +------++-------+-------+-------+-------+ 2934 | | |MARKER|| M=0 | M=1 | M=0 | M=1 | 2935 +---------+----------+------++-------+-------+-------+-------+ 2936 +---------+----------+------++-------+-------+-------+-------+ 2937 | | | M=0 || V=1 | V=1 | V=1 | V=1 | 2938 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2939 | |Non-perms.+------++-------+-------+-------+-------+ 2940 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2941 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2942 | MPA +----------+------++-------+-------+-------+-------+ 2943 |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | 2944 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2945 | |Permissive+------++-------+-------+-------+-------+ 2946 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2947 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2948 +---------+----------+------++-------+-------+-------+-------+ 2949 Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a 2950 Permissive IETF RNIC. 2952 Normative References 2954 [iSCSI] Satran, J., Internet Small Computer Systems Interface 2955 (iSCSI), RFC 3720, April 2004. 2957 [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, 2958 November 1990. 2960 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP 2961 Selective Acknowledgment Options", RFC 2018, October 1996. 2963 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2964 Requirement Levels", BCP 14, RFC 2119, March 1997. 2966 [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the 2967 Internet Protocol", RFC 2401, November 1998. 2969 [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over 2970 IP", RFC3723, April 2004. 2972 [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet 2973 Program Protocol Specification", RFC 793, September 1981. 2975 [RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP 2976 Security", draft-ietf-rddp-security-09.txt (work in progress), 2977 MAY 2006. 2979 Informative References 2981 [APPL] Bestler, C., "Applicability of Remote Direct Memory Access 2982 Protocol (RDMA) and Direct Data Placement (DDP)", draft-ietf- 2983 rddp-applicability-08.txt (Work in progress), June 2006. 2985 [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum 2986 disagree", ACM Sigcomm, Sept. 2000. 2988 [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming 2989 Library) and uDAPL (User Direct Access Programming Library)", 2990 http://www.datcollaborative.org. 2992 [DDP] H. Shah et al., "Direct Data Placement over Reliable 2993 Transports", draft-ietf-rddp-ddp-07.txt (Work in progress), 2994 September 2006. 2996 [iSER] Mike Ko et al., "iSCSI Extensions for RDMA Specification", 2997 draft-ietf-ips-iser-05.txt (Work in progress), October 2005. 2999 [IT-API] The Open Group, "Interconnect Transport API (IT-API)" 3000 Version 2.1, http://www.opengroup.org. 3002 [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to 3003 Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- 3004 bindings-02.txt, July 2004. 3006 [RDMAP] R. Recio et al., "RDMA Protocol Specification", 3007 draft-ietf-rddp-rdmap-07.txt, September 2006. 3009 [RFC792] Postel, J., "Internet Control Message Protocol", September 3010 1981 3012 [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC 3013 896, January 1984. 3015 [RFC1122] Braden, R.T., "Requirements for Internet hosts - 3016 communication layers", October 1989. 3018 [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", 3019 RFC 2960, October 2000. 3021 [RFC4296] Bailey, S., Talpey, T, "The Architecture of Direct Data 3022 Placement (DDP) and Remote Direct Memory Access (RDMA) on 3023 Internet Protocols" RFC 4296, December 2005 3025 [RFC4297] Romanow, A., et al., "Remote Direct Memory Access (RDMA) 3026 over IP Problem Statement", RFC 4297, December 2005 3028 [RFC4301] Kent, S., Seo, K., "Security Architecture for the Internet 3029 Protocol", RFC 4301, December 2005 3031 [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification", 3032 draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003, 3033 http://www.rdmaconsortium.org. 3035 Author's Addresses 3037 Stephen Bailey 3038 Sandburst Corporation 3039 600 Federal Street 3040 Andover, MA 01810 USA 3041 Phone: +1 978 689 1614 3042 Email: steph@sandburst.com 3044 Paul R. Culley 3045 Hewlett-Packard Company 3046 20555 SH 249 3047 Houston, Tx. USA 77070-2698 3048 Phone: 281-514-5543 3049 Email: paul.culley@hp.com 3051 Uri Elzur 3052 Broadcom 3053 16215 Alton Parkway 3054 CA, 92618 3055 Phone: 949.585.6432 3056 Email: uri@broadcom.com 3058 Renato J Recio 3059 IBM 3060 Internal Zip 9043 3061 11400 Burnett Road 3062 Austin, Texas 78759 3063 Phone: 512-838-3685 3064 Email: recio@us.ibm.com 3066 John Carrier 3067 Cray Inc. 3068 411 First Avenue S, Suite 600 3069 Seattle, WA 98104-2860 3070 Phone: 206-701-2090 3071 Email: carrier@cray.com 3073 Acknowledgments 3075 Dwight Barron 3076 Hewlett-Packard Company 3077 20555 SH 249 3078 Houston, Tx. USA 77070-2698 3079 Phone: 281-514-2769 3080 Email: dwight.barron@hp.com 3082 Jeff Chase 3083 Department of Computer Science 3084 Duke University 3085 Durham, NC 27708-0129 USA 3086 Phone: +1 919 660 6559 3087 Email: chase@cs.duke.edu 3089 Ted Compton 3090 EMC Corporation 3091 Research Triangle Park, NC 27709, USA 3092 Phone: 919-248-6075 3093 Email: compton_ted@emc.com 3095 Dave Garcia 3096 Hewlett-Packard Company 3097 19333 Vallco Parkway 3098 Cupertino, Ca. USA 95014 3099 Phone: 408.285.6116 3100 Email: dave.garcia@hp.com 3102 Hari Ghadia 3103 Adaptec, Inc. 3104 691 S. Milpitas Blvd., 3105 Milpitas, CA 95035 USA 3106 Phone: +1 (408) 957-5608 3107 Email: hari_ghadia@adaptec.com 3109 Howard C. Herbert 3110 Intel Corporation 3111 MS CH7-404 3112 5000 West Chandler Blvd. 3113 Chandler, Arizona 85226 3114 Phone: 480-554-3116 3115 Email: howard.c.herbert@intel.com 3117 Jeff Hilland 3118 Hewlett-Packard Company 3119 20555 SH 249 3120 Houston, Tx. USA 77070-2698 3121 Phone: 281-514-9489 3122 Email: jeff.hilland@hp.com 3124 Mike Ko 3125 IBM 3126 650 Harry Rd. 3127 San Jose, CA 95120 3128 Phone: (408) 927-2085 3129 Email: mako@us.ibm.com 3131 Mike Krause 3132 Hewlett-Packard Corporation, 43LN 3133 19410 Homestead Road 3134 Cupertino, CA 95014 USA 3135 Phone: +1 (408) 447-3191 3136 Email: krause@cup.hp.com 3138 Dave Minturn 3139 Intel Corporation 3140 MS JF1-210 3141 5200 North East Elam Young Parkway 3142 Hillsboro, Oregon 97124 3143 Phone: 503-712-4106 3144 Email: dave.b.minturn@intel.com 3146 Jim Pinkerton 3147 Microsoft, Inc. 3148 One Microsoft Way 3149 Redmond, WA, USA 98052 3150 Email: jpink@microsoft.com 3152 Hemal Shah 3153 16215 Alton Parkway 3154 Irvine, California 92619-7013 USA 3155 Phone: +1 949 926-6941 3156 Email: hemal@broadcom.com 3158 Allyn Romanow 3159 Cisco Systems 3160 170 W Tasman Drive 3161 San Jose, CA 95134 USA 3162 Phone: +1 408 525 8836 3163 Email: allyn@cisco.com 3165 Tom Talpey 3166 Network Appliance 3167 375 Totten Pond Road 3168 Waltham, MA 02451 USA 3169 Phone: +1 (781) 768-5329 3170 EMail: thomas.talpey@netapp.com 3172 Patricia Thaler 3173 Broadcom 3174 16215 Alton Parkway 3175 Irvine, CA 92618 3176 Phone: 916 570 2707 3177 pthaler@broadcom.com 3179 Jim Wendt 3180 Hewlett Packard Corporation 3181 8000 Foothills Boulevard MS 5668 3182 Roseville, CA 95747-5668 USA 3183 Phone: +1 916 785 5198 3184 Email: jim_wendt@hp.com 3186 Jim Williams 3187 Emulex Corporation 3188 580 Main Street 3189 Bolton, MA 01740 USA 3190 Phone: +1 978 779 7224 3191 Email: jim.williams@emulex.com 3193 Full Copyright Statement 3195 This document and the information contained herein are provided on an 3196 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 3197 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 3198 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 3199 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 3200 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 3201 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 3203 Copyright (C) The Internet Society (2006). This document is subject 3204 to the rights, licenses and restrictions contained in BCP 78, and 3205 except as set forth therein, the authors retain all their rights. 3207 Intellectual Property 3209 The IETF takes no position regarding the validity or scope of any 3210 Intellectual Property Rights or other rights that might be claimed to 3211 pertain to the implementation or use of the technology described in 3212 this document or the extent to which any license under such rights 3213 might or might not be available; nor does it represent that it has 3214 made any independent effort to identify any such rights. Information 3215 on the procedures with respect to rights in RFC documents can be 3216 found in BCP 78 and BCP 79. 3218 Copies of IPR disclosures made to the IETF Secretariat and any 3219 assurances of licenses to be made available, or the result of an 3220 attempt made to obtain a general license or permission for the use of 3221 such proprietary rights by implementers or users of this 3222 specification can be obtained from the IETF on-line IPR repository at 3223 http://www.ietf.org/ipr. 3225 The IETF invites any interested party to bring to its attention any 3226 copyrights, patents or patent applications, or other proprietary 3227 rights that may cover technology that may be required to implement 3228 this standard. Please address the information to the IETF at 3229 ietf-ipr@ietf.org.