idnits 2.17.1 draft-ietf-rddp-mpa-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 3161. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3176. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3183. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3189. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([DDP]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: C: This bit declares an endpoint's preferred CRC usage. When this field is '0' in the MPA Request Frame and the MPA Reply Frame, CRCs MUST not be checked and need not be generated by either endpoint. When this bit is '1' in either the MPA Request Frame or MPA Reply Frame, CRCs MUST be generated and checked by both endpoints. Note that even when not in use, the CRC field remains present in the FPDU. When CRCs are not in use, the CRC field MUST be considered valid for FPDU checking regardless of its contents. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: 9. MPA implementations MUST validate the PD_Length field. The buffer that receives the Private Data field MUST be large enough to receive that data; the amount of Private Data MUST not exceed the PD_Length, or the application buffer. If any of the above fails, the startup frame MUST be considered improperly formatted. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 30, 2006) is 6538 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Seconds' is mentioned on line 2341, but not defined == Unused Reference: 'NagleDAck' is defined on line 2200, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-10) exists of draft-ietf-rddp-security-09 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-ddp-06 -- Obsolete informational reference (is this intentional?): RFC 2401 (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) == Outdated reference: A later version (-04) exists of draft-ietf-nfsv4-channel-bindings-02 == Outdated reference: A later version (-07) exists of draft-ietf-rddp-rdmap-06 -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) -- No information found for draft-hilland-iwarp-verbs-v1 - is the name correct? Summary: 6 errors (**), 0 flaws (~~), 11 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Remote Direct Data Placement Work Group P. Culley 2 INTERNET-DRAFT Hewlett-Packard Company 3 draft-ietf-rddp-mpa-04.txt U. Elzur 4 Broadcom Corporation 5 R. Recio 6 IBM Corporation 7 S. Bailey 8 Sandburst Corporation 9 J. Carrier 10 Cray Inc. 12 Expires: November 2006 May 30, 2006 14 Marker PDU Aligned Framing for TCP Specification 16 Status of this Memo 18 By submitting this Internet-Draft, each author represents that any 19 applicable patent or other IPR claims of which he or she is aware 20 have been or will be disclosed, and any of which he or she becomes 21 aware will be disclosed, in accordance with Section 6 of BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft 35 Shadow Directories can be accessed at http://www.ietf.org/shadow.html 37 Abstract 39 MPA (Marker Protocol data unit Aligned framing) is designed to work 40 as an "adaptation layer" between TCP and the Direct Data Placement 41 [DDP] protocol, preserving the reliable, in-order delivery of TCP, 42 while adding the preservation of higher-level protocol record 43 boundaries that DDP requires. MPA is fully compliant with applicable 44 TCP RFCs and can be utilized with existing TCP implementations. MPA 45 also supports integrated implementations that combine TCP, MPA and 46 DDP to reduce buffering requirements in the implementation and 47 improve performance at the system level. 49 Table of Contents 51 Status of this Memo 1 52 Abstract 1 53 1 Glossary 7 54 2 Introduction 10 55 2.1 Motivation 10 56 2.2 Protocol Overview 10 57 3 LLP and DDP requirements 14 58 3.1 TCP implementation Requirements to support MPA 14 59 3.1.1 TCP Transmit side 14 60 3.1.2 TCP Receive side 14 61 3.2 MPA's interactions with DDP 16 62 4 FPDU Formats 18 63 4.1 Marker Format 19 64 5 Data Transfer Semantics 20 65 5.1 MPA Markers 20 66 5.2 CRC Calculation 23 67 5.3 MPA on TCP Sender Segmentation 26 68 5.3.1 Effects of MPA on TCP Segmentation 27 69 5.3.2 FPDU Size Considerations 29 70 5.4 MPA Receiver FPDU Identification 30 71 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 31 72 6 Connection Semantics 32 73 6.1 Connection setup 32 74 6.1.1 MPA Request and Reply Frame Format 34 75 6.1.2 Connection Startup Rules 35 76 6.1.3 Example Delayed Startup sequence 38 77 6.1.4 Use of Private Data 41 78 6.1.5 "Dual stack" implementations 44 79 6.2 Normal Connection Teardown 45 80 7 Error Semantics 46 81 8 Security Considerations 47 82 8.1 Protocol-specific Security Considerations 47 83 8.1.1 Spoofing 47 84 8.1.2 Eavesdropping 48 85 8.2 Introduction to Security Options 49 86 8.3 Using IPsec With MPA 49 87 8.4 Requirements for IPsec Encapsulation of MPA/DDP 50 88 9 IANA Considerations 51 89 10 References 52 90 10.1 Normative References 52 91 10.2 Informative References 52 92 11 Appendix 54 93 11.1 Analysis of MPA over TCP Operations 54 94 11.1.1 Assumptions 55 95 11.1.2 The Value of FPDU Alignment 56 96 11.2 Receiver implementation 63 97 11.2.1 Network Layer Reassembly Buffers 63 98 11.2.2 TCP Reassembly buffers 64 99 11.3 IETF Implementation Interoperability with RDMA Consortium 100 Protocols 65 101 11.3.1 Negotiated Parameters 65 102 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 66 103 11.3.3 RDMAC RNIC and Permissive IETF RNIC 68 104 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 69 105 12 Author's Addresses 70 106 13 Acknowledgments 71 107 Full Copyright Statement 74 108 Intellectual Property 74 110 Table of Figures 112 Figure 1 ULP MPA TCP Layering 11 113 Figure 2 FPDU Format 18 114 Figure 3 Marker Format 19 115 Figure 4 Example FPDU Format with Marker 21 116 Figure 5 Annotated Hex Dump of an FPDU 25 117 Figure 6 Annotated Hex Dump of an FPDU with Marker 26 118 Figure 7 MPA Request/Reply Frame 34 119 Figure 8: Example Delayed Startup negotiation 39 120 Figure 9: Example Immediate Startup negotiation 42 121 Figure 10: Non-aligned FPDU freely placed in TCP octet stream 58 122 Figure 11: Aligned FPDU placed immediately after TCP header 59 123 Figure 12. Connection Parameters for the RNIC Types. 66 124 Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive 125 IETF RNIC. 67 126 Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive 127 IETF RNIC. 68 128 Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a 129 Permissive IETF RNIC. 69 131 Revision history [To be deleted prior to RFC publication] 133 [draft-ietf-rddp-mpa-04] workgroup draft with following changes: 135 Numerous capitalization and "" adjustments, tried to make more 136 consistent. 138 Added some missing capitalized terms to glossary 140 Removed company specific "use as is" boilerplate paragraph 142 Fixed up some contact information and cross references. 144 Removed reference to expired draft-elzur-iwarp-mpa-tcp-analysis- 145 00.txt 147 Suggested MTU to be used to determine EMSS, when otherwise not 148 available; removed technology specific lengths per AD suggestion 149 Tweaked text around disabling Nagle so that it is no longer 150 implied that that is all that is necessary to achieve proper 151 segmentation behavior 153 Revamped section 5.3.1 for improved clarity 155 [draft-ietf-rddp-mpa-03] workgroup draft with following changes: 157 Tweaked abstract to give a bit more information. 159 Tightened definition and usage of "deliver" 161 Cleaned up usage of terms "FPDU Alignment" and "Header 162 Alignment" 164 Rearranged overview sections with stack and glossary earlier 166 Mentioned how an non-MPA-Aware TCP MPA receiver deals with out 167 of order segments (it doesn't have to...) 169 Fixed description of out of order segment handling in section 170 3.1.1 172 Added text saying that ordering and completion indications are 173 used to deliver to DDP 175 Added redundant text indicating low two bits of FPDUPTR must 176 always be zero and treated as such in Section 4.1 178 Added redundant text indicating Markers are always included in a 179 CRC calculation 181 Removed indication saying that an implementation can "ignore" an 182 administrative input to not use CRCs; clarified that both ends 183 have to agree to not use CRC (as originally intended). 185 Changed example FPDU hex dump format for greater clarity 187 Clarified that EMSS shrinking below 128 bytes is the condition 188 (rather than "very small sizes") 190 Put connection startup rules after the start frame formats 192 Added Initiator Private Data to figure 9 194 Removed or Clarified use of RNIC term 196 Added intro to IETF/RDMAC interoperability appendix and gave a 197 web reference for docs; also recommended use of "permissive IETF 198 RNIC" 199 Numerous minor clarifications 201 Updated Boilerplates per current requirements 203 [draft-ietf-rddp-mpa-02] workgroup draft with following changes: 205 Made IPsec must implement, optional to use. 207 Updated Marker language to clarify that it points to ULPDU 208 Length even when Marker precedes FPDU. 210 Clarified when to start Markers use (in Full Operation mode). 212 Added informative text on interoperability with RDMAC RNICs. 214 Reduced Private Data to 512 octets max. 216 Clarified CRC use description, must be used unless data is at 217 least as well protected by another means. 219 Clarified CRC disabled mode; CRC field is always valid. 221 Added Security text. 223 Changed DDP and RDMAP version numbers in hex dumps (Fig 5, 6) 224 and adjusted CRC accordingly. 226 [draft-ietf-rddp-mpa-01] workgroup draft with following changes: 228 Added the "R" bit (Rejected) to the MPA Reply Frame and 229 described its semantics. 231 Added some comments on recent decisions regarding startup. 233 Updated RFC3667 boilerplate. 235 [draft-ietf-rddp-mpa-00] workgroup draft with following changes: 237 Changed "Start Key" to two separate startup frames to facilitate 238 identification of incorrect active/active startup. 240 Changed Active/Passive nomenclature to Initiator/Responder to 241 reduce confusion with TCP startup and verbs doc (which used 242 opposite sense). 244 Added Private Data to the startup key sequences. This also 245 required describing the motivation and expected usage models 246 along with some interface hints. Removed the Private Data stuff 247 from appendix. 249 Added example "Immediate" startup with TCP and explanation. 251 [draft-culley-iwarp-mpa-03] 253 Add option to allow receivers to specify Marker use. 255 Add option that allows both sides to agree not to use CRC. 257 Added startup declaration "Start Key" with options and larger 258 MPA mode recognition "key". 260 Updated MPA/DDP connection startup rules and sequence to deal 261 with "Start Key". 263 Added Appendix that provides a more detailed analysis of the 264 effects of MPA on TCP data streams. 266 Added appendix that describes a mechanism to deal with "Private 267 Data" prior to full MPA/DDP operation. 269 [draft-culley-iwarp-mpa-02] 271 Enhanced descriptions of how MPA is used over an unmodified TCP. 273 Removed "No Packing" text. 275 Made MPA an adaptation layer for DDP, instead of a generalized 276 framing solution. 278 Added clarifications of the MPA/TCP interaction for optimized 279 implementations and that any such optimizations are to be used 280 only when requested by MPA. 282 [draft-culley-iwarp-mpa-01] initial draft. 284 1 Glossary 286 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 287 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 288 this document are to be interpreted as described in RFC 2119. 290 Consumer - the ULPs or applications that lie above MPA and DDP. The 291 Consumer is responsible for making TCP connections, starting MPA 292 and DDP connections, and generally controlling operations. 294 Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as 295 the process of informing DDP that a particular PDU is ordered for 296 use. A PDU is Delivered in the exact order that it was sent by 297 the original sender; MPA uses TCP's byte stream ordering to 298 determine when Delivery is possible. This is specifically 299 different from "passing the PDU to DDP", which may generally 300 occur in any order, while the order of Delivery is strictly 301 defined. 303 EMSS - Effective Maximum Segment Size. EMSS is the smaller of the 304 TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], 305 and the current path Maximum Transfer Unit (MTU) [RFC1191]. 307 FPDU - Framed Protocol Data Unit. The unit of data created by an MPA 308 sender. 310 FPDU Alignment - the property that an FPDU is Header Aligned with the 311 TCP segment, and the TCP segment includes an integer number of 312 FPDUs. A TCP segment with a FPDU Alignment allows immediate 313 processing of the contained FPDUs without waiting on other TCP 314 segments to arrive or combining with prior segments. 316 FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate 317 the beginning of an FPDU. 319 Full Operation (Full Operation Phase) - After the completion of the 320 Startup Phase MPA begins exchanging FPDUs. 322 Header Alignment - the property that a TCP segment begins with an 323 FPDU. The FPDU is Header Aligned when the FPDU header is exactly 324 at the start of the TCP segment (right behind the TCP headers on 325 the wire). 327 Initiator - The endpoint of a connection that sends the MPA Request 328 Frame, i.e. the first to actually send data (which may not be the 329 one which sends the TCP SYN). 331 Marker - A four octet field that is placed in the MPA data stream at 332 fixed octet intervals (every 512 octets). 334 MPA-aware TCP - a TCP implementation that is aware of the receiver 335 efficiencies of MPA FPDU Alignment and is capable of sending TCP 336 segments that begin with an FPDU. 338 MPA-enabled - MPA is enabled if the MPA protocol is visible on the 339 wire. When the sender is MPA-enabled, it is inserting framing 340 and Markers. When the receiver is MPA-enabled, it is 341 interpreting framing and Markers. 343 MPA Request Frame - Data sent from the MPA Initiator to the MPA 344 Responder during the Startup Phase. 346 MPA Reply Frame - Data sent from the MPA Responder to the MPA 347 Initiator during the Startup Phase. 349 MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This 350 document defines the MPA protocol. 352 MULPDU - Maximum ULPDU. The current maximum size of the record that 353 is acceptable for DDP to pass to MPA for transmission. 355 Node - A computing device attached to one or more links of a Network. 356 A Node in this context does not refer to a specific application 357 or protocol instantiation running on the computer. A Node may 358 consist of one or more MPA on TCP devices installed in a host 359 computer. 361 PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact 362 modulo 4 size. 364 PDU - protocol data unit 366 Private Data - A block of data exchanged between MPA endpoints during 367 initial connection setup. 369 Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that 370 tie use of various endpoint resources (memory access etc.) to the 371 specific RDMA/DDP/MPA connection. 373 RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA 374 to enable applications to transfer data directly from memory 375 buffers. See [RDMAP]. 377 Remote Peer - The MPA protocol implementation on the opposite end of 378 the connection. Used to refer to the remote entity when 379 describing protocol exchanges or other interactions between two 380 Nodes. 382 Responder - The connection endpoint which responds to an incoming MPA 383 connection request (the MAP Request Frame). This may not be the 384 endpoint which awaited the TCP SYN. 386 Startup Phase - The initial exchanges of an MPA connection which 387 serves to more fully identify MPA endpoints to each other and 388 pass connection specific setup information to each other. 390 ULP - Upper Layer Protocol. The protocol layer above the protocol 391 layer currently being referenced. The ULP for MPA is DDP [DDP]. 393 ULPDU - Upper Layer Protocol Data Unit. The data record defined by 394 the layer above MPA (DDP). ULPDU corresponds to DDP's DDP 395 segment. 397 ULPDU_Length - a field in the FPDU describing the length of the 398 included ULPDU. 400 2 Introduction 402 This section discusses the reason for creating MPA on TCP and a 403 general overview of the protocol. Later sections show the MPA 404 headers (see section 4 on page 18), and detailed protocol 405 requirements and characteristics (see section 5 on page 20), as well 406 as Connection Semantics (section 6 on page 31), Error Semantics 407 (section 7 on page 46), and Security Considerations (section 8 on 408 page 47). 410 2.1 Motivation 412 The Direct Data Placement protocol [DDP], when used with TCP [RFC793] 413 requires a mechanism to detect record boundaries. The DDP records 414 are referred to as Upper Layer Protocol Data Units by this document. 415 The ability to locate the Upper Layer Protocol Data Unit (ULPDU) 416 boundary is useful to a hardware network adapter that uses DDP to 417 directly place the data in the application buffer based on the 418 control information carried in the ULPDU header. This may be done 419 without requiring that the packets arrive in order. Potential 420 benefits of this capability are the avoidance of the memory copy 421 overhead and a smaller memory requirement for handling out of order 422 or dropped packets. 424 Many approaches have been proposed for a generalized framing 425 mechanism. Some are probabilistic in nature and others are 426 deterministic. A probabilistic approach is characterized by a 427 detectable value embedded in the octet stream. It is probabilistic 428 because under some conditions the receiver may incorrectly interpret 429 application data as the detectable value. Under these conditions, 430 the protocol may fail with unacceptable frequency. A deterministic 431 approach is characterized by embedded controls at known locations in 432 the octet stream. Because the receiver can guarantee it will only 433 examine the data stream at locations that are known to contain the 434 embedded control, the protocol can never misinterpret application 435 data as being embedded control data. For unambiguous handling of an 436 out of order packet, the deterministic approach is preferred. 438 The MPA protocol provides a framing mechanism for DDP running over 439 TCP using the deterministic approach. It allows the location of the 440 ULPDU to be determined in the TCP stream even if the TCP segments 441 arrive out of order. 443 2.2 Protocol Overview 445 The layering of PDUs with MPA is shown in Figure 1, below. 447 +------------------+ 448 | ULP client | 449 +------------------+ <- Consumer messages 450 | DDP | 451 +------------------+ <- ULPDUs 452 | MPA | 453 +------------------+ <- FPDUs (containing ULPDUs) 454 | TCP* | 455 +------------------+ <- TCP Segments (containing FPDUs) 456 | IP etc. | 457 +------------------+ 458 * TCP or MPA-aware TCP. 460 Figure 1 ULP MPA TCP Layering 462 MPA is described as an extra layer above TCP and below DDP. The 463 operation sequence is: 465 1. A TCP connection is established by ULP action. This is done 466 using methods not described by this specification. The ULP may 467 exchange some amount of data in streaming mode prior to starting 468 MPA, but is not required to do so. 470 2. The Consumer negotiates the use of DDP and MPA at both ends of a 471 connection. The mechanisms to do this are not described in this 472 specification. The negotiation may be done in streaming mode, or 473 by some other mechanism (such as a pre-arranged port number). 475 3. The ULP activates MPA on each end in the Startup Phase, either as 476 an Initiator or a Responder, as determined by the ULP. This mode 477 verifies the usage of MPA, specifies the use of CRC and Markers, 478 and allows the ULP to communicate some additional data via a 479 Private Data exchange. See section 6.1 Connection setup for more 480 details on the startup process. 482 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into 483 Full Operation and begins sending DDP data as further described 484 below. In this document, DDP data chunks are called ULPDUs. For 485 a description of the DDP data, see [DDP]. 487 Following is a description of data transfer when MPA is in Full 488 Operation. 490 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 491 for this value. MPA derives this information from TCP or IP, 492 when it is available, or chooses a reasonable value. 494 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 495 MPA at the sender. 497 3. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 498 header, optionally inserting Markers, and appending a CRC field 499 after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. 501 4. The TCP sender puts the FPDUs into the TCP stream. If the TCP 502 Sender is MPA-aware, it segments the TCP stream in such a way 503 that a TCP Segment boundary is also the boundary of an FPDU. TCP 504 then passes each segment to the IP layer for transmission. 506 5. The TCP receiver may be MPA-aware or may not be MPA-aware. If it 507 is MPA-aware, it may separate passing the TCP payload to MPA from 508 passing the TCP payload ordering information to MPA. In either 509 case, RFC compliant TCP wire behavior is observed at both the 510 sender and receiver. 512 6. The MPA receiver locates and assembles complete FPDUs within the 513 stream, verifies their integrity, and removes MPA Markers (when 514 present), ULPDU_Length, PAD and the CRC field. 516 7. MPA then provides the complete ULPDUs to DDP. MPA may also 517 separate passing MPA payload to DDP from passing the MPA payload 518 ordering information. 520 MPA-aware TCP is a TCP layer which potentially contains some 521 additional semantics as defined in this document. MPA is implemented 522 as a data stream ULP for TCP and is therefore RFC compliant. MPA- 523 aware TCP is RFC compliant. 525 An MPA-aware TCP sender is able to segment the data stream such that 526 TCP segments begin with FPDUs (FPDU Alignment). This has significant 527 advantages for receivers. When segments arrive with aligned FPDUs 528 the receiver usually need not buffer any portion of the segment, 529 allowing DDP to place it in its destination memory immediately, thus 530 avoiding copies from intermediate buffers (DDP's reason for 531 existence). 533 MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation 534 to locate the start of ULPDUs that may be received out of order. It 535 also allows the implementation to determine if the entire ULPDU has 536 been received. As a result, MPA can pass out of order ULPDUs to DDP 537 for immediate use. This enables a DDP on MPA implementation to save 538 a significant amount of intermediate storage by placing the ULPDUs in 539 the right locations in the application buffers when they arrive, 540 rather than waiting until full ordering can be restored. 542 The ability of a receiver to recover out of order ULPDUs is optional 543 and declared to the transmitter during startup. When the receiver 544 declares that it does not support out of order recovery, the 545 transmitter does not add the control information to the data stream 546 needed for out of order recovery. 548 If TCP is not MPA-aware, then MPA receives a strictly ordered stream 549 of data and does not deal with out of order ULPDUs. In this case MPA 550 passes each ULPDU to DDP when the last bytes arrive from TCP, along 551 with the indication that they are in order. 553 MPA implementations that support recovery of out of order ULPDUs MUST 554 support a mechanism to indicate the ordering of ULPDUs as the sender 555 transmitted them and indicate when missing intermediate segments 556 arrive. These mechanisms allow DDP to reestablish record ordering 557 and report Delivery of complete messages (groups of records). 559 MPA also addresses enhanced data integrity. Some users of TCP have 560 noted that the TCP checksum is not as strong as could be desired (see 561 [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum 562 indicates segments in error at a much higher rate than the underlying 563 link characteristics would indicate. With these higher error rates, 564 the chance that an error will escape detection, when using only the 565 TCP checksum for data integrity, becomes a concern. A stronger 566 integrity check can reduce the chance of data errors being missed. 568 MPA includes a CRC check to increase the ULPDU data integrity to the 569 level provided by other modern protocols, such as SCTP [RFC2960]. It 570 is possible to disable this CRC check, however CRCs MUST be enabled 571 unless it is clear that the end to end connection through the network 572 has data integrity at least as good as a MPA with CRC enabled (for 573 example when IPsec is implemented end to end). DDP's ULP expects 574 this level of data integrity and therefore the ULP does not have to 575 provide its own duplicate data integrity and error recovery for lost 576 data. 578 3 LLP and DDP requirements 580 The following sections describe requirements on TCP and DDP to 581 utilize MPA. The DDP requirements enable the correct operation over 582 MPA and TCP (as opposed to DDP over SCTP or other LLPs). 584 The TCP requirements are mostly intended to support the MPA-aware TCP 585 variation, which allows implementations that require less buffer 586 memory and may provide better overall system performance. 588 3.1 TCP implementation Requirements to support MPA 590 The TCP implementation MUST inform MPA when the TCP connection is 591 closed or has begun closing the connection (e.g. received a FIN). 593 3.1.1 TCP Transmit side 595 To provide optimum performance, an MPA-aware transmit side TCP 596 implementation SHOULD be enabled to: 598 * With an EMSS large enough to contain the FPDU(s), segment the 599 outgoing TCP stream such that the first octet of every TCP 600 Segment begins with an FPDU. Multiple FPDUs MAY be packed into a 601 single TCP segment as long as they are entirely contained in the 602 TCP segment. 604 * Report the current EMSS to the MPA transmit layer. 606 An MPA-aware TCP transmit side implementation MUST continue to use 607 the method of segmentation expected by non-MPA applications (and 608 described in TCP RFCs) when MPA is not enabled on the connection. 609 When MPA is enabled above an MPA-aware TCP, it SHOULD specifically 610 enable the segmentation rules described above for the DDP segments 611 (FPDUs) posted for transmission. 613 If the transmit side TCP implementation is not able to segment the 614 TCP stream as indicated above, MPA SHOULD make a best effort to 615 achieve that result. For example, using the TCP_NODELAY socket 616 option to disable the Nagle algorithm will usually result in many of 617 the segments starting with an FPDU. 619 If the transmit side TCP implementation is not able to report the 620 EMSS, MPA SHOULD use the current MTU value to establish a likely FPDU 621 size, taking into account the various expected header sizes. 623 3.1.2 TCP Receive side 625 When an MPA receive implementation and the MPA-aware receive side TCP 626 implementation support handling out of order ULPDUs, the TCP receive 627 implementation SHOULD be enabled to: 629 * Pass incoming TCP segments to MPA as soon as they have been 630 received and validated, even if not received in order. The TCP 631 layer MUST have committed to keeping each segment before it can 632 be passed to the MPA. This means that the segment must have 633 passed the TCP, IP, and lower layer data integrity validation 634 (i.e., checksum), must be in the receive window, must not be a 635 duplicate, must be part of the same epoch (if timestamps are used 636 to verify this) and any other checks required by TCP RFCs. The 637 segment MUST NOT be passed to MPA more than once unless 638 explicitly requested (see Section 7). 640 This is not to imply that the data must be completely ordered 641 before use. An implementation MAY accept out of order segments, 642 SACK them [RFC2018], and pass them to DDP immediately, before the 643 reception of the segments needed to fill in the gaps arrive. 644 Such an implementation MUST "commit" to the data early on, and 645 MUST NOT overwrite it even if (or when) duplicate data arrives. 646 MPA expects to utilize this "commit" to allow the passing of 647 ULPDUs to DDP when they arrive, independent of ordering. DDP 648 uses the passed ULPDU to "place" the DDP segments (see [DDP] for 649 more details). 651 * Provide a mechanism to indicate the ordering of TCP segments as 652 the sender transmitted them. One possible mechanism might be 653 attaching the TCP sequence number to each segment. 655 * Provide a mechanism to indicate when a given TCP segment (and the 656 prior TCP stream) is complete. One possible mechanism might be 657 to utilize the leading (left) edge of the TCP Receive Window. 659 MPA uses the ordering and completion indications to inform DDP 660 when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses 661 the indications to "deliver" its messages to the DDP consumer 662 (see [DDP] for more details). 664 DDP on MPA MUST utilize these two mechanisms to establish the 665 Delivery semantics that DDP's consumers agree to. These 666 semantics are described fully in [DDP]. These include 667 requirements on DDP's consumer to respect ownership of buffers 668 prior to the time that DDP delivers them to the Consumer. 670 An MPA-aware TCP receive side implementation MUST continue to buffer 671 TCP segments until completely ordered and then deliver them as 672 expected by non-MPA applications (and described in TCP RFCs) when MPA 673 is not enabled on the connection. When MPA is enabled above an MPA- 674 aware TCP, TCP SHOULD enable the in and out of order passing of data, 675 and the separate ordering information as described above. 677 When an MPA receive implementation is coupled with a TCP receive 678 implementation that does not support the preceding mechanisms, TCP 679 passes and Delivers incoming stream data to MPA in order. 681 3.2 MPA's interactions with DDP 683 DDP requires MPA to maintain DDP record boundaries from the sender to 684 the receiver. When using MPA on TCP to send data, DDP provides 685 records (ULPDUs) to MPA. MPA will use the reliable transmission 686 abilities of TCP to transmit the data, and will insert appropriate 687 additional information into the TCP stream to allow the MPA receiver 688 to locate the record boundary information. 690 As such, MPA accepts complete records (ULPDUs) from DDP at the sender 691 and returns them to DDP at the receiver. 693 MPA combined with an MPA-aware TCP can only ensure FPDU Alignment 694 with the TCP Header if the FPDU is less than or equal to TCP's EMSS. 695 Since FPDU Alignment is generally desired by the receiver, DDP must 696 cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS 697 under normal conditions. This is done with the MULPDU mechanism. 699 MPA provides information to DDP on the current maximum size of the 700 record that is acceptable to send (MULPDU). DDP SHOULD limit each 701 record size to MULPDU. The range of MULPDU values MUST be between 702 128 octets and 64768 octets, inclusive. 704 The sending DDP MUST NOT post a ULPDU larger than 64768 octets to 705 MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, 706 however MPA is not REQUIRED to support a ULPDU Length that is greater 707 than the current MULPDU. 709 While the maximum theoretical length supported by the MPA header 710 ULPDU_Length field is 65535, TCP over IP requires the IP datagram 711 maximum length to be 65535 octets. To enable MPA to support FPDU 712 Alignment, the maximum size of the FPDU must fit within an IP 713 datagram. Thus the ULPDU limit of 64768 octets was derived by taking 714 the maximum IP datagram length, subtracting from it the maximum total 715 length of the sum of the IPv4 header, TCP header, IPv4 options, TCP 716 options, and the worst case MPA overhead, and then rounding the 717 result down to a 128 octet boundary. 719 On receive, MPA MUST pass each ULPDU with its length to DDP when it 720 has been validated. 722 If an MPA implementation supports passing out of order ULPDUs to DDP, 723 the MPA implementation SHOULD: 725 * Pass each ULPDU with its length to DDP as soon as it has been 726 fully received and validated. 728 * Provide a mechanism to indicate the ordering of ULPDUs as the 729 sender transmitted them. One possible mechanism might be 730 providing the TCP sequence number for each ULPDU. 732 * Provide a mechanism to indicate when a given ULPDU (and prior 733 ULPDUs) are complete (Delivered to DDP). One possible mechanism 734 might be to allow DDP to see the current outgoing TCP Ack 735 sequence number. 737 * Provide an indication to DDP that the TCP has closed or has begun 738 to close the connection (e.g. received a FIN). 740 MPA MUST provide the protocol version negotiated with its peer to 741 DDP. DDP will use this version to set the version in its header and 742 to report the version to [RDMAP]. 744 4 FPDU Formats 746 MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown 747 below MUST be used for all MPA FPDUs. For purposes of clarity, 748 Markers are not shown in Figure 2. 750 0 1 2 3 751 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 752 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 753 | ULPDU_Length | | 754 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 755 | | 756 ~ ~ 757 ~ ULPDU ~ 758 | | 759 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 760 | | PAD (0-3 octets) | 761 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 762 | CRC | 763 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 764 Figure 2 FPDU Format 766 ULPDU_Length: 16 bits (unsigned integer). This is the number of 767 octets of the contained ULPDU. It does not include the length of the 768 FPDU header itself, the pad, the CRC, or of any Markers that fall 769 within the ULPDU. The 16-bit ULPDU Length field is large enough to 770 support the largest IP datagrams for IPv4 or IPv6. 772 PAD: The PAD field trails the ULPDU and contains between zero and 773 three octets of data. The pad data MUST be set to zero by the sender 774 and ignored by the receiver (except for CRC checking). The length of 775 the pad is set so as to make the size of the FPDU an integral 776 multiple of four. 778 CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C 779 check value, which is used to verify the entire contents of the FPDU, 780 using CRC32C. See section 5.2 CRC Calculation on page 23. When CRCs 781 are not enabled, this field is still present, may contain any value, 782 and MUST NOT be checked. 784 The FPDU adds a minimum of 6 octets to the length of the ULPDU. In 785 addition, the total length of the FPDU will include the length of any 786 Markers and from 0 to 3 pad octets added to round-up the ULPDU size. 788 4.1 Marker Format 790 The format of a Marker MUST be as specified in Figure 3: 792 0 1 2 3 793 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 794 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 795 | RESERVED | FPDUPTR | 796 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 797 Figure 3 Marker Format 799 RESERVED: The Reserved field MUST be set to zero on transmit and 800 ignored on receive (except for CRC calculation). 802 FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, 803 interpreted as an unsigned integer that indicates the number of 804 octets in the TCP stream from the beginning of the ULPDU Length field 805 to the first octet of the entire Marker. The least significant two 806 bits MUST always be set to zero at the transmitter, and the receivers 807 MUST always treat these as zero for calculations. 809 5 Data Transfer Semantics 811 This section discusses some characteristics and behavior of the MPA 812 protocol as well as implications of that protocol. 814 5.1 MPA Markers 816 MPA Markers are used to identify the start of FPDUs when packets are 817 received out of order. This is done by locating the Markers at fixed 818 intervals in the data stream (which is correlated to the TCP sequence 819 number) and using the Marker value to locate the preceding FPDU 820 start. 822 All MPA Markers are included in the containing FPDU CRC calculation 823 (when both CRCs and Markers are in use). 825 The MPA receiver's ability to locate out of order FPDUs and pass the 826 ULPDUs to DDP is implementation dependent. MPA/DDP allows those 827 receivers that are able to deal with out of order FPDUs in this way 828 to require the insertion of Markers in the data stream. When the 829 receiver cannot deal with out of order FPDUs in this way, it may 830 disable the insertion of Markers at the sender. All MPA senders MUST 831 be able to generate Markers when their use is declared by the 832 opposing receiver (see section 6.1 Connection setup on page 32). 834 When Markers are enabled, MPA senders MUST insert a Marker into the 835 data stream at a 512 octet periodic interval in the TCP Sequence 836 Number Space. The Marker contains a 16 bit unsigned integer referred 837 to as the FPDUPTR (FPDU Pointer). 839 If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit 840 relative back-pointer. FPDUPTR MUST contain the number of octets in 841 the TCP stream from the beginning of the ULPDU Length field to the 842 first octet of the Marker, unless the Marker falls between FPDUs. 843 Thus the location of the first octet of the previous FPDU header can 844 be determined by subtracting the value of the given Marker from the 845 current octet-stream sequence number (i.e. TCP sequence number) of 846 the first octet of the Marker. Note that this computation MUST take 847 into account that the TCP sequence number could have wrapped between 848 the Marker and the header. 850 An FPDUPTR value of 0x0000 is a special case - it is used when the 851 Marker falls exactly between FPDUs (between the preceding FPDU CRC 852 field, and the next FPDU's ULPDU Length field). In this case, the 853 Marker is considered to be contained in the following FPDU; the 854 Marker MUST be included in the CRC calculation of the FPDU following 855 the Marker (if CRCs are being generated or checked). Thus an FPDUPTR 856 value of 0x0000 means that immediately following the Marker is an 857 FPDU header (the ULPDU Length field). 859 Since all FPDUs are integral multiples of 4 octets, the bottom two 860 bits of the FPDUPTR as calculated by the sender are zero. MPA 861 reserves these bits so they MUST be treated as zero for computation 862 at the receiver. 864 When Markers are enabled (see section 6.1 Connection setup on page 865 32), the MPA Markers MUST be inserted immediately preceding the first 866 FPDU of Full Operation phase, and at every 512th octet of the TCP 867 octet stream thereafter. As a result, the first Marker has an 868 FPDUPTR value of 0x0000. If the first Marker begins at octet 869 sequence number SeqStart, then Markers are inserted such that the 870 first octet of the Marker is at octet sequence number SeqNum if the 871 remainder of (SeqNum - SeqStart) mod 512 is zero. Note that SeqNum 872 can wrap. 874 For example, if the TCP sequence number were used to calculate the 875 insertion point of the Marker, the starting TCP sequence number is 876 unlikely to be zero, and 512 octet multiples are unlikely to fall on 877 a modulo 512 of zero. If the MPA connection is started at TCP 878 sequence number 11, then the 1st Marker will begin at 11, and 879 subsequent Markers will begin at 523, 1035, etc. 881 If an FPDU is large enough to contain multiple Markers, they MUST all 882 point to the same point in the TCP stream: the first octet of the 883 ULPDU Length field for the FPDU. 885 If a Marker interval contains multiple FPDUs (the FPDUs are small), 886 the Marker MUST point to the start of the ULPDU Length field for the 887 FPDU containing the Marker unless the Marker falls between FPDUs, in 888 which case the Marker MUST be zero. 890 The following example shows an FPDU containing a Marker. 892 0 1 2 3 893 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 894 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 895 | ULPDU Length (0x0010) | | 896 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 897 | | 898 + + 899 | ULPDU (octets 0-9) | 900 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 901 | (0x0000) | FPDU ptr (0x000C) | 902 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 903 | ULPDU (octets 10-15) | 904 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 905 | | PAD (2 octets:0,0) | 906 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 907 | CRC | 908 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 909 Figure 4 Example FPDU Format with Marker 911 MPA Receivers MUST preserve ULPDU boundaries when passing data to 912 DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to 913 DDP and not the Markers, headers, and CRC. 915 5.2 CRC Calculation 917 An MPA implementation MUST implement CRC support and MUST either: 919 (1) always use CRCs; The MPA provider at is not REQUIRED to support 920 an administrator's request that CRCs not be used. 922 or 924 (2a) only indicate a preference to not use CRCs on the explicit 925 request of the system administrator, via an interface not defined 926 in this spec. The default configuration for a connection MUST be 927 to use CRCs. 929 (2b) disable CRC checking (and possibly generation) if both the local 930 and remote endpoints indicate preference to not use CRCs. 932 The decision for hosts to request CRC suppression MAY be made on an 933 administrative basis for any path that provides equivalent protection 934 from undetected errors as an end-to-end CRC32c. 936 The process MUST be invisible to the ULP. 938 After receipt of an MPA startup declaration indicating that its peer 939 requires CRCs, an MPA instance MUST continue generating and checking 940 CRCs until the connection terminates. If an MPA instance has 941 declared that it does not require CRCs, it MUST turn off CRC checking 942 immediately after receipt of an MPA mode declaration indicating that 943 its peer also does not require CRCs. It MAY continue generating 944 CRCs. See section 6.1 Connection setup on page 32 for details on the 945 MPA startup. 947 When sending an FPDU, the sender MUST include a CRC field. When CRCs 948 are enabled, the CRC field in the MPA FPDU MUST be computed using the 949 CRC32C polynomial in the manner described in the iSCSI Protocol 950 [iSCSI] document for Header and Data Digests. 952 The fields which MUST be included in the CRC calculation when sending 953 an FPDU are as follows: 955 1) If a Marker does not immediately precede the ULPDU Length field, 956 the CRC-32c is calculated from the first octet of the ULPDU 957 Length field, through all the ULPDU and Markers (if present), to 958 the last octet of the PAD (if present), inclusive. If there is a 959 Marker immediately following the PAD, the Marker is included in 960 the CRC calculation for this FPDU. 962 2) If a Marker immediately precedes the first octet of the ULPDU 963 Length field of the FPDU, (i.e. the Marker fell between FPDUs, 964 and thus is required to be included in the second FPDU), the CRC- 965 32c is calculated from the first octet of the Marker, through the 966 ULPDU Length header, through all the ULPDU and Markers (if 967 present), to the last octet of the PAD (if present), inclusive. 969 3) After calculating the CRC-32c, the resultant value is placed into 970 the CRC field at the end of the FPDU. 972 When an FPDU is received, and CRC checking is enabled, the receiver 973 MUST first perform the following: 975 1) Calculate the CRC of the incoming FPDU in the same fashion as 976 defined above. 978 2) Verify that the calculated CRC-32c value is the same as the 979 received CRC-32c value found in the FPDU CRC field. If not, the 980 receiver MUST treat the FPDU as an invalid FPDU. 982 The procedure for handling invalid FPDUs is covered in the Error 983 Section (see section 7 on page 46) 985 The following is an annotated hex dump of an example FPDU sent as the 986 first FPDU on the stream. As such, it starts with a Marker. The 987 FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn 988 contains 24 octets of the contained ULPDU, which is a data load that 989 is all zeros. The CRC32c has been correctly calculated and can be 990 used as a reference. See the [DDP] and [RDMAP] specification for 991 definitions of the DDP Control field, Queue, MSN, MO, and Send Data. 993 Octet Contents Annotation 994 Count 996 0000 00 Marker: Reserved 997 0001 00 998 0002 00 Marker: FPDUPTR 999 0003 00 1000 0004 00 ULPDU Length 1001 0005 2a 1002 0006 41 DDP Control Field, Send with Last flag set 1003 0007 43 1004 0008 00 Reserved (DDP STag position with no STag) 1005 0009 00 1006 000a 00 1007 000b 00 1008 000c 00 DDP Queue = 0 1009 000d 00 1010 000e 00 1011 000f 00 1012 0010 00 DDP MSN = 1 1013 0011 00 1014 0012 00 1015 0013 01 1016 0014 00 DDP MO = 0 1017 0015 00 1018 0016 00 1019 0017 00 1020 0018 00 DDP Send Data (24 octets of zeros) 1021 ... 1022 002f 00 1023 0030 52 CRC32c 1024 0031 23 1025 0032 99 1026 0033 83 1027 Figure 5 Annotated Hex Dump of an FPDU 1029 The following is an example sent as the second FPDU of the stream 1030 where the first FPDU (which is not shown here) had a length of 492 1031 octets and was also a Send to Queue 0 with Last Flag set. This 1032 example contains a Marker. 1034 Octet Contents Annotation 1035 Count 1037 01ec 00 Length 1038 01ed 2a 1039 01ee 41 DDP Control Field: Send with Last Flag set 1040 01ef 43 1041 01f0 00 Reserved (DDP STag position with no STag) 1042 01f1 00 1043 01f2 00 1044 01f3 00 1045 01f4 00 DDP Queue = 0 1046 01f5 00 1047 01f6 00 1048 01f7 00 1049 01f8 00 DDP MSN = 2 1050 01f9 00 1051 01fa 00 1052 01fb 02 1053 01fc 00 DDP MO = 0 1054 01fd 00 1055 01fe 00 1056 01ff 00 1057 0200 00 Marker: Reserved 1058 0201 00 1059 0202 00 Marker: FPDUPTR 1060 0203 14 1061 0204 00 DDP Send Data (24 octets of zeros) 1062 ... 1063 021b 00 1064 021c 84 CRC32c 1065 021d 92 1066 021e 58 1067 021f 98 1068 Figure 6 Annotated Hex Dump of an FPDU with Marker 1070 5.3 MPA on TCP Sender Segmentation 1072 The various TCP RFCs allow considerable choice in segmenting a TCP 1073 stream. In order to optimize FPDU recovery at the MPA receiver, MPA 1074 specifies additional segmentation rules. 1076 MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU 1077 contained in one FPDU. 1079 An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP 1080 implementations that support this, and with an EMSS large enough to 1081 contain at least one FPDU, segment the outbound TCP stream such that 1082 each TCP segment begins with an FPDU, and fully contains all included 1083 FPDUs. 1085 Implementation note: To achieve the previous segmentation rule, 1086 an MPA-aware TCP sender implementation SHOULD disable TCP's 1087 Nagle [RFC0896] algorithm, communicate the FPDU boundaries to 1088 TCP, and make other minor changes such as the reporting of EMSS 1089 to MPA. 1091 There are exceptions to the above rule. Once an ULPDU is provided to 1092 MPA, the MPA on TCP sender MUST transmit it or fail the connection; 1093 it cannot be repudiated. As a result, during changes in MTU and 1094 EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it 1095 may be necessary to send FPDUs that do not conform to the 1096 segmentation rule above. 1098 A possible, but less desirable, alternative is to use IP 1099 fragmentation on accepted FPDUs to deal with MTU reductions or 1100 extremely small EMSS. 1102 The sender MUST still format the FPDU according to FPDU format as 1103 shown in Figure 2. 1105 On a retransmission, TCP does not necessarily preserve original TCP 1106 segmentation boundaries. This can lead to the loss of FPDU Alignment 1107 and containment within a TCP segment during TCP retransmissions. An 1108 MPA-aware TCP sender SHOULD try to preserve original TCP segmentation 1109 boundaries on a retransmission. 1111 5.3.1 Effects of MPA on TCP Segmentation 1113 DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU 1114 when a DDP message is large enough. Since the DDP message may not 1115 exactly fit into TCP segments, a "message tail" often occurs that 1116 results in an FPDU that is smaller than a single TCP segment. 1117 Additionally some DDP messages may be considerably shorter than the 1118 EMSS. If a small FPDU is sent in a single TCP segment the result is 1119 a "short" TCP segment. 1121 Applications expected to see strong advantages from Direct Data 1122 Placement include transaction-based applications and throughput 1123 applications. Request/response protocols typically send one FPDU per 1124 TCP segment and then wait for a response. Under these conditions, 1125 these "short" TCP segments are an appropriate and expected effect of 1126 the segmentation. 1128 Another possibility is that the application might be sending multiple 1129 messages (FPDUs) to the same endpoint before waiting for a response. 1131 In this case, the segmentation policy would tend to reduce the 1132 available connection bandwidth by under-filling the TCP segments. 1134 TCP implementations often utilize the Nagle [RFC0896] algorithm to 1135 ensure that segments are filled to the EMSS whenever the round trip 1136 latency is large enough that the source stream can fully fill 1137 segments before Acks arrive. The algorithm does this by delaying the 1138 transmission of TCP segments until a ULP can fill a segment, or until 1139 an ACK arrives from the far side. The algorithm thus allows for 1140 smaller segments when latencies are shorter to keep the ULP's end to 1141 end latency to reasonable levels. 1143 The Nagle algorithm is not mandatory to use [RFC1122]. 1145 If Nagle or other algorithms for detecting the availability of 1146 multiple FPDUs for transmission is used, "packing" of multiple FPDUs 1147 into TCP segments can occur. 1149 If a "message tail", small DDP messages, or the start of a larger DDP 1150 message are available, MPA MAY pack multiple FPDUs into TCP segments. 1151 When this is done, the TCP segments can be more fully utilized, but, 1152 due to the size constraints of FPDUs, segments may not be filled to 1153 the EMSS. 1155 Note that MPA receivers must do more processing of a TCP segment 1156 that contains multiple FPDUs, this may affect the performance of 1157 some receiver implementations. 1159 It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note 1160 that many of the applications expected to take advantage of MPA/DDP 1161 prefer to avoid the extra delays caused by Nagle. In such scenarios 1162 it is anticipated there will be minimal opportunity for packing at 1163 the transmitter and receivers may choose to optimize their 1164 performance for this anticipated behavior. 1166 Therefore, the application is expected to set TCP parameters such 1167 that it can trade off latency and wire efficiency. This is 1168 accomplished by setting the TCP_NODELAY socket option (which disables 1169 Nagle). 1171 When latency is not critical, application is expected to leave Nagle 1172 enabled. In this case the TCP implementation may pack any available 1173 stream data into TCP segments so that the segments are filled to the 1174 EMSS. If the amount of data available is not enough to fill the TCP 1175 segment when it is prepared for transmission, TCP can send the 1176 segment partly filled, or use the Nagle algorithm to wait for the ULP 1177 to post more data (discussed below). 1179 5.3.2 FPDU Size Considerations 1181 MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as 1182 the size of the largest ULPDU fitting in an FPDU. For an empty TCP 1183 Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus 1184 space for Markers and pad octets. 1186 The maximum ULPDU Length for a single ULPDU when Markers are 1187 present MUST be computed as: 1189 MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) 1191 The formula above accounts for the worst-case number of Markers. 1193 The maximum ULPDU Length for a single ULPDU when Markers are NOT 1194 present MUST be computed as: 1196 MULPDU = EMSS - (6 + EMSS mod 4) 1198 As a further optimization of the wire efficiency an MPA 1199 implementation MAY dynamically adjust the MULPDU (see section 5.3.1 1200 for latency and wire efficiency trade-offs). When one or more FPDUs 1201 are already packed into a TCP Segment, MULPDU MAY be reduced 1202 accordingly. 1204 DDP SHOULD provide ULPDUs that are as large as possible, but less 1205 than or equal to MULPDU. 1207 If the TCP implementation needs to adjust EMSS to support MTU 1208 changes, the MULPDU value is changed accordingly. 1210 In certain rare situations, the EMSS may shrink below 128 octets in 1211 size. If this occurs, the MPA on TCP sender MUST NOT shrink the 1212 MULPDU below 128 octets and is not REQUIRED to follow the 1213 segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on 1214 page 26. 1216 If one or more FPDUs are already packed into a TCP segment, such that 1217 the remaining room is less than 128 octets, MPA MUST NOT provide a 1218 MULPDU smaller than 128. In this case, MPA would typically provide a 1219 MULPDU for the next full sized segment, but may still pack the next 1220 FPDU into the small remaining room, provide that the next FPDU is 1221 small enough to fit. 1223 The value 128 is chosen as to allow DDP designers room for the DDP 1224 Header and some user data. 1226 5.4 MPA Receiver FPDU Identification 1228 An MPA receiver MUST first verify the FPDU before passing the ULPDU 1229 to DDP. To do this, the receiver MUST: 1231 * locate the start of the FPDU unambiguously, 1233 * verify its CRC (if CRC checking is enabled). 1235 If the above conditions are true, the MPA receiver passes the ULPDU 1236 to DDP. 1238 To detect the start of the FPDU unambiguously one of the following 1239 MUST be used: 1241 1: In an ordered TCP stream, the ULPDU Length field in the current 1242 FPDU when FPDU has a valid CRC, can be used to identify the 1243 beginning of the next FPDU. 1245 2: For receivers that support out of order reception of FPDUs (see 1246 section 5.1 MPA Markers on page 20) a Marker can always be used 1247 to locate the beginning of an FPDU (in FPDUs with valid CRCs). 1248 Since the location of the Marker is known in the octet stream 1249 (sequence number space), the Marker can always be found. 1251 3: Having found an FPDU by means of a Marker, following contiguous 1252 FPDUs can be found by using the ULPDU Length fields (from FPDUs 1253 with valid CRCs) to establish the next FPDU boundary. 1255 The ULPDU Length field (see section 4) MUST be used to determine if 1256 the entire FPDU is present before forwarding the ULPDU to DDP. 1258 CRC calculation is discussed in section 5.2 on page 23 above. 1260 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 1262 Since MPA on MPA-aware TCP senders start FPDUs on TCP segment 1263 boundaries, a receiving DDP on MPA on TCP implementation may be able 1264 to optimize the reception of data in various ways. 1266 However, MPA receivers MUST NOT depend on FPDU Alignment on TCP 1267 segment boundaries. 1269 Some MPA senders may be unable to conform to the sender requirements 1270 because their implementation of TCP is not designed with MPA in mind. 1271 Even if the sender is MPA-aware, the network may contain "middle 1272 boxes" which modify the TCP stream by changing the segmentation. 1273 This is generally interoperable with TCP and its users and MPA must 1274 be no exception. 1276 The presence of Markers in MPA (when enabled) allows an MPA receiver 1277 to recover the FPDUs despite these obstacles, although it may be 1278 necessary to utilize additional buffering at the receiver to do so. 1280 Some of the cases that a receiver may have to contend with are listed 1281 below as a reminder to the implementer: 1283 * A single Aligned and complete FPDU, either in order, or out of 1284 order: This can be passed to DDP as soon as validated, and 1285 Delivered when ordering is established. 1287 * Multiple FPDUs in a TCP segment, aligned and fully contained, 1288 either in order, or out of order: These can be passed to DDP as 1289 soon as validated, and Delivered when ordering is established. 1291 * Incomplete FPDU: The receiver should buffer until the remainder 1292 of the FPDU arrives. If the remainder of the FPDU is already 1293 available, this can be passed to DDP as soon as validated, and 1294 Delivered when ordering is established. 1296 * Unaligned FPDU start: The partial FPDU must be combined with its 1297 preceding portion(s). If the preceding parts are already 1298 available, and the whole FPDU is present, this can be passed to 1299 DDP as soon as validated, and Delivered when ordering is 1300 established. If the whole FPDU is not available, the receiver 1301 should buffer until the remainder of the FPDU arrives. 1303 * Combinations of Unaligned or incomplete FPDUs (and potentially 1304 other complete FPDUs) in the same TCP segment: If any FPDU is 1305 present in its entirety, or can be completed with portions 1306 already available, it can be passed to DDP as soon as validated, 1307 and Delivered when ordering is established. 1309 6 Connection Semantics 1311 6.1 Connection setup 1313 MPA requires that the Consumer MUST activate MPA, and any TCP 1314 enhancements for MPA, on a TCP half connection at the same location 1315 in the octet stream at both the sender and the receiver. This is 1316 required in order for the Marker scheme to correctly locate the 1317 Markers (if enabled) and to correctly locate the first FPDU. 1319 MPA, and any TCP enhancements for MPA are enabled by the ULP in both 1320 directions at once at an endpoint. 1322 This can be accomplished several ways, and is left up to DDP's ULP: 1324 * DDP's ULP MAY require DDP on MPA startup immediately after TCP 1325 connection setup. This has the advantage that no streaming mode 1326 negotiation is needed. An example of such a protocol is shown in 1327 Figure 9: Example Immediate Startup negotiation on page 42. 1329 This may be accomplished by using a well-known port, or a service 1330 locator protocol to locate an appropriate port on which DDP on 1331 MPA is expected to operate. 1333 * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a 1334 normal TCP startup, using TCP streaming data exchanges on the 1335 same connection. The exchange establishes that DDP on MPA (as 1336 well as other ULPs) will be used, and exactly locates the point 1337 in the octet stream where MPA is to begin operation. Note that 1338 such a negotiation protocol is outside the scope of this 1339 specification. A simplified example of such a protocol is shown 1340 in Figure 8: Example Delayed Startup negotiation on page 39. 1342 An MPA endpoint operates in two distinct phases. 1344 The Startup Phase is used to verify correct MPA setup, exchange CRC 1345 and Marker configuration, and optionally pass Private Data between 1346 endpoints prior to completing a DDP connection. During this phase, 1347 specifically formatted frames are exchanged as TCP byte streams 1348 without using CRCs or Markers. During this phase a DDP endpoint need 1349 not be "bound" to the MPA connection. In fact, the choice of DDP 1350 endpoint and its operating parameters may not be known until the 1351 Consumer supplied Private Data (if any) has been examined by the 1352 Consumer. 1354 The second distinct phase is Full Operation during which FPDUs are 1355 sent using all the rules that pertain (CRCs, Markers, MULPDU 1356 restrictions etc.). A DDP endpoint MUST be "bound" to the MPA 1357 connection at entry to this phase. 1359 When Private Data is passed between ULPs in the Startup Phase, the 1360 ULP is responsible for interpreting that data, and then placing MPA 1361 into Full Operation. 1363 Note: The following text differentiates the two endpoints by calling 1364 them Initiator and Responder. This is quite arbitrary and is NOT 1365 related to the TCP startup (SYN, SYN/ACK sequence). The 1366 Initiator is the side that sends first in the MPA startup 1367 sequence (the MPA Request Frame). 1369 Note: The possibility that both endpoints would be allowed to make a 1370 connection at the same time, sometimes called an active/active 1371 connection, was considered by the work group and rejected. There 1372 were several motivations for this decision. One was that 1373 applications needing this facility were few (none other than 1374 theoretical at the time of this draft). Another was that the 1375 facility created some implementation difficulties, particularly 1376 with the "dual stack" designs described later on. A last issue 1377 was that dealing with rejected connections at startup would have 1378 required at least an additional frame type, and more recovery 1379 actions, complicating the protocol. While none of these issues 1380 was overwhelming, the group and implementers were not motivated 1381 to do the work to resolve these issues. The protocol includes a 1382 method of detecting these active/active startup attempts so that 1383 they can be rejected and an error reported. 1385 The ULP is responsible for determining which side is Initiator or 1386 Responder. For client/server type ULPs this is easy. For peer-peer 1387 ULPs (which might utilize a TCP style active/active startup), some 1388 mechanism (not defined by this specification) must be established, or 1389 some streaming mode data exchanged prior to MPA startup to determine 1390 the side which starts in Initiator and which starts in Responder MPA 1391 mode. 1393 6.1.1 MPA Request and Reply Frame Format 1395 0 1 2 3 1396 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1397 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1398 0 | | 1399 + Key (16 bytes containing "MPA ID Req Frame") + 1400 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | 1401 + Or (16 bytes containing "MPA ID Rep Frame") + 1402 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | 1403 + + 1404 12 | | 1405 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1406 16 |M|C|R| Res | Rev | PD_Length | 1407 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1408 | | 1409 ~ ~ 1410 ~ Private Data ~ 1411 | | 1412 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1413 | | 1414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1415 Figure 7 MPA Request/Reply Frame 1417 Key: This field contains the "key" used to validate that the sender 1418 is an MPA sender. Initiator mode senders MUST set this field to 1419 the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20 1420 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder 1421 mode receivers MUST check this field for the same value, and 1422 close the connection and report an error locally if any other 1423 value is detected. Responder mode senders MUST set this field to 1424 the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20 1425 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator 1426 mode receivers MUST check this field for the same value, and 1427 close the connection and report an error locally if any other 1428 value is detected. 1430 M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame, 1431 declares a receiver's requirement for Markers. When in a 1432 received MPA Request Frame or MPA Reply Frame and the value is 1433 '0', Markers MUST NOT be added to the data stream by the sender. 1434 When '1' Markers MUST be added as described in section 5.1 MPA 1435 Markers on page 20. 1437 C: This bit declares an endpoint's preferred CRC usage. When this 1438 field is '0' in the MPA Request Frame and the MPA Reply Frame, 1439 CRCs MUST not be checked and need not be generated by either 1440 endpoint. When this bit is '1' in either the MPA Request Frame 1441 or MPA Reply Frame, CRCs MUST be generated and checked by both 1442 endpoints. Note that even when not in use, the CRC field remains 1443 present in the FPDU. When CRCs are not in use, the CRC field 1444 MUST be considered valid for FPDU checking regardless of its 1445 contents. 1447 R: This bit is set to zero, and not checked on reception in the MPA 1448 Request Frame. In the MPA Reply Frame, this bit is the Rejected 1449 Connection bit, set by the Responders ULP to indicate acceptance 1450 '0', or rejection '1', of the connection parameters provided in 1451 the Private Data. 1453 Res: This field is reserved for future use. It MUST be set to zero 1454 when sending, and not checked on reception. 1456 Rev: This field contains the Revision of MPA. For this version of 1457 the specification senders MUST set this field to one. MPA 1458 receivers compliant with this version of the specification MUST 1459 check this field. If the MPA receiver cannot interoperate with 1460 the received version, then it MUST close the connection and 1461 report an error locally. Otherwise, the MPA receiver should 1462 report the received version to the ULP. 1464 PD_Length: This field MUST contain the length in Octets of the 1465 Private Data field. A value of zero indicates that there is no 1466 Private Data field present at all. If the receiver detects that 1467 the PD_Length field does not match the length of the Private Data 1468 field, or if the length of the Private Data field exceeds 512 1469 octets, the receiver MUST close the connection and report an 1470 error locally. Otherwise, the MPA receiver should pass the 1471 PD_Length value and Private Data to the ULP. 1473 Private Data: This field may contain any value defined by ULPs or may 1474 not be present. The Private Data field MUST between 0 and 512 1475 octets in length. ULPs define how to size, set, and validate 1476 this field within these limits. 1478 6.1.2 Connection Startup Rules 1480 The following rules apply to MPA connection Startup Phase: 1482 1. When MPA is started in the Initiator mode, the MPA implementation 1483 MUST send a valid MPA Request Frame. The MPA Request Frame MAY 1484 include ULP supplied Private Data. 1486 2. When MPA is started in the Responder mode, the MPA implementation 1487 MUST wait until a MPA Request Frame is received and validated 1488 before entering full MPA/DDP operation. 1490 If the MPA Request Frame is improperly formatted, the 1491 implementation MUST close the TCP connection and exit MPA. 1493 If the MPA Request Frame is properly formatted but the Private 1494 Data is not acceptable, the implementation SHOULD return an MPA 1495 Reply Frame with the Rejected Connection bit set to '1'; the MPA 1496 Reply Frame MAY include ULP supplied Private Data; the 1497 implementation MUST exit MPA, leaving the TCP connection open. 1498 The ULP may close TCP or use the connection for other purposes. 1500 If the MPA Request Frame is properly formatted and the Private 1501 Data is acceptable, the implementation SHOULD return an MPA Reply 1502 Frame with the Rejected Connection bit set to '0'; the MPA Reply 1503 Frame MAY include ULP supplied Private Data; and the Responder 1504 SHOULD prepare to interpret any data received as FPDUs and pass 1505 any received ULPDUs to DDP. 1507 Note: Since the receiver's ability to deal with Markers is 1508 unknown until the Request and Reply frames have been 1509 received, sending FPDUs before this occurs is not possible. 1511 Note: The requirement to wait on a Request Frame before sending a 1512 Reply frame is a design choice, it makes for well ordered 1513 sequence of events at each end, and avoids having to specify 1514 how to deal with situations where both ends start at the same 1515 time. 1517 3. MPA Initiator mode implementations MUST receive and validate a 1518 MPA Reply Frame. 1520 If the MPA Reply Frame is improperly formatted, the 1521 implementation MUST close the TCP connection and exit MPA. 1523 If the MPA Reply Frame is properly formatted but is the Private 1524 Data is not acceptable, or if the Rejected Connection bit set to 1525 '1', the implementation MUST exit MPA, leaving the TCP connection 1526 open. The ULP may close TCP or use the connection for other 1527 purposes. 1529 If the MPA Reply Frame is properly formatted and the Private Data 1530 is acceptable, and the Reject Connection bit is set to '0', the 1531 implementation SHOULD enter full MPA/DDP operation mode; 1532 interpreting any received data as FPDUs and sending DDP ULPDUs as 1533 FPDUs. 1535 4. MPA Responder mode implementations MUST receive and validate at 1536 least one FPDU before sending any FPDUs or Markers. 1538 Note: this requirement is present to allow the Initiator time to 1539 get its receiver into Full Operation before an FPDU arrives, 1540 avoiding potential race conditions at the Initiator. This 1541 was also subject to some debate in the work group before 1542 rough consensus was reached. Eliminating this requirement 1543 would allow faster startup in some types of applications. 1544 However, that would also make certain implementations 1545 (particularly "dual stack") much harder. 1547 5. If a received "Key" does not match the expected value, (See 6.1.1 1548 MPA Request and Reply Frame Format above) the TCP/DDP connection 1549 MUST be closed, and an error returned to the ULP. 1551 6. The received Private Data fields may be used by Consumers at 1552 either end to further validate the connection, and set up DDP or 1553 other ULP parameters. The Initiator ULP MAY close the 1554 TCP/MPA/DDP connection as a result of validating the Private Data 1555 fields. The Responder SHOULD return a MPA Reply Frame with the 1556 "Reject Connection" Bit set to '1' if the validation of the 1557 Private Data is not acceptable to the ULP. 1559 7. When the first FPDU is to be sent, then if Markers are enabled, 1560 the first octets sent are the special Marker 0x00000000, followed 1561 by the start of the FPDU (the FPDU's ULPDU Length field). If 1562 Markers are not enabled, the first octets sent are the start of 1563 the FPDU (the FPDU's ULPDU Length field). 1565 8. MPA implementations MUST use the difference between the MPA 1566 Request Frame and the MPA Reply Frame to check for incorrect 1567 "Initiator/Initiator" startups. Implementations SHOULD put a 1568 timeout on waiting for the MPA Request Frame when started in 1569 Responder mode, to detect incorrect "Responder/Responder" 1570 startups. 1572 9. MPA implementations MUST validate the PD_Length field. The 1573 buffer that receives the Private Data field MUST be large enough 1574 to receive that data; the amount of Private Data MUST not exceed 1575 the PD_Length, or the application buffer. If any of the above 1576 fails, the startup frame MUST be considered improperly formatted. 1578 10. MPA implementations SHOULD implement a reasonable timeout while 1579 waiting for the entire startup frames; this prevents certain 1580 denial of service attacks. ULPs SHOULD implement a reasonable 1581 timeout while waiting for FPDUs, ULPDUs and application level 1582 messages to guard against application failures and certain denial 1583 of service attacks. 1585 6.1.3 Example Delayed Startup sequence 1587 A variety of startup sequences are possible when using MPA on TCP. 1588 Following is an example of an MPA/DDP startup that occurs after TCP 1589 has been running for a while and has exchanged some amount of 1590 streaming data. This example does not use any Private Data (an 1591 example that does is shown later in 6.1.4.2 Example Immediate Startup 1592 using Private Data on page 42), although it is perfectly legal to 1593 include the Private Data. Note that since the example does not use 1594 any Private Data, there are no ULP interactions shown between 1595 receiving "Startup frames" and putting MPA into Full Operation. 1597 Initiator Responder 1599 +---------------------------+ 1600 |ULP streaming mode | 1601 | request to | 1602 | transition to DDP/MPA | +--------------------------+ 1603 | mode (optional) | --------> |ULP gets request; | 1604 +---------------------------+ |enables MPA Responder mode| 1605 |with last (optional) | 1606 |streaming mode | 1607 |for MPA to send. | 1608 +---------------------------+ |MPA waits for incoming | 1609 |ULP receives streaming | <-------- | | 1610 | ; | +--------------------------+ 1611 |Enters MPA Initiator mode; | 1612 |MPA sends | 1613 | ; | 1614 |MPA waits for incoming | +--------------------------+ 1615 | |MPA receives | 1616 +---------------------------+ | | 1617 |Consumer binds DDP to MPA,| 1618 |MPA sends the | 1619 | . | 1620 |DDP/MPA enables FPDU | 1621 +---------------------------+ |decoding, but does not | 1622 |MPA receives the | < - - - - |send any FPDUs. | 1623 | | +--------------------------+ 1624 |Consumer binds DDP to MPA, | 1625 |DDP/MPA begins full | 1626 |operation. | 1627 |MPA sends first FPDU (as | +--------------------------+ 1628 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1629 |available). | |MPA sends first FPDU (as | 1630 +---------------------------+ |DDP ULPDUs become | 1631 <====== |available. | 1632 +--------------------------+ 1633 Figure 8: Example Delayed Startup negotiation 1635 An example Delayed Startup sequence is described below: 1637 * Active and passive sides start up a TCP connection in the 1638 usual fashion, probably using sockets APIs. They exchange 1639 some amount of streaming mode data. At some point one side 1640 (the MPA Initiator) sends streaming mode data that 1641 effectively says "Hello, Lets go into MPA/DDP mode." 1643 * When the remote side (the MPA Responder) gets this streaming mode 1644 message, the Consumer would send a last streaming mode message 1645 that effectively says "I Acknowledge your Hello, and am now in 1646 MPA Responder Mode". The exchange of these messages establishes 1647 the exact point in the TCP stream where MPA is enabled. The 1648 Responding Consumer enables MPA in the Responder mode and waits 1649 for the initial MPA startup message. 1651 * The Initiating Consumer would enable MPA startup in the 1652 Initiator mode which then sends the MPA Request Frame. It is 1653 assumed that no Private Data messages are needed for this 1654 example, although it is possible to do so. The Initiating 1655 MPA (and Consumer) would also wait for the MPA connection to 1656 be accepted. 1658 * The Responding MPA would receive the initial MPA Request Frame 1659 and would inform the Consumer that this message arrived. The 1660 Consumer can then accept the MPA/DDP connection or close the TCP 1661 connection. 1663 * To accept the connection request, the Responding Consumer would 1664 use an appropriate API to bind the TCP/MPA connections to a DDP 1665 endpoint, thus enabling MPA/DDP into Full Operation. In the 1666 process of going to Full Operation, MPA sends the MPA Reply 1667 Frame. MPA/DDP waits for the first incoming FPDU before sending 1668 any FPDUs. 1670 * If the initial TCP data was not a properly formatted MPA Request 1671 Frame MPA will close or reset the TCP connection immediately. 1673 * The Initiating MPA would receive the MPA Reply Frame and 1674 would report this message to the Consumer. The Consumer can 1675 then accept the MPA/DDP connection, or close or reset the TCP 1676 connection to abort the process. 1678 * On determining that the Connection is acceptable, the 1679 Initiating Consumer would use an appropriate API to bind the 1680 TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP 1681 into Full Operation. MPA/DDP would begin sending DDP 1682 messages as MPA FPDUs. 1684 6.1.4 Use of Private Data 1686 This section is advisory in nature, in that it suggests a method that 1687 a ULP can deal with pre-DDP connection information exchange. 1689 6.1.4.1 Motivation 1691 Prior RDMA protocols have been developed that provide Private Data 1692 via out of band mechanisms. As a result, many applications now 1693 expect some form of Private Data to be available for application use 1694 prior to setting up the DDP/RDMA connection. Following are some 1695 examples of the use of Private Data. 1697 An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand 1698 and the [VERBS]) must be associated with a Protection Domain. No 1699 receive operations may be posted to the endpoint before it is 1700 associated with a Protection Domain. Indeed under both the 1701 InfiniBand and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is 1702 created within a Protection Domain. 1704 There are some applications where the choice of Protection Domain is 1705 dependent upon the identity of the remote ULP client. For example, 1706 if a user session requires multiple connections, it is highly 1707 desirable for all of those connections to use a single Protection 1708 Domain. Note: use of Protection Domains is further discussed in 1709 [RDMASEC]. 1711 InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for 1712 the active side ULP to provide Private Data when requesting a 1713 connection. This data is passed to the ULP to allow it to determine 1714 whether to accept the connection, and if so with which endpoint (and 1715 implicitly which Protection Domain). 1717 The Private Data can also be used to ensure that both ends of the 1718 connection have configured their RDMA endpoints compatibly on such 1719 matters as the RDMA Read capacity (see [RDMAP]). Further ULP- 1720 specific uses are also presumed, such as establishing the identity of 1721 the client. 1723 Private Data is also allowed for when accepting the connection, to 1724 allow completion of any negotiation on RDMA resources and for other 1725 ULP reasons. 1727 There are several potential ways to exchange this Private Data. For 1728 example, the InfiniBand specification includes a connection 1729 management protocol that allows a small amount of Private Data to be 1730 exchanged using datagrams before actually starting the RDMA 1731 connection. 1733 This draft allows for small amounts of Private Data to be exchanged 1734 as part of the MPA startup sequence. The actual Private Data fields 1735 are carried in the MPA Request Frame, and the MPA Reply Frame. 1737 If larger amounts of Private Data or more negotiation is necessary, 1738 TCP streaming mode messages may be exchanged prior to enabling MPA. 1740 6.1.4.2 Example Immediate Startup using Private Data 1742 Initiator Responder 1744 +---------------------------+ 1745 |TCP SYN sent | +--------------------------+ 1746 +---------------------------+ --------> |TCP gets SYN packet; | 1747 +---------------------------+ | Sends SYN-Ack | 1748 |TCP gets SYN-Ack | <-------- +--------------------------+ 1749 | Sends Ack | 1750 +---------------------------+ --------> +--------------------------+ 1751 +---------------------------+ |Consumer enables MPA | 1752 |Consumer enables MPA | |Responder Mode, waits for | 1753 |Initiator mode with | | | 1754 |Private Data; MPA sends | +--------------------------+ 1755 | ; | 1756 |MPA waits for incoming | +--------------------------+ 1757 | |MPA receives | 1758 +---------------------------+ | | 1759 |Consumer examines Private | 1760 |Data, provides MPA with | 1761 |return Private Data, | 1762 |binds DDP to MPA, and | 1763 |enables MPA to send an | 1764 | . | 1765 |DDP/MPA enables FPDU | 1766 +---------------------------+ |decoding, but does not | 1767 |MPA receives the | < - - - - |send any FPDUs. | 1768 | | +--------------------------+ 1769 |Consumer examines Private | 1770 |Data, binds DDP to MPA, | 1771 |and enables DDP/MPA to | 1772 |begin Full Operation. | 1773 |MPA sends first FPDU (as | +--------------------------+ 1774 |DDP ULPDUs become | ========> |MPA Receives first FPDU. | 1775 |available). | |MPA sends first FPDU (as | 1776 +---------------------------+ |DDP ULPDUs become | 1777 <====== |available. | 1778 +--------------------------+ 1779 Figure 9: Example Immediate Startup negotiation 1781 Note: the exact order of when MPA is started in the TCP connection 1782 sequence is implementation dependent; the above diagram shows one 1783 possible sequence. Also, the Initiator "Ack" to the Responder's 1784 "SYN-Ack" may be combined into the same TCP segment containing 1785 the MPA Request Frame (as is allowed by TCP RFCs). 1787 The example immediate startup sequence is described below: 1789 * The passive side (Responding Consumer) would listen on the TCP 1790 destination port, to indicate its readiness to accept a 1791 connection. 1793 * The active side (Initiating Consumer) would request a 1794 connection from a TCP endpoint (that expected to upgrade to 1795 MPA/DDP/RDMA and expected the Private Data) to a destination 1796 address and port. 1798 * The Initiating Consumer would initiate a TCP connection to 1799 the destination port. Acceptance/rejection of the connection 1800 would proceed as per normal TCP connection establishment. 1802 * The passive side (Responding Consumer) would receive the TCP 1803 connection request as usual allowing normal TCP gatekeepers, such 1804 as INETD and TCPserver, to exercise their normal 1805 safeguard/logging functions. On acceptance of the TCP 1806 connection, the Responding Consumer would enable MPA in the 1807 Responder mode and wait for the initial MPA startup message. 1809 * The Initiating Consumer would enable MPA startup in the 1810 Initiator mode to send an initial MPA Request Frame with its 1811 included Private Data message to send. The Initiating MPA 1812 (and Consumer) would also wait for the MPA connection to be 1813 accepted, and any returned Private Data. 1815 * The Responding MPA would receive the initial MPA Request Frame 1816 with the Private Data message and would pass the Private Data 1817 through to the Consumer. The Consumer can then accept the 1818 MPA/DDP connection, close the TCP connection, or reject the MPA 1819 connection with a return message. 1821 * To accept the connection request, the Responding Consumer would 1822 use an appropriate API to bind the TCP/MPA connections to a DDP 1823 endpoint, thus enabling MPA/DDP into Full Operation. In the 1824 process of going to Full Operation, MPA sends the MPA Reply Frame 1825 which includes the Consumer supplied Private Data containing any 1826 appropriate Consumer response. MPA/DDP waits for the first 1827 incoming FPDU before sending any FPDUs. 1829 * If the initial TCP data was not a properly formatted MPA Request 1830 Frame, MPA will close or reset the TCP connection immediately. 1832 * To reject the MPA connection request, the Responding Consumer 1833 would send an MPA Reply Frame with any ULP supplied Private Data 1834 (with reason for rejection), with the "Rejected Connection" bit 1835 set to '1', and may close the TCP connection. 1837 * The Initiating MPA would receive the MPA Reply Frame with the 1838 Private Data message and would report this message to the 1839 Consumer, including the supplied Private Data. 1841 If the "rejected Connection" bit is set to a '1', MPA will 1842 close the TCP connection and exit. 1844 If the "Rejected Connection" bit is set to a '0', and on 1845 determining from the MPA Reply Frame Private Data that the 1846 Connection is acceptable, the Initiating Consumer would use 1847 an appropriate API to bind the TCP/MPA connections to a DDP 1848 endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP 1849 would begin sending DDP messages as MPA FPDUs. 1851 6.1.5 "Dual stack" implementations 1853 MPA/DDP implementations are commonly expected to be implemented as 1854 part of a "dual stack" architecture. One "stack" is the traditional 1855 TCP stack, usually with a sockets interface API (Application 1856 Programming Interface). The second stack is the MPA/DDP "stack" with 1857 its own API, and potentially separate code or hardware to deal with 1858 the MPA/DDP data. Of course, implementations may vary, so the 1859 following comments are of an advisory nature only. 1861 The use of the two "stacks" offers advantages: 1863 TCP connection setup is usually done with the TCP stack. This 1864 allows use of the usual naming and addressing mechanisms. It 1865 also means that any mechanisms used to "harden" the connection 1866 setup against security threats are also used when starting 1867 MPA/DDP. 1869 Some applications may have been originally designed for TCP, but 1870 are "enhanced" to utilize MPA/DDP after a negotiation reveals 1871 the capability to do so. The negotiation process takes place in 1872 TCP's streaming mode, using the usual TCP APIs. 1874 Some new applications, designed for RDMA or DDP, still need to 1875 exchange some data prior to starting MPA/DDP. This exchange can 1876 be of arbitrary length or complexity, but often consists of only 1877 a small amount of Private Data, perhaps only a single message. 1878 Using the TCP streaming mode for this exchange allows this to be 1879 done using well understood methods. 1881 The main disadvantage of using two stacks is the conversion of an 1882 active TCP connection between them. This process must be done with 1883 care to prevent loss of data. 1885 To avoid some of the problems when using a "dual stack" architecture 1886 the following additional restrictions may be required by the 1887 implementation: 1889 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming 1890 stream data is expected. This is typically managed by the ULP 1891 protocol. When following the recommended startup sequence, the 1892 Responder side enters DDP/MPA mode, sends the last streaming mode 1893 data, and then waits for the MPA Request Frame. No additional 1894 streaming mode data is expected. The Initiator side ULP receives 1895 the last streaming mode data, and then enters DDP/MPA mode. 1896 Again, no additional streaming mode data is expected. 1898 2. The DDP/MPA MAY provide the ability to send a "last streaming 1899 message" as part of its Responder DDP/MPA enable function. This 1900 allows the DDP/MPA stack to more easily manage the conversion to 1901 DDP/MPA mode (and avoid problems with a very fast return of the 1902 MPA Request Frame from the Initiator side). 1904 Note: Regardless of the "stack" architecture used, TCP's rules MUST 1905 be followed. For example, if network data is lost, re-segmented 1906 or re-ordered, TCP MUST recover appropriately even when this 1907 occurs while switching stacks. 1909 6.2 Normal Connection Teardown 1911 Each half connection of MPA terminates when DDP closes the 1912 corresponding TCP half connection. 1914 A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware 1915 that a graceful close of the LLP connection has been received by the 1916 LLP (e.g. FIN is received). 1918 7 Error Semantics 1920 The following errors MUST be detected by MPA and the codes SHOULD be 1921 provided to DDP or other Consumer: 1923 Code Error 1925 1 TCP connection closed, terminated or lost. This includes lost 1926 by timeout, too many retries, RST received or FIN received. 1928 2 Received MPA CRC does not match the calculated value for the 1929 FPDU. 1931 3 In the event that the CRC is valid, received MPA Marker (if 1932 enabled) and ULPDU Length fields do not agree on the start of 1933 a FPDU. If the FPDU start determined from previous ULPDU 1934 Length fields does not match with the MPA Marker position, MPA 1935 SHOULD deliver an error to DDP. It may not be possible to 1936 make this check as a segment arrives, but the check SHOULD be 1937 made when a gap creating an out of order sequence is closed 1938 and any time a Marker points to an already identified FPDU. 1939 It is OPTIONAL for a receiver to check each Marker, if 1940 multiple Markers are present in an FPDU, or if the segment is 1941 received in order. 1943 4 Invalid MPA Request Frame or MPA Response Frame received. In 1944 this case, the TCP connection MUST be immediately closed. DDP 1945 and other ULPs should treat this similar to code 1, above. 1947 When conditions 2 or 3 above are detected, an MPA-aware TCP 1948 implementation MAY choose to silently drop the TCP segment rather 1949 than reporting the error to DDP. In this case, the sending TCP will 1950 retry the segment, usually correcting the error, unless the problem 1951 was at the source. In that case, the source will usually exceed the 1952 number of retries and terminate the connection. 1954 Once MPA delivers an error of any type, it MUST NOT pass or deliver 1955 any additional FPDUs on that half connection. 1957 For Error codes 2 and 3, MPA MUST NOT close the TCP connection 1958 following a reported error. Closing the connection is the 1959 responsibility of DDP's ULP. 1961 Note that since MPA will not Deliver any FPDUs on a half 1962 connection following an error detected on the receive side of 1963 that connection, DDP's ULP is expected to tear down the 1964 connection. This may not occur until after one or more last 1965 messages are transmitted on the opposite half connection. This 1966 allows a diagnostic error message to be sent. 1968 8 Security Considerations 1970 This section discusses the security considerations for MPA. 1972 8.1 Protocol-specific Security Considerations 1974 The vulnerabilities of MPA to third-party attacks are no greater than 1975 any other protocol running over TCP. A third party, by sending 1976 packets into the network that are delivered to an MPA receiver, could 1977 launch a variety of attacks that take advantage of how MPA operates. 1978 For example, a third party could send random packets that are valid 1979 for TCP, but contain no FPDU headers. An MPA receiver reports an 1980 error to DDP when any packet arrives that cannot be validated as an 1981 FPDU when properly located on an FPDU boundary. A third party could 1982 also send packets that are valid for TCP, MPA, and DDP, but do not 1983 target valid buffers. These types of attacks ultimately result in 1984 loss of connection and thus become a type of DOS (Denial Of Service) 1985 attack. Communication security mechanisms such as IPsec [RFC2401] 1986 may be used to prevent such attacks. 1988 Independent of how MPA operates, a third party could use ICMP 1989 messages to reduce the path MTU to such a small size that performance 1990 would likewise be severely impacted. Range checking on path MTU 1991 sizes in ICMP packets may be used to prevent such attacks. 1993 [RDMAP] and [DDP] are used to control, read and write data buffers 1994 over IP networks. Therefore, the control and the data packets of 1995 these protocols are vulnerable to the spoofing, tampering and 1996 information disclosure attacks listed below. In addition, Connection 1997 to/from an unauthorized or unauthenticated endpoint is a potential 1998 problem with most applications using RDMA, DDP, and MPA. 2000 8.1.1 Spoofing 2002 Spoofing attacks can be launched by the Remote Peer, or by a network 2003 based attacker. A network based spoofing attack applies to all 2004 Remote Peers. Because the MPA Stream requires a TCP Stream in the 2005 ESTABLISHED state, certain types of traditional forms of wire attacks 2006 do not apply -- an end-to-end handshake must have occurred to 2007 establish the MPA Stream. So, the only form of spoofing that applies 2008 is one when a remote node can both send and receive packets. Yet 2009 even with this limitation the Stream is still exposed to the 2010 following spoofing attacks. 2012 8.1.1.1 Impersonation 2014 A network based attacker can impersonate a legal MPA/DDP/RDMAP peer 2015 (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP 2016 Stream with the victim. End to end authentication (i.e. IPsec or ULP 2017 authentication) provides protection against this attack. 2019 8.1.1.2 Stream Hijacking 2021 Stream hijacking happens when a network based attacker follows the 2022 Stream establishment phase, and waits until the authentication phase 2023 (if such a phase exists) is completed successfully. He can then 2024 spoof the IP address and re-direct the Stream from the victim to its 2025 own machine. For example, an attacker can wait until an iSCSI 2026 authentication is completed successfully, and hijack the iSCSI 2027 Stream. 2029 The best protection against this form of attack is end-to-end 2030 integrity protection and authentication, such as IPsec to prevent 2031 spoofing. Another option is to provide physical security. 2032 Discussion of physical security is out of scope for this document. 2034 8.1.1.3 Man in the Middle Attack 2036 If a network based attacker has the ability to delete, inject replay, 2037 or modify packets which will still be accepted by MPA (e.g., TCP 2038 sequence number is correct, FPDU is valid etc.) then the Stream can 2039 be exposed to a man in the middle attack. The attacker could 2040 potentially use the services of [DDP] and [RDMAP] to read the 2041 contents of the associated data buffer, modify the contents of the 2042 associated data buffer, or to disable further access to the buffer. 2043 The only countermeasure for this form of attack is to either secure 2044 the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to 2045 provide physical security to prevent man-in-the-middle type attacks. 2047 The best protection against this form of attack is end-to-end 2048 integrity protection and authentication, such as IPsec, to prevent 2049 spoofing or tampering. If Stream or session level authentication and 2050 integrity protection are not used, then a man-in-the-middle attack 2051 can occur, enabling spoofing and tampering. 2053 Another approach is to restrict access to only the local subnet/link, 2054 and provide some mechanism to limit access, such as physical security 2055 or 802.1.x. This model is an extremely limited deployment scenario, 2056 and will not be further examined here. 2058 8.1.2 Eavesdropping 2060 Generally speaking, Stream confidentiality protects against 2061 eavesdropping. Stream and/or session authentication and integrity 2062 protection is a counter measurement against various spoofing and 2063 tampering attacks. The effectiveness of authentication and integrity 2064 against a specific attack, depend on whether the authentication is 2065 machine level authentication (as the one provided by IPsec), or ULP 2066 authentication. 2068 8.2 Introduction to Security Options 2070 The following security services can be applied to an MPA/DDP/RDMAP 2071 Stream: 2073 1. Session confidentiality - protects against eavesdropping. 2075 2. Per-packet data source authentication - protects against the 2076 following spoofing attacks: network based impersonation, Stream 2077 hijacking, and man in the middle. 2079 3. Per-packet integrity - protects against tampering done by 2080 network based modification of FPDUs (indirectly affecting buffer 2081 content through DDP services). 2083 4. Packet sequencing - protects against replay attacks, which is 2084 a special case of the above tampering attack. 2086 If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, 2087 or Stream hijacking attacks, it is recommended that the Stream be 2088 authenticated, integrity protected, and protected from replay 2089 attacks; it may use confidentiality protection to protect from 2090 eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public 2091 network). 2093 IPsec is capable of providing the above security services for IP and 2094 TCP traffic. 2096 ULP protocols may be able to provide part of the above security 2097 services. See [NFSv4CHANNEL] for additional information on a 2098 promising approach called "channel binding". From [NFSv4CHANNEL]: 2100 "The concept of channel bindings allows applications to prove 2101 that the end-points of two secure channels at different network 2102 layers are the same by binding authentication at one channel to 2103 the session protection at the other channel. The use of channel 2104 bindings allows applications to delegate session protection to 2105 lower layers, which may significantly improve performance for 2106 some applications." 2108 8.3 Using IPsec With MPA 2110 IPsec can be used to protect against the packet injection attacks 2111 outlined above. Because IPsec is designed to secure individual IP 2112 packets, MPA can run above IPsec without change. IPsec packets are 2113 processed (e.g., integrity checked and decrypted) in the order they 2114 are received, and an MPA receiver will process the decrypted FPDUs 2115 contained in these packets in the same manner as FPDUs contained in 2116 unsecured IP packets. 2118 MPA Implementations MUST implement IPsec as described in Section 8.4 2119 below. The use of IPsec is up to ULPs and administrators. 2121 8.4 Requirements for IPsec Encapsulation of MPA/DDP 2123 The IP Storage working group has spent significant time and effort to 2124 define the normative IPsec requirements for IP Storage [RFC3723]. 2125 Portions of that specification are applicable to a wide variety of 2126 protocols, including the RDDP protocol suite. In order to not 2127 replicate this effort, an MPA ON TCP implementation MUST follow the 2128 requirements defined in RFC3723 Section 2.3 and Section 5, including 2129 the associated normative references for those sections. 2131 Additionally, since IPsec acceleration hardware may only be able to 2132 handle a limited number of active IKE Phase 2 SAs, Phase 2 delete 2133 messages MAY be sent for idle SAs, as a means of keeping the number 2134 of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 2135 delete message MUST NOT be interpreted as a reason for tearing down 2136 an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, 2137 and if additional traffic is sent on it, to bring up another IKE 2138 Phase 2 SA to protect it. This avoids the potential for continually 2139 bringing Streams up and down. 2141 Note that there are serious security issues if IPsec is not 2142 implemented end-to-end. For example, if IPsec is implemented as a 2143 tunnel in the middle of the network, any hosts between the peer and 2144 the IPsec tunneling device can freely attack the unprotected Stream. 2146 9 IANA Considerations 2148 No IANA actions are required by this document. 2150 If a well-known port is chosen as the mechanism to identify a DDP on 2151 MPA on TCP, the well-known port must be registered with IANA. 2152 Because the use of the port is DDP specific, registration of the port 2153 with IANA is left to DDP. 2155 10 References 2157 10.1 Normative References 2159 [iSCSI] Satran, J., Internet Small Computer Systems Interface 2160 (iSCSI), RFC 3720, April 2004. 2162 [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, 2163 November 1990. 2165 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP 2166 Selective Acknowledgment Options", RFC 2018, October 1996. 2168 [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over 2169 IP", RFC3723, April 2004. 2171 [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet 2172 Program Protocol Specification", RFC 793, September 1981. 2174 [RDMASEC] Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP 2175 Security", draft-ietf-rddp-security-09.txt (work in progress), 2176 MAY 2006. 2178 10.2 Informative References 2180 [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum 2181 disagree", ACM Sigcomm, Sept. 2000. 2183 [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming 2184 Library) and uDAPL (User Direct Access Programming Library)", 2185 http://www.datcollaborative.org. 2187 [DDP] H. Shah et al., "Direct Data Placement over Reliable 2188 Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May 2189 2006. 2191 [IT-API] The Open Group, "Interconnect Transport API (IT-API)" 2192 Version 2.1, http://www.opengroup.org. 2194 [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the 2195 Internet Protocol", RFC 2401, November 1998. 2197 [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC 2198 896, January 1984. 2200 [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., 2201 "Application performance pitfalls and TCP's Nagle algorithm", 2202 Workshop on Internet Server Performance, May 1999. 2204 [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to 2205 Secure Channels", Internet-Draft draft-ietf-nfsv4-channel- 2206 bindings-02.txt, July 2004. 2208 [RDMAP] R. Recio et al., "RDMA Protocol Specification", 2209 draft-ietf-rddp-rdmap-06.txt, May 2006. 2211 [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", 2212 RFC 2960, October 2000. 2214 [RFC792] Postel, J., "Internet Control Message Protocol", September 2215 1981 2217 [RFC1122] Braden, R.T., "Requirements for Internet hosts - 2218 communication layers", October 1989. 2220 [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification", 2221 draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003, 2222 http://www.rdmaconsortium.org. 2224 11 Appendix 2226 This appendix is for information only and is NOT part of the 2227 standard. 2229 The appendix covers three topics; 2231 Section 11.1 is an analysis of MPA on TCP and why it is useful to 2232 integrate MPA with TCP (with modifications to typical TCP 2233 implementations) to reduce overall system buffering and overhead. 2235 Section 11.2 covers some MPA receiver implementation notes. 2237 Section 11.3 covers methods of making MPA implementations 2238 interoperate with both IETF and RDMA Consortium versions of the 2239 protocols. 2241 11.1 Analysis of MPA over TCP Operations 2243 This appendix analyzes the impact of MPA on the TCP sender, receiver, 2244 and wire protocol. 2246 One of MPA's high level goals is to provide enough information, when 2247 combined with the Direct Data Placement Protocol [DDP], to enable 2248 out-of-order placement of DDP payload into the final Upper Layer 2249 Protocol (ULP) buffer. Note that DDP separates the act of placing 2250 data into a ULP buffer from that of notifying the ULP that the ULP 2251 buffer is available for use. In DDP terminology, the former is 2252 defined as "Placement", and the later is defined as "Delivery". MPA 2253 supports in-order Delivery of the data to the ULP, including support 2254 for Direct Data Placement in the final ULP buffer location when TCP 2255 segments arrive out-of-order. Effectively, the goal is to use the 2256 pre-posted ULP buffers as the TCP receive buffer, where the 2257 reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and 2258 DDP) is done in place, in the ULP buffer, with no data copies. 2260 This Appendix walks through the advantages and disadvantages of the 2261 TCP sender modifications proposed by MPA: 2263 1) that MPA prefers that the TCP sender to do Header Alignment, where 2264 a TCP segment should begin with an MPA Framing Protocol Data Unit 2265 (FPDU) (if there is payload present). 2267 2) that there be an integral number of FPDUs in a TCP segment (under 2268 conditions where the Path MTU is not changing). 2270 This Appendix concludes that the scaling advantages of FPDU Alignment 2271 are strong, based primarily on fairly drastic TCP receive buffer 2272 reduction requirements and simplified receive handling. The analysis 2273 also shows that there is little effect to TCP wire behavior. 2275 11.1.1 Assumptions 2277 11.1.1.1 MPA is layered beneath DDP [DDP] 2279 MPA is an adaptation layer between DDP and TCP. DDP requires 2280 preservation of DDP segment boundaries and a CRC32C digest covering 2281 the DDP header and data. MPA adds these features to the TCP stream 2282 so that DDP over TCP has the same basic properties as DDP over SCTP. 2284 11.1.1.2 MPA preserves DDP message framing 2286 MPA was designed as a framing layer specifically for DDP and was not 2287 intended as a general-purpose framing layer for any other ULP using 2288 TCP. 2290 A framing layer allows ULPs using it to receive indications from the 2291 transport layer only when complete ULPDUs are present. As a framing 2292 layer, MPA is not aware of the content of the DDP PDU, only that it 2293 has received and, if necessary, reassembled a complete PDU for 2294 Delivery to the DDP. 2296 11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under 2297 normal conditions 2299 To make reception of a complete DDP PDU on every received segment 2300 possible, DDP passes to MPA a PDU that is no larger than the EMSS of 2301 the underlying fabric. Each FPDU that MPA creates contains 2302 sufficient information for the receiver to directly place the ULP 2303 payload in the correct location in the correct receive buffer. 2305 Edge cases when this condition does not occur are dealt with, but do 2306 not need to be on the fast path 2308 11.1.1.4 Out-of-order placement but NO out-of-order Delivery 2310 DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the 2311 information necessary to place its ULP payload directly in the 2312 correct location in host memory. 2314 Because each DDP segment is self-describing, it is possible for DDP 2315 segments received out of order to have their ULP payload placed 2316 immediately in the ULP receive buffer. 2318 Data delivery to the ULP is guaranteed to be in the order the data 2319 was sent. DDP only indicates data delivery to the ULP after TCP has 2320 acknowledged the complete byte stream. 2322 11.1.2 The Value of FPDU Alignment 2324 Significant receiver optimizations can be achieved when Header 2325 Alignment and complete FPDUs are the common case. The optimizations 2326 allow utilizing significantly fewer buffers on the receiver and less 2327 computation per FPDU. The net effect is the ability to build a 2328 "flow-through" receiver that enables TCP-based solutions to scale to 2329 10G and beyond in an economical way. The optimizations are 2330 especially relevant to hardware implementations of receivers that 2331 process multiple protocol layers - Data Link Layer (e.g., Ethernet), 2332 Network and Transport Layer (e.g., TCP/IP), and even some ULP on top 2333 of TCP (e.g., MPA/DDP). As network speed increases, there is an 2334 increasing desire to use a hardware based receiver in order to 2335 achieve an efficient high performance solution. 2337 A TCP receiver, under worst case conditions, has to allocate buffers 2338 (BufferSizeTCP) whose capacities are a function of the bandwidth- 2339 delay product. Thus: 2341 BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds]. 2343 Where bandwidth is the end-to-end bandwidth of the connection, delay 2344 is the round trip delay of the connection, and K is an implementation 2345 dependent constant. 2347 Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more 2348 buffers for a 10x increase in end-to-end bandwidth). As this 2349 buffering approach may scale poorly for hardware or software 2350 implementations alike, several approaches allow reduction in the 2351 amount of buffering required for high-speed TCP communication. 2353 The MPA/DDP approach is to enable the ULP's buffer to be used as the 2354 TCP receive buffer. If the application pre-posts a sufficient amount 2355 of buffering, and each TCP segment has sufficient information to 2356 place the payload into the right application buffer, when an out-of- 2357 order TCP segment arrives it could potentially be placed directly in 2358 the ULP buffer. However, placement can only be done when a complete 2359 FPDU with the placement information is available to the receiver, and 2360 the FPDU contents contain enough information to place the data into 2361 the correct ULP buffer (e.g., there is a DDP header available). 2363 For the case when the FPDU is not aligned with the TCP segment, it 2364 may take, on average, 2 TCP segments to assemble one FPDU. 2365 Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size, 2366 Non-Aligned FPDU) octets: 2368 BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS 2370 Where K1 and K2 are implementation dependent constants and EMSS is 2371 the effective maximum segment size. 2373 For example, a 1 Gbps link with 10,000 connections and an EMSS of 2374 1500B would require 15 MB of memory. Often the number of connections 2375 used scales with the network speed, aggravating the situation for 2376 higher speeds. 2378 FPDU Alignment would allow the receiver to allocate BufferSizeAF 2379 (Buffer Size, Aligned FPDU) octets: 2381 BufferSizeAF = K2 * EMSS 2383 for the same conditions. A FPDU Aligned receiver may require memory 2384 in the range of ~100s of KB - which is feasible for an on-chip memory 2385 and enables a "flow-through" design, in which the data flows through 2386 the NIC and is placed directly in the destination buffer. Assuming 2387 most of the connections support FPDU Alignment, the receiver buffers 2388 no longer scale with number of connections. 2390 Additional optimizations can be achieved in a balanced I/O sub-system 2391 -- where the system interface of the network controller provides 2392 ample bandwidth as compared with the network bandwidth. For almost 2393 twenty years this has been the case and the trend is expected to 2394 continue - while Ethernet speeds have scaled by 1000 (from 10 2395 megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU 2396 architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to 2397 PCI-X DDR). Under these conditions, the FPDU Alignment approach 2398 allows BufferSizeAF to be indifferent to network speed. It is 2399 primarily a function of the local processing time for a given frame. 2400 Thus when the FPDU Alignment approach is used, receive buffering is 2401 expected to scale gracefully (i.e. less than linear scaling) as 2402 network speed is increased. 2404 11.1.2.1 Impact of lack of FPDU Alignment on the receiver computational 2405 load and complexity 2407 The receiver must perform IP and TCP processing, and then perform 2408 FPDU CRC checks, before it can trust the FPDU header placement 2409 information. For simplicity of the description, the assumption is 2410 that a FPDU is carried in no more than 2 TCP segments. In reality, 2411 with no FPDU Alignment, an FPDU can be carried by more than 2 TCP 2412 segments (e.g., if the PMTU was reduced). 2414 ----++-----------------------------++-----------------------++----- 2415 +---||---------------+ +--------||--------+ +----------||----+ 2416 | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | 2417 +---||---------------+ +--------||--------+ +----------||----+ 2418 ----++-----------------------------++-----------------------++----- 2419 FPDU #N-1 FPDU #N 2421 Figure 10: Non-aligned FPDU freely placed in TCP octet stream 2423 The receiver algorithm for processing TCP segments (e.g., TCP segment 2424 #X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream) 2425 carrying non-aligned FPDUs (in-order or out-of-order) includes: 2427 Data Link Layer processing (whole frame) - typically including a 2428 CRC calculation. 2430 1. Network Layer processing (assuming not an IP fragment, the 2431 whole Data Link Layer frame contains one IP datagram. IP 2432 fragments should be reassembled in a local buffer. This is 2433 not a performance optimization goal) 2435 2. Transport Layer processing -- TCP protocol processing, header 2436 and checksum checks. 2438 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2439 IP DST, TCP SRC Port, TCP DST Port, protocol) 2441 3. Find FPDU message boundaries. 2443 a. Get MPA state information for the connection 2445 If the TCP segment is in-order, use the receiver managed 2446 MPA state information to calculate where the previous 2447 FPDU message (#N-1) ends in the current TCP segment X. 2448 (previously, when the MPA receiver processed the first 2449 part of FPDU #N-1, it calculated the number of bytes 2450 remaining to complete FPDU #N-1 by using the MPA 2451 Length field). 2453 Get the stored partial CRC for FPDU #N-1 2455 Complete CRC calculation for FPDU #N-1 data (first 2456 portion of TCP segment #X) 2458 Check CRC calculation for FPDU #N-1 2460 If no FPDU CRC errors, placement is allowed 2461 Locate the local buffer for the first portion of 2462 FPDU#N-1, CopyData(local buffer of first portion 2463 of FPDU #N-1, host buffer address, length) 2465 Compute host buffer address for second portion of FPDU 2466 #N-1 2468 CopyData (local buffer of second portion of FPDU #N-1, 2469 host buffer address for second portion, length) 2471 Calculate the octet offset into the TCP segment for 2472 the next FPDU #N. 2474 Start Calculation of CRC for available data for FPDU 2475 #N 2477 Store partial CRC results for FPDU #N 2479 Store local buffer address of first portion of FPDU #N 2481 No further action is possible on FPDU #N, before it is 2482 completely received 2484 If TCP out-of-order, receiver must buffer the data until 2485 at least one complete FPDU is received. Typically 2486 buffering for more than one TCP segment per connection 2487 is required. Use the MPA based Markers to calculate 2488 where FPDU boundaries are. 2490 When a complete FPDU is available, a similar procedure 2491 to the in-order algorithm above is used. There is 2492 additional complexity, though, because when the 2493 missing segment arrives, this TCP segment must be 2494 run through the CRC engine after the CRC is 2495 calculated for the missing segment. 2497 If we assume FPDU Alignment, the following diagram and the algorithm 2498 below apply. Note that when using MPA, the receiver is assumed to 2499 actively detect presence or loss of FPDU Alignment for every TCP 2500 segment received. 2502 +--------------------------+ +--------------------------+ 2503 +--|--------------------------+ +--|--------------------------+ 2504 | | TCP Seg X | | | TCP Seg X+1 | 2505 +--|--------------------------+ +--|--------------------------+ 2506 +--------------------------+ +--------------------------+ 2507 FPDU #N FPDU #N+1 2509 Figure 11: Aligned FPDU placed immediately after TCP header 2511 The receiver algorithm for FPDU Aligned frames (in-order or out-of- 2512 order) includes: 2514 1) Data Link Layer processing (whole frame) - typically 2515 including a CRC calculation. 2517 2) Network Layer processing (assuming not an IP fragment, the 2518 whole Data Link Layer frame contains one IP datagram. IP 2519 fragments should be reassembled in a local buffer. This is 2520 not a performance optimization goal) 2522 3) Transport Layer processing -- TCP protocol processing, header 2523 and checksum checks. 2525 a. Classify incoming TCP segment using the 5 tuple (IP SRC, 2526 IP DST, TCP SRC Port, TCP DST Port, protocol) 2528 4) Check for Header Alignment. (Described in detail in Section 2529 5.4). Assuming Header Alignment for the rest of the 2530 algorithm below. 2532 a. If the header is not aligned, see the algorithm defined 2533 in the prior section. 2535 5) If TCP is in-order or out-of-order the MPA header is at the 2536 beginning of the current TCP payload. Get the FPDU length 2537 from the FPDU header. 2539 6) Calculate CRC over FPDU 2541 7) Check CRC calculation for FPDU #N 2543 8) If no FPDU CRC errors, placement is allowed 2545 9) CopyData(TCP segment #X, host buffer address, length) 2547 10) Loop to #5 until all the FPDUs in the TCP segment are 2548 consumed in order to handle FPDU packing. 2550 Implementation note: In both cases the receiver has to classify the 2551 incoming TCP segment and associate it with one of the flows it 2552 maintains. In the case of no FPDU Alignment, the receiver is forced 2553 to classify incoming traffic before it can calculate the FPDU CRC. 2554 In the case of FPDU Alignment the operations order is left to the 2555 implementer. 2557 The FPDU Aligned receiver algorithm is significantly simpler. There 2558 is no need to locally buffer portions of FPDUs. Accessing state 2559 information is also substantially simplified - the normal case does 2560 not require retrieving information to find out where a FPDU starts 2561 and ends or retrieval of a partial CRC before the CRC calculation can 2562 commence. This avoids adding internal latencies, having multiple 2563 data passes through the CRC machine, or scheduling multiple commands 2564 for moving the data to the host buffer. 2566 The aligned FPDU approach is useful for in-order and out-of-order 2567 reception. The receiver can use the same mechanisms for data storage 2568 in both cases, and only needs to account for when all the TCP 2569 segments have arrived to enable Delivery. The Header Alignment, 2570 along with the high probability that at least one complete FPDU is 2571 found with every TCP segment, allows the receiver to perform data 2572 placement for out-of-order TCP segments with no need for intermediate 2573 buffering. Essentially the TCP receive buffer has been eliminated 2574 and TCP reassembly is done in place within the ULP buffer. 2576 In case FPDU Alignment is not found, the receiver should follow the 2577 algorithm for non aligned FPDU reception which may be slower and less 2578 efficient. 2580 11.1.2.2 FPDU Alignment effects on TCP wire protocol 2582 An MPA-aware TCP exposes its EMSS to MPA. MPA uses the EMSS to 2583 calculate its MULPDU, which it then exposes to DDP, its ULP. DDP 2584 uses the MULPDU to segment its payload so that each FPDU sent by 2585 MPA fits completely into one TCP segment. This has no impact on 2586 wire protocol and exposing this information is already supported 2587 on many TCP implementations, including all modern flavors of BSD 2588 networking, through the TCP_MAXSEG socket option. 2590 In the common case, the ULP (i.e. DDP over MPA) messages provided to 2591 the TCP layer are segmented to MULPDU size. It is assumed that the 2592 ULP message size is bounded by MULPDU, such that a single ULP message 2593 can be encapsulated in a single TCP segment. Therefore, in the 2594 common case, there is no increase in the number of TCP segments 2595 emitted. For smaller ULP messages, the sender can also apply 2596 packing, i.e. the sender packs as many complete FPDUs as possible 2597 into one TCP segment. The requirement to always have a complete FPDU 2598 may increase the number of TCP segments emitted. Typically, a ULP 2599 message size varies from few bytes to multiple EMSS (e.g., 64 2600 Kbytes). In some cases the ULP may post more than one message at a 2601 time for transmission, giving the sender an opportunity for packing. 2602 In the case where more than one FPDU is available for transmission 2603 and the FPDUs are encapsulated into a TCP segment and there is no 2604 room in the TCP segment to include the next complete FPDU, another 2605 TCP segment is sent. In this corner case some of the TCP segments 2606 are not full size. In the worst case scenario, the ULP may choose a 2607 FPDU size that is EMSS/2 +1 and has multiple messages available for 2608 transmission. For this poor choice of FPDU size, the average TCP 2609 segment size is therefore about 1/2 of the EMSS and the number of TCP 2610 segments emitted is approaching 2x of what is possible without the 2611 requirement to encapsulate an integer number of complete FPDUs in 2612 every TCP segment. This is a dynamic situation that only lasts for 2613 the duration where the sender ULP has multiple non-optimal messages 2614 for transmission and this causes a minor impact on the wire 2615 utilization. 2617 However, it is not expected that requiring FPDU Alignment will have a 2618 measurable impact on wire behavior of most applications. Throughput 2619 applications with large I/Os are expected to take full advantage of 2620 the EMSS. Another class of applications with many small outstanding 2621 buffers (as compared to EMSS) is expected to use packing when 2622 applicable. Transaction oriented applications are also optimal. 2624 TCP retransmission is another area that can affect sender behavior. 2625 TCP supports retransmission of the exact, originally transmitted 2626 segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the 2627 window" and [RFC1122] section 4.2.2.15). In the unlikely event that 2628 part of the original segment has been received and acknowledged by 2629 the remote peer (e.g., a re-segmenting middle box, as documented in 2630 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on 2631 page 31), a better available bandwidth utilization may be possible by 2632 re-transmitting only the missing octets. If an MPA-aware TCP 2633 retransmits complete FPDUs, there may be some marginal bandwidth 2634 loss. 2636 Another area where a change in the TCP segment number may have impact 2637 is that of Slow Start and Congestion Avoidance. Slow-start 2638 exponential increase is measured in segments per second, as the 2639 algorithm focuses on the overhead per segment at the source for 2640 congestion that eventually results in dropped segments. Slow-start 2641 exponential bandwidth growth for MPA-aware TCP is similar to any TCP 2642 implementation. Congestion Avoidance allows for a linear growth in 2643 available bandwidth when recovering after a packet drop. Similar to 2644 the analysis for slow-start, MPA-aware TCP doesn't change the 2645 behavior of the algorithm. Therefore the average size of the segment 2646 versus EMSS is not a major factor in the assessment of the bandwidth 2647 growth for a sender. Both Slow Start and Congestion Avoidance for an 2648 MPA-aware TCP will behave similarly to any TCP sender and allow an 2649 MPA-aware TCP to enjoy the theoretical performance limits of the 2650 algorithms. 2652 In summary, the ULP messages generated at the sender (e.g., the 2653 amount of messages grouped for every transmission request) and 2654 message size distribution has the most significant impact over the 2655 number of TCP segments emitted. The worst case effect for certain 2656 ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by 2657 an increase of up to 2x in the number of TCP segments and 2658 acknowledges. In reality the effect is expected to be marginal. 2660 11.2 Receiver implementation 2662 Transport & Network Layer Reassembly Buffers: 2664 The use of reassembly buffers (either TCP reassembly buffers or IP 2665 fragmentation reassembly buffers) is implementation dependent. When 2666 MPA is enabled, reassembly buffers are needed if out of order packets 2667 arrive and Markers are not enabled. Buffers are also needed if FPDU 2668 Alignment is lost or if IP fragmentation occurs. This is because the 2669 incoming out of order segment may not contain enough information for 2670 MPA to process all of the FPDU. For cases where a re-segmenting 2671 middle box is present, or where the TCP sender is not MPA-aware, the 2672 presence of Markers significantly reduces the amount of buffering 2673 needed. 2675 Recovery from IP Fragmentation must be transparent to the MPA 2676 Consumers. 2678 11.2.1 Network Layer Reassembly Buffers 2680 Most IP implementations set the IP Don't Fragment bit. Thus upon a 2681 path MTU change, intermediate devices drop the IP datagram if it is 2682 too large and reply with an ICMP message which tells the source TCP 2683 that the path MTU has changed. This causes TCP to emit segments 2684 conformant with the new path MTU size. Thus IP fragments under most 2685 conditions should never occur at the receiver. But it is possible. 2687 There are several options for implementation of network layer 2688 reassembly buffers: 2690 1. drop any IP fragments, and reply with an ICMP message according 2691 to [RFC792] (fragmentation needed and DF set) to tell the Remote 2692 Peer to resize its TCP segment 2694 2. support an IP reassembly buffer, but have it of limited size 2695 (possibly the same size as the local link's MTU). The end Node 2696 would normally never advertise a path MTU larger than the local 2697 link MTU. It is recommended that a dropped IP fragment cause an 2698 ICMP message to be generated according to RFC792. 2700 3. multiple IP reassembly buffers, of effectively unlimited size. 2702 4. support an IP reassembly buffer for the largest IP datagram (64 2703 KB). 2705 5. support for a large IP reassembly buffer which could span 2706 multiple IP datagrams. 2708 An implementation should support at least 2 or 3 above, to avoid 2709 dropping packets that have traversed the entire fabric. 2711 There is no end-to-end ACK for IP reassembly buffers, so there is no 2712 flow control on the buffer. The only end-to-end ACK is a TCP ACK, 2713 which can only occur when a complete IP datagram is delivered to TCP. 2714 Because of this, under worst case, pathological scenarios, the 2715 largest IP reassembly buffer is the TCP receive window (to buffer 2716 multiple IP datagrams that have all been fragmented). 2718 Note that if the Remote Peer does not implement re-segmentation of 2719 the data stream upon receiving the ICMP reply updating the path MTU, 2720 it is possible to halt forward progress because the opposite peer 2721 would continue to retransmit using a transport segment size that is 2722 too large. This deadlock scenario is no different than if the fabric 2723 MTU (not last hop MTU) was reduced after connection setup, and the 2724 remote Node's behavior is not compliant with [RFC1122]. 2726 11.2.2 TCP Reassembly buffers 2728 A TCP reassembly buffer is also needed. TCP reassembly buffers are 2729 needed if FPDU Alignment is lost when using TCP with MPA or when the 2730 MPA FPDU spans multiple TCP segments. Buffers are also needed if 2731 Markers are disabled and out of order packets arrive. 2733 Since lost FPDU Alignment often means that FPDUs are incomplete, an 2734 MPA on TCP implementation must have a reassembly buffer large enough 2735 to recover an FPDU that is less than or equal to the MTU of the 2736 locally attached link (this should be the largest possible advertised 2737 TCP path MTU). If the MTU is smaller than 140 octets, the buffer 2738 MUST be at least 140 octets long to support the minimum FPDU size. 2739 The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2740 2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As 2741 usual, additional buffering may provide better performance. 2743 Note that if the TCP segment were not stored, it is possible to 2744 deadlock the MPA algorithm. If the path MTU is reduced, FPDU 2745 Alignment requires the source TCP to re-segment the data stream to 2746 the new path MTU. The source MPA will detect this condition and 2747 reduce the MPA segment size, but any FPDUs already posted to the 2748 source TCP will be re-segmented and lose FPDU Alignment. If the 2749 destination does not support a TCP reassembly buffer, these segments 2750 can never be successfully transmitted and the protocol deadlocks. 2752 When a complete FPDU is received, processing continues normally. 2754 11.3 IETF Implementation Interoperability with RDMA Consortium Protocols 2756 The RDMA Consortium created early specifications of the MPA/DDP/RDMA 2757 protocols and some manufacturers created implementations of those 2758 protocols before the IETF versions were finalized. These protocols 2759 and are very similar to the IETF versions making it possible for 2760 implementations to be created or modified to support either set of 2761 specifications. For those interested, the RDMA Consortium protocol 2762 documents can be obtained at http://www.rdmaconsortium.org. 2764 In this section, implementations of MPA/DDP/RDMA that conform to the 2765 RDMAC specifications are called RDMAC RNICs. Implementations of 2766 MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs. 2768 Without the exchange of MPA Request/Reply Frames, there is no 2769 standard mechanism for enabling RDMAC RNICs to interoperate with IETF 2770 RNICs. Even if a ULP uses a well-known port to start an IETF RNIC 2771 immediately in RDMA mode (i.e., without exchanging the MPA 2772 Request/Reply messages), there is no reason to believe an IETF RNIC 2773 will interoperate with an RDMAC RNIC because of the differences in 2774 the version number in the DDP and RDMAP headers on the wire. 2776 Therefore, the ULP or other supporting entity at the RDMAC RNIC must 2777 implement MPA Request/Reply Frames on behalf of the RNIC in order to 2778 negotiate the connection parameters. The following section describes 2779 the results following the exchange of the MPA Request/Reply Frames 2780 before the conversion from streaming to RDMA mode. 2782 11.3.1 Negotiated Parameters 2784 Three types of RNICs are considered: 2786 Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which 2787 has a ULP or other supporting entity that exchanges the MPA 2788 Request/Reply Frames in streaming mode before the conversion to 2789 RDMA mode. 2791 Non-permissive IETF RNIC - an RNIC implementing the IETF protocols 2792 which is not capable of implementing the RDMAC protocols. Such 2793 an RNIC can only interoperate with other IETF RNICs. 2795 Permissive IETF RNIC - an RNIC implementing the IETF protocols which 2796 is capable of implementing the RDMAC protocols on a per 2797 connection basis. 2799 The Permissive IETF RNIC is recommended for those implementers that 2800 want maximum interoperability with other RNIC implementations. 2802 The values used by these three RNIC types for the MPA, DDP, and RDMAP 2803 versions as well as MPA Markers and CRC are summarized in Figure 12. 2805 +----------------++-----------+-----------+-----------+-----------+ 2806 | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | 2807 | || Version | Revision | Markers | CRC | 2808 +----------------++-----------+-----------+-----------+-----------+ 2809 +----------------++-----------+-----------+-----------+-----------+ 2810 | RDMAC || 0 | 0 | 1 | 1 | 2811 | || | | | | 2812 +----------------++-----------+-----------+-----------+-----------+ 2813 | IETF || 1 | 1 | 0 or 1 | 0 or 1 | 2814 | Non-permissive || | | | | 2815 +----------------++-----------+-----------+-----------+-----------+ 2816 | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | 2817 | permissive || | | | | 2818 +----------------++-----------+-----------+-----------+-----------+ 2819 Figure 12. Connection Parameters for the RNIC Types. 2820 For MPA Markers and MPA CRC, enabled=1, disabled=0. 2822 It is assumed there is no mixing of versions allowed between MPA, DDP 2823 and RDMAP. The RNIC either generates the RDMAC protocols on the wire 2824 (version is zero) or the IETF protocols (version is one). 2826 During the exchange of the MPA Request/Reply Frames, each peer 2827 provides its MPA Revision, Marker preference (M: 0=disabled, 2828 1=enabled), and CRC preference. The MPA Revision provided in the MPA 2829 Request Frame and the MPA Reply Frame may differ. 2831 From the information in the MPA Request/Reply Frames, each side sets 2832 the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as 2833 well as the state of the Markers for each half connection. Between 2834 DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP 2835 and RDMAP version MUST be identical in the two directions. The RNIC 2836 either generates the RDMAC protocols on the wire (version is zero) or 2837 the IETF protocols (version is one). 2839 In the following sections, the figures do not discuss CRC negotiation 2840 because there is no interoperability issue for CRCs. Since the RDMAC 2841 RNIC will always request CRC use, then, according to the IETF MPA 2842 specification, both peers MUST generate and check CRCs. 2844 11.3.2 RDMAC RNIC and Non-permissive IETF RNIC 2846 Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate 2847 with an RDMAC RNIC, despite the fact that both peers exchange MPA 2848 Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA 2849 negotiation has no effect on the DDP/RDMAP version and it is unable 2850 to interoperate with the RDMAC RNIC. 2852 The rows in the figure show the state of the Marker field in the MPA 2853 Request Frame sent by the MPA Initiator. The columns show the state 2854 of the Marker field in the MPA Reply Frame sent by the MPA Responder. 2855 Each type of RNIC is shown as an Initiator and a Responder. The 2856 connection results are shown in the lower right corner, at the 2857 intersection of the different RNIC types, where V=0 is the RDMAC 2858 DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA 2859 Markers are disabled and M=1 means MPA Markers are enabled. The 2860 negotiated Marker state is shown as X/Y, for the receive direction of 2861 the Initiator/Responder. 2863 +---------------------------++-----------------------+ 2864 | MPA || MPA | 2865 | CONNECT || Responder | 2866 | MODE +-----------------++-------+---------------+ 2867 | | RNIC || RDMAC | IETF | 2868 | | TYPE || | Non-permissive| 2869 | | +------++-------+-------+-------+ 2870 | | |MARKER|| M=1 | M=0 | M=1 | 2871 +---------+----------+------++-------+-------+-------+ 2872 +---------+----------+------++-------+-------+-------+ 2873 | | RDMAC | M=1 || V=0 | close | close | 2874 | | | || M=1/1 | | | 2875 | +----------+------++-------+-------+-------+ 2876 | MPA | | M=0 || close | V=1 | V=1 | 2877 |Initiator| IETF | || | M=0/0 | M=0/1 | 2878 | |Non-perms.+------++-------+-------+-------+ 2879 | | | M=1 || close | V=1 | V=1 | 2880 | | | || | M=1/0 | M=1/1 | 2881 +---------+----------+------++-------+-------+-------+ 2882 Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive 2883 IETF RNIC. 2885 11.3.2.1 RDMAC RNIC Initiator 2887 If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request 2888 Frame with Rev field set to zero and the M and C bits set to one. 2889 Because the Non-permissive IETF RNIC cannot dynamically downgrade the 2890 version number it uses for DDP and RDMAP, it would send an MPA Reply 2891 Frame with the Rev field equal to one and then gracefully close the 2892 connection. 2894 11.3.2.2 Non-Permissive IETF RNIC Initiator 2896 If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA 2897 Request Frame with Rev field equal to one. The ULP or supporting 2898 entity for the RDMAC RNIC responds with an MPA Reply Frame that has 2899 the Rev field equal to zero and the M bit set to one. The Non- 2900 permissive IETF RNIC will gracefully close the connection after it 2901 reads the incompatible Rev field in the MPA Reply Frame. 2903 11.3.3 RDMAC RNIC and Permissive IETF RNIC 2905 Figure 14 shows that a Permissive IETF RNIC can interoperate with an 2906 RDMAC RNIC regardless of its Marker preference. The figure uses the 2907 same format as shown with the Non-permissive IETF RNIC. 2909 +---------------------------++-----------------------+ 2910 | MPA || MPA | 2911 | CONNECT || Responder | 2912 | MODE +-----------------++-------+---------------+ 2913 | | RNIC || RDMAC | IETF | 2914 | | TYPE || | Permissive | 2915 | | +------++-------+-------+-------+ 2916 | | |MARKER|| M=1 | M=0 | M=1 | 2917 +---------+----------+------++-------+-------+-------+ 2918 +---------+----------+------++-------+-------+-------+ 2919 | | RDMAC | M=1 || V=0 | N/A | V=0 | 2920 | | | || M=1/1 | | M=1/1 | 2921 | +----------+------++-------+-------+-------+ 2922 | MPA | | M=0 || V=0 | V=1 | V=1 | 2923 |Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 | 2924 | |Permissive+------++-------+-------+-------+ 2925 | | | M=1 || V=0 | V=1 | V=1 | 2926 | | | || M=1/1 | M=1/0 | M=1/1 | 2927 +---------+----------+------++-------+-------+-------+ 2928 Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive 2929 IETF RNIC. 2931 A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the 2932 Rev field of the MPA Req/Rep Frames and then adjust its receive 2933 Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As 2934 a result, as an MPA Responder, the Permissive IETF RNIC will never 2935 return an MPA Reply Frame with the M bit set to zero. This case is 2936 shown as a not applicable (N/A) in Figure 14. 2938 11.3.3.1 RDMAC RNIC Initiator 2940 When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting 2941 entity prepares an MPA Request message and sets the revision to zero 2942 and the M bit and C bit to one. 2944 The Permissive IETF Responder receives the MPA Request message and 2945 checks the revision field. Since it is capable of generating RDMAC 2946 DDP/RDMAP headers, it sends an MPA Reply message with revision set to 2947 zero and the M and C bits set to one. The Responder must inform its 2948 ULP that it is generating version zero DDP/RDMAP messages. 2950 11.3.3.2 Permissive IETF RNIC Initiator 2952 If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA 2953 Request Frame setting the Rev field to one. Regardless of the value 2954 of the M bit in the MPA Request Frame, the ULP or other supporting 2955 entity for the RDMAC RNIC will create an MPA Reply Frame with Rev 2956 equal to zero and the M bit set to one. 2958 When the Initiator reads the Rev field of the MPA Reply Frame and 2959 finds that its peer is an RDMAC RNIC, it must inform its ULP that it 2960 should generate version zero DDP/RDMAP messages and enable MPA 2961 Markers and CRC. 2963 11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC 2965 For completeness, Figure 15 shows the results of MPA negotiation 2966 between a Non-permissive IETF RNIC and a Permissive IETF RNIC. The 2967 important point from this figure is that an IETF RNIC cannot detect 2968 whether its peer is a Permissive or Non-permissive RNIC. 2970 +---------------------------++-------------------------------+ 2971 | MPA || MPA | 2972 | CONNECT || Responder | 2973 | MODE +-----------------++---------------+---------------+ 2974 | | RNIC || IETF | IETF | 2975 | | TYPE || Non-permissive| Permissive | 2976 | | +------++-------+-------+-------+-------+ 2977 | | |MARKER|| M=0 | M=1 | M=0 | M=1 | 2978 +---------+----------+------++-------+-------+-------+-------+ 2979 +---------+----------+------++-------+-------+-------+-------+ 2980 | | | M=0 || V=1 | V=1 | V=1 | V=1 | 2981 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2982 | |Non-perms.+------++-------+-------+-------+-------+ 2983 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2984 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2985 | MPA +----------+------++-------+-------+-------+-------+ 2986 |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | 2987 | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | 2988 | |Permissive+------++-------+-------+-------+-------+ 2989 | | | M=1 || V=1 | V=1 | V=1 | V=1 | 2990 | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | 2991 +---------+----------+------++-------+-------+-------+-------+ 2992 Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a 2993 Permissive IETF RNIC. 2995 12 Author's Addresses 2997 Stephen Bailey 2998 Sandburst Corporation 2999 600 Federal Street 3000 Andover, MA 01810 USA 3001 Phone: +1 978 689 1614 3002 Email: steph@sandburst.com 3004 Paul R. Culley 3005 Hewlett-Packard Company 3006 20555 SH 249 3007 Houston, Tx. USA 77070-2698 3008 Phone: 281-514-5543 3009 Email: paul.culley@hp.com 3011 Uri Elzur 3012 Broadcom 3013 16215 Alton Parkway 3014 CA, 92618 3015 Phone: 949.585.6432 3016 Email: uri@broadcom.com 3018 Renato J Recio 3019 IBM 3020 Internal Zip 9043 3021 11400 Burnett Road 3022 Austin, Texas 78759 3023 Phone: 512-838-3685 3024 Email: recio@us.ibm.com 3026 John Carrier 3027 Cray Inc. 3028 411 First Avenue S, Suite 600 3029 Seattle, WA 98104-2860 3030 Phone: 206-701-2090 3031 Email: carrier@cray.com 3033 13 Acknowledgments 3035 Dwight Barron 3036 Hewlett-Packard Company 3037 20555 SH 249 3038 Houston, Tx. USA 77070-2698 3039 Phone: 281-514-2769 3040 Email: dwight.barron@hp.com 3042 Jeff Chase 3043 Department of Computer Science 3044 Duke University 3045 Durham, NC 27708-0129 USA 3046 Phone: +1 919 660 6559 3047 Email: chase@cs.duke.edu 3049 Ted Compton 3050 EMC Corporation 3051 Research Triangle Park, NC 27709, USA 3052 Phone: 919-248-6075 3053 Email: compton_ted@emc.com 3055 Dave Garcia 3056 Hewlett-Packard Company 3057 19333 Vallco Parkway 3058 Cupertino, Ca. USA 95014 3059 Phone: 408.285.6116 3060 Email: dave.garcia@hp.com 3062 Hari Ghadia 3063 Adaptec, Inc. 3064 691 S. Milpitas Blvd., 3065 Milpitas, CA 95035 USA 3066 Phone: +1 (408) 957-5608 3067 Email: hari_ghadia@adaptec.com 3069 Howard C. Herbert 3070 Intel Corporation 3071 MS CH7-404 3072 5000 West Chandler Blvd. 3073 Chandler, Arizona 85226 3074 Phone: 480-554-3116 3075 Email: howard.c.herbert@intel.com 3077 Jeff Hilland 3078 Hewlett-Packard Company 3079 20555 SH 249 3080 Houston, Tx. USA 77070-2698 3081 Phone: 281-514-9489 3082 Email: jeff.hilland@hp.com 3084 Mike Ko 3085 IBM 3086 650 Harry Rd. 3087 San Jose, CA 95120 3088 Phone: (408) 927-2085 3089 Email: mako@us.ibm.com 3091 Mike Krause 3092 Hewlett-Packard Corporation, 43LN 3093 19410 Homestead Road 3094 Cupertino, CA 95014 USA 3095 Phone: +1 (408) 447-3191 3096 Email: krause@cup.hp.com 3098 Dave Minturn 3099 Intel Corporation 3100 MS JF1-210 3101 5200 North East Elam Young Parkway 3102 Hillsboro, Oregon 97124 3103 Phone: 503-712-4106 3104 Email: dave.b.minturn@intel.com 3106 Jim Pinkerton 3107 Microsoft, Inc. 3108 One Microsoft Way 3109 Redmond, WA, USA 98052 3110 Email: jpink@microsoft.com 3112 Hemal Shah 3113 16215 Alton Parkway 3114 Irvine, California 92619-7013 USA 3115 Phone: +1 949 926-6941 3116 Email: hemal@broadcom.com 3118 Allyn Romanow 3119 Cisco Systems 3120 170 W Tasman Drive 3121 San Jose, CA 95134 USA 3122 Phone: +1 408 525 8836 3123 Email: allyn@cisco.com 3125 Tom Talpey 3126 Network Appliance 3127 375 Totten Pond Road 3128 Waltham, MA 02451 USA 3129 Phone: +1 (781) 768-5329 3130 EMail: thomas.talpey@netapp.com 3132 Patricia Thaler 3133 Broadcom 3134 16215 Alton Parkway 3135 Irvine, CA 92618 3136 Phone: 916 570 2707 3137 pthaler@broadcom.com 3139 Jim Wendt 3140 Hewlett Packard Corporation 3141 8000 Foothills Boulevard MS 5668 3142 Roseville, CA 95747-5668 USA 3143 Phone: +1 916 785 5198 3144 Email: jim_wendt@hp.com 3146 Jim Williams 3147 Emulex Corporation 3148 580 Main Street 3149 Bolton, MA 01740 USA 3150 Phone: +1 978 779 7224 3151 Email: jim.williams@emulex.com 3153 Full Copyright Statement 3155 This document and the information contained herein are provided on an 3156 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 3157 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 3158 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 3159 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 3160 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 3161 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 3163 Copyright (C) The Internet Society (2006). This document is subject 3164 to the rights, licenses and restrictions contained in BCP 78, and 3165 except as set forth therein, the authors retain all their rights. 3167 Intellectual Property 3169 The IETF takes no position regarding the validity or scope of any 3170 Intellectual Property Rights or other rights that might be claimed to 3171 pertain to the implementation or use of the technology described in 3172 this document or the extent to which any license under such rights 3173 might or might not be available; nor does it represent that it has 3174 made any independent effort to identify any such rights. Information 3175 on the procedures with respect to rights in RFC documents can be 3176 found in BCP 78 and BCP 79. 3178 Copies of IPR disclosures made to the IETF Secretariat and any 3179 assurances of licenses to be made available, or the result of an 3180 attempt made to obtain a general license or permission for the use of 3181 such proprietary rights by implementers or users of this 3182 specification can be obtained from the IETF on-line IPR repository at 3183 http://www.ietf.org/ipr. 3185 The IETF invites any interested party to bring to its attention any 3186 copyrights, patents or patent applications, or other proprietary 3187 rights that may cover technology that may be required to implement 3188 this standard. Please address the information to the IETF at 3189 ietf-ipr@ietf.org.