idnits 2.17.1 draft-culley-iwarp-mpa-02.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([DDP], [ELZUR-MPA], [02]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 233: '...t recovery of out of order ULPDUs MUST...' RFC 2119 keyword, line 298: '...P implementation MUST inform MPA when ...' RFC 2119 keyword, line 304: '... implementation SHOULD be enabled to:...' RFC 2119 keyword, line 308: '.... Multiple FPDUs MAY be packed into a...' RFC 2119 keyword, line 314: '...de implementation MUST continue to use...' (63 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '02' on line 105 == Unused Reference: 'RFC2026' is defined on line 1060, but no explicit reference was found in the text == Unused Reference: 'NagleDAck' is defined on line 1081, but no explicit reference was found in the text ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-01) exists of draft-shah-iwarp-ddp-00 -- Obsolete informational reference (is this intentional?): RFC 2401 (Obsoleted by RFC 4301) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- No information found for draft-recio-iwarp-rdmap - is the name correct? -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT P. Culley 3 draft-culley-iwarp-mpa-02.txt Hewlett-Packard Company 4 U. Elzur 5 Broadcom Corporation 6 R. Recio 7 IBM Corpration 8 S. Bailey 9 Sandburst Corporation 10 J. Carrier 11 Adaptec 13 Expires: August 2003 15 Marker PDU Aligned Framing for TCP Specification 17 1 Status of this Memo 19 This document is an Internet-Draft and is subject to all provisions 20 of Section 10 of RFC2026. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF), its areas, and its working groups. Note that 24 other groups may also distribute working documents as Internet- 25 Drafts. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 The list of current Internet-Drafts can be accessed at 33 http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft 34 Shadow Directories can be accessed at http://www.ietf.org/shadow.html 36 2 Abstract 38 A framing protocol is defined for TCP that is fully compliant with 39 applicable TCP RFCs and fully interoperable with existing TCP 40 implementations. The framing mechanism is designed to work as an 41 "adaptation layer" between TCP and the Direct Data Placement [DDP] 42 protocol, preserving the reliable, in-order delivery of TCP, while 43 adding the preservation of higher-level protocol record boundaries 44 that DDP requires. 46 Table of Contents 48 1 Status of this Memo..........................................1 49 2 Abstract.....................................................1 50 3 Introduction.................................................4 51 3.1 Motivation...................................................4 52 3.2 Protocol Overview............................................5 53 4 Glossary.....................................................7 54 5 LLP and DDP requirements.....................................8 55 5.1 TCP implementation Requirements to support MPA...............8 56 5.1.1 TCP Transmit side............................................8 57 5.1.2 TCP Receive side.............................................8 58 5.2 MPA's interactions with DDP..................................9 59 6 FPDU Formats................................................11 60 6.1 Marker Format...............................................12 61 7 Data Transfer Semantics.....................................13 62 7.1 MPA Markers.................................................13 63 7.2 CRC Calculation.............................................14 64 7.3 MPA on TCP Sender Segmentation..............................17 65 7.3.1 Effects of MPA on TCP Segmentation..........................17 66 7.3.2 FPDU Size Considerations....................................18 67 7.4 MPA Receiver FPDU Identification............................19 68 7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....20 69 8 Connection Semantics........................................22 70 8.1 Connection setup............................................22 71 8.2 Normal Connection Teardown..................................23 72 9 Error Semantics.............................................24 73 10 Security Considerations.....................................25 74 10.1 Protocol-specific Security Considerations...................25 75 10.2 Using IPsec With MPA........................................25 76 11 IANA Considerations.........................................26 77 12 References..................................................27 78 12.1 Normative References........................................27 79 12.2 Informative References......................................27 80 13 Appendix....................................................29 81 13.1 Receiver implementation.....................................29 82 13.1.1 Transport & Network Layer Reassembly Buffers..............29 83 14 Author's Addresses..........................................31 84 15 Acknowledgments.............................................32 85 16 Full Copyright Statement....................................35 86 Table of Figures 88 Figure 1 ULP MPA TCP Layering.......................................6 89 Figure 2 FPDU Format...............................................11 90 Figure 3 Marker Format.............................................12 91 Figure 4 Example FPDU Format with Marker...........................14 92 Figure 5 Annotated Hex Dump of an FPDU.............................16 93 Figure 6 Annotated Hex Dump of an FPDU with Marker.................16 94 Figure 7: Example Startup negotiation..............................23 96 Revision history 98 [02] Enhanced descriptions of how MPA is used over an unmodified TCP. 100 [02] Removed "No Packing" text. 102 [02] Made MPA an adaptation layer for DDP, instead of a generalized 103 framing solution. 105 [02] Added clarifications of the MPA/TCP interaction for optimized 106 implementations and that any such optimizations are to be used 107 only when requested by MPA. 109 Note: a discussion of reasons for these changes can be found in 110 [ELZUR-MPA]. 112 3 Introduction 114 This section discusses the reason for creating MPA on TCP and a 115 general overview of the protocol. Later sections show the MPA 116 headers (see section 6 on page 11), and detailed protocol 117 requirements and characteristics (see section 7 on page 13), as well 118 as Connection Semantics (section 8 on page 20), Error Semantics 119 (section 9 on page 24), and Security Considerations (section 10 on 120 page 25). 122 3.1 Motivation 124 The Direct Data Placement protocol [DDP], when used with TCP [RFC793] 125 requires a mechanism to detect record boundaries. The DDP records 126 are referred to as Upper Layer Protocol Data Units by this document. 127 The ability to locate the Upper Layer Protocol Data Unit (ULPDU) 128 boundary is useful to a hardware network adapter that uses DDP to 129 directly place the data in the application buffer based on the 130 control information carried in the ULPDU header. This may be done 131 without requiring that the packets arrive in order. Potential 132 benefits of this capability are the avoidance of the memory copy 133 overhead and a smaller memory requirement for handling out of order 134 or dropped packets. 136 Many approaches have been proposed for a generalized framing 137 mechanism. Some are probabilistic in nature and others are 138 deterministic. A probabilistic approach is characterized by a 139 detectable value embedded in the octet stream. It is probabilistic 140 because under some conditions the receiver may incorrectly interpret 141 application data as the detectable value. Under these conditions, 142 the protocol may fail with unacceptable frequency. A deterministic 143 approach is characterized by embedded controls at known locations in 144 the octet stream. Because the receiver can guarantee it will only 145 examine the data stream at locations that are known to contain the 146 embedded control, the protocol can never misinterpret application 147 data as being embedded control data. For unambiguous handling of an 148 out of order packet, the deterministic approach is preferred. 150 The MPA protocol provides a framing mechanism for DDP running over 151 TCP using the deterministic approach. It allows the location of the 152 ULPDU to be determined in the TCP stream even if the TCP segments 153 arrive out of order. 155 3.2 Protocol Overview 157 MPA is described as an extra layer above TCP and below DDP. The end- 158 to-end data flow is: 160 1. The DDP's ULP negotiates the use of DDP and MPA at both ends of a 161 connection. 163 2. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA 164 for this value. MPA derives this information from TCP, when it 165 is available, or chooses a reasonable value. This information is 166 already supported on many TCP implementations, including all 167 modern flavors of BSD networking, through the TCP_MAXSEG socket 168 option. 170 3. DDP creates ULPDUs of MULPDU size or smaller, and hands them to 171 MPA at the sender. 173 4. MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a 174 header, inserting markers, and appending a CRC after the ULPDU 175 and PAD (if any). MPA delivers the FPDU to TCP. 177 5. The TCP sender puts the FPDUs into the TCP stream. If the TCP 178 Sender is MPA-aware, it segments the TCP stream in such a way 179 that a TCP Segment boundary is also the boundary of an FPDU. TCP 180 then passes each segment to the IP layer for transmission. 182 6. The TCP receiver may be MPA-aware or may not be MPA-aware. If it 183 is MPA-aware, it may separate passing the TCP payload to MPA from 184 passing the TCP payload ordering information to MPA. In either 185 case, RFC compliant TCP wire behavior is observed at both the 186 sender and receiver. 188 7. The MPA receiver locates and assembles complete FPDUs within the 189 stream, verifies their integrity, and removes MPA markers, 190 ULPDU_Length, PAD and CRC. 192 8. MPA then provides the complete ULPDUs to DDP. MPA may also 193 separate passing MPA payload to DDP from passing the MPA payload 194 ordering information. 196 The layering of PDUs with MPA is shown in Figure 1, below. 198 MPA-aware TCP is a TCP layer which potentially contains some 199 additional semantics as defined in this document. MPA is implemented 200 as a data stream ULP for TCP and is therefore RFC compliant. MPA- 201 aware TCP is RFC compliant. 203 +------------------+ 204 | ULP client | 205 +------------------+ <- Consumer messages 206 | DDP | 207 +------------------+ <- ULPDUs 208 | MPA | 209 +------------------+ <- FPDUs (containing ULPDUs) 210 | TCP* | 211 +------------------+ <- TCP Segments (containing FPDUs) 212 | IP etc. | 213 +------------------+ 214 * TCP or MPA-aware TCP. 216 Figure 1 ULP MPA TCP Layering 218 An MPA-aware TCP sender is able to segment the data stream such that 219 TCP segments begin with FPDUs (FPDU Alignment). This has significant 220 advantages for receivers. When segments arrive with aligned FPDUs 221 the receiver usually need not buffer any portion of the segment, 222 allowing DDP to place it in its destination memory immediately, thus 223 avoiding copies from intermediate buffers (DDP's reason for 224 existence). 226 MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation 227 to recover ULPDUs that may be received out of order. This enables a 228 DDP on MPA implementation to save a significant amount of 229 intermediate storage by placing the ULPDUs in the right locations in 230 the application buffers when they arrive, rather than waiting until 231 full ordering can be restored. 233 MPA implementations that support recovery of out of order ULPDUs MUST 234 support a mechanism to indicate the ordering of ULPDUs as the sender 235 transmitted them and indicate when missing intermediate segments 236 arrive. These mechanisms allow DDP to reestablish record ordering 237 and report Delivery of complete messages (groups of records). 239 MPA also addresses enhanced data integrity. Many users of TCP have 240 noted that the TCP checksum is not as strong as could be desired 241 [CRCTCP]. Studies have shown that the TCP checksum indicates 242 segments in error at a much higher rate than the underlying link 243 characteristics would indicate. With these higher error rates, the 244 chance that an error will escape detection, when using only the TCP 245 checksum for data integrity, becomes a concern. A stronger integrity 246 check can reduce the chance of data errors being missed. 248 MPA includes a CRC check to increase the ULPDU data integrity to the 249 level provided by other modern protocols, such as SCTP [RFC2960]. 251 4 Glossary 253 Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as 254 the process of informing DDP that a particular PDU is ordered for 255 use. This is specifically different from "passing the PDU to 256 DDP", which may generally occur in any order, while the order of 257 "Delivery" is strictly defined. 259 EMSS - Effective Maximum Segment Size. EMSS is the smaller of the 260 TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], 261 and the current path Maximum Transfer Unit (MTU) [RFC1191]. 263 FPDU - Framing Protocol Data Unit. The unit of data created by an 264 MPA sender. 266 FPDU Alignment - the property that a TCP segment begins with an FPDU. 268 PDU - protocol data unit 270 MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This 271 document defines the MPA protocol. 273 MULPDU - Maximum ULPDU. The current maximum size of the record that 274 is acceptable for DDP to pass to MPA for transmission. 276 Node - A computing device attached to one or more links of a Network. 277 A Node in this context does not refer to a specific application 278 or protocol instantiation running on the computer. A Node may 279 consist of one or more MPA on TCP devices installed in a host 280 computer. 282 Remote Peer - The MPA protocol implementation on the opposite end of 283 the connection. Used to refer to the remote entity when 284 describing protocol exchanges or other interactions between two 285 Nodes. 287 ULP - Upper Layer Protocol. The protocol layer above the protocol 288 layer currently being referenced. The ULP for MPA is DDP [DDP]. 290 ULPDU - Upper Layer Protocol Data Unit. The data record defined by 291 the layer above MPA (DDP). ULPDU corresponds to DDP's "DDP 292 Segment". 294 5 LLP and DDP requirements 296 5.1 TCP implementation Requirements to support MPA 298 The TCP implementation MUST inform MPA when the TCP connection is 299 closed or has begun closing the connection (e.g. received a FIN). 301 5.1.1 TCP Transmit side 303 To provide optimum performance, an MPA-aware transmit side TCP 304 implementation SHOULD be enabled to: 306 * With an EMSS large enough to contain the FPDU(s), segment the 307 outgoing TCP stream such that the first octet of every TCP 308 Segment begins with an FPDU. Multiple FPDUs MAY be packed into a 309 single TCP segment as long as they are entirely contained in the 310 TCP segment. 312 * Report the current EMSS to the MPA transmit layer. 314 An MPA-aware TCP transmit side implementation MUST continue to use 315 the method of segmentation expected by non-MPA applications (and 316 described in TCP RFCs) when MPA is not enabled on the connection. 317 When MPA is enabled above an MPA-aware TCP, it SHOULD specifically 318 enable the segmentation rules described above for the DDP segments 319 (FPDUs) posted for transmission. 321 If the transmit side TCP implementation is not able to segment the 322 TCP stream as indicated above, MPA should make a best effort to 323 achieve that result. For example, using the TCP_NODELAY socket 324 option to disable the Nagle algorithm will usually result in many of 325 the segments starting with an FPDU. 327 If the transmit side TCP implementation is not able to report the 328 EMSS, MPA may assume that TCP will use 1460 octet segments in 329 creating FPDUs. If the implementation has reason to believe that the 330 TCP segment size is actually smaller than 1460, it may instead use a 331 536 octet FPDU. 333 5.1.2 TCP Receive side 335 When an MPA receive implementation and the MPA-aware receive side TCP 336 implementation supports handling out of order ULPDUs, the TCP receive 337 implementation SHOULD be enabled to: 339 * Pass incoming TCP segments to MPA as soon as they have been 340 received and validated, even if not received in order. The TCP 341 layer MUST have committed to keeping each segment before it can 342 be passed to the MPA. This means that the segment must have 343 passed the TCP, IP, and lower layer data integrity validation 344 (i.e., checksum), must be in the receive window, must not be a 345 duplicate, must be part of the same epoch (if timestamps are used 346 to verify this) and any other checks required by TCP RFCs. The 347 segment MUST NOT be passed to MPA more than once unless 348 explicitly requested (see Section 9). 350 This is not to imply that the data must be completely ordered 351 before use. An implementation may accept out of order segments, 352 SACK them [RFC2018], and pass them to DDP when the reception of 353 the segments needed to fill in the gaps arrive. Such an 354 implementation can "commit" to the data early on, and will not 355 overwrite it even if (or when) duplicate data arrives. MPA 356 expects to utilize this "commit" to allow the passing of ULPDUs 357 to DDP when they arrive, independent of ordering. 359 * Provide a mechanism to indicate the ordering of TCP segments as 360 the sender transmitted them. One possible mechanism might be 361 attaching the TCP sequence number to each segment. 363 * Provide a mechanism to indicate when a given TCP segment (and the 364 prior TCP stream) is complete. One possible mechanism might be 365 to utilize the leading (left) edge of the TCP Receive Window. 367 DDP on MPA MUST utilize these two mechanisms to establish the 368 Delivery semantics that DDP's consumers agree to. These 369 semantics are described fully in [DDP]. These include 370 requirements on DDP's consumer to respect ownership of buffers 371 prior to the time that DDP delivers them to the consumer. 373 An MPA-aware TCP receive side implementation MUST continue to buffer 374 TCP segments until completely ordered and then deliver them as 375 expected by non-MPA applications (and described in TCP RFCs) when MPA 376 is not enabled on the connection. When MPA is enabled above an MPA- 377 aware TCP, TCP SHOULD enable the in and out of order passing of data, 378 and the separate ordering information as described above. 380 When an MPA receive implementation is coupled with a TCP receive 381 implementation that does not support the preceding mechanisms, TCP 382 passes and Delivers incoming stream data to MPA in order. 384 5.2 MPA's interactions with DDP 386 DDP requires MPA to maintain DDP record boundaries from the sender to 387 the receiver. When using MPA on TCP to send data, DDP provides 388 records (ULPDUs) to MPA. MPA will use the reliable transmission 389 abilities of TCP to transmit the data, and will insert appropriate 390 additional information into the TCP stream to allow the MPA receiver 391 to locate the record boundary information. 393 As such, MPA accepts complete records (ULPDUs) from DDP at the sender 394 and returns them to DDP at the receiver. 396 MPA combined with an MPA-aware TCP can only ensure FPDU Alignment 397 with the TCP Header if the FPDU is less than or equal to TCP's EMSS. 399 Since FPDU alignment is generally desired by the receiver, DDP must 400 cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS 401 under normal conditions. This is done with the MULPDU mechanism. 403 MPA provides information to DDP on the current maximum size of the 404 record that is acceptable to send (MULPDU). DDP SHOULD limit each 405 record size to MULPDU. The range of MULPDU values MUST be between 406 128 octets and 64768 octets, inclusive. 408 The sending DDP MUST NOT post a ULPDU larger than 64768 octets to 409 MPA. DDP MAY post a ULPDU of any size between one and 64768 octets, 410 however MPA is NOT REQUIRED to support a ULPDU length that is greater 411 than the current MULPDU. 413 While the maximum theoretical length supported by the MPA header 414 ULPDU_Length field is 65535, TCP over IP requires the IP datagram 415 maximum length to be 65535 octets. To enable MPA to support FPDU 416 Alignment, the maximum size of the FPDU must fit within an IP 417 datagram. Thus the ULPDU limit of 64768 octets was derived by taking 418 the maximum IP datagram length, subtracting from it the maximum total 419 length of the sum of the IPv4 header, TCP header, IPv4 options, TCP 420 options, and the worst case MPA overhead, and then rounding the 421 result down to a 128 octet boundary. 423 On receive, MPA MUST pass each ULPDU with its length to DDP when it 424 has been validated. 426 If an MPA implementation supports passing out of order ULPDUs to DDP, 427 the MPA implementation SHOULD: 429 * Pass each ULPDU with its length to DDP as soon as it has been 430 fully received and validated. 432 * Provide a mechanism to indicate the ordering of ULPDUs as the 433 sender transmitted them. One possible mechanism might be 434 providing the TCP sequence number for each ULPDU. 436 * Provide a mechanism to indicate when a given ULPDU (and prior 437 ULPDUs) are complete. One possible mechanism might be to allow 438 DDP to see the current outgoing TCP Ack sequence number. 440 * Provide an indication to DDP that the TCP has closed or has begun 441 to close the connection (e.g. received a FIN). 443 6 FPDU Formats 445 MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown 446 below MUST be used for all MPA FPDUs. For purposes of clarity, 447 markers are not shown in Figure 2. 449 0 1 2 3 450 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 451 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 452 | ULPDU_Length | | 453 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 454 | | 455 ~ ~ 456 ~ ULPDU ~ 457 | | 458 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 459 | | PAD (0-3 octets) | 460 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 461 | CRC | 462 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 463 Figure 2 FPDU Format 465 ULPDU_Length: 16 bits (unsigned integer). This is the number of 466 octets of the contained ULPDU. It does not include the length of the 467 FPDU header itself, the pad, the CRC, or of any markers that fall 468 within the ULPDU. The 16-bit ULPDU Length field is large enough to 469 support the largest IP datagrams for IPv4 or IPv6. 471 PAD: The PAD field trails the ULPDU and contains between zero and 472 three octets of data. The pad data MUST be set to zero by the sender 473 and ignored by the receiver (except for CRC checking). The length of 474 the pad is set so as to make the size of the FPDU an integral 475 multiple of four. 477 CRC: 32 bits, this CRC is used to verify the entire contents of the 478 FPDU, using CRC32C See section 7.2 CRC Calculation on page 14. 480 The FPDU adds a minimum of 6 octets to the length of the ULPDU. In 481 addition, the total length of the FPDU will include the length of any 482 markers and from 0 to 3 pad octets added to round-up the ULPDU size. 484 6.1 Marker Format 486 The format of a marker MUST be as specified in Figure 3: 488 0 1 2 3 489 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 490 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 491 | RESERVED | FPDUPTR | 492 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 493 Figure 3 Marker Format 495 RESERVED: The Reserved field MUST be set to zero on transmit and 496 ignored on receive (except for CRC calculation). 498 FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long, 499 interpreted as an unsigned integer, that indicates the number of 500 octets in the TCP stream from the beginning of the FPDU to the first 501 octet of the entire marker. 503 7 Data Transfer Semantics 505 This section discusses some characteristics and behavior of the MPA 506 protocol as well as implications of that protocol. 508 7.1 MPA Markers 510 MPA senders MUST insert a marker into the data stream at a 512 octet 511 periodic interval in the TCP Sequence Number Space. The marker 512 contains a 16 bit unsigned integer referred to as the FPDUPTR (FPDU 513 Pointer). 515 If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit 516 relative back-pointer. FPDUPTR MUST contain the number of octets in 517 the TCP stream from the beginning of the current FPDU to the first 518 octet of the marker, unless the marker falls between FPDUs. Thus the 519 location of the first octet of the previous FPDU header can be 520 determined by subtracting the value of the given marker from the 521 current octet-stream sequence number (i.e. TCP sequence number) of 522 the first octet of the marker. Note that this computation must take 523 into account that the TCP sequence number could have wrapped between 524 the marker and the header. 526 An FPDUPTR value of 0x0000 is a special case - it is used when the 527 marker falls exactly between FPDUs. In this case, the marker MUST be 528 placed in the following FPDU and viewed as being part of that FPDU 529 (e.g. for CRC calculation). Thus an FPDUPTR value of 0x0000 means 530 that immediately following the marker is an FPDU header. 532 Since all FPDUs are integral multiples of 4 octets, the bottom two 533 bits of the FPDUPTR as calculated by the sender are zero. MPA 534 reserves these bits so they MUST be treated as zero for computation 535 at the receiver. 537 The MPA markers MUST be inserted immediately following MPA connection 538 establishment, and at every 512th octet of the TCP octet stream 539 thereafter. As a result, the first marker has an FPDUPTR value of 540 0x0000. If the first marker begins at octet sequence number 541 SeqStart, then markers are inserted such that the first octet of the 542 marker is at octet sequence number SeqNum if the remainder of (SeqNum 543 - SeqStart) mod 512 is zero. Note that SeqNum can wrap. 545 For example, if the TCP sequence number were used to calculate the 546 insertion point of the marker, the starting TCP sequence number is 547 unlikely to be zero, and 512 octet multiples are unlikely to fall on 548 a modulo 512 of zero. If the MPA connection is started at TCP 549 sequence number 11, then the 1st marker will begin at 11, and 550 subsequent markers will begin at 523, 1035, etc. 552 If an FPDU is large enough to contain multiple markers, they MUST all 553 point to the same point in the TCP stream: the first octet of the 554 FPDU. 556 If a marker interval contains multiple FPDUs (the FPDUs are small), 557 the marker MUST point to the start of the FPDU containing the marker 558 unless the marker falls between FPDUs, in which case the marker MUST 559 be zero. 561 The following example shows an FPDU containing a marker. 563 0 1 2 3 564 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 565 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 566 | ULPDU Length (0x0010) | | 567 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 568 | | 569 + + 570 | ULPDU (octets 0-9) | 571 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 572 | (0x0000) | FPDU ptr (0x000C) | 573 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 574 | ULPDU (octets 10-15) | 575 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 576 | | PAD (2 octets:0,0) | 577 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 578 | CRC | 579 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 580 Figure 4 Example FPDU Format with Marker 582 MPA Receivers MUST preserve ULPDU boundaries when passing data to 583 DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to 584 DDP and not the markers, headers, and CRC. 586 7.2 CRC Calculation 588 When sending an FPDU, the sender MUST include a valid CRC field. The 589 CRC field in the MPA FPDU MUST be computed using the CRC32C 590 polynomial in the manner described in the iSCSI Protocol [iSCSI] 591 document for Header and Data Digests. 593 The fields which MUST be included in the CRC calculation when sending 594 an FPDU are as follows: 596 1) If the first octet of the FPDU is the "ULPDU Length" field, the 597 CRC-32c is calculated from the first octet of the "ULPDU Length" 598 header, through all the ULPDU and markers (if present), to the 599 last octet of the PAD (if present), inclusive. If there is a 600 marker immediately following the PAD, the marker is included in 601 the CRC calculation for this FPDU. 603 2) If the first octet of the FPDU is a marker, (i.e. the marker fell 604 between FPDUs, and thus is required to be included in the second 605 FPDU), the CRC-32c is calculated from the first octet of the 606 marker, through the "ULPDU Length" header, through all the ULPDU 607 and markers (if present), to the last octet of the PAD (if 608 present), inclusive. 610 3) After calculating the CRC-32c, the resultant value is placed into 611 the CRC field at the end of the FPDU. 613 When an FPDU is received, the receiver MUST first perform the 614 following: 616 1) Calculate the CRC of the incoming FPDU in the same fashion as 617 defined above. 619 2) Verify that the calculated CRC-32c value is the same as the 620 received CRC-32c value found in the FPDU CRC field. If not, the 621 receiver MUST treat the FPDU as an invalid FPDU. 623 The procedure for handling invalid FPDUs is covered in the Error 624 Section (see section 9 on page 24) 626 The following is an annotated hex dump of an example FPDU sent as the 627 first FPDU on the stream. As such, it starts with a marker. The FPDU 628 contains 24 octets of the contained ULPDU, which are all zeros. The 629 CRC32c has been correctly calculated and can be used as a reference. 630 See the [DDP] and [RDMA] specification for definitions of the DDP 631 Control field, Queue, MSN, MO, and Send Data. 633 Octet Contents Annotation 634 Count 636 0000 00 00 Marker: Reserved 637 0002 00 00 FPDUPTR 638 0004 00 2a Length 639 0006 40 03 DDP Control Field, Send with Last flag set 640 0008 00 00 Reserved (STag position with no STag) 641 000a 00 00 642 000c 00 00 Queue = 0 643 000e 00 00 644 0010 00 00 MSN = 1 645 0012 00 01 646 0014 00 00 MO = 0 647 0016 00 00 648 0018 00 00 649 Send Data (24 octets of zeros) 650 002e 00 00 651 0030 4C 86 CRC32c 652 0032 B3 84 653 Figure 5 Annotated Hex Dump of an FPDU 655 The following is an example sent as the second FPDU of the stream 656 where the first FPDU (which is not shown here) had a length of 492 657 octets and was also a Send to Queue 0 with Last Flag set. This 658 example contains a marker. 660 Octet Contents Annotation 661 Count 663 01ec 00 2a Length 664 01ee 40 03 DDP Control Field: Send with Last Flag set 665 01f0 00 00 Reserved (STag position with no STag) 666 01f2 00 00 667 01f4 00 00 Queue = 0 668 01f6 00 00 669 01f8 00 00 MSN = 2 670 01fa 00 02 671 01fc 00 00 MO = 0 672 01fe 00 00 673 0200 00 00 Marker: Reserved 674 0202 00 14 FPDUPTR 675 0204 00 00 676 Send Data (24 octets of zeros) 677 021a 00 00 678 021c A1 9C CRC32c 679 021e D1 03 680 Figure 6 Annotated Hex Dump of an FPDU with Marker 682 7.3 MPA on TCP Sender Segmentation 684 The various TCP RFCs allow considerable choice in segmenting a TCP 685 stream. In order to optimize FPDU recovery at the MPA receiver, MPA 686 specifies additional segmentation rules. 688 MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU 689 contained in one FPDU. 691 An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP 692 implementations that support this, and with an EMSS large enough to 693 contain at least one FPDU, segment the outbound TCP stream such that 694 each TCP segment begins with an FPDU, and fully contains all included 695 FPDUs. 697 Implementation note: To achieve the previous segmentation rule, 698 TCP's Nagle [RFC0896] algorithm SHOULD be disabled. 700 There are exceptions to the above rule. Once an ULPDU is provided to 701 MPA, the MPA on TCP sender MUST transmit it or fail the connection; 702 it cannot be repudiated. As a result, during changes in MTU and 703 EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it 704 may be necessary to send FPDUs that do not conform to the 705 segmentation rule above. 707 A possible, but less desirable, alternative is to use IP 708 fragmentation on accepted FPDUs to deal with MTU reductions or 709 extremely small EMSS. 711 The sender MUST still format the FPDU according to FPDU format as 712 shown in Figure 2. 714 On a retransmission, TCP does not necessarily preserve original TCP 715 segmentation boundaries. This can lead to the loss of FPDU alignment 716 and containment within a TCP segment during TCP retransmissions. An 717 MPA-aware TCP sender SHOULD try to preserve original TCP segmentation 718 boundaries on a retransmission. 720 7.3.1 Effects of MPA on TCP Segmentation 722 Applications expected to see strong advantages from Direct Data 723 Placement include transaction-based applications and throughput 724 applications. Request/response protocols typically send one FPDU per 725 TCP segment and then wait for a response. Therefore, the application 726 is expected to set TCP parameters such that it can trade off latency 727 and wire efficiency. This is accomplished by setting the TCP_NODELAY 728 socket option. 730 When latency is not critical, and the application provides data in 731 chunks larger than EMSS at one time, the TCP implementation may 732 "pack" any available stream data into TCP segments so that the 733 segments are filled to the EMSS. If the amount of data available is 734 not enough to fill the TCP segment when it is prepared for 735 transmission, TCP can send the segment partly filled, or use the 736 Nagle algorithm to wait for the ULP to post more data (discussed 737 below). 739 DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU 740 when a DDP message is large enough. Since the DDP message may not 741 exactly fit into TCP segments, a "message tail" often occurs that 742 results in an FPDU that is smaller than a single TCP segment. If a 743 "message tail", small DDP messages, or the start of a larger DDP 744 message are available, MPA MAY "pack" the resulting FPDUs into TCP 745 segments. When this is done, the TCP segments can be more fully 746 utilized, but, due to the size constraints of FPDUs, segments may not 747 be filled to the EMSS. 749 Note that MPA receivers must do more processing of a TCP segment 750 that contains multiple FPDUs, this may affect the performance of 751 some receiver implementations. 753 TCP implementations often utilize the "Nagle" [RFC0896] algorithm to 754 ensure that segments are filled to the EMSS whenever the round trip 755 latency is large enough that the source stream can fully fill 756 segments before Acks arrive. The algorithm does this by delaying the 757 transmission of TCP segments until a ULP can fill a segment, or until 758 an ACK arrives from the far side. The algorithm thus allows for 759 smaller segments when latencies are shorter to keep the ULP's end to 760 end latency to reasonable levels. 762 The Nagle algorithm is not mandatory to use [RFC1122]. 764 It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note 765 that many of the applications expected to take advantage of MPA/DDP 766 prefer to avoid the extra delays caused by Nagle. In such scenarios 767 it is anticipated there will be minimal opportunity for packing at 768 the transmitter and receivers may choose to optimize their 769 performance for this anticipated behavior. 771 7.3.2 FPDU Size Considerations 773 MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as 774 the size of the largest ULPDU fitting in an FPDU. For an empty TCP 775 Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus 776 space for markers and pad octets. 778 The maximum ULPDU Length for a single ULPDU MUST be computed as: 780 MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) 782 The formula above accounts for the worst-case number of markers. 784 As a further optimization of the wire efficiency an MPA 785 implementation MAY dynamically adjust the MULPDU (see section 7.3.1. 787 for latency and wire efficiency trade-offs). When one or more FPDUs 788 are already packed into a TCP Segment, MULPDU MAY be reduced 789 accordingly. 791 DDP SHOULD provide ULPDUs that are as large as possible, but less 792 than or equal to MULPDU. 794 If the TCP implementation needs to adjust EMSS to support MTU 795 changes, the MULPDU value is changed accordingly. 797 In certain rare situations, the EMSS may shrink to very small sizes. 798 If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU 799 below 128 octets and is not required to follow the segmentation rules 800 in Section 7.3 MPA on TCP Sender Segmentation on page 17. 802 If one or more FPDUs are already packed into a TCP segment, such that 803 the remaining room is less than 128 octets, MPA MUST NOT provide a 804 MULPDU smaller than 128. In this case, MPA would typically provide a 805 MULPDU for the next full sized segment, but may still pack the next 806 FPDU into the small remaining room, provide that the next FPDU is 807 small enough to fit. 809 The value 128 is chosen as to allow DDP designers room for the DDP 810 Header and some user data. 812 7.4 MPA Receiver FPDU Identification 814 An MPA receiver MUST first verify the FPDU before passing the ULPDU 815 to DDP. To do this, the receiver MUST: 817 * locate the start of the FPDU unambiguously, 819 * verify its CRC. 821 If the above conditions are true, the MPA receiver passes the ULPDU 822 to DDP. 824 To detect the start of the FPDU unambiguously one of the following 825 MUST be used: 827 1: In an ordered TCP stream, the ULPDU Length field in the current 828 FPDU when FPDU has a valid CRC, can be used to identify the 829 beginning of the next FPDU. 831 2: A Marker can always be used to locate the beginning of an FPDU 832 (in FPDUs with valid CRCs). Since the location of the marker is 833 known in the octet stream (sequence number space), the marker can 834 always be found. 836 3: Having found an FPDU by means of a Marker, following contiguous 837 FPDUs can be found by using the ULPDU Lengths (from FPDUs with 838 valid CRCs) to establish the next FPDU boundary. 840 The ULPDU Length field (see section 6) MUST be used to determine if 841 the entire FPDU is present before forwarding the ULPDU to DDP. 843 CRC calculation is discussed in section 7.2 on page 14 above. 845 7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders 847 Since MPA on MPA-aware TCP senders start FPDUs on TCP segment 848 boundaries, a receiving DDP on MPA on TCP implementation may be able 849 to optimize the reception of data in various ways. 851 However, MPA receivers MUST NOT depend on FPDU Alignment on TCP 852 segment boundaries. 854 Some MPA senders may be unable to conform to the sender requirements 855 because their implementation of TCP is not designed with MPA in mind. 856 Even if the sender is MPA-aware, the network may contain "middle 857 boxes" which modify the TCP stream by changing the segmentation. 858 This is generally interoperable with TCP and its users and MPA must 859 be no exception. 861 The presence of markers in MPA allows an MPA receiver to recover the 862 FPDUs despite these obstacles, although it may be necessary to 863 utilize additional buffering at the receiver to do so. 865 Some of the cases that a receiver may have to contend with are listed 866 below as a reminder to the implementer: 868 * A single Aligned and complete FPDU, either in order, or out of 869 order: This can be passed to DDP as soon as validated, and 870 Delivered when ordering is established. 872 * Multiple FPDUs in a TCP segment, aligned and fully contained, 873 either in order, or out of order: These can be passed to DDP as 874 soon as validated, and Delivered when ordering is established. 876 * Incomplete FPDU: The receiver should buffer until the remainder 877 of the FPDU arrives. If the remainder of the FPDU is already 878 available, this can be passed to DDP as soon as validated, and 879 Delivered when ordering is established. 881 * Unaligned FPDU start: The partial FPDU must be combined with its 882 preceding portion(s). If the preceding parts are already 883 available, and the whole FPDU is present, this can be passed to 884 DDP as soon as validated, and Delivered when ordering is 885 established. If the whole FPDU is not available, the receiver 886 should buffer until the remainder of the FPDU arrives. 888 * Combinations of Unaligned or incomplete FPDUs (and potentially 889 other complete FPDUs) in the same TCP segment: If any FPDU is 890 present in its entirety, or can be completed with portions 891 already available, it can be passed to DDP as soon as validated, 892 and Delivered when ordering is established. 894 8 Connection Semantics 896 8.1 Connection setup 898 DDP on MPA requires that DDP's consumer MUST activate DDP, MPA, and 899 any TCP enhancements for MPA, on a TCP half connection at the same 900 location in the octet stream at both the sender and the receiver. 901 This is required in order for the marker scheme to correctly locate 902 the markers. 904 DDP, MPA, and any TCP enhancements for MPA, MAY be started separately 905 in each direction, or enabled in both directions at once. 907 This can be accomplished several ways, and is left up to DDP's ULP: 909 * DDP's ULP MAY require DDP on MPA startup immediately after TCP 910 connection setup. This has the advantage that no additional 911 negotiation is needed (at least for MPA). In this case the 912 marker MUST be the first four octets sent (this marker has the 913 special value 0x0000, meaning it belongs to the FPDU that 914 follows). 916 This may be accomplished by using a well-known port, or a service 917 locator protocol to locate an appropriate port on which DDP on 918 MPA is expected to operate. 920 * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a 921 normal TCP startup, using TCP streaming data exchanges on the 922 same connection. The exchange establishes that DDP on MPA (as 923 well as other ULPs) will be used, and exactly locates the point 924 in the octet stream where MPA is to begin operation. Again, the 925 marker is the first four octets sent when operation begins (this 926 marker has the special value 0x0000, meaning it belongs to the 927 FPDU that follows). Note that such a negotiation protocol is 928 outside the scope of this specification. A simplified example of 929 such a protocol is shown below. 931 +-------------------------+ 932 |ULP streaming mode | 933 | request to | 934 | transition to DDP/MPA | +--------------------------+ 935 | mode | --------> |ULP gets request; | 936 +-------------------------+ |sets its receiver to | 937 |DDP/MPA mode; sends | 938 |streaming mode DDP/MPA | 939 +-------------------------+ | | 940 |ULP receives DDP/MPA | <-------- | | 941 |; | +--------------------------+ 942 |Sets transmitter and | 943 |receiver to DDP/MPA mode;| 944 | | 945 |The First DDP/MPA message| +--------------------------+ 946 |Is then sent. | --------> |When the DDP/MPA mode | 947 +-------------------------+ |message arrives, the ULP | 948 |sets its Transmit side to | 949 |DDP/MPA mode and begins | 950 |full operation. | 951 +--------------------------+ 952 Figure 7: Example Startup negotiation 954 8.2 Normal Connection Teardown 956 Each half connection of MPA terminates when DDP closes the 957 corresponding TCP half connection. 959 A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware 960 that a graceful close of the LLP connection has been received by the 961 LLP (e.g. FIN is received). 963 9 Error Semantics 965 The following errors MUST be detected by MPA and the codes SHOULD be 966 provided to DDP: 968 Code Error 970 1 TCP connection closed, terminated or lost. This includes 971 lost by timeout, too many retries, RST received or FIN 972 received. 974 2 Received MPA CRC does not match the calculated value for the 975 FPDU. 977 3 In the event that the CRC is valid, received MPA marker and 978 'ULPDU Length' fields do not agree on the start of a FPDU. 979 If the FPDU start determined from previous ULPDU Length 980 fields does not match with the MPA marker position, MPA 981 SHOULD deliver an error to DDP. It may not be possible to 982 make this check as a segment arrives, but the check SHOULD 983 be made when a gap creating an out of order sequence is 984 closed and any time a marker points to an already identified 985 FPDU. It is OPTIONAL for a receiver to check each marker, 986 if multiple markers are present in an FPDU, or if the 987 segment is received in order. 989 When conditions 2 or 3 above are detected, an MPA-aware TCP 990 implementation MAY choose to silently drop the TCP segment rather 991 than reporting the error to DDP. In this case, the sending TCP will 992 retry the segment, usually correcting the error, unless the problem 993 was at the source. In that case, the source will usually exceed the 994 number of retries and terminate the connection. 996 Once MPA delivers an error of any type, it MUST NOT pass or deliver 997 any additional FPDUs on that half connection. 999 MPA MUST NOT close the TCP connection following a reported error. 1000 Closing the connection is the responsibility of DDP's ULP. 1002 Note that since MPA will not deliver any FPDUs on a half 1003 connection following an error detected on the receive side of 1004 that connection, DDP's ULP is expected to tear down the 1005 connection. This may not occur until after one or more last 1006 messages are transmitted on the opposite half connection. This 1007 allows a diagnostic error message to be sent. 1009 10 Security Considerations 1011 This section discusses the security considerations for MPA. 1013 10.1 Protocol-specific Security Considerations 1015 The vulnerabilities of MPA to third-party attacks are no greater than 1016 any other protocol running over TCP. A third party, by sending 1017 packets into the network that are delivered to an MPA receiver, could 1018 launch a variety of attacks that take advantage of how MPA operates. 1019 For example, a third party could send random packets that are valid 1020 for TCP, but contain no FPDU headers. An MPA receiver reports an 1021 error to DDP when any packet arrives that cannot be validated as an 1022 FPDU when properly located on an FPDU boundary. This would have a 1023 severe impact on performance. Communication security mechanisms such 1024 as IPsec [RFC2401] may be used to prevent such attacks. Independent 1025 of how MPA operates, a third party could use ICMP messages to reduce 1026 the path MTU to such a small size that performance would likewise be 1027 severely impacted. Range checking on path MTU sizes in ICMP packets 1028 may be used to prevent such attacks. 1030 10.2 Using IPsec With MPA 1032 IPsec can be used to protect against the packet injection attacks 1033 outlined above. Because IPsec is designed to secure individual IP 1034 packets, MPA can run above IPsec without change. IPsec packets are 1035 processed (e.g., integrity checked and decrypted) in the order they 1036 are received, and an MPA receiver will process the decrypted FPDUs 1037 contained in these packets in the same manner as FPDUs contained in 1038 unsecured IP packets. 1040 11 IANA Considerations 1042 If a well-known port is chosen as the mechanism to identify a DDP on 1043 MPA on TCP, the well-known port must be registered with IANA. 1044 Because the use of the port is DDP specific, registration of the port 1045 with IANA is left to DDP. 1047 12 References 1049 12.1 Normative References 1051 [iSCSI] Satran, J., "iSCSI", draft-ietf-ips-iscsi-20.txt (work in 1052 progress), January 2003. 1054 [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191, 1055 November 1990. 1057 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP 1058 Selective Acknowledgment Options", RFC 2018, October 1996. 1060 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1061 3", BCP 9, RFC 2026, October 1996. 1063 [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet 1064 Program Protocol Specification", RFC 793, September 1981. 1066 12.2 Informative References 1068 [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum 1069 disagree", ACM Sigcomm, Sept. 2000. 1071 [DDP] H. Shah et al., "Direct Data Placement over Reliable 1072 Transports", draft-shah-iwarp-ddp-00.txt (Work in progress), 1073 October 2002 1075 [RFC2401] Atkinson, R., Kent, S., "Security Architecture for the 1076 Internet Protocol", RFC 2401, November 1998. 1078 [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC 1079 896, January 1984. 1081 [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B., 1082 "Application performance pitfalls and TCP's Nagle algorithm", 1083 Workshop on Internet Server Performance, May 1999. 1085 [RDMA] R. Recio et al., "RDMA Protocol Specification", 1086 draft-recio-iwarp-rdmap-00.txt, October 2002 1088 [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol", 1089 RFC 2960, October 2000. 1091 [RFC792] Postel, J., "Internet Control Message Protocol". September 1092 1981 1094 [RFC1122] Braden, R.T., "Requirements for Internet hosts - 1095 communication layers". October 1989. 1097 [ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft- 1098 elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003. 1100 13 Appendix 1102 This appendix is for information only and is NOT part of the 1103 standard. 1105 13.1 Receiver implementation 1107 13.1.1 Transport & Network Layer Reassembly Buffers 1109 The use of reassembly buffers (either TCP reassembly buffers or IP 1110 fragmentation reassembly buffers) is implementation dependent. When 1111 MPA is enabled, reassembly buffers are needed if FPDU Alignment is 1112 lost or if IP fragmentation occurs. This is because the incoming out 1113 of order segment may not contain enough information for MPA to 1114 process all of the FPDU. For cases where a re-segmenting middle box 1115 is present, or where the TCP sender is not MPA-aware, the presence of 1116 markers significantly reduces the amount of buffering needed. 1118 Recovery from IP Fragmentation must be transparent to the MPA 1119 Consumers. 1121 13.1.1.1 Network Layer Reassembly Buffers 1123 Most IP implementations set the IP Don't Fragment bit. Thus upon a 1124 path MTU change, intermediate devices drop the IP datagram if it is 1125 too large and reply with an ICMP message which tells the source TCP 1126 that the path MTU has changed. This causes TCP to emit segments 1127 conformant with the new path MTU size. Thus IP fragments under most 1128 conditions should never occur at the receiver. But it is possible. 1130 There are several options for implementation of network layer 1131 reassembly buffers: 1133 1. drop any IP fragments, and reply with an ICMP message according 1134 to [RFC792] (fragmentation needed and DF set) to tell the Remote 1135 Peer to resize its TCP segment 1137 2. support an IP reassembly buffer, but have it of limited size 1138 (possibly the same size as the local link's MTU). The end Node 1139 would normally never advertise a path MTU larger than the local 1140 link MTU. It is recommended that a dropped IP fragment cause an 1141 ICMP message to be generated according to RFC792. 1143 3. multiple IP reassembly buffers, of effectively unlimited size. 1145 4. support an IP reassembly buffer for the largest IP datagram (64 1146 KB). 1148 5. support for a large IP reassembly buffer which could span 1149 multiple IP datagrams. 1151 An implementation should support at least 2 or 3 above, to avoid 1152 dropping packets that have traversed the entire fabric. 1154 There is no end-to-end ACK for IP reassembly buffers, so there is no 1155 flow control on the buffer. The only end-to-end ACK is a TCP ACK, 1156 which can only occur when a complete IP datagram is delivered to TCP. 1157 Because of this, under worst case, pathological scenarios, the 1158 largest IP reassembly buffer is the TCP receive window (to buffer 1159 multiple IP datagrams that have all been fragmented). 1161 Note that if the Remote Peer does not implement re-segmentation of 1162 the data stream upon receiving the ICMP reply updating the path MTU, 1163 it is possible to halt forward progress because the opposite peer 1164 would continue to retransmit using a transport segment size that is 1165 too large. This deadlock scenario is no different than if the fabric 1166 MTU (not last hop MTU) was reduced after connection setup, and the 1167 remote Node's behavior is not compliant with [RFC1122]. 1169 13.1.1.2 TCP Reassembly buffers 1171 A TCP reassembly buffer is also needed. TCP reassembly buffers are 1172 needed if FPDU Alignment is lost when using TCP with MPA or when the 1173 MPA FPDU spans multiple TCP segments. 1175 Since lost FPDU Alignment often means that FPDUs are incomplete, an 1176 MPA on TCP implementation must have a reassembly buffer large enough 1177 to recover an FPDU that is less than or equal to the MTU of the 1178 locally attached link (this should be the largest possible advertised 1179 TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST 1180 be at least 140 octets long to support the minimum FPDU size. The 1181 140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2 1182 of ULPDU_Length, 4 of CRC, and space for a possible marker. As usual, 1183 additional buffering may provide better performance. 1185 Note that if the TCP segment were not stored, it is possible to 1186 deadlock the MPA algorithm. If the path MTU is reduced, FPDU 1187 Alignment requires the source TCP to re-segment the data stream to 1188 the new path MTU. The source MPA will detect this condition and 1189 reduce the MPA segment size, but any FPDUs already posted to the 1190 source TCP will be re-segmented and lose FPDU Alignment. If the 1191 destination does not support a TCP reassembly buffer, these segments 1192 can never be successfully transmitted and the protocol deadlocks. 1194 When a complete FPDU is received, processing continues normally. 1196 14 Author's Addresses 1198 Stephen Bailey 1199 Sandburst Corporation 1200 600 Federal Street 1201 Andover, MA 01810 USA 1202 Phone: +1 978 689 1614 1203 Email: steph@sandburst.com 1205 Paul R. Culley 1206 Hewlett-Packard Company 1207 20555 SH 249 1208 Houston, Tx. USA 77070-2698 1209 Phone: 281-514-5543 1210 Email: paul.culley@hp.com 1212 Uri Elzur 1213 Broadcom 1214 16215 Alton Parkway 1215 CA, 92618 1216 Phone: 949.585.6432 1217 Email: uri@broadcom.com 1219 Renato J Recio 1220 IBM 1221 Internal Zip 9043 1222 11400 Burnett Road 1223 Austin, Texas 78759 1224 Phone: 512-838-3685 1225 Email: recio@us.ibm.com 1227 John Carrier 1228 Adaptec Inc. 1229 691 South Milpitas Blvd. 1230 Milpitas, CA 95035 1231 Phone: 360-378-8526 1232 Email: John_Carrier@adaptec.com 1234 15 Acknowledgments 1236 Dwight Barron 1237 Hewlett-Packard Company 1238 20555 SH 249 1239 Houston, Tx. USA 77070-2698 1240 Phone: 281-514-2769 1241 Email: dwight.barron@hp.com 1243 Jeff Chase 1244 Department of Computer Science 1245 Duke University 1246 Durham, NC 27708-0129 USA 1247 Phone: +1 919 660 6559 1248 Email: chase@cs.duke.edu 1250 Ted Compton 1251 EMC Corporation 1252 Research Triangle Park, NC 27709, USA 1253 Phone: 919-248-6075 1254 Email: compton_ted@emc.com 1256 Dave Garcia 1257 Hewlett-Packard Company 1258 19333 Vallco Parkway 1259 Cupertino, Ca. USA 95014 1260 Phone: 408.285.6116 1261 Email: dave.garcia@hp.com 1263 Hari Ghadia 1264 Adaptec, Inc. 1265 691 S. Milpitas Blvd., 1266 Milpitas, CA 95035 USA 1267 Phone: +1 (408) 957-5608 1268 Email: hari_ghadia@adaptec.com 1270 Howard C. Herbert 1271 Intel Corporation 1272 MS CH7-404 1273 5000 West Chandler Blvd. 1274 Chandler, Arizona 85226 1275 Phone: 480-554-3116 1276 Email: howard.c.herbert@intel.com 1278 Jeff Hilland 1279 Hewlett-Packard Company 1280 20555 SH 249 1281 Houston, Tx. USA 77070-2698 1282 Phone: 281-514-9489 1283 Email: jeff.hilland@hp.com 1285 Mike Ko 1286 IBM 1287 650 Harry Rd. 1288 San Jose, CA 95120 1289 Phone: (408) 927-2085 1290 Email: mako@us.ibm.com 1292 Mike Krause 1293 Hewlett-Packard Corporation, 43LN 1294 19410 Homestead Road 1295 Cupertino, CA 95014 USA 1296 Phone: +1 (408) 447-3191 1297 Email: krause@cup.hp.com 1299 Dave Minturn 1300 Intel Corporation 1301 MS JF1-210 1302 5200 North East Elam Young Parkway 1303 Hillsboro, Oregon 97124 1304 Phone: 503-712-4106 1305 Email: dave.b.minturn@intel.com 1307 Jim Pinkerton 1308 Microsoft, Inc. 1309 One Microsoft Way 1310 Redmond, WA, USA 98052 1311 Email: jpink@microsoft.com 1313 Hemal Shah 1314 Intel Corporation 1315 MS PTL1 1316 1501 South Mopac Expressway, #400 1317 Austin, Texas 78746 1318 Phone: 512-732-3963 1319 Email: hemal.shah@intel.com 1321 Allyn Romanow 1322 Cisco Systems 1323 170 W Tasman Drive 1324 San Jose, CA 95134 USA 1325 Phone: +1 408 525 8836 1326 Email: allyn@cisco.com 1328 Tom Talpey 1329 Network Appliance 1330 375 Totten Pond Road 1331 Waltham, MA 02451 USA 1332 Phone: +1 (781) 768-5329 1333 EMail: thomas.talpey@netapp.com 1335 Patricia Thaler 1336 Agilent Technologies, Inc. 1337 1101 Creekside Ridge Drive, #100 1338 M/S-RG10 1339 Roseville, CA 95678 1340 Phone: +1-916-788-5662 1341 email: pat_thaler@agilent.com 1343 Jim Wendt 1344 Hewlett Packard Corporation 1345 8000 Foothills Boulevard MS 5668 1346 Roseville, CA 95747-5668 USA 1347 Phone: +1 916 785 5198 1348 Email: jim_wendt@hp.com 1350 Jim Williams 1351 Emulex Corporation 1352 580 Main Street 1353 Bolton, MA 01740 USA 1354 Phone: +1 978 779 7224 1355 Email: jim.williams@emulex.com 1357 16 Full Copyright Statement 1359 This document and the information contained herein is provided on an 1360 "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM 1361 CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION, 1362 EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS 1363 MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION, 1364 NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY, 1365 AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, 1366 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT 1367 THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY 1368 IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR 1369 PURPOSE. 1371 Copyright (c) 2002 ADAPTEC INC., BROADCOM CORPORATION, CISCO SYSTEMS 1372 INC., EMC CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL 1373 BUSINESS MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT 1374 CORPORATION, NETWORK APPLIANCE INC., All Rights Reserved