idnits 2.17.1 

draft-culley-iwarp-mpa-02.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([DDP], [ELZUR-MPA], [02]),
     which it shouldn't.  Please replace those with straight textual mentions
     of the documents in question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 233: '...t recovery of out of order ULPDUs MUST...'
     RFC 2119 keyword, line 298: '...P implementation MUST inform MPA when ...'
     RFC 2119 keyword, line 304: '...  implementation SHOULD be enabled to:...'
     RFC 2119 keyword, line 308: '....  Multiple FPDUs MAY be packed into a...'
     RFC 2119 keyword, line 314: '...de implementation MUST continue to use...'
     (63 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '02' on line 105

  == Unused Reference: 'RFC2026' is defined on line 1060, but no explicit
     reference was found in the text

  == Unused Reference: 'NagleDAck' is defined on line 1081, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-01) exists of
     draft-shah-iwarp-ddp-00

  -- Obsolete informational reference (is this intentional?): RFC 2401
     (Obsoleted by RFC 4301)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  -- No information found for draft-recio-iwarp-rdmap - is the name correct?

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)


     Summary: 5 errors (**), 0 flaws (~~), 4 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	   INTERNET-DRAFT                            P. Culley
3	   draft-culley-iwarp-mpa-02.txt               Hewlett-Packard Company
4	                                             U. Elzur
5	                                               Broadcom Corporation
6	                                             R. Recio
7	                                               IBM Corpration
8	                                             S. Bailey
9	                                               Sandburst Corporation
10	                                             J. Carrier
11	                                               Adaptec

13	                                             Expires: August 2003

15	             Marker PDU Aligned Framing for TCP Specification

17	1  Status of this Memo

19	   This document is an Internet-Draft and is subject to all provisions
20	   of Section 10 of RFC2026.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF), its areas, and its working groups.  Note that
24	   other groups may also distribute working documents as Internet-
25	   Drafts.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   The list of current Internet-Drafts can be accessed at
33	   http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft
34	   Shadow Directories can be accessed at http://www.ietf.org/shadow.html

36	2  Abstract

38	   A framing protocol is defined for TCP that is fully compliant with
39	   applicable TCP RFCs and fully interoperable with existing TCP
40	   implementations. The framing mechanism is designed to work as an
41	   "adaptation layer" between TCP and the Direct Data Placement [DDP]
42	   protocol, preserving the reliable, in-order delivery of TCP, while
43	   adding the preservation of higher-level protocol record boundaries
44	   that DDP requires.

46	   Table of Contents

48	   1     Status of this Memo..........................................1
49	   2     Abstract.....................................................1
50	   3     Introduction.................................................4
51	   3.1   Motivation...................................................4
52	   3.2   Protocol Overview............................................5
53	   4     Glossary.....................................................7
54	   5     LLP and DDP requirements.....................................8
55	   5.1   TCP implementation Requirements to support MPA...............8
56	   5.1.1 TCP Transmit side............................................8
57	   5.1.2 TCP Receive side.............................................8
58	   5.2   MPA's interactions with DDP..................................9
59	   6     FPDU Formats................................................11
60	   6.1   Marker Format...............................................12
61	   7     Data Transfer Semantics.....................................13
62	   7.1   MPA Markers.................................................13
63	   7.2   CRC Calculation.............................................14
64	   7.3   MPA on TCP Sender Segmentation..............................17
65	   7.3.1 Effects of MPA on TCP Segmentation..........................17
66	   7.3.2 FPDU Size Considerations....................................18
67	   7.4   MPA Receiver FPDU Identification............................19
68	   7.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....20
69	   8     Connection Semantics........................................22
70	   8.1   Connection setup............................................22
71	   8.2   Normal Connection Teardown..................................23
72	   9     Error Semantics.............................................24
73	   10    Security Considerations.....................................25
74	   10.1  Protocol-specific Security Considerations...................25
75	   10.2  Using IPsec With MPA........................................25
76	   11    IANA Considerations.........................................26
77	   12    References..................................................27
78	   12.1  Normative References........................................27
79	   12.2  Informative References......................................27
80	   13    Appendix....................................................29
81	   13.1  Receiver implementation.....................................29
82	   13.1.1  Transport & Network Layer Reassembly Buffers..............29
83	   14    Author's Addresses..........................................31
84	   15    Acknowledgments.............................................32
85	   16    Full Copyright Statement....................................35
86	   Table of Figures

88	   Figure 1 ULP MPA TCP Layering.......................................6
89	   Figure 2 FPDU Format...............................................11
90	   Figure 3 Marker Format.............................................12
91	   Figure 4 Example FPDU Format with Marker...........................14
92	   Figure 5 Annotated Hex Dump of an FPDU.............................16
93	   Figure 6 Annotated Hex Dump of an FPDU with Marker.................16
94	   Figure 7: Example Startup negotiation..............................23

96	   Revision history

98	   [02] Enhanced descriptions of how MPA is used over an unmodified TCP.

100	   [02] Removed "No Packing" text.

102	   [02] Made MPA an adaptation layer for DDP, instead of a generalized
103	       framing solution.

105	   [02] Added clarifications of the MPA/TCP interaction for optimized
106	       implementations and that any such optimizations are to be used
107	       only when requested by MPA.

109	       Note: a discussion of reasons for these changes can be found in
110	       [ELZUR-MPA].

112	3  Introduction

114	   This section discusses the reason for creating MPA on TCP and a
115	   general overview of the protocol.  Later sections show the MPA
116	   headers (see section 6 on page 11), and detailed protocol
117	   requirements and characteristics (see section 7 on page 13), as well
118	   as Connection Semantics (section 8 on page 20), Error Semantics
119	   (section 9 on page 24), and Security Considerations (section 10 on
120	   page 25).

122	3.1  Motivation

124	   The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
125	   requires a mechanism to detect record boundaries.  The DDP records
126	   are referred to as Upper Layer Protocol Data Units by this document.
127	   The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
128	   boundary is useful to a hardware network adapter that uses DDP to
129	   directly place the data in the application buffer based on the
130	   control information carried in the ULPDU header.  This may be done
131	   without requiring that the packets arrive in order.  Potential
132	   benefits of this capability are the avoidance of the memory copy
133	   overhead and a smaller memory requirement for handling out of order
134	   or dropped packets.

136	   Many approaches have been proposed for a generalized framing
137	   mechanism.  Some are probabilistic in nature and others are
138	   deterministic.  A probabilistic approach is characterized by a
139	   detectable value embedded in the octet stream.  It is probabilistic
140	   because under some conditions the receiver may incorrectly interpret
141	   application data as the detectable value.  Under these conditions,
142	   the protocol may fail with unacceptable frequency.  A deterministic
143	   approach is characterized by embedded controls at known locations in
144	   the octet stream.  Because the receiver can guarantee it will only
145	   examine the data stream at locations that are known to contain the
146	   embedded control, the protocol can never misinterpret application
147	   data as being embedded control data.  For unambiguous handling of an
148	   out of order packet, the deterministic approach is preferred.

150	   The MPA protocol provides a framing mechanism for DDP running over
151	   TCP using the deterministic approach.  It allows the location of the
152	   ULPDU to be determined in the TCP stream even if the TCP segments
153	   arrive out of order.

155	3.2  Protocol Overview

157	   MPA is described as an extra layer above TCP and below DDP.  The end-
158	   to-end data flow is:

160	   1.  The DDP's ULP negotiates the use of DDP and MPA at both ends of a
161	       connection.

163	   2.  DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
164	       for this value.  MPA derives this information from TCP, when it
165	       is available, or chooses a reasonable value.  This information is
166	       already supported on many TCP implementations, including all
167	       modern flavors of BSD networking, through the TCP_MAXSEG socket
168	       option.

170	   3.  DDP creates ULPDUs of MULPDU size or smaller, and hands them to
171	       MPA at the sender.

173	   4.  MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
174	       header, inserting markers, and appending a CRC after the ULPDU
175	       and PAD (if any).  MPA delivers the FPDU to TCP.

177	   5.  The TCP sender puts the FPDUs into the TCP stream.  If the TCP
178	       Sender is MPA-aware, it segments the TCP stream in such a way
179	       that a TCP Segment boundary is also the boundary of an FPDU.  TCP
180	       then passes each segment to the IP layer for transmission.

182	   6.  The TCP receiver may be MPA-aware or may not be MPA-aware. If it
183	       is MPA-aware, it may separate passing the TCP payload to MPA from
184	       passing the TCP payload ordering information to MPA. In either
185	       case, RFC compliant TCP wire behavior is observed at both the
186	       sender and receiver.

188	   7.  The MPA receiver locates and assembles complete FPDUs within the
189	       stream, verifies their integrity, and removes MPA markers,
190	       ULPDU_Length, PAD and CRC.

192	   8.  MPA then provides the complete ULPDUs to DDP.  MPA may also
193	       separate passing MPA payload to DDP from passing the MPA payload
194	       ordering information.

196	   The layering of PDUs with MPA is shown in Figure 1, below.

198	   MPA-aware TCP is a TCP layer which potentially contains some
199	   additional semantics as defined in this document.  MPA is implemented
200	   as a data stream ULP for TCP and is therefore RFC compliant.  MPA-
201	   aware TCP is RFC compliant.

203	               +------------------+
204	               |     ULP client   |
205	               +------------------+  <- Consumer messages
206	               |        DDP       |
207	               +------------------+  <- ULPDUs
208	               |        MPA       |
209	               +------------------+  <- FPDUs (containing ULPDUs)
210	               |        TCP*      |
211	               +------------------+  <- TCP Segments (containing FPDUs)
212	               |      IP etc.     |
213	               +------------------+
214	                                      * TCP or MPA-aware TCP.

216	                       Figure 1 ULP MPA TCP Layering

218	   An MPA-aware TCP sender is able to segment the data stream such that
219	   TCP segments begin with FPDUs (FPDU Alignment).  This has significant
220	   advantages for receivers.  When segments arrive with aligned FPDUs
221	   the receiver usually need not buffer any portion of the segment,
222	   allowing DDP to place it in its destination memory immediately, thus
223	   avoiding copies from intermediate buffers (DDP's reason for
224	   existence).

226	   MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation
227	   to recover ULPDUs that may be received out of order.  This enables a
228	   DDP on MPA implementation to save a significant amount of
229	   intermediate storage by placing the ULPDUs in the right locations in
230	   the application buffers when they arrive, rather than waiting until
231	   full ordering can be restored.

233	   MPA implementations that support recovery of out of order ULPDUs MUST
234	   support a mechanism to indicate the ordering of ULPDUs as the sender
235	   transmitted them and indicate when missing intermediate segments
236	   arrive.  These mechanisms allow DDP to reestablish record ordering
237	   and report Delivery of complete messages (groups of records).

239	   MPA also addresses enhanced data integrity.  Many users of TCP have
240	   noted that the TCP checksum is not as strong as could be desired
241	   [CRCTCP].  Studies have shown that the TCP checksum indicates
242	   segments in error at a much higher rate than the underlying link
243	   characteristics would indicate.  With these higher error rates, the
244	   chance that an error will escape detection, when using only the TCP
245	   checksum for data integrity, becomes a concern.  A stronger integrity
246	   check can reduce the chance of data errors being missed.

248	   MPA includes a CRC check to increase the ULPDU data integrity to the
249	   level provided by other modern protocols, such as SCTP [RFC2960].

251	4  Glossary

253	   Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
254	       the process of informing DDP that a particular PDU is ordered for
255	       use.  This is specifically different from "passing the PDU to
256	       DDP", which may generally occur in any order, while the order of
257	       "Delivery" is strictly defined.

259	   EMSS - Effective Maximum Segment Size.  EMSS is the smaller of the
260	       TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
261	       and the current path Maximum Transfer Unit (MTU) [RFC1191].

263	   FPDU - Framing Protocol Data Unit.  The unit of data created by an
264	       MPA sender.

266	   FPDU Alignment - the property that a TCP segment begins with an FPDU.

268	   PDU - protocol data unit

270	   MPA - Marker-based ULP PDU Aligned Framing for TCP protocol.   This
271	       document defines the MPA protocol.

273	   MULPDU - Maximum ULPDU. The current maximum size of the record that
274	       is acceptable for DDP to pass to MPA for transmission.

276	   Node - A computing device attached to one or more links of a Network.
277	       A Node in this context does not refer to a specific application
278	       or protocol instantiation running on the computer. A Node may
279	       consist of one or more MPA on TCP devices installed in a host
280	       computer.

282	   Remote Peer - The MPA protocol implementation on the opposite end of
283	       the connection. Used to refer to the remote entity when
284	       describing protocol exchanges or other interactions between two
285	       Nodes.

287	   ULP - Upper Layer Protocol. The protocol layer above the protocol
288	       layer currently being referenced. The ULP for MPA is DDP [DDP].

290	   ULPDU - Upper Layer Protocol Data Unit.  The data record defined by
291	      the layer above MPA (DDP).  ULPDU corresponds to DDP's "DDP
292	      Segment".

294	5  LLP and DDP requirements

296	5.1  TCP implementation Requirements to support MPA

298	   The TCP implementation MUST inform MPA when the TCP connection is
299	   closed or has begun closing the connection (e.g. received a FIN).

301	5.1.1  TCP Transmit side

303	   To provide optimum performance, an MPA-aware transmit side TCP
304	   implementation SHOULD be enabled to:

306	   *   With an EMSS large enough to contain the FPDU(s), segment the
307	       outgoing TCP stream such that the first octet of every TCP
308	       Segment begins with an FPDU.  Multiple FPDUs MAY be packed into a
309	       single TCP segment as long as they are entirely contained in the
310	       TCP segment.

312	   *   Report the current EMSS to the MPA transmit layer.

314	   An MPA-aware TCP transmit side implementation MUST continue to use
315	   the method of segmentation expected by non-MPA applications (and
316	   described in TCP RFCs) when MPA is not enabled on the connection.
317	   When MPA is enabled above an MPA-aware TCP, it SHOULD specifically
318	   enable the segmentation rules described above for the DDP segments
319	   (FPDUs) posted for transmission.

321	   If the transmit side TCP implementation is not able to segment the
322	   TCP stream as indicated above, MPA should make a best effort to
323	   achieve that result.  For example, using the TCP_NODELAY socket
324	   option to disable the Nagle algorithm will usually result in many of
325	   the segments starting with an FPDU.

327	   If the transmit side TCP implementation is not able to report the
328	   EMSS, MPA may assume that TCP will use 1460 octet segments in
329	   creating FPDUs.  If the implementation has reason to believe that the
330	   TCP segment size is actually smaller than 1460, it may instead use a
331	   536 octet FPDU.

333	5.1.2  TCP Receive side

335	   When an MPA receive implementation and the MPA-aware receive side TCP
336	   implementation supports handling out of order ULPDUs, the TCP receive
337	   implementation SHOULD be enabled to:

339	   *   Pass incoming TCP segments to MPA as soon as they have been
340	       received and validated, even if not received in order.  The TCP
341	       layer MUST have committed to keeping each segment before it can
342	       be passed to the MPA.  This means that the segment must have
343	       passed the TCP, IP, and lower layer data integrity validation
344	       (i.e., checksum), must be in the receive window, must not be a
345	       duplicate, must be part of the same epoch (if timestamps are used
346	       to verify this) and any other checks required by TCP RFCs.  The
347	       segment MUST NOT be passed to MPA more than once unless
348	       explicitly requested (see Section 9).

350	       This is not to imply that the data must be completely ordered
351	       before use.  An implementation may accept out of order segments,
352	       SACK them [RFC2018], and pass them to DDP when the reception of
353	       the segments needed to fill in the gaps arrive.  Such an
354	       implementation can "commit" to the data early on, and will not
355	       overwrite it even if (or when) duplicate data arrives.  MPA
356	       expects to utilize this "commit" to allow the passing of ULPDUs
357	       to DDP when they arrive, independent of ordering.

359	   *   Provide a mechanism to indicate the ordering of TCP segments as
360	       the sender transmitted them.  One possible mechanism might be
361	       attaching the TCP sequence number to each segment.

363	   *   Provide a mechanism to indicate when a given TCP segment (and the
364	       prior TCP stream) is complete.  One possible mechanism might be
365	       to utilize the leading (left) edge of the TCP Receive Window.

367	       DDP on MPA MUST utilize these two mechanisms to establish the
368	       Delivery semantics that DDP's consumers agree to.  These
369	       semantics are described fully in [DDP]. These include
370	       requirements on DDP's consumer to respect ownership of buffers
371	       prior to the time that DDP delivers them to the consumer.

373	   An MPA-aware TCP receive side implementation MUST continue to buffer
374	   TCP segments until completely ordered and then deliver them as
375	   expected by non-MPA applications (and described in TCP RFCs) when MPA
376	   is not enabled on the connection.  When MPA is enabled above an MPA-
377	   aware TCP, TCP SHOULD enable the in and out of order passing of data,
378	   and the separate ordering information as described above.

380	   When an MPA receive implementation is coupled with a TCP receive
381	   implementation that does not support the preceding mechanisms, TCP
382	   passes and Delivers incoming stream data to MPA in order.

384	5.2  MPA's interactions with DDP

386	   DDP requires MPA to maintain DDP record boundaries from the sender to
387	   the receiver.  When using MPA on TCP to send data, DDP provides
388	   records (ULPDUs) to MPA.  MPA will use the reliable transmission
389	   abilities of TCP to transmit the data, and will insert appropriate
390	   additional information into the TCP stream to allow the MPA receiver
391	   to locate the record boundary information.

393	   As such, MPA accepts complete records (ULPDUs) from DDP at the sender
394	   and returns them to DDP at the receiver.

396	   MPA combined with an MPA-aware TCP can only ensure FPDU Alignment
397	   with the TCP Header if the FPDU is less than or equal to TCP's EMSS.

399	   Since FPDU alignment is generally desired by the receiver, DDP must
400	   cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS
401	   under normal conditions.  This is done with the MULPDU mechanism.

403	   MPA provides information to DDP on the current maximum size of the
404	   record that is acceptable to send (MULPDU).  DDP SHOULD limit each
405	   record size to MULPDU.  The range of MULPDU values MUST be between
406	   128 octets and 64768 octets, inclusive.

408	   The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
409	   MPA. DDP MAY post a ULPDU of any size between one and 64768 octets,
410	   however MPA is NOT REQUIRED to support a ULPDU length that is greater
411	   than the current MULPDU.

413	   While the maximum theoretical length supported by the MPA header
414	   ULPDU_Length field is 65535, TCP over IP requires the IP datagram
415	   maximum length to be 65535 octets. To enable MPA to support FPDU
416	   Alignment, the maximum size of the FPDU must fit within an IP
417	   datagram. Thus the ULPDU limit of 64768 octets was derived by taking
418	   the maximum IP datagram length, subtracting from it the maximum total
419	   length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
420	   options, and the worst case MPA overhead, and then rounding the
421	   result down to a 128 octet boundary.

423	   On receive, MPA MUST pass each ULPDU with its length to DDP when it
424	   has been validated.

426	   If an MPA implementation supports passing out of order ULPDUs to DDP,
427	   the MPA implementation SHOULD:

429	   *   Pass each ULPDU with its length to DDP as soon as it has been
430	       fully received and validated.

432	   *   Provide a mechanism to indicate the ordering of ULPDUs as the
433	       sender transmitted them.  One possible mechanism might be
434	       providing the TCP sequence number for each ULPDU.

436	   *   Provide a mechanism to indicate when a given ULPDU (and prior
437	       ULPDUs) are complete.  One possible mechanism might be to allow
438	       DDP to see the current outgoing TCP Ack sequence number.

440	   *   Provide an indication to DDP that the TCP has closed or has begun
441	       to close the connection (e.g. received a FIN).

443	6  FPDU Formats

445	   MPA senders create FPDUs out of ULPDUs.  The format of an FPDU shown
446	   below MUST be used for all MPA FPDUs.  For purposes of clarity,
447	   markers are not shown in Figure 2.

449	       0                   1                   2                   3
450	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
451	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
452	      |          ULPDU_Length         |                               |
453	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
454	      |                                                               |
455	      ~                                                               ~
456	      ~                            ULPDU                              ~
457	      |                                                               |
458	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
459	      |                               |          PAD (0-3 octets)     |
460	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
461	      |                             CRC                               |
462	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
463	                           Figure 2 FPDU Format

465	   ULPDU_Length: 16 bits (unsigned integer).  This is the number of
466	   octets of the contained ULPDU.  It does not include the length of the
467	   FPDU header itself, the pad, the CRC, or of any markers that fall
468	   within the ULPDU. The 16-bit ULPDU Length field is large enough to
469	   support the largest IP datagrams for IPv4 or IPv6.

471	   PAD: The PAD field trails the ULPDU and contains between zero and
472	   three octets of data.  The pad data MUST be set to zero by the sender
473	   and ignored by the receiver (except for CRC checking).  The length of
474	   the pad is set so as to make the size of the FPDU an integral
475	   multiple of four.

477	   CRC: 32 bits, this CRC is used to verify the entire contents of the
478	   FPDU, using CRC32C See section 7.2 CRC Calculation on page 14.

480	   The FPDU adds a minimum of 6 octets to the length of the ULPDU.  In
481	   addition, the total length of the FPDU will include the length of any
482	   markers and from 0 to 3 pad octets added to round-up the ULPDU size.

484	6.1  Marker Format

486	   The format of a marker MUST be as specified in Figure 3:

488	       0                   1                   2                   3
489	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
490	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
491	      |           RESERVED            |            FPDUPTR            |
492	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
493	                          Figure 3 Marker Format

495	   RESERVED: The Reserved field MUST be set to zero on transmit and
496	   ignored on receive (except for CRC calculation).

498	   FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
499	   interpreted as an unsigned integer, that indicates the number of
500	   octets in the TCP stream from the beginning of the FPDU to the first
501	   octet of the entire marker.

503	7  Data Transfer Semantics

505	   This section discusses some characteristics and behavior of the MPA
506	   protocol as well as implications of that protocol.

508	7.1  MPA Markers

510	   MPA senders MUST insert a marker into the data stream at a 512 octet
511	   periodic interval in the TCP Sequence Number Space. The marker
512	   contains a 16 bit unsigned integer referred to as the FPDUPTR (FPDU
513	   Pointer).

515	   If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
516	   relative back-pointer. FPDUPTR MUST contain the number of octets in
517	   the TCP stream from the beginning of the current FPDU to the first
518	   octet of the marker, unless the marker falls between FPDUs. Thus the
519	   location of the first octet of the previous FPDU header can be
520	   determined by subtracting the value of the given marker from the
521	   current octet-stream sequence number (i.e. TCP sequence number) of
522	   the first octet of the marker. Note that this computation must take
523	   into account that the TCP sequence number could have wrapped between
524	   the marker and the header.

526	   An FPDUPTR value of 0x0000 is a special case - it is used when the
527	   marker falls exactly between FPDUs.  In this case, the marker MUST be
528	   placed in the following FPDU and viewed as being part of that FPDU
529	   (e.g. for CRC calculation). Thus an FPDUPTR value of 0x0000 means
530	   that immediately following the marker is an FPDU header.

532	   Since all FPDUs are integral multiples of 4 octets, the bottom two
533	   bits of the FPDUPTR as calculated by the sender are zero.  MPA
534	   reserves these bits so they MUST be treated as zero for computation
535	   at the receiver.

537	   The MPA markers MUST be inserted immediately following MPA connection
538	   establishment, and at every 512th octet of the TCP octet stream
539	   thereafter.  As a result, the first marker has an FPDUPTR value of
540	   0x0000.  If the first marker begins at octet sequence number
541	   SeqStart, then markers are inserted such that the first octet of the
542	   marker is at octet sequence number SeqNum if the remainder of (SeqNum
543	   - SeqStart) mod 512 is zero.  Note that SeqNum can wrap.

545	   For example, if the TCP sequence number were used to calculate the
546	   insertion point of the marker, the starting TCP sequence number is
547	   unlikely to be zero, and 512 octet multiples are unlikely to fall on
548	   a modulo 512 of zero. If the MPA connection is started at TCP
549	   sequence number 11, then the 1st marker will begin at 11, and
550	   subsequent markers will begin at 523, 1035, etc.

552	   If an FPDU is large enough to contain multiple markers, they MUST all
553	   point to the same point in the TCP stream: the first octet of the
554	   FPDU.

556	   If a marker interval contains multiple FPDUs (the FPDUs are small),
557	   the marker MUST point to the start of the FPDU containing the marker
558	   unless the marker falls between FPDUs, in which case the marker MUST
559	   be zero.

561	   The following example shows an FPDU containing a marker.

563	       0                   1                   2                   3
564	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
565	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
566	      |       ULPDU Length (0x0010)   |                               |
567	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
568	      |                                                               |
569	      +                                                               +
570	      |                         ULPDU (octets 0-9)                    |
571	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
572	      |            (0x0000)           |        FPDU ptr (0x000C)      |
573	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
574	      |                        ULPDU (octets 10-15)                   |
575	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
576	      |                               |          PAD (2 octets:0,0)   |
577	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
578	      |                              CRC                              |
579	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
580	                 Figure 4 Example FPDU Format with Marker

582	   MPA Receivers MUST preserve ULPDU boundaries when passing data to
583	   DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
584	   DDP and not the markers, headers, and CRC.

586	7.2  CRC Calculation

588	   When sending an FPDU, the sender MUST include a valid CRC field.  The
589	   CRC field in the MPA FPDU MUST be computed using the CRC32C
590	   polynomial in the manner described in the iSCSI Protocol [iSCSI]
591	   document for Header and Data Digests.

593	   The fields which MUST be included in the CRC calculation when sending
594	   an FPDU are as follows:

596	   1)  If the first octet of the FPDU is the "ULPDU Length" field, the
597	       CRC-32c is calculated from the first octet of the "ULPDU Length"
598	       header, through all the ULPDU and markers (if present), to the
599	       last octet of the PAD (if present), inclusive. If there is a
600	       marker immediately following the PAD, the marker is included in
601	       the CRC calculation for this FPDU.

603	   2)  If the first octet of the FPDU is a marker, (i.e. the marker fell
604	       between FPDUs, and thus is required to be included in the second
605	       FPDU), the CRC-32c is calculated from the first octet of the
606	       marker, through the "ULPDU Length" header, through all the ULPDU
607	       and markers (if present), to the last octet of the PAD (if
608	       present), inclusive.

610	   3)  After calculating the CRC-32c, the resultant value is placed into
611	       the CRC field at the end of the FPDU.

613	   When an FPDU is received, the receiver MUST first perform the
614	   following:

616	   1)  Calculate the CRC of the incoming FPDU in the same fashion as
617	       defined above.

619	   2)  Verify that the calculated CRC-32c value is the same as the
620	       received CRC-32c value found in the FPDU CRC field.  If not, the
621	       receiver MUST treat the FPDU as an invalid FPDU.

623	   The procedure for handling invalid FPDUs is covered in the Error
624	   Section (see section 9 on page 24)

626	   The following is an annotated hex dump of an example FPDU sent as the
627	   first FPDU on the stream.  As such, it starts with a marker. The FPDU
628	   contains 24 octets of the contained ULPDU, which are all zeros. The
629	   CRC32c has been correctly calculated and can be used as a reference.
630	   See the [DDP] and [RDMA] specification for definitions of the DDP
631	   Control field, Queue, MSN, MO, and Send Data.

633	       Octet Contents  Annotation
634	       Count

636	       0000    00 00   Marker: Reserved
637	       0002    00 00           FPDUPTR
638	       0004    00 2a   Length
639	       0006    40 03   DDP Control Field, Send with Last flag set
640	       0008    00 00   Reserved (STag position with no STag)
641	       000a    00 00
642	       000c    00 00   Queue = 0
643	       000e    00 00
644	       0010    00 00   MSN = 1
645	       0012    00 01
646	       0014    00 00   MO = 0
647	       0016    00 00
648	       0018    00 00
649	                       Send Data (24 octets of zeros)
650	       002e    00 00
651	       0030    4C 86   CRC32c
652	       0032    B3 84
653	                  Figure 5 Annotated Hex Dump of an FPDU

655	   The following is an example sent as the second FPDU of the stream
656	   where the first FPDU (which is not shown here) had a length of 492
657	   octets and was also a Send to Queue 0 with Last Flag set.  This
658	   example contains a marker.

660	       Octet Contents  Annotation
661	       Count

663	       01ec    00 2a   Length
664	       01ee    40 03   DDP Control Field: Send with Last Flag set
665	       01f0    00 00   Reserved (STag position with no STag)
666	       01f2    00 00
667	       01f4    00 00   Queue = 0
668	       01f6    00 00
669	       01f8    00 00   MSN = 2
670	       01fa    00 02
671	       01fc    00 00   MO = 0
672	       01fe    00 00
673	       0200    00 00   Marker: Reserved
674	       0202    00 14           FPDUPTR
675	       0204    00 00
676	                       Send Data (24 octets of zeros)
677	       021a    00 00
678	       021c    A1 9C   CRC32c
679	       021e    D1 03
680	            Figure 6 Annotated Hex Dump of an FPDU with Marker

682	7.3  MPA on TCP Sender Segmentation

684	   The various TCP RFCs allow considerable choice in segmenting a TCP
685	   stream.  In order to optimize FPDU recovery at the MPA receiver, MPA
686	   specifies additional segmentation rules.

688	   MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
689	   contained in one FPDU.

691	   An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP
692	   implementations that support this, and with an EMSS large enough to
693	   contain at least one FPDU, segment the outbound TCP stream such that
694	   each TCP segment begins with an FPDU, and fully contains all included
695	   FPDUs.

697	        Implementation note: To achieve the previous segmentation rule,
698	        TCP's Nagle [RFC0896] algorithm SHOULD be disabled.

700	   There are exceptions to the above rule.  Once an ULPDU is provided to
701	   MPA, the MPA on TCP sender MUST transmit it or fail the connection;
702	   it cannot be repudiated.  As a result, during changes in MTU and
703	   EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it
704	   may be necessary to send FPDUs that do not conform to the
705	   segmentation rule above.

707	   A possible, but less desirable, alternative is to use IP
708	   fragmentation on accepted FPDUs to deal with MTU reductions or
709	   extremely small EMSS.

711	   The sender MUST still format the FPDU according to FPDU format as
712	   shown in Figure 2.

714	   On a retransmission, TCP does not necessarily preserve original TCP
715	   segmentation boundaries. This can lead to the loss of FPDU alignment
716	   and containment within a TCP segment during TCP retransmissions. An
717	   MPA-aware TCP sender SHOULD try to preserve original TCP segmentation
718	   boundaries on a retransmission.

720	7.3.1  Effects of MPA on TCP Segmentation

722	   Applications expected to see strong advantages from Direct Data
723	   Placement include transaction-based applications and throughput
724	   applications. Request/response protocols typically send one FPDU per
725	   TCP segment and then wait for a response. Therefore, the application
726	   is expected to set TCP parameters such that it can trade off latency
727	   and wire efficiency. This is accomplished by setting the TCP_NODELAY
728	   socket option.

730	   When latency is not critical, and the application provides data in
731	   chunks larger than EMSS at one time,  the TCP implementation may
732	   "pack" any available stream data into TCP segments so that the
733	   segments are filled to the EMSS.  If the amount of data available is
734	   not enough to fill the TCP segment when it is prepared for
735	   transmission, TCP can send the segment partly filled, or use the
736	   Nagle algorithm to wait for the ULP to post more data (discussed
737	   below).

739	   DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU
740	   when a DDP message is large enough.  Since the DDP message may not
741	   exactly fit into TCP segments, a "message tail" often occurs that
742	   results in an FPDU that is smaller than a single TCP segment.  If a
743	   "message tail", small DDP messages, or the start of a larger DDP
744	   message are available, MPA MAY "pack" the resulting FPDUs into TCP
745	   segments.  When this is done, the TCP segments can be more fully
746	   utilized, but, due to the size constraints of FPDUs, segments may not
747	   be filled to the EMSS.

749	        Note that MPA receivers must do more processing of a TCP segment
750	        that contains multiple FPDUs, this may affect the performance of
751	        some receiver implementations.

753	   TCP implementations often utilize the "Nagle" [RFC0896] algorithm to
754	   ensure that segments are filled to the EMSS whenever the round trip
755	   latency is large enough that the source stream can fully fill
756	   segments before Acks arrive.  The algorithm does this by delaying the
757	   transmission of TCP segments until a ULP can fill a segment, or until
758	   an ACK arrives from the far side.  The algorithm thus allows for
759	   smaller segments when latencies are shorter to keep the ULP's end to
760	   end latency to reasonable levels.

762	   The Nagle algorithm is not mandatory to use [RFC1122].

764	   It is up to the ULP to decide if Nagle is useful with DDP/MPA.  Note
765	   that many of the applications expected to take advantage of MPA/DDP
766	   prefer to avoid the extra delays caused by Nagle. In such scenarios
767	   it is anticipated there will be minimal opportunity for packing at
768	   the transmitter and receivers may choose to optimize their
769	   performance for this anticipated behavior.

771	7.3.2  FPDU Size Considerations

773	   MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
774	   the size of the largest ULPDU fitting in an FPDU.  For an empty TCP
775	   Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
776	   space for markers and pad octets.

778	     The maximum ULPDU Length for a single ULPDU MUST be computed as:

780	        MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)

782	   The formula above accounts for the worst-case number of markers.

784	   As a further optimization of the wire efficiency an MPA
785	   implementation MAY dynamically adjust the MULPDU (see section 7.3.1.

787	   for latency and wire efficiency trade-offs). When one or more FPDUs
788	   are already packed into a TCP Segment, MULPDU MAY be reduced
789	   accordingly.

791	   DDP SHOULD provide ULPDUs that are as large as possible, but less
792	   than or equal to MULPDU.

794	   If the TCP implementation needs to adjust EMSS to support MTU
795	   changes, the MULPDU value is changed accordingly.

797	   In certain rare situations, the EMSS may shrink to very small sizes.
798	   If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU
799	   below 128 octets and is not required to follow the segmentation rules
800	   in Section 7.3 MPA on TCP Sender Segmentation on page 17.

802	   If one or more FPDUs are already packed into a TCP segment, such that
803	   the remaining room is less than 128 octets, MPA MUST NOT provide a
804	   MULPDU smaller than 128.  In this case, MPA would typically provide a
805	   MULPDU for the next full sized segment, but may still pack the next
806	   FPDU into the small remaining room, provide that the next FPDU is
807	   small enough to fit.

809	   The value 128 is chosen as to allow DDP designers room for the DDP
810	   Header and some user data.

812	7.4  MPA Receiver FPDU Identification

814	   An MPA receiver MUST first verify the FPDU before passing the ULPDU
815	   to DDP.  To do this, the receiver MUST:

817	   *   locate the start of the FPDU unambiguously,

819	   *   verify its CRC.

821	   If the above conditions are true, the MPA receiver passes the ULPDU
822	   to DDP.

824	   To detect the start of the FPDU unambiguously one of the following
825	   MUST be used:

827	   1:  In an ordered TCP stream, the ULPDU Length field in the current
828	       FPDU when FPDU has a valid CRC, can be used to identify the
829	       beginning of the next FPDU.

831	   2:  A Marker can always be used to locate the beginning of an FPDU
832	       (in FPDUs with valid CRCs).  Since the location of the marker is
833	       known in the octet stream (sequence number space), the marker can
834	       always be found.

836	   3:  Having found an FPDU by means of a Marker, following contiguous
837	       FPDUs can be found by using the ULPDU Lengths (from FPDUs with
838	       valid CRCs) to establish the next FPDU boundary.

840	   The ULPDU Length field (see section 6) MUST be used to determine if
841	   the entire FPDU is present before forwarding the ULPDU to DDP.

843	   CRC calculation is discussed in section 7.2 on page 14 above.

845	7.4.1  Re-segmenting Middle boxes and non MPA-aware TCP senders

847	   Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
848	   boundaries, a receiving DDP on MPA on TCP implementation may be able
849	   to optimize the reception of data in various ways.

851	   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
852	   segment boundaries.

854	   Some MPA senders may be unable to conform to the sender requirements
855	   because their implementation of TCP is not designed with MPA in mind.
856	   Even if the sender is MPA-aware, the network may contain "middle
857	   boxes" which modify the TCP stream by changing the segmentation.
858	   This is generally interoperable with TCP and its users and MPA must
859	   be no exception.

861	   The presence of markers in MPA allows an MPA receiver to recover the
862	   FPDUs despite these obstacles, although it may be necessary to
863	   utilize additional buffering at the receiver to do so.

865	   Some of the cases that a receiver may have to contend with are listed
866	   below as a reminder to the implementer:

868	   *   A single Aligned and complete FPDU, either in order, or out of
869	       order:  This can be passed to DDP as soon as validated, and
870	       Delivered when ordering is established.

872	   *   Multiple FPDUs in a TCP segment, aligned and fully contained,
873	       either in order, or out of order:  These can be passed to DDP as
874	       soon as validated, and Delivered when ordering is established.

876	   *   Incomplete FPDU: The receiver should buffer until the remainder
877	       of the FPDU arrives.  If the remainder of the FPDU is already
878	       available, this can be passed to DDP as soon as validated, and
879	       Delivered when ordering is established.

881	   *   Unaligned FPDU start: The partial FPDU must be combined with its
882	       preceding portion(s).  If the preceding parts are already
883	       available, and the whole FPDU is present, this can be passed to
884	       DDP as soon as validated, and Delivered when ordering is
885	       established.  If the whole FPDU is not available, the receiver
886	       should buffer until the remainder of the FPDU arrives.

888	   *   Combinations of Unaligned or incomplete FPDUs (and potentially
889	       other complete FPDUs) in the same TCP segment:  If any FPDU is
890	       present in its entirety, or can be completed with portions
891	       already available, it can be passed to DDP as soon as validated,
892	       and Delivered when ordering is established.

894	8  Connection Semantics

896	8.1  Connection setup

898	   DDP on MPA requires that DDP's consumer MUST activate DDP, MPA, and
899	   any TCP enhancements for MPA, on a TCP half connection at the same
900	   location in the octet stream at both the sender and the receiver.
901	   This is required in order for the marker scheme to correctly locate
902	   the markers.

904	   DDP, MPA, and any TCP enhancements for MPA, MAY be started separately
905	   in each direction, or enabled in both directions at once.

907	   This can be accomplished several ways, and is left up to DDP's ULP:

909	   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP
910	       connection setup.  This has the advantage that no additional
911	       negotiation is needed (at least for MPA).  In this case the
912	       marker MUST be the first four octets sent (this marker has the
913	       special value 0x0000, meaning it belongs to the FPDU that
914	       follows).

916	       This may be accomplished by using a well-known port, or a service
917	       locator protocol to locate an appropriate port on which DDP on
918	       MPA is expected to operate.

920	   *   DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
921	       normal TCP startup, using TCP streaming data exchanges on the
922	       same connection.  The exchange establishes that DDP on MPA (as
923	       well as other ULPs) will be used, and exactly locates the point
924	       in the octet stream where MPA is to begin operation.  Again, the
925	       marker is the first four octets sent when operation begins (this
926	       marker has the special value 0x0000, meaning it belongs to the
927	       FPDU that follows).  Note that such a negotiation protocol is
928	       outside the scope of this specification.  A simplified example of
929	       such a protocol is shown below.

931	     +-------------------------+
932	     |ULP streaming mode       |
933	     | <Hello> request to      |
934	     | transition to DDP/MPA   |           +--------------------------+
935	     | mode                    | --------> |ULP gets request;         |
936	     +-------------------------+           |sets its receiver to      |
937	                                           |DDP/MPA mode; sends       |
938	                                           |streaming mode DDP/MPA    |
939	     +-------------------------+           |<Hello Acknowledgement>   |
940	     |ULP receives DDP/MPA     | <-------- |                          |
941	     |<Hello Acknowledgement>; |           +--------------------------+
942	     |Sets transmitter and     |
943	     |receiver to DDP/MPA mode;|
944	     |                         |
945	     |The First DDP/MPA message|           +--------------------------+
946	     |Is then sent.            | --------> |When the DDP/MPA mode     |
947	     +-------------------------+           |message arrives, the ULP  |
948	                                           |sets its Transmit side to |
949	                                           |DDP/MPA mode and begins   |
950	                                           |full operation.           |
951	                                           +--------------------------+
952	                   Figure 7: Example Startup negotiation

954	8.2  Normal Connection Teardown

956	   Each half connection of MPA terminates when DDP closes the
957	   corresponding TCP half connection.

959	   A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
960	   that a graceful close of the LLP connection has been received by the
961	   LLP (e.g. FIN is received).

963	9  Error Semantics

965	   The following errors MUST be detected by MPA and the codes SHOULD be
966	   provided to DDP:

968	       Code Error

970	       1   TCP connection closed, terminated or lost.  This includes
971	           lost by timeout, too many retries, RST received or FIN
972	           received.

974	       2   Received MPA CRC does not match the calculated value for the
975	           FPDU.

977	       3   In the event that the CRC is valid, received MPA marker and
978	           'ULPDU Length' fields do not agree on the start of a FPDU.
979	           If the FPDU start determined from previous ULPDU Length
980	           fields does not match with the MPA marker position, MPA
981	           SHOULD deliver an error to DDP.  It may not be possible to
982	           make this check as a segment arrives, but the check SHOULD
983	           be made when a gap creating an out of order sequence is
984	           closed and any time a marker points to an already identified
985	           FPDU.  It is OPTIONAL for a receiver to check each marker,
986	           if multiple markers are present in an FPDU, or if the
987	           segment is received in order.

989	   When conditions 2 or 3 above are detected, an MPA-aware TCP
990	   implementation MAY choose to silently drop the TCP segment rather
991	   than reporting the error to DDP.  In this case, the sending TCP will
992	   retry the segment, usually correcting the error, unless the problem
993	   was at the source.  In that case, the source will usually exceed the
994	   number of retries and terminate the connection.

996	   Once MPA delivers an error of any type, it MUST NOT pass or deliver
997	   any additional FPDUs on that half connection.

999	   MPA MUST NOT close the TCP connection following a reported error.
1000	   Closing the connection is the responsibility of DDP's ULP.

1002	        Note that since MPA will not deliver any FPDUs on a half
1003	        connection following an error detected on the receive side of
1004	        that connection, DDP's ULP is expected to tear down the
1005	        connection.  This may not occur until after one or more last
1006	        messages are transmitted on the opposite half connection.  This
1007	        allows a diagnostic error message to be sent.

1009	10 Security Considerations

1011	   This section discusses the security considerations for MPA.

1013	10.1 Protocol-specific Security Considerations

1015	   The vulnerabilities of MPA to third-party attacks are no greater than
1016	   any other protocol running over TCP.  A third party, by sending
1017	   packets into the network that are delivered to an MPA receiver, could
1018	   launch a variety of attacks that take advantage of how MPA operates.
1019	   For example, a third party could send random packets that are valid
1020	   for TCP, but contain no FPDU headers.  An MPA receiver reports an
1021	   error to DDP when any packet arrives that cannot be validated as an
1022	   FPDU when properly located on an FPDU boundary.  This would have a
1023	   severe impact on performance.  Communication security mechanisms such
1024	   as IPsec [RFC2401] may be used to prevent such attacks.  Independent
1025	   of how MPA operates, a third party could use ICMP messages to reduce
1026	   the path MTU to such a small size that performance would likewise be
1027	   severely impacted.  Range checking on path MTU sizes in ICMP packets
1028	   may be used to prevent such attacks.

1030	10.2 Using IPsec With MPA

1032	   IPsec can be used to protect against the packet injection attacks
1033	   outlined above.  Because IPsec is designed to secure individual IP
1034	   packets, MPA can run above IPsec without change.  IPsec packets are
1035	   processed (e.g., integrity checked and decrypted) in the order they
1036	   are received, and an MPA receiver will process the decrypted FPDUs
1037	   contained in these packets in the same manner as FPDUs contained in
1038	   unsecured IP packets.

1040	11 IANA Considerations

1042	   If a well-known port is chosen as the mechanism to identify a DDP on
1043	   MPA on TCP, the well-known port must be registered with IANA.
1044	   Because the use of the port is DDP specific, registration of the port
1045	   with IANA is left to DDP.

1047	12 References

1049	12.1 Normative References

1051	   [iSCSI] Satran, J., "iSCSI", draft-ietf-ips-iscsi-20.txt (work in
1052	       progress), January 2003.

1054	   [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
1055	       November 1990.

1057	   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
1058	       Selective Acknowledgment Options", RFC 2018, October 1996.

1060	   [RFC2026] Bradner, S., "The Internet Standards Process -- Revision
1061	       3", BCP 9, RFC 2026, October 1996.

1063	   [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
1064	       Program Protocol Specification", RFC 793, September 1981.

1066	12.2 Informative References

1068	   [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
1069	       disagree", ACM Sigcomm, Sept. 2000.

1071	   [DDP] H. Shah et al., "Direct Data Placement over Reliable
1072	       Transports", draft-shah-iwarp-ddp-00.txt (Work in progress),
1073	       October 2002

1075	   [RFC2401]  Atkinson, R., Kent, S., "Security Architecture for the
1076	       Internet Protocol", RFC 2401, November 1998.

1078	   [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
1079	       896, January 1984.

1081	   [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B.,
1082	       "Application performance pitfalls and TCP's Nagle algorithm",
1083	       Workshop on Internet Server Performance, May 1999.

1085	   [RDMA] R. Recio et al., "RDMA Protocol Specification",
1086	       draft-recio-iwarp-rdmap-00.txt, October 2002

1088	   [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
1089	       RFC 2960, October 2000.

1091	   [RFC792] Postel, J., "Internet Control Message Protocol". September
1092	       1981

1094	   [RFC1122] Braden, R.T., "Requirements for Internet hosts -
1095	       communication layers". October 1989.

1097	   [ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft-
1098	       elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003.

1100	13 Appendix

1102	   This appendix is for information only and is NOT part of the
1103	   standard.

1105	13.1 Receiver implementation

1107	13.1.1 Transport & Network Layer Reassembly Buffers

1109	   The use of reassembly buffers (either TCP reassembly buffers or IP
1110	   fragmentation reassembly buffers) is implementation dependent. When
1111	   MPA is enabled, reassembly buffers are needed if FPDU Alignment is
1112	   lost or if IP fragmentation occurs. This is because the incoming out
1113	   of order segment may not contain enough information for MPA to
1114	   process all of the FPDU. For cases where a re-segmenting middle box
1115	   is present, or where the TCP sender is not MPA-aware, the presence of
1116	   markers significantly reduces the amount of buffering needed.

1118	   Recovery from IP Fragmentation must be transparent to the MPA
1119	   Consumers.

1121	13.1.1.1 Network Layer Reassembly Buffers

1123	   Most IP implementations set the IP Don't Fragment bit. Thus upon a
1124	   path MTU change, intermediate devices drop the IP datagram if it is
1125	   too large and reply with an ICMP message which tells the source TCP
1126	   that the path MTU has changed. This causes TCP to emit segments
1127	   conformant with the new path MTU size. Thus IP fragments under most
1128	   conditions should never occur at the receiver. But it is possible.

1130	   There are several options for implementation of network layer
1131	   reassembly buffers:

1133	   1.  drop any IP fragments, and reply with an ICMP message according
1134	       to [RFC792] (fragmentation needed and DF set) to tell the Remote
1135	       Peer to resize its TCP segment

1137	   2.  support an IP reassembly buffer, but have it of limited size
1138	       (possibly the same size as the local link's MTU). The end Node
1139	       would normally never advertise a path MTU larger than the local
1140	       link MTU. It is recommended that a dropped IP fragment cause an
1141	       ICMP message to be generated according to RFC792.

1143	   3.  multiple IP reassembly buffers, of effectively unlimited size.

1145	   4.  support an IP reassembly buffer for the largest IP datagram (64
1146	       KB).

1148	   5.  support for a large IP reassembly buffer which could span
1149	       multiple IP datagrams.

1151	   An implementation should support at least 2 or 3 above, to avoid
1152	   dropping packets that have traversed the entire fabric.

1154	   There is no end-to-end ACK for IP reassembly buffers, so there is no
1155	   flow control on the buffer. The only end-to-end ACK is a TCP ACK,
1156	   which can only occur when a complete IP datagram is delivered to TCP.
1157	   Because of this, under worst case, pathological scenarios, the
1158	   largest IP reassembly buffer is the TCP receive window (to buffer
1159	   multiple IP datagrams that have all been fragmented).

1161	   Note that if the Remote Peer does not implement re-segmentation of
1162	   the data stream upon receiving the ICMP reply updating the path MTU,
1163	   it is possible to halt forward progress because the opposite peer
1164	   would continue to retransmit using a transport segment size that is
1165	   too large. This deadlock scenario is no different than if the fabric
1166	   MTU (not last hop MTU) was reduced after connection setup, and the
1167	   remote Node's behavior is not compliant with [RFC1122].

1169	13.1.1.2 TCP Reassembly buffers

1171	   A TCP reassembly buffer is also needed. TCP reassembly buffers are
1172	   needed if FPDU Alignment is lost when using TCP with MPA or when the
1173	   MPA FPDU spans multiple TCP segments.

1175	   Since lost FPDU Alignment often means that FPDUs are incomplete, an
1176	   MPA on TCP implementation must have a reassembly buffer large enough
1177	   to recover an FPDU that is less than or equal to the MTU of the
1178	   locally attached link (this should be the largest possible advertised
1179	   TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST
1180	   be at least 140 octets long to support the minimum FPDU size.  The
1181	   140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2
1182	   of ULPDU_Length, 4 of CRC, and space for a possible marker. As usual,
1183	   additional buffering may provide better performance.

1185	   Note that if the TCP segment were not stored, it is possible to
1186	   deadlock the MPA algorithm. If the path MTU is reduced, FPDU
1187	   Alignment requires the source TCP to re-segment the data stream to
1188	   the new path MTU. The source MPA will detect this condition and
1189	   reduce the MPA segment size, but any FPDUs already posted to the
1190	   source TCP will be re-segmented and lose FPDU Alignment. If the
1191	   destination does not support a TCP reassembly buffer, these segments
1192	   can never be successfully transmitted and the protocol deadlocks.

1194	   When a complete FPDU is received, processing continues normally.

1196	14 Author's Addresses

1198	   Stephen Bailey
1199	       Sandburst Corporation
1200	       600 Federal Street
1201	       Andover, MA  01810 USA
1202	       Phone: +1 978 689 1614
1203	       Email: steph@sandburst.com

1205	   Paul R. Culley
1206	       Hewlett-Packard Company
1207	       20555 SH 249
1208	       Houston, Tx. USA 77070-2698
1209	       Phone:  281-514-5543
1210	       Email:  paul.culley@hp.com

1212	   Uri Elzur
1213	       Broadcom
1214	       16215 Alton Parkway
1215	       CA, 92618
1216	       Phone: 949.585.6432
1217	       Email:  uri@broadcom.com

1219	   Renato J Recio
1220	       IBM
1221	       Internal Zip 9043
1222	       11400 Burnett Road
1223	       Austin,  Texas  78759
1224	       Phone:  512-838-3685
1225	       Email:  recio@us.ibm.com

1227	   John Carrier
1228	       Adaptec Inc.
1229	       691 South Milpitas Blvd.
1230	       Milpitas, CA 95035
1231	       Phone:  360-378-8526
1232	       Email:  John_Carrier@adaptec.com

1234	15 Acknowledgments

1236	   Dwight Barron
1237	       Hewlett-Packard Company
1238	       20555 SH 249
1239	       Houston, Tx. USA 77070-2698
1240	       Phone: 281-514-2769
1241	       Email: dwight.barron@hp.com

1243	   Jeff Chase
1244	       Department of Computer Science
1245	       Duke University
1246	       Durham, NC 27708-0129 USA
1247	       Phone: +1 919 660 6559
1248	       Email: chase@cs.duke.edu

1250	   Ted Compton
1251	       EMC Corporation
1252	       Research Triangle Park, NC 27709, USA
1253	       Phone: 919-248-6075
1254	       Email: compton_ted@emc.com

1256	   Dave Garcia
1257	       Hewlett-Packard Company
1258	       19333 Vallco Parkway
1259	       Cupertino, Ca. USA 95014
1260	       Phone: 408.285.6116
1261	       Email: dave.garcia@hp.com

1263	   Hari Ghadia
1264	       Adaptec, Inc.
1265	       691 S. Milpitas Blvd.,
1266	       Milpitas, CA 95035  USA
1267	       Phone: +1 (408) 957-5608
1268	       Email: hari_ghadia@adaptec.com

1270	   Howard C. Herbert
1271	       Intel Corporation
1272	       MS CH7-404
1273	       5000 West Chandler Blvd.
1274	       Chandler, Arizona 85226
1275	       Phone: 480-554-3116
1276	       Email: howard.c.herbert@intel.com

1278	   Jeff Hilland
1279	       Hewlett-Packard Company
1280	       20555 SH 249
1281	       Houston, Tx. USA 77070-2698
1282	       Phone: 281-514-9489
1283	       Email: jeff.hilland@hp.com

1285	   Mike Ko
1286	       IBM
1287	       650 Harry Rd.
1288	       San Jose, CA 95120
1289	       Phone: (408) 927-2085
1290	       Email: mako@us.ibm.com

1292	   Mike Krause
1293	       Hewlett-Packard Corporation, 43LN
1294	       19410 Homestead Road
1295	       Cupertino, CA 95014 USA
1296	       Phone: +1 (408) 447-3191
1297	       Email: krause@cup.hp.com

1299	   Dave Minturn
1300	       Intel Corporation
1301	       MS JF1-210
1302	       5200 North East Elam Young Parkway
1303	       Hillsboro, Oregon  97124
1304	       Phone: 503-712-4106
1305	       Email: dave.b.minturn@intel.com

1307	   Jim Pinkerton
1308	       Microsoft, Inc.
1309	       One Microsoft Way
1310	       Redmond, WA, USA 98052
1311	       Email: jpink@microsoft.com

1313	   Hemal Shah
1314	       Intel Corporation
1315	       MS PTL1
1316	       1501 South Mopac Expressway, #400
1317	       Austin, Texas  78746
1318	       Phone: 512-732-3963
1319	       Email: hemal.shah@intel.com

1321	   Allyn Romanow
1322	       Cisco Systems
1323	       170 W Tasman Drive
1324	       San Jose, CA 95134 USA
1325	       Phone: +1 408 525 8836
1326	       Email: allyn@cisco.com

1328	   Tom Talpey
1329	       Network Appliance
1330	       375 Totten Pond Road
1331	       Waltham, MA 02451 USA
1332	       Phone: +1 (781) 768-5329
1333	       EMail: thomas.talpey@netapp.com

1335	   Patricia Thaler
1336	       Agilent Technologies, Inc.
1337	       1101 Creekside Ridge Drive, #100
1338	       M/S-RG10
1339	       Roseville, CA 95678
1340	       Phone: +1-916-788-5662
1341	       email: pat_thaler@agilent.com

1343	   Jim Wendt
1344	       Hewlett Packard Corporation
1345	       8000 Foothills Boulevard MS 5668
1346	       Roseville, CA 95747-5668 USA
1347	       Phone: +1 916 785 5198
1348	       Email: jim_wendt@hp.com

1350	   Jim Williams
1351	       Emulex Corporation
1352	       580 Main Street
1353	       Bolton, MA 01740 USA
1354	       Phone: +1 978 779 7224
1355	       Email: jim.williams@emulex.com

1357	16 Full Copyright Statement

1359	   This document and the information contained herein is provided on an
1360	   "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
1361	   CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION,
1362	   EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
1363	   MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION,
1364	   NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY,
1365	   AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
1366	   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
1367	   THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
1368	   IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
1369	   PURPOSE.

1371	   Copyright (c) 2002 ADAPTEC INC., BROADCOM CORPORATION, CISCO SYSTEMS
1372	   INC., EMC CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL
1373	   BUSINESS MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT
1374	   CORPORATION, NETWORK APPLIANCE INC., All Rights Reserved