idnits 2.17.1 

draft-ietf-rddp-mpa-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 21.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 2942.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure
     Invitation. 

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([DDP], [ELZER-MPA]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.

  == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 364: '...t recovery of out of order ULPDUs MUST...'
     RFC 2119 keyword, line 381: '...RC check, however CRCs MUST be enabled...'
     RFC 2119 keyword, line 452: '...P implementation MUST inform MPA when ...'
     RFC 2119 keyword, line 458: '...  implementation SHOULD be enabled to:...'
     RFC 2119 keyword, line 462: '....  Multiple FPDUs MAY be packed into a...'
     (123 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == In addition to RFC 3978, Section 5.5 boilerplate, a section with a
     similar start was also found:


        This document and the information contained herein is provided on an "AS
        IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
        CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC
        CORPORATION, EMULEX CORPORATION, HEWLETT-PACKARD COMPANY,
        INTERNATIONAL BUSINESS MACHINES CORPORATION, INTEL CORPORATION,
        MICROSOFT CORPORATION, NETWORK APPLIANCE INC., SANDBURST
        CORPORATION, THE INTERNET SOCIETY, AND THE INTERNET ENGINEERING
        TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
        BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
        HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
        MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     9.  MPA implementations MUST validate the PD_Length field.  The
     buffer that receives the "Private Data" field MUST be large enough to
     receive that data; the amount of "Private Data" MUST not exceed the
     PD_Length, or the application buffer.  If any of the above fails, the
     startup frame MUST be considered improperly formatted.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     C: This bit declares an endpoint's preferred CRC usage.  When this
     field is '0' in the "MPA Request Frame" and the "MPA Reply Frame", CRCs
     MUST not be checked and need not be generated by either endpoint.  When
     this bit is '1' in either the "MPA Request Frame" or "MPA Reply Frame",
     CRCs MUST be generated and checked by both endpoints.  Note that even
     when not in use, the CRC field remains present in the FPDU.  When CRCs
     are not in use, the CRC field MUST be considered valid for FPDU checking
     regardless of its contents.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 2, 2004) is 7390 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'ELZER-MPA' is mentioned on line 211, but not defined

  == Missing Reference: 'RDMAP' is mentioned on line 1831, but not defined

  == Missing Reference: 'MPA' is mentioned on line 2310, but not defined

  == Missing Reference: 'S' is mentioned on line 2122, but not defined

  == Missing Reference: 'RFC0793' is mentioned on line 2407, but not defined

  ** Obsolete undefined reference: RFC  793 (Obsoleted by RFC 9293)

  == Unused Reference: 'RFC2026' is defined on line 1957, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3667' is defined on line 1960, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3668' is defined on line 1963, but no explicit
     reference was found in the text

  == Unused Reference: 'RDMASEC' is defined on line 1972, but no explicit
     reference was found in the text

  == Unused Reference: 'NagleDAck' is defined on line 1991, but no explicit
     reference was found in the text

  == Unused Reference: 'ELZUR-MPA' is defined on line 2011, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 3667 (Obsoleted by RFC 3978)

  ** Obsolete normative reference: RFC 3668 (Obsoleted by RFC 3979)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-10) exists of
     draft-ietf-rddp-security-06

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-ddp-04

  -- Obsolete informational reference (is this intentional?): RFC 2401
     (Obsoleted by RFC 4301)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  == Outdated reference: A later version (-04) exists of
     draft-ietf-nfsv4-channel-bindings-02

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-rdmap-03

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)


     Summary: 14 errors (**), 0 flaws (~~), 21 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	   Remote Direct Data Placement Work Group   P. Culley
2	   INTERNET-DRAFT                              Hewlett-Packard Company
3	   draft-ietf-rddp-mpa-02.txt                U. Elzur
4	                                               Broadcom Corporation
5	                                             R. Recio
6	                                               IBM Corporation
7	                                             S. Bailey
8	                                               Sandburst Corporation
9	                                             J. Carrier
10	                                               Adaptec

12	   Expires: August 2005                      February 2, 2004

14	             Marker PDU Aligned Framing for TCP Specification

16	Status of this Memo

18	   By submitting this Internet-Draft, I certify that any applicable
19	   patent or other IPR claims of which I am aware have been disclosed,
20	   or will be disclosed, and any of which I become aware will be
21	   disclosed, in accordance with RFC 3668.

23	   By submitting this Internet-Draft, I accept the provisions of Section
24	   4 of RFC 3667.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF), its areas, and its working groups.  Note that
28	   other groups may also distribute working documents as Internet-
29	   Drafts.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   The list of current Internet-Drafts can be accessed at
37	   http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft
38	   Shadow Directories can be accessed at http://www.ietf.org/shadow.html

40	Abstract

42	   A framing protocol is defined for TCP that is fully compliant with
43	   applicable TCP RFCs and fully interoperable with existing TCP
44	   implementations. The framing mechanism is designed to work as an
45	   "adaptation layer" between TCP and the Direct Data Placement [DDP]
46	   protocol, preserving the reliable, in-order delivery of TCP, while
47	   adding the preservation of higher-level protocol record boundaries
48	   that DDP requires.

50	   Table of Contents

52	   Status of this Memo.................................................1
53	   Abstract............................................................1
54	   1     Introduction.................................................6
55	   1.1   Motivation...................................................6
56	   1.2   Protocol Overview............................................6
57	   2     Glossary....................................................10
58	   3     LLP and DDP requirements....................................12
59	   3.1   TCP implementation Requirements to support MPA..............12
60	   3.1.1 TCP Transmit side...........................................12
61	   3.1.2 TCP Receive side............................................12
62	   3.2   MPA's interactions with DDP.................................13
63	   4     FPDU Formats................................................15
64	   4.1   Marker Format...............................................16
65	   5     Data Transfer Semantics.....................................17
66	   5.1   MPA Markers.................................................17
67	   5.2   CRC Calculation.............................................19
68	   5.3   MPA on TCP Sender Segmentation..............................22
69	   5.3.1 Effects of MPA on TCP Segmentation..........................22
70	   5.3.2 FPDU Size Considerations....................................24
71	   5.4   MPA Receiver FPDU Identification............................25
72	   5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders....26
73	   6     Connection Semantics........................................27
74	   6.1   Connection setup............................................27
75	   6.1.1 MPA Request and Reply Frame Format..........................31
76	   6.1.2 Example Delayed Startup sequence............................32
77	   6.1.3 Use of "Private Data".......................................35
78	   6.1.4 "Dual Stack" implementations................................38
79	   6.2   Normal Connection Teardown..................................39
80	   7     Error Semantics.............................................40
81	   8     Security Considerations.....................................41
82	   8.1   Protocol-specific Security Considerations...................41
83	   8.1.1 Spoofing....................................................41
84	   8.1.2 Eavesdropping...............................................42
85	   8.2   Introduction to Security Options............................43
86	   8.3   Using IPsec With MPA........................................43
87	   8.4   Requirements for IPsec Encapsulation of DDP.................44
88	   9     IANA Considerations.........................................45
89	   10    References..................................................46
90	   10.1  Normative References........................................46
91	   10.2  Informative References......................................46
92	   11    Appendix....................................................48
93	   11.1  Analysis of MPA over TCP Operations.........................48
94	   11.1.1  Assumptions...............................................48
95	   11.1.2  The Value of Header Alignment.............................49
96	   11.2  Receiver implementation.....................................57
97	   11.2.1  Network Layer Reassembly Buffers..........................57
98	   11.2.2  TCP Reassembly buffers....................................58
99	   11.3  IETF RNIC Interoperability with RDMA Consortium Protocols...59
100	   11.3.1  Negotiated Parameters.....................................59
101	   11.3.2  RDMAC RNIC and Non-permissive IETF RNIC...................60
102	   11.3.3  RDMAC RNIC and Permissive IETF RNIC.......................62
103	   11.3.4  Non-Permissive IETF RNIC and Permissive IETF RNIC.........63
104	   12    Author's Addresses..........................................64
105	   13    Acknowledgments.............................................65
106	   14    Full Copyright Statement....................................68

108	   Table of Figures

110	   Figure 1 ULP MPA TCP Layering.......................................8
111	   Figure 2 FPDU Format...............................................15
112	   Figure 3 Marker Format.............................................16
113	   Figure 4 Example FPDU Format with Marker...........................18
114	   Figure 5 Annotated Hex Dump of an FPDU.............................21
115	   Figure 6 Annotated Hex Dump of an FPDU with Marker.................21
116	   Figure 7 "MPA Request/Reply Frame".................................31
117	   Figure 8: Example Delayed Startup negotiation......................33
118	   Figure 9: Example Immediate Startup negotiation....................36
119	   Figure 10: Non-aligned FPDU freely placed in TCP octet stream......51
120	   Figure 11: Aligned FPDU placed immediately after TCP header........53
121	   Figure 12. Connection Parameters for the RNIC Types................60
122	   Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
123	   IETF RNIC..........................................................61
124	   Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
125	   IETF RNIC..........................................................62
126	   Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
127	   Permissive IETF RNIC...............................................63

129	   Revision history

131	   [draft-ietf-rddp-mpa-02] workgroup draft with following changes:

133	        Made IPSEC must implement, optional to use.

135	        Updated Marker language to clarify that it points to ULPDU
136	        Length even when marker precedes FPDU.

138	        Clarified when to start markers use (in full operation mode).

140	        Added informative text on interoperability with RDMAC RNICs.

142	        Reduced "Private Data" to 512 octets max.

144	        Clarified CRC use description, must be used unless data is at
145	        least as well protected by another means.

147	        Clarified CRC disabled mode; CRC field is always valid.

149	        Added Security text.

151	        Changed DDP and RDMAP version numbers in hex dumps (Fig 5,6) and
152	        adjusted CRC accordingly.

154	   [draft-ietf-rddp-mpa-01] workgroup draft with following changes:

156	        Added the "R" bit (Rejected) to the "MPA Reply Frame" and
157	        described its semantics.

159	        Added some comments on recent decisions regarding startup.

161	        Updated RFC3667 boilerplate.

163	   [draft-ietf-rddp-mpa-00] workgroup draft with following changes:

165	        Changed "Start Key" to two separate startup frames to facilitate
166	        identification of incorrect Active/Active startup.

168	        Changed Active/Passive nomenclature to Initiator/Responder to
169	        reduce confusion with TCP startup and verbs doc (which used
170	        opposite sense).

172	        Added "Private Data" to the startup key sequences.  This also
173	        required describing the motivation and expected usage models
174	        along with some interface hints.  Removed the "Private data"
175	        stuff from appendix.

177	        Added example "Immediate" startup with TCP and explanation.

179	   [draft-culley-iwarp-mpa-03]

181	        Add option to allow receivers to specify Marker use.

183	        Add option that allows both sides to agree not to use CRC.

185	        Added startup declaration "Start Key" with options and larger
186	        MPA mode recognition "key".

188	        Updated MPA/DDP connection startup rules and sequence to deal
189	        with "Start Key".

191	        Added Appendix that provides a more detailed analysis of the
192	        effects of MPA on TCP data streams.

194	        Added appendix that describes a mechanism to deal with "private
195	        data" prior to full MPA/DDP operation.

197	   [draft-culley-iwarp-mpa-02]

199	        Enhanced descriptions of how MPA is used over an unmodified TCP.

201	        Removed "No Packing" text.

203	        Made MPA an adaptation layer for DDP, instead of a generalized
204	        framing solution.

206	        Added clarifications of the MPA/TCP interaction for optimized
207	        implementations and that any such optimizations are to be used
208	        only when requested by MPA.

210	        Note: a discussion of reasons for these changes can be found in
211	        [ELZER-MPA].

213	   [draft-culley-iwarp-mpa-01] initial draft.

215	1  Introduction

217	   This section discusses the reason for creating MPA on TCP and a
218	   general overview of the protocol.  Later sections show the MPA
219	   headers (see section 4 on page 15), and detailed protocol
220	   requirements and characteristics (see section 5 on page 17), as well
221	   as Connection Semantics (section 6 on page 26), Error Semantics
222	   (section 7 on page 40), and Security Considerations (section 8 on
223	   page 41).

225	1.1  Motivation

227	   The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
228	   requires a mechanism to detect record boundaries.  The DDP records
229	   are referred to as Upper Layer Protocol Data Units by this document.
230	   The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
231	   boundary is useful to a hardware network adapter that uses DDP to
232	   directly place the data in the application buffer based on the
233	   control information carried in the ULPDU header.  This may be done
234	   without requiring that the packets arrive in order.  Potential
235	   benefits of this capability are the avoidance of the memory copy
236	   overhead and a smaller memory requirement for handling out of order
237	   or dropped packets.

239	   Many approaches have been proposed for a generalized framing
240	   mechanism.  Some are probabilistic in nature and others are
241	   deterministic.  A probabilistic approach is characterized by a
242	   detectable value embedded in the octet stream.  It is probabilistic
243	   because under some conditions the receiver may incorrectly interpret
244	   application data as the detectable value.  Under these conditions,
245	   the protocol may fail with unacceptable frequency.  A deterministic
246	   approach is characterized by embedded controls at known locations in
247	   the octet stream.  Because the receiver can guarantee it will only
248	   examine the data stream at locations that are known to contain the
249	   embedded control, the protocol can never misinterpret application
250	   data as being embedded control data.  For unambiguous handling of an
251	   out of order packet, the deterministic approach is preferred.

253	   The MPA protocol provides a framing mechanism for DDP running over
254	   TCP using the deterministic approach.  It allows the location of the
255	   ULPDU to be determined in the TCP stream even if the TCP segments
256	   arrive out of order.

258	1.2  Protocol Overview

260	   MPA is described as an extra layer above TCP and below DDP.  The
261	   operation sequence is:

263	   1.  A TCP connection is established by ULP action.  This is done
264	       using methods not described by this specification.  The ULP may
265	       exchange some amount of data in streaming mode prior to starting
266	       MPA, but is not required to do so.

268	   2.  The Consumer negotiates the use of DDP and MPA at both ends of a
269	       connection.  The mechanisms to do this are not described in this
270	       specification.  The negotiation may be done in streaming mode, or
271	       by some other mechanism (such as a pre-arranged port number).

273	   3.  The ULP activates MPA on each end in the "Startup Phase", either
274	       as an "Initiator" or a "Responder", as determined by the ULP.
275	       This mode verifies the usage of MPA, specifies the use of CRC and
276	       Markers, and allows the ULP to communicate some additional data
277	       via a "private data" exchange.  See section 6.1 Connection setup
278	       for more details on the startup process.

280	   4.  At the end of the Startup Phase, the ULP puts MPA (and DDP) into
281	       full operation and begins sending DDP data as further described
282	       below.  In this document, DDP data chunks are called ULPDUs.  For
283	       a description of the DDP data, see [DDP].

285	   Following is a description of data transfer when MPA is in full
286	   operation.

288	   1.  DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
289	       for this value.  MPA derives this information from TCP, when it
290	       is available, or chooses a reasonable value.  This information is
291	       already supported on many TCP implementations, including all
292	       modern flavors of BSD networking, through the TCP_MAXSEG socket
293	       option.

295	   2.  DDP creates ULPDUs of MULPDU size or smaller, and hands them to
296	       MPA at the sender.

298	   3.  MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
299	       header, optionally inserting markers, and appending a CRC field
300	       after the ULPDU and PAD (if any).  MPA delivers the FPDU to TCP.

302	   4.  The TCP sender puts the FPDUs into the TCP stream.  If the TCP
303	       Sender is MPA-aware, it segments the TCP stream in such a way
304	       that a TCP Segment boundary is also the boundary of an FPDU.  TCP
305	       then passes each segment to the IP layer for transmission.

307	   5.  The TCP receiver may be MPA-aware or may not be MPA-aware. If it
308	       is MPA-aware, it may separate passing the TCP payload to MPA from
309	       passing the TCP payload ordering information to MPA. In either
310	       case, RFC compliant TCP wire behavior is observed at both the
311	       sender and receiver.

313	   6.  The MPA receiver locates and assembles complete FPDUs within the
314	       stream, verifies their integrity, and removes MPA markers (when
315	       present), ULPDU_Length, PAD and the CRC field.

317	   7.  MPA then provides the complete ULPDUs to DDP.  MPA may also
318	       separate passing MPA payload to DDP from passing the MPA payload
319	       ordering information.

321	   The layering of PDUs with MPA is shown in Figure 1, below.

323	   MPA-aware TCP is a TCP layer which potentially contains some
324	   additional semantics as defined in this document.  MPA is implemented
325	   as a data stream ULP for TCP and is therefore RFC compliant.  MPA-
326	   aware TCP is RFC compliant.

328	               +------------------+
329	               |     ULP client   |
330	               +------------------+  <- Consumer messages
331	               |        DDP       |
332	               +------------------+  <- ULPDUs
333	               |        MPA       |
334	               +------------------+  <- FPDUs (containing ULPDUs)
335	               |        TCP*      |
336	               +------------------+  <- TCP Segments (containing FPDUs)
337	               |      IP etc.     |
338	               +------------------+
339	                                      * TCP or MPA-aware TCP.

341	                       Figure 1 ULP MPA TCP Layering

343	   An MPA-aware TCP sender is able to segment the data stream such that
344	   TCP segments begin with FPDUs (FPDU Alignment).  This has significant
345	   advantages for receivers.  When segments arrive with aligned FPDUs
346	   the receiver usually need not buffer any portion of the segment,
347	   allowing DDP to place it in its destination memory immediately, thus
348	   avoiding copies from intermediate buffers (DDP's reason for
349	   existence).

351	   MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation
352	   to recover ULPDUs that may be received out of order.  This enables a
353	   DDP on MPA implementation to save a significant amount of
354	   intermediate storage by placing the ULPDUs in the right locations in
355	   the application buffers when they arrive, rather than waiting until
356	   full ordering can be restored.

358	   The ability of a receiver to recover out of order ULPDUs is optional
359	   and declared to the transmitter during startup.  When the receiver
360	   declares that it does not support out of order recovery, the
361	   transmitter does not add the control information to the data stream
362	   needed for out of order recovery.

364	   MPA implementations that support recovery of out of order ULPDUs MUST
365	   support a mechanism to indicate the ordering of ULPDUs as the sender
366	   transmitted them and indicate when missing intermediate segments
367	   arrive.  These mechanisms allow DDP to reestablish record ordering
368	   and report Delivery of complete messages (groups of records).

370	   MPA also addresses enhanced data integrity.  Many users of TCP have
371	   noted that the TCP checksum is not as strong as could be desired
372	   [CRCTCP].  Studies have shown that the TCP checksum indicates
373	   segments in error at a much higher rate than the underlying link
374	   characteristics would indicate.  With these higher error rates, the
375	   chance that an error will escape detection, when using only the TCP
376	   checksum for data integrity, becomes a concern.  A stronger integrity
377	   check can reduce the chance of data errors being missed.

379	   MPA includes a CRC check to increase the ULPDU data integrity to the
380	   level provided by other modern protocols, such as SCTP [RFC2960].  It
381	   is possible to disable this CRC check, however CRCs MUST be enabled
382	   unless it is clear that the end to end connection through the network
383	   has data integrity at least as good as a MPA with CRC enabled (for
384	   example when IPSEC is implemented end to end).  DDP's ULP expects
385	   this level of data integrity and therefore the ULP does not have to
386	   provide its own duplicate data integrity and error recovery for lost
387	   data.

389	2  Glossary

391	   Consumer - the ULPs or applications that lie above MPA and DDP.  The
392	       Consumer is responsible for making TCP connections, starting MPA
393	       and DDP connections, and generally controlling operations.

395	   Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
396	       the process of informing DDP that a particular PDU is ordered for
397	       use.  This is specifically different from "passing the PDU to
398	       DDP", which may generally occur in any order, while the order of
399	       "Delivery" is strictly defined.

401	   EMSS - Effective Maximum Segment Size.  EMSS is the smaller of the
402	       TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
403	       and the current path Maximum Transfer Unit (MTU) [RFC1191].

405	   FPDU - Framing Protocol Data Unit.  The unit of data created by an
406	       MPA sender.

408	   FPDU Alignment - the property that a TCP segment begins with an FPDU.

410	   Header Alignment - the property that a TCP segment begins with an
411	       FPDU and the TCP segment includes an integer number of FPDUs.

413	   PDU - protocol data unit

415	   MPA-aware TCP - a TCP implementation that is aware of the receiver
416	       efficiencies of MPA Header Alignment and is capable of sending
417	       TCP segments that begin with an FPDU.

419	   MPA-enabled - MPA is enabled if the MPA protocol is visible on the
420	       wire.  When the sender is MPA-enabled, it is inserting framing
421	       and markers.  When the receiver is MPA-enabled, it is
422	       interpreting framing and markers.

424	   MPA - Marker-based ULP PDU Aligned Framing for TCP protocol.   This
425	       document defines the MPA protocol.

427	   MULPDU - Maximum ULPDU. The current maximum size of the record that
428	       is acceptable for DDP to pass to MPA for transmission.

430	   Node - A computing device attached to one or more links of a Network.
431	       A Node in this context does not refer to a specific application
432	       or protocol instantiation running on the computer. A Node may
433	       consist of one or more MPA on TCP devices installed in a host
434	       computer.

436	   Remote Peer - The MPA protocol implementation on the opposite end of
437	       the connection. Used to refer to the remote entity when
438	       describing protocol exchanges or other interactions between two
439	       Nodes.

441	   ULP - Upper Layer Protocol. The protocol layer above the protocol
442	       layer currently being referenced. The ULP for MPA is DDP [DDP].

444	   ULPDU - Upper Layer Protocol Data Unit.  The data record defined by
445	      the layer above MPA (DDP).  ULPDU corresponds to DDP's "DDP
446	      Segment".

448	3  LLP and DDP requirements

450	3.1  TCP implementation Requirements to support MPA

452	   The TCP implementation MUST inform MPA when the TCP connection is
453	   closed or has begun closing the connection (e.g. received a FIN).

455	3.1.1  TCP Transmit side

457	   To provide optimum performance, an MPA-aware transmit side TCP
458	   implementation SHOULD be enabled to:

460	   *   With an EMSS large enough to contain the FPDU(s), segment the
461	       outgoing TCP stream such that the first octet of every TCP
462	       Segment begins with an FPDU.  Multiple FPDUs MAY be packed into a
463	       single TCP segment as long as they are entirely contained in the
464	       TCP segment.

466	   *   Report the current EMSS to the MPA transmit layer.

468	   An MPA-aware TCP transmit side implementation MUST continue to use
469	   the method of segmentation expected by non-MPA applications (and
470	   described in TCP RFCs) when MPA is not enabled on the connection.
471	   When MPA is enabled above an MPA-aware TCP, it SHOULD specifically
472	   enable the segmentation rules described above for the DDP segments
473	   (FPDUs) posted for transmission.

475	   If the transmit side TCP implementation is not able to segment the
476	   TCP stream as indicated above, MPA SHOULD make a best effort to
477	   achieve that result.  For example, using the TCP_NODELAY socket
478	   option to disable the Nagle algorithm will usually result in many of
479	   the segments starting with an FPDU.

481	   If the transmit side TCP implementation is not able to report the
482	   EMSS, MPA may assume that TCP will use 1460 octet segments in
483	   creating FPDUs.  If the implementation has reason to believe that the
484	   TCP segment size is actually smaller than 1460, it may instead use a
485	   536 octet FPDU.

487	3.1.2  TCP Receive side

489	   When an MPA receive implementation and the MPA-aware receive side TCP
490	   implementation support handling out of order ULPDUs, the TCP receive
491	   implementation SHOULD be enabled to:

493	   *   Pass incoming TCP segments to MPA as soon as they have been
494	       received and validated, even if not received in order.  The TCP
495	       layer MUST have committed to keeping each segment before it can
496	       be passed to the MPA.  This means that the segment must have
497	       passed the TCP, IP, and lower layer data integrity validation
498	       (i.e., checksum), must be in the receive window, must not be a
499	       duplicate, must be part of the same epoch (if timestamps are used
500	       to verify this) and any other checks required by TCP RFCs.  The
501	       segment MUST NOT be passed to MPA more than once unless
502	       explicitly requested (see Section 7).

504	       This is not to imply that the data must be completely ordered
505	       before use.  An implementation may accept out of order segments,
506	       SACK them [RFC2018], and pass them to DDP when the reception of
507	       the segments needed to fill in the gaps arrive.  Such an
508	       implementation can "commit" to the data early on, and will not
509	       overwrite it even if (or when) duplicate data arrives.  MPA
510	       expects to utilize this "commit" to allow the passing of ULPDUs
511	       to DDP when they arrive, independent of ordering.

513	   *   Provide a mechanism to indicate the ordering of TCP segments as
514	       the sender transmitted them.  One possible mechanism might be
515	       attaching the TCP sequence number to each segment.

517	   *   Provide a mechanism to indicate when a given TCP segment (and the
518	       prior TCP stream) is complete.  One possible mechanism might be
519	       to utilize the leading (left) edge of the TCP Receive Window.

521	       DDP on MPA MUST utilize these two mechanisms to establish the
522	       Delivery semantics that DDP's consumers agree to.  These
523	       semantics are described fully in [DDP]. These include
524	       requirements on DDP's consumer to respect ownership of buffers
525	       prior to the time that DDP delivers them to the consumer.

527	   An MPA-aware TCP receive side implementation MUST continue to buffer
528	   TCP segments until completely ordered and then deliver them as
529	   expected by non-MPA applications (and described in TCP RFCs) when MPA
530	   is not enabled on the connection.  When MPA is enabled above an MPA-
531	   aware TCP, TCP SHOULD enable the in and out of order passing of data,
532	   and the separate ordering information as described above.

534	   When an MPA receive implementation is coupled with a TCP receive
535	   implementation that does not support the preceding mechanisms, TCP
536	   passes and Delivers incoming stream data to MPA in order.

538	3.2  MPA's interactions with DDP

540	   DDP requires MPA to maintain DDP record boundaries from the sender to
541	   the receiver.  When using MPA on TCP to send data, DDP provides
542	   records (ULPDUs) to MPA.  MPA will use the reliable transmission
543	   abilities of TCP to transmit the data, and will insert appropriate
544	   additional information into the TCP stream to allow the MPA receiver
545	   to locate the record boundary information.

547	   As such, MPA accepts complete records (ULPDUs) from DDP at the sender
548	   and returns them to DDP at the receiver.

550	   MPA combined with an MPA-aware TCP can only ensure FPDU Alignment
551	   with the TCP Header if the FPDU is less than or equal to TCP's EMSS.
552	   Since FPDU alignment is generally desired by the receiver, DDP must
553	   cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS
554	   under normal conditions.  This is done with the MULPDU mechanism.

556	   MPA provides information to DDP on the current maximum size of the
557	   record that is acceptable to send (MULPDU).  DDP SHOULD limit each
558	   record size to MULPDU.  The range of MULPDU values MUST be between
559	   128 octets and 64768 octets, inclusive.

561	   The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
562	   MPA. DDP MAY post a ULPDU of any size between one and 64768 octets,
563	   however MPA is NOT REQUIRED to support a "ULPDU Length" that is
564	   greater than the current MULPDU.

566	   While the maximum theoretical length supported by the MPA header
567	   ULPDU_Length field is 65535, TCP over IP requires the IP datagram
568	   maximum length to be 65535 octets. To enable MPA to support FPDU
569	   Alignment, the maximum size of the FPDU must fit within an IP
570	   datagram. Thus the ULPDU limit of 64768 octets was derived by taking
571	   the maximum IP datagram length, subtracting from it the maximum total
572	   length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
573	   options, and the worst case MPA overhead, and then rounding the
574	   result down to a 128 octet boundary.

576	   On receive, MPA MUST pass each ULPDU with its length to DDP when it
577	   has been validated.

579	   If an MPA implementation supports passing out of order ULPDUs to DDP,
580	   the MPA implementation SHOULD:

582	   *   Pass each ULPDU with its length to DDP as soon as it has been
583	       fully received and validated.

585	   *   Provide a mechanism to indicate the ordering of ULPDUs as the
586	       sender transmitted them.  One possible mechanism might be
587	       providing the TCP sequence number for each ULPDU.

589	   *   Provide a mechanism to indicate when a given ULPDU (and prior
590	       ULPDUs) are complete.  One possible mechanism might be to allow
591	       DDP to see the current outgoing TCP Ack sequence number.

593	   *   Provide an indication to DDP that the TCP has closed or has begun
594	       to close the connection (e.g. received a FIN).

596	   MPA MUST provide the protocol version negotiated with its peer to
597	   DDP.  DDP will use this version to set the version in its header and
598	   to report the version to RDMAP

600	4  FPDU Formats

602	   MPA senders create FPDUs out of ULPDUs.  The format of an FPDU shown
603	   below MUST be used for all MPA FPDUs.  For purposes of clarity,
604	   markers are not shown in Figure 2.

606	       0                   1                   2                   3
607	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
608	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
609	      |          ULPDU_Length         |                               |
610	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
611	      |                                                               |
612	      ~                                                               ~
613	      ~                            ULPDU                              ~
614	      |                                                               |
615	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
616	      |                               |          PAD (0-3 octets)     |
617	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
618	      |                             CRC                               |
619	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
620	                           Figure 2 FPDU Format

622	   ULPDU_Length: 16 bits (unsigned integer).  This is the number of
623	   octets of the contained ULPDU.  It does not include the length of the
624	   FPDU header itself, the pad, the CRC, or of any markers that fall
625	   within the ULPDU. The 16-bit "ULPDU Length" field is large enough to
626	   support the largest IP datagrams for IPv4 or IPv6.

628	   PAD: The PAD field trails the ULPDU and contains between zero and
629	   three octets of data.  The pad data MUST be set to zero by the sender
630	   and ignored by the receiver (except for CRC checking).  The length of
631	   the pad is set so as to make the size of the FPDU an integral
632	   multiple of four.

634	   CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
635	   check value, which is used to verify the entire contents of the FPDU,
636	   using CRC32C.  See section 5.2 CRC Calculation on page 19.  When CRCs
637	   are not enabled, this field is still present, may contain any value,
638	   and MUST NOT be checked.

640	   The FPDU adds a minimum of 6 octets to the length of the ULPDU.  In
641	   addition, the total length of the FPDU will include the length of any
642	   markers and from 0 to 3 pad octets added to round-up the ULPDU size.

644	4.1  Marker Format

646	   The format of a marker MUST be as specified in Figure 3:

648	       0                   1                   2                   3
649	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
650	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
651	      |           RESERVED            |            FPDUPTR            |
652	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
653	                          Figure 3 Marker Format

655	   RESERVED: The Reserved field MUST be set to zero on transmit and
656	   ignored on receive (except for CRC calculation).

658	   FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
659	   interpreted as an unsigned integer, that indicates the number of
660	   octets in the TCP stream from the beginning of the "ULPDU Length"
661	   field to the first octet of the entire marker.

663	5  Data Transfer Semantics

665	   This section discusses some characteristics and behavior of the MPA
666	   protocol as well as implications of that protocol.

668	5.1  MPA Markers

670	   MPA markers are used to identify the start of FPDUs when packets are
671	   received out of order.  This is done by locating the markers at fixed
672	   intervals in the data stream (which is correlated to the TCP sequence
673	   number) and using the marker value to locate the preceding FPDU
674	   start.

676	   The MPA receiver's ability to locate out of order FPDUs and pass the
677	   ULPDUs to DDP is implementation dependent.  MPA/DDP allows those
678	   receivers that are able to deal with out of order FPDUs in this way
679	   to require the insertion of markers in the data stream.  When the
680	   receiver cannot deal with out of order FPDUs in this way, it may
681	   disable the insertion of markers at the sender.  All MPA senders MUST
682	   be able to generate markers when their use is declared by the
683	   opposing receiver (see section 6.1 Connection setup on page 27).

685	   When Markers are enabled, MPA senders MUST insert a marker into the
686	   data stream at a 512 octet periodic interval in the TCP Sequence
687	   Number Space. The marker contains a 16 bit unsigned integer referred
688	   to as the FPDUPTR (FPDU Pointer).

690	   If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
691	   relative back-pointer. FPDUPTR MUST contain the number of octets in
692	   the TCP stream from the beginning of the "ULPDU Length" field to the
693	   first octet of the marker, unless the marker falls between FPDUs.
694	   Thus the location of the first octet of the previous FPDU header can
695	   be determined by subtracting the value of the given marker from the
696	   current octet-stream sequence number (i.e. TCP sequence number) of
697	   the first octet of the marker. Note that this computation must take
698	   into account that the TCP sequence number could have wrapped between
699	   the marker and the header.

701	   An FPDUPTR value of 0x0000 is a special case - it is used when the
702	   marker falls exactly between FPDUs (between the preceding FPDU CRC
703	   field, and the next FPDU's "ULPDU Length" field).  In this case, the
704	   marker MUST be included in the CRC calculation of the FPDU following
705	   the marker (if CRCs are being generated or checked). Thus an FPDUPTR
706	   value of 0x0000 means that immediately following the marker is an
707	   FPDU header (the "ULPDU Length" field).

709	   Since all FPDUs are integral multiples of 4 octets, the bottom two
710	   bits of the FPDUPTR as calculated by the sender are zero.  MPA
711	   reserves these bits so they MUST be treated as zero for computation
712	   at the receiver.

714	   When Markers are enabled (see section 6.1 Connection setup on page
715	   27), the MPA markers MUST be inserted immediately preceding the first
716	   FPDU of full operation phase, and at every 512th octet of the TCP
717	   octet stream thereafter.  As a result, the first marker has an
718	   FPDUPTR value of 0x0000.  If the first marker begins at octet
719	   sequence number SeqStart, then markers are inserted such that the
720	   first octet of the marker is at octet sequence number SeqNum if the
721	   remainder of (SeqNum - SeqStart) mod 512 is zero.  Note that SeqNum
722	   can wrap.

724	   For example, if the TCP sequence number were used to calculate the
725	   insertion point of the marker, the starting TCP sequence number is
726	   unlikely to be zero, and 512 octet multiples are unlikely to fall on
727	   a modulo 512 of zero. If the MPA connection is started at TCP
728	   sequence number 11, then the 1st marker will begin at 11, and
729	   subsequent markers will begin at 523, 1035, etc.

731	   If an FPDU is large enough to contain multiple markers, they MUST all
732	   point to the same point in the TCP stream: the first octet of the
733	   "ULPDU Length" field for the FPDU.

735	   If a marker interval contains multiple FPDUs (the FPDUs are small),
736	   the marker MUST point to the start of the "ULPDU Length" field for
737	   the FPDU containing the marker unless the marker falls between FPDUs,
738	   in which case the marker MUST be zero.

740	   The following example shows an FPDU containing a marker.

742	       0                   1                   2                   3
743	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
744	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
745	      |       ULPDU Length (0x0010)   |                               |
746	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
747	      |                                                               |
748	      +                                                               +
749	      |                         ULPDU (octets 0-9)                    |
750	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
751	      |            (0x0000)           |        FPDU ptr (0x000C)      |
752	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
753	      |                        ULPDU (octets 10-15)                   |
754	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
755	      |                               |          PAD (2 octets:0,0)   |
756	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
757	      |                              CRC                              |
758	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
759	                 Figure 4 Example FPDU Format with Marker

761	   MPA Receivers MUST preserve ULPDU boundaries when passing data to
762	   DDP. MPA Receivers MUST pass the ULPDU data and the "ULPDU Length" to
763	   DDP and not the markers, headers, and CRC.

765	5.2  CRC Calculation

767	   An MPA implementation MUST implement CRC support and MUST either:

769	   (1) always use CRCs

771	       or

773	   (2) only negotiate the non-use of CRC on the explicit request of the
774	       system administrator, via an interface not defined in this spec.
775	       The default configuration for a connection MUST be to use CRCs.

777	   (3) The MPA provider at either peer MAY ignore its administrator's
778	       request that CRCs not be used.

780	   The decision for one host to request CRC suppression MAY be made on
781	   an administrative basis for any path that provides equivalent
782	   protection from undetected errors as an end-to-end CRC32c.

784	   The process MUST be invisible to the ULP.

786	   After receipt of an MPA startup declaration indicating that its peer
787	   requires CRCs, an MPA instance MUST continue generating and checking
788	   CRCs until the connection terminates.  If an MPA instance has
789	   declared that it does not require CRCs, it MUST turn off CRC checking
790	   immediately after receipt of an MPA mode declaration indicating that
791	   its peer also does not require CRCs.  It MAY continue generating
792	   CRCs.  See section 6.1 Connection setup on page 27 for details on the
793	   MPA startup.

795	   When sending an FPDU, the sender MUST include a CRC field.  When CRCs
796	   are enabled, the CRC field in the MPA FPDU MUST be computed using the
797	   CRC32C polynomial in the manner described in the iSCSI Protocol
798	   [iSCSI] document for Header and Data Digests.

800	   The fields which MUST be included in the CRC calculation when sending
801	   an FPDU are as follows:

803	   1)  If a marker does not immediately precede the "ULPDU Length"
804	       field, the CRC-32c is calculated from the first octet of the
805	       "ULPDU Length" field, through all the ULPDU and markers (if
806	       present), to the last octet of the PAD (if present), inclusive.
807	       If there is a marker immediately following the PAD, the marker is
808	       included in the CRC calculation for this FPDU.

810	   2)  If a marker immediately precedes the first octet of the "ULPDU
811	       Length" field of the FPDU, (i.e. the marker fell between FPDUs,
812	       and thus is required to be included in the second FPDU), the CRC-
813	       32c is calculated from the first octet of the marker, through the
814	       "ULPDU Length" header, through all the ULPDU and markers (if
815	       present), to the last octet of the PAD (if present), inclusive.

817	   3)  After calculating the CRC-32c, the resultant value is placed into
818	       the CRC field at the end of the FPDU.

820	   When an FPDU is received, and CRC checking is enabled, the receiver
821	   MUST first perform the following:

823	   1)  Calculate the CRC of the incoming FPDU in the same fashion as
824	       defined above.

826	   2)  Verify that the calculated CRC-32c value is the same as the
827	       received CRC-32c value found in the FPDU CRC field.  If not, the
828	       receiver MUST treat the FPDU as an invalid FPDU.

830	   The procedure for handling invalid FPDUs is covered in the Error
831	   Section (see section 7 on page 40)

833	   The following is an annotated hex dump of an example FPDU sent as the
834	   first FPDU on the stream.  As such, it starts with a marker. The FPDU
835	   contains 24 octets of the contained ULPDU, which are all zeros. The
836	   CRC32c has been correctly calculated and can be used as a reference.
837	   See the [DDP] and [RDMA] specification for definitions of the DDP
838	   Control field, Queue, MSN, MO, and Send Data.

840	       Octet Contents  Annotation
841	       Count

843	       0000    00 00   Marker: Reserved
844	       0002    00 00           FPDUPTR
845	       0004    00 2a   Length
846	       0006    41 43   DDP Control Field, Send with Last flag set
847	       0008    00 00   Reserved (STag position with no STag)
848	       000a    00 00
849	       000c    00 00   Queue = 0
850	       000e    00 00
851	       0010    00 00   MSN = 1
852	       0012    00 01
853	       0014    00 00   MO = 0
854	       0016    00 00
855	       0018    00 00
856	                       Send Data (24 octets of zeros)
857	       002e    00 00
858	       0030    52 23   CRC32c
859	       0032    99 83
860	                  Figure 5 Annotated Hex Dump of an FPDU

862	   The following is an example sent as the second FPDU of the stream
863	   where the first FPDU (which is not shown here) had a length of 492
864	   octets and was also a Send to Queue 0 with Last Flag set.  This
865	   example contains a marker.

867	       Octet Contents  Annotation
868	       Count

870	       01ec    00 2a   Length
871	       01ee    41 43   DDP Control Field: Send with Last Flag set
872	       01f0    00 00   Reserved (STag position with no STag)
873	       01f2    00 00
874	       01f4    00 00   Queue = 0
875	       01f6    00 00
876	       01f8    00 00   MSN = 2
877	       01fa    00 02
878	       01fc    00 00   MO = 0
879	       01fe    00 00
880	       0200    00 00   Marker: Reserved
881	       0202    00 14           FPDUPTR
882	       0204    00 00
883	                       Send Data (24 octets of zeros)
884	       021a    00 00
885	       021c    84 92   CRC32c
886	       021e    58 98
887	            Figure 6 Annotated Hex Dump of an FPDU with Marker

889	5.3  MPA on TCP Sender Segmentation

891	   The various TCP RFCs allow considerable choice in segmenting a TCP
892	   stream.  In order to optimize FPDU recovery at the MPA receiver, MPA
893	   specifies additional segmentation rules.

895	   MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
896	   contained in one FPDU.

898	   An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP
899	   implementations that support this, and with an EMSS large enough to
900	   contain at least one FPDU, segment the outbound TCP stream such that
901	   each TCP segment begins with an FPDU, and fully contains all included
902	   FPDUs.

904	        Implementation note: To achieve the previous segmentation rule,
905	        TCP's Nagle [RFC0896] algorithm SHOULD be disabled.

907	   There are exceptions to the above rule.  Once an ULPDU is provided to
908	   MPA, the MPA on TCP sender MUST transmit it or fail the connection;
909	   it cannot be repudiated.  As a result, during changes in MTU and
910	   EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it
911	   may be necessary to send FPDUs that do not conform to the
912	   segmentation rule above.

914	   A possible, but less desirable, alternative is to use IP
915	   fragmentation on accepted FPDUs to deal with MTU reductions or
916	   extremely small EMSS.

918	   The sender MUST still format the FPDU according to FPDU format as
919	   shown in Figure 2.

921	   On a retransmission, TCP does not necessarily preserve original TCP
922	   segmentation boundaries. This can lead to the loss of FPDU alignment
923	   and containment within a TCP segment during TCP retransmissions. An
924	   MPA-aware TCP sender SHOULD try to preserve original TCP segmentation
925	   boundaries on a retransmission.

927	5.3.1  Effects of MPA on TCP Segmentation

929	   Applications expected to see strong advantages from Direct Data
930	   Placement include transaction-based applications and throughput
931	   applications. Request/response protocols typically send one FPDU per
932	   TCP segment and then wait for a response. Therefore, the application
933	   is expected to set TCP parameters such that it can trade off latency
934	   and wire efficiency. This is accomplished by setting the TCP_NODELAY
935	   socket option.

937	   When latency is not critical, and the application provides data in
938	   chunks larger than EMSS at one time, the TCP implementation may
939	   "pack" any available stream data into TCP segments so that the
940	   segments are filled to the EMSS.  If the amount of data available is
941	   not enough to fill the TCP segment when it is prepared for
942	   transmission, TCP can send the segment partly filled, or use the
943	   Nagle algorithm to wait for the ULP to post more data (discussed
944	   below).

946	   DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU
947	   when a DDP message is large enough.  Since the DDP message may not
948	   exactly fit into TCP segments, a "message tail" often occurs that
949	   results in an FPDU that is smaller than a single TCP segment.  If a
950	   "message tail", small DDP messages, or the start of a larger DDP
951	   message are available, MPA MAY "pack" the resulting FPDUs into TCP
952	   segments.  When this is done, the TCP segments can be more fully
953	   utilized, but, due to the size constraints of FPDUs, segments may not
954	   be filled to the EMSS.

956	        Note that MPA receivers must do more processing of a TCP segment
957	        that contains multiple FPDUs, this may affect the performance of
958	        some receiver implementations.

960	   TCP implementations often utilize the "Nagle" [RFC0896] algorithm to
961	   ensure that segments are filled to the EMSS whenever the round trip
962	   latency is large enough that the source stream can fully fill
963	   segments before Acks arrive.  The algorithm does this by delaying the
964	   transmission of TCP segments until a ULP can fill a segment, or until
965	   an ACK arrives from the far side.  The algorithm thus allows for
966	   smaller segments when latencies are shorter to keep the ULP's end to
967	   end latency to reasonable levels.

969	   The Nagle algorithm is not mandatory to use [RFC1122].

971	   It is up to the ULP to decide if Nagle is useful with DDP/MPA.  Note
972	   that many of the applications expected to take advantage of MPA/DDP
973	   prefer to avoid the extra delays caused by Nagle. In such scenarios
974	   it is anticipated there will be minimal opportunity for packing at
975	   the transmitter and receivers may choose to optimize their
976	   performance for this anticipated behavior.

978	5.3.2  FPDU Size Considerations

980	   MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
981	   the size of the largest ULPDU fitting in an FPDU.  For an empty TCP
982	   Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
983	   space for markers and pad octets.

985	        The maximum ULPDU Length for a single ULPDU when markers are
986	        present MUST be computed as:

988	        MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)

990	   The formula above accounts for the worst-case number of markers.

992	        The maximum ULPDU Length for a single ULPDU when markers are NOT
993	        present MUST be computed as:

995	        MULPDU = EMSS - (6 + EMSS mod 4)

997	   As a further optimization of the wire efficiency an MPA
998	   implementation MAY dynamically adjust the MULPDU (see section 7.3.1.
999	   for latency and wire efficiency trade-offs). When one or more FPDUs
1000	   are already packed into a TCP Segment, MULPDU MAY be reduced
1001	   accordingly.

1003	   DDP SHOULD provide ULPDUs that are as large as possible, but less
1004	   than or equal to MULPDU.

1006	   If the TCP implementation needs to adjust EMSS to support MTU
1007	   changes, the MULPDU value is changed accordingly.

1009	   In certain rare situations, the EMSS may shrink to very small sizes.
1010	   If this occurs, the MPA on TCP sender MUST NOT shrink the MULPDU
1011	   below 128 octets and is not required to follow the segmentation rules
1012	   in Section 5.3 MPA on TCP Sender Segmentation on page 22.

1014	   If one or more FPDUs are already packed into a TCP segment, such that
1015	   the remaining room is less than 128 octets, MPA MUST NOT provide a
1016	   MULPDU smaller than 128.  In this case, MPA would typically provide a
1017	   MULPDU for the next full sized segment, but may still pack the next
1018	   FPDU into the small remaining room, provide that the next FPDU is
1019	   small enough to fit.

1021	   The value 128 is chosen as to allow DDP designers room for the DDP
1022	   Header and some user data.

1024	5.4  MPA Receiver FPDU Identification

1026	   An MPA receiver MUST first verify the FPDU before passing the ULPDU
1027	   to DDP.  To do this, the receiver MUST:

1029	   *   locate the start of the FPDU unambiguously,

1031	   *   verify its CRC (if CRC checking is enabled).

1033	   If the above conditions are true, the MPA receiver passes the ULPDU
1034	   to DDP.

1036	   To detect the start of the FPDU unambiguously one of the following
1037	   MUST be used:

1039	   1:  In an ordered TCP stream, the "ULPDU Length" field in the current
1040	       FPDU when FPDU has a valid CRC, can be used to identify the
1041	       beginning of the next FPDU.

1043	   2:  For receivers that support out of order reception of FPDUs (see
1044	       section 5.1 MPA Markers on page 17) a Marker can always be used
1045	       to locate the beginning of an FPDU (in FPDUs with valid CRCs).
1046	       Since the location of the marker is known in the octet stream
1047	       (sequence number space), the marker can always be found.

1049	   3:  Having found an FPDU by means of a Marker, following contiguous
1050	       FPDUs can be found by using the "ULPDU Length" fields (from FPDUs
1051	       with valid CRCs) to establish the next FPDU boundary.

1053	   The "ULPDU Length" field (see section 4) MUST be used to determine if
1054	   the entire FPDU is present before forwarding the ULPDU to DDP.

1056	   CRC calculation is discussed in section 5.2 on page 19 above.

1058	5.4.1  Re-segmenting Middle boxes and non MPA-aware TCP senders

1060	   Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
1061	   boundaries, a receiving DDP on MPA on TCP implementation may be able
1062	   to optimize the reception of data in various ways.

1064	   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
1065	   segment boundaries.

1067	   Some MPA senders may be unable to conform to the sender requirements
1068	   because their implementation of TCP is not designed with MPA in mind.
1069	   Even if the sender is MPA-aware, the network may contain "middle
1070	   boxes" which modify the TCP stream by changing the segmentation.
1071	   This is generally interoperable with TCP and its users and MPA must
1072	   be no exception.

1074	   The presence of markers in MPA (when enabled) allows an MPA receiver
1075	   to recover the FPDUs despite these obstacles, although it may be
1076	   necessary to utilize additional buffering at the receiver to do so.

1078	   Some of the cases that a receiver may have to contend with are listed
1079	   below as a reminder to the implementer:

1081	   *   A single Aligned and complete FPDU, either in order, or out of
1082	       order:  This can be passed to DDP as soon as validated, and
1083	       Delivered when ordering is established.

1085	   *   Multiple FPDUs in a TCP segment, aligned and fully contained,
1086	       either in order, or out of order:  These can be passed to DDP as
1087	       soon as validated, and Delivered when ordering is established.

1089	   *   Incomplete FPDU: The receiver should buffer until the remainder
1090	       of the FPDU arrives.  If the remainder of the FPDU is already
1091	       available, this can be passed to DDP as soon as validated, and
1092	       Delivered when ordering is established.

1094	   *   Unaligned FPDU start: The partial FPDU must be combined with its
1095	       preceding portion(s).  If the preceding parts are already
1096	       available, and the whole FPDU is present, this can be passed to
1097	       DDP as soon as validated, and Delivered when ordering is
1098	       established.  If the whole FPDU is not available, the receiver
1099	       should buffer until the remainder of the FPDU arrives.

1101	   *   Combinations of Unaligned or incomplete FPDUs (and potentially
1102	       other complete FPDUs) in the same TCP segment:  If any FPDU is
1103	       present in its entirety, or can be completed with portions
1104	       already available, it can be passed to DDP as soon as validated,
1105	       and Delivered when ordering is established.

1107	6  Connection Semantics

1109	6.1  Connection setup

1111	   MPA requires that the consumer MUST activate MPA, and any TCP
1112	   enhancements for MPA, on a TCP half connection at the same location
1113	   in the octet stream at both the sender and the receiver. This is
1114	   required in order for the marker scheme to correctly locate the
1115	   markers (if enabled) and to correctly locate the first FPDU.

1117	   MPA, and any TCP enhancements for MPA are enabled by the ULP in both
1118	   directions at once at an endpoint.

1120	   This can be accomplished several ways, and is left up to DDP's ULP:

1122	   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP
1123	       connection setup.  This has the advantage that no streaming mode
1124	       negotiation is needed. An example of such a protocol is shown in
1125	       Figure 9: Example Immediate Startup negotiation on page 36.

1127	       This may be accomplished by using a well-known port, or a service
1128	       locator protocol to locate an appropriate port on which DDP on
1129	       MPA is expected to operate.

1131	   *   DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
1132	       normal TCP startup, using TCP streaming data exchanges on the
1133	       same connection.  The exchange establishes that DDP on MPA (as
1134	       well as other ULPs) will be used, and exactly locates the point
1135	       in the octet stream where MPA is to begin operation.  Note that
1136	       such a negotiation protocol is outside the scope of this
1137	       specification.  A simplified example of such a protocol is shown
1138	       in Figure 8: Example Delayed Startup negotiation on page 33.

1140	   An MPA endpoint operates in two distinct phases.

1142	   The "Startup Phase" is used to verify correct MPA setup, exchange CRC
1143	   and Marker configuration, and optionally pass "private data" between
1144	   endpoints prior to completing a DDP connection.  During this phase,
1145	   specifically formatted frames are exchanged as TCP byte streams
1146	   without using CRCs or Markers.  During this phase a DDP endpoint need
1147	   not be "bound" to the MPA connection.  In fact, the choice of DDP
1148	   endpoint and its operating parameters may not be known until the
1149	   consumer supplied "private data" (if any) has been examined by the
1150	   consumer.

1152	   The second distinct phase is "Full operation" during which FPDUs are
1153	   sent using all the rules that pertain (CRCs, Markers, MULPDU
1154	   restrictions etc.).  A DDP endpoint MUST be "bound" to the MPA
1155	   connection at entry to this phase.

1157	   When "private data" is passed between ULPs in the "Startup Phase",
1158	   the ULP is responsible for interpreting that data, and then placing
1159	   MPA into "Full operation".

1161	   Note: The following text differentiates the two endpoints by calling
1162	       them "Initiator" and "Responder".  This is quite arbitrary and is
1163	       NOT related to the TCP startup (SYN, SYN/ACK sequence).  The
1164	       Initiator is the side that sends first in the MPA startup
1165	       sequence (the "MPA Request Frame").

1167	   Note: The possibility that both endpoints would be allowed to make a
1168	       connection at the same time, sometimes called an "Active/Active"
1169	       connection, was considered by the work group and rejected.  There
1170	       were several motivations for this decision.  One was that
1171	       applications needing this facility were few (none other than
1172	       theoretical at the time of this draft).  Another was that the
1173	       facility created some implementation difficulties, particularly
1174	       with the "Dual Stack" designs described later on. A last issue
1175	       was that dealing with rejected connections at startup would have
1176	       required at least an additional frame type, and more recovery
1177	       actinos, complicating the protocol.  While none of these issues
1178	       was overwhelming, the group and implementers were not motivated
1179	       to do the work to resolve these issues.

1181	   The ULP is responsible for determining which side is "Initiator" or
1182	   "Responder".  For "Client/Server" type ULPs this is easy.  For peer-
1183	   peer ULPs (which might utilize a TCP style "active/active" startup),
1184	   some mechanism (not defined by this specification) must be
1185	   established, or some streaming mode data exchanged prior to MPA
1186	   startup to determine the side which starts in "Initiator" and which
1187	   starts in "Responder" MPA mode.

1189	   The following rules apply to MPA connection startup phase:

1191	   1.  When MPA is started in the "Initiator" mode, the MPA
1192	       implementation MUST send a valid "MPA Request Frame".  The "MPA
1193	       Request Frame" MAY include ULP supplied "Private Data".

1195	   2.  When MPA is started in the "Responder" mode, the MPA
1196	       implementation MUST wait until a "MPA Request Frame" is received
1197	       and validated before entering full MPA/DDP operation.

1199	       If the "MPA Request Frame" is improperly formatted, the
1200	       implementation MUST close the TCP connection and exit MPA.

1202	       If the "MPA Request Frame" is properly formatted but the "Private
1203	       Data" is not acceptable, the implementation SHOULD return an "MPA
1204	       Reply Frame" with the "Rejected Connection" bit set to '1'; the
1205	       "MPA Reply Frame" MAY include ULP supplied "Private Data"; the
1206	       implementation MUST exit MPA, leaving the TCP connection open.
1207	       The ULP may close TCP or use the connection for other purposes.

1209	       If the "MPA Request Frame" is properly formatted and the "Private
1210	       Data" is acceptable, the implementation SHOULD return an "MPA
1211	       Reply Frame" with the "Rejected Connection" bit set to '0'; the
1212	       "MPA Reply Frame" MAY include ULP supplied "Private Data"; and
1213	       the responder SHOULD prepare to interpret any data received as
1214	       FPDUs and pass any received ULPDUs to DDP.

1216	       Note: Since the receiver's ability to deal with markers is
1217	           unknown until the Request and Reply frames have been
1218	           received, sending FPDUs before this occurs is not possible.

1220	       Note: The requirement to wait on a Request Frame before sending a
1221	           Reply frame is a design choice, it makes for well ordered
1222	           sequence of events at each end, and avoids having to specify
1223	           how to deal with situations where both ends start at the same
1224	           time.

1226	   3.  MPA "Initiator" mode implementations MUST receive and validate a
1227	       "MPA Reply Frame".

1229	       If the "MPA Reply Frame" is improperly formatted, the
1230	       implementation MUST close the TCP connection and exit MPA.

1232	       If the "MPA Reply Frame" is properly formatted but is the
1233	       "Private Data" is not acceptable, or if the "Rejected Connection"
1234	       bit set to '1', the implementation MUST exit MPA, leaving the TCP
1235	       connection open.  The ULP may close TCP or use the connection for
1236	       other purposes.

1238	       If the "MPA Reply Frame" is properly formatted and the "Private
1239	       Data" is acceptable, and the "Reject Connection" bit is set to
1240	       '0', the implementation SHOULD enter full MPA/DDP operation mode;
1241	       interpreting any received data as FPDUs and sending DDP ULPDUs as
1242	       FPDUs.

1244	   4.  MPA "Responder" mode implementations MUST receive and validate at
1245	       least one FPDU before sending any FPDUs or markers.

1247	       Note: this requirement is present to allow the Initiator time to
1248	           get its receiver into full operation before an FPDU arrives,
1249	           avoiding potential race conditions at the initiator.  This
1250	           was also subject to some debate in the work group before
1251	           rough consensus was reached.  Eliminating this requirement
1252	           would allow faster startup in some types of applications.
1253	           However, that would also make certain implementations
1254	           (particularly "Dual Stack") much harder.

1256	   5.  If a received "Key" does not match the expected value, (See 6.1.1
1257	       MPA Request and Reply Frame Format below) the TCP/DDP connection
1258	       MUST be closed, and an error returned to the ULP.

1260	   6.  The received "Private Data" fields may be used by consumers at
1261	       either end to further validate the connection, and set up DDP or
1262	       other ULP parameters.  The Initiator ULP MAY close the
1263	       TCP/MPA/DDP connection as a result of validating the "Private
1264	       Data" fields.  The Responder SHOULD return a "MPA Reply Frame"
1265	       with the "Reject Connection" Bit set to '1' if the validation of
1266	       the "Private Data" is not acceptable to the ULP.

1268	   7.  When the first FPDU is to be sent, then if markers are enabled,
1269	       the first octets sent are the special marker 0x00000000, followed
1270	       by the start of the FPDU (the FPDU's "ULPDU Length" field).  If
1271	       markers are not enabled, the first octets sent are the start of
1272	       the FPDU (the FPDU's "ULPDU Length" field).

1274	   8.  MPA implementations MUST use the difference between the "MPA
1275	       Request Frame" and the "MPA Reply Frame" to check for incorrect
1276	       "Initiator/Initiator" startups.  Implementations SHOULD put a
1277	       timeout on waiting for the "MPA Request Frame" when started in
1278	       "Responder" mode, to detect incorrect "Responder/Responder"
1279	       startups.

1281	   9.  MPA implementations MUST validate the PD_Length field.  The
1282	       buffer that receives the "Private Data" field MUST be large
1283	       enough to receive that data; the amount of "Private Data" MUST
1284	       not exceed the PD_Length, or the application buffer.  If any of
1285	       the above fails, the startup frame MUST be considered improperly
1286	       formatted.

1288	   10. MPA implementations SHOULD implement a reasonable timeout while
1289	       waiting for the entire startup frames; this prevents certain
1290	       denial of service attacks.  ULPs SHOULD implement a reasonable
1291	       timeout while waiting for FPDUs, ULPDUs and application level
1292	       messages to guard against application failures and certain denial
1293	       of service attacks.

1295	6.1.1  MPA Request and Reply Frame Format

1297	       0                   1                   2                   3
1298	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1299	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1300	   0  |                                                               |
1301	      +         Key (16 bytes containing "MPA ID Req Frame")          +
1302	   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
1303	      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
1304	   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
1305	      +                                                               +
1306	   12 |                                                               |
1307	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1308	   16 |M|C|R| Res     |     Rev       |          PD_Length            |
1309	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1310	      |                                                               |
1311	      ~                                                               ~
1312	      ~                   Private Data                                ~
1313	      |                                                               |
1314	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1315	      |                               |
1316	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1317	                    Figure 7 "MPA Request/Reply Frame"

1319	   Key: This field contains the "key" used to authenticate that the
1320	       sender is an MPA sender.  Initiator mode senders must set this
1321	       field to the fixed value "MPA ID Req frame" or (in byte order) 4D
1322	       50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).
1323	       Responder mode receivers MUST check this field for the same
1324	       value, and close the connection and report an error locally if
1325	       any other value is detected. Responder mode senders must set this
1326	       field to the fixed value "MPA ID Rep frame" or (in byte order) 4D
1327	       50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).
1328	       Initiator mode receivers MUST check this field for the same
1329	       value, and close the connection and report an error locally if
1330	       any other value is detected.

1332	   M: This bit, when sent in an "MPA Request Frame" or an "MPA Reply
1333	       Frame", declares a receiver's requirement for Markers.  When in a
1334	       received "MPA Request Frame" or "MPA Reply Frame" and the value
1335	       is '0', markers MUST NOT be added to the data stream by the
1336	       sender.  When '1' markers MUST be added as described in section
1337	       5.1 MPA Markers on page 17.

1339	   C: This bit declares an endpoint's preferred CRC usage.  When this
1340	       field is '0' in the "MPA Request Frame" and the "MPA Reply
1341	       Frame", CRCs MUST not be checked and need not be generated by
1342	       either endpoint.  When this bit is '1' in either the "MPA Request
1343	       Frame" or "MPA Reply Frame", CRCs MUST be generated and checked
1344	       by both endpoints.  Note that even when not in use, the CRC field
1345	       remains present in the FPDU.  When CRCs are not in use, the CRC
1346	       field MUST be considered valid for FPDU checking regardless of
1347	       its contents.

1349	   R: This bit is set to zero, and not checked on reception in the "MPA
1350	       Request Frame".  In the "MPA Reply Frame", this bit is the
1351	       "Rejected Connection" bit, set by the responders ULP to indicate
1352	       acceptance '0', or rejection '1', of the connection parameters
1353	       provided in the "Private Data".

1355	   Res: This field is reserved for future use.  It must be set to zero
1356	       when sending, and not checked on reception.

1358	   Rev: This field contains the Revision of MPA.  For this version of
1359	       the specification senders MUST set this field to one.  MPA
1360	       receivers compliant with this version of the specification MUST
1361	       check this field.  If the MPA receiver cannot interoperate with
1362	       the received version, then it MUST close the connection and
1363	       report an error locally.  Otherwise, the MPA receiver should
1364	       report the received version to the ULP.

1366	   PD_Length: This field MUST contain the length in Octets of the
1367	       Private Data field.  A value of zero indicates that there is no
1368	       private data field present at all.  If the receiver detects that
1369	       the PD_Length field does not match the length of the "Private
1370	       Data" field, or if the length of the "Private Data" field exceeds
1371	       512 octets, the receiver MUST close the connection and report an
1372	       error locally.  Otherwise, the MPA receiver should pass the
1373	       PD_Length value and "Private Data" to the ULP.

1375	   Private Data: This field may contain any value defined by ULPs or may
1376	       not be present.  The "Private Data" field MUST between 0 and 512
1377	       octets in length.  ULPs define how to size, set, and validate
1378	       this field within these limits.

1380	6.1.2  Example Delayed Startup sequence

1382	   A variety of startup sequences are possible when using MPA on TCP.
1383	   Following is an example of an MPA/DDP startup that occurs after TCP
1384	   has been running for a while and has exchanged some amount of
1385	   streaming data.  This example does not use any private data (an
1386	   example that does is shown later in 6.1.3.2 Example Immediate Startup
1387	   using Private Data on page 36), although it is perfectly legal to
1388	   include the private data. Note that since the example does not use
1389	   any Private Data, there are no ULP interactions shown between
1390	   receiving "Startup frames" and putting MPA into "Full operation".

1392	          Initiator                                 Responder

1394	   +---------------------------+
1395	   |ULP streaming mode         |
1396	   | <Hello> request to        |
1397	   | transition to DDP/MPA     |           +--------------------------+
1398	   | mode (optional)           | --------> |ULP gets request;         |
1399	   +---------------------------+           |enables MPA Responder mode|
1400	                                           |with last (optional)      |
1401	                                           |streaming mode <Hello Ack>|
1402	                                           |for MPA to send.          |
1403	   +---------------------------+           |MPA waits for incoming    |
1404	   |ULP receives streaming     | <-------- |  <MPA Request frame>     |
1405	   | <Hello Ack>;              |           +--------------------------+
1406	   |Enters MPA Initiator mode; |
1407	   |MPA sends                  |
1408	   |  <MPA Request Frame>;     |
1409	   |MPA waits for incoming     |           +--------------------------+
1410	   |  <MPA Reply Frame         | - - - - > |MPA receives              |
1411	   +---------------------------+           |  <MPA Request Frame>     |
1412	                                           |Consumer binds DDP to MPA,|
1413	                                           |MPA sends the             |
1414	                                           |  <MPA Reply Frame>.      |
1415	                                           |DDP/MPA enables FPDU      |
1416	   +---------------------------+           |decoding, but does not    |
1417	   |MPA receives the           | < - - - - |send any FPDUs.           |
1418	   |  <MPA Reply Frame>        |           +--------------------------+
1419	   |Consumer binds DDP to MPA, |
1420	   |DDP/MPA begins full        |
1421	   |operation.                 |
1422	   |MPA sends first FPDU (as   |           +--------------------------+
1423	   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
1424	   |available).                |           |MPA sends first FPDU (as  |
1425	   +---------------------------+           |DDP ULPDUs become         |
1426	                                   <====== |available.                |
1427	                                           +--------------------------+
1428	               Figure 8: Example Delayed Startup negotiation

1430	   An example Delayed Startup sequence is described below:

1432	       *   Active and passive sides start up a TCP connection in the
1433	           usual fashion, probably using sockets APIs.  They exchange
1434	           some amount of streaming mode data.  At some point one side
1435	           (the MPA Initiator) sends streaming mode data that
1436	           effectively says "Hello, Lets go into MPA/DDP mode."

1438	   *   When the remote side (the MPA Responder) gets this streaming mode
1439	       message, the consumer would send a last streaming mode message
1440	       that effectively says "I Acknowledge your Hello, and am now in
1441	       MPA Responder Mode".   The exchange of these messages establishes
1442	       the exact point in the TCP stream where MPA is enabled.  The
1443	       Responding Consumer enables MPA in the Responder mode and waits
1444	       for the initial MPA startup message.

1446	       *   The Initiating Consumer would enable MPA startup in the
1447	           Initiator mode which then sends the "MPA Request Frame".  It
1448	           is assumed that no "Private Data" messages are needed for
1449	           this example, although it is possible to do so.  The
1450	           Initiating MPA (and Consumer) would also wait for the MPA
1451	           connection to be accepted.

1453	   *   The Responding MPA would receive the initial "MPA Request Frame"
1454	       and would inform the consumer that this message arrived.  The
1455	       Consumer can then accept the MPA/DDP connection or close the TCP
1456	       connection.

1458	   *   To accept the connection request, the Responding Consumer would
1459	       use an appropriate API to bind the TCP/MPA connections to a DDP
1460	       endpoint, thus enabling MPA/DDP into full operation. In the
1461	       process of going to full operation, MPA sends the "MPA Reply
1462	       Frame".  MPA/DDP waits for the first incoming FPDU before sending
1463	       any FPDUs.

1465	   *   If the initial TCP data was not a properly formatted "MPA Request
1466	       Frame" MPA will close or reset the TCP connection immediately.

1468	       *   The Initiating MPA would receive the "MPA Reply Frame" and
1469	           would report this message to the Consumer.  The Consumer can
1470	           then accept the MPA/DDP connection, or close or reset the TCP
1471	           connection to abort the process.

1473	       *   On determining that the Connection is acceptable, the
1474	           Initiating Consumer would use an appropriate API to bind the
1475	           TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
1476	           into full operation.  MPA/DDP would begin sending DDP
1477	           messages as MPA FPDUs.

1479	6.1.3  Use of "Private Data"

1481	   This section is advisory in nature, in that it suggests a method that
1482	   a ULP can deal with pre-DDP connection information exchange.

1484	6.1.3.1  Motivation

1486	   Prior RDMA protocols have been developed that provide "private data"
1487	   via out of band mechanisms.  As a result, many applications now
1488	   expect some form of "private data" to be available for application
1489	   use prior to setting up the DDP/RDMA connection.  For example,

1491	   An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
1492	   and the [Verbs]) must be associated with a Protection Domain.  No
1493	   receive operations may be posted to the endpoint before it is
1494	   associated with a Protection Domain.  Indeed under both the
1495	   InfiniBand and proposed iWARP verbs [Verbs] an endpoint/QP is created
1496	   within a Protection Domain.

1498	   There are some applications where the choice of Protection Domain is
1499	   dependent upon the identity of the remote ULP client. For example, if
1500	   a user session requires multiple connections, it is highly desirable
1501	   for all of those connections to use a single Protection Domain.

1503	   InfiniBand, the DAT APIs and the IT-API all provide for the active
1504	   side ULP to provide "Private Data" when requesting a connection. This
1505	   data is passed to the ULP to allow it to determine whether to accept
1506	   the connection, and if so with which endpoint (and implicitly which
1507	   Protection Domain).

1509	   The Private Data can also be used to ensure that both ends of the
1510	   connection have configured their RDMA endpoints compatibly on such
1511	   matters as the RDMA Read capacity. Further ULP-specific uses are also
1512	   presumed, such as establishing the identity of the client.

1514	   Private Data is also allowed for when accepting the connection, to
1515	   allow completion of any negotiation on RDMA resources and for other
1516	   ULP reasons.

1518	   There are several potential ways to exchange this "Private Data".
1519	   For Example, the InfiniBand specification includes a connection
1520	   management protocol that allows a small amount of "private data" to
1521	   be exchanged using datagrams before actually starting the RDMA
1522	   connection.

1524	   This draft allows for small amounts of "Private Data" to be exchanged
1525	   as part of the MPA startup sequence.  The actual Private Data fields
1526	   are carried in the "MPA Request Frame", and the "MPA Reply Frame".

1528	   If larger amounts of private data or more negotiation is necessary,
1529	   TCP streaming mode messages may be exchanged prior to enabling MPA.

1531	6.1.3.2  Example Immediate Startup using Private Data

1533	          Initiator                                 Responder

1535	   +---------------------------+
1536	   |TCP SYN sent               |           +--------------------------+
1537	   +---------------------------+ --------> |TCP gets SYN packet;      |
1538	   +---------------------------+           |  Sends SYN-Ack           |
1539	   |TCP gets SYN-Ack           | <-------- +--------------------------+
1540	   |  Sends Ack                |
1541	   +---------------------------+ --------> +--------------------------+
1542	   +---------------------------+           |Consumer enables MPA      |
1543	   |Enters MPA Initiator mode; |           |Responder Mode, waits for |
1544	   |MPA sends                  |           |  <MPA Request frame>     |
1545	   |  <MPA Request Frame>;     |           +--------------------------+
1546	   |MPA waits for incoming     |           +--------------------------+
1547	   |  <MPA Reply Frame         | - - - - > |MPA receives              |
1548	   +---------------------------+           |  <MPA Request Frame>     |
1549	                                           |Consumer examines "Private|
1550	                                           |Data", provides MPA with  |
1551	                                           |return "Private Data",    |
1552	                                           |binds DDP to MPA, and     |
1553	                                           |enables MPA to send an    |
1554	                                           |  <MPA Reply Frame>.      |
1555	                                           |DDP/MPA enables FPDU      |
1556	   +---------------------------+           |decoding, but does not    |
1557	   |MPA receives the           | < - - - - |send any FPDUs.           |
1558	   |  <MPA Reply Frame>        |           +--------------------------+
1559	   |Consumer examines "Private |
1560	   |Data", binds DDP to MPA,   |
1561	   |and enables DDP/MPA to     |
1562	   |begin full operation.      |
1563	   |MPA sends first FPDU (as   |           +--------------------------+
1564	   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
1565	   |available).                |           |MPA sends first FPDU (as  |
1566	   +---------------------------+           |DDP ULPDUs become         |
1567	                                   <====== |available.                |
1568	                                           +--------------------------+
1569	              Figure 9: Example Immediate Startup negotiation

1571	   Note: the exact order of when MPA is started in the TCP connection
1572	       sequence is implementation dependent; the above diagram shows one
1573	       possible sequence.  Also, the Initiator "Ack" to the Responder's
1574	       "SYN-Ack" may be combined into the same TCP segment containing
1575	       the "MPA Request Frame" (as is allowed by TCP RFCs).

1577	       The example immediate startup sequence is described below:

1579	   *   The passive side (Responding Consumer) would listen on the TCP
1580	       destination port, to indicate its readiness to accept a
1581	       connection.

1583	       *   The active side (Initiating Consumer) would request a
1584	           connection from a TCP endpoint (that expected to upgrade to
1585	           MPA/DDP/RDMA and expected the private data) to a destination
1586	           address and port.

1588	       *   The Initiating Consumer would initiate a TCP connection to
1589	           the destination port. Acceptance/rejection of the connection
1590	           would proceed as per normal TCP connection establishment.

1592	   *   The passive side (Responding Consumer) would receive the TCP
1593	       connection request as usual allowing normal TCP gatekeepers, such
1594	       as INETD and TCPserver, to exercise their normal
1595	       safeguard/logging functions.  On acceptance of the TCP
1596	       connection, the Responding consumer would enable MPA in the
1597	       Responder mode and wait for the initial MPA startup message.

1599	       *   The Initiating Consumer would enable MPA startup in the
1600	           Initiator mode to send an initial "MPA Request Frame" with
1601	           its included "Private Data" message to send.  The Initiating
1602	           MPA (and Consumer) would also wait for the MPA connection to
1603	           be accepted, and any returned private data.

1605	   *   The Responding MPA would receive the initial "MPA Request Frame"
1606	       with the "Private Data" message and would pass the Private Data
1607	       through to the consumer.  The Consumer can then accept the
1608	       MPA/DDP connection, close the TCP connection, or reject the MPA
1609	       connection with a return message.

1611	   *   To accept the connection request, the Responding Consumer would
1612	       use an appropriate API to bind the TCP/MPA connections to a DDP
1613	       endpoint, thus enabling MPA/DDP into full operation.  In the
1614	       process of going to full operation, MPA sends the "MPA Reply
1615	       Frame" which includes the Consumer supplied "Private Data"
1616	       containing any appropriate consumer response.  MPA/DDP waits for
1617	       the first incoming FPDU before sending any FPDUs.

1619	   *   If the initial TCP data was not a properly formatted "MPA Request
1620	       Frame", MPA will close or reset the TCP connection immediately.

1622	   *   To reject the MPA connection request, the Responding Consumer
1623	       would send an "MPA Reply Frame" with any ULP supplied "Private
1624	       Data" (with reason for rejection), with the "Rejected Connection"
1625	       bit set to '1', and may close the TCP connection.

1627	       *   The Initiating MPA would receive the "MPA Reply Frame" with
1628	           the "Private Data" message and would report this message to
1629	           the Consumer, including the supplied Private Data.

1631	           If the "rejected Connection" bit is set to a '1', MPA will
1632	           close the TCP connection and exit.

1634	           If the "Rejected Connection" bit is set to a '0', and on
1635	           determining from the "MPA Reply Frame" "Private Data" that
1636	           the Connection is acceptable, the Initiating Consumer would
1637	           use an appropriate API to bind the TCP/MPA connections to a
1638	           DDP endpoint thus enabling MPA/DDP into full operation.
1639	           MPA/DDP would begin sending DDP messages as MPA FPDUs.

1641	6.1.4  "Dual Stack" implementations

1643	   MPA/DDP implementations are commonly expected to be implemented as
1644	   part of a "Dual stack" architecture.  One "stack" is the traditional
1645	   TCP stack, usually with a sockets interface API.  The second stack is
1646	   the MPA/DDP "stack" with its own API, and potentially separate code
1647	   or hardware to deal with the MPA/DDP data.  Of course,
1648	   implementations may vary, so the following comments are of an
1649	   advisory nature only.

1651	   The use of the two "stacks" offers advantages:

1653	        TCP connection setup is usually done with the TCP stack. This
1654	        allows use of the usual naming and addressing mechanisms.  It
1655	        also means that any mechanisms used to "harden" the connection
1656	        setup against security threats are also used when starting
1657	        MPA/DDP.

1659	        Some applications may have been originally designed for TCP, but
1660	        are "enhanced" to utilize MPA/DDP after a negotiation reveals
1661	        the capability to do so.  The negotiation process takes place in
1662	        TCP's streaming mode, using the usual TCP APIs.

1664	        Some new applications, designed for RDMA or DDP, still need to
1665	        exchange some data prior to starting MPA/DDP.  This exchange can
1666	        be of arbitrary length or complexity, but often consists of only
1667	        a small amount of "private data", perhaps only a single message.
1668	        Using the TCP streaming mode for this exchange allows this to be
1669	        done using well understood methods.

1671	   The main disadvantage of using two stacks is the conversion of an
1672	   active TCP connection between them.  This process must be done with
1673	   care to prevent loss of data.

1675	   To avoid some of the problems when using a "dual stack" architecture
1676	   the following additional restrictions may be required by the
1677	   implementation:

1679	   1.  Enabling the DDP/MPA stack SHOULD be done only when no incoming
1680	       stream data is expected.  This is typically managed by the ULP
1681	       protocol.  When following the recommended startup sequence, the
1682	       "Responder" side enters DDP/MPA mode, sends the last streaming
1683	       mode data, and then waits for the "MPA Request frame".  No
1684	       additional streaming mode data is expected.  The "Initiator" side
1685	       ULP receives the last streaming mode data, and then enters
1686	       DDP/MPA mode.  Again, no additional streaming mode data is
1687	       expected.

1689	   2.  The DDP/MPA MAY provide the ability to send a "Last streaming
1690	       message" as part of its "Responder" DDP/MPA enable function.
1691	       This allows the DDP/MPA stack to more easily manage the
1692	       conversion to DDP/MPA mode (and avoid problems with a very fast
1693	       return of the "MPA Request Frame" from the Initiator side).

1695	   Note: Regardless of the "stack" architecture used, TCP's rules must
1696	       be followed.  For example, if network data is lost, re-segmented
1697	       or re-ordered, TCP must recover appropriately even when this
1698	       occurs while switching stacks.

1700	6.2  Normal Connection Teardown

1702	   Each half connection of MPA terminates when DDP closes the
1703	   corresponding TCP half connection.

1705	   A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
1706	   that a graceful close of the LLP connection has been received by the
1707	   LLP (e.g. FIN is received).

1709	7  Error Semantics

1711	   The following errors MUST be detected by MPA and the codes SHOULD be
1712	   provided to DDP or other consumer:

1714	    Code Error

1716	    1    TCP connection closed, terminated or lost.  This includes lost
1717	         by timeout, too many retries, RST received or FIN received.

1719	    2    Received MPA CRC does not match the calculated value for the
1720	         FPDU.

1722	    3    In the event that the CRC is valid, received MPA marker (if
1723	         enabled) and "ULPDU Length" fields do not agree on the start
1724	         of a FPDU.  If the FPDU start determined from previous "ULPDU
1725	         Length" fields does not match with the MPA marker position,
1726	         MPA SHOULD deliver an error to DDP.  It may not be possible to
1727	         make this check as a segment arrives, but the check SHOULD be
1728	         made when a gap creating an out of order sequence is closed
1729	         and any time a marker points to an already identified FPDU.
1730	         It is OPTIONAL for a receiver to check each marker, if
1731	         multiple markers are present in an FPDU, or if the segment is
1732	         received in order.

1734	    4    Invalid MPA Request Frame or MPA Response Frame received.  In
1735	         this case, the TCP connection MUST be immediately closed.  DDP
1736	         and other ULPs should treat this similar to code 1, above.

1738	   When conditions 2 or 3 above are detected, an MPA-aware TCP
1739	   implementation MAY choose to silently drop the TCP segment rather
1740	   than reporting the error to DDP.  In this case, the sending TCP will
1741	   retry the segment, usually correcting the error, unless the problem
1742	   was at the source.  In that case, the source will usually exceed the
1743	   number of retries and terminate the connection.

1745	   Once MPA delivers an error of any type, it MUST NOT pass or deliver
1746	   any additional FPDUs on that half connection.

1748	   For Error codes 2 and 3, MPA MUST NOT close the TCP connection
1749	   following a reported error.  Closing the connection is the
1750	   responsibility of DDP's ULP.

1752	        Note that since MPA will not deliver any FPDUs on a half
1753	        connection following an error detected on the receive side of
1754	        that connection, DDP's ULP is expected to tear down the
1755	        connection.  This may not occur until after one or more last
1756	        messages are transmitted on the opposite half connection.  This
1757	        allows a diagnostic error message to be sent.

1759	8  Security Considerations

1761	   This section discusses the security considerations for MPA.

1763	8.1  Protocol-specific Security Considerations

1765	   The vulnerabilities of MPA to third-party attacks are no greater than
1766	   any other protocol running over TCP.  A third party, by sending
1767	   packets into the network that are delivered to an MPA receiver, could
1768	   launch a variety of attacks that take advantage of how MPA operates.
1769	   For example, a third party could send random packets that are valid
1770	   for TCP, but contain no FPDU headers.  An MPA receiver reports an
1771	   error to DDP when any packet arrives that cannot be validated as an
1772	   FPDU when properly located on an FPDU boundary.  A third party could
1773	   also send packets that are valid for TCP, MPA, and DDP, but do not
1774	   target valid buffers.  These types of attacks ultimately result in
1775	   loss of connection and thus become a type of DOS (Denial Of Service)
1776	   attack.  Communication security mechanisms such as IPsec [RFC2401]
1777	   may be used to prevent such attacks.

1779	   Independent of how MPA operates, a third party could use ICMP
1780	   messages to reduce the path MTU to such a small size that performance
1781	   would likewise be severely impacted.  Range checking on path MTU
1782	   sizes in ICMP packets may be used to prevent such attacks.

1784	   [RDMA] and [DDP] are used to control, read and write data buffers
1785	   over IP networks. Therefore, the control and the data packets of
1786	   these protocols are vulnerable to the spoofing, tampering and
1787	   information disclosure attacks listed below.  In addition, Connection
1788	   to/from an unauthorized or unauthenticated endpoint is a potential
1789	   problem with most applications using RDMA, DDP, and MPA.

1791	8.1.1  Spoofing

1793	   Spoofing attacks can be launched by the Remote Peer, or by a network
1794	   based attacker. A network based spoofing attack applies to all Remote
1795	   Peers. Because the MPA Stream requires an TCP Stream in the
1796	   ESTABLISHED state, certain types of traditional forms of wire attacks
1797	   do not apply -- an end-to-end handshake must have occurred to
1798	   establish the MPA Stream. So, the only form of spoofing that applies
1799	   is one when a remote node can both send and receive packets. Yet even
1800	   with this limitation the Stream is still exposed to the following
1801	   spoofing attacks.

1803	8.1.1.1  Impersonation

1805	   A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
1806	   (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP
1807	   Stream with the victim. End to end authentication (i.e. IPsec or ULP
1808	   authentication) provides protection against this attack.

1810	8.1.1.2  Stream Hijacking

1812	   Stream hijacking happens when a network based attacker follows the
1813	   Stream establishment phase, and waits until the authentication phase
1814	   (if such a phase exists) is completed successfully. He can then spoof
1815	   the IP address and re-direct the Stream from the victim to its own
1816	   machine. For example, an attacker can wait until an iSCSI
1817	   authentication is completed successfully, and hijack the iSCSI
1818	   Stream.

1820	   The best protection against this form of attack is end-to-end
1821	   integrity protection and authentication, such as IPsec to prevent
1822	   spoofing. Another option is to provide physical security. Discussion
1823	   of physical security is out of scope for this document.

1825	8.1.1.3  Man in the Middle Attack

1827	   If a network based attacker has the ability to delete, inject replay,
1828	   or modify packets which will still be accepted by MPA (e.g., TCP
1829	   sequence number is correct, FPDU is valid etc.) then the Stream can
1830	   be exposed to a man in the middle attack. The attacker could
1831	   potentially use the services of [DDP] and [RDMAP] to read the
1832	   contents of the associated data buffer, modify the contents of the
1833	   associated data buffer, or to disable further access to the buffer.
1834	   The only countermeasure for this form of attack is to either secure
1835	   the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to
1836	   provide physical security to prevent man-in-the-middle type attacks.

1838	   The best protection against this form of attack is end-to-end
1839	   integrity protection and authentication, such as IPsec, to prevent
1840	   spoofing or tampering. If Stream or session level authentication and
1841	   integrity protection are not used, then a man-in-the-middle attack
1842	   can occur, enabling spoofing and tampering.

1844	   Another approach is to restrict access to only the local subnet/link,
1845	   and provide some mechanism to limit access, such as physical security
1846	   or 802.1.x. This model is an extremely limited deployment scenario,
1847	   and will not be further examined here.

1849	8.1.2  Eavesdropping

1851	   Generally speaking, Stream confidentiality protects against
1852	   eavesdropping. Stream and/or session authentication and integrity
1853	   protection is a counter measurement against various spoofing and
1854	   tampering attacks. The effectiveness of authentication and integrity
1855	   against a specific attack, depend on whether the authentication is
1856	   machine level authentication (as the one provided by IPsec), or ULP
1857	   authentication.

1859	8.2  Introduction to Security Options

1861	   The following security services can be applied to an MPA/DDP/RDMAP
1862	   Stream:

1864	   1.  Session confidentiality - protects against eavesdropping.

1866	   2.  Per-packet data source authentication - protects against the
1867	   following spoofing attacks: network based impersonation, Stream
1868	   hijacking, and man in the middle.

1870	   3.  Per-packet integrity - protects against tampering done by
1871	   network based modification of FPDUs (indirectly affecting buffer
1872	   content through DDP services).

1874	   4.  Packet sequencing - protects against replay attacks, which is
1875	   a special case of the above tampering attack.

1877	   If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks,
1878	   or Stream hijacking attacks, it is recommended that the Stream be
1879	   authenticated, integrity protected, and protected from replay
1880	   attacks; it may use confidentiality protection to protect from
1881	   eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public
1882	   network).

1884	   IPsec is capable of providing the above security services for IP and
1885	   TCP traffic.

1887	   ULP protocols may be able to provide part of the above security
1888	   services. See [NFSv4CHANNEL] for additional information on a
1889	   promising approach called "channel binding". From [NFSv4CHANNEL]:

1891	        "The concept of channel bindings allows applications to prove
1892	        that the end-points of two secure channels at different network
1893	        layers are the same by binding authentication at one channel to
1894	        the session protection at the other channel.  The use of channel
1895	        bindings allows applications to delegate session protection to
1896	        lower layers, which may significantly improve performance for
1897	        some applications."

1899	8.3  Using IPsec With MPA

1901	   IPsec can be used to protect against the packet injection attacks
1902	   outlined above.  Because IPsec is designed to secure individual IP
1903	   packets, MPA can run above IPsec without change.  IPsec packets are
1904	   processed (e.g., integrity checked and decrypted) in the order they
1905	   are received, and an MPA receiver will process the decrypted FPDUs
1906	   contained in these packets in the same manner as FPDUs contained in
1907	   unsecured IP packets.

1909	   MPA Implementations MUST implement IPSEC.  The use of IPSEC is up to
1910	   ULPs and administrators.

1912	8.4  Requirements for IPsec Encapsulation of DDP

1914	   The IP Storage working group has spent significant time and effort to
1915	   define the normative IPsec requirements for IP Storage [RFC3723].
1916	   Portions of that specification are applicable to a wide variety of
1917	   protocols, including the RDDP protocol suite. In order to not
1918	   replicate this effort, an RNIC implementation MUST follow the
1919	   requirements defined in RFC3723 Section 2.3 and Section 5, including
1920	   the associated normative references for those sections.

1922	   Additionally, since IPsec acceleration hardware may only be able to
1923	   handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
1924	   messages may be sent for idle SAs, as a means of keeping the number
1925	   of active Phase 2 SAs to a minimum. The receipt of an IKE Phase 2
1926	   delete message MUST NOT be interpreted as a reason for tearing down
1927	   an DDP/RDMA Stream. Rather, it is preferable to leave the Stream up,
1928	   and if additional traffic is sent on it, to bring up another IKE
1929	   Phase 2 SA to protect it. This avoids the potential for continually
1930	   bringing Streams up and down.

1932	   Note that there are serious security issues if IPsec is not
1933	   implemented end-to-end. For example, if IPsec is implemented as a
1934	   tunnel in the middle of the network, any hosts between the peer and
1935	   the IPsec tunneling device can freely attack the unprotected Stream.

1937	9  IANA Considerations

1939	   If a well-known port is chosen as the mechanism to identify a DDP on
1940	   MPA on TCP, the well-known port must be registered with IANA.
1941	   Because the use of the port is DDP specific, registration of the port
1942	   with IANA is left to DDP.

1944	10 References

1946	10.1 Normative References

1948	   [iSCSI] Satran, J., Internet Small Computer Systems Interface
1949	       (iSCSI), RFC 3720, April 2004.

1951	   [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
1952	       November 1990.

1954	   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
1955	       Selective Acknowledgment Options", RFC 2018, October 1996.

1957	   [RFC2026] Bradner, S., "The Internet Standards Process -- Revision
1958	       3", BCP 9, RFC 2026, October 1996.

1960	   [RFC3667] Bradner, S., "IETF Rights in Contributions", BCP 78, RFC
1961	       3667, February 2004.

1963	   [RFC3668] Bradner, S., Ed., "Intellectual Property Rights in IETF
1964	       Technology", BCP 79, RFC 3668, February 2004.

1966	   [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
1967	       IP", RFC3723, April 2004.

1969	   [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
1970	       Program Protocol Specification", RFC 793, September 1981.

1972	   [RDMASEC]  Pinkerton J., Deleganes E., Romanow A., Bitan S.,
1973	       "DDP/RDMAP Security", draft-ietf-rddp-security-06.txt (work in
1974	       progress), December 2004.

1976	10.2 Informative References

1978	   [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
1979	       disagree", ACM Sigcomm, Sept. 2000.

1981	   [DDP] H. Shah et al., "Direct Data Placement over Reliable
1982	       Transports", draft-ietf-rddp-ddp-04.txt (Work in progress),
1983	       February 2005

1985	   [RFC2401]  Atkinson, R., Kent, S., "Security Architecture for the
1986	       Internet Protocol", RFC 2401, November 1998.

1988	   [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
1989	       896, January 1984.

1991	   [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B.,
1992	       "Application performance pitfalls and TCP's Nagle algorithm",
1993	       Workshop on Internet Server Performance, May 1999.

1995	   [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
1996	       Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
1997	       bindings-02.txt, July 2004.

1999	   [RDMA] R. Recio et al., "RDMA Protocol Specification",
2000	       draft-ietf-rddp-rdmap-03.txt, February 2005

2002	   [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
2003	       RFC 2960, October 2000.

2005	   [RFC792] Postel, J., "Internet Control Message Protocol". September
2006	       1981

2008	   [RFC1122] Braden, R.T., "Requirements for Internet hosts -
2009	       communication layers". October 1989.

2011	   [ELZUR-MPA] Elzur, U., "Analysis of MPA over TCP Operations" draft-
2012	       elzur-iwarp-mpa-tcp-analysis-00.txt, February 2003.

2014	   [Verbs] J. Hilland et al., "RDMA Protocol Verbs Specification" draft-
2015	       hilland-rddp-verbs-00.txt, April 2003.

2017	11 Appendix

2019	   This appendix is for information only and is NOT part of the
2020	   standard.

2022	11.1 Analysis of MPA over TCP Operations

2024	   This appendix analyzes the impact of MPA (Marker PDU Aligned Framing
2025	   for TCP [MPA]) on the TCP sender, receiver, and wire protocol.

2027	   One of MPA's high level goals is to provide enough information, when
2028	   combined with the Direct Data Placement Protocol [DDP], to enable
2029	   out-of-order placement of DDP payload into the final Upper Layer
2030	   Protocol (ULP) buffer. Note that DDP separates the act of placing
2031	   data into a ULP buffer from that of notifying the ULP that the ULP
2032	   buffer is available for use. In DDP terminology, the former is
2033	   defined as "Placement", and the later is defined as "Delivery". MPA
2034	   supports in-order delivery of the data to the ULP, including support
2035	   for Direct Data Placement in the final ULP buffer location when TCP
2036	   segments arrive out-of-order. Effectively, the goal is to use the
2037	   pre-posted ULP buffers as the TCP receive buffer, where the
2038	   reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
2039	   DDP) is done in place, in the ULP buffer, with no data copies.

2041	   This Appendix walks through the advantages and disadvantages of the
2042	   TCP sender modifications proposed by MPA:

2044	   1) that MPA require the TCP sender to do "Header Alignment", where a
2045	   TCP segment is required to begin with an MPA Framing Protocol Data
2046	   Unit (FPDU) (if there is payload present).

2048	   2) that there be an integral number of FPDUs in a TCP segment (under
2049	   conditions where the Path MTU is not changing).

2051	   This Appendix concludes that the scaling advantages of Header
2052	   Alignment are strong, based primarily on fairly drastic TCP receive
2053	   buffer reduction requirements and simplified receive handling. The
2054	   analysis also shows that there is little effect to TCP wire behavior.

2056	11.1.1 Assumptions

2058	11.1.1.1 MPA is layered beneath DDP [DDP]

2060	   MPA is an adaptation layer between DDP and TCP.  DDP requires
2061	   preservation of DDP segment boundaries and a CRC32C digest covering
2062	   the DDP header and data.   MPA adds these features to the TCP stream
2063	   so that DDP over TCP has the same basic properties as DDP over SCTP.

2065	11.1.1.2 MPA preserves DDP message framing

2067	   MPA was designed as a framing layer specifically for DDP and was not
2068	   intended as a general-purpose framing layer for any other ULP using
2069	   TCP.

2071	   A framing layer allows ULPs using it to receive indications from the
2072	   transport layer only when complete ULPDUs are present.  As a framing
2073	   layer, MPA is not aware of the content of the DDP PDU, only that it
2074	   has received and, if necessary, reassembled a complete PDU for
2075	   delivery to the DDP.

2077	11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under
2078	      normal conditions

2080	   To make reception of a complete DDP PDU on every received segment
2081	   possible, DDP passes to MPA a PDU that is no larger than the EMSS of
2082	   the underlying fabric. Each FPDU that MPA creates contains sufficient
2083	   information for the receiver to directly place the ULP payload in the
2084	   correct location in the correct receive buffer.

2086	   Edge cases when this condition does not occur are dealt with, but do
2087	   not need to be on the fast path

2089	11.1.1.4 Out-of-order placement but NO out-of-order delivery

2091	   DDP receives complete DDP PDUs from MPA.  Each DDP PDU contains the
2092	   information necessary to place its ULP payload directly in the
2093	   correct location in host memory.

2095	   Because each DDP segment is self-describing, it is possible for DDP
2096	   segments received out of order to have their ULP payload placed
2097	   immediately in the ULP receive buffer.

2099	   Data delivery to the ULP is guaranteed to be in the order the data
2100	   was sent.  DDP only indicates data delivery to the ULP after TCP has
2101	   acknowledged the complete byte stream.

2103	11.1.2 The Value of Header Alignment

2105	   Significant receiver optimizations can be achieved when Header
2106	   Alignment and complete FPDUs are the common case. The optimizations
2107	   allow utilizing significantly fewer buffers on the receiver and less
2108	   computation per FPDU. The net effect is the ability to build a "Flow-
2109	   Through" receiver that enables TCP-based solutions to scale to 10G
2110	   and beyond in an economical way. The optimizations are especially
2111	   relevant to hardware implementations of receivers that process
2112	   multiple protocol layers - Data Link Layer (e.g., Ethernet), Network
2113	   and Transport Layer (e.g., TCP/IP), and even some ULP on top of TCP
2114	   (e.g., MPA/DDP). As network speed increases, there is an increasing
2115	   desire to use a hardware based receiver in order to achieve an
2116	   efficient high performance solution.

2118	   A TCP receiver, under worst case conditions, has to allocate buffers
2119	   (BufferSizeTCP) whose capacities are a function of the bandwidth-
2120	   delay product. Thus:

2122	        BufferSizeTCP = K * bandwidth [octets/S] * Delay [S].

2124	   Where bandwidth is the end-to-end bandwidth of the connection, delay
2125	   is the round trip delay of the connection, and K is an implementation
2126	   dependent constant.

2128	   Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more
2129	   buffers for a 10x increase in end-to-end bandwidth). As this
2130	   buffering approach may scale poorly for hardware or software
2131	   implementations alike, several approaches allow reduction in the
2132	   amount of buffering required for high-speed TCP communication.

2134	   The MPA/DDP approach is to enable the ULP's buffer to be used as the
2135	   TCP receive buffer. If the application pre-posts a sufficient amount
2136	   of buffering, and each TCP segment has sufficient information to
2137	   place the payload into the right application buffer, when an out-of-
2138	   order TCP segment arrives it could potentially be placed directly in
2139	   the ULP buffer. However, placement can only be done when a complete
2140	   FPDU with the placement information is available to the receiver, and
2141	   the FPDU contents contain enough information to place the data into
2142	   the correct ULP buffer (e.g., there is a DDP header available).

2144	   For the case when the FPDU is not aligned with the TCP segment, it
2145	   may take, on average, 2 TCP segments to assemble one FPDU. Therefore,
2146	   the receiver has to allocate BufferSizeNAF (Buffer Size, Non-Aligned
2147	   FPDU) octets:

2149	       BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS

2151	   Where K1 and K2 are implementation dependent constants and EMSS is
2152	   the effective maximum segment size.

2154	   For example, a 1 Gbps link with 10,000 connections and an EMSS of
2155	   1500B would require 15 MB of memory. Often the number of connections
2156	   used scales with the network speed, aggravating the situation for
2157	   higher speeds.

2159	   A Header Aligned FPDU would allow the receiver to allocate
2160	   BufferSizeAF (Buffer Size, Aligned FPDU) octets:

2162	       BufferSizeAF = K2 * EMSS

2164	   for the same conditions. A Header Aligned receiver may require memory
2165	   in the range of ~100s of KB - which is feasible for an on-chip memory
2166	   and enables a "Flow-Through" design, in which the data flows through
2167	   the NIC and is placed directly in the destination buffer. Assuming
2168	   most of the connections support Header Alignment, the receiver
2169	   buffers no longer scale with number of connections.

2171	   Additional optimizations can be achieved in a balanced I/O sub-system
2172	   -- where the system interface of the network controller provides
2173	   ample bandwidth as compared with the network bandwidth. For almost
2174	   twenty years this has been the case and the trend is expected to
2175	   continue - while Ethernet speeds have scaled by 1000 (from 10
2176	   megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
2177	   architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
2178	   PCI-X DDR). Under these conditions, the Header Aligned FPDU approach
2179	   allows BufferSizeAF to be indifferent to network speed. It is
2180	   primarily a function of the local processing time for a given frame.
2181	   Thus when the Header Aligned FPDU approach is used, receive buffering
2182	   is expected to scale gracefully (i.e. less than linear scaling) as
2183	   network speed is increased.

2185	11.1.2.1 Impact of lack of Header Alignment on the receiver
2186	      computational load and complexity

2188	   The receiver must perform IP and TCP processing, and then perform
2189	   FPDU CRC checks, before it can trust the FPDU header placement
2190	   information. For simplicity of the description, the assumption is
2191	   that a FPDU is carried in no more than 2 TCP segments. In reality,
2192	   with no Header Alignment, an FPDU can be carried by more than 2 TCP
2193	   segments (e.g., if the PMTU was reduced).

2195	   ----++-----------------------------++-----------------------++-----
2196	   +---||---------------+    +--------||--------+   +----------||----+
2197	   |   TCP Seg X-1      |    |     TCP Seg X    |   |  TCP Seg X+1   |
2198	   +---||---------------+    +--------||--------+   +----------||----+
2199	   ----++-----------------------------++-----------------------++-----
2200	                   FPDU #N-1                  FPDU #N

2202	       Figure 10: Non-aligned FPDU freely placed in TCP octet stream

2204	   The receiver algorithm for processing TCP segments (e.g., TCP segment
2205	   #X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream)
2206	   carrying non-aligned FPDUs (in-order or out-of-order) includes:

2208	      Data Link Layer processing (whole frame) - typically including a
2209	          CRC calculation.

2211	      1.  Network Layer processing (assuming not an IP fragment, the
2212	          whole Data Link Layer frame contains one IP datagram. IP
2213	          fragments should be reassembled in a local buffer. This is not
2214	          a performance optimization goal)

2216	      2.  Transport Layer processing -- TCP protocol processing, header
2217	          and checksum checks.

2219	          a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
2220	              IP DST, TCP SRC Port, TCP DST Port, protocol)

2222	      3.  Find FPDU message boundaries.

2224	          a.  Get MPA state information for the connection

2226	              If the TCP segment is in-order, use the receiver managed
2227	                  MPA state information to calculate where the previous
2228	                  FPDU message (#N-1) ends in the current TCP segment X.
2229	                  (previously, when the MPA receiver processed the first
2230	                  part of FPDU #N-1, it calculated the number of bytes
2231	                  remaining to complete FPDU #N-1 by using the MPA
2232	                  Length field).

2234	                  Get the stored partial CRC for FPDU #N-1

2236	                  Complete CRC calculation for FPDU #N-1 data (first
2237	                      portion of TCP segment #X)

2239	                  Check CRC calculation for FPDU #N-1

2241	                  If no FPDU CRC errors, placement is allowed

2243	                  Locate the local buffer for the first portion of
2244	                      FPDU#N-1, CopyData(local buffer of first portion
2245	                      of FPDU #N-1, host buffer address, length)

2247	                  Compute host buffer address for second portion of FPDU
2248	                      #N-1

2250	                  CopyData (local buffer of second portion of FPDU #N-1,
2251	                      host buffer address for second portion, length)

2253	                  Calculate the octet offset into the TCP segment for
2254	                      the next FPDU #N.

2256	                  Start Calculation of CRC for available data for FPDU
2257	                      #N

2259	                  Store partial CRC results for FPDU #N

2261	                  Store local buffer address of first portion of FPDU #N

2263	                  No further action is possible on FPDU #N, before it is
2264	                      completely received

2266	              If TCP out-of-order, receiver must buffer the data until
2267	                  at least one complete FPDU is received. Typically
2268	                  buffering for more than one TCP segment per connection
2269	                  is required. Use the MPA based Markers to calculate
2270	                  where FPDU boundaries are.

2272	                  When a complete FPDU is available, a similar procedure
2273	                      to the in-order algorithm above is used. There is
2274	                      additional complexity, though, because when the
2275	                      missing segment arrives, this TCP segment must be
2276	                      run through the CRC engine after the CRC is
2277	                      calculated for the missing segment.

2279	   If we assume Header Alignment, the following diagram and the
2280	   algorithm below apply. Note that when using MPA, the receiver is
2281	   assumed to actively detect presence or loss of Header Alignment for
2282	   every TCP segment received.

2284	      +--------------------------+      +--------------------------+
2285	   +--|--------------------------+   +--|--------------------------+
2286	   |  |       TCP Seg X          |   |  |         TCP Seg X+1      |
2287	   +--|--------------------------+   +--|--------------------------+
2288	      +--------------------------+      +--------------------------+
2289	                FPDU #N                          FPDU #N+1

2291	        Figure 11: Aligned FPDU placed immediately after TCP header

2293	   The receiver algorithm for Header Aligned frames (in-order or out-of-
2294	   order) includes:

2296	       1)  Data Link Layer processing (whole frame) - typically
2297	           including a CRC calculation.

2299	       2)  Network Layer processing (assuming not an IP fragment, the
2300	           whole Data Link Layer frame contains one IP datagram. IP
2301	           fragments should be reassembled in a local buffer. This is
2302	           not a performance optimization goal)

2304	       3)  Transport Layer processing -- TCP protocol processing, header
2305	           and checksum checks.

2307	           a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
2308	               IP DST, TCP SRC Port, TCP DST Port, protocol)

2310	       4)  Check for Header Alignment. (Described in detail in [MPA]
2311	           section 7.4). Assuming Header Alignment for the rest of the
2312	           algorithm below.

2314	           a.  If the header is not aligned, see the algorithm defined
2315	               in the prior section.

2317	       5)  If TCP is in-order or out-of-order the MPA header is at the
2318	           beginning of the current TCP payload. Get the FPDU length
2319	           from the FPDU header.

2321	       6)  Calculate CRC over FPDU

2323	       7)  Check CRC calculation for FPDU #N

2325	       8)  If no FPDU CRC errors, placement is allowed

2327	       9)  CopyData(TCP segment #X, host buffer address, length)

2329	       10) Loop to #5 until all the FPDUs in the TCP segment are
2330	           consumed in order to handle FPDU packing.

2332	   Implementation note: In both cases the receiver has to classify the
2333	   incoming TCP segment and associate it with one of the flows it
2334	   maintains. In the case of no Header Alignment, the receiver is forced
2335	   to classify incoming traffic before it can calculate the FPDU CRC. In
2336	   the case of Header Alignment the operations order is left to the
2337	   implementer.

2339	   The Header Aligned receiver algorithm is significantly simpler. There
2340	   is no need to locally buffer portions of FPDUs. Accessing state
2341	   information is also substantially simplified - the normal case does
2342	   not require retrieving information to find out where a FPDU starts
2343	   and ends or retrieval of a partial CRC before the CRC calculation can
2344	   commence. This avoids adding internal latencies, having multiple data
2345	   passes through the CRC machine, or scheduling multiple commands for
2346	   moving the data to the host buffer.

2348	   The aligned FPDU approach is useful for in-order and out-of-order
2349	   reception. The receiver can use the same mechanisms for data storage
2350	   in both cases, and only needs to account for when all the TCP
2351	   segments have arrived to enable delivery. . The Header Alignment,
2352	   along with the high probability that at least one complete FPDU is
2353	   found with every TCP segment, allows the receiver to perform data
2354	   placement for out-of-order TCP segments with no need for intermediate
2355	   buffering. Essentially the TCP receive buffer has been eliminated and
2356	   TCP reassembly is done in place within the ULP buffer.

2358	   In case Header Alignment is not found, the receiver should follow the
2359	   algorithm for non aligned FPDU reception which may be slower and less
2360	   efficient.

2362	11.1.2.2 Header Alignment effects on TCP wire protocol

2364	      An MPA-aware TCP exposes its EMSS to MPA.  MPA uses the EMSS to
2365	      calculate its MULPDU, which it then exposes to DDP, its ULP.  DDP
2366	      uses the MULPDU to segment its payload so that each FPDU sent by
2367	      MPA fits completely into one TCP segment. This has no impact on
2368	      wire protocol and exposing this information is already supported
2369	      on many TCP implementations, including all modern flavors of BSD
2370	      networking, through the TCP_MAXSEG socket option.

2372	   In the common case, the ULP (i.e. DDP over MPA) messages provided to
2373	   the TCP layer are segmented to MULPDU size. It is assumed that the
2374	   ULP message size is bounded by MULPDU, such that a single ULP message
2375	   can be encapsulated in a single TCP segment. Therefore, in the common
2376	   case, there is no increase in the number of TCP segments emitted. For
2377	   smaller ULP messages, the sender can also apply packing, i.e. the
2378	   sender packs as many complete FPDUs as possible into one TCP segment.
2379	   The requirement to always have a complete FPDU may increase the
2380	   number of TCP segments emitted. Typically, a ULP message size varies
2381	   from few bytes to multiple EMSS (e.g., 64 Kbytes). In some cases the
2382	   ULP may post more than one message at a time for transmission, giving
2383	   the sender an opportunity for packing. In the case where more than
2384	   one FPDU is available for transmission and the FPDUs are encapsulated
2385	   into a TCP segment and there is no room in the TCP segment to include
2386	   the next complete FPDU, another TCP segment is sent. In this corner
2387	   case some of the TCP segments are not full size. In the worst case
2388	   scenario, the ULP may choose a FPDU size that is EMSS/2 +1 and has
2389	   multiple messages available for transmission. For this poor choice of
2390	   FPDU size, the average TCP segment size is therefore about 1/2 of the
2391	   EMSS and the number of TCP segments emitted is approaching 2x of what
2392	   is possible without the requirement to encapsulate an integer number
2393	   of complete FPDUs in every TCP segment. This is a dynamic situation
2394	   that only lasts for the duration where the sender ULP has multiple
2395	   non-optimal messages for transmission and this causes a minor impact
2396	   on the wire utilization.

2398	   However, it is not expected that requiring Header Alignment will have
2399	   a measurable impact on wire behavior of most applications. Throughput
2400	   applications with large I/Os are expected to take full advantage of
2401	   the EMSS.  Another class of applications with many small outstanding
2402	   buffers (as compared to EMSS) is expected to use packing when
2403	   applicable. Transaction oriented applications are also optimal.

2405	   TCP retransmission is another area that can affect sender behavior.
2406	   TCP supports retransmission of the exact, originally transmitted
2407	   segment (see [RFC0793] section 2.6, [RFC0793] section 3.7 "managing
2408	   the window" and [RFC1122] section 4.2.2.15 ). In the unlikely event
2409	   that part of the original segment has been received and acknowledged
2410	   by the remote peer (e.g., a re-segmenting middle box, as documented
2411	   in 5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on
2412	   page 26), a better available bandwidth utilization may be possible by
2413	   re-transmitting only the missing octets. If an MPA-aware TCP
2414	   retransmits complete FPDUs, there may be some marginal bandwidth
2415	   loss.

2417	   Another area where a change in the TCP segment number may have impact
2418	   is that of Slow Start and Congestion Avoidance. Slow-start
2419	   exponential increase is measured in segments per second, as the
2420	   algorithm focuses on the overhead per segment at the source for
2421	   congestion that eventually results in dropped segments. Slow-start
2422	   exponential bandwidth growth for MPA-aware TCP is similar to any TCP
2423	   implementation. Congestion Avoidance allows for a linear growth in
2424	   available bandwidth when recovering after a packet drop. Similar to
2425	   the analysis for slow-start, MPA-aware TCP doesn't change the
2426	   behavior of the algorithm. Therefore the average size of the segment
2427	   versus EMSS is not a major factor in the assessment of the bandwidth
2428	   growth for a sender. Both Slow Start and Congestion Avoidance for an
2429	   MPA-aware TCP will behave similarly to any TCP sender and allow an
2430	   MPA-aware TCP to enjoy the theoretical performance limits of the
2431	   algorithms.

2433	   In summary, the ULP messages generated at the sender (e.g., the
2434	   amount of messages grouped for every transmission request) and
2435	   message size distribution has the most significant impact over the
2436	   number of TCP segments emitted. The worst case effect for certain
2437	   ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by
2438	   an increase of up to 2x in the number of TCP segments and
2439	   acknowledges.  In reality the effect is expected to be marginal.

2441	11.2 Receiver implementation

2443	   Transport & Network Layer Reassembly Buffers:

2445	   The use of reassembly buffers (either TCP reassembly buffers or IP
2446	   fragmentation reassembly buffers) is implementation dependent. When
2447	   MPA is enabled, reassembly buffers are needed if out of order packets
2448	   arrive and Markers are not enabled.  Buffers are also needed if FPDU
2449	   Alignment is lost or if IP fragmentation occurs. This is because the
2450	   incoming out of order segment may not contain enough information for
2451	   MPA to process all of the FPDU. For cases where a re-segmenting
2452	   middle box is present, or where the TCP sender is not MPA-aware, the
2453	   presence of markers significantly reduces the amount of buffering
2454	   needed.

2456	   Recovery from IP Fragmentation must be transparent to the MPA
2457	   Consumers.

2459	11.2.1 Network Layer Reassembly Buffers

2461	   Most IP implementations set the IP Don't Fragment bit. Thus upon a
2462	   path MTU change, intermediate devices drop the IP datagram if it is
2463	   too large and reply with an ICMP message which tells the source TCP
2464	   that the path MTU has changed. This causes TCP to emit segments
2465	   conformant with the new path MTU size. Thus IP fragments under most
2466	   conditions should never occur at the receiver. But it is possible.

2468	   There are several options for implementation of network layer
2469	   reassembly buffers:

2471	   1.  drop any IP fragments, and reply with an ICMP message according
2472	       to [RFC792] (fragmentation needed and DF set) to tell the Remote
2473	       Peer to resize its TCP segment

2475	   2.  support an IP reassembly buffer, but have it of limited size
2476	       (possibly the same size as the local link's MTU). The end Node
2477	       would normally never advertise a path MTU larger than the local
2478	       link MTU. It is recommended that a dropped IP fragment cause an
2479	       ICMP message to be generated according to RFC792.

2481	   3.  multiple IP reassembly buffers, of effectively unlimited size.

2483	   4.  support an IP reassembly buffer for the largest IP datagram (64
2484	       KB).

2486	   5.  support for a large IP reassembly buffer which could span
2487	       multiple IP datagrams.

2489	   An implementation should support at least 2 or 3 above, to avoid
2490	   dropping packets that have traversed the entire fabric.

2492	   There is no end-to-end ACK for IP reassembly buffers, so there is no
2493	   flow control on the buffer. The only end-to-end ACK is a TCP ACK,
2494	   which can only occur when a complete IP datagram is delivered to TCP.
2495	   Because of this, under worst case, pathological scenarios, the
2496	   largest IP reassembly buffer is the TCP receive window (to buffer
2497	   multiple IP datagrams that have all been fragmented).

2499	   Note that if the Remote Peer does not implement re-segmentation of
2500	   the data stream upon receiving the ICMP reply updating the path MTU,
2501	   it is possible to halt forward progress because the opposite peer
2502	   would continue to retransmit using a transport segment size that is
2503	   too large. This deadlock scenario is no different than if the fabric
2504	   MTU (not last hop MTU) was reduced after connection setup, and the
2505	   remote Node's behavior is not compliant with [RFC1122].

2507	11.2.2 TCP Reassembly buffers

2509	   A TCP reassembly buffer is also needed. TCP reassembly buffers are
2510	   needed if FPDU Alignment is lost when using TCP with MPA or when the
2511	   MPA FPDU spans multiple TCP segments.  Buffers are also needed if
2512	   Markers are disabled and out of order packets arrive.

2514	   Since lost FPDU Alignment often means that FPDUs are incomplete, an
2515	   MPA on TCP implementation must have a reassembly buffer large enough
2516	   to recover an FPDU that is less than or equal to the MTU of the
2517	   locally attached link (this should be the largest possible advertised
2518	   TCP path MTU). If the MTU is smaller than 140 octets, the buffer MUST
2519	   be at least 140 octets long to support the minimum FPDU size.  The
2520	   140 octets allows for the minimum MULPDU of 128, 2 octets of pad, 2
2521	   of ULPDU_Length, 4 of CRC, and space for a possible marker. As usual,
2522	   additional buffering may provide better performance.

2524	   Note that if the TCP segment were not stored, it is possible to
2525	   deadlock the MPA algorithm. If the path MTU is reduced, FPDU
2526	   Alignment requires the source TCP to re-segment the data stream to
2527	   the new path MTU. The source MPA will detect this condition and
2528	   reduce the MPA segment size, but any FPDUs already posted to the
2529	   source TCP will be re-segmented and lose FPDU Alignment. If the
2530	   destination does not support a TCP reassembly buffer, these segments
2531	   can never be successfully transmitted and the protocol deadlocks.

2533	   When a complete FPDU is received, processing continues normally.

2535	11.3 IETF RNIC Interoperability with RDMA Consortium Protocols

2537	   Without the exchange of MPA Request/Reply Frames, there is no
2538	   standard mechanism for enabling RDMAC RNICs to interoperate with IETF
2539	   RNICs.  Even if a ULP uses a well-known port to start an IETF RNIC
2540	   immediately in RDMA mode (i.e., without exchanging the MPA
2541	   Request/Reply messages), there is no reason to believe an IETF RNIC
2542	   will interoperate with an RDMAC RNIC because of the differences in
2543	   the version number in the DDP and RDMAP headers on the wire.

2545	   Therefore, the ULP or other supporting entity at the RDMAC RNIC must
2546	   implement MPA Request/Reply Frames on behalf of the RNIC in order to
2547	   negotiate the connection parameters.  The following section describes
2548	   the results following the exchange of the MPA Request/Reply Frames
2549	   before the conversion from streaming to RDMA mode.

2551	11.3.1 Negotiated Parameters

2553	   Three types of RNICs are considered:

2555	   Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which
2556	       has a ULP or other supporting entity that exchanges the MPA
2557	       Request/Reply Frames in streaming mode before the conversion to
2558	       RDMA mode.

2560	   Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
2561	       which is not capable of implementing the RDMAC protocols.  Such
2562	       an RNIC can only interoperate with other IETF RNICs.

2564	   Permissive IETF RNIC - an RNIC implementing the IETF protocols which
2565	       is capable of implementing the RDMAC protocols on a per
2566	       connection basis.

2568	   The values used by these three RNIC types for the MPA, DDP, and RDMAP
2569	   versions as well as MPA markers and CRC are summarized in Figure 12.

2571	    +----------------++-----------+-----------+-----------+-----------+
2572	    | RNIC TYPE      || DDP/RDMAP |    MPA    |    MPA    |    MPA    |
2573	    |                ||  Version  | Revision  |  Markers  |    CRC    |
2574	    +----------------++-----------+-----------+-----------+-----------+
2575	    +----------------++-----------+-----------+-----------+-----------+
2576	    | RDMAC          ||     0     |     0     |     1     |     1     |
2577	    |                ||           |           |           |           |
2578	    +----------------++-----------+-----------+-----------+-----------+
2579	    | IETF           ||     1     |     1     |  0 or 1   |  0 or 1   |
2580	    | Non-permissive ||           |           |           |           |
2581	    +----------------++-----------+-----------+-----------+-----------+
2582	    | IETF           ||  1 or 0   |  1 or 0   |  0 or 1   |  0 or 1   |
2583	    | permissive     ||           |           |           |           |
2584	    +----------------++-----------+-----------+-----------+-----------+
2585	           Figure 12. Connection Parameters for the RNIC Types.
2586	            For MPA markers and MPA CRC, enabled=1, disabled=0.

2588	   It is assumed there is no mixing of versions allowed between MPA, DDP
2589	   and RDMAP.  The RNIC either generates the RDMAC protocols on the wire
2590	   (version is zero) or the IETF protocols (version is one).

2592	   During the exchange of the MPA Request/Reply Frames, each peer
2593	   provides its MPA Revision, Marker preference (M: 0=disabled,
2594	   1=enabled), and CRC preference.  The MPA Revision provided in the MPA
2595	   Request Frame and the MPA Reply Frame may differ.

2597	   From the information in the MPA Request/Reply Frames, each side sets
2598	   the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as
2599	   well as the state of the Markers for each half connection.  Between
2600	   DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP
2601	   and RDMAP version MUST be identical in the two directions.  The RNIC
2602	   either generates the RDMAC protocols on the wire (version is zero) or
2603	   the IETF protocols (version is one).

2605	   In the following sections, the figures do not discuss CRC negotiation
2606	   because there is no interoperability issue for CRCs.  Since the RDMAC
2607	   RNIC will always request CRC use, then, according to the IETF MPA
2608	   specification, both peers MUST generate and check CRCs.

2610	11.3.2 RDMAC RNIC and Non-permissive IETF RNIC

2612	   Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate
2613	   with an RDMAC RNIC, despite the fact that both peers exchange MPA
2614	   Request/Reply Frames.  For a Non-permissive IETF RNIC, the MPA
2615	   negotiation has no effect on the DDP/RDMAP version and it is unable
2616	   to interoperate with the RDMAC RNIC.

2618	   The rows in the figure show the state of the Marker field in the MPA
2619	   Request Frame sent by the MPA Initiator.  The columns show the state
2620	   of the Marker field in the MPA Reply Frame sent by the MPA Responder.
2621	   Each type of RNIC is shown as an initiator and a responder.  The
2622	   connection results are shown in the lower right corner, at the
2623	   intersection of the different RNIC types, where V=0 is the RDMAC
2624	   DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
2625	   markers are disabled and M=1 means MPA markers are enabled. The
2626	   negotiated marker state is shown as X/Y, for the receive direction of
2627	   the initiator/responder.

2629	          +---------------------------++-----------------------+
2630	          |   MPA                     ||          MPA          |
2631	          | CONNECT                   ||       Responder       |
2632	          |   MODE  +-----------------++-------+---------------+
2633	          |         |   RNIC          || RDMAC |     IETF      |
2634	          |         |   TYPE          ||       | Non-permissive|
2635	          |         |          +------++-------+-------+-------+
2636	          |         |          |MARKER|| M=1   | M=0   |  M=1  |
2637	          +---------+----------+------++-------+-------+-------+
2638	          +---------+----------+------++-------+-------+-------+
2639	          |         |   RDMAC  | M=1  || V=0   | close | close |
2640	          |         |          |      || M=1/1 |       |       |
2641	          |         +----------+------++-------+-------+-------+
2642	          |   MPA   |          | M=0  || close | V=1   | V=1   |
2643	          |Initiator|   IETF   |      ||       | M=0/0 | M=0/1 |
2644	          |         |Non-perms.+------++-------+-------+-------+
2645	          |         |          | M=1  || close | V=1   | V=1   |
2646	          |         |          |      ||       | M=1/0 | M=1/1 |
2647	          +---------+----------+------++-------+-------+-------+
2648	   Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
2649	                                IETF RNIC.

2651	11.3.2.1 RDMAC RNIC Initiator

2653	   If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
2654	   Frame with Rev field set to zero and the M and C bits set to one.
2655	   Because the Non-permissive IETF RNIC cannot dynamically downgrade the
2656	   version number it uses for DDP and RDMAP, it would send an MPA Reply
2657	   Frame with the Rev field equal to one and then gracefully close the
2658	   connection.

2660	11.3.2.2 Non-Permissive IETF RNIC Initiator

2662	   If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
2663	   Request Frame with Rev field equal to one.  The ULP or supporting
2664	   entity for the RDMAC RNIC responds with an MPA Reply Frame that has
2665	   the Rev field equal to zero and the M bit set to one.  The Non-
2666	   permissive IETF RNIC will gracefully close the connection after it
2667	   reads the incompatible Rev field in the MPA Reply Frame.

2669	11.3.3 RDMAC RNIC and Permissive IETF RNIC

2671	   Figure 14 shows that a Permissive IETF RNIC can interoperate with an
2672	   RDMAC RNIC regardless of its Marker preference.  The figure uses the
2673	   same format as shown with the Non-permissive IETF RNIC.

2675	          +---------------------------++-----------------------+
2676	          |   MPA                     ||          MPA          |
2677	          | CONNECT                   ||       Responder       |
2678	          |   MODE  +-----------------++-------+---------------+
2679	          |         |   RNIC          || RDMAC |     IETF      |
2680	          |         |   TYPE          ||       |  Permissive   |
2681	          |         |          +------++-------+-------+-------+
2682	          |         |          |MARKER|| M=1   | M=0   | M=1   |
2683	          +---------+----------+------++-------+-------+-------+
2684	          +---------+----------+------++-------+-------+-------+
2685	          |         |   RDMAC  | M=1  || V=0   | N/A   | V=0   |
2686	          |         |          |      || M=1/1 |       | M=1/1 |
2687	          |         +----------+------++-------+-------+-------+
2688	          |   MPA   |          | M=0  || V=0   | V=1   | V=1   |
2689	          |Initiator|   IETF   |      || M=1/1 | M=0/0 | M=0/1 |
2690	          |         |Permissive+------++-------+-------+-------+
2691	          |         |          | M=1  || V=0   | V=1   | V=1   |
2692	          |         |          |      || M=1/1 | M=1/0 | M=1/1 |
2693	          +---------+----------+------++-------+-------+-------+
2694	     Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
2695	                                IETF RNIC.

2697	   A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
2698	   Rev field of the MPA Req/Rep Frames and then adjust its receive
2699	   Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC.  As
2700	   a result, as an MPA Responder, the Permissive IETF RNIC will never
2701	   return an MPA Reply Frame with the M bit set to zero.  This case is
2702	   shown as a not applicable (N/A) in Figure 14.

2704	11.3.3.1 RDMAC RNIC Initiator

2706	   When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
2707	   entity prepares an MPA Request message and sets the revision to zero
2708	   and the M bit and C bit to one.

2710	   The Permissive IETF Responder receives the MPA Request message and
2711	   checks the revision field.  Since it is capable of generating RDMAC
2712	   DDP/RDMAP headers, it sends an MPA Reply message with revision set to
2713	   zero and the M and C bits set to one.  The Responder must inform its
2714	   ULP that it is generating version zero DDP/RDMAP messages.

2716	11.3.3.2 Permissive IETF RNIC Initiator

2718	   If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
2719	   Request Frame setting the Rev field to one.  Regardless of the value
2720	   of the M bit in the MPA Request Frame, the ULP or other supporting
2721	   entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
2722	   equal to zero and the M bit set to one.

2724	   When the Initiator reads the Rev field of the MPA Reply Frame and
2725	   finds that its peer is an RDMAC RNIC, it must inform its ULP that it
2726	   should generate version zero DDP/RDMAP messages and enable MPA
2727	   markers and CRC.

2729	11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC

2731	   For completeness, Figure 15 shows the results of MPA negotiation
2732	   between a Non-permissive IETF RNIC and a Permissive IETF RNIC.  The
2733	   important point from this figure is that an IETF RNIC cannot detect
2734	   whether its peer is a Permissive or Non-permissive RNIC.

2736	      +---------------------------++-------------------------------+
2737	      |   MPA                     ||              MPA              |
2738	      | CONNECT                   ||            Responder          |
2739	      |   MODE  +-----------------++---------------+---------------+
2740	      |         |   RNIC          ||     IETF      |     IETF      |
2741	      |         |   TYPE          || Non-permissive|  Permissive   |
2742	      |         |          +------++-------+-------+-------+-------+
2743	      |         |          |MARKER|| M=0   | M=1   | M=0   | M=1   |
2744	      +---------+----------+------++-------+-------+-------+-------+
2745	      +---------+----------+------++-------+-------+-------+-------+
2746	      |         |          | M=0  || V=1   | V=1   | V=1   | V=1   |
2747	      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
2748	      |         |Non-perms.+------++-------+-------+-------+-------+
2749	      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
2750	      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
2751	      |   MPA   +----------+------++-------+-------+-------+-------+
2752	      |Initiator|          | M=0  || V=1   | V=1   | V=1   | V=1   |
2753	      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
2754	      |         |Permissive+------++-------+-------+-------+-------+
2755	      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
2756	      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
2757	      +---------+----------+------++-------+-------+-------+-------+
2758	    Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
2759	                           Permissive IETF RNIC.

2761	12 Author's Addresses

2763	   Stephen Bailey
2764	       Sandburst Corporation
2765	       600 Federal Street
2766	       Andover, MA  01810 USA
2767	       Phone: +1 978 689 1614
2768	       Email: steph@sandburst.com

2770	   Paul R. Culley
2771	       Hewlett-Packard Company
2772	       20555 SH 249
2773	       Houston, Tx. USA 77070-2698
2774	       Phone:  281-514-5543
2775	       Email:  paul.culley@hp.com

2777	   Uri Elzur
2778	       Broadcom
2779	       16215 Alton Parkway
2780	       CA, 92618
2781	       Phone: 949.585.6432
2782	       Email:  uri@broadcom.com

2784	   Renato J Recio
2785	       IBM
2786	       Internal Zip 9043
2787	       11400 Burnett Road
2788	       Austin,  Texas  78759
2789	       Phone:  512-838-3685
2790	       Email:  recio@us.ibm.com

2792	   John Carrier
2793	       Adaptec Inc.
2794	       691 South Milpitas Blvd.
2795	       Milpitas, CA 95035
2796	       Phone:  360-378-8526
2797	       Email:  John_Carrier@adaptec.com

2799	13 Acknowledgments

2801	   Dwight Barron
2802	       Hewlett-Packard Company
2803	       20555 SH 249
2804	       Houston, Tx. USA 77070-2698
2805	       Phone: 281-514-2769
2806	       Email: dwight.barron@hp.com

2808	   Jeff Chase
2809	       Department of Computer Science
2810	       Duke University
2811	       Durham, NC 27708-0129 USA
2812	       Phone: +1 919 660 6559
2813	       Email: chase@cs.duke.edu

2815	   Ted Compton
2816	       EMC Corporation
2817	       Research Triangle Park, NC 27709, USA
2818	       Phone: 919-248-6075
2819	       Email: compton_ted@emc.com

2821	   Dave Garcia
2822	       Hewlett-Packard Company
2823	       19333 Vallco Parkway
2824	       Cupertino, Ca. USA 95014
2825	       Phone: 408.285.6116
2826	       Email: dave.garcia@hp.com

2828	   Hari Ghadia
2829	       Adaptec, Inc.
2830	       691 S. Milpitas Blvd.,
2831	       Milpitas, CA 95035  USA
2832	       Phone: +1 (408) 957-5608
2833	       Email: hari_ghadia@adaptec.com

2835	   Howard C. Herbert
2836	       Intel Corporation
2837	       MS CH7-404
2838	       5000 West Chandler Blvd.
2839	       Chandler, Arizona 85226
2840	       Phone: 480-554-3116
2841	       Email: howard.c.herbert@intel.com

2843	   Jeff Hilland
2844	       Hewlett-Packard Company
2845	       20555 SH 249
2846	       Houston, Tx. USA 77070-2698
2847	       Phone: 281-514-9489
2848	       Email: jeff.hilland@hp.com

2850	   Mike Ko
2851	       IBM
2852	       650 Harry Rd.
2853	       San Jose, CA 95120
2854	       Phone: (408) 927-2085
2855	       Email: mako@us.ibm.com

2857	   Mike Krause
2858	       Hewlett-Packard Corporation, 43LN
2859	       19410 Homestead Road
2860	       Cupertino, CA 95014 USA
2861	       Phone: +1 (408) 447-3191
2862	       Email: krause@cup.hp.com

2864	   Dave Minturn
2865	       Intel Corporation
2866	       MS JF1-210
2867	       5200 North East Elam Young Parkway
2868	       Hillsboro, Oregon  97124
2869	       Phone: 503-712-4106
2870	       Email: dave.b.minturn@intel.com

2872	   Jim Pinkerton
2873	       Microsoft, Inc.
2874	       One Microsoft Way
2875	       Redmond, WA, USA 98052
2876	       Email: jpink@microsoft.com

2878	   Hemal Shah
2879	       Intel Corporation
2880	       MS PTL1
2881	       1501 South Mopac Expressway, #400
2882	       Austin, Texas  78746
2883	       Phone: 512-732-3963
2884	       Email: hemal.shah@intel.com

2886	   Allyn Romanow
2887	       Cisco Systems
2888	       170 W Tasman Drive
2889	       San Jose, CA 95134 USA
2890	       Phone: +1 408 525 8836
2891	       Email: allyn@cisco.com

2893	   Tom Talpey
2894	       Network Appliance
2895	       375 Totten Pond Road
2896	       Waltham, MA 02451 USA
2897	       Phone: +1 (781) 768-5329
2898	       EMail: thomas.talpey@netapp.com

2900	   Patricia Thaler
2901	       Agilent Technologies, Inc.
2902	       1101 Creekside Ridge Drive, #100
2903	       M/S-RG10
2904	       Roseville, CA 95678
2905	       Phone: +1-916-788-5662
2906	       email: pat_thaler@agilent.com

2908	   Jim Wendt
2909	       Hewlett Packard Corporation
2910	       8000 Foothills Boulevard MS 5668
2911	       Roseville, CA 95747-5668 USA
2912	       Phone: +1 916 785 5198
2913	       Email: jim_wendt@hp.com

2915	   Jim Williams
2916	       Emulex Corporation
2917	       580 Main Street
2918	       Bolton, MA 01740 USA
2919	       Phone: +1 978 779 7224
2920	       Email: jim.williams@emulex.com

2922	14 Full Copyright Statement

2924	   This document and the information contained herein is provided on an
2925	   "AS IS" basis and ADAPTEC INC., AGILENT TECHNOLOGIES INC., BROADCOM
2926	   CORPORATION, CISCO SYSTEMS INC., DUKE UNIVERSITY, EMC CORPORATION,
2927	   EMULEX CORPORATION, HEWLETT-PACKARD COMPANY, INTERNATIONAL BUSINESS
2928	   MACHINES CORPORATION, INTEL CORPORATION, MICROSOFT CORPORATION,
2929	   NETWORK APPLIANCE INC., SANDBURST CORPORATION, THE INTERNET SOCIETY,
2930	   AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES,
2931	   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
2932	   THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
2933	   IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
2934	   PURPOSE.

2936	   This document and the information contained herein are provided on an
2937	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2938	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
2939	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
2940	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
2941	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2942	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2944	   Copyright (C) The Internet Society (2005).  This document is subject
2945	   to the rights, licenses and restrictions contained in BCP 78, and
2946	   except as set forth therein, the authors retain all their rights.