idnits 2.17.1 

draft-ietf-rddp-mpa-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 21.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 3161.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 3176.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 3183.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 3189.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([DDP]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.

  == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     C: This bit declares an endpoint's preferred CRC usage.  When this
     field is '0' in the MPA Request Frame and the MPA Reply Frame, CRCs MUST
     not be checked and need not be generated by either endpoint.  When this
     bit is '1' in either the MPA Request Frame or MPA Reply Frame, CRCs MUST
     be generated and checked by both endpoints.  Note that even when not in
     use, the CRC field remains present in the FPDU.  When CRCs are not in
     use, the CRC field MUST be considered valid for FPDU checking regardless
     of its contents.

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     9.  MPA implementations MUST validate the PD_Length field.  The
     buffer that receives the Private Data field MUST be large enough to
     receive that data; the amount of Private Data MUST not exceed the
     PD_Length, or the application buffer.  If any of the above fails, the
     startup frame MUST be considered improperly formatted.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (May 30, 2006) is 6538 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Seconds' is mentioned on line 2341, but not defined

  == Unused Reference: 'NagleDAck' is defined on line 2200, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-10) exists of
     draft-ietf-rddp-security-09

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-ddp-06

  -- Obsolete informational reference (is this intentional?): RFC 2401
     (Obsoleted by RFC 4301)

  -- Obsolete informational reference (is this intentional?): RFC  896
     (Obsoleted by RFC 7805)

  == Outdated reference: A later version (-04) exists of
     draft-ietf-nfsv4-channel-bindings-02

  == Outdated reference: A later version (-07) exists of
     draft-ietf-rddp-rdmap-06

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)

  -- No information found for draft-hilland-iwarp-verbs-v1 - is the name
     correct?


     Summary: 6 errors (**), 0 flaws (~~), 11 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	   Remote Direct Data Placement Work Group   P. Culley
2	   INTERNET-DRAFT                              Hewlett-Packard Company
3	   draft-ietf-rddp-mpa-04.txt                U. Elzur
4	                                               Broadcom Corporation
5	                                             R. Recio
6	                                               IBM Corporation
7	                                             S. Bailey
8	                                               Sandburst Corporation
9	                                             J. Carrier
10	                                               Cray Inc.

12	   Expires: November 2006                    May 30, 2006

14	             Marker PDU Aligned Framing for TCP Specification

16	Status of this Memo

18	   By submitting this Internet-Draft, each author represents that any
19	   applicable patent or other IPR claims of which he or she is aware
20	   have been or will be disclosed, and any of which he or she becomes
21	   aware will be disclosed, in accordance with Section 6 of BCP 79.

23	   Internet-Drafts are working documents of the Internet Engineering
24	   Task Force (IETF), its areas, and its working groups.  Note that
25	   other groups may also distribute working documents as Internet-
26	   Drafts.

28	   Internet-Drafts are draft documents valid for a maximum of six months
29	   and may be updated, replaced, or obsoleted by other documents at any
30	   time.  It is inappropriate to use Internet-Drafts as reference
31	   material or to cite them other than as "work in progress."

33	   The list of current Internet-Drafts can be accessed at
34	   http://www.ietf.org/1id-abstracts.html.  The list of Internet-Draft
35	   Shadow Directories can be accessed at http://www.ietf.org/shadow.html

37	Abstract

39	   MPA (Marker Protocol data unit Aligned framing) is designed to work
40	   as an "adaptation layer" between TCP and the Direct Data Placement
41	   [DDP] protocol, preserving the reliable, in-order delivery of TCP,
42	   while adding the preservation of higher-level protocol record
43	   boundaries that DDP requires.  MPA is fully compliant with applicable
44	   TCP RFCs and can be utilized with existing TCP implementations.  MPA
45	   also supports integrated implementations that combine TCP, MPA and
46	   DDP to reduce buffering requirements in the implementation and
47	   improve performance at the system level.

49	   Table of Contents

51	   Status of this Memo                                                 1
52	   Abstract                                                            1
53	   1      Glossary                                                     7
54	   2      Introduction                                                10
55	   2.1    Motivation                                                  10
56	   2.2    Protocol Overview                                           10
57	   3      LLP and DDP requirements                                    14
58	   3.1    TCP implementation Requirements to support MPA              14
59	   3.1.1  TCP Transmit side                                           14
60	   3.1.2  TCP Receive side                                            14
61	   3.2    MPA's interactions with DDP                                 16
62	   4      FPDU Formats                                                18
63	   4.1    Marker Format                                               19
64	   5      Data Transfer Semantics                                     20
65	   5.1    MPA Markers                                                 20
66	   5.2    CRC Calculation                                             23
67	   5.3    MPA on TCP Sender Segmentation                              26
68	   5.3.1  Effects of MPA on TCP Segmentation                          27
69	   5.3.2  FPDU Size Considerations                                    29
70	   5.4    MPA Receiver FPDU Identification                            30
71	   5.4.1  Re-segmenting Middle boxes and non MPA-aware TCP senders    31
72	   6      Connection Semantics                                        32
73	   6.1    Connection setup                                            32
74	   6.1.1  MPA Request and Reply Frame Format                          34
75	   6.1.2  Connection Startup Rules                                    35
76	   6.1.3  Example Delayed Startup sequence                            38
77	   6.1.4  Use of Private Data                                         41
78	   6.1.5  "Dual stack" implementations                                44
79	   6.2    Normal Connection Teardown                                  45
80	   7      Error Semantics                                             46
81	   8      Security Considerations                                     47
82	   8.1    Protocol-specific Security Considerations                   47
83	   8.1.1  Spoofing                                                    47
84	   8.1.2  Eavesdropping                                               48
85	   8.2    Introduction to Security Options                            49
86	   8.3    Using IPsec With MPA                                        49
87	   8.4    Requirements for IPsec Encapsulation of MPA/DDP             50
88	   9      IANA Considerations                                         51
89	   10     References                                                  52
90	   10.1   Normative References                                        52
91	   10.2   Informative References                                      52
92	   11     Appendix                                                    54
93	   11.1   Analysis of MPA over TCP Operations                         54
94	   11.1.1 Assumptions                                                 55
95	   11.1.2 The Value of FPDU Alignment                                 56
96	   11.2   Receiver implementation                                     63
97	   11.2.1 Network Layer Reassembly Buffers                            63
98	   11.2.2 TCP Reassembly buffers                                      64
99	   11.3   IETF Implementation Interoperability with RDMA Consortium
100	   Protocols                                                          65
101	   11.3.1 Negotiated Parameters                                       65
102	   11.3.2 RDMAC RNIC and Non-permissive IETF RNIC                     66
103	   11.3.3 RDMAC RNIC and Permissive IETF RNIC                         68
104	   11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC           69
105	   12     Author's Addresses                                          70
106	   13     Acknowledgments                                             71
107	   Full Copyright Statement                                           74
108	   Intellectual Property                                              74

110	   Table of Figures

112	   Figure 1 ULP MPA TCP Layering                                      11
113	   Figure 2 FPDU Format                                               18
114	   Figure 3 Marker Format                                             19
115	   Figure 4 Example FPDU Format with Marker                           21
116	   Figure 5 Annotated Hex Dump of an FPDU                             25
117	   Figure 6 Annotated Hex Dump of an FPDU with Marker                 26
118	   Figure 7 MPA Request/Reply Frame                                   34
119	   Figure 8: Example Delayed Startup negotiation                      39
120	   Figure 9: Example Immediate Startup negotiation                    42
121	   Figure 10: Non-aligned FPDU freely placed in TCP octet stream      58
122	   Figure 11: Aligned FPDU placed immediately after TCP header        59
123	   Figure 12.  Connection Parameters for the RNIC Types.              66
124	   Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
125	   IETF RNIC.                                                         67
126	   Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
127	   IETF RNIC.                                                         68
128	   Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
129	   Permissive IETF RNIC.                                              69

131	   Revision history [To be deleted prior to RFC publication]

133	   [draft-ietf-rddp-mpa-04] workgroup draft with following changes:

135	        Numerous capitalization and "" adjustments, tried to make more
136	        consistent.

138	        Added some missing capitalized terms to glossary

140	        Removed company specific "use as is" boilerplate paragraph

142	        Fixed up some contact information and cross references.

144	        Removed reference to expired draft-elzur-iwarp-mpa-tcp-analysis-
145	        00.txt

147	        Suggested MTU to be used to determine EMSS, when otherwise not
148	        available; removed technology specific lengths per AD suggestion
149	        Tweaked text around disabling Nagle so that it is no longer
150	        implied that that is all that is necessary to achieve proper
151	        segmentation behavior

153	        Revamped section 5.3.1 for improved clarity

155	   [draft-ietf-rddp-mpa-03] workgroup draft with following changes:

157	        Tweaked abstract to give a bit more information.

159	        Tightened definition and usage of "deliver"

161	        Cleaned up usage of terms "FPDU Alignment" and "Header
162	        Alignment"

164	        Rearranged overview sections with stack and glossary earlier

166	        Mentioned how an non-MPA-Aware TCP MPA receiver deals with out
167	        of order segments (it doesn't have to...)

169	        Fixed description of out of order segment handling in section
170	        3.1.1

172	        Added text saying that ordering and completion indications are
173	        used to deliver to DDP

175	        Added redundant text indicating low two bits of FPDUPTR must
176	        always be zero and treated as such in Section 4.1

178	        Added redundant text indicating Markers are always included in a
179	        CRC calculation

181	        Removed indication saying that an implementation can "ignore" an
182	        administrative input to not use CRCs; clarified that both ends
183	        have to agree to not use CRC (as originally intended).

185	        Changed example FPDU hex dump format for greater clarity

187	        Clarified that EMSS shrinking below 128 bytes is the condition
188	        (rather than "very small sizes")

190	        Put connection startup rules after the start frame formats

192	        Added Initiator Private Data to figure 9

194	        Removed or Clarified use of RNIC term

196	        Added intro to IETF/RDMAC interoperability appendix and gave a
197	        web reference for docs; also recommended use of "permissive IETF
198	        RNIC"
199	        Numerous minor clarifications

201	        Updated Boilerplates per current requirements

203	   [draft-ietf-rddp-mpa-02] workgroup draft with following changes:

205	        Made IPsec must implement, optional to use.

207	        Updated Marker language to clarify that it points to ULPDU
208	        Length even when Marker precedes FPDU.

210	        Clarified when to start Markers use (in Full Operation mode).

212	        Added informative text on interoperability with RDMAC RNICs.

214	        Reduced Private Data to 512 octets max.

216	        Clarified CRC use description, must be used unless data is at
217	        least as well protected by another means.

219	        Clarified CRC disabled mode; CRC field is always valid.

221	        Added Security text.

223	        Changed DDP and RDMAP version numbers in hex dumps (Fig 5, 6)
224	        and adjusted CRC accordingly.

226	   [draft-ietf-rddp-mpa-01] workgroup draft with following changes:

228	        Added the "R" bit (Rejected) to the MPA Reply Frame and
229	        described its semantics.

231	        Added some comments on recent decisions regarding startup.

233	        Updated RFC3667 boilerplate.

235	   [draft-ietf-rddp-mpa-00] workgroup draft with following changes:

237	        Changed "Start Key" to two separate startup frames to facilitate
238	        identification of incorrect active/active startup.

240	        Changed Active/Passive nomenclature to Initiator/Responder to
241	        reduce confusion with TCP startup and verbs doc (which used
242	        opposite sense).

244	        Added Private Data to the startup key sequences.  This also
245	        required describing the motivation and expected usage models
246	        along with some interface hints.  Removed the Private Data stuff
247	        from appendix.

249	        Added example "Immediate" startup with TCP and explanation.

251	   [draft-culley-iwarp-mpa-03]

253	        Add option to allow receivers to specify Marker use.

255	        Add option that allows both sides to agree not to use CRC.

257	        Added startup declaration "Start Key" with options and larger
258	        MPA mode recognition "key".

260	        Updated MPA/DDP connection startup rules and sequence to deal
261	        with "Start Key".

263	        Added Appendix that provides a more detailed analysis of the
264	        effects of MPA on TCP data streams.

266	        Added appendix that describes a mechanism to deal with "Private
267	        Data" prior to full MPA/DDP operation.

269	   [draft-culley-iwarp-mpa-02]

271	        Enhanced descriptions of how MPA is used over an unmodified TCP.

273	        Removed "No Packing" text.

275	        Made MPA an adaptation layer for DDP, instead of a generalized
276	        framing solution.

278	        Added clarifications of the MPA/TCP interaction for optimized
279	        implementations and that any such optimizations are to be used
280	        only when requested by MPA.

282	    [draft-culley-iwarp-mpa-01] initial draft.

284	1  Glossary

286	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
287	       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
288	       this document are to be interpreted as described in RFC 2119.

290	   Consumer - the ULPs or applications that lie above MPA and DDP.  The
291	       Consumer is responsible for making TCP connections, starting MPA
292	       and DDP connections, and generally controlling operations.

294	   Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
295	       the process of informing DDP that a particular PDU is ordered for
296	       use.  A PDU is Delivered in the exact order that it was sent by
297	       the original sender; MPA uses TCP's byte stream ordering to
298	       determine when Delivery is possible.  This is specifically
299	       different from "passing the PDU to DDP", which may generally
300	       occur in any order, while the order of Delivery is strictly
301	       defined.

303	   EMSS - Effective Maximum Segment Size.  EMSS is the smaller of the
304	       TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
305	       and the current path Maximum Transfer Unit (MTU) [RFC1191].

307	   FPDU - Framed Protocol Data Unit.  The unit of data created by an MPA
308	       sender.

310	   FPDU Alignment - the property that an FPDU is Header Aligned with the
311	       TCP segment, and the TCP segment includes an integer number of
312	       FPDUs.  A TCP segment with a FPDU Alignment allows immediate
313	       processing of the contained FPDUs without waiting on other TCP
314	       segments to arrive or combining with prior segments.

316	   FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate
317	       the beginning of an FPDU.

319	   Full Operation (Full Operation Phase) - After the completion of the
320	       Startup Phase MPA begins exchanging FPDUs.

322	   Header Alignment - the property that a TCP segment begins with an
323	       FPDU.  The FPDU is Header Aligned when the FPDU header is exactly
324	       at the start of the TCP segment (right behind the TCP headers on
325	       the wire).

327	   Initiator - The endpoint of a connection that sends the MPA Request
328	       Frame, i.e. the first to actually send data (which may not be the
329	       one which sends the TCP SYN).

331	   Marker - A four octet field that is placed in the MPA data stream at
332	       fixed octet intervals (every 512 octets).

334	   MPA-aware TCP - a TCP implementation that is aware of the receiver
335	       efficiencies of MPA FPDU Alignment and is capable of sending TCP
336	       segments that begin with an FPDU.

338	   MPA-enabled - MPA is enabled if the MPA protocol is visible on the
339	       wire.  When the sender is MPA-enabled, it is inserting framing
340	       and Markers.  When the receiver is MPA-enabled, it is
341	       interpreting framing and Markers.

343	   MPA Request Frame - Data sent from the MPA Initiator to the MPA
344	       Responder during the Startup Phase.

346	   MPA Reply Frame - Data sent from the MPA Responder to the MPA
347	       Initiator during the Startup Phase.

349	   MPA - Marker-based ULP PDU Aligned Framing for TCP protocol.  This
350	       document defines the MPA protocol.

352	   MULPDU - Maximum ULPDU.  The current maximum size of the record that
353	       is acceptable for DDP to pass to MPA for transmission.

355	   Node - A computing device attached to one or more links of a Network.
356	       A Node in this context does not refer to a specific application
357	       or protocol instantiation running on the computer.  A Node may
358	       consist of one or more MPA on TCP devices installed in a host
359	       computer.

361	   PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact
362	       modulo 4 size.

364	   PDU - protocol data unit

366	   Private Data - A block of data exchanged between MPA endpoints during
367	       initial connection setup.

369	   Protection Domain - An RDMA concept (see [VERBS] and [RDMASEC]) that
370	       tie use of various endpoint resources (memory access etc.) to the
371	       specific RDMA/DDP/MPA connection.

373	   RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA
374	       to enable applications to transfer data directly from memory
375	       buffers.  See [RDMAP].

377	   Remote Peer - The MPA protocol implementation on the opposite end of
378	       the connection.  Used to refer to the remote entity when
379	       describing protocol exchanges or other interactions between two
380	       Nodes.

382	   Responder - The connection endpoint which responds to an incoming MPA
383	       connection request (the MAP Request Frame).  This may not be the
384	       endpoint which awaited the TCP SYN.

386	   Startup Phase - The initial exchanges of an MPA connection which
387	       serves to more fully identify MPA endpoints to each other and
388	       pass connection specific setup information to each other.

390	   ULP - Upper Layer Protocol.  The protocol layer above the protocol
391	       layer currently being referenced.  The ULP for MPA is DDP [DDP].

393	   ULPDU - Upper Layer Protocol Data Unit.  The data record defined by
394	      the layer above MPA (DDP).  ULPDU corresponds to DDP's DDP
395	      segment.

397	   ULPDU_Length - a field in the FPDU describing the length of the
398	      included ULPDU.

400	2  Introduction

402	   This section discusses the reason for creating MPA on TCP and a
403	   general overview of the protocol.  Later sections show the MPA
404	   headers (see section 4 on page 18), and detailed protocol
405	   requirements and characteristics (see section 5 on page 20), as well
406	   as Connection Semantics (section 6 on page 31), Error Semantics
407	   (section 7 on page 46), and Security Considerations (section 8 on
408	   page 47).

410	2.1  Motivation

412	   The Direct Data Placement protocol [DDP], when used with TCP [RFC793]
413	   requires a mechanism to detect record boundaries.  The DDP records
414	   are referred to as Upper Layer Protocol Data Units by this document.
415	   The ability to locate the Upper Layer Protocol Data Unit (ULPDU)
416	   boundary is useful to a hardware network adapter that uses DDP to
417	   directly place the data in the application buffer based on the
418	   control information carried in the ULPDU header.  This may be done
419	   without requiring that the packets arrive in order.  Potential
420	   benefits of this capability are the avoidance of the memory copy
421	   overhead and a smaller memory requirement for handling out of order
422	   or dropped packets.

424	   Many approaches have been proposed for a generalized framing
425	   mechanism.  Some are probabilistic in nature and others are
426	   deterministic.  A probabilistic approach is characterized by a
427	   detectable value embedded in the octet stream.  It is probabilistic
428	   because under some conditions the receiver may incorrectly interpret
429	   application data as the detectable value.  Under these conditions,
430	   the protocol may fail with unacceptable frequency.  A deterministic
431	   approach is characterized by embedded controls at known locations in
432	   the octet stream.  Because the receiver can guarantee it will only
433	   examine the data stream at locations that are known to contain the
434	   embedded control, the protocol can never misinterpret application
435	   data as being embedded control data.  For unambiguous handling of an
436	   out of order packet, the deterministic approach is preferred.

438	   The MPA protocol provides a framing mechanism for DDP running over
439	   TCP using the deterministic approach.  It allows the location of the
440	   ULPDU to be determined in the TCP stream even if the TCP segments
441	   arrive out of order.

443	2.2  Protocol Overview

445	   The layering of PDUs with MPA is shown in Figure 1, below.

447	               +------------------+
448	               |     ULP client   |
449	               +------------------+  <- Consumer messages
450	               |        DDP       |
451	               +------------------+  <- ULPDUs
452	               |        MPA       |
453	               +------------------+  <- FPDUs (containing ULPDUs)
454	               |        TCP*      |
455	               +------------------+  <- TCP Segments (containing FPDUs)
456	               |      IP etc.     |
457	               +------------------+
458	                                      * TCP or MPA-aware TCP.

460	                       Figure 1 ULP MPA TCP Layering

462	   MPA is described as an extra layer above TCP and below DDP.  The
463	   operation sequence is:

465	   1.  A TCP connection is established by ULP action.  This is done
466	       using methods not described by this specification.  The ULP may
467	       exchange some amount of data in streaming mode prior to starting
468	       MPA, but is not required to do so.

470	   2.  The Consumer negotiates the use of DDP and MPA at both ends of a
471	       connection.  The mechanisms to do this are not described in this
472	       specification.  The negotiation may be done in streaming mode, or
473	       by some other mechanism (such as a pre-arranged port number).

475	   3.  The ULP activates MPA on each end in the Startup Phase, either as
476	       an Initiator or a Responder, as determined by the ULP.  This mode
477	       verifies the usage of MPA, specifies the use of CRC and Markers,
478	       and allows the ULP to communicate some additional data via a
479	       Private Data exchange.  See section 6.1 Connection setup for more
480	       details on the startup process.

482	   4.  At the end of the Startup Phase, the ULP puts MPA (and DDP) into
483	       Full Operation and begins sending DDP data as further described
484	       below.  In this document, DDP data chunks are called ULPDUs.  For
485	       a description of the DDP data, see [DDP].

487	   Following is a description of data transfer when MPA is in Full
488	   Operation.

490	   1.  DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
491	       for this value.  MPA derives this information from TCP or IP,
492	       when it is available, or chooses a reasonable value.

494	   2.  DDP creates ULPDUs of MULPDU size or smaller, and hands them to
495	       MPA at the sender.

497	   3.  MPA creates a Framed Protocol Data Unit (FPDU) by pre-pending a
498	       header, optionally inserting Markers, and appending a CRC field
499	       after the ULPDU and PAD (if any).  MPA delivers the FPDU to TCP.

501	   4.  The TCP sender puts the FPDUs into the TCP stream.  If the TCP
502	       Sender is MPA-aware, it segments the TCP stream in such a way
503	       that a TCP Segment boundary is also the boundary of an FPDU.  TCP
504	       then passes each segment to the IP layer for transmission.

506	   5.  The TCP receiver may be MPA-aware or may not be MPA-aware.  If it
507	       is MPA-aware, it may separate passing the TCP payload to MPA from
508	       passing the TCP payload ordering information to MPA.  In either
509	       case, RFC compliant TCP wire behavior is observed at both the
510	       sender and receiver.

512	   6.  The MPA receiver locates and assembles complete FPDUs within the
513	       stream, verifies their integrity, and removes MPA Markers (when
514	       present), ULPDU_Length, PAD and the CRC field.

516	   7.  MPA then provides the complete ULPDUs to DDP.  MPA may also
517	       separate passing MPA payload to DDP from passing the MPA payload
518	       ordering information.

520	   MPA-aware TCP is a TCP layer which potentially contains some
521	   additional semantics as defined in this document.  MPA is implemented
522	   as a data stream ULP for TCP and is therefore RFC compliant.  MPA-
523	   aware TCP is RFC compliant.

525	   An MPA-aware TCP sender is able to segment the data stream such that
526	   TCP segments begin with FPDUs (FPDU Alignment).  This has significant
527	   advantages for receivers.  When segments arrive with aligned FPDUs
528	   the receiver usually need not buffer any portion of the segment,
529	   allowing DDP to place it in its destination memory immediately, thus
530	   avoiding copies from intermediate buffers (DDP's reason for
531	   existence).

533	   MPA with an MPA-aware TCP receiver allows a DDP on MPA implementation
534	   to locate the start of ULPDUs that may be received out of order.  It
535	   also allows the implementation to determine if the entire ULPDU has
536	   been received.  As a result, MPA can pass out of order ULPDUs to DDP
537	   for immediate use.  This enables a DDP on MPA implementation to save
538	   a significant amount of intermediate storage by placing the ULPDUs in
539	   the right locations in the application buffers when they arrive,
540	   rather than waiting until full ordering can be restored.

542	   The ability of a receiver to recover out of order ULPDUs is optional
543	   and declared to the transmitter during startup.  When the receiver
544	   declares that it does not support out of order recovery, the
545	   transmitter does not add the control information to the data stream
546	   needed for out of order recovery.

548	   If TCP is not MPA-aware, then MPA receives a strictly ordered stream
549	   of data and does not deal with out of order ULPDUs.  In this case MPA
550	   passes each ULPDU to DDP when the last bytes arrive from TCP, along
551	   with the indication that they are in order.

553	   MPA implementations that support recovery of out of order ULPDUs MUST
554	   support a mechanism to indicate the ordering of ULPDUs as the sender
555	   transmitted them and indicate when missing intermediate segments
556	   arrive.  These mechanisms allow DDP to reestablish record ordering
557	   and report Delivery of complete messages (groups of records).

559	   MPA also addresses enhanced data integrity.  Some users of TCP have
560	   noted that the TCP checksum is not as strong as could be desired (see
561	   [CRCTCP]).  Studies such as [CRCTCP] have shown that the TCP checksum
562	   indicates segments in error at a much higher rate than the underlying
563	   link characteristics would indicate.  With these higher error rates,
564	   the chance that an error will escape detection, when using only the
565	   TCP checksum for data integrity, becomes a concern.  A stronger
566	   integrity check can reduce the chance of data errors being missed.

568	   MPA includes a CRC check to increase the ULPDU data integrity to the
569	   level provided by other modern protocols, such as SCTP [RFC2960].  It
570	   is possible to disable this CRC check, however CRCs MUST be enabled
571	   unless it is clear that the end to end connection through the network
572	   has data integrity at least as good as a MPA with CRC enabled (for
573	   example when IPsec is implemented end to end).  DDP's ULP expects
574	   this level of data integrity and therefore the ULP does not have to
575	   provide its own duplicate data integrity and error recovery for lost
576	   data.

578	3  LLP and DDP requirements

580	   The following sections describe requirements on TCP and DDP to
581	   utilize MPA.  The DDP requirements enable the correct operation over
582	   MPA and TCP (as opposed to DDP over SCTP or other LLPs).

584	   The TCP requirements are mostly intended to support the MPA-aware TCP
585	   variation, which allows implementations that require less buffer
586	   memory and may provide better overall system performance.

588	3.1  TCP implementation Requirements to support MPA

590	   The TCP implementation MUST inform MPA when the TCP connection is
591	   closed or has begun closing the connection (e.g. received a FIN).

593	3.1.1  TCP Transmit side

595	   To provide optimum performance, an MPA-aware transmit side TCP
596	   implementation SHOULD be enabled to:

598	   *   With an EMSS large enough to contain the FPDU(s), segment the
599	       outgoing TCP stream such that the first octet of every TCP
600	       Segment begins with an FPDU.  Multiple FPDUs MAY be packed into a
601	       single TCP segment as long as they are entirely contained in the
602	       TCP segment.

604	   *   Report the current EMSS to the MPA transmit layer.

606	   An MPA-aware TCP transmit side implementation MUST continue to use
607	   the method of segmentation expected by non-MPA applications (and
608	   described in TCP RFCs) when MPA is not enabled on the connection.
609	   When MPA is enabled above an MPA-aware TCP, it SHOULD specifically
610	   enable the segmentation rules described above for the DDP segments
611	   (FPDUs) posted for transmission.

613	   If the transmit side TCP implementation is not able to segment the
614	   TCP stream as indicated above, MPA SHOULD make a best effort to
615	   achieve that result.  For example, using the TCP_NODELAY socket
616	   option to disable the Nagle algorithm will usually result in many of
617	   the segments starting with an FPDU.

619	   If the transmit side TCP implementation is not able to report the
620	   EMSS, MPA SHOULD use the current MTU value to establish a likely FPDU
621	   size, taking into account the various expected header sizes.

623	3.1.2  TCP Receive side

625	   When an MPA receive implementation and the MPA-aware receive side TCP
626	   implementation support handling out of order ULPDUs, the TCP receive
627	   implementation SHOULD be enabled to:

629	   *   Pass incoming TCP segments to MPA as soon as they have been
630	       received and validated, even if not received in order.  The TCP
631	       layer MUST have committed to keeping each segment before it can
632	       be passed to the MPA.  This means that the segment must have
633	       passed the TCP, IP, and lower layer data integrity validation
634	       (i.e., checksum), must be in the receive window, must not be a
635	       duplicate, must be part of the same epoch (if timestamps are used
636	       to verify this) and any other checks required by TCP RFCs.  The
637	       segment MUST NOT be passed to MPA more than once unless
638	       explicitly requested (see Section 7).

640	       This is not to imply that the data must be completely ordered
641	       before use.  An implementation MAY accept out of order segments,
642	       SACK them [RFC2018], and pass them to DDP immediately, before the
643	       reception of the segments needed to fill in the gaps arrive.
644	       Such an implementation MUST "commit" to the data early on, and
645	       MUST NOT overwrite it even if (or when) duplicate data arrives.
646	       MPA expects to utilize this "commit" to allow the passing of
647	       ULPDUs to DDP when they arrive, independent of ordering.  DDP
648	       uses the passed ULPDU to "place" the DDP segments (see [DDP] for
649	       more details).

651	   *   Provide a mechanism to indicate the ordering of TCP segments as
652	       the sender transmitted them.  One possible mechanism might be
653	       attaching the TCP sequence number to each segment.

655	   *   Provide a mechanism to indicate when a given TCP segment (and the
656	       prior TCP stream) is complete.  One possible mechanism might be
657	       to utilize the leading (left) edge of the TCP Receive Window.

659	       MPA uses the ordering and completion indications to inform DDP
660	       when a ULPDU is complete; MPA Delivers the FPDU to DDP.  DDP uses
661	       the indications to "deliver" its messages to the DDP consumer
662	       (see [DDP] for more details).

664	       DDP on MPA MUST utilize these two mechanisms to establish the
665	       Delivery semantics that DDP's consumers agree to.  These
666	       semantics are described fully in [DDP].  These include
667	       requirements on DDP's consumer to respect ownership of buffers
668	       prior to the time that DDP delivers them to the Consumer.

670	   An MPA-aware TCP receive side implementation MUST continue to buffer
671	   TCP segments until completely ordered and then deliver them as
672	   expected by non-MPA applications (and described in TCP RFCs) when MPA
673	   is not enabled on the connection.  When MPA is enabled above an MPA-
674	   aware TCP, TCP SHOULD enable the in and out of order passing of data,
675	   and the separate ordering information as described above.

677	   When an MPA receive implementation is coupled with a TCP receive
678	   implementation that does not support the preceding mechanisms, TCP
679	   passes and Delivers incoming stream data to MPA in order.

681	3.2  MPA's interactions with DDP

683	   DDP requires MPA to maintain DDP record boundaries from the sender to
684	   the receiver.  When using MPA on TCP to send data, DDP provides
685	   records (ULPDUs) to MPA.  MPA will use the reliable transmission
686	   abilities of TCP to transmit the data, and will insert appropriate
687	   additional information into the TCP stream to allow the MPA receiver
688	   to locate the record boundary information.

690	   As such, MPA accepts complete records (ULPDUs) from DDP at the sender
691	   and returns them to DDP at the receiver.

693	   MPA combined with an MPA-aware TCP can only ensure FPDU Alignment
694	   with the TCP Header if the FPDU is less than or equal to TCP's EMSS.
695	   Since FPDU Alignment is generally desired by the receiver, DDP must
696	   cooperate with MPA to ensure FPDUs' lengths do not exceed the EMSS
697	   under normal conditions.  This is done with the MULPDU mechanism.

699	   MPA provides information to DDP on the current maximum size of the
700	   record that is acceptable to send (MULPDU).  DDP SHOULD limit each
701	   record size to MULPDU.  The range of MULPDU values MUST be between
702	   128 octets and 64768 octets, inclusive.

704	   The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
705	   MPA.  DDP MAY post a ULPDU of any size between one and 64768 octets,
706	   however MPA is not REQUIRED to support a ULPDU Length that is greater
707	   than the current MULPDU.

709	   While the maximum theoretical length supported by the MPA header
710	   ULPDU_Length field is 65535, TCP over IP requires the IP datagram
711	   maximum length to be 65535 octets.  To enable MPA to support FPDU
712	   Alignment, the maximum size of the FPDU must fit within an IP
713	   datagram.  Thus the ULPDU limit of 64768 octets was derived by taking
714	   the maximum IP datagram length, subtracting from it the maximum total
715	   length of the sum of the IPv4 header, TCP header, IPv4 options, TCP
716	   options, and the worst case MPA overhead, and then rounding the
717	   result down to a 128 octet boundary.

719	   On receive, MPA MUST pass each ULPDU with its length to DDP when it
720	   has been validated.

722	   If an MPA implementation supports passing out of order ULPDUs to DDP,
723	   the MPA implementation SHOULD:

725	   *   Pass each ULPDU with its length to DDP as soon as it has been
726	       fully received and validated.

728	   *   Provide a mechanism to indicate the ordering of ULPDUs as the
729	       sender transmitted them.  One possible mechanism might be
730	       providing the TCP sequence number for each ULPDU.

732	   *   Provide a mechanism to indicate when a given ULPDU (and prior
733	       ULPDUs) are complete (Delivered to DDP).  One possible mechanism
734	       might be to allow DDP to see the current outgoing TCP Ack
735	       sequence number.

737	   *   Provide an indication to DDP that the TCP has closed or has begun
738	       to close the connection (e.g. received a FIN).

740	   MPA MUST provide the protocol version negotiated with its peer to
741	   DDP.  DDP will use this version to set the version in its header and
742	   to report the version to [RDMAP].

744	4  FPDU Formats

746	   MPA senders create FPDUs out of ULPDUs.  The format of an FPDU shown
747	   below MUST be used for all MPA FPDUs.  For purposes of clarity,
748	   Markers are not shown in Figure 2.

750	       0                   1                   2                   3
751	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
752	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
753	      |          ULPDU_Length         |                               |
754	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
755	      |                                                               |
756	      ~                                                               ~
757	      ~                            ULPDU                              ~
758	      |                                                               |
759	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
760	      |                               |          PAD (0-3 octets)     |
761	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
762	      |                             CRC                               |
763	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
764	                           Figure 2 FPDU Format

766	   ULPDU_Length: 16 bits (unsigned integer).  This is the number of
767	   octets of the contained ULPDU.  It does not include the length of the
768	   FPDU header itself, the pad, the CRC, or of any Markers that fall
769	   within the ULPDU.  The 16-bit ULPDU Length field is large enough to
770	   support the largest IP datagrams for IPv4 or IPv6.

772	   PAD: The PAD field trails the ULPDU and contains between zero and
773	   three octets of data.  The pad data MUST be set to zero by the sender
774	   and ignored by the receiver (except for CRC checking).  The length of
775	   the pad is set so as to make the size of the FPDU an integral
776	   multiple of four.

778	   CRC: 32 bits, When CRCs are enabled, this field contains a CRC32C
779	   check value, which is used to verify the entire contents of the FPDU,
780	   using CRC32C.  See section 5.2 CRC Calculation on page 23.  When CRCs
781	   are not enabled, this field is still present, may contain any value,
782	   and MUST NOT be checked.

784	   The FPDU adds a minimum of 6 octets to the length of the ULPDU.  In
785	   addition, the total length of the FPDU will include the length of any
786	   Markers and from 0 to 3 pad octets added to round-up the ULPDU size.

788	4.1  Marker Format

790	   The format of a Marker MUST be as specified in Figure 3:

792	       0                   1                   2                   3
793	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
794	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
795	      |           RESERVED            |            FPDUPTR            |
796	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
797	                          Figure 3 Marker Format

799	   RESERVED: The Reserved field MUST be set to zero on transmit and
800	   ignored on receive (except for CRC calculation).

802	   FPDUPTR: The FPDU Pointer is a relative pointer, 16-bits long,
803	   interpreted as an unsigned integer that indicates the number of
804	   octets in the TCP stream from the beginning of the ULPDU Length field
805	   to the first octet of the entire Marker.  The least significant two
806	   bits MUST always be set to zero at the transmitter, and the receivers
807	   MUST always treat these as zero for calculations.

809	5  Data Transfer Semantics

811	   This section discusses some characteristics and behavior of the MPA
812	   protocol as well as implications of that protocol.

814	5.1  MPA Markers

816	   MPA Markers are used to identify the start of FPDUs when packets are
817	   received out of order.  This is done by locating the Markers at fixed
818	   intervals in the data stream (which is correlated to the TCP sequence
819	   number) and using the Marker value to locate the preceding FPDU
820	   start.

822	   All MPA Markers are included in the containing FPDU CRC calculation
823	   (when both CRCs and Markers are in use).

825	   The MPA receiver's ability to locate out of order FPDUs and pass the
826	   ULPDUs to DDP is implementation dependent.  MPA/DDP allows those
827	   receivers that are able to deal with out of order FPDUs in this way
828	   to require the insertion of Markers in the data stream.  When the
829	   receiver cannot deal with out of order FPDUs in this way, it may
830	   disable the insertion of Markers at the sender.  All MPA senders MUST
831	   be able to generate Markers when their use is declared by the
832	   opposing receiver (see section 6.1 Connection setup on page 32).

834	   When Markers are enabled, MPA senders MUST insert a Marker into the
835	   data stream at a 512 octet periodic interval in the TCP Sequence
836	   Number Space.  The Marker contains a 16 bit unsigned integer referred
837	   to as the FPDUPTR (FPDU Pointer).

839	   If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16 bit
840	   relative back-pointer.  FPDUPTR MUST contain the number of octets in
841	   the TCP stream from the beginning of the ULPDU Length field to the
842	   first octet of the Marker, unless the Marker falls between FPDUs.
843	   Thus the location of the first octet of the previous FPDU header can
844	   be determined by subtracting the value of the given Marker from the
845	   current octet-stream sequence number (i.e. TCP sequence number) of
846	   the first octet of the Marker.  Note that this computation MUST take
847	   into account that the TCP sequence number could have wrapped between
848	   the Marker and the header.

850	   An FPDUPTR value of 0x0000 is a special case - it is used when the
851	   Marker falls exactly between FPDUs (between the preceding FPDU CRC
852	   field, and the next FPDU's ULPDU Length field).  In this case, the
853	   Marker is considered to be contained in the following FPDU; the
854	   Marker MUST be included in the CRC calculation of the FPDU following
855	   the Marker (if CRCs are being generated or checked).  Thus an FPDUPTR
856	   value of 0x0000 means that immediately following the Marker is an
857	   FPDU header (the ULPDU Length field).

859	   Since all FPDUs are integral multiples of 4 octets, the bottom two
860	   bits of the FPDUPTR as calculated by the sender are zero.  MPA
861	   reserves these bits so they MUST be treated as zero for computation
862	   at the receiver.

864	   When Markers are enabled (see section 6.1 Connection setup on page
865	   32), the MPA Markers MUST be inserted immediately preceding the first
866	   FPDU of Full Operation phase, and at every 512th octet of the TCP
867	   octet stream thereafter.  As a result, the first Marker has an
868	   FPDUPTR value of 0x0000.  If the first Marker begins at octet
869	   sequence number SeqStart, then Markers are inserted such that the
870	   first octet of the Marker is at octet sequence number SeqNum if the
871	   remainder of (SeqNum - SeqStart) mod 512 is zero.  Note that SeqNum
872	   can wrap.

874	   For example, if the TCP sequence number were used to calculate the
875	   insertion point of the Marker, the starting TCP sequence number is
876	   unlikely to be zero, and 512 octet multiples are unlikely to fall on
877	   a modulo 512 of zero.  If the MPA connection is started at TCP
878	   sequence number 11, then the 1st Marker will begin at 11, and
879	   subsequent Markers will begin at 523, 1035, etc.

881	   If an FPDU is large enough to contain multiple Markers, they MUST all
882	   point to the same point in the TCP stream: the first octet of the
883	   ULPDU Length field for the FPDU.

885	   If a Marker interval contains multiple FPDUs (the FPDUs are small),
886	   the Marker MUST point to the start of the ULPDU Length field for the
887	   FPDU containing the Marker unless the Marker falls between FPDUs, in
888	   which case the Marker MUST be zero.

890	   The following example shows an FPDU containing a Marker.

892	       0                   1                   2                   3
893	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
894	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
895	      |       ULPDU Length (0x0010)   |                               |
896	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
897	      |                                                               |
898	      +                                                               +
899	      |                         ULPDU (octets 0-9)                    |
900	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
901	      |            (0x0000)           |        FPDU ptr (0x000C)      |
902	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
903	      |                        ULPDU (octets 10-15)                   |
904	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
905	      |                               |          PAD (2 octets:0,0)   |
906	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
907	      |                              CRC                              |
908	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
909	                 Figure 4 Example FPDU Format with Marker

911	   MPA Receivers MUST preserve ULPDU boundaries when passing data to
912	   DDP.  MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
913	   DDP and not the Markers, headers, and CRC.

915	5.2  CRC Calculation

917	   An MPA implementation MUST implement CRC support and MUST either:

919	   (1) always use CRCs; The MPA provider at is not REQUIRED to support
920	       an administrator's request that CRCs not be used.

922	       or

924	   (2a) only indicate a preference to not use CRCs on the explicit
925	       request of the system administrator, via an interface not defined
926	       in this spec.  The default configuration for a connection MUST be
927	       to use CRCs.

929	   (2b) disable CRC checking (and possibly generation) if both the local
930	       and remote endpoints indicate preference to not use CRCs.

932	   The decision for hosts to request CRC suppression MAY be made on an
933	   administrative basis for any path that provides equivalent protection
934	   from undetected errors as an end-to-end CRC32c.

936	   The process MUST be invisible to the ULP.

938	   After receipt of an MPA startup declaration indicating that its peer
939	   requires CRCs, an MPA instance MUST continue generating and checking
940	   CRCs until the connection terminates.  If an MPA instance has
941	   declared that it does not require CRCs, it MUST turn off CRC checking
942	   immediately after receipt of an MPA mode declaration indicating that
943	   its peer also does not require CRCs.  It MAY continue generating
944	   CRCs.  See section 6.1 Connection setup on page 32 for details on the
945	   MPA startup.

947	   When sending an FPDU, the sender MUST include a CRC field.  When CRCs
948	   are enabled, the CRC field in the MPA FPDU MUST be computed using the
949	   CRC32C polynomial in the manner described in the iSCSI Protocol
950	   [iSCSI] document for Header and Data Digests.

952	   The fields which MUST be included in the CRC calculation when sending
953	   an FPDU are as follows:

955	   1)  If a Marker does not immediately precede the ULPDU Length field,
956	       the CRC-32c is calculated from the first octet of the ULPDU
957	       Length field, through all the ULPDU and Markers (if present), to
958	       the last octet of the PAD (if present), inclusive.  If there is a
959	       Marker immediately following the PAD, the Marker is included in
960	       the CRC calculation for this FPDU.

962	   2)  If a Marker immediately precedes the first octet of the ULPDU
963	       Length field of the FPDU, (i.e. the Marker fell between FPDUs,
964	       and thus is required to be included in the second FPDU), the CRC-
965	       32c is calculated from the first octet of the Marker, through the
966	       ULPDU Length header, through all the ULPDU and Markers (if
967	       present), to the last octet of the PAD (if present), inclusive.

969	   3)  After calculating the CRC-32c, the resultant value is placed into
970	       the CRC field at the end of the FPDU.

972	   When an FPDU is received, and CRC checking is enabled, the receiver
973	   MUST first perform the following:

975	   1)  Calculate the CRC of the incoming FPDU in the same fashion as
976	       defined above.

978	   2)  Verify that the calculated CRC-32c value is the same as the
979	       received CRC-32c value found in the FPDU CRC field.  If not, the
980	       receiver MUST treat the FPDU as an invalid FPDU.

982	   The procedure for handling invalid FPDUs is covered in the Error
983	   Section (see section 7 on page 46)

985	   The following is an annotated hex dump of an example FPDU sent as the
986	   first FPDU on the stream.  As such, it starts with a Marker.  The
987	   FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn
988	   contains 24 octets of the contained ULPDU, which is a data load that
989	   is all zeros.  The CRC32c has been correctly calculated and can be
990	   used as a reference.  See the [DDP] and [RDMAP] specification for
991	   definitions of the DDP Control field, Queue, MSN, MO, and Send Data.

993	       Octet Contents  Annotation
994	       Count

996	       0000    00      Marker: Reserved
997	       0001    00
998	       0002    00      Marker: FPDUPTR
999	       0003    00
1000	       0004    00      ULPDU Length
1001	       0005    2a
1002	       0006    41      DDP Control Field, Send with Last flag set
1003	       0007    43
1004	       0008    00      Reserved (DDP STag position with no STag)
1005	       0009    00
1006	       000a    00
1007	       000b    00
1008	       000c    00      DDP Queue = 0
1009	       000d    00
1010	       000e    00
1011	       000f    00
1012	       0010    00      DDP MSN = 1
1013	       0011    00
1014	       0012    00
1015	       0013    01
1016	       0014    00      DDP MO = 0
1017	       0015    00
1018	       0016    00
1019	       0017    00
1020	       0018    00      DDP Send Data (24 octets of zeros)
1021	       ...
1022	       002f    00
1023	       0030    52      CRC32c
1024	       0031    23
1025	       0032    99
1026	       0033    83
1027	                  Figure 5 Annotated Hex Dump of an FPDU

1029	   The following is an example sent as the second FPDU of the stream
1030	   where the first FPDU (which is not shown here) had a length of 492
1031	   octets and was also a Send to Queue 0 with Last Flag set.  This
1032	   example contains a Marker.

1034	       Octet Contents  Annotation
1035	       Count

1037	       01ec    00      Length
1038	       01ed    2a
1039	       01ee    41      DDP Control Field: Send with Last Flag set
1040	       01ef    43
1041	       01f0    00      Reserved (DDP STag position with no STag)
1042	       01f1    00
1043	       01f2    00
1044	       01f3    00
1045	       01f4    00      DDP Queue = 0
1046	       01f5    00
1047	       01f6    00
1048	       01f7    00
1049	       01f8    00      DDP MSN = 2
1050	       01f9    00
1051	       01fa    00
1052	       01fb    02
1053	       01fc    00      DDP MO = 0
1054	       01fd    00
1055	       01fe    00
1056	       01ff    00
1057	       0200    00      Marker: Reserved
1058	       0201    00
1059	       0202    00      Marker: FPDUPTR
1060	       0203    14
1061	       0204    00      DDP Send Data (24 octets of zeros)
1062	       ...
1063	       021b    00
1064	       021c    84      CRC32c
1065	       021d    92
1066	       021e    58
1067	       021f    98
1068	            Figure 6 Annotated Hex Dump of an FPDU with Marker

1070	5.3  MPA on TCP Sender Segmentation

1072	   The various TCP RFCs allow considerable choice in segmenting a TCP
1073	   stream.  In order to optimize FPDU recovery at the MPA receiver, MPA
1074	   specifies additional segmentation rules.

1076	   MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
1077	   contained in one FPDU.

1079	   An MPA-aware TCP sender SHOULD, when enabled for MPA, on TCP
1080	   implementations that support this, and with an EMSS large enough to
1081	   contain at least one FPDU, segment the outbound TCP stream such that
1082	   each TCP segment begins with an FPDU, and fully contains all included
1083	   FPDUs.

1085	        Implementation note: To achieve the previous segmentation rule,
1086	        an MPA-aware TCP sender implementation SHOULD disable TCP's
1087	        Nagle [RFC0896] algorithm, communicate the FPDU boundaries to
1088	        TCP, and make other minor changes such as the reporting of EMSS
1089	        to MPA.

1091	   There are exceptions to the above rule.  Once an ULPDU is provided to
1092	   MPA, the MPA on TCP sender MUST transmit it or fail the connection;
1093	   it cannot be repudiated.  As a result, during changes in MTU and
1094	   EMSS, or when TCP's Receive Window size (RWIN) becomes too small, it
1095	   may be necessary to send FPDUs that do not conform to the
1096	   segmentation rule above.

1098	   A possible, but less desirable, alternative is to use IP
1099	   fragmentation on accepted FPDUs to deal with MTU reductions or
1100	   extremely small EMSS.

1102	   The sender MUST still format the FPDU according to FPDU format as
1103	   shown in Figure 2.

1105	   On a retransmission, TCP does not necessarily preserve original TCP
1106	   segmentation boundaries.  This can lead to the loss of FPDU Alignment
1107	   and containment within a TCP segment during TCP retransmissions.  An
1108	   MPA-aware TCP sender SHOULD try to preserve original TCP segmentation
1109	   boundaries on a retransmission.

1111	5.3.1  Effects of MPA on TCP Segmentation

1113	   DDP/MPA senders will fill TCP segments to the EMSS with a single FPDU
1114	   when a DDP message is large enough.  Since the DDP message may not
1115	   exactly fit into TCP segments, a "message tail" often occurs that
1116	   results in an FPDU that is smaller than a single TCP segment.
1117	   Additionally some DDP messages may be considerably shorter than the
1118	   EMSS.  If a small FPDU is sent in a single TCP segment the result is
1119	   a "short" TCP segment.

1121	   Applications expected to see strong advantages from Direct Data
1122	   Placement include transaction-based applications and throughput
1123	   applications.  Request/response protocols typically send one FPDU per
1124	   TCP segment and then wait for a response.  Under these conditions,
1125	   these "short" TCP segments are an appropriate and expected effect of
1126	   the segmentation.

1128	   Another possibility is that the application might be sending multiple
1129	   messages (FPDUs) to the same endpoint before waiting for a response.

1131	   In this case, the segmentation policy would tend to reduce the
1132	   available connection bandwidth by under-filling the TCP segments.

1134	   TCP implementations often utilize the Nagle [RFC0896] algorithm to
1135	   ensure that segments are filled to the EMSS whenever the round trip
1136	   latency is large enough that the source stream can fully fill
1137	   segments before Acks arrive.  The algorithm does this by delaying the
1138	   transmission of TCP segments until a ULP can fill a segment, or until
1139	   an ACK arrives from the far side.  The algorithm thus allows for
1140	   smaller segments when latencies are shorter to keep the ULP's end to
1141	   end latency to reasonable levels.

1143	   The Nagle algorithm is not mandatory to use [RFC1122].

1145	   If Nagle or other algorithms for detecting the availability of
1146	   multiple FPDUs for transmission is used, "packing" of multiple FPDUs
1147	   into TCP segments can occur.

1149	   If a "message tail", small DDP messages, or the start of a larger DDP
1150	   message are available, MPA MAY pack multiple FPDUs into TCP segments.
1151	   When this is done, the TCP segments can be more fully utilized, but,
1152	   due to the size constraints of FPDUs, segments may not be filled to
1153	   the EMSS.

1155	        Note that MPA receivers must do more processing of a TCP segment
1156	        that contains multiple FPDUs, this may affect the performance of
1157	        some receiver implementations.

1159	   It is up to the ULP to decide if Nagle is useful with DDP/MPA.  Note
1160	   that many of the applications expected to take advantage of MPA/DDP
1161	   prefer to avoid the extra delays caused by Nagle.  In such scenarios
1162	   it is anticipated there will be minimal opportunity for packing at
1163	   the transmitter and receivers may choose to optimize their
1164	   performance for this anticipated behavior.

1166	   Therefore, the application is expected to set TCP parameters such
1167	   that it can trade off latency and wire efficiency.  This is
1168	   accomplished by setting the TCP_NODELAY socket option (which disables
1169	   Nagle).

1171	   When latency is not critical, application is expected to leave Nagle
1172	   enabled.  In this case the TCP implementation may pack any available
1173	   stream data into TCP segments so that the segments are filled to the
1174	   EMSS.  If the amount of data available is not enough to fill the TCP
1175	   segment when it is prepared for transmission, TCP can send the
1176	   segment partly filled, or use the Nagle algorithm to wait for the ULP
1177	   to post more data (discussed below).

1179	5.3.2  FPDU Size Considerations

1181	   MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
1182	   the size of the largest ULPDU fitting in an FPDU.  For an empty TCP
1183	   Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
1184	   space for Markers and pad octets.

1186	        The maximum ULPDU Length for a single ULPDU when Markers are
1187	        present MUST be computed as:

1189	        MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)

1191	   The formula above accounts for the worst-case number of Markers.

1193	        The maximum ULPDU Length for a single ULPDU when Markers are NOT
1194	        present MUST be computed as:

1196	        MULPDU = EMSS - (6 + EMSS mod 4)

1198	   As a further optimization of the wire efficiency an MPA
1199	   implementation MAY dynamically adjust the MULPDU (see section 5.3.1
1200	   for latency and wire efficiency trade-offs).  When one or more FPDUs
1201	   are already packed into a TCP Segment, MULPDU MAY be reduced
1202	   accordingly.

1204	   DDP SHOULD provide ULPDUs that are as large as possible, but less
1205	   than or equal to MULPDU.

1207	   If the TCP implementation needs to adjust EMSS to support MTU
1208	   changes, the MULPDU value is changed accordingly.

1210	   In certain rare situations, the EMSS may shrink below 128 octets in
1211	   size.  If this occurs, the MPA on TCP sender MUST NOT shrink the
1212	   MULPDU below 128 octets and is not REQUIRED to follow the
1213	   segmentation rules in Section 5.3 MPA on TCP Sender Segmentation on
1214	   page 26.

1216	   If one or more FPDUs are already packed into a TCP segment, such that
1217	   the remaining room is less than 128 octets, MPA MUST NOT provide a
1218	   MULPDU smaller than 128.  In this case, MPA would typically provide a
1219	   MULPDU for the next full sized segment, but may still pack the next
1220	   FPDU into the small remaining room, provide that the next FPDU is
1221	   small enough to fit.

1223	   The value 128 is chosen as to allow DDP designers room for the DDP
1224	   Header and some user data.

1226	5.4  MPA Receiver FPDU Identification

1228	   An MPA receiver MUST first verify the FPDU before passing the ULPDU
1229	   to DDP.  To do this, the receiver MUST:

1231	   *   locate the start of the FPDU unambiguously,

1233	   *   verify its CRC (if CRC checking is enabled).

1235	   If the above conditions are true, the MPA receiver passes the ULPDU
1236	   to DDP.

1238	   To detect the start of the FPDU unambiguously one of the following
1239	   MUST be used:

1241	   1:  In an ordered TCP stream, the ULPDU Length field in the current
1242	       FPDU when FPDU has a valid CRC, can be used to identify the
1243	       beginning of the next FPDU.

1245	   2:  For receivers that support out of order reception of FPDUs (see
1246	       section 5.1 MPA Markers on page 20) a Marker can always be used
1247	       to locate the beginning of an FPDU (in FPDUs with valid CRCs).
1248	       Since the location of the Marker is known in the octet stream
1249	       (sequence number space), the Marker can always be found.

1251	   3:  Having found an FPDU by means of a Marker, following contiguous
1252	       FPDUs can be found by using the ULPDU Length fields (from FPDUs
1253	       with valid CRCs) to establish the next FPDU boundary.

1255	   The ULPDU Length field (see section 4) MUST be used to determine if
1256	   the entire FPDU is present before forwarding the ULPDU to DDP.

1258	   CRC calculation is discussed in section 5.2 on page 23 above.

1260	5.4.1  Re-segmenting Middle boxes and non MPA-aware TCP senders

1262	   Since MPA on MPA-aware TCP senders start FPDUs on TCP segment
1263	   boundaries, a receiving DDP on MPA on TCP implementation may be able
1264	   to optimize the reception of data in various ways.

1266	   However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
1267	   segment boundaries.

1269	   Some MPA senders may be unable to conform to the sender requirements
1270	   because their implementation of TCP is not designed with MPA in mind.
1271	   Even if the sender is MPA-aware, the network may contain "middle
1272	   boxes" which modify the TCP stream by changing the segmentation.
1273	   This is generally interoperable with TCP and its users and MPA must
1274	   be no exception.

1276	   The presence of Markers in MPA (when enabled) allows an MPA receiver
1277	   to recover the FPDUs despite these obstacles, although it may be
1278	   necessary to utilize additional buffering at the receiver to do so.

1280	   Some of the cases that a receiver may have to contend with are listed
1281	   below as a reminder to the implementer:

1283	   *   A single Aligned and complete FPDU, either in order, or out of
1284	       order:  This can be passed to DDP as soon as validated, and
1285	       Delivered when ordering is established.

1287	   *   Multiple FPDUs in a TCP segment, aligned and fully contained,
1288	       either in order, or out of order:  These can be passed to DDP as
1289	       soon as validated, and Delivered when ordering is established.

1291	   *   Incomplete FPDU: The receiver should buffer until the remainder
1292	       of the FPDU arrives.  If the remainder of the FPDU is already
1293	       available, this can be passed to DDP as soon as validated, and
1294	       Delivered when ordering is established.

1296	   *   Unaligned FPDU start: The partial FPDU must be combined with its
1297	       preceding portion(s).  If the preceding parts are already
1298	       available, and the whole FPDU is present, this can be passed to
1299	       DDP as soon as validated, and Delivered when ordering is
1300	       established.  If the whole FPDU is not available, the receiver
1301	       should buffer until the remainder of the FPDU arrives.

1303	   *   Combinations of Unaligned or incomplete FPDUs (and potentially
1304	       other complete FPDUs) in the same TCP segment:  If any FPDU is
1305	       present in its entirety, or can be completed with portions
1306	       already available, it can be passed to DDP as soon as validated,
1307	       and Delivered when ordering is established.

1309	6  Connection Semantics

1311	6.1  Connection setup

1313	   MPA requires that the Consumer MUST activate MPA, and any TCP
1314	   enhancements for MPA, on a TCP half connection at the same location
1315	   in the octet stream at both the sender and the receiver.  This is
1316	   required in order for the Marker scheme to correctly locate the
1317	   Markers (if enabled) and to correctly locate the first FPDU.

1319	   MPA, and any TCP enhancements for MPA are enabled by the ULP in both
1320	   directions at once at an endpoint.

1322	   This can be accomplished several ways, and is left up to DDP's ULP:

1324	   *   DDP's ULP MAY require DDP on MPA startup immediately after TCP
1325	       connection setup.  This has the advantage that no streaming mode
1326	       negotiation is needed.  An example of such a protocol is shown in
1327	       Figure 9: Example Immediate Startup negotiation on page 42.

1329	       This may be accomplished by using a well-known port, or a service
1330	       locator protocol to locate an appropriate port on which DDP on
1331	       MPA is expected to operate.

1333	   *   DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
1334	       normal TCP startup, using TCP streaming data exchanges on the
1335	       same connection.  The exchange establishes that DDP on MPA (as
1336	       well as other ULPs) will be used, and exactly locates the point
1337	       in the octet stream where MPA is to begin operation.  Note that
1338	       such a negotiation protocol is outside the scope of this
1339	       specification.  A simplified example of such a protocol is shown
1340	       in Figure 8: Example Delayed Startup negotiation on page 39.

1342	   An MPA endpoint operates in two distinct phases.

1344	   The Startup Phase is used to verify correct MPA setup, exchange CRC
1345	   and Marker configuration, and optionally pass Private Data between
1346	   endpoints prior to completing a DDP connection.  During this phase,
1347	   specifically formatted frames are exchanged as TCP byte streams
1348	   without using CRCs or Markers.  During this phase a DDP endpoint need
1349	   not be "bound" to the MPA connection.  In fact, the choice of DDP
1350	   endpoint and its operating parameters may not be known until the
1351	   Consumer supplied Private Data (if any) has been examined by the
1352	   Consumer.

1354	   The second distinct phase is Full Operation during which FPDUs are
1355	   sent using all the rules that pertain (CRCs, Markers, MULPDU
1356	   restrictions etc.).  A DDP endpoint MUST be "bound" to the MPA
1357	   connection at entry to this phase.

1359	   When Private Data is passed between ULPs in the Startup Phase, the
1360	   ULP is responsible for interpreting that data, and then placing MPA
1361	   into Full Operation.

1363	   Note: The following text differentiates the two endpoints by calling
1364	       them Initiator and Responder.  This is quite arbitrary and is NOT
1365	       related to the TCP startup (SYN, SYN/ACK sequence).  The
1366	       Initiator is the side that sends first in the MPA startup
1367	       sequence (the MPA Request Frame).

1369	   Note: The possibility that both endpoints would be allowed to make a
1370	       connection at the same time, sometimes called an active/active
1371	       connection, was considered by the work group and rejected.  There
1372	       were several motivations for this decision.  One was that
1373	       applications needing this facility were few (none other than
1374	       theoretical at the time of this draft).  Another was that the
1375	       facility created some implementation difficulties, particularly
1376	       with the "dual stack" designs described later on.  A last issue
1377	       was that dealing with rejected connections at startup would have
1378	       required at least an additional frame type, and more recovery
1379	       actions, complicating the protocol.  While none of these issues
1380	       was overwhelming, the group and implementers were not motivated
1381	       to do the work to resolve these issues.  The protocol includes a
1382	       method of detecting these active/active startup attempts so that
1383	       they can be rejected and an error reported.

1385	   The ULP is responsible for determining which side is Initiator or
1386	   Responder.  For client/server type ULPs this is easy.  For peer-peer
1387	   ULPs (which might utilize a TCP style active/active startup), some
1388	   mechanism (not defined by this specification) must be established, or
1389	   some streaming mode data exchanged prior to MPA startup to determine
1390	   the side which starts in Initiator and which starts in Responder MPA
1391	   mode.

1393	6.1.1  MPA Request and Reply Frame Format

1395	       0                   1                   2                   3
1396	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1397	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1398	   0  |                                                               |
1399	      +         Key (16 bytes containing "MPA ID Req Frame")          +
1400	   4  |      (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65)        |
1401	      +         Or  (16 bytes containing "MPA ID Rep Frame")          +
1402	   8  |      (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65)        |
1403	      +                                                               +
1404	   12 |                                                               |
1405	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1406	   16 |M|C|R| Res     |     Rev       |          PD_Length            |
1407	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1408	      |                                                               |
1409	      ~                                                               ~
1410	      ~                   Private Data                                ~
1411	      |                                                               |
1412	      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1413	      |                               |
1414	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1415	                     Figure 7 MPA Request/Reply Frame

1417	   Key: This field contains the "key" used to validate that the sender
1418	       is an MPA sender.  Initiator mode senders MUST set this field to
1419	       the fixed value "MPA ID Req frame" or (in byte order) 4D 50 41 20
1420	       49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal).  Responder
1421	       mode receivers MUST check this field for the same value, and
1422	       close the connection and report an error locally if any other
1423	       value is detected.  Responder mode senders MUST set this field to
1424	       the fixed value "MPA ID Rep frame" or (in byte order) 4D 50 41 20
1425	       49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal).  Initiator
1426	       mode receivers MUST check this field for the same value, and
1427	       close the connection and report an error locally if any other
1428	       value is detected.

1430	   M: This bit, when sent in an MPA Request Frame or an MPA Reply Frame,
1431	       declares a receiver's requirement for Markers.  When in a
1432	       received MPA Request Frame or MPA Reply Frame and the value is
1433	       '0', Markers MUST NOT be added to the data stream by the sender.
1434	       When '1' Markers MUST be added as described in section 5.1 MPA
1435	       Markers on page 20.

1437	   C: This bit declares an endpoint's preferred CRC usage.  When this
1438	       field is '0' in the MPA Request Frame and the MPA Reply Frame,
1439	       CRCs MUST not be checked and need not be generated by either
1440	       endpoint.  When this bit is '1' in either the MPA Request Frame
1441	       or MPA Reply Frame, CRCs MUST be generated and checked by both
1442	       endpoints.  Note that even when not in use, the CRC field remains
1443	       present in the FPDU.  When CRCs are not in use, the CRC field
1444	       MUST be considered valid for FPDU checking regardless of its
1445	       contents.

1447	   R: This bit is set to zero, and not checked on reception in the MPA
1448	       Request Frame.  In the MPA Reply Frame, this bit is the Rejected
1449	       Connection bit, set by the Responders ULP to indicate acceptance
1450	       '0', or rejection '1', of the connection parameters provided in
1451	       the Private Data.

1453	   Res: This field is reserved for future use.  It MUST be set to zero
1454	       when sending, and not checked on reception.

1456	   Rev: This field contains the Revision of MPA.  For this version of
1457	       the specification senders MUST set this field to one.  MPA
1458	       receivers compliant with this version of the specification MUST
1459	       check this field.  If the MPA receiver cannot interoperate with
1460	       the received version, then it MUST close the connection and
1461	       report an error locally.  Otherwise, the MPA receiver should
1462	       report the received version to the ULP.

1464	   PD_Length: This field MUST contain the length in Octets of the
1465	       Private Data field.  A value of zero indicates that there is no
1466	       Private Data field present at all.  If the receiver detects that
1467	       the PD_Length field does not match the length of the Private Data
1468	       field, or if the length of the Private Data field exceeds 512
1469	       octets, the receiver MUST close the connection and report an
1470	       error locally.  Otherwise, the MPA receiver should pass the
1471	       PD_Length value and Private Data to the ULP.

1473	   Private Data: This field may contain any value defined by ULPs or may
1474	       not be present.  The Private Data field MUST between 0 and 512
1475	       octets in length.  ULPs define how to size, set, and validate
1476	       this field within these limits.

1478	6.1.2  Connection Startup Rules

1480	   The following rules apply to MPA connection Startup Phase:

1482	   1.  When MPA is started in the Initiator mode, the MPA implementation
1483	       MUST send a valid MPA Request Frame.  The MPA Request Frame MAY
1484	       include ULP supplied Private Data.

1486	   2.  When MPA is started in the Responder mode, the MPA implementation
1487	       MUST wait until a MPA Request Frame is received and validated
1488	       before entering full MPA/DDP operation.

1490	       If the MPA Request Frame is improperly formatted, the
1491	       implementation MUST close the TCP connection and exit MPA.

1493	       If the MPA Request Frame is properly formatted but the Private
1494	       Data is not acceptable, the implementation SHOULD return an MPA
1495	       Reply Frame with the Rejected Connection bit set to '1'; the MPA
1496	       Reply Frame MAY include ULP supplied Private Data; the
1497	       implementation MUST exit MPA, leaving the TCP connection open.
1498	       The ULP may close TCP or use the connection for other purposes.

1500	       If the MPA Request Frame is properly formatted and the Private
1501	       Data is acceptable, the implementation SHOULD return an MPA Reply
1502	       Frame with the Rejected Connection bit set to '0'; the MPA Reply
1503	       Frame MAY include ULP supplied Private Data; and the Responder
1504	       SHOULD prepare to interpret any data received as FPDUs and pass
1505	       any received ULPDUs to DDP.

1507	       Note: Since the receiver's ability to deal with Markers is
1508	           unknown until the Request and Reply frames have been
1509	           received, sending FPDUs before this occurs is not possible.

1511	       Note: The requirement to wait on a Request Frame before sending a
1512	           Reply frame is a design choice, it makes for well ordered
1513	           sequence of events at each end, and avoids having to specify
1514	           how to deal with situations where both ends start at the same
1515	           time.

1517	   3.  MPA Initiator mode implementations MUST receive and validate a
1518	       MPA Reply Frame.

1520	       If the MPA Reply Frame is improperly formatted, the
1521	       implementation MUST close the TCP connection and exit MPA.

1523	       If the MPA Reply Frame is properly formatted but is the Private
1524	       Data is not acceptable, or if the Rejected Connection bit set to
1525	       '1', the implementation MUST exit MPA, leaving the TCP connection
1526	       open.  The ULP may close TCP or use the connection for other
1527	       purposes.

1529	       If the MPA Reply Frame is properly formatted and the Private Data
1530	       is acceptable, and the Reject Connection bit is set to '0', the
1531	       implementation SHOULD enter full MPA/DDP operation mode;
1532	       interpreting any received data as FPDUs and sending DDP ULPDUs as
1533	       FPDUs.

1535	   4.  MPA Responder mode implementations MUST receive and validate at
1536	       least one FPDU before sending any FPDUs or Markers.

1538	       Note: this requirement is present to allow the Initiator time to
1539	           get its receiver into Full Operation before an FPDU arrives,
1540	           avoiding potential race conditions at the Initiator.  This
1541	           was also subject to some debate in the work group before
1542	           rough consensus was reached.  Eliminating this requirement
1543	           would allow faster startup in some types of applications.
1544	           However, that would also make certain implementations
1545	           (particularly "dual stack") much harder.

1547	   5.  If a received "Key" does not match the expected value, (See 6.1.1
1548	       MPA Request and Reply Frame Format above) the TCP/DDP connection
1549	       MUST be closed, and an error returned to the ULP.

1551	   6.  The received Private Data fields may be used by Consumers at
1552	       either end to further validate the connection, and set up DDP or
1553	       other ULP parameters.  The Initiator ULP MAY close the
1554	       TCP/MPA/DDP connection as a result of validating the Private Data
1555	       fields.  The Responder SHOULD return a MPA Reply Frame with the
1556	       "Reject Connection" Bit set to '1' if the validation of the
1557	       Private Data is not acceptable to the ULP.

1559	   7.  When the first FPDU is to be sent, then if Markers are enabled,
1560	       the first octets sent are the special Marker 0x00000000, followed
1561	       by the start of the FPDU (the FPDU's ULPDU Length field).  If
1562	       Markers are not enabled, the first octets sent are the start of
1563	       the FPDU (the FPDU's ULPDU Length field).

1565	   8.  MPA implementations MUST use the difference between the MPA
1566	       Request Frame and the MPA Reply Frame to check for incorrect
1567	       "Initiator/Initiator" startups.  Implementations SHOULD put a
1568	       timeout on waiting for the MPA Request Frame when started in
1569	       Responder mode, to detect incorrect "Responder/Responder"
1570	       startups.

1572	   9.  MPA implementations MUST validate the PD_Length field.  The
1573	       buffer that receives the Private Data field MUST be large enough
1574	       to receive that data; the amount of Private Data MUST not exceed
1575	       the PD_Length, or the application buffer.  If any of the above
1576	       fails, the startup frame MUST be considered improperly formatted.

1578	   10. MPA implementations SHOULD implement a reasonable timeout while
1579	       waiting for the entire startup frames; this prevents certain
1580	       denial of service attacks.  ULPs SHOULD implement a reasonable
1581	       timeout while waiting for FPDUs, ULPDUs and application level
1582	       messages to guard against application failures and certain denial
1583	       of service attacks.

1585	6.1.3  Example Delayed Startup sequence

1587	   A variety of startup sequences are possible when using MPA on TCP.
1588	   Following is an example of an MPA/DDP startup that occurs after TCP
1589	   has been running for a while and has exchanged some amount of
1590	   streaming data.  This example does not use any Private Data (an
1591	   example that does is shown later in 6.1.4.2 Example Immediate Startup
1592	   using Private Data on page 42), although it is perfectly legal to
1593	   include the Private Data.  Note that since the example does not use
1594	   any Private Data, there are no ULP interactions shown between
1595	   receiving "Startup frames" and putting MPA into Full Operation.

1597	          Initiator                                 Responder

1599	   +---------------------------+
1600	   |ULP streaming mode         |
1601	   | <Hello> request to        |
1602	   | transition to DDP/MPA     |           +--------------------------+
1603	   | mode (optional)           | --------> |ULP gets request;         |
1604	   +---------------------------+           |enables MPA Responder mode|
1605	                                           |with last (optional)      |
1606	                                           |streaming mode <Hello Ack>|
1607	                                           |for MPA to send.          |
1608	   +---------------------------+           |MPA waits for incoming    |
1609	   |ULP receives streaming     | <-------- |  <MPA Request frame>     |
1610	   | <Hello Ack>;              |           +--------------------------+
1611	   |Enters MPA Initiator mode; |
1612	   |MPA sends                  |
1613	   |  <MPA Request Frame>;     |
1614	   |MPA waits for incoming     |           +--------------------------+
1615	   |  <MPA Reply Frame         | - - - - > |MPA receives              |
1616	   +---------------------------+           |  <MPA Request Frame>     |
1617	                                           |Consumer binds DDP to MPA,|
1618	                                           |MPA sends the             |
1619	                                           |  <MPA Reply Frame>.      |
1620	                                           |DDP/MPA enables FPDU      |
1621	   +---------------------------+           |decoding, but does not    |
1622	   |MPA receives the           | < - - - - |send any FPDUs.           |
1623	   |  <MPA Reply Frame>        |           +--------------------------+
1624	   |Consumer binds DDP to MPA, |
1625	   |DDP/MPA begins full        |
1626	   |operation.                 |
1627	   |MPA sends first FPDU (as   |           +--------------------------+
1628	   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
1629	   |available).                |           |MPA sends first FPDU (as  |
1630	   +---------------------------+           |DDP ULPDUs become         |
1631	                                   <====== |available.                |
1632	                                           +--------------------------+
1633	               Figure 8: Example Delayed Startup negotiation

1635	   An example Delayed Startup sequence is described below:

1637	       *   Active and passive sides start up a TCP connection in the
1638	           usual fashion, probably using sockets APIs.  They exchange
1639	           some amount of streaming mode data.  At some point one side
1640	           (the MPA Initiator) sends streaming mode data that
1641	           effectively says "Hello, Lets go into MPA/DDP mode."

1643	   *   When the remote side (the MPA Responder) gets this streaming mode
1644	       message, the Consumer would send a last streaming mode message
1645	       that effectively says "I Acknowledge your Hello, and am now in
1646	       MPA Responder Mode".  The exchange of these messages establishes
1647	       the exact point in the TCP stream where MPA is enabled.  The
1648	       Responding Consumer enables MPA in the Responder mode and waits
1649	       for the initial MPA startup message.

1651	       *   The Initiating Consumer would enable MPA startup in the
1652	           Initiator mode which then sends the MPA Request Frame.  It is
1653	           assumed that no Private Data messages are needed for this
1654	           example, although it is possible to do so.  The Initiating
1655	           MPA (and Consumer) would also wait for the MPA connection to
1656	           be accepted.

1658	   *   The Responding MPA would receive the initial MPA Request Frame
1659	       and would inform the Consumer that this message arrived.  The
1660	       Consumer can then accept the MPA/DDP connection or close the TCP
1661	       connection.

1663	   *   To accept the connection request, the Responding Consumer would
1664	       use an appropriate API to bind the TCP/MPA connections to a DDP
1665	       endpoint, thus enabling MPA/DDP into Full Operation.  In the
1666	       process of going to Full Operation, MPA sends the MPA Reply
1667	       Frame.  MPA/DDP waits for the first incoming FPDU before sending
1668	       any FPDUs.

1670	   *   If the initial TCP data was not a properly formatted MPA Request
1671	       Frame MPA will close or reset the TCP connection immediately.

1673	       *   The Initiating MPA would receive the MPA Reply Frame and
1674	           would report this message to the Consumer.  The Consumer can
1675	           then accept the MPA/DDP connection, or close or reset the TCP
1676	           connection to abort the process.

1678	       *   On determining that the Connection is acceptable, the
1679	           Initiating Consumer would use an appropriate API to bind the
1680	           TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
1681	           into Full Operation.  MPA/DDP would begin sending DDP
1682	           messages as MPA FPDUs.

1684	6.1.4  Use of Private Data

1686	   This section is advisory in nature, in that it suggests a method that
1687	   a ULP can deal with pre-DDP connection information exchange.

1689	6.1.4.1  Motivation

1691	   Prior RDMA protocols have been developed that provide Private Data
1692	   via out of band mechanisms.  As a result, many applications now
1693	   expect some form of Private Data to be available for application use
1694	   prior to setting up the DDP/RDMA connection.  Following are some
1695	   examples of the use of Private Data.

1697	   An RDMA Endpoint (referred to as a Queue Pair, or QP, in InfiniBand
1698	   and the [VERBS]) must be associated with a Protection Domain.  No
1699	   receive operations may be posted to the endpoint before it is
1700	   associated with a Protection Domain.  Indeed under both the
1701	   InfiniBand and proposed RDMA/DDP verbs [VERBS] an endpoint/QP is
1702	   created within a Protection Domain.

1704	   There are some applications where the choice of Protection Domain is
1705	   dependent upon the identity of the remote ULP client.  For example,
1706	   if a user session requires multiple connections, it is highly
1707	   desirable for all of those connections to use a single Protection
1708	   Domain.  Note: use of Protection Domains is further discussed in
1709	   [RDMASEC].

1711	   InfiniBand, the DAT APIs [DAT-API] and the [IT-API] all provide for
1712	   the active side ULP to provide Private Data when requesting a
1713	   connection.  This data is passed to the ULP to allow it to determine
1714	   whether to accept the connection, and if so with which endpoint (and
1715	   implicitly which Protection Domain).

1717	   The Private Data can also be used to ensure that both ends of the
1718	   connection have configured their RDMA endpoints compatibly on such
1719	   matters as the RDMA Read capacity (see [RDMAP]).  Further ULP-
1720	   specific uses are also presumed, such as establishing the identity of
1721	   the client.

1723	   Private Data is also allowed for when accepting the connection, to
1724	   allow completion of any negotiation on RDMA resources and for other
1725	   ULP reasons.

1727	   There are several potential ways to exchange this Private Data.  For
1728	   example, the InfiniBand specification includes a connection
1729	   management protocol that allows a small amount of Private Data to be
1730	   exchanged using datagrams before actually starting the RDMA
1731	   connection.

1733	   This draft allows for small amounts of Private Data to be exchanged
1734	   as part of the MPA startup sequence.  The actual Private Data fields
1735	   are carried in the MPA Request Frame, and the MPA Reply Frame.

1737	   If larger amounts of Private Data or more negotiation is necessary,
1738	   TCP streaming mode messages may be exchanged prior to enabling MPA.

1740	6.1.4.2  Example Immediate Startup using Private Data

1742	          Initiator                                 Responder

1744	   +---------------------------+
1745	   |TCP SYN sent               |           +--------------------------+
1746	   +---------------------------+ --------> |TCP gets SYN packet;      |
1747	   +---------------------------+           |  Sends SYN-Ack           |
1748	   |TCP gets SYN-Ack           | <-------- +--------------------------+
1749	   |  Sends Ack                |
1750	   +---------------------------+ --------> +--------------------------+
1751	   +---------------------------+           |Consumer enables MPA      |
1752	   |Consumer enables MPA       |           |Responder Mode, waits for |
1753	   |Initiator mode with        |           |  <MPA Request frame>     |
1754	   |Private Data; MPA sends    |           +--------------------------+
1755	   |  <MPA Request Frame>;     |
1756	   |MPA waits for incoming     |           +--------------------------+
1757	   |  <MPA Reply Frame         | - - - - > |MPA receives              |
1758	   +---------------------------+           |  <MPA Request Frame>     |
1759	                                           |Consumer examines Private |
1760	                                           |Data, provides MPA with   |
1761	                                           |return Private Data,      |
1762	                                           |binds DDP to MPA, and     |
1763	                                           |enables MPA to send an    |
1764	                                           |  <MPA Reply Frame>.      |
1765	                                           |DDP/MPA enables FPDU      |
1766	   +---------------------------+           |decoding, but does not    |
1767	   |MPA receives the           | < - - - - |send any FPDUs.           |
1768	   |  <MPA Reply Frame>        |           +--------------------------+
1769	   |Consumer examines Private  |
1770	   |Data, binds DDP to MPA,    |
1771	   |and enables DDP/MPA to     |
1772	   |begin Full Operation.      |
1773	   |MPA sends first FPDU (as   |           +--------------------------+
1774	   |DDP ULPDUs become          | ========> |MPA Receives first FPDU.  |
1775	   |available).                |           |MPA sends first FPDU (as  |
1776	   +---------------------------+           |DDP ULPDUs become         |
1777	                                   <====== |available.                |
1778	                                           +--------------------------+
1779	              Figure 9: Example Immediate Startup negotiation

1781	   Note: the exact order of when MPA is started in the TCP connection
1782	       sequence is implementation dependent; the above diagram shows one
1783	       possible sequence.  Also, the Initiator "Ack" to the Responder's
1784	       "SYN-Ack" may be combined into the same TCP segment containing
1785	       the MPA Request Frame (as is allowed by TCP RFCs).

1787	   The example immediate startup sequence is described below:

1789	   *   The passive side (Responding Consumer) would listen on the TCP
1790	       destination port, to indicate its readiness to accept a
1791	       connection.

1793	       *   The active side (Initiating Consumer) would request a
1794	           connection from a TCP endpoint (that expected to upgrade to
1795	           MPA/DDP/RDMA and expected the Private Data) to a destination
1796	           address and port.

1798	       *   The Initiating Consumer would initiate a TCP connection to
1799	           the destination port.  Acceptance/rejection of the connection
1800	           would proceed as per normal TCP connection establishment.

1802	   *   The passive side (Responding Consumer) would receive the TCP
1803	       connection request as usual allowing normal TCP gatekeepers, such
1804	       as INETD and TCPserver, to exercise their normal
1805	       safeguard/logging functions.  On acceptance of the TCP
1806	       connection, the Responding Consumer would enable MPA in the
1807	       Responder mode and wait for the initial MPA startup message.

1809	       *   The Initiating Consumer would enable MPA startup in the
1810	           Initiator mode to send an initial MPA Request Frame with its
1811	           included Private Data message to send.  The Initiating MPA
1812	           (and Consumer) would also wait for the MPA connection to be
1813	           accepted, and any returned Private Data.

1815	   *   The Responding MPA would receive the initial MPA Request Frame
1816	       with the Private Data message and would pass the Private Data
1817	       through to the Consumer.  The Consumer can then accept the
1818	       MPA/DDP connection, close the TCP connection, or reject the MPA
1819	       connection with a return message.

1821	   *   To accept the connection request, the Responding Consumer would
1822	       use an appropriate API to bind the TCP/MPA connections to a DDP
1823	       endpoint, thus enabling MPA/DDP into Full Operation.  In the
1824	       process of going to Full Operation, MPA sends the MPA Reply Frame
1825	       which includes the Consumer supplied Private Data containing any
1826	       appropriate Consumer response.  MPA/DDP waits for the first
1827	       incoming FPDU before sending any FPDUs.

1829	   *   If the initial TCP data was not a properly formatted MPA Request
1830	       Frame, MPA will close or reset the TCP connection immediately.

1832	   *   To reject the MPA connection request, the Responding Consumer
1833	       would send an MPA Reply Frame with any ULP supplied Private Data
1834	       (with reason for rejection), with the "Rejected Connection" bit
1835	       set to '1', and may close the TCP connection.

1837	       *   The Initiating MPA would receive the MPA Reply Frame with the
1838	           Private Data message and would report this message to the
1839	           Consumer, including the supplied Private Data.

1841	           If the "rejected Connection" bit is set to a '1', MPA will
1842	           close the TCP connection and exit.

1844	           If the "Rejected Connection" bit is set to a '0', and on
1845	           determining from the MPA Reply Frame Private Data that the
1846	           Connection is acceptable, the Initiating Consumer would use
1847	           an appropriate API to bind the TCP/MPA connections to a DDP
1848	           endpoint thus enabling MPA/DDP into Full Operation.  MPA/DDP
1849	           would begin sending DDP messages as MPA FPDUs.

1851	6.1.5  "Dual stack" implementations

1853	   MPA/DDP implementations are commonly expected to be implemented as
1854	   part of a "dual stack" architecture.  One "stack" is the traditional
1855	   TCP stack, usually with a sockets interface API (Application
1856	   Programming Interface).  The second stack is the MPA/DDP "stack" with
1857	   its own API, and potentially separate code or hardware to deal with
1858	   the MPA/DDP data.  Of course, implementations may vary, so the
1859	   following comments are of an advisory nature only.

1861	   The use of the two "stacks" offers advantages:

1863	        TCP connection setup is usually done with the TCP stack.  This
1864	        allows use of the usual naming and addressing mechanisms.  It
1865	        also means that any mechanisms used to "harden" the connection
1866	        setup against security threats are also used when starting
1867	        MPA/DDP.

1869	        Some applications may have been originally designed for TCP, but
1870	        are "enhanced" to utilize MPA/DDP after a negotiation reveals
1871	        the capability to do so.  The negotiation process takes place in
1872	        TCP's streaming mode, using the usual TCP APIs.

1874	        Some new applications, designed for RDMA or DDP, still need to
1875	        exchange some data prior to starting MPA/DDP.  This exchange can
1876	        be of arbitrary length or complexity, but often consists of only
1877	        a small amount of Private Data, perhaps only a single message.
1878	        Using the TCP streaming mode for this exchange allows this to be
1879	        done using well understood methods.

1881	   The main disadvantage of using two stacks is the conversion of an
1882	   active TCP connection between them.  This process must be done with
1883	   care to prevent loss of data.

1885	   To avoid some of the problems when using a "dual stack" architecture
1886	   the following additional restrictions may be required by the
1887	   implementation:

1889	   1.  Enabling the DDP/MPA stack SHOULD be done only when no incoming
1890	       stream data is expected.  This is typically managed by the ULP
1891	       protocol.  When following the recommended startup sequence, the
1892	       Responder side enters DDP/MPA mode, sends the last streaming mode
1893	       data, and then waits for the MPA Request Frame.  No additional
1894	       streaming mode data is expected.  The Initiator side ULP receives
1895	       the last streaming mode data, and then enters DDP/MPA mode.
1896	       Again, no additional streaming mode data is expected.

1898	   2.  The DDP/MPA MAY provide the ability to send a "last streaming
1899	       message" as part of its Responder DDP/MPA enable function.  This
1900	       allows the DDP/MPA stack to more easily manage the conversion to
1901	       DDP/MPA mode (and avoid problems with a very fast return of the
1902	       MPA Request Frame from the Initiator side).

1904	   Note: Regardless of the "stack" architecture used, TCP's rules MUST
1905	       be followed.  For example, if network data is lost, re-segmented
1906	       or re-ordered, TCP MUST recover appropriately even when this
1907	       occurs while switching stacks.

1909	6.2  Normal Connection Teardown

1911	   Each half connection of MPA terminates when DDP closes the
1912	   corresponding TCP half connection.

1914	   A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
1915	   that a graceful close of the LLP connection has been received by the
1916	   LLP (e.g. FIN is received).

1918	7  Error Semantics

1920	   The following errors MUST be detected by MPA and the codes SHOULD be
1921	   provided to DDP or other Consumer:

1923	    Code Error

1925	    1    TCP connection closed, terminated or lost.  This includes lost
1926	         by timeout, too many retries, RST received or FIN received.

1928	    2    Received MPA CRC does not match the calculated value for the
1929	         FPDU.

1931	    3    In the event that the CRC is valid, received MPA Marker (if
1932	         enabled) and ULPDU Length fields do not agree on the start of
1933	         a FPDU.  If the FPDU start determined from previous ULPDU
1934	         Length fields does not match with the MPA Marker position, MPA
1935	         SHOULD deliver an error to DDP.  It may not be possible to
1936	         make this check as a segment arrives, but the check SHOULD be
1937	         made when a gap creating an out of order sequence is closed
1938	         and any time a Marker points to an already identified FPDU.
1939	         It is OPTIONAL for a receiver to check each Marker, if
1940	         multiple Markers are present in an FPDU, or if the segment is
1941	         received in order.

1943	    4    Invalid MPA Request Frame or MPA Response Frame received.  In
1944	         this case, the TCP connection MUST be immediately closed.  DDP
1945	         and other ULPs should treat this similar to code 1, above.

1947	   When conditions 2 or 3 above are detected, an MPA-aware TCP
1948	   implementation MAY choose to silently drop the TCP segment rather
1949	   than reporting the error to DDP.  In this case, the sending TCP will
1950	   retry the segment, usually correcting the error, unless the problem
1951	   was at the source.  In that case, the source will usually exceed the
1952	   number of retries and terminate the connection.

1954	   Once MPA delivers an error of any type, it MUST NOT pass or deliver
1955	   any additional FPDUs on that half connection.

1957	   For Error codes 2 and 3, MPA MUST NOT close the TCP connection
1958	   following a reported error.  Closing the connection is the
1959	   responsibility of DDP's ULP.

1961	        Note that since MPA will not Deliver any FPDUs on a half
1962	        connection following an error detected on the receive side of
1963	        that connection, DDP's ULP is expected to tear down the
1964	        connection.  This may not occur until after one or more last
1965	        messages are transmitted on the opposite half connection.  This
1966	        allows a diagnostic error message to be sent.

1968	8  Security Considerations

1970	   This section discusses the security considerations for MPA.

1972	8.1  Protocol-specific Security Considerations

1974	   The vulnerabilities of MPA to third-party attacks are no greater than
1975	   any other protocol running over TCP.  A third party, by sending
1976	   packets into the network that are delivered to an MPA receiver, could
1977	   launch a variety of attacks that take advantage of how MPA operates.
1978	   For example, a third party could send random packets that are valid
1979	   for TCP, but contain no FPDU headers.  An MPA receiver reports an
1980	   error to DDP when any packet arrives that cannot be validated as an
1981	   FPDU when properly located on an FPDU boundary.  A third party could
1982	   also send packets that are valid for TCP, MPA, and DDP, but do not
1983	   target valid buffers.  These types of attacks ultimately result in
1984	   loss of connection and thus become a type of DOS (Denial Of Service)
1985	   attack.  Communication security mechanisms such as IPsec [RFC2401]
1986	   may be used to prevent such attacks.

1988	   Independent of how MPA operates, a third party could use ICMP
1989	   messages to reduce the path MTU to such a small size that performance
1990	   would likewise be severely impacted.  Range checking on path MTU
1991	   sizes in ICMP packets may be used to prevent such attacks.

1993	   [RDMAP] and [DDP] are used to control, read and write data buffers
1994	   over IP networks.  Therefore, the control and the data packets of
1995	   these protocols are vulnerable to the spoofing, tampering and
1996	   information disclosure attacks listed below.  In addition, Connection
1997	   to/from an unauthorized or unauthenticated endpoint is a potential
1998	   problem with most applications using RDMA, DDP, and MPA.

2000	8.1.1  Spoofing

2002	   Spoofing attacks can be launched by the Remote Peer, or by a network
2003	   based attacker.  A network based spoofing attack applies to all
2004	   Remote Peers.  Because the MPA Stream requires a TCP Stream in the
2005	   ESTABLISHED state, certain types of traditional forms of wire attacks
2006	   do not apply -- an end-to-end handshake must have occurred to
2007	   establish the MPA Stream.  So, the only form of spoofing that applies
2008	   is one when a remote node can both send and receive packets.  Yet
2009	   even with this limitation the Stream is still exposed to the
2010	   following spoofing attacks.

2012	8.1.1.1  Impersonation

2014	   A network based attacker can impersonate a legal MPA/DDP/RDMAP peer
2015	   (by spoofing a legal IP address), and establish an MPA/DDP/RDMAP
2016	   Stream with the victim.  End to end authentication (i.e. IPsec or ULP
2017	   authentication) provides protection against this attack.

2019	8.1.1.2  Stream Hijacking

2021	   Stream hijacking happens when a network based attacker follows the
2022	   Stream establishment phase, and waits until the authentication phase
2023	   (if such a phase exists) is completed successfully.  He can then
2024	   spoof the IP address and re-direct the Stream from the victim to its
2025	   own machine.  For example, an attacker can wait until an iSCSI
2026	   authentication is completed successfully, and hijack the iSCSI
2027	   Stream.

2029	   The best protection against this form of attack is end-to-end
2030	   integrity protection and authentication, such as IPsec to prevent
2031	   spoofing.  Another option is to provide physical security.
2032	   Discussion of physical security is out of scope for this document.

2034	8.1.1.3  Man in the Middle Attack

2036	   If a network based attacker has the ability to delete, inject replay,
2037	   or modify packets which will still be accepted by MPA (e.g., TCP
2038	   sequence number is correct, FPDU is valid etc.) then the Stream can
2039	   be exposed to a man in the middle attack.  The attacker could
2040	   potentially use the services of [DDP] and [RDMAP] to read the
2041	   contents of the associated data buffer, modify the contents of the
2042	   associated data buffer, or to disable further access to the buffer.
2043	   The only countermeasure for this form of attack is to either secure
2044	   the MPA/DDP/RDMAP Stream (i.e. integrity protect) or attempt to
2045	   provide physical security to prevent man-in-the-middle type attacks.

2047	   The best protection against this form of attack is end-to-end
2048	   integrity protection and authentication, such as IPsec, to prevent
2049	   spoofing or tampering.  If Stream or session level authentication and
2050	   integrity protection are not used, then a man-in-the-middle attack
2051	   can occur, enabling spoofing and tampering.

2053	   Another approach is to restrict access to only the local subnet/link,
2054	   and provide some mechanism to limit access, such as physical security
2055	   or 802.1.x.  This model is an extremely limited deployment scenario,
2056	   and will not be further examined here.

2058	8.1.2  Eavesdropping

2060	   Generally speaking, Stream confidentiality protects against
2061	   eavesdropping.  Stream and/or session authentication and integrity
2062	   protection is a counter measurement against various spoofing and
2063	   tampering attacks.  The effectiveness of authentication and integrity
2064	   against a specific attack, depend on whether the authentication is
2065	   machine level authentication (as the one provided by IPsec), or ULP
2066	   authentication.

2068	8.2  Introduction to Security Options

2070	   The following security services can be applied to an MPA/DDP/RDMAP
2071	   Stream:

2073	   1.  Session confidentiality - protects against eavesdropping.

2075	   2.  Per-packet data source authentication - protects against the
2076	   following spoofing attacks: network based impersonation, Stream
2077	   hijacking, and man in the middle.

2079	   3.  Per-packet integrity - protects against tampering done by
2080	   network based modification of FPDUs (indirectly affecting buffer
2081	   content through DDP services).

2083	   4.  Packet sequencing - protects against replay attacks, which is
2084	   a special case of the above tampering attack.

2086	   If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks,
2087	   or Stream hijacking attacks, it is recommended that the Stream be
2088	   authenticated, integrity protected, and protected from replay
2089	   attacks; it may use confidentiality protection to protect from
2090	   eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public
2091	   network).

2093	   IPsec is capable of providing the above security services for IP and
2094	   TCP traffic.

2096	   ULP protocols may be able to provide part of the above security
2097	   services.  See [NFSv4CHANNEL] for additional information on a
2098	   promising approach called "channel binding".  From [NFSv4CHANNEL]:

2100	        "The concept of channel bindings allows applications to prove
2101	        that the end-points of two secure channels at different network
2102	        layers are the same by binding authentication at one channel to
2103	        the session protection at the other channel.  The use of channel
2104	        bindings allows applications to delegate session protection to
2105	        lower layers, which may significantly improve performance for
2106	        some applications."

2108	8.3  Using IPsec With MPA

2110	   IPsec can be used to protect against the packet injection attacks
2111	   outlined above.  Because IPsec is designed to secure individual IP
2112	   packets, MPA can run above IPsec without change.  IPsec packets are
2113	   processed (e.g., integrity checked and decrypted) in the order they
2114	   are received, and an MPA receiver will process the decrypted FPDUs
2115	   contained in these packets in the same manner as FPDUs contained in
2116	   unsecured IP packets.

2118	   MPA Implementations MUST implement IPsec as described in Section 8.4
2119	   below.  The use of IPsec is up to ULPs and administrators.

2121	8.4  Requirements for IPsec Encapsulation of MPA/DDP

2123	   The IP Storage working group has spent significant time and effort to
2124	   define the normative IPsec requirements for IP Storage [RFC3723].
2125	   Portions of that specification are applicable to a wide variety of
2126	   protocols, including the RDDP protocol suite.  In order to not
2127	   replicate this effort, an MPA ON TCP implementation MUST follow the
2128	   requirements defined in RFC3723 Section 2.3 and Section 5, including
2129	   the associated normative references for those sections.

2131	   Additionally, since IPsec acceleration hardware may only be able to
2132	   handle a limited number of active IKE Phase 2 SAs, Phase 2 delete
2133	   messages MAY be sent for idle SAs, as a means of keeping the number
2134	   of active Phase 2 SAs to a minimum.  The receipt of an IKE Phase 2
2135	   delete message MUST NOT be interpreted as a reason for tearing down
2136	   an DDP/RDMA Stream.  Rather, it is preferable to leave the Stream up,
2137	   and if additional traffic is sent on it, to bring up another IKE
2138	   Phase 2 SA to protect it.  This avoids the potential for continually
2139	   bringing Streams up and down.

2141	   Note that there are serious security issues if IPsec is not
2142	   implemented end-to-end.  For example, if IPsec is implemented as a
2143	   tunnel in the middle of the network, any hosts between the peer and
2144	   the IPsec tunneling device can freely attack the unprotected Stream.

2146	9  IANA Considerations

2148	   No IANA actions are required by this document.

2150	   If a well-known port is chosen as the mechanism to identify a DDP on
2151	   MPA on TCP, the well-known port must be registered with IANA.
2152	   Because the use of the port is DDP specific, registration of the port
2153	   with IANA is left to DDP.

2155	10 References

2157	10.1 Normative References

2159	   [iSCSI] Satran, J., Internet Small Computer Systems Interface
2160	       (iSCSI), RFC 3720, April 2004.

2162	   [RFC1191] Mogul, J., and Deering, S., "Path MTU Discovery", RFC 1191,
2163	       November 1990.

2165	   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., Romanow, A., "TCP
2166	       Selective Acknowledgment Options", RFC 2018, October 1996.

2168	   [RFC3723] Aboba B., et al, "Securing Block Storage Protocols over
2169	       IP", RFC3723, April 2004.

2171	   [RFC793] Postel, J., "Transmission Control Protocol - DARPA Internet
2172	       Program Protocol Specification", RFC 793, September 1981.

2174	   [RDMASEC]  Pinkerton J., Deleganes E., Bitan S., "DDP/RDMAP
2175	       Security", draft-ietf-rddp-security-09.txt (work in progress),
2176	       MAY 2006.

2178	10.2 Informative References

2180	   [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
2181	       disagree", ACM Sigcomm, Sept. 2000.

2183	   [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access Programming
2184	       Library) and uDAPL (User Direct Access Programming Library)",
2185	       http://www.datcollaborative.org.

2187	   [DDP] H. Shah et al., "Direct Data Placement over Reliable
2188	       Transports", draft-ietf-rddp-ddp-06.txt (Work in progress), May
2189	       2006.

2191	   [IT-API] The Open Group, "Interconnect Transport API (IT-API)"
2192	       Version 2.1, http://www.opengroup.org.

2194	   [RFC2401]  Atkinson, R., Kent, S., "Security Architecture for the
2195	       Internet Protocol", RFC 2401, November 1998.

2197	   [RFC0896] J. Nagle, "Congestion Control in IP/TCP Internetworks", RFC
2198	       896, January 1984.

2200	   [NagleDAck] Minshall G., Mogul, J., Saito, Y., Verghese, B.,
2201	       "Application performance pitfalls and TCP's Nagle algorithm",
2202	       Workshop on Internet Server Performance, May 1999.

2204	   [NFSv4CHANNEL] Williams, N., "On the Use of Channel Bindings to
2205	       Secure Channels", Internet-Draft draft-ietf-nfsv4-channel-
2206	       bindings-02.txt, July 2004.

2208	   [RDMAP] R. Recio et al., "RDMA Protocol Specification",
2209	       draft-ietf-rddp-rdmap-06.txt, May 2006.

2211	   [RFC2960] R. Stewart et al., "Stream Control Transmission Protocol",
2212	       RFC 2960, October 2000.

2214	   [RFC792] Postel, J., "Internet Control Message Protocol", September
2215	       1981

2217	   [RFC1122] Braden, R.T., "Requirements for Internet hosts -
2218	       communication layers", October 1989.

2220	   [VERBS] J. Hilland et al., "RDMA Protocol Verbs Specification",
2221	       draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf April 2003,
2222	       http://www.rdmaconsortium.org.

2224	11 Appendix

2226	   This appendix is for information only and is NOT part of the
2227	   standard.

2229	   The appendix covers three topics;

2231	   Section 11.1 is an analysis of MPA on TCP and why it is useful to
2232	   integrate MPA with TCP (with modifications to typical TCP
2233	   implementations) to reduce overall system buffering and overhead.

2235	   Section 11.2 covers some MPA receiver implementation notes.

2237	   Section 11.3 covers methods of making MPA implementations
2238	   interoperate with both IETF and RDMA Consortium versions of the
2239	   protocols.

2241	11.1 Analysis of MPA over TCP Operations

2243	   This appendix analyzes the impact of MPA on the TCP sender, receiver,
2244	   and wire protocol.

2246	   One of MPA's high level goals is to provide enough information, when
2247	   combined with the Direct Data Placement Protocol [DDP], to enable
2248	   out-of-order placement of DDP payload into the final Upper Layer
2249	   Protocol (ULP) buffer.  Note that DDP separates the act of placing
2250	   data into a ULP buffer from that of notifying the ULP that the ULP
2251	   buffer is available for use.  In DDP terminology, the former is
2252	   defined as "Placement", and the later is defined as "Delivery".  MPA
2253	   supports in-order Delivery of the data to the ULP, including support
2254	   for Direct Data Placement in the final ULP buffer location when TCP
2255	   segments arrive out-of-order.  Effectively, the goal is to use the
2256	   pre-posted ULP buffers as the TCP receive buffer, where the
2257	   reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
2258	   DDP) is done in place, in the ULP buffer, with no data copies.

2260	   This Appendix walks through the advantages and disadvantages of the
2261	   TCP sender modifications proposed by MPA:

2263	   1) that MPA prefers that the TCP sender to do Header Alignment, where
2264	   a TCP segment should begin with an MPA Framing Protocol Data Unit
2265	   (FPDU) (if there is payload present).

2267	   2) that there be an integral number of FPDUs in a TCP segment (under
2268	   conditions where the Path MTU is not changing).

2270	   This Appendix concludes that the scaling advantages of FPDU Alignment
2271	   are strong, based primarily on fairly drastic TCP receive buffer
2272	   reduction requirements and simplified receive handling.  The analysis
2273	   also shows that there is little effect to TCP wire behavior.

2275	11.1.1 Assumptions

2277	11.1.1.1 MPA is layered beneath DDP [DDP]

2279	   MPA is an adaptation layer between DDP and TCP.  DDP requires
2280	   preservation of DDP segment boundaries and a CRC32C digest covering
2281	   the DDP header and data.   MPA adds these features to the TCP stream
2282	   so that DDP over TCP has the same basic properties as DDP over SCTP.

2284	11.1.1.2 MPA preserves DDP message framing

2286	   MPA was designed as a framing layer specifically for DDP and was not
2287	   intended as a general-purpose framing layer for any other ULP using
2288	   TCP.

2290	   A framing layer allows ULPs using it to receive indications from the
2291	   transport layer only when complete ULPDUs are present.  As a framing
2292	   layer, MPA is not aware of the content of the DDP PDU, only that it
2293	   has received and, if necessary, reassembled a complete PDU for
2294	   Delivery to the DDP.

2296	11.1.1.3 The size of the ULPDU passed to MPA is less than EMSS under
2297	      normal conditions

2299	   To make reception of a complete DDP PDU on every received segment
2300	   possible, DDP passes to MPA a PDU that is no larger than the EMSS of
2301	   the underlying fabric.  Each FPDU that MPA creates contains
2302	   sufficient information for the receiver to directly place the ULP
2303	   payload in the correct location in the correct receive buffer.

2305	   Edge cases when this condition does not occur are dealt with, but do
2306	   not need to be on the fast path

2308	11.1.1.4 Out-of-order placement but NO out-of-order Delivery

2310	   DDP receives complete DDP PDUs from MPA.  Each DDP PDU contains the
2311	   information necessary to place its ULP payload directly in the
2312	   correct location in host memory.

2314	   Because each DDP segment is self-describing, it is possible for DDP
2315	   segments received out of order to have their ULP payload placed
2316	   immediately in the ULP receive buffer.

2318	   Data delivery to the ULP is guaranteed to be in the order the data
2319	   was sent.  DDP only indicates data delivery to the ULP after TCP has
2320	   acknowledged the complete byte stream.

2322	11.1.2 The Value of FPDU Alignment

2324	   Significant receiver optimizations can be achieved when Header
2325	   Alignment and complete FPDUs are the common case.  The optimizations
2326	   allow utilizing significantly fewer buffers on the receiver and less
2327	   computation per FPDU.  The net effect is the ability to build a
2328	   "flow-through" receiver that enables TCP-based solutions to scale to
2329	   10G and beyond in an economical way.  The optimizations are
2330	   especially relevant to hardware implementations of receivers that
2331	   process multiple protocol layers - Data Link Layer (e.g., Ethernet),
2332	   Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
2333	   of TCP (e.g., MPA/DDP).  As network speed increases, there is an
2334	   increasing desire to use a hardware based receiver in order to
2335	   achieve an efficient high performance solution.

2337	   A TCP receiver, under worst case conditions, has to allocate buffers
2338	   (BufferSizeTCP) whose capacities are a function of the bandwidth-
2339	   delay product.  Thus:

2341	       BufferSizeTCP = K * bandwidth [octets/Second] * Delay [Seconds].

2343	   Where bandwidth is the end-to-end bandwidth of the connection, delay
2344	   is the round trip delay of the connection, and K is an implementation
2345	   dependent constant.

2347	   Thus BufferSizeTCP scales with the end-to-end bandwidth (10x more
2348	   buffers for a 10x increase in end-to-end bandwidth).  As this
2349	   buffering approach may scale poorly for hardware or software
2350	   implementations alike, several approaches allow reduction in the
2351	   amount of buffering required for high-speed TCP communication.

2353	   The MPA/DDP approach is to enable the ULP's buffer to be used as the
2354	   TCP receive buffer.  If the application pre-posts a sufficient amount
2355	   of buffering, and each TCP segment has sufficient information to
2356	   place the payload into the right application buffer, when an out-of-
2357	   order TCP segment arrives it could potentially be placed directly in
2358	   the ULP buffer.  However, placement can only be done when a complete
2359	   FPDU with the placement information is available to the receiver, and
2360	   the FPDU contents contain enough information to place the data into
2361	   the correct ULP buffer (e.g., there is a DDP header available).

2363	   For the case when the FPDU is not aligned with the TCP segment, it
2364	   may take, on average, 2 TCP segments to assemble one FPDU.
2365	   Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size,
2366	   Non-Aligned FPDU) octets:

2368	       BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS

2370	   Where K1 and K2 are implementation dependent constants and EMSS is
2371	   the effective maximum segment size.

2373	   For example, a 1 Gbps link with 10,000 connections and an EMSS of
2374	   1500B would require 15 MB of memory.  Often the number of connections
2375	   used scales with the network speed, aggravating the situation for
2376	   higher speeds.

2378	   FPDU Alignment would allow the receiver to allocate BufferSizeAF
2379	   (Buffer Size, Aligned FPDU) octets:

2381	       BufferSizeAF = K2 * EMSS

2383	   for the same conditions.  A FPDU Aligned receiver may require memory
2384	   in the range of ~100s of KB - which is feasible for an on-chip memory
2385	   and enables a "flow-through" design, in which the data flows through
2386	   the NIC and is placed directly in the destination buffer.  Assuming
2387	   most of the connections support FPDU Alignment, the receiver buffers
2388	   no longer scale with number of connections.

2390	   Additional optimizations can be achieved in a balanced I/O sub-system
2391	   -- where the system interface of the network controller provides
2392	   ample bandwidth as compared with the network bandwidth.  For almost
2393	   twenty years this has been the case and the trend is expected to
2394	   continue - while Ethernet speeds have scaled by 1000 (from 10
2395	   megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
2396	   architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
2397	   PCI-X DDR).  Under these conditions, the FPDU Alignment approach
2398	   allows BufferSizeAF to be indifferent to network speed.  It is
2399	   primarily a function of the local processing time for a given frame.
2400	   Thus when the FPDU Alignment approach is used, receive buffering is
2401	   expected to scale gracefully (i.e. less than linear scaling) as
2402	   network speed is increased.

2404	11.1.2.1 Impact of lack of FPDU Alignment on the receiver computational
2405	      load and complexity

2407	   The receiver must perform IP and TCP processing, and then perform
2408	   FPDU CRC checks, before it can trust the FPDU header placement
2409	   information.  For simplicity of the description, the assumption is
2410	   that a FPDU is carried in no more than 2 TCP segments.  In reality,
2411	   with no FPDU Alignment, an FPDU can be carried by more than 2 TCP
2412	   segments (e.g., if the PMTU was reduced).

2414	   ----++-----------------------------++-----------------------++-----
2415	   +---||---------------+    +--------||--------+   +----------||----+
2416	   |   TCP Seg X-1      |    |     TCP Seg X    |   |  TCP Seg X+1   |
2417	   +---||---------------+    +--------||--------+   +----------||----+
2418	   ----++-----------------------------++-----------------------++-----
2419	                   FPDU #N-1                  FPDU #N

2421	       Figure 10: Non-aligned FPDU freely placed in TCP octet stream

2423	   The receiver algorithm for processing TCP segments (e.g., TCP segment
2424	   #X in Figure 10: Non-aligned FPDU freely placed in TCP octet stream)
2425	   carrying non-aligned FPDUs (in-order or out-of-order) includes:

2427	      Data Link Layer processing (whole frame) - typically including a
2428	          CRC calculation.

2430	      1.  Network Layer processing (assuming not an IP fragment, the
2431	          whole Data Link Layer frame contains one IP datagram.  IP
2432	          fragments should be reassembled in a local buffer.  This is
2433	          not a performance optimization goal)

2435	      2.  Transport Layer processing -- TCP protocol processing, header
2436	          and checksum checks.

2438	          a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
2439	              IP DST, TCP SRC Port, TCP DST Port, protocol)

2441	      3.  Find FPDU message boundaries.

2443	          a.  Get MPA state information for the connection

2445	              If the TCP segment is in-order, use the receiver managed
2446	                  MPA state information to calculate where the previous
2447	                  FPDU message (#N-1) ends in the current TCP segment X.
2448	                  (previously, when the MPA receiver processed the first
2449	                  part of FPDU #N-1, it calculated the number of bytes
2450	                  remaining to complete FPDU #N-1 by using the MPA
2451	                  Length field).

2453	                  Get the stored partial CRC for FPDU #N-1

2455	                  Complete CRC calculation for FPDU #N-1 data (first
2456	                      portion of TCP segment #X)

2458	                  Check CRC calculation for FPDU #N-1

2460	                  If no FPDU CRC errors, placement is allowed
2461	                  Locate the local buffer for the first portion of
2462	                      FPDU#N-1, CopyData(local buffer of first portion
2463	                      of FPDU #N-1, host buffer address, length)

2465	                  Compute host buffer address for second portion of FPDU
2466	                      #N-1

2468	                  CopyData (local buffer of second portion of FPDU #N-1,
2469	                      host buffer address for second portion, length)

2471	                  Calculate the octet offset into the TCP segment for
2472	                      the next FPDU #N.

2474	                  Start Calculation of CRC for available data for FPDU
2475	                      #N

2477	                  Store partial CRC results for FPDU #N

2479	                  Store local buffer address of first portion of FPDU #N

2481	                  No further action is possible on FPDU #N, before it is
2482	                      completely received

2484	              If TCP out-of-order, receiver must buffer the data until
2485	                  at least one complete FPDU is received.  Typically
2486	                  buffering for more than one TCP segment per connection
2487	                  is required.  Use the MPA based Markers to calculate
2488	                  where FPDU boundaries are.

2490	                  When a complete FPDU is available, a similar procedure
2491	                      to the in-order algorithm above is used.  There is
2492	                      additional complexity, though, because when the
2493	                      missing segment arrives, this TCP segment must be
2494	                      run through the CRC engine after the CRC is
2495	                      calculated for the missing segment.

2497	   If we assume FPDU Alignment, the following diagram and the algorithm
2498	   below apply.  Note that when using MPA, the receiver is assumed to
2499	   actively detect presence or loss of FPDU Alignment for every TCP
2500	   segment received.

2502	      +--------------------------+      +--------------------------+
2503	   +--|--------------------------+   +--|--------------------------+
2504	   |  |       TCP Seg X          |   |  |         TCP Seg X+1      |
2505	   +--|--------------------------+   +--|--------------------------+
2506	      +--------------------------+      +--------------------------+
2507	                FPDU #N                          FPDU #N+1

2509	        Figure 11: Aligned FPDU placed immediately after TCP header

2511	   The receiver algorithm for FPDU Aligned frames (in-order or out-of-
2512	   order) includes:

2514	       1)  Data Link Layer processing (whole frame) - typically
2515	           including a CRC calculation.

2517	       2)  Network Layer processing (assuming not an IP fragment, the
2518	           whole Data Link Layer frame contains one IP datagram.  IP
2519	           fragments should be reassembled in a local buffer.  This is
2520	           not a performance optimization goal)

2522	       3)  Transport Layer processing -- TCP protocol processing, header
2523	           and checksum checks.

2525	           a.  Classify incoming TCP segment using the 5 tuple (IP SRC,
2526	               IP DST, TCP SRC Port, TCP DST Port, protocol)

2528	       4)  Check for Header Alignment. (Described in detail in Section
2529	           5.4).  Assuming Header Alignment for the rest of the
2530	           algorithm below.

2532	           a.  If the header is not aligned, see the algorithm defined
2533	               in the prior section.

2535	       5)  If TCP is in-order or out-of-order the MPA header is at the
2536	           beginning of the current TCP payload.  Get the FPDU length
2537	           from the FPDU header.

2539	       6)  Calculate CRC over FPDU

2541	       7)  Check CRC calculation for FPDU #N

2543	       8)  If no FPDU CRC errors, placement is allowed

2545	       9)  CopyData(TCP segment #X, host buffer address, length)

2547	       10) Loop to #5 until all the FPDUs in the TCP segment are
2548	           consumed in order to handle FPDU packing.

2550	   Implementation note: In both cases the receiver has to classify the
2551	   incoming TCP segment and associate it with one of the flows it
2552	   maintains.  In the case of no FPDU Alignment, the receiver is forced
2553	   to classify incoming traffic before it can calculate the FPDU CRC.
2554	   In the case of FPDU Alignment the operations order is left to the
2555	   implementer.

2557	   The FPDU Aligned receiver algorithm is significantly simpler.  There
2558	   is no need to locally buffer portions of FPDUs.  Accessing state
2559	   information is also substantially simplified - the normal case does
2560	   not require retrieving information to find out where a FPDU starts
2561	   and ends or retrieval of a partial CRC before the CRC calculation can
2562	   commence.  This avoids adding internal latencies, having multiple
2563	   data passes through the CRC machine, or scheduling multiple commands
2564	   for moving the data to the host buffer.

2566	   The aligned FPDU approach is useful for in-order and out-of-order
2567	   reception.  The receiver can use the same mechanisms for data storage
2568	   in both cases, and only needs to account for when all the TCP
2569	   segments have arrived to enable Delivery.  The Header Alignment,
2570	   along with the high probability that at least one complete FPDU is
2571	   found with every TCP segment, allows the receiver to perform data
2572	   placement for out-of-order TCP segments with no need for intermediate
2573	   buffering.  Essentially the TCP receive buffer has been eliminated
2574	   and TCP reassembly is done in place within the ULP buffer.

2576	   In case FPDU Alignment is not found, the receiver should follow the
2577	   algorithm for non aligned FPDU reception which may be slower and less
2578	   efficient.

2580	11.1.2.2 FPDU Alignment effects on TCP wire protocol

2582	      An MPA-aware TCP exposes its EMSS to MPA.  MPA uses the EMSS to
2583	      calculate its MULPDU, which it then exposes to DDP, its ULP.  DDP
2584	      uses the MULPDU to segment its payload so that each FPDU sent by
2585	      MPA fits completely into one TCP segment.  This has no impact on
2586	      wire protocol and exposing this information is already supported
2587	      on many TCP implementations, including all modern flavors of BSD
2588	      networking, through the TCP_MAXSEG socket option.

2590	   In the common case, the ULP (i.e. DDP over MPA) messages provided to
2591	   the TCP layer are segmented to MULPDU size.  It is assumed that the
2592	   ULP message size is bounded by MULPDU, such that a single ULP message
2593	   can be encapsulated in a single TCP segment.  Therefore, in the
2594	   common case, there is no increase in the number of TCP segments
2595	   emitted.  For smaller ULP messages, the sender can also apply
2596	   packing, i.e. the sender packs as many complete FPDUs as possible
2597	   into one TCP segment.  The requirement to always have a complete FPDU
2598	   may increase the number of TCP segments emitted.  Typically, a ULP
2599	   message size varies from few bytes to multiple EMSS (e.g., 64
2600	   Kbytes).  In some cases the ULP may post more than one message at a
2601	   time for transmission, giving the sender an opportunity for packing.
2602	   In the case where more than one FPDU is available for transmission
2603	   and the FPDUs are encapsulated into a TCP segment and there is no
2604	   room in the TCP segment to include the next complete FPDU, another
2605	   TCP segment is sent.  In this corner case some of the TCP segments
2606	   are not full size.  In the worst case scenario, the ULP may choose a
2607	   FPDU size that is EMSS/2 +1 and has multiple messages available for
2608	   transmission.  For this poor choice of FPDU size, the average TCP
2609	   segment size is therefore about 1/2 of the EMSS and the number of TCP
2610	   segments emitted is approaching 2x of what is possible without the
2611	   requirement to encapsulate an integer number of complete FPDUs in
2612	   every TCP segment.  This is a dynamic situation that only lasts for
2613	   the duration where the sender ULP has multiple non-optimal messages
2614	   for transmission and this causes a minor impact on the wire
2615	   utilization.

2617	   However, it is not expected that requiring FPDU Alignment will have a
2618	   measurable impact on wire behavior of most applications.  Throughput
2619	   applications with large I/Os are expected to take full advantage of
2620	   the EMSS.  Another class of applications with many small outstanding
2621	   buffers (as compared to EMSS) is expected to use packing when
2622	   applicable.  Transaction oriented applications are also optimal.

2624	   TCP retransmission is another area that can affect sender behavior.
2625	   TCP supports retransmission of the exact, originally transmitted
2626	   segment (see [RFC793] section 2.6, [RFC793] section 3.7 "managing the
2627	   window" and [RFC1122] section 4.2.2.15).  In the unlikely event that
2628	   part of the original segment has been received and acknowledged by
2629	   the remote peer (e.g., a re-segmenting middle box, as documented in
2630	   5.4.1 Re-segmenting Middle boxes and non MPA-aware TCP senders on
2631	   page 31), a better available bandwidth utilization may be possible by
2632	   re-transmitting only the missing octets.  If an MPA-aware TCP
2633	   retransmits complete FPDUs, there may be some marginal bandwidth
2634	   loss.

2636	   Another area where a change in the TCP segment number may have impact
2637	   is that of Slow Start and Congestion Avoidance.  Slow-start
2638	   exponential increase is measured in segments per second, as the
2639	   algorithm focuses on the overhead per segment at the source for
2640	   congestion that eventually results in dropped segments.  Slow-start
2641	   exponential bandwidth growth for MPA-aware TCP is similar to any TCP
2642	   implementation.  Congestion Avoidance allows for a linear growth in
2643	   available bandwidth when recovering after a packet drop.  Similar to
2644	   the analysis for slow-start, MPA-aware TCP doesn't change the
2645	   behavior of the algorithm.  Therefore the average size of the segment
2646	   versus EMSS is not a major factor in the assessment of the bandwidth
2647	   growth for a sender.  Both Slow Start and Congestion Avoidance for an
2648	   MPA-aware TCP will behave similarly to any TCP sender and allow an
2649	   MPA-aware TCP to enjoy the theoretical performance limits of the
2650	   algorithms.

2652	   In summary, the ULP messages generated at the sender (e.g., the
2653	   amount of messages grouped for every transmission request) and
2654	   message size distribution has the most significant impact over the
2655	   number of TCP segments emitted.  The worst case effect for certain
2656	   ULPs (with average message size of EMSS/2+1 to EMSS), is bounded by
2657	   an increase of up to 2x in the number of TCP segments and
2658	   acknowledges.  In reality the effect is expected to be marginal.

2660	11.2 Receiver implementation

2662	   Transport & Network Layer Reassembly Buffers:

2664	   The use of reassembly buffers (either TCP reassembly buffers or IP
2665	   fragmentation reassembly buffers) is implementation dependent.  When
2666	   MPA is enabled, reassembly buffers are needed if out of order packets
2667	   arrive and Markers are not enabled.  Buffers are also needed if FPDU
2668	   Alignment is lost or if IP fragmentation occurs.  This is because the
2669	   incoming out of order segment may not contain enough information for
2670	   MPA to process all of the FPDU.  For cases where a re-segmenting
2671	   middle box is present, or where the TCP sender is not MPA-aware, the
2672	   presence of Markers significantly reduces the amount of buffering
2673	   needed.

2675	   Recovery from IP Fragmentation must be transparent to the MPA
2676	   Consumers.

2678	11.2.1 Network Layer Reassembly Buffers

2680	   Most IP implementations set the IP Don't Fragment bit.  Thus upon a
2681	   path MTU change, intermediate devices drop the IP datagram if it is
2682	   too large and reply with an ICMP message which tells the source TCP
2683	   that the path MTU has changed.  This causes TCP to emit segments
2684	   conformant with the new path MTU size.  Thus IP fragments under most
2685	   conditions should never occur at the receiver.  But it is possible.

2687	   There are several options for implementation of network layer
2688	   reassembly buffers:

2690	   1.  drop any IP fragments, and reply with an ICMP message according
2691	       to [RFC792] (fragmentation needed and DF set) to tell the Remote
2692	       Peer to resize its TCP segment

2694	   2.  support an IP reassembly buffer, but have it of limited size
2695	       (possibly the same size as the local link's MTU).  The end Node
2696	       would normally never advertise a path MTU larger than the local
2697	       link MTU.  It is recommended that a dropped IP fragment cause an
2698	       ICMP message to be generated according to RFC792.

2700	   3.  multiple IP reassembly buffers, of effectively unlimited size.

2702	   4.  support an IP reassembly buffer for the largest IP datagram (64
2703	       KB).

2705	   5.  support for a large IP reassembly buffer which could span
2706	       multiple IP datagrams.

2708	   An implementation should support at least 2 or 3 above, to avoid
2709	   dropping packets that have traversed the entire fabric.

2711	   There is no end-to-end ACK for IP reassembly buffers, so there is no
2712	   flow control on the buffer.  The only end-to-end ACK is a TCP ACK,
2713	   which can only occur when a complete IP datagram is delivered to TCP.
2714	   Because of this, under worst case, pathological scenarios, the
2715	   largest IP reassembly buffer is the TCP receive window (to buffer
2716	   multiple IP datagrams that have all been fragmented).

2718	   Note that if the Remote Peer does not implement re-segmentation of
2719	   the data stream upon receiving the ICMP reply updating the path MTU,
2720	   it is possible to halt forward progress because the opposite peer
2721	   would continue to retransmit using a transport segment size that is
2722	   too large.  This deadlock scenario is no different than if the fabric
2723	   MTU (not last hop MTU) was reduced after connection setup, and the
2724	   remote Node's behavior is not compliant with [RFC1122].

2726	11.2.2 TCP Reassembly buffers

2728	   A TCP reassembly buffer is also needed.  TCP reassembly buffers are
2729	   needed if FPDU Alignment is lost when using TCP with MPA or when the
2730	   MPA FPDU spans multiple TCP segments.  Buffers are also needed if
2731	   Markers are disabled and out of order packets arrive.

2733	   Since lost FPDU Alignment often means that FPDUs are incomplete, an
2734	   MPA on TCP implementation must have a reassembly buffer large enough
2735	   to recover an FPDU that is less than or equal to the MTU of the
2736	   locally attached link (this should be the largest possible advertised
2737	   TCP path MTU).  If the MTU is smaller than 140 octets, the buffer
2738	   MUST be at least 140 octets long to support the minimum FPDU size.
2739	   The 140 octets allows for the minimum MULPDU of 128, 2 octets of pad,
2740	   2 of ULPDU_Length, 4 of CRC, and space for a possible Marker.  As
2741	   usual, additional buffering may provide better performance.

2743	   Note that if the TCP segment were not stored, it is possible to
2744	   deadlock the MPA algorithm.  If the path MTU is reduced, FPDU
2745	   Alignment requires the source TCP to re-segment the data stream to
2746	   the new path MTU.  The source MPA will detect this condition and
2747	   reduce the MPA segment size, but any FPDUs already posted to the
2748	   source TCP will be re-segmented and lose FPDU Alignment.  If the
2749	   destination does not support a TCP reassembly buffer, these segments
2750	   can never be successfully transmitted and the protocol deadlocks.

2752	   When a complete FPDU is received, processing continues normally.

2754	11.3 IETF Implementation Interoperability with RDMA Consortium Protocols

2756	   The RDMA Consortium created early specifications of the MPA/DDP/RDMA
2757	   protocols and some manufacturers created implementations of those
2758	   protocols before the IETF versions were finalized.  These protocols
2759	   and are very similar to the IETF versions making it possible for
2760	   implementations to be created or modified to support either set of
2761	   specifications.  For those interested, the RDMA Consortium protocol
2762	   documents can be obtained at http://www.rdmaconsortium.org.

2764	   In this section, implementations of MPA/DDP/RDMA that conform to the
2765	   RDMAC specifications are called RDMAC RNICs.  Implementations of
2766	   MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.

2768	   Without the exchange of MPA Request/Reply Frames, there is no
2769	   standard mechanism for enabling RDMAC RNICs to interoperate with IETF
2770	   RNICs.  Even if a ULP uses a well-known port to start an IETF RNIC
2771	   immediately in RDMA mode (i.e., without exchanging the MPA
2772	   Request/Reply messages), there is no reason to believe an IETF RNIC
2773	   will interoperate with an RDMAC RNIC because of the differences in
2774	   the version number in the DDP and RDMAP headers on the wire.

2776	   Therefore, the ULP or other supporting entity at the RDMAC RNIC must
2777	   implement MPA Request/Reply Frames on behalf of the RNIC in order to
2778	   negotiate the connection parameters.  The following section describes
2779	   the results following the exchange of the MPA Request/Reply Frames
2780	   before the conversion from streaming to RDMA mode.

2782	11.3.1 Negotiated Parameters

2784	   Three types of RNICs are considered:

2786	   Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols which
2787	       has a ULP or other supporting entity that exchanges the MPA
2788	       Request/Reply Frames in streaming mode before the conversion to
2789	       RDMA mode.

2791	   Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
2792	       which is not capable of implementing the RDMAC protocols.  Such
2793	       an RNIC can only interoperate with other IETF RNICs.

2795	   Permissive IETF RNIC - an RNIC implementing the IETF protocols which
2796	       is capable of implementing the RDMAC protocols on a per
2797	       connection basis.

2799	   The Permissive IETF RNIC is recommended for those implementers that
2800	   want maximum interoperability with other RNIC implementations.

2802	   The values used by these three RNIC types for the MPA, DDP, and RDMAP
2803	   versions as well as MPA Markers and CRC are summarized in Figure 12.

2805	    +----------------++-----------+-----------+-----------+-----------+
2806	    | RNIC TYPE      || DDP/RDMAP |    MPA    |    MPA    |    MPA    |
2807	    |                ||  Version  | Revision  |  Markers  |    CRC    |
2808	    +----------------++-----------+-----------+-----------+-----------+
2809	    +----------------++-----------+-----------+-----------+-----------+
2810	    | RDMAC          ||     0     |     0     |     1     |     1     |
2811	    |                ||           |           |           |           |
2812	    +----------------++-----------+-----------+-----------+-----------+
2813	    | IETF           ||     1     |     1     |  0 or 1   |  0 or 1   |
2814	    | Non-permissive ||           |           |           |           |
2815	    +----------------++-----------+-----------+-----------+-----------+
2816	    | IETF           ||  1 or 0   |  1 or 0   |  0 or 1   |  0 or 1   |
2817	    | permissive     ||           |           |           |           |
2818	    +----------------++-----------+-----------+-----------+-----------+
2819	           Figure 12.  Connection Parameters for the RNIC Types.
2820	            For MPA Markers and MPA CRC, enabled=1, disabled=0.

2822	   It is assumed there is no mixing of versions allowed between MPA, DDP
2823	   and RDMAP.  The RNIC either generates the RDMAC protocols on the wire
2824	   (version is zero) or the IETF protocols (version is one).

2826	   During the exchange of the MPA Request/Reply Frames, each peer
2827	   provides its MPA Revision, Marker preference (M: 0=disabled,
2828	   1=enabled), and CRC preference.  The MPA Revision provided in the MPA
2829	   Request Frame and the MPA Reply Frame may differ.

2831	   From the information in the MPA Request/Reply Frames, each side sets
2832	   the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as
2833	   well as the state of the Markers for each half connection.  Between
2834	   DDP and RDMAP, no mixing of versions is allowed.  Moreover, the DDP
2835	   and RDMAP version MUST be identical in the two directions.  The RNIC
2836	   either generates the RDMAC protocols on the wire (version is zero) or
2837	   the IETF protocols (version is one).

2839	   In the following sections, the figures do not discuss CRC negotiation
2840	   because there is no interoperability issue for CRCs.  Since the RDMAC
2841	   RNIC will always request CRC use, then, according to the IETF MPA
2842	   specification, both peers MUST generate and check CRCs.

2844	11.3.2 RDMAC RNIC and Non-permissive IETF RNIC

2846	   Figure 13 shows that a Non-permissive IETF RNIC cannot interoperate
2847	   with an RDMAC RNIC, despite the fact that both peers exchange MPA
2848	   Request/Reply Frames.  For a Non-permissive IETF RNIC, the MPA
2849	   negotiation has no effect on the DDP/RDMAP version and it is unable
2850	   to interoperate with the RDMAC RNIC.

2852	   The rows in the figure show the state of the Marker field in the MPA
2853	   Request Frame sent by the MPA Initiator.  The columns show the state
2854	   of the Marker field in the MPA Reply Frame sent by the MPA Responder.
2855	   Each type of RNIC is shown as an Initiator and a Responder.  The
2856	   connection results are shown in the lower right corner, at the
2857	   intersection of the different RNIC types, where V=0 is the RDMAC
2858	   DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
2859	   Markers are disabled and M=1 means MPA Markers are enabled.  The
2860	   negotiated Marker state is shown as X/Y, for the receive direction of
2861	   the Initiator/Responder.

2863	          +---------------------------++-----------------------+
2864	          |   MPA                     ||          MPA          |
2865	          | CONNECT                   ||       Responder       |
2866	          |   MODE  +-----------------++-------+---------------+
2867	          |         |   RNIC          || RDMAC |     IETF      |
2868	          |         |   TYPE          ||       | Non-permissive|
2869	          |         |          +------++-------+-------+-------+
2870	          |         |          |MARKER|| M=1   | M=0   |  M=1  |
2871	          +---------+----------+------++-------+-------+-------+
2872	          +---------+----------+------++-------+-------+-------+
2873	          |         |   RDMAC  | M=1  || V=0   | close | close |
2874	          |         |          |      || M=1/1 |       |       |
2875	          |         +----------+------++-------+-------+-------+
2876	          |   MPA   |          | M=0  || close | V=1   | V=1   |
2877	          |Initiator|   IETF   |      ||       | M=0/0 | M=0/1 |
2878	          |         |Non-perms.+------++-------+-------+-------+
2879	          |         |          | M=1  || close | V=1   | V=1   |
2880	          |         |          |      ||       | M=1/0 | M=1/1 |
2881	          +---------+----------+------++-------+-------+-------+
2882	   Figure 13: MPA negotiation between an RDMAC RNIC and a Non-permissive
2883	                                IETF RNIC.

2885	11.3.2.1 RDMAC RNIC Initiator

2887	   If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
2888	   Frame with Rev field set to zero and the M and C bits set to one.
2889	   Because the Non-permissive IETF RNIC cannot dynamically downgrade the
2890	   version number it uses for DDP and RDMAP, it would send an MPA Reply
2891	   Frame with the Rev field equal to one and then gracefully close the
2892	   connection.

2894	11.3.2.2 Non-Permissive IETF RNIC Initiator

2896	   If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
2897	   Request Frame with Rev field equal to one.  The ULP or supporting
2898	   entity for the RDMAC RNIC responds with an MPA Reply Frame that has
2899	   the Rev field equal to zero and the M bit set to one.  The Non-
2900	   permissive IETF RNIC will gracefully close the connection after it
2901	   reads the incompatible Rev field in the MPA Reply Frame.

2903	11.3.3 RDMAC RNIC and Permissive IETF RNIC

2905	   Figure 14 shows that a Permissive IETF RNIC can interoperate with an
2906	   RDMAC RNIC regardless of its Marker preference.  The figure uses the
2907	   same format as shown with the Non-permissive IETF RNIC.

2909	          +---------------------------++-----------------------+
2910	          |   MPA                     ||          MPA          |
2911	          | CONNECT                   ||       Responder       |
2912	          |   MODE  +-----------------++-------+---------------+
2913	          |         |   RNIC          || RDMAC |     IETF      |
2914	          |         |   TYPE          ||       |  Permissive   |
2915	          |         |          +------++-------+-------+-------+
2916	          |         |          |MARKER|| M=1   | M=0   | M=1   |
2917	          +---------+----------+------++-------+-------+-------+
2918	          +---------+----------+------++-------+-------+-------+
2919	          |         |   RDMAC  | M=1  || V=0   | N/A   | V=0   |
2920	          |         |          |      || M=1/1 |       | M=1/1 |
2921	          |         +----------+------++-------+-------+-------+
2922	          |   MPA   |          | M=0  || V=0   | V=1   | V=1   |
2923	          |Initiator|   IETF   |      || M=1/1 | M=0/0 | M=0/1 |
2924	          |         |Permissive+------++-------+-------+-------+
2925	          |         |          | M=1  || V=0   | V=1   | V=1   |
2926	          |         |          |      || M=1/1 | M=1/0 | M=1/1 |
2927	          +---------+----------+------++-------+-------+-------+
2928	     Figure 14: MPA negotiation between an RDMAC RNIC and a Permissive
2929	                                IETF RNIC.

2931	   A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
2932	   Rev field of the MPA Req/Rep Frames and then adjust its receive
2933	   Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC.  As
2934	   a result, as an MPA Responder, the Permissive IETF RNIC will never
2935	   return an MPA Reply Frame with the M bit set to zero.  This case is
2936	   shown as a not applicable (N/A) in Figure 14.

2938	11.3.3.1 RDMAC RNIC Initiator

2940	   When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
2941	   entity prepares an MPA Request message and sets the revision to zero
2942	   and the M bit and C bit to one.

2944	   The Permissive IETF Responder receives the MPA Request message and
2945	   checks the revision field.  Since it is capable of generating RDMAC
2946	   DDP/RDMAP headers, it sends an MPA Reply message with revision set to
2947	   zero and the M and C bits set to one.  The Responder must inform its
2948	   ULP that it is generating version zero DDP/RDMAP messages.

2950	11.3.3.2 Permissive IETF RNIC Initiator

2952	   If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
2953	   Request Frame setting the Rev field to one.  Regardless of the value
2954	   of the M bit in the MPA Request Frame, the ULP or other supporting
2955	   entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
2956	   equal to zero and the M bit set to one.

2958	   When the Initiator reads the Rev field of the MPA Reply Frame and
2959	   finds that its peer is an RDMAC RNIC, it must inform its ULP that it
2960	   should generate version zero DDP/RDMAP messages and enable MPA
2961	   Markers and CRC.

2963	11.3.4 Non-Permissive IETF RNIC and Permissive IETF RNIC

2965	   For completeness, Figure 15 shows the results of MPA negotiation
2966	   between a Non-permissive IETF RNIC and a Permissive IETF RNIC.  The
2967	   important point from this figure is that an IETF RNIC cannot detect
2968	   whether its peer is a Permissive or Non-permissive RNIC.

2970	      +---------------------------++-------------------------------+
2971	      |   MPA                     ||              MPA              |
2972	      | CONNECT                   ||            Responder          |
2973	      |   MODE  +-----------------++---------------+---------------+
2974	      |         |   RNIC          ||     IETF      |     IETF      |
2975	      |         |   TYPE          || Non-permissive|  Permissive   |
2976	      |         |          +------++-------+-------+-------+-------+
2977	      |         |          |MARKER|| M=0   | M=1   | M=0   | M=1   |
2978	      +---------+----------+------++-------+-------+-------+-------+
2979	      +---------+----------+------++-------+-------+-------+-------+
2980	      |         |          | M=0  || V=1   | V=1   | V=1   | V=1   |
2981	      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
2982	      |         |Non-perms.+------++-------+-------+-------+-------+
2983	      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
2984	      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
2985	      |   MPA   +----------+------++-------+-------+-------+-------+
2986	      |Initiator|          | M=0  || V=1   | V=1   | V=1   | V=1   |
2987	      |         |   IETF   |      || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
2988	      |         |Permissive+------++-------+-------+-------+-------+
2989	      |         |          | M=1  || V=1   | V=1   | V=1   | V=1   |
2990	      |         |          |      || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
2991	      +---------+----------+------++-------+-------+-------+-------+
2992	    Figure 15: MPA negotiation between a Non-permissive IETF RNIC and a
2993	                           Permissive IETF RNIC.

2995	12 Author's Addresses

2997	   Stephen Bailey
2998	       Sandburst Corporation
2999	       600 Federal Street
3000	       Andover, MA  01810 USA
3001	       Phone: +1 978 689 1614
3002	       Email: steph@sandburst.com

3004	   Paul R. Culley
3005	       Hewlett-Packard Company
3006	       20555 SH 249
3007	       Houston, Tx. USA 77070-2698
3008	       Phone:  281-514-5543
3009	       Email:  paul.culley@hp.com

3011	   Uri Elzur
3012	       Broadcom
3013	       16215 Alton Parkway
3014	       CA, 92618
3015	       Phone: 949.585.6432
3016	       Email:  uri@broadcom.com

3018	   Renato J Recio
3019	       IBM
3020	       Internal Zip 9043
3021	       11400 Burnett Road
3022	       Austin,  Texas  78759
3023	       Phone:  512-838-3685
3024	       Email:  recio@us.ibm.com

3026	   John Carrier
3027	       Cray Inc.
3028	       411 First Avenue S, Suite 600
3029	       Seattle, WA 98104-2860
3030	       Phone: 206-701-2090
3031	       Email: carrier@cray.com

3033	13 Acknowledgments

3035	   Dwight Barron
3036	       Hewlett-Packard Company
3037	       20555 SH 249
3038	       Houston, Tx. USA 77070-2698
3039	       Phone: 281-514-2769
3040	       Email: dwight.barron@hp.com

3042	   Jeff Chase
3043	       Department of Computer Science
3044	       Duke University
3045	       Durham, NC 27708-0129 USA
3046	       Phone: +1 919 660 6559
3047	       Email: chase@cs.duke.edu

3049	   Ted Compton
3050	       EMC Corporation
3051	       Research Triangle Park, NC 27709, USA
3052	       Phone: 919-248-6075
3053	       Email: compton_ted@emc.com

3055	   Dave Garcia
3056	       Hewlett-Packard Company
3057	       19333 Vallco Parkway
3058	       Cupertino, Ca. USA 95014
3059	       Phone: 408.285.6116
3060	       Email: dave.garcia@hp.com

3062	   Hari Ghadia
3063	       Adaptec, Inc.
3064	       691 S. Milpitas Blvd.,
3065	       Milpitas, CA 95035  USA
3066	       Phone: +1 (408) 957-5608
3067	       Email: hari_ghadia@adaptec.com

3069	   Howard C. Herbert
3070	       Intel Corporation
3071	       MS CH7-404
3072	       5000 West Chandler Blvd.
3073	       Chandler, Arizona 85226
3074	       Phone: 480-554-3116
3075	       Email: howard.c.herbert@intel.com

3077	   Jeff Hilland
3078	       Hewlett-Packard Company
3079	       20555 SH 249
3080	       Houston, Tx. USA 77070-2698
3081	       Phone: 281-514-9489
3082	       Email: jeff.hilland@hp.com

3084	   Mike Ko
3085	       IBM
3086	       650 Harry Rd.
3087	       San Jose, CA 95120
3088	       Phone: (408) 927-2085
3089	       Email: mako@us.ibm.com

3091	   Mike Krause
3092	       Hewlett-Packard Corporation, 43LN
3093	       19410 Homestead Road
3094	       Cupertino, CA 95014 USA
3095	       Phone: +1 (408) 447-3191
3096	       Email: krause@cup.hp.com

3098	   Dave Minturn
3099	       Intel Corporation
3100	       MS JF1-210
3101	       5200 North East Elam Young Parkway
3102	       Hillsboro, Oregon  97124
3103	       Phone: 503-712-4106
3104	       Email: dave.b.minturn@intel.com

3106	   Jim Pinkerton
3107	       Microsoft, Inc.
3108	       One Microsoft Way
3109	       Redmond, WA, USA 98052
3110	       Email: jpink@microsoft.com

3112	   Hemal Shah
3113	       16215 Alton Parkway
3114	       Irvine, California 92619-7013 USA
3115	       Phone: +1 949 926-6941
3116	       Email: hemal@broadcom.com

3118	   Allyn Romanow
3119	       Cisco Systems
3120	       170 W Tasman Drive
3121	       San Jose, CA 95134 USA
3122	       Phone: +1 408 525 8836
3123	       Email: allyn@cisco.com

3125	   Tom Talpey
3126	       Network Appliance
3127	       375 Totten Pond Road
3128	       Waltham, MA 02451 USA
3129	       Phone: +1 (781) 768-5329
3130	       EMail: thomas.talpey@netapp.com

3132	   Patricia Thaler
3133	       Broadcom
3134	       16215 Alton Parkway
3135	       Irvine, CA 92618
3136	       Phone: 916 570 2707
3137	       pthaler@broadcom.com

3139	   Jim Wendt
3140	       Hewlett Packard Corporation
3141	       8000 Foothills Boulevard MS 5668
3142	       Roseville, CA 95747-5668 USA
3143	       Phone: +1 916 785 5198
3144	       Email: jim_wendt@hp.com

3146	   Jim Williams
3147	       Emulex Corporation
3148	       580 Main Street
3149	       Bolton, MA 01740 USA
3150	       Phone: +1 978 779 7224
3151	       Email: jim.williams@emulex.com

3153	Full Copyright Statement

3155	   This document and the information contained herein are provided on an
3156	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
3157	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
3158	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
3159	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
3160	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
3161	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

3163	   Copyright (C) The Internet Society (2006).  This document is subject
3164	   to the rights, licenses and restrictions contained in BCP 78, and
3165	   except as set forth therein, the authors retain all their rights.

3167	Intellectual Property

3169	   The IETF takes no position regarding the validity or scope of any
3170	   Intellectual Property Rights or other rights that might be claimed to
3171	   pertain to the implementation or use of the technology described in
3172	   this document or the extent to which any license under such rights
3173	   might or might not be available; nor does it represent that it has
3174	   made any independent effort to identify any such rights.  Information
3175	   on the procedures with respect to rights in RFC documents can be
3176	   found in BCP 78 and BCP 79.

3178	   Copies of IPR disclosures made to the IETF Secretariat and any
3179	   assurances of licenses to be made available, or the result of an
3180	   attempt made to obtain a general license or permission for the use of
3181	   such proprietary rights by implementers or users of this
3182	   specification can be obtained from the IETF on-line IPR repository at
3183	   http://www.ietf.org/ipr.

3185	   The IETF invites any interested party to bring to its attention any
3186	   copyrights, patents or patent applications, or other proprietary
3187	   rights that may cover technology that may be required to implement
3188	   this standard.  Please address the information to the IETF at
3189	   ietf-ipr@ietf.org.