idnits 2.17.1 

draft-ietf-mptcp-multiaddressed-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 22, 2012) is 4204 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: 'Data ACK' on line 474

  -- Looks like a reference, but probably isn't: 'Checksum' on line 475

  -- Looks like a reference, but probably isn't: 'Data FIN' on line 503

  -- Looks like a reference, but probably isn't: 'DFIN' on line 2871

  ** Obsolete normative reference: RFC  793 (ref. '1') (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-07) exists of
     draft-ietf-mptcp-api-05

  -- Obsolete informational reference (is this intentional?): RFC 1323 (ref.
     '15') (Obsoleted by RFC 7323)

  -- Obsolete informational reference (is this intentional?): RFC 5226 (ref.
     '24') (Obsoleted by RFC 8126)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                  A. Ford
3	Internet-Draft                                                     Cisco
4	Intended status: Experimental                                  C. Raiciu
5	Expires: April 25, 2013                        University Politehnica of
6	                                                               Bucharest
7	                                                              M. Handley
8	                                               University College London
9	                                                          O. Bonaventure
10	                                                Universite catholique de
11	                                                                 Louvain
12	                                                        October 22, 2012

14	     TCP Extensions for Multipath Operation with Multiple Addresses
15	                   draft-ietf-mptcp-multiaddressed-12

17	Abstract

19	   TCP/IP communication is currently restricted to a single path per
20	   connection, yet multiple paths often exist between peers.  The
21	   simultaneous use of these multiple paths for a TCP/IP session would
22	   improve resource usage within the network, and thus improve user
23	   experience through higher throughput and improved resilience to
24	   network failure.

26	   Multipath TCP provides the ability to simultaneously use multiple
27	   paths between peers.  This document presents a set of extensions to
28	   traditional TCP to support multipath operation.  The protocol offers
29	   the same type of service to applications as TCP (i.e. reliable
30	   bytestream), and provides the components necessary to establish and
31	   use multiple TCP flows across potentially disjoint paths.

33	Status of this Memo

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF).  Note that other groups may also distribute
40	   working documents as Internet-Drafts.  The list of current Internet-
41	   Drafts is at http://datatracker.ietf.org/drafts/current/.

43	   Internet-Drafts are draft documents valid for a maximum of six months
44	   and may be updated, replaced, or obsoleted by other documents at any
45	   time.  It is inappropriate to use Internet-Drafts as reference
46	   material or to cite them other than as "work in progress."

48	   This Internet-Draft will expire on April 25, 2013.

50	Copyright Notice

52	   Copyright (c) 2012 IETF Trust and the persons identified as the
53	   document authors.  All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (http://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document.  Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document.  Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
68	     1.1.  Design Assumptions . . . . . . . . . . . . . . . . . . . .  4
69	     1.2.  Multipath TCP in the Networking Stack  . . . . . . . . . .  5
70	     1.3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  6
71	     1.4.  MPTCP Concept  . . . . . . . . . . . . . . . . . . . . . .  7
72	     1.5.  Requirements Language  . . . . . . . . . . . . . . . . . .  8
73	   2.  Operation Overview . . . . . . . . . . . . . . . . . . . . . .  8
74	     2.1.  Initiating an MPTCP connection . . . . . . . . . . . . . .  9
75	     2.2.  Associating a new subflow with an existing MPTCP
76	           connection . . . . . . . . . . . . . . . . . . . . . . . .  9
77	     2.3.  Informing the other Host about another potential
78	           address  . . . . . . . . . . . . . . . . . . . . . . . . . 10
79	     2.4.  Data transfer using MPTCP  . . . . . . . . . . . . . . . . 11
80	     2.5.  Requesting a change in a path's priority . . . . . . . . . 11
81	     2.6.  Closing an MPTCP connection  . . . . . . . . . . . . . . . 12
82	     2.7.  Notable features . . . . . . . . . . . . . . . . . . . . . 12
83	   3.  MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . . 12
84	     3.1.  Connection Initiation  . . . . . . . . . . . . . . . . . . 13
85	     3.2.  Starting a New Subflow . . . . . . . . . . . . . . . . . . 18
86	     3.3.  General MPTCP Operation  . . . . . . . . . . . . . . . . . 23
87	       3.3.1.  Data Sequence Mapping  . . . . . . . . . . . . . . . . 25
88	       3.3.2.  Data Acknowledgments . . . . . . . . . . . . . . . . . 28
89	       3.3.3.  Closing a Connection . . . . . . . . . . . . . . . . . 29
90	       3.3.4.  Receiver Considerations  . . . . . . . . . . . . . . . 30
91	       3.3.5.  Sender Considerations  . . . . . . . . . . . . . . . . 31
92	       3.3.6.  Reliability and Retransmissions  . . . . . . . . . . . 32
93	       3.3.7.  Congestion Control Considerations  . . . . . . . . . . 33
94	       3.3.8.  Subflow Policy . . . . . . . . . . . . . . . . . . . . 34
95	     3.4.  Address Knowledge Exchange (Path Management) . . . . . . . 35
96	       3.4.1.  Address Advertisement  . . . . . . . . . . . . . . . . 36
97	       3.4.2.  Remove Address . . . . . . . . . . . . . . . . . . . . 39
98	     3.5.  Fast Close . . . . . . . . . . . . . . . . . . . . . . . . 40
99	     3.6.  Fallback . . . . . . . . . . . . . . . . . . . . . . . . . 41
100	     3.7.  Error Handling . . . . . . . . . . . . . . . . . . . . . . 44
101	     3.8.  Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 45
102	       3.8.1.  Port Usage . . . . . . . . . . . . . . . . . . . . . . 45
103	       3.8.2.  Delayed Subflow Start  . . . . . . . . . . . . . . . . 45
104	       3.8.3.  Failure Handling . . . . . . . . . . . . . . . . . . . 46
105	   4.  Semantic Issues  . . . . . . . . . . . . . . . . . . . . . . . 47
106	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 48
107	   6.  Interactions with Middleboxes  . . . . . . . . . . . . . . . . 51
108	   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 54
109	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 54
110	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 56
111	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 56
112	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 56
113	   Appendix A.  Notes on use of TCP Options . . . . . . . . . . . . . 58
114	   Appendix B.  Control Blocks  . . . . . . . . . . . . . . . . . . . 60
115	     B.1.  MPTCP Control Block  . . . . . . . . . . . . . . . . . . . 60
116	       B.1.1.  Authentication and Metadata  . . . . . . . . . . . . . 60
117	       B.1.2.  Sending Side . . . . . . . . . . . . . . . . . . . . . 60
118	       B.1.3.  Receiving Side . . . . . . . . . . . . . . . . . . . . 61
119	     B.2.  TCP Control Blocks . . . . . . . . . . . . . . . . . . . . 61
120	       B.2.1.  Sending Side . . . . . . . . . . . . . . . . . . . . . 61
121	       B.2.2.  Receiving Side . . . . . . . . . . . . . . . . . . . . 61
122	   Appendix C.  Finite State Machine  . . . . . . . . . . . . . . . . 62
123	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 62

125	1.  Introduction

127	   MPTCP is a set of extensions to regular TCP [1] to provide a
128	   Multipath TCP [2] service, which enables a transport connection to
129	   operate across multiple paths simultaneously.  This document presents
130	   the protocol changes required to add multipath capability to TCP;
131	   specifically, those for signaling and setting up multiple paths
132	   ("subflows"), managing these subflows, reassembly of data, and
133	   termination of sessions.  This is not the only information required
134	   to create a Multipath TCP implementation, however.  This document is
135	   complemented by three others:

137	   o  Architecture [2], which explains the motivations behind Multipath
138	      TCP, contains a discussion of high-level design decisions on which
139	      this design is based, and an explanation of a functional
140	      separation through which an extensible MPTCP implementation can be
141	      developed.

143	   o  Congestion Control [5], presenting a safe congestion control
144	      algorithm for coupling the behaviour of the multiple paths in
145	      order to "do no harm" to other network users.

147	   o  Application Considerations [6], discussing what impact MPTCP will
148	      have on applications, what applications will want to do with
149	      MPTCP, and as a consequence of these factors, what API extensions
150	      an MPTCP implementation should present.

152	1.1.  Design Assumptions

154	   In order to limit the potentially huge design space, the working
155	   group imposed two key constraints on the multipath TCP design
156	   presented in this document:

158	   o  It must be backwards-compatible with current, regular TCP, to
159	      increase its chances of deployment

161	   o  It can be assumed that one or both hosts are multihomed and
162	      multiaddressed

164	   To simplify the design we assume that the presence of multiple
165	   addresses at a host is sufficient to indicate the existence of
166	   multiple paths.  These paths need not be entirely disjoint: they may
167	   share one or many routers between them.  Even in such a situation
168	   making use of multiple paths is beneficial, improving resource
169	   utilisation and resilience to a subset of node failures.  The
170	   congestion control algorithms defined in [5] ensure this does not act
171	   detrimentally.  Furthermore, there may be some scenarios where
172	   different TCP ports on a single host can provide disjoint paths (such
173	   as through certain ECMP implementations [7]), and so the MPTCP design
174	   also supports the use of ports in path identifiers.

176	   There are three aspects to the backwards-compatibility listed above
177	   (discussed in more detail in [2]):

179	   External Constraints:  The protocol must function through the vast
180	      majority of existing middleboxes such as NATs, firewalls and
181	      proxies, and as such must resemble existing TCP as far as possible
182	      on the wire.  Furthermore, the protocol must not assume the
183	      segments it sends on the wire arrive unmodified at the
184	      destination: they may be split or coalesced; TCP options may be
185	      removed or duplicated.

187	   Application Constraints:  The protocol must be usable with no change
188	      to existing applications that use the common TCP API (although it
189	      is reasonable that not all features would be available to such
190	      legacy applications).  Furthermore, the protocol must provide the
191	      same service model as regular TCP to the application.

193	   Fall-back:  The protocol should be able to fall back to standard TCP
194	      with no interference from the user, to be able to communicate with
195	      legacy hosts.

197	   The complementary application considerations document [6] discusses
198	   the necessary features of an API to provide backwards-compatibility,
199	   as well as API extensions to convey the behaviour of MPTCP at a level
200	   of control and information equivalent to that available with regular,
201	   single-path TCP.

203	   Further discussion of the design constraints and associated design
204	   decisions are given in the MPTCP Architecture document [2].

206	1.2.  Multipath TCP in the Networking Stack

208	   MPTCP operates at the transport layer and aims to be transparent to
209	   both higher and lower layers.  It is a set of additional features on
210	   top of standard TCP; Figure 1 illustrates this layering.  MPTCP is
211	   designed to be usable by legacy applications with no changes;
212	   detailed discussion of its interactions with applications is given in
213	   [6].

215	                                   +-------------------------------+
216	                                   |           Application         |
217	      +---------------+            +-------------------------------+
218	      |  Application  |            |             MPTCP             |
219	      +---------------+            + - - - - - - - + - - - - - - - +
220	      |      TCP      |            | Subflow (TCP) | Subflow (TCP) |
221	      +---------------+            +-------------------------------+
222	      |      IP       |            |       IP      |      IP       |
223	      +---------------+            +-------------------------------+

225	      Figure 1: Comparison of Standard TCP and MPTCP Protocol Stacks

227	1.3.  Terminology

229	   This document makes use of a number of terms which are either MPTCP-
230	   specific, or have defined meaning in the context of MPTCP, as
231	   follows:

233	   Path:  A sequence of links between a sender and a receiver, defined
234	      in this context by a 4-tuple of source and destination address/
235	      port pairs.

237	   Subflow:  A flow of TCP segments operating over an individual path,
238	      which forms part of a larger MPTCP connection.  A subflow is
239	      started and terminated similarly to a regular TCP connection.

241	   (MPTCP) Connection:  A set of one or more subflows, over which an
242	      application can communicate between two hosts.  There is a one-to-
243	      one mapping between a connection and an application socket.

245	   Data-level:  The payload data is nominally transferred over a
246	      connection, which in turn is transported over subflows.  Thus the
247	      term "data-level" is synonymous with "connection level", in
248	      contrast to "subflow-level" which refers to properties of an
249	      individual subflow.

251	   Token:  A locally unique identifier given to a multipath connection
252	      by a host.  May also be referred to as a "Connection ID".

254	   Host:  A end host operating an MPTCP implementation, and either
255	      initiating or accepting an MPTCP connection.

257	   In addition to these terms, note that MPTCP's interpretation of, and
258	   effect on, regular single-path TCP semantics are discussed in
259	   Section 4.

261	1.4.  MPTCP Concept

263	   This section provides a high-level summary of normal operation of
264	   MPTCP, and is illustrated by the scenario shown in Figure 2.  A
265	   detailed description of operation is given in Section 3.

267	   o  To a non-MPTCP-aware application, MPTCP will behave the same as
268	      normal TCP.  Extended APIs could provide additional control to
269	      MPTCP-aware applications [6].  An application begins by opening a
270	      TCP socket in the normal way.  MPTCP signaling and operation is
271	      handled by the MPTCP implementation.

273	   o  An MPTCP connection begins similarly to a regular TCP connection.
274	      This is illustrated in Figure 2 where an MPTCP connection is
275	      established between addresses A1 and B1 on Hosts A and B
276	      respectively.

278	   o  If extra paths are available, additional TCP sessions (termed
279	      MPTCP "subflows") are created on these paths, and are combined
280	      with the existing session, which continues to appear as a single
281	      connection to the applications at both ends.  The creation of the
282	      additional TCP session is illustrated between Address A2 on Host A
283	      and Address B1 on Host B.

285	   o  MPTCP identifies multiple paths by the presence of multiple
286	      addresses at hosts.  Combinations of these multiple addresses
287	      equate to the additional paths.  In the example, other potential
288	      paths that could be set up are A1<->B2 and A2<->B2.  Although this
289	      additional session is shown as being initiated from A2, it could
290	      equally have been initiated from B1.

292	   o  The discovery and setup of additional subflows will be achieved
293	      through a path management method; this document describes a
294	      mechanism by which a host can initiate new subflows by using its
295	      own additional addresses, or by signaling its available addresses
296	      to the other host.

298	   o  MPTCP adds connection-level sequence numbers to allow the
299	      reassembly of segments arriving on multiple subflows with
300	      differing network delays.

302	   o  Subflows are terminated as regular TCP connections, with a four
303	      way FIN handshake.  The MPTCP connection is terminated by a
304	      connection-level FIN.

306	               Host A                               Host B
307	      ------------------------             ------------------------
308	      Address A1    Address A2             Address B1    Address B2
309	      ----------    ----------             ----------    ----------
310	          |             |                      |             |
311	          |     (initial connection setup)     |             |
312	          |----------------------------------->|             |
313	          |<-----------------------------------|             |
314	          |             |                      |             |
315	          |            (additional subflow setup)            |
316	          |             |--------------------->|             |
317	          |             |<---------------------|             |
318	          |             |                      |             |
319	          |             |                      |             |

321	                  Figure 2: Example MPTCP Usage Scenario

323	1.5.  Requirements Language

325	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
326	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
327	   document are to be interpreted as described in RFC 2119 [3].

329	2.  Operation Overview

331	   This section presents a single description of common MPTCP operation,
332	   with reference to the protocol operation.  This is a high-level
333	   overview of the key functions; the full specification follows in
334	   Section 3.  Extensibility and negotiated features are not discussed
335	   here.  Considerable reference is made to symbolic names of MPTCP
336	   options throughout this section - these are subtypes of the IANA-
337	   assigned MPTCP option (see Section 8), and their formats are defined
338	   in the detailed protocol specification which follows in Section 3.

340	   A Multipath TCP connection provides a bidirectionnal bytestream
341	   between two hosts communicating like normal TCP and thus does not
342	   require any change to the applications.  However, Multipath TCP
343	   enables the hosts to use different paths with different IP addresses
344	   to exchange packets belonging to the MPTCP connection.  A Multipath
345	   TCP connection appears like a normal TCP connection to an
346	   application.  However, to the network layer each MPTCP subflows looks
347	   like a regular TCP flow whose segments carry a new TCP option type.
348	   Multipath TCP manages the creation, removal and utilization of these
349	   subflows to send data.  The number of subflows that are managed
350	   within a Multipath TCP connection is not fixed and it can fluctuate
351	   during the lifetime of the Multipath TCP connection.

353	   All MPTCP operations are signaled with a TCP option - a single
354	   numerical type for MPTCP, with "sub-types" for each MPTCP message.
355	   What follows is a summary of the purpose and rationale of these
356	   messages.

358	2.1.  Initiating an MPTCP connection

360	   This is the same signaling as for initiating a normal TCP connection,
361	   but the SYN, SYN/ACK and ACK packets also carry the MP_CAPABLE
362	   option.  This is variable-length and serves multiple purposes.
363	   Firstly, it verifies whether the remote host supports Multipath TCP;
364	   and secondly, this option allows the hosts to exchange some
365	   information to authenticate the establishment of additional subflows.
366	   Further details are given in Section 3.1.

368	      Host-A                                  Host-B
369	      ------                                  ------
370	      MP_CAPABLE            ->
371	      [A's key, flags]
372	                            <-                MP_CAPABLE
373	                                              [B's key, flags]
374	      ACK + MP_CAPABLE      ->
375	      [A's key, B's key, flags]

377	2.2.  Associating a new subflow with an existing MPTCP connection

379	   The exchange of keys in the MP_CAPABLE handshake provides material
380	   that can be used to authenticate the endpoints when new subflows will
381	   be setup.  Additional subflows begin in the same way as initiating a
382	   normal TCP connection, but the SYN, SYN/ACK and ACK packets also
383	   carry the MP_JOIN option.

385	   Host-A initiates a new subflow between one of its addresses and one
386	   of Host-B's addresses.  The token - generated from the key - is used
387	   to identify which MPTCP connection it is joining, and the HMAC is
388	   used for authentication.  The HMAC uses the keys exchanged in the
389	   MP_CAPABLE handshake, and the random numbers (nonces) exchanged in
390	   these MP_JOIN options.  MP_JOIN also contains flags and an Address ID
391	   that can be used to refer to the source address without the sender
392	   needing to know if it has been changed by a NAT.  Further details in
393	   Section 3.2.

395	      Host-A                                  Host-B
396	      ------                                  ------
397	      MP_JOIN               ->
398	      [B's token, A's nonce,
399	       A's Address ID, flags]
400	                            <-                MP_JOIN
401	                                              [B's HMAC, B's nonce,
402	                                               B's Address ID, flags]
403	      ACK + MP_JOIN         ->
404	      [A's HMAC]

406	                            <-                ACK

408	2.3.  Informing the other Host about another potential address

410	   The set of IP addresses associated to a multihomed host may change
411	   during the lifetime of an MPTCP connection.  MPTCP supports the
412	   addition and removal of addresses on a host both implicitly and
413	   explicitly.  If Host-A has established a subflow starting at address
414	   IP#-A1 and wants to open a second subflow starting at address IP#-A2,
415	   it simply initiates the establishment of the subflow as explained
416	   above.  The remote host will then be implicitly informed about the
417	   new address.

419	   In some circumstances, a host may want to advertise to the remote
420	   host the availability of an address without establishing a new
421	   subflow, for example when a NAT prevents setup in one direction.  In
422	   the example below, Host-A informs Host-B about its alternative IP
423	   address (IP#-A2).  Host-B may later send an MP_JOIN to this new
424	   address.  Due to the presence of middleboxes that may translate IP
425	   addresses, this option uses an address identifier to unambiguously
426	   identify an address on a host.  Further details in Section 3.4.1.

428	      Host-A                                 Host-B
429	      ------                                 ------
430	      ADD_ADDR                  ->
431	      [IP#-A2,
432	       IP#-A2's Address ID]

434	   There is a corresponding signal for address removal, making use of
435	   the Address ID that is signalled in the add address handshake.
436	   Further details in Section 3.4.2.

438	      Host-A                                 Host-B
439	      ------                                 ------
440	      REMOVE_ADDR               ->
441	      [IP#-A2's Address ID]

443	2.4.  Data transfer using MPTCP

445	   To ensure reliable, in-order delivery of data over subflows that may
446	   appear and disappear at any time, MPTCP uses a 64-bit Data Sequence
447	   Number (DSN) to number all data sent over the MPTCP connection.  Each
448	   subflow has its own 32 bits sequence number space and an MPTCP option
449	   maps the subflow sequence space to the data sequence space.  In this
450	   way, data can be retransmitted on different subflows (mapped to the
451	   same DSN) in the event of failure.

453	   The "Data Sequence Signal" carries the "Data Sequence Mapping".  The
454	   Data Sequence Mapping consists of the subflow sequence number, data
455	   sequence number, and length for which this mapping is valid.  This
456	   option can also carry a connection-level acknowledgement (the "Data
457	   ACK") for the received DSN.

459	   With MPTCP, all subflows share the same receive buffer and advertise
460	   the same receive window.  There are two levels of acknowledgement in
461	   MPTCP.  Regular TCP acknowledgments are used on each subflow to
462	   acknowledge the reception of the segments sent over the subflow
463	   independently of their DSN.  In addition, there are connection-level
464	   acknowledgments for the data sequence space.  These acknowledgments
465	   track the advancement of the bytestream and slide the receiving
466	   window.

468	   Further details are in Section 3.3.

470	      Host-A                                 Host-B
471	      ------                                 ------
472	      DATA_SEQUENCE_SIGNAL      ->
473	      [Data Sequence Mapping]
474	      [Data ACK]
475	      [Checksum]

477	2.5.  Requesting a change in a path's priority

479	   Hosts can indicate at initial subflow setup whether they wish the
480	   subflow to be used as a regular or backup path - a backup path being
481	   only used if there are no regular paths available.  During a
482	   connection, Host-A can request a change in the priority of a subflow
483	   through the MP_PRIO signal to Host-B.  Further details in
484	   Section 3.3.8.

486	      Host-A                                 Host-B
487	      ------                                 ------
488	      MP_PRIO                   ->

490	2.6.  Closing an MPTCP connection

492	   When Host-A wants to inform Host-B that it has no more data to send,
493	   it signals this "Data FIN" as part of the Data Sequence Signal (see
494	   above).  It has the same semantics and behaviour as a regular TCP
495	   FIN, but at the connection level.  Once all the data on the MPTCP
496	   connection has been successfully received, then this message is
497	   acknowledged at the connection level with a DATA_ACK.  Further
498	   details in Section 3.3.3.

500	      Host-A                                 Host-B
501	      ------                                 ------
502	      DATA_SEQUENCE_SIGNAL      ->
503	      [Data FIN]

505	                                <-           (MPTCP DATA_ACK)

507	2.7.  Notable features

509	   It is worth highlighting that MPTCP's signaling has been designed
510	   with several key requirements in mind:

512	   o  To cope with NATs on the path, addresses are referred to by
513	      Address IDs, in case the IP packet's source address gets changed
514	      by a NAT.  Setting up a new TCP flow is not possible if the
515	      passive opener is behind a NAT; to allow subflows to be created
516	      when either end is behind a NAT, MPTCP uses the ADD_ADDR message.

518	   o  MPTCP falls back to ordinary TCP if MPTCP operation is not
519	      possible.  For example if one host is not MPTCP capable, or if a
520	      middlebox alters the payload.

522	   o  To meet the threats identified in [8], the following steps are
523	      taken: keys are sent in the clear in the MP_CAPABLE messages;
524	      MP_JOIN messages are secured with HMAC-SHA1 ([9], [4]) using those
525	      keys; and standard TCP validity checks are made on the other
526	      messages (ensuring sequence numbers are in-window).

528	3.  MPTCP Protocol

530	   This section describes the operation of the MPTCP protocol, and is
531	   subdivided into sections for each key part of the protocol operation.

533	   All MPTCP operations are signalled using optional TCP header fields.
534	   A single TCP option number ("Kind") will be assigned by IANA for
535	   MPTCP (see Section 8), and then individual messages will be
536	   determined by a "sub-type", the values of which will also be stored
537	   in an IANA registry (and are also listed in Section 8).

539	   Throughout this document, when reference is made to an MPTCP option
540	   by symbolic name, such as "MP_CAPABLE", this refers to a TCP option
541	   with the single MPTCP option type, and with the sub-type value of the
542	   symbolic name as defined in Section 8.  This sub-type is a four-bit
543	   field - the first four bits of the option payload, as shown in
544	   Figure 3.  The MPTCP messages are defined in the following sections.

546	                           1                   2                   3
547	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
548	      +---------------+---------------+-------+-----------------------+
549	      |     Kind      |    Length     |Subtype|                       |
550	      +---------------+---------------+-------+                       |
551	      |                     Subtype-specific data                     |
552	      |                       (variable length)                       |
553	      +---------------------------------------------------------------+

555	                       Figure 3: MPTCP option format

557	   Those MPTCP options associated with subflow initiation are used on
558	   packets with the SYN flag set.  Additionally, there is one MPTCP
559	   option for signaling metadata to ensure segmented data can be
560	   recombined for delivery to the application.

562	   The remaining options, however, are signals that do not need to be on
563	   a specific packet, such as those for signaling additional addresses.
564	   Whilst an implementation may desire to send MPTCP options as soon as
565	   possible, it may not be possible to combine all desired options (both
566	   those for MPTCP and for regular TCP, such as SACK [10]) on a single
567	   packet.  Therefore, an implementation may choose to send duplicate
568	   ACKs containing the additional signaling information.  This changes
569	   the semantics of a duplicate ACK, these are usually only sent as a
570	   signal of a lost segment [11] in regular TCP.  Therefore, an MPTCP
571	   implementation receiving a duplicate ACK which contains an MPTCP
572	   option MUST NOT treat it as a signal of congestion.  Additionally, an
573	   MPTCP implementation SHOULD NOT send more than two duplicate ACKs in
574	   a row for the purposes of sending MPTCP options alone, in order to
575	   ensure no middleboxes misinterpret this as a sign of congestion.

577	   Furthermore, standard TCP validity checks (such as ensuring the
578	   Sequence Number and Acknowledgement Number are within window) MUST be
579	   undertaken before processing any MPTCP signals, as described in [12].

581	3.1.  Connection Initiation

583	   Connection Initiation begins with a SYN, SYN/ACK, ACK exchange on a
584	   single path.  Each packet contains the Multipath Capable (MP_CAPABLE)
585	   TCP option (Figure 4).  This option declares its sender is capable of
586	   performing multipath TCP and wishes to do so on this particular
587	   connection.

589	   This option is used to declare the 64 bit key which the sender has
590	   generated for this MPTCP connection.  This key is used to
591	   authenticate the addition of future subflows to this connection.
592	   This is the only time the key will be sent in clear on the wire
593	   (unless "fast close", Section 3.5, is used); all future subflows will
594	   identify the connection using a 32 bit "token".  This token is a
595	   cryptographic hash of this key.  The algorithm for this process is
596	   dependent on the authentication algorithm selected; the method of
597	   selection is defined later in this section.

599	   This key is generated by its sender, and its method of generation is
600	   implementation-specific.  The key MUST be hard to guess, and it MUST
601	   be unique for the sending host at any one time.  Recommendations for
602	   generating random numbers for use in keys are given in [13].
603	   Connections will be indexed at each host by the token (a one-way hash
604	   of the key).  Therefore, an implementation will require a mapping
605	   from each token to the corresponding connection, and in turn to the
606	   keys for the connection.

608	   There is a risk that two different keys will hash to the same token.
609	   The risk of hash collisions is usually small, unless the host is
610	   handling many tens of thousands of connections.  Therefore, an
611	   implementation SHOULD check its list of connection tokens to ensure
612	   there is not a collision before sending its key in the SYN/ACK.  This
613	   would, however, be costly for a server with thousands of connections.
614	   The subflow handshake mechanism (Section 3.2) will ensure that new
615	   subflows only join the correct connection, however, through the
616	   cryptographic handshake, as well as checking the connection tokens in
617	   both directions, and ensuring sequence numbers are in-window, so in
618	   the worst case if there was a token collision, the new subflow would
619	   not succeed, but the MPTCP connection would continue to provide a
620	   regular TCP service.

622	   The MP_CAPABLE option is carried on the SYN, SYN/ACK, and ACK packets
623	   that start the first subflow of an MPTCP connection.  The data
624	   carried by each packet is as follows, where A = initiator and B =
625	   listener.

627	   o  SYN (A->B): A's Key for this connection.

629	   o  SYN/ACK (B->A): B's Key for this connection.

631	   o  ACK (A->B): A's Key followed by B's Key.

633	   The contents of the option is determined by the SYN and ACK flags of
634	   the packet, verified by the option's length field.  For the diagram
635	   shown in Figure 4, "sender" and "receiver" refer to the sender or
636	   receiver of the TCP packet (which can be either host).  If the SYN
637	   flag is set, a single key is included; if only an ACK flag is set,
638	   both keys are present.

640	   B's Key is echoed in the ACK in order to allow the listener (host B)
641	   to act statelessly until the TCP connection reaches the ESTABLISHED
642	   state.  If the listener acts in this way, however, it MUST generate
643	   its key in a way that would allow it to verify that it generated the
644	   key when it is echoed in the ACK.

646	   This exchange allows the safe passage of MPTCP options on SYN packets
647	   to be determined.  If any of these options are dropped, MPTCP will
648	   gracefully fall back to regular single-path TCP, as documented in
649	   Section 3.6.  Note that new subflows MUST NOT be established (using
650	   the process documented in Section 3.2) until a DSS option has been
651	   successfully received across the path (as documented in Section 3.3).

653	                           1                   2                   3
654	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
655	      +---------------+---------------+-------+-------+---------------+
656	      |     Kind      |    Length     |Subtype|Version|A|B|C|D|E|F|G|H|
657	      +---------------+---------------+-------+-------+---------------+
658	      |                   Option Sender's Key (64 bits)               |
659	      |                                                               |
660	      |                                                               |
661	      +---------------------------------------------------------------+
662	      |                  Option Receiver's Key (64 bits)              |
663	      |                     (if option Length == 20)                  |
664	      |                                                               |
665	      +---------------------------------------------------------------+

667	              Figure 4: Multipath Capable (MP_CAPABLE) option

669	   The first four bits of the first octet in the MP_CAPABLE option
670	   (Figure 4) define the MPTCP option subtype (see Section 8; for
671	   MP_CAPABLE, this is 0), and the remaining four bits of this octet
672	   specifies the MPTCP version in use (for this specification, this is
673	   0).

675	   The second octet is reserved for flags, allocated as follows:

677	   A: The leftmost bit, labelled "A", SHOULD be set to 1 to indicate
678	      "Checksum Required", unless the system administrator has decided
679	      that checksums are not required (for example, if the environment
680	      is controlled and no middleboxes exist that might adjust the
681	      payload).

683	   B: The second bit, labelled "B", is an extensibility flag, and MUST
684	      be set to 0 for current implementations.  This will be used for an
685	      extensibility mechanism in a future specification, and the impact
686	      of this flag will be defined at a later date.  If receiving a
687	      message with the "B" flag set to 1, and this is not understood,
688	      then this SYN MUST be silently ignored; the sender is expected to
689	      retry with a format compatible with this legacy specification.
690	      Note that the length of the MP_CAPABLE option, and the meanings of
691	      bits "C" through "H", may be altered by setting B=1.

693	   C through H:  The remaining bits, labelled "C" through "H", are used
694	      for crypto algorithm negotiation.  Currently only the rightmost
695	      bit, labelled "H", is assigned.  Bit "H" indicates the use of
696	      HMAC-SHA1 (as defined in Section 3.2).  An implementation that
697	      only supports this method MUST set bit "H" to 1, and bits "C"
698	      through "G" to 0.

700	   A crypto algorithm MUST be specified.  If flag bits C through H are
701	   all 0, the MP_CAPABLE option MUST be treated as invalid and ignored
702	   (that is, it must be treated as a regular TCP handshake).

704	   The selection of the authentication algorithm also impacts the
705	   algorithm used to generate the token and the Initial Data Sequence
706	   Number.  In this specification, with only the SHA-1 algorithm (bit
707	   "H") specified and selected, the token MUST be a truncated (most
708	   significant 32 bits) SHA-1 hash ([4], [14]) of the key.  A different,
709	   64 bit truncation (the least significant 64 bits) of the SHA-1 hash
710	   of the key MUST be used as the Initial Data Sequence Number.  Note
711	   that the key MUST be hashed in network byte order.  Also note that
712	   the "least significant" bits MUST be the rightmost bits of the SHA-1
713	   digest, as per [4].  Future specifications of the use of the crypto
714	   bits may choose to specify different algorithms for token and IDSN
715	   generation.

717	   Both the crypto and checksum bits negotiate capabilities in similar
718	   ways.  For the Checksum Required bit (labelled "A"), if either host
719	   requires the use of checksums, checksums MUST be used.  In other
720	   words, the only way for checksums not to be used is if both hosts in
721	   their SYNs set A=0.  This decision is confirmed by the setting of the
722	   "A" bit in the third packet (the ACK) of the handshake.  For example,
723	   if the initiator sets A=0 in the SYN, but the responder sets A=1 in
724	   the SYN/ACK, checksums MUST be used in both directions, and the
725	   initiator will set A=1 in the ACK.  The decision whether to use
726	   checksums will be stored by an implementation in a per-connection
727	   binary state variable.

729	   For crypto negotiation, the responder has the choice.  The initiator
730	   creates a proposal setting a bit for each algorithm it supports to 1
731	   (in this version of the specification, there is only one proposal, so
732	   bit "H" will be always set to 1).  The responder responds with only
733	   one bit set - this is the chosen algorithm.  The rationale for this
734	   behaviour is that the responder will typically be a server with
735	   potentially many thousands of connections, so it may wish to choose
736	   an algorithm with minimal computational complexity, depending on the
737	   load.  If a responder does not support (or does not want to support)
738	   any of the initiator's proposals, it can respond without an
739	   MP_CAPABLE option, thus forcing a fall-back to regular TCP.

741	   The MP_CAPABLE option is only used in the first subflow of a
742	   connection, in order to identify the connection; all following
743	   subflows will use the "Join" option (see Section 3.2) to join the
744	   existing connection.

746	   If a SYN contains an MP_CAPABLE option but the SYN/ACK does not, it
747	   is assumed that the passive opener is not multipath capable and thus
748	   the MPTCP session MUST operate as a regular, single-path TCP.  If a
749	   SYN does not contain a MP_CAPABLE option, the SYN/ACK MUST NOT
750	   contain one in response.  If the third packet (the ACK) does not
751	   contain the MP_CAPABLE option, then the session MUST fall back to
752	   operating as a regular, single-path TCP.  This is to maintain
753	   compatibility with middleboxes on the path that drop some or all TCP
754	   options.  Note that an implementation MAY choose to attempt sending
755	   MPTCP options more than one time before making this decision to
756	   operate as regular TCP (see Section 3.8).

758	   If the SYN packets are unacknowledged, it is up to local policy to
759	   decide how to respond.  It is expected that a sender will eventually
760	   fall back to single-path TCP (i.e. without the MP_CAPABLE Option) in
761	   order to work around middleboxes that may drop packets with unknown
762	   options; however, the number of multipath-capable attempts that are
763	   made first will be up to local policy.  It is possible that MPTCP and
764	   non-MPTCP SYNs could get re-ordered in the network.  Therefore, the
765	   final state is inferred from the presence or absence of the
766	   MP_CAPABLE option in the third packet of the TCP handshake.  If this
767	   option is not present, the connection SHOULD fall back to regular
768	   TCP, as documented in Section 3.6.

770	   The initial Data Sequence Number (IDSN) on a MPTCP connection is
771	   generated from the Key. The algorithm for IDSN generation is also
772	   determined from the negotiated authentication algorithm.  In this
773	   specification, with only the SHA-1 algorithm specified and selected,
774	   the IDSN of a host MUST be the least significant 64 bits of the SHA-1
775	   hash of its key, i.e.  IDSN-A = Hash(Key-A) and IDSN-B = Hash(Key-B).
776	   This deterministic generation of the IDSN allows a receiver to ensure
777	   that there are no gaps in sequence space at the start of the
778	   connection.  The SYN with MP_CAPABLE occupies the first octet of Data
779	   Sequence Space, although this does not need to be acknowledged at the
780	   connection level until the first data is sent (see Section 3.3).

782	3.2.  Starting a New Subflow

784	   Once an MPTCP connection has begun with the MP_CAPABLE exchange,
785	   further subflows can be added to the connection.  Hosts have
786	   knowledge of their own address(es), and can become aware of the other
787	   host's addresses through signaling exchanges as described in
788	   Section 3.4.  Using this knowledge, a host can initiate a new subflow
789	   over a currently unused pair of addresses.  It is permitted for
790	   either host in a connection to initiate the creation of a new
791	   subflow, but it is expected that this will normally be the original
792	   connection initiator (see Section 3.8 for heuristics).

794	   A new subflow is started as a normal TCP SYN/ACK exchange.  The Join
795	   Connection (MP_JOIN) TCP option is used to identify the connection to
796	   be joined by the new subflow.  It uses keying material that was
797	   exchanged in the initial MP_CAPABLE handshake (Section 3.1), and that
798	   handshake also negotiates the crypto algorithm in use for the MP_JOIN
799	   handshake.

801	   This section specifies the behaviour of MP_JOIN using the HMAC-SHA1
802	   algorithm.  An MP_JOIN option is present in the SYN, SYN/ACK and ACK
803	   of the three-way handshake, although in each case with a different
804	   format.

806	   In the first MP_JOIN on the SYN packet, illustrated in Figure 5, the
807	   initiator sends a token, random number, and address ID.

809	   The token is used to identify the MPTCP connection and is a
810	   cryptographic hash of the receiver's key, as exchanged in the initial
811	   MP_CAPABLE handshake (Section 3.1).  In this specification, the
812	   tokens presented in this option are generated by the SHA-1 ([4],
813	   [14]) algorithm, truncated to the most significant 32 bits.  The
814	   token included in the MP_JOIN option is the token that the receiver
815	   of the packet uses to identify this connection, i.e.  Host A will
816	   send Token-B (which is generated from Key-B).  Note that the hash
817	   generation algorithm can be overridden by the choice of cryptographic
818	   handshake algorithm, as defined in Section 3.1.

820	   The MP_JOIN SYN not only sends the token (which is static for a
821	   connection) but also Random Numbers (nonces) that are used to prevent
822	   replay attacks on the authentication method.  Recommendations for the
823	   generation of random numbers for this purpose are given in [13].

825	   The MP_JOIN option includes an "Address ID".  This is an identifier
826	   that only has significance within a single connection, where it
827	   identifies the source address of this packet, even if the IP header
828	   has been changed in transit by a middlebox.  The Address ID allows
829	   address removal (Section 3.4.2) without needing to know what the
830	   source address at the receiver is, thus allowing address removal
831	   through NATs.  The Address ID also allows correlation between new
832	   subflow setup attempts and address signaling (Section 3.4.1), to
833	   prevent setting up duplicate subflows on the same path, if a MP_JOIN
834	   and ADD_ADDR are sent at the same time.

836	   The Address IDs of the subflow used in the initial SYN exchange of
837	   the first subflow in the connection are implicit, and have the value
838	   zero.  A host MUST store the mappings between Address IDs and
839	   addresses both for itself and the remote host.  An implementation
840	   will also need to know which local and remote Address IDs are
841	   associated with which established subflows, for when addresses are
842	   removed from a local or remote host.

844	   The MP_JOIN option on packets with the SYN flag set also includes 4
845	   bits of flags, 3 of which are currently reserved and MUST be set to
846	   zero by the sender.  The final bit, labelled 'B', indicates whether
847	   the sender of this option wishes this subflow to be used as a backup
848	   path (B=1) in the event of failure of other paths, or whether it
849	   wants it to be used as part of the connection immediately.  By
850	   setting B=1, the sender of the option is requesting the other host to
851	   only send data on this subflow if there are no available subflows
852	   where B=0.  Subflow policy is discussed in more detail in
853	   Section 3.3.8.

855	                           1                   2                   3
856	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
857	      +---------------+---------------+-------+-----+-+---------------+
858	      |     Kind      |  Length = 12  |Subtype|     |B|   Address ID  |
859	      +---------------+---------------+-------+-----+-+---------------+
860	      |                   Receiver's Token (32 bits)                  |
861	      +---------------------------------------------------------------+
862	      |                Sender's Random Number (32 bits)               |
863	      +---------------------------------------------------------------+

865	       Figure 5: Join Connection (MP_JOIN) option (for initial SYN)

867	   When receiving a SYN with an MP_JOIN option that contains a valid
868	   token for an existing MPTCP connection, the recipient SHOULD respond
869	   with a SYN/ACK also containing an MP_JOIN option containing a random
870	   number and a truncated (leftmost 64 bits) Hash-based Message
871	   Authentication Code (HMAC).  This version of the option is shown in
872	   Figure 6.  If the token is unknown, or the host wants to refuse
873	   subflow establishment (for example, due to a limit on the number of
874	   subflows it will permit), the receiver will send back an RST,
875	   analogous to an unknown port in TCP.  Although calculating an HMAC
876	   requires cryptographic operations, it is believed that the 32 bit
877	   token in the MP_JOIN SYN gives sufficient protection against blind
878	   state exhaustion attacks and therefore there is no need to provide
879	   mechanisms to allow a responder to operate statelessly at the MP_JOIN
880	   stage.

882	   An HMAC is sent by both hosts - by the initiator (Host A) in the
883	   third packet (the ACK) and by the responder (Host B) in the second
884	   packet (the SYN/ACK).  Doing the HMAC exchange at this stage allows
885	   both hosts to have first exchanged random data (in the first two SYN
886	   packets) that is used as the "message".  This specification defines
887	   that HMAC as defined in [9] is used, along with the SHA-1 hash
888	   algorithm [4] (potentially implemented as in [14]), thus generating a
889	   160-bit / 20 octet HMAC.  Due to option space limitations, the HMAC
890	   included in the SYN/ACK is truncated to the leftmost 64 bits, but
891	   this is acceptable since random numbers are used, and thus an
892	   attacker only has one chance to guess the HMAC correctly (if the HMAC
893	   is incorrect, the TCP connection is closed, so a new MP_JOIN
894	   negotiation with a new random number is required).

896	   The initiator's authentication information is sent in its first ACK
897	   (the third packet of the handshake), as shown in Figure 7.  This data
898	   needs to be sent reliably, since it is the only time this HMAC is
899	   sent and therefore receipt of this packet MUST trigger a regular TCP
900	   ACK in response, and the packet MUST be retransmitted if this ACK is
901	   not received.  In other words, sending the ACK/MP_JOIN packet places
902	   the subflow in the PRE_ESTABLISHED state, and it moves to the
903	   ESTABLISHED state only on receipt of an ACK from the receiver.  It is
904	   not permitted to send data while in the PRE_ESTABLISHED state.  The
905	   reserved bits in this option MUST be set to zero by the sender.

907	   The key for the HMAC algorithm, in the case of the message
908	   transmitted by Host A, will be Key-A followed by Key-B, and in the
909	   case of Host B, Key-B followed by Key-A.  These are the keys that
910	   were exchanged in the original MP_CAPABLE handshake.  The "message"
911	   for the HMAC algorithm in each case is the concatenations of Random
912	   Number for each host (denoted by R): for Host A, R-A followed by R-B;
913	   and for Host B, R-B followed by R-A.

915	                           1                   2                   3
916	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
917	      +---------------+---------------+-------+-----+-+---------------+
918	      |     Kind      |  Length = 16  |Subtype|     |B|   Address ID  |
919	      +---------------+---------------+-------+-----+-+---------------+
920	      |                                                               |
921	      |                Sender's Truncated HMAC (64 bits)              |
922	      |                                                               |
923	      +---------------------------------------------------------------+
924	      |                Sender's Random Number (32 bits)               |
925	      +---------------------------------------------------------------+

927	    Figure 6: Join Connection (MP_JOIN) option (for responding SYN/ACK)

929	                           1                   2                   3
930	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
931	      +---------------+---------------+-------+-----------------------+
932	      |     Kind      |  Length = 24  |Subtype|      (reserved)       |
933	      +---------------+---------------+-------+-----------------------+
934	      |                                                               |
935	      |                                                               |
936	      |                   Sender's HMAC (160 bits)                    |
937	      |                                                               |
938	      |                                                               |
939	      +---------------------------------------------------------------+

941	        Figure 7: Join Connection (MP_JOIN) option (for third ACK)

943	   These various TCP options fit together to enable authenticated
944	   subflow setup as illustrated in Figure 8.

946	              Host A                                  Host B
947	     ------------------------                       ----------
948	     Address A1    Address A2                       Address B1
949	     ----------    ----------                       ----------
950	         |             |                                |
951	         |            SYN + MP_CAPABLE(Key-A)           |
952	         |--------------------------------------------->|
953	         |<---------------------------------------------|
954	         |          SYN/ACK + MP_CAPABLE(Key-B)         |
955	         |             |                                |
956	         |        ACK + MP_CAPABLE(Key-A, Key-B)        |
957	         |--------------------------------------------->|
958	         |             |                                |
959	         |             |   SYN + MP_JOIN(Token-B, R-A)  |
960	         |             |------------------------------->|
961	         |             |<-------------------------------|
962	         |             | SYN/ACK + MP_JOIN(HMAC-B, R-B) |
963	         |             |                                |
964	         |             |     ACK + MP_JOIN(HMAC-A)      |
965	         |             |------------------------------->|
966	         |             |<-------------------------------|
967	         |             |             ACK                |

969	   HMAC-A = HMAC(Key=(Key-A+Key-B), Msg=(R-A+R-B))
970	   HMAC-B = HMAC(Key=(Key-B+Key-A), Msg=(R-B+R-A))

972	               Figure 8: Example use of MPTCP Authentication

974	   If the token received at Host B is unknown or local policy prohibits
975	   the acceptance of the new subflow, the recipient MUST respond with a
976	   TCP RST for the subflow.

978	   If the token is accepted at Host B, but the HMAC returned to Host A
979	   does not match the one expected, Host A MUST close the subflow with a
980	   TCP RST.

982	   If Host B does not receive the expected HMAC, or the MP_JOIN option
983	   is missing from the ACK, it MUST close the subflow with a TCP RST.

985	   If the HMACs are verified as correct, then both hosts have
986	   authenticated each other as being the same peers as existed at the
987	   start of the connection, and they have agreed of which connection
988	   this subflow will become a part.

990	   If the SYN/ACK as received at Host A does not have an MP_JOIN option,
991	   Host A MUST close the subflow with a RST.

993	   This covers all cases of the loss of an MP_JOIN.  In more detail, if
994	   MP_JOIN is stripped from the SYN on the path from A to B, and Host B
995	   does not have a passive opener on the relevant port, it will respond
996	   with an RST in the normal way.  If in response to a SYN with an
997	   MP_JOIN option, a SYN/ACK is received without the MP_JOIN option
998	   (either since it was stripped on the return path, or it was stripped
999	   on the outgoing path but the passive opener on Host B responded as if
1000	   it were a new regular TCP session), then the subflow is unusable and
1001	   Host A MUST close it with a RST.

1003	   Note that additional subflows can be created between any pair of
1004	   ports (but see Section 3.8 for heuristics); no explicit application-
1005	   level accept calls or bind calls are required to open additional
1006	   subflows.  To associate a new subflow with an existing connection,
1007	   the token supplied in the subflow's SYN exchange is used for
1008	   demultiplexing.  This then binds the 5-tuple of the TCP subflow to
1009	   the local token of the connection.  A consequence is that it is
1010	   possible to allow any port pairs to be used for a connection.

1012	   Demultiplexing subflow SYNs MUST be done using the token; this is
1013	   unlike traditional TCP, where the destination port is used for
1014	   demultiplexing SYN packets.  Once a subflow is setup, demultiplexing
1015	   packets is done using the five-tuple, as in traditional TCP.  The
1016	   five-tuples will be mapped to the local connection identifier
1017	   (token).  Note that Host A will know its local token for the subflow
1018	   even though it is not sent on the wire - only the responder's token
1019	   is sent.

1021	3.3.  General MPTCP Operation

1023	   This section discusses operation of MPTCP for data transfer.  At a
1024	   high level, an MPTCP implementation will take one input data stream
1025	   from an application, and split it into one or more subflows, with
1026	   sufficient control information to allow it to be reassembled and
1027	   delivered reliably and in-order to the recipient application.  The
1028	   following subsections define this behaviour in detail.

1030	   The Data Sequence Mapping and the Data ACK are signalled in the Data
1031	   Sequence Signal (DSS) option.  Either or both can be signalled in one
1032	   DSS, dependent on the flags set.  The Data Sequence Mapping defines
1033	   how the sequence space on the subflow maps to the connection level,
1034	   and the Data ACK acknowledges receipt of data at the connection
1035	   level.  These functions are described in more detail in the following
1036	   two subsections.

1038	   Either or both the Data Sequence Mapping and the Data ACK can be
1039	   signalled in the DSS option, dependent on the flags set.

1041	                          1                   2                   3
1042	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1043	     +---------------+---------------+-------+----------------------+
1044	     |     Kind      |    Length     |Subtype| (reserved) |F|m|M|a|A|
1045	     +---------------+---------------+-------+----------------------+
1046	     |           Data ACK (4 or 8 octets, depending on flags)       |
1047	     +--------------------------------------------------------------+
1048	     |   Data Sequence Number (4 or 8 octets, depending on flags)   |
1049	     +--------------------------------------------------------------+
1050	     |              Subflow Sequence Number (4 octets)              |
1051	     +-------------------------------+------------------------------+
1052	     |  Data-level Length (2 octets) |      Checksum (2 octets)     |
1053	     +-------------------------------+------------------------------+

1055	                Figure 9: Data Sequence Signal (DSS) option

1057	   The flags when set define the contents of this option, as follows:

1059	   o  A = Data ACK present

1061	   o  a = Data ACK is 8 octets (if not set, Data ACK is 4 octets)

1063	   o  M = Data Sequence Number, Subflow Sequence Number, Data-level
1064	      Length, and Checksum present

1066	   o  m = Data Sequence Number is 8 octets (if not set, DSN is 4 octets)

1068	   The flags 'a' and 'm' only have meaning if the corresponding 'A' or
1069	   'M' flags are set, otherwise they will be ignored.  The maximum
1070	   length of this option, with all flags set, is 28 octets.

1072	   The 'F' flag indicates "DATA_FIN".  If present, this means that this
1073	   mapping covers the final data from the sender.  This is the
1074	   connection-level equivalent to the FIN flag in single-path TCP.  A
1075	   connection is not closed unless there has been a DATA_FIN exchange,
1076	   or a timeout.  The purpose of the DATA_FIN, along with the
1077	   interactions between this flag, the subflow-level FIN flag, and the
1078	   data sequence mapping are described in Section 3.3.3.  The remaining
1079	   reserved bits MUST be set to zero by an implementation of this
1080	   specification.

1082	   Note that the Checksum is only present in this option if the use of
1083	   MPTCP checksumming has been negotiated at the MP_CAPABLE handshake
1084	   (see Section 3.1).  The presence of the checksum can be inferred from
1085	   the length of the option.  If a checksum is present, but its use had
1086	   not been negotiated in the MP_CAPABLE handshake, the checksum field
1087	   MUST be ignored.  If a checksum is not present when its use has been
1088	   negotiated, the receiver MUST close the subflow with a RST as it is
1089	   considered broken.

1091	3.3.1.  Data Sequence Mapping

1093	   The data stream as a whole can be reassembled through the use of the
1094	   Data Sequence Mapping components of the DSS option (Figure 9), which
1095	   define the mapping from the subflow sequence number to the data
1096	   sequence number.  This is used by the receiver to ensure in-order
1097	   delivery to the application layer.  Meanwhile, the subflow-level
1098	   sequence numbers (i.e. the regular sequence numbers in the TCP
1099	   header) have subflow-only relevance.  It is expected (but not
1100	   mandated) that SACK [10] is used at the subflow level to improve
1101	   efficiency.

1103	   The Data Sequence Mapping specifies a mapping from subflow sequence
1104	   space to data sequence space.  This is expressed in terms of starting
1105	   sequence numbers for the subflow and the data level, and a length of
1106	   bytes for which this mapping is valid.  This explicit mapping for a
1107	   range of data was chosen rather than per-packet signaling to assist
1108	   with compatibility with situations where TCP/IP segmentation or
1109	   coalescing is undertaken separately from the stack that is generating
1110	   the data flow (e.g. through the use of TCP segmentation offloading on
1111	   network interface cards, or by middleboxes such as performance
1112	   enhancing proxies).  It also allows a single mapping to cover many
1113	   packets, which may be useful in bulk transfer situations.

1115	   A mapping is fixed, in that the subflow sequence number is bound to
1116	   the data sequence number after the mapping has been processed.  A
1117	   sender MUST NOT change this mapping after it has been declared;
1118	   however, the same data sequence number can be mapped to by different
1119	   subflows for retransmission purposes (see Section 3.3.6).  This would
1120	   also permit the same data to be sent simultaneously on multiple
1121	   subflows for resilience or efficiency purposes, especially in the
1122	   case of lossy links.  Although the detailed specification of such
1123	   operation is outside the scope of this document, an implementation
1124	   SHOULD treat the first data that is received at a subflow for the
1125	   data sequence space as that which should be delivered to the
1126	   application, and any later data for that sequence space ignored.

1128	   The data sequence number is specified as an absolute value, whereas
1129	   the subflow sequence numbering is relative (the SYN at the start of
1130	   the subflow has relative subflow sequence number 0).  This is to
1131	   allow middleboxes to change the Initial Sequence Number of a subflow,
1132	   such as firewalls that undertake ISN randomization.

1134	   The data sequence mapping also contains a checksum of the data that
1135	   this mapping covers, if use of checksums has been negotiated at the
1136	   MP_CAPABLE exchange.  Checksums are used to detect if the payload has
1137	   been adjusted in any way by a non-MPTCP-aware middlebox.  If this
1138	   checksum fails, it will trigger a failure of the subflow, or a
1139	   fallback to regular TCP, as documented in Section 3.6, since MPTCP
1140	   can no longer reliably know the subflow sequence space at the
1141	   receiver to build data sequence mappings.

1143	   The checksum algorithm used is the standard TCP checksum [1],
1144	   operating over the data covered by this mapping, along with a pseudo-
1145	   header as shown in Figure 10.

1147	                          1                   2                   3
1148	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1149	     +--------------------------------------------------------------+
1150	     |                                                              |
1151	     |                Data Sequence Number (8 octets)               |
1152	     |                                                              |
1153	     +--------------------------------------------------------------+
1154	     |              Subflow Sequence Number (4 octets)              |
1155	     +-------------------------------+------------------------------+
1156	     |  Data-level Length (2 octets) |        Zeros (2 octets)      |
1157	     +-------------------------------+------------------------------+

1159	                 Figure 10: Pseudo-Header for DSS Checksum

1161	   Note that the Data Sequence Number used in the pseudo-header is
1162	   always the 64 bit value, irrespective of what length is used in the
1163	   DSS option itself.  The standard TCP checksum algorithm has been
1164	   chosen since it will be calculated anyway for the TCP subflow, and if
1165	   calculated first over the data before adding the pseudo-headers, it
1166	   only needs to be calculated once.  Furthermore, since the TCP
1167	   checksum is additive, the checksum for a DSN_MAP can be constructed
1168	   by simply adding together the checksums for the data of each
1169	   constituent TCP segment, and adding the checksum for the DSS pseudo-
1170	   header.

1172	   Note that checksumming relies on the TCP subflow containing
1173	   contiguous data, and therefore a TCP subflow MUST NOT use the Urgent
1174	   Pointer to interrupt an existing mapping.  Further note, however,
1175	   that if Urgent data is received on a subflow, it SHOULD be mapped to
1176	   the data sequence space and delivered to the application analogous to
1177	   Urgent data in regular TCP.

1179	   To avoid possible deadlock scenarios, subflow-level processing should
1180	   be undertaken separately from that at connection-level.  Therefore,
1181	   even if a mapping does not exist from the subflow space to the data-
1182	   level space, the data SHOULD still be ACKed at the subflow (if it is
1183	   in-window).  This data cannot, however, be acknowledged at the data
1184	   level (Section 3.3.2) because its data sequence numbers are unknown.

1186	   Implementations MAY hold onto such unmapped data for a short while in
1187	   the expectation that a mapping will arrive shortly.  Such unmapped
1188	   data cannot be counted as being within the connection-level receive
1189	   window because this is relative to the data sequence numbers, so if
1190	   the receiver runs out of memory to hold this data, it will have to be
1191	   discarded.  If a mapping for that subflow-level sequence space does
1192	   not arrive within a receive window of data, that subflow SHOULD be
1193	   treated as broken, closed with an RST, and any unmapped data silently
1194	   discarded.

1196	   Data sequence numbers are always 64 bit quantities, and MUST be
1197	   maintained as such in implementations.  If a connection is
1198	   progressing at a slow rate, so protection against wrapped sequence
1199	   numbers is not required, then it is permissible to include just the
1200	   lower 32 bits of the data sequence number in the Data Sequence
1201	   Mapping and/or Data ACK as an optimization, and an implementation can
1202	   make this choice independently for each packet.

1204	   An implementation MUST send the full 64 bit Data Sequence Number if
1205	   it is transmitting at a sufficiently high rate that the 32 bit value
1206	   could wrap within the Maximum Segment Lifetime (MSL) [15].  The
1207	   lengths of the DSNs used in these values (which may be different) are
1208	   declared with flags in the DSS option.  Implementations MUST accept a
1209	   32 bit DSN and implicitly promote it to a 64 bit quantity by
1210	   incrementing the upper 32 bits of sequence number each time the lower
1211	   32 bits wrap.  A sanity check MUST be implemented to ensure that a
1212	   wrap occurs at an expected time (e.g. the sequence number jumps from
1213	   a very high number to a very low number) and is not triggered by out-
1214	   of-order packets.

1216	   As with the standard TCP sequence number, the data sequence number
1217	   should not start at zero, but at a random value to make blind session
1218	   hijacking harder.  This specification requires setting the initial
1219	   data sequence number (IDSN) of each host to the least significant 64
1220	   bits of the SHA-1 hash of the host's key, as described in
1221	   Section 3.1.

1223	   A Data Sequence Mapping does not need to be included in every MPTCP
1224	   packet, as long as the subflow sequence space in that packet is
1225	   covered by a mapping known at the receiver.  This can be used to
1226	   reduce overhead in cases where the mapping is known in advance; one
1227	   such case is when there is a single subflow between the hosts,
1228	   another is when segments of data are scheduled in larger than packet-
1229	   sized chunks.

1231	   An "infinite" mapping can be used to fallback to regular TCP by
1232	   mapping the subflow-level data to the connection-level data for the
1233	   remainder of the connection (see Section 3.6).  This is achieved by
1234	   setting the Data-level Length field of the DSS option to the reserved
1235	   value of 0.  The checksum, in such a case, will also be set to zero.

1237	3.3.2.  Data Acknowledgments

1239	   To provide full end-to-end resilience, MPTCP provides a connection-
1240	   level acknowledgement, to act as a cumulative ACK for the connection
1241	   as a whole.  This is the "Data ACK" field of the DSS option
1242	   (Figure 9).  The Data ACK is analogous to the behaviour of the
1243	   standard TCP cumulative ACK - indicating how much data has been
1244	   successfully received (with no holes).  This is in comparison to the
1245	   subflow-level ACK, which acts analogous to TCP SACK, given that there
1246	   may still be holes in the data stream at the connection level.  The
1247	   Data ACK specifies the next Data Sequence Number it expects to
1248	   receive.

1250	   The Data ACK, as for the DSN, can be sent as the full 64 bit value,
1251	   or as the lower 32 bits.  If data is received with a 64 bit DSN, it
1252	   MUST be acknowledged with a 64 bit Data ACK.  If the DSN received is
1253	   32 bits, it is valid for the implementation to choose whether to send
1254	   a 32 bit or 64 bit Data ACK.

1256	   The Data ACK proves that the data, and all required MPTCP signaling,
1257	   has been received and accepted by the remote end.  One key use of the
1258	   Data ACK signal is that it is used to indicate the left edge of the
1259	   advertised receive window.  As explained in Section 3.3.4, the
1260	   receive window is shared by all subflows and is relative to the Data
1261	   ACK.  Because of this, an implementation MUST NOT use the RCV.WND
1262	   field of a TCP segment at connection-level if it does not also carry
1263	   a DSS option with a Data ACK field.  Furthermore, separating the
1264	   connection-level acknowledgments from the subflow-level allows
1265	   processing to be done separately, and a receiver has the freedom to
1266	   drop segments after acknowledgement at the subflow level, for example
1267	   due to memory constraints when many segments arrive out-of-order.

1269	   An MPTCP sender MUST NOT free data from the send buffer until it has
1270	   been acknowledged by both a Data ACK received on any subflow and at
1271	   the subflow level by all subflows the data was sent on.  The former
1272	   condition ensures liveness of the connection and the latter condition
1273	   ensures liveness and self-consistence of a subflow when data needs to
1274	   be retransmitted.  Note, however, that if some data needs to be
1275	   retransmitted multiple times over a subflow, there is a risk of
1276	   blocking the sending window.  In this case, the MPTCP sender can
1277	   decide to terminate the subflow that is behaving badly by sending a
1278	   RST.

1280	   The Data ACK MAY be included in all segments, however optimisations
1281	   SHOULD be considered in more advanced implementations, where the Data
1282	   ACK is present in segments only when the Data ACK value advances, and
1283	   this behaviour MUST be treated as valid.  This behaviour ensures the
1284	   sender buffer is freed, while reducing overhead when the data
1285	   transfer is unidirectional.

1287	3.3.3.  Closing a Connection

1289	   In regular TCP a FIN announces the receiver that the sender has no
1290	   more data to send.  In order to allow subflows to operate
1291	   independently and to keep the appearance of TCP over the wire, a FIN
1292	   in MPTCP only affects the subflow on which it is sent.  This allows
1293	   nodes to exercise considerable freedom over which paths are in use at
1294	   any one time.  The semantics of a FIN remain as for regular TCP, i.e.
1295	   it is not until both sides have ACKed each other's FINs that the
1296	   subflow is fully closed.

1298	   When an application calls close() on a socket, this indicates that it
1299	   has no more data to send, and for regular TCP this would result in a
1300	   FIN on the connection.  For MPTCP, an equivalent mechanism is needed,
1301	   and this is referred to as the DATA_FIN.

1303	   A DATA_FIN is an indication that the sender has no more data to send,
1304	   and as such can be used to verify that all data has been successfully
1305	   received.  A DATA_FIN, as with the FIN on a regular TCP connection,
1306	   is a unidirectional signal.

1308	   The DATA_FIN is signalled by setting the 'F' flag in the Data
1309	   Sequence Signal option (Figure 9) to 1.  A DATA_FIN occupies one
1310	   octet (the final octet) of the connection-level sequence space.  Note
1311	   that the DATA_FIN is included in the Data-Level Length, but not at
1312	   the subflow level: for example, a segment with DSN 80, and Data-Level
1313	   Length 11, with DATA_FIN set, would map 10 octets from the subflow
1314	   into data sequnce space 80-89, the DATA_FIN is DSN 90, and therefore
1315	   this segment including DATA_FIN would be acknowledged with a DATA_ACK
1316	   of 91.

1318	   Note that when the DATA_FIN is not attached to a TCP segment
1319	   containing data, the Data Sequence Signal MUST have Subflow Sequence
1320	   Number of 0, a Data-Level Length of 1, and the Data Sequence Number
1321	   that corresponds with the DATA_FIN itself.  The checksum in this case
1322	   will only cover the pseudo-header.

1324	   A DATA_FIN has the semantics and behaviour as a regular TCP FIN, but
1325	   at the connection level.  Notably, it is only DATA_ACKed once all
1326	   data has been successfully received at the connection level.  Note
1327	   therefore that a DATA_FIN is decoupled from a subflow FIN.  It is
1328	   only permissible to combine these signals on one subflow if there is
1329	   no data outstanding on other subflows.  Otherwise, it may be
1330	   necessary to retransmit data on different subflows.  Essentially, a
1331	   host MUST NOT close all functioning subflows unless it is safe to do
1332	   so, i.e. until all outstanding data has been DATA_ACKed, or that the
1333	   segment with the DATA_FIN flag set is the only outstanding segment.

1335	   Once a DATA_FIN has been acknowledged, all remaining subflows MUST be
1336	   closed with standard FIN exchanges.  Both hosts SHOULD send FINs on
1337	   all subflows, as a courtesy to allow middleboxes to clean up state
1338	   even if an individual subflow has failed.  It is also encouraged to
1339	   reduce the timeouts (Maximum Segment Life) on subflows at end hosts.
1340	   In particular, any subflows where there is still outstanding data
1341	   queued (which has been retransmitted on other subflows in order to
1342	   get the DATA_FIN acknowledged) MAY be closed with an RST.

1344	   A connection is considered closed once both hosts' DATA_FINs have
1345	   been acknowledged by DATA_ACKs.

1347	   As specified above, a standard TCP FIN on an individual subflow only
1348	   shuts down the subflow on which it was sent.  If all subflows have
1349	   been closed with a FIN exchange, but no DATA_FIN has been received
1350	   and acknowledged, the MPTCP connection is treated as closed only
1351	   after a timeout.  This implies that an implementation will have
1352	   TIME_WAIT states at both the subflow and connection levels (see
1353	   Appendix C).  This permits "break-before-make" scenarios where
1354	   connectivity is lost on all subflows before a new one can be re-
1355	   established.

1357	3.3.4.  Receiver Considerations

1359	   Regular TCP advertises a receive window in each packet, telling the
1360	   sender how much data the receiver is willing to accept past the
1361	   cumulative ack.  The receive window is used to implement flow
1362	   control, throttling down fast senders when receivers cannot keep up.

1364	   MPTCP also uses a unique receive window, shared between the subflows.
1365	   The idea is to allow any subflow to send data as long as the receiver
1366	   is willing to accept it; the alternative, maintaining per subflow
1367	   receive windows, could end-up stalling some subflows while others
1368	   would not use up their window.

1370	   The receive window is relative to the DATA_ACK.  As in TCP, a
1371	   receiver MUST NOT shrink the right edge of the receive window (i.e.
1372	   DATA_ACK + receive window).  The receiver will use the Data Sequence
1373	   Number to tell if a packet should be accepted at connection level.

1375	   When deciding to accept packets at subflow level, regular TCP checks
1376	   the sequence number in the packet against the allowed receive window.
1377	   With multipath, such a check is done using only the connection level
1378	   window.  A sanity check SHOULD be performed at subflow level to
1379	   ensure that the subflow and mapped sequence numbers meet the
1380	   following test: SSN - SUBFLOW_ACK <= DSN - DATA_ACK, where SSN is the
1381	   subflow sequence number of the received packet and SUBFLOW_ACK is the
1382	   RCV.NXT (next expected sequence number) of the subflow (with the
1383	   equivalent connection-level definitions for DSN and DATA_ACK).

1385	   In regular TCP, once a segment is deemed in-window, it is either put
1386	   in the in-order receive queue or in the out-of-order queue.  In
1387	   multipath TCP, the same happens but at connection-level: a segment is
1388	   placed in the connection level in-order or out-of-order queue if it
1389	   is in-window at both connection and subflow level.  The stack still
1390	   has to remember, for each subflow, which segments were received
1391	   successfully so that it can ACK them at subflow level appropriately.
1392	   Typically, this will be implemented by keeping per subflow out-of-
1393	   order queues (containing only message headers, not the payloads) and
1394	   remembering the value of the cumulative ACK.

1396	   It is important for implementers to understand how large a receiver
1397	   buffer is appropriate.  The lower bound for full network utilization
1398	   is the maximum bandwidth-delay product of any one of the paths.
1399	   However this might be insufficient when a packet is lost on a slower
1400	   subflow and needs to be retransmitted (see Section 3.3.6).  A tight
1401	   upper bound would be the maximum RTT of any path multiplied by the
1402	   total bandwidth available across all paths.  This permits all
1403	   subflows to continue at full speed while a packet is fast-
1404	   retransmitted on the maximum RTT path.  Even this might be
1405	   insufficient to maintain full performance in the event of a
1406	   retransmit timeout on the maximum RTT path.  It is for future study
1407	   to determine the relationship between retransmission strategies and
1408	   receive buffer sizing.

1410	3.3.5.  Sender Considerations

1412	   The sender remembers receiver window advertisements from the
1413	   receiver.  It should only update its local receive window values when
1414	   the largest sequence number allowed (i.e.  DATA_ACK + receive window)
1415	   increases, on the receipt of a DATA_ACK.  This is important to allow
1416	   using paths with different RTTs, and thus different feedback loops.

1418	   MPTCP uses a single receive window across all subflows, and if the
1419	   receive window was guaranteed to be unchanged end-to-end, a host
1420	   could always read the most recent receive window value.  However,
1421	   some classes of middleboxes may alter the TCP-level receive window.
1422	   Typically these will shrink the offered window, although for short
1423	   periods of time it may be possible for the window to be larger
1424	   (however note that this would not continue for long periods since
1425	   ultimately the middlebox must keep up with delivering data to the
1426	   receiver).  Therefore, if receive window sizes differ on multiple
1427	   subflows, when sending data MPTCP SHOULD take the largest of the most
1428	   recent window sizes as the one to use in calculations.  This rule is
1429	   implicit in the requirement not to reduce the right edge of the
1430	   window.

1432	   The sender MUST also remember the receive windows advertised by each
1433	   subflow.  The allowed window for subflow i is (ack_i, ack_i +
1434	   rcv_wnd_i), where ack_i is the subflow-level cumulative ack of
1435	   subflow i.  This ensures data will not be sent to a middlebox unless
1436	   there is enough buffering for the data.

1438	   Putting the two rules together, we get the following: a sender is
1439	   allowed to send data segments with data-level sequence numbers
1440	   between (DATA_ACK, DATA_ACK + receive_window).  Each of these
1441	   segments will be mapped onto subflows, as long as subflow sequence
1442	   numbers are in the the allowed windows for those subflows.  Note that
1443	   subflow sequence numbers do not generally affect flow control if the
1444	   same receive window is advertised across all subflows.  They will
1445	   perform flow control for those subflows with a smaller advertised
1446	   receive window.

1448	   The send buffer MUST, at a minimum, be as big as the receive buffer,
1449	   to enable the sender to reach maximum throughput.

1451	3.3.6.  Reliability and Retransmissions

1453	   The data sequence mapping allows senders to re-send data with the
1454	   same data sequence number on a different subflow.  When doing this, a
1455	   host MUST still retransmit the original data on the original subflow,
1456	   in order to preserve the subflow integrity (middleboxes could replay
1457	   old data, and/or could reject holes in subflows), and a receiver will
1458	   ignore these retransmissions.  While this is clearly suboptimal, for
1459	   compatibility reasons this is sensible behaviour.  Optimisations
1460	   could be negotiated in future versions of this protocol.

1462	   This protocol specification does not mandate any mechanisms for
1463	   handling retransmissions, and much will be dependent upon local
1464	   policy (as discussed in Section 3.3.8).  One can imagine aggressive
1465	   connection level retransmissions policies where every packet lost at
1466	   subflow level is retransmitted on a different subflow (hence wasting
1467	   bandwidth but possibly reducing application-to-application delays),
1468	   or conservative retransmission policies where connection-level
1469	   retransmits are only used after a few subflow level retransmission
1470	   timeouts occur.

1472	   It is envisaged that a standard connection-level retransmission
1473	   mechanism would be implemented around a connection-level data queue:

1475	   all segments that haven't been DATA_ACKed are stored.  A timer is set
1476	   when the head of the connection-level is ACKed at subflow level but
1477	   its corresponding data is not ACKed at data level.  This timer will
1478	   guard against failures in re-transmission by middleboxes that pro-
1479	   actively ACK data.

1481	   The sender MUST keep data in its send buffer as long as the data has
1482	   not been acknowledged at both connection level and on all subflows it
1483	   has been sent on.  In this way, the sender can always retransmit the
1484	   data if needed, on the same subflow or on a different one.  A special
1485	   case is when a subflow fails: the sender will typically resend the
1486	   data on other working subflows after a timeout, and will keep trying
1487	   to retransmit the data on the failed subflow too.  The sender will
1488	   declare the subflow failed after a predefined upper bound on
1489	   retransmissions is reached (which MAY be lower than the usual TCP
1490	   limits of the Maximum Segment Life), or on the receipt of an ICMP
1491	   error, and only then delete the outstanding data segments.

1493	   Multiple retransmissions are triggers that will indicate that a
1494	   subflow performs badly and could lead to a host resetting the subflow
1495	   with an RST.  However, additional research is required to understand
1496	   the heuristics of how and when to reset underperforming subflows.
1497	   For example, a highly asymmetric path may be mis-diagnosed as
1498	   underperforming.

1500	3.3.7.  Congestion Control Considerations

1502	   Different subflows in an MPTCP connection have different congestion
1503	   windows.  To achieve fairness at bottlenecks and resource pooling, it
1504	   is necessary to couple the congestion windows in use on each subflow,
1505	   in order to push most traffic to uncongested links.  One algorithm
1506	   for achieving this is presented in [5]; the algorithm does not
1507	   achieve perfect resource pooling but is "safe" in that it is readily
1508	   deployable in the current Internet.  By this, we mean that it does
1509	   not take up more capacity on any one path than if it was a single
1510	   path flow using only that route, so this ensures fair coexistence
1511	   with single-path TCP at shared bottlenecks.

1513	   It is foreseeable that different congestion controllers will be
1514	   implemented for MPTCP, each aiming to achieve different properties in
1515	   the resource pooling/fairness/stability design space, as well as
1516	   those for achieving different properties in quality of service,
1517	   reliability and resilience.

1519	   Regardless of the algorithm used, the design of the MPTCP protocol
1520	   aims to provide the congestion control implementations sufficient
1521	   information to take the right decisions; this information includes,
1522	   for each subflow, which packets were lost and when.

1524	3.3.8.  Subflow Policy

1526	   Within a local MPTCP implementation, a host may use any local policy
1527	   it wishes to decide how to share the traffic to be sent over the
1528	   available paths.

1530	   In the typical use case, where the goal is to maximise throughput,
1531	   all available paths will be used simultaneously for data transfer,
1532	   using coupled congestion control as described in [5].  It is
1533	   expected, however, that other use cases will appear.

1535	   For instance, a possibility is an 'all-or-nothing' approach, i.e.
1536	   have a second path ready for use in the event of failure of the first
1537	   path, but alternatives could include entirely saturating one path
1538	   before using an additional path (the 'overflow' case).  Such choices
1539	   would be most likely based on the monetary cost of links, but may
1540	   also be based on properties such as the delay or jitter of links,
1541	   where stability (of delay or bandwidth) is more important than
1542	   throughput.  Application requirements such as these are discussed in
1543	   detail in [6].

1545	   The ability to make effective choices at the sender requires full
1546	   knowledge of the path "cost", which is unlikely to be the case.  It
1547	   would be desirable for a receiver to be able to signal their own
1548	   preferences for paths, since they will often be the multihomed party,
1549	   and may have to pay for metered incoming bandwidth.

1551	   Whilst fine-grained control may be the most powerful solution, that
1552	   would require some mechanism such as overloading the ECN signal [16],
1553	   which is undesirable, and it is felt that there would not be
1554	   sufficient benefit to justify an entirely new signal.  Therefore the
1555	   MP_JOIN option (see Section 3.2) contains the 'B' bit, which allows a
1556	   host to indicate to its peer that this path should be treated as a
1557	   backup path to use only in the event of failure of other working
1558	   subflows (i.e. a subflow where the receiver has indicated B=1 SHOULD
1559	   NOT be used to send data unless there are no usable subflows where
1560	   B=0).

1562	   In the event that the available set of paths changes, a host may wish
1563	   to signal a change in priority of subflows to the peer (e.g. a
1564	   subflow that was previously set as backup should now take priority
1565	   over all remaining subflows).  Therefore, the MP_PRIO option, shown
1566	   in Figure 11, can be used to change the 'B' flag of the subflow on
1567	   which it is sent.

1569	                           1                   2                   3
1570	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1571	      +---------------+---------------+-------+-----+-+--------------+
1572	      |     Kind      |     Length    |Subtype|     |B| AddrID (opt) |
1573	      +---------------+---------------+-------+-----+-+--------------+

1575	                         Figure 11: MP_PRIO option

1577	   It should be noted that the backup flag is a request from a data
1578	   receiver to a data sender only, and the data sender SHOULD adhere to
1579	   these requests.  A host cannot assume that the data sender will do
1580	   so, however, since local policies - or technical difficulties - may
1581	   override MP_PRIO requests.  Note also that this signal applies to a
1582	   single direction, and so the sender of this option could choose to
1583	   continue using the subflow to send data even if it has signalled B=1
1584	   to the other host.

1586	   This option can also be applied to other subflows than the one on
1587	   which it is sent, by setting the optional Address ID field.  This
1588	   applies the given setting of B to all subflows in this connection
1589	   that use the address identified by the given Address ID.  The
1590	   presence of this field is determined by the option length; if
1591	   Length==4 then it is present, if Length==3 then it applies to the
1592	   current subflow only.  The use case of this is that a host can signal
1593	   to its peer that an address is temporarily unavailable (for example,
1594	   if it has radio coverage issues) and the peer should therefore drop
1595	   to backup state on all subflows using that Address ID.

1597	3.4.  Address Knowledge Exchange (Path Management)

1599	   We use the term "path management" to refer to the exchange of
1600	   information about additional paths between hosts, which in this
1601	   design is managed by multiple addresses at hosts.  For more detail of
1602	   the architectural thinking behind this design, see the separate
1603	   architecture document [2].

1605	   This design makes use of two methods of sharing such information, and
1606	   both can be used on a connection.  The first is the direct setup of
1607	   new subflows, already described in Section 3.2, where the initiator
1608	   has an additional address.  The second method, described in the
1609	   following subsections, signals addresses explicitly to the other host
1610	   to allow it to initiate new subflows.  The two mechanisms are
1611	   complementary: the first is implicit and simple, while the explicit
1612	   is more complex but is more robust.  Together, the mechanisms allow
1613	   addresses to change in flight (and thus support operation through
1614	   NATs, since the source address need not be known), and also allow the
1615	   signaling of previously unknown addresses, and of addresses belonging
1616	   to other address families (e.g. both IPv4 and IPv6).

1618	   Here is an example of typical operation of the protocol:

1620	   o  An MPTCP connection is initially set up between address/port A1 of
1621	      host A and address/port B1 of host B. If host A is multihomed and
1622	      multi-addressed, it can start an additional subflow from its
1623	      address A2 to B1, by sending a SYN with a Join option from A2 to
1624	      B1, using B's previously declared token for this connection.
1625	      Alternatively, if B is multihomed, it can try to set up a new
1626	      subflow from B2 to A1, using A's previously declared token.  In
1627	      either case, the SYN will be sent to the port already in use for
1628	      the original subflow on the receiving host.

1630	   o  Simultaneously (or after a timeout), an ADD_ADDR option
1631	      (Section 3.4.1) is sent on an existing subflow, informing the
1632	      receiver of the sender's alternative address(es).  The recipient
1633	      can use this information to open a new subflow to the sender's
1634	      additional address.  In our example, A will send ADD_ADDR option
1635	      informing B of address/port A2.  The mix of using the SYN-based
1636	      option and the ADD_ADDR option, including timeouts, is
1637	      implementation-specific and can be tailored to agree with local
1638	      policy.

1640	   o  If subflow A2-B1 is succesfully setup, host B can use the Address
1641	      ID in the Join option to correlate this with the ADD_ADDR option
1642	      that will also arrive on an existing subflow; now B knows not to
1643	      open A2-B1, ignoring the ADD_ADDR.  Otherwise, if B has not
1644	      received the A2-B1 MP_JOIN SYN but received the ADD_ADDR, it can
1645	      try to initiate a new subflow from one or more of its addresses to
1646	      address A2.  This permits new sessions to be opened if one host is
1647	      behind a NAT.

1649	   Other ways of using the two signaling mechanisms are possible; for
1650	   instance, signaling addresses in other address families can only be
1651	   done explicitly using the Add Address option.

1653	3.4.1.  Address Advertisement

1655	   The Add Address (ADD_ADDR) TCP Option announces additional addresses
1656	   (and optionally, ports) on which a host can be reached (Figure 12).
1657	   Multiple instances of this TCP option can be added in a single
1658	   message if there is sufficient TCP option space, otherwise multiple
1659	   TCP messages containing this option will be sent.  This option can be
1660	   used at any time during a connection, depending on when the sender
1661	   wishes to enable multiple paths and/or when paths become available.
1662	   As with all MPTCP signals, the receiver MUST undertake standard TCP
1663	   validity checks before acting upon it.

1665	   Every address has an Address ID which can be used for uniquely
1666	   identifying the address within a connection, for address removal.
1667	   This is also used to identify MP_JOIN options (see Section 3.2)
1668	   relating to the same address, even when address translators are in
1669	   use.  The Address ID MUST uniquely identify the address to the sender
1670	   (within the scope of the connection), but the mechanism for
1671	   allocating such IDs is implementation-specific.

1673	   All address IDs learnt via either MP_JOIN or ADD_ADDR SHOULD be
1674	   stored by the receiver in a data structure that gathers all the
1675	   Address ID to address mappings for a connection (identified by a
1676	   token pair).  In this way there is a stored mapping between Address
1677	   ID, observed source address and token pair for future processing of
1678	   control information for a connection.  Note that an implementation
1679	   MAY discard incoming address advertisements at will, for example for
1680	   avoiding the required mapping state, or because advertised addresses
1681	   are of no use to it (for example, IPv6 addresses when it has IPv4
1682	   only).  Therefore, a host MUST treat address advertisements as soft
1683	   state, and MAY choose to refresh advertisements periodically.

1685	   This option is shown in Figure 12.  The illustration is sized for
1686	   IPv4 addresses (IPVer = 4).  For IPv6, the IPVer field will read 6,
1687	   and the length of the address will be 16 octets (instead of 4).

1689	   The presence of the final two octets, specifying the TCP port number
1690	   to use, are optional and can be inferred from the length of the
1691	   option.  Although it is expected that the majority of use cases will
1692	   use the same port pairs as used for the initial subflow (e.g. port 80
1693	   remains port 80 on all subflows, as does the ephemeral port at the
1694	   client), there may be cases (such as port-based load balancing) where
1695	   the explicit specification of a different port is required.  If no
1696	   port is specified, MPTCP SHOULD attempt to connect to the specified
1697	   address on the same port as is already in use by the subflow on which
1698	   the ADD_ADDR signal was sent; this is discussed in more detail in
1699	   Section 3.8.

1701	                           1                   2                   3
1702	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1703	      +---------------+---------------+-------+-------+---------------+
1704	      |     Kind      |     Length    |Subtype| IPVer |  Address ID   |
1705	      +---------------+---------------+-------+-------+---------------+
1706	      |          Address (IPv4 - 4 octets / IPv6 - 16 octets)         |
1707	      +-------------------------------+-------------------------------+
1708	      |   Port (2 octets, optional)   |
1709	      +-------------------------------+

1711	                 Figure 12: Add Address (ADD_ADDR) option

1713	   Due to the proliferation of NATs, it is reasonably likely that one
1714	   host may attempt to advertise private addresses [17].  It is not
1715	   desirable to prohibit this, since there may be cases where both hosts
1716	   have additional interfaces on the same private network, and a host
1717	   MAY want to advertise such addresses.  The MP_JOIN handshake to
1718	   create a new subflow (Section 3.2) provides mechanisms to minimise
1719	   security risks.  The MP_JOIN message contains a 32 bit token that
1720	   uniquely identifies the connection to the receiving host.  If the
1721	   token is unknown, the host will return with a RST.  In the unlikely
1722	   event that the token is known, subflow setup will continue, but the
1723	   HMAC exchange must occur for authentication.  This will fail, and
1724	   will provide sufficient protection against two unconnected hosts
1725	   accidentally setting up a new subflow upon the signal of a private
1726	   address.  Further security considerations around the issue of
1727	   ADD_ADDR messages that accidentally mis-direct, or maliciously
1728	   direct, new MP_JOIN attempts are discussed in Section 5.

1730	   Ideally, ADD_ADDR and REMOVE_ADDR options would be sent reliably, and
1731	   in order, to the other end.  This would ensure that this address
1732	   management does not unnecessarily cause an outage in the connection
1733	   when remove/add addresses are processed in reverse order, and also to
1734	   ensure that all possible paths are used.  Note, however, that losing
1735	   reliability and ordering will not break the multipath connections, it
1736	   will just reduce the opportunity to open multipath paths and to
1737	   survive different patterns of path failures.

1739	   Therefore, implementing reliability signals for these TCP options is
1740	   not necessary.  In order to minimise the impact of the loss of these
1741	   options, however, it is RECOMMENDED that a sender should send these
1742	   options on all available subflows.  If these options need to be
1743	   received in-order, an implementation SHOULD only send one ADD_ADDR/
1744	   REMOVE_ADDR option per RTT, to minimise the risk of misordering.

1746	   A host can send an ADD_ADDR message with an already assigned Address
1747	   ID, but the Address MUST be the same as previously assigned to this
1748	   Address ID, and the Port MUST be different to one already in use for
1749	   this Address ID.  If these conditions are not met, the receiver
1750	   SHOULD silently ignore the ADD_ADDR.  A host wishing to replace an
1751	   existing Address ID MUST first remove the existing one
1752	   (Section 3.4.2).

1754	   A host that receives an ADD_ADDR but finds a connection setup to that
1755	   IP address and port number is unsuccessful SHOULD NOT perform further
1756	   connection attempts to this address/port combination for this
1757	   connection.  A sender that wants to trigger a new incoming connection
1758	   attempt on a previously advertised address/port combination can
1759	   therefore refresh ADD_ADDR information by sending the option again.

1761	   During normal MPTCP operation, it is unlikely that there will be
1762	   sufficient TCP option space for ADD_ADDR to be included along with
1763	   those for data sequence numbering (Section 3.3.1).  Therefore, it is
1764	   expected that an MPTCP implementation will send the ADD_ADDR option
1765	   on separate ACKs.  As discussed earlier, however, an MPTCP
1766	   implementation MUST NOT treat duplicate ACKs with any MPTCP option,
1767	   with the exception of the DSS option, as indications of congestion
1768	   [11], and an MPTCP implementation SHOULD NOT send more than two
1769	   duplicate ACKs in a row for signaling purposes.

1771	3.4.2.  Remove Address

1773	   If, during the lifetime of an MPTCP connection, a previously-
1774	   announced address becomes invalid (e.g. if the interface disappears),
1775	   the affected host SHOULD announce this so that the peer can remove
1776	   subflows related to this address.

1778	   This is achieved through the Remove Address (REMOVE_ADDR) option
1779	   (Figure 13), which will remove a previously-added address (or list of
1780	   addresses) from a connection and terminate any subflows currently
1781	   using that address.

1783	   For security purposes, if a host receives a REMOVE_ADDR option, it
1784	   must ensure the affected path(s) are no longer in use before it
1785	   instigates closure.  The receipt of REMOVE_ADDR SHOULD first trigger
1786	   the sending of a TCP Keepalive [18] on the path, and if a response is
1787	   received the path SHOULD NOT be removed.  Typical TCP validity tests
1788	   on the subflow (e.g. ensuring sequence and ack numbers are correct)
1789	   MUST also be undertaken.  An implementation can use indications of
1790	   these test failures as part of intrusion detection or error logging.

1792	   The sending and receipt (if no keepalive response was received) of
1793	   this message SHOULD trigger the sending of RSTs by both hosts on the
1794	   affected subflow(s) (if possible), as a courtesy to cleaning up
1795	   middlebox state, before cleaning up any local state.

1797	   Address removal is undertaken by ID, so as to permit the use of NATs
1798	   and other middleboxes that rewrite source addresses.  If there is no
1799	   address at the requested ID, the receiver will silently ignore the
1800	   request.

1802	   A subflow that is still functioning MUST be closed with a FIN
1803	   exchange as in regular TCP, rather than using this option.  For more
1804	   information, see Section 3.3.3.

1806	                        1                   2                   3
1807	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1808	   +---------------+---------------+-------+-------+---------------+
1809	   |     Kind      |  Length = 3+n |Subtype|(resvd)|   Address ID  | ...
1810	   +---------------+---------------+-------+-------+---------------+
1811	                              (followed by n-1 Address IDs, if required)

1813	              Figure 13: Remove Address (REMOVE_ADDR) option

1815	3.5.  Fast Close

1817	   Regular TCP has the means of sending a reset signal (RST) to abruptly
1818	   close a connection.  With MPTCP, the RST only has the scope of the
1819	   subflow and will only close the concerned subflow but not affect the
1820	   remaining subflows.  MPTCP's connection will stay alive at the data-
1821	   level, in order to permit break-before-make handover between
1822	   subflows.  It is therefore necessary to provide an MPTCP-level
1823	   "reset" to allow the abrupt closure of the whole MPTCP connection,
1824	   and this is the MP_FASTCLOSE option.

1826	   MP_FASTCLOSE is used to indicate to the peer that the connection will
1827	   be abruptly closed and no data will be accepted any more.  The
1828	   reasons for triggering an MP_FASTCLOSE are implementation-specific.
1829	   Regular TCP does not allow sending a RST while the connection is in a
1830	   synchronized state [1].  Nevertheless, implementations allow the
1831	   sending of a RST in this state, if for example the operating system
1832	   is running out of resources.  In these cases, MPTCP should send the
1833	   MP_FASTCLOSE.  This option is illustrated in Figure 14.

1835	                            1                   2                   3
1836	        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1837	       +---------------+---------------+-------+-----------------------+
1838	       |     Kind      |    Length     |Subtype|      (reserved)       |
1839	       +---------------+---------------+-------+-----------------------+
1840	       |                      Option Receiver's Key                    |
1841	       |                            (64 bits)                          |
1842	       |                                                               |
1843	       +---------------------------------------------------------------+

1845	                Figure 14: Fast Close (MP_FASTCLOSE) option

1847	   If Host A wants to force the closure of an MPTCP connection, the
1848	   MPTCP Fast Close procedure is as follows:

1850	   o  Host A sends an ACK containing the MP_FASTCLOSE option on one
1851	      subflow, containing the key of Host B as declared in the initial
1852	      connection handshake.  On all the other subflows, Host A sends a
1853	      regular TCP RST to close these subflows, and tears them down.

1855	      Host A now enters FASTCLOSE_WAIT state.

1857	   o  Upon receipt of an MP_FASTCLOSE, containing the valid key, host B
1858	      answers on the same subflow with a TCP RST and tears down all
1859	      subflows.  Host B can now close the whole MPTCP connection (it
1860	      transitions directly to CLOSED state).

1862	   o  As soon as Host A has received the TCP RST on the remaining
1863	      subflow, it can close this subflow and tear down the whole
1864	      connection (transition from FASTCLOSE_WAIT to CLOSED states).  If
1865	      Host A receives an MP_FASTCLOSE instead of a TCP RST, both hosts
1866	      attempted fast closure simultaneously.  Hose A should reply with a
1867	      TCP RST and tear down the connection.

1869	   o  If host A does not receive a TCP RST in reply to its MP_FASTCLOSE
1870	      after one RTO (the RTO of the subflow where the MPTCP_RST has been
1871	      sent), it SHOULD retransmit the MP_FASTCLOSE.  The number of
1872	      retransmissions SHOULD be limited to avoid this connection from
1873	      being retained for a long time, but this limit is implementation-
1874	      specific.  A RECOMMENDED number is 3.

1876	3.6.  Fallback

1878	   Sometimes, middleboxes will exist on a path that could prevent the
1879	   operation of MPTCP.  MPTCP has been designed in order to cope with
1880	   many middlebox modifications (see Section 6), but there are still
1881	   some cases where a subflow could fail to operate within the MPTCP
1882	   requirements.  These cases are notably: the loss of TCP options on a
1883	   path; and the modification of payload data.  If such an event occurs,
1884	   it is necessary to "fall back" to the previous, safe operation.  This
1885	   may either be falling back to regular TCP, or removing a problematic
1886	   subflow.

1888	   At the start of an MPTCP connection (i.e. the first subflow), it is
1889	   important to ensure that the path is fully MPTCP-capable and the
1890	   necessary TCP options can reach each host.  The handshake as
1891	   described in Section 3.1 SHOULD fall back to regular TCP if either of
1892	   the SYN messages do not have the MPTCP options: this is the same, and
1893	   desired, behaviour in the case where a host is not MPTCP capable, or
1894	   the path does not support the MPTCP options.  When attempting to join
1895	   an existing MPTCP connection (Section 3.2), if a path is not MPTCP
1896	   capable and the TCP options do not get through on the SYNs, the
1897	   subflow will be closed according to the MP_JOIN logic.

1899	   There is, however, another corner case which should be addressed.
1900	   That is one of MPTCP options getting through on the SYN, but not on
1901	   regular packets.  This can be resolved if the subflow is the first
1902	   subflow, and thus all data in flight is contiguous, using the
1903	   following rules.

1905	   A sender MUST include a DSS option with Data Sequence Mapping in
1906	   every segment until one of the sent segments has been acknowledged
1907	   with a DSS option containing a Data ACK.  Upon reception of the
1908	   acknowledgement, the sender has the confirmation that the DSS option
1909	   passes in both directions and may choose to send fewer DSS options
1910	   than once per segment.

1912	   If, however, an ACK is received for data (not just for the SYN)
1913	   without a DSS option containing a Data ACK, the sender determines the
1914	   path is not MPTCP capable.  In the case of this occurring on an
1915	   additional subflow (i.e. one started with MP_JOIN), the host MUST
1916	   close the subflow with an RST.  In the case of the first subflow
1917	   (i.e. that started with MP_CAPABLE), it MUST drop out of an MPTCP
1918	   mode back to regular TCP.  The sender will send one final Data
1919	   Sequence Mapping, with the Data-Level Length value of 0 indicating an
1920	   infinite mapping (in case the path drops options in one direction
1921	   only), and then revert to sending data on the single subflow without
1922	   any MPTCP options.

1924	   Note that this rule essentially prohibits the sending of data on the
1925	   third packet of an MP_CAPABLE or MP_JOIN handshake, since both that
1926	   option and a DSS cannot fit in TCP option space.  If the initiator is
1927	   to send first, another segment must be sent that contains the data
1928	   and DSS.  Note also that an additional subflow cannot be used until
1929	   the initial path has been verified as MPTCP-capable.

1931	   These rules should cover all cases where such a failure could happen:
1932	   whether it's on the forward or reverse path, and whether the server
1933	   or the client first sends data.  If lost options on data packets
1934	   occur on any other subflow apart from the the initial subflow, it
1935	   should be treated as a standard path failure.  The data would not be
1936	   DATA_ACKed (since there is no mapping for the data), and the subflow
1937	   can be closed with an RST.

1939	   The case described above is a specialised case of fallback, for when
1940	   the lack of MPTCP support is detected before any data is acknowledged
1941	   at the connection level on a subflow.  More generally, fallback
1942	   (either closing a subflow, or to regular TCP) can become necessary at
1943	   any point during a connection if a non-MPTCP-aware middlebox changes
1944	   the data stream.

1946	   As described in Section 3.3, each portion of data for which there is
1947	   a mapping is protected by a checksum.  This mechanism is used to
1948	   detect if middleboxes have made any adjustments to the payload
1949	   (added, removed, or changed data).  A checksum will fail if the data
1950	   has been changed in any way.  This will also detect if the length of
1951	   data on the subflow is increased or decreased, and this means the
1952	   Data Sequence Mapping is no longer valid.  The sender no longer knows
1953	   what subflow-level sequence number the receiver is genuinely
1954	   operating at (the middlebox will be faking ACKs in return), and
1955	   cannot signal any further mappings.  Furthermore, in addition to the
1956	   possibility of payload modifications that are valid at the
1957	   application layer, there is the possibility that false-positives
1958	   could be hit across MPTCP segment boundaries, corrupting the data.
1959	   Therefore, all data from the start of the segment that failed the
1960	   checksum onwards is not trustworthy.

1962	   When multiple subflows are in use, the data in-flight on a subflow
1963	   will likely involve data that is not contiguously part of the
1964	   connection-level stream, since segments will be spread across the
1965	   multiple subflows.  Due to the problems identified above, it is not
1966	   possible to determine what the adjustment has done to the data
1967	   (notably, any changes to the subflow sequence numbering).  Therefore,
1968	   it is not possible to recover the subflow, and the affected subflow
1969	   must be immediately closed with an RST, featuring an MP_FAIL option
1970	   (Figure 15), which defines the Data Sequence Number at the start of
1971	   the segment (defined by the Data Sequence Mapping) which had the
1972	   checksum failure.  Note that the MP_FAIL option requires the use of
1973	   the full 64-bit sequence number, even if 32-bit sequence numbers are
1974	   normally in use in the DSS signals on the path.

1976	                           1                   2                   3
1977	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1978	      +---------------+---------------+-------+----------------------+
1979	      |     Kind      |   Length=12   |Subtype|      (reserved)      |
1980	      +---------------+---------------+-------+----------------------+
1981	      |                                                              |
1982	      |                 Data Sequence Number (8 octets)              |
1983	      |                                                              |
1984	      +--------------------------------------------------------------+

1986	                   Figure 15: Fallback (MP_FAIL) option

1988	   The receiver MUST discard all data following the data sequence number
1989	   specified.  Failed data MUST NOT be DATA_ACKed and so will be re-
1990	   transmitted on other subflows (Section 3.3.6).

1992	   A special case is when there is a single subflow and it fails with a
1993	   checksum error.  If it is known that all unacknowledged data in
1994	   flight is contiguous (which will usually be the case with a single
1995	   subflow), an infinite mapping can be applied to the subflow without
1996	   the need to close it first, and essentially turn off all further
1997	   MPTCP signaling.  In this case, if a receiver identifies a checksum
1998	   failure when there is only one path, it will send back an MP_FAIL
1999	   option on the subflow-level ACK, refering to the data-level sequence
2000	   number of the start of the segment on which the checksum error was
2001	   detected.  The sender will receive this, and if all unacknowledged
2002	   data in flight is contiguous, will signal an infinite mapping.  This
2003	   infinite mapping will be a DSS option (Section 3.3) on the first new
2004	   packet, containing a Data Sequence Mapping that acts retroactively,
2005	   referring to the start of the subflow sequence number of the last
2006	   segment that was known to be delivered intact.  From that point
2007	   onwards data can be altered by a middlebox without affecting MPTCP,
2008	   as the data stream is equivalent to a regular, legacy TCP session.

2010	   In the rare case that the data is not contiguous (which could happen
2011	   when there is only one subflow but it is retransmitting data from a
2012	   subflow that has recently been uncleanly closed), the receiver MUST
2013	   close the subflow with an RST with MP_FAIL.  The receiver MUST
2014	   discard all data that follows the data sequence number specified.
2015	   The sender MAY attempt to create a new subflow belonging to the same
2016	   connection, and if it chooses to do so, SHOULD place the single
2017	   subflow immediately in single-path mode by setting an infinite data
2018	   sequence mapping.  This mapping will begin from the data-level
2019	   sequence number that was declared in the MP_FAIL.

2021	   After a sender signals an infinite mapping it MUST only use subflow
2022	   ACKs to clear its send buffer.  This is because Data ACKs may become
2023	   misaligned with the subflow ACKs when middleboxes insert or delete
2024	   data.  The receive SHOULD stop generating Data ACKs after it receives
2025	   an infinite mapping.

2027	   When a connection has fallen back, only one subflow can send data,
2028	   otherwise the receiver would not know how to reorder the data.  In
2029	   practice, this means that all MPTCP subflows will have to be
2030	   terminated except one.  Once MPTCP falls back to regular TCP, it MUST
2031	   NOT revert to MPTCP later in the connection.

2033	   It should be emphasised that we are not attempting to prevent the use
2034	   of middleboxes that want to adjust the payload.  An MPTCP-aware
2035	   middlebox could provide such functionality by also rewriting
2036	   checksums.

2038	3.7.  Error Handling

2040	   In addition to the fallback mechanism as described above, the
2041	   standard classes of TCP errors may need to be handled in an MPTCP-
2042	   specific way.  Note that changing semantics - such as the relevance
2043	   of an RST - are covered in Section 4.  Where possible, we do not want
2044	   to deviate from regular TCP behaviour.

2046	   The following list covers possible errors and the appropriate MPTCP
2047	   behaviour:

2049	   o  Unknown token in MP_JOIN (or HMAC failure in MP_JOIN ACK, or
2050	      missing MP_JOIN in SYN/ACK response): send RST (analogous to TCP's
2051	      behaviour on an unknown port)

2053	   o  DSN out of Window (during normal operation): drop the data, do not
2054	      send Data ACKs.

2056	   o  Remove request for unknown address ID: silently ignore

2058	3.8.  Heuristics

2060	   There are a number of heuristics that are needed for performance or
2061	   deployment but which are not required for protocol correctness.  In
2062	   this section we detail such heuristics.  Note that discussion of
2063	   buffering and certain sender and receiver window behaviours are
2064	   presented in Section 3.3.4 and Section 3.3.5, as well as
2065	   retransmission in Section 3.3.6.

2067	3.8.1.  Port Usage

2069	   Under typical operation an MPTCP implementation SHOULD use the same
2070	   ports as already in use.  In other words, the destination port of a
2071	   SYN containing an MP_JOIN option SHOULD be the same as the remote
2072	   port of the first subflow in the connection.  The local port for such
2073	   SYNs SHOULD also be the same as for the first subflow (and as such,
2074	   an implementation SHOULD reserve ephemeral ports across all local IP
2075	   addresses), although there may be cases where this is infeasible.
2076	   This strategy is intended to maximize the probability of the SYN
2077	   being permitted by a firewall or NAT at the recipient and to avoid
2078	   confusing any network monitoring software.

2080	   There may also be cases, however, where the passive opener wishes to
2081	   signal to the other host that a specific port should be used, and
2082	   this facility is provided in the Add Address option as documented in
2083	   Section 3.4.1.  It is therefore feasible to allow multiple subflows
2084	   between the same two addresses but using different port pairs, and
2085	   such a facility could be used to allow load balancing within the
2086	   network based on 5-tuples (e.g. some ECMP implementations [7]).

2088	3.8.2.  Delayed Subflow Start

2090	   Many TCP connections are short-lived and consist only of a few
2091	   segments, and so the overheads of using MPTCP outweigh any benefits.
2092	   A heuristic is required, therefore, to decide when to start using
2093	   additional subflows in an MPTCP connection.  We expect that
2094	   experience gathered from deployments will provide further guidance on
2095	   this, and will be affected by particular application characteristics
2096	   (which are likely to change over time).  However, a suggested
2097	   general-purpose heuristic that an implementation MAY choose to employ
2098	   is as follows.  Results from experimental deployments are needed in
2099	   order to verify the correctness of this proposal.

2101	   If a host has data buffered for its peer (which implies that the
2102	   application has received a request for data), the host opens one
2103	   subflow for each initial window's worth of data that is buffered.

2105	   Consideration should also be given to limiting the rate of adding new
2106	   subflows, as well as limiting the total number of subflows open for a
2107	   particular connection.  A host may choose to vary these values based
2108	   on its load or knowledge of traffic and path characteristics.

2110	   Note that this heuristic alone is probably insufficient.  Traffic for
2111	   many common applications, such as downloads, is highly asymmetric and
2112	   the host that is multihomed may well be the client which will never
2113	   fill its buffers, and thus never use MPTCP.  Advanced APIs that allow
2114	   an application to signal its traffic requirements would aid in these
2115	   decisions.

2117	   An additional time-based heuristic could be applied, opening
2118	   additional subflows after a given period of time has passed.  This
2119	   would alleviate the above issue, and also provide resilience for low-
2120	   bandwidth but long-lived applications.

2122	   This section has shown some of the considerations that an implementer
2123	   should give when developing MPTCP heuristics, but is not intended to
2124	   be prescriptive.

2126	3.8.3.  Failure Handling

2128	   Requirements for MPTCP's handling of unexpected signals have been
2129	   given in Section 3.7.  There are other failure cases, however, where
2130	   a hosts can choose appropriate behaviour.

2132	   For example, Section 3.1 suggests that a host SHOULD fall back to
2133	   trying regular TCP SYNs after one or more failures of MPTCP SYNs for
2134	   a connection.  A host may keep a system-wide cache of such
2135	   information, so that it can back off from using MPTCP, firstly for
2136	   that particular destination host, and eventually on a whole
2137	   interface, if MPTCP connections continue failing.

2139	   Another failure could occur when the MP_JOIN handshake fails.
2140	   Section 3.7 specifies that an incorrect handshake MUST lead to the
2141	   subflow being closed with a RST.  A host operating an active
2142	   intrusion detection system may choose to start blocking MP_JOIN
2143	   packets from the source host if multiple failed MP_JOIN attempts are
2144	   seen.  From the connection initiator's point of view, if an MP_JOIN
2145	   fails, it SHOULD NOT attempt to connect to the same IP address and
2146	   port during the lifetime of the connection, unless the other host
2147	   refreshes the information with another ADD_ADDR option.  Note that
2148	   the ADD_ADDR option is informational only, and does not guarantee the
2149	   other host will attempt a connection.

2151	   In addition, an implementation may learn over a number of connections
2152	   that certain interfaces or destination addresses consistently fail
2153	   and may default to not trying to use MPTCP for these.  Behaviour
2154	   could also be learnt for particularly badly performing subflows or
2155	   subflows that regularly fail during use, in order to temporarily
2156	   choose not to use these paths.

2158	4.  Semantic Issues

2160	   In order to support multipath operation, the semantics of some TCP
2161	   components have changed.  To aid clarity, this section collects these
2162	   semantic changes as a reference.

2164	   Sequence Number:  The (in-header) TCP sequence number is specific to
2165	      the subflow.  To allow the receiver to reorder application data,
2166	      an additional data-level sequence space is used.  In this data-
2167	      level sequence space, the initial SYN and the final DATA_FIN
2168	      occupy one octet of sequence space.  There is an explicit mapping
2169	      of data sequence space to subflow sequence space, which is
2170	      signalled through TCP options in data packets.

2172	   ACK:  The ACK field in the TCP header acknowledges only the subflow
2173	      sequence number, not the data-level sequence space.
2174	      Implementations SHOULD NOT attempt to infer a data-level
2175	      acknowledgement from the subflow ACKs.  This separates subflow-
2176	      and connection-level processing at an end host.

2178	   Duplicate ACK:  A duplicate ACK that includes any MPTCP signaling
2179	      (with the exception of the DSS option) MUST NOT be treated as a
2180	      signal of congestion.  To limit the chances of non-MPTCP-aware
2181	      entities mistakenly interpreting duplicate ACKs as a signal of
2182	      congestion, MPTCP SHOULD NOT send more than two duplicate ACKs
2183	      containing (non-DSS) MPTCP signals in a row.

2185	   Receive Window:  The receive window in the TCP header indicates the
2186	      amount of free buffer space for the whole data-level connection
2187	      (as opposed to for this subflow) that is available at the
2188	      receiver.  This is the same semantics as regular TCP, but to
2189	      maintain these semantics the receive window must be interpreted at
2190	      the sender as relative to the sequence number given in the
2191	      DATA_ACK rather than the subflow ACK in the TCP header.  In this
2192	      way the original flow control role is preserved.  Note that some
2193	      middleboxes may change the receive window, and so a host SHOULD
2194	      use the maximum value of those recently seen on the constituent
2195	      subflows for the connection-level receive window, and also needs
2196	      to maintain a subflow-level window for subflow-level processing.

2198	   FIN:  The FIN flag in the TCP header applies only to the subflow it
2199	      is sent on, not to the whole connection.  For connection-level FIN
2200	      semantics, the DATA_FIN option is used.

2202	   RST:  The RST flag in the TCP header applies only to the subflow it
2203	      is sent on, not to the whole connection.  The MP_FASTCLOSE option
2204	      provides the fast-close functionality of a RST at the MPTCP
2205	      connection level.

2207	   Address List:  Address list management (i.e. knowledge of the local
2208	      and remote hosts' lists of available IP addresses) is handled on a
2209	      per-connection basis (as opposed to per-subflow, per host, or per
2210	      pair of communicating hosts).  This permits the application of
2211	      per-connection local policy.  Adding an address to one connection
2212	      (either explicitly through an Add Address message, or implicitly
2213	      through a Join) has no implication for other connections between
2214	      the same pair of hosts.

2216	   5-tuple:  The 5-tuple (protocol, local address, local port, remote
2217	      address, remote port) presented by kernel APIs to the application
2218	      layer in a non-multipath-aware application is that of the first
2219	      subflow, even if the subflow has since been closed and removed
2220	      from the connection.  This decision, and other related API issues,
2221	      are discussed in more detail in [6].

2223	5.  Security Considerations

2225	   As identified in [8], the addition of multipath capability to TCP
2226	   will bring with it a number of new classes of threat.  In order to
2227	   prevent these, [2] presents a set of requirements for a security
2228	   solution for MPTCP.  The fundamental goal is for the security of
2229	   MPTCP to be "no worse" than regular TCP today, and the key security
2230	   requirements are:

2232	   o  Provide a mechanism to confirm that the parties in a subflow
2233	      handshake are the same as in the original connection setup.

2235	   o  Provide verification that the peer can receive traffic at a new
2236	      address before using it as part of a connection.

2238	   o  Provide replay protection, i.e. ensure that a request to add/
2239	      remove a subflow is 'fresh'.

2241	   In order to achieve these goals, MPTCP includes a hash-based
2242	   handshake algorithm documented in Section 3.1 and Section 3.2.

2244	   The security of the MPTCP connection hangs on the use of keys that
2245	   are shared once at the start of the first subflow, and are never sent
2246	   again over the network (unless used in the fast close mechanism,
2247	   Section 3.5).  To ease demultiplexing whilst not giving away any
2248	   cryptographic material, future subflows use a truncated cryptographic
2249	   hash of this key as the connection identification "token".  The keys
2250	   are concatenated and used as keys for creating Hash-based Message
2251	   Authentication Codes (HMAC) used on subflow setup, in order to verify
2252	   that the parties in the handshake are the same as in the original
2253	   connection setup.  It also provides verification that the peer can
2254	   receive traffic at this new address.  Replay attacks would still be
2255	   possible when only keys are used, and therefore the handshakes use
2256	   single-use random numbers (nonces) at both ends - this ensures the
2257	   HMAC will never be the same on two handshakes.  Guidance on
2258	   generating random numbers suitable for use as keys is given in [13]
2259	   and discussed in Section 3.1.

2261	   The use of crypto capability bits in the initial connection handshake
2262	   to negotiate use of a particular algorithm allows the deployment of
2263	   additional crypto mechanisms in the future.  Note that this would be
2264	   susceptible to bid-down attacks only if the attacker was on-path (and
2265	   thus would be able to modify the data anyway).  The security
2266	   mechanism presented in this draft should therefore protect against
2267	   all forms of flooding and hijacking attacks discussed in [8].

2269	   During normal operation, regular TCP protection mechanisms (such as
2270	   ensuring sequence numbers are in-window) will provide the same level
2271	   of protection against attacks on indivudal TCP subflows as exists for
2272	   regular TCP today.  Implementations will introduce additional buffers
2273	   compared to regular TCP, to reassemble data at the connection level.
2274	   The application of window sizing will minimize the risk of denial-of-
2275	   service attacks consuming resources.

2277	   As discussed in Section 3.4.1, a host may advertise its private
2278	   addresses, but these might point to different hosts in the receiver's
2279	   network.  The MP_JOIN handshake (Section 3.2) will ensure that this
2280	   does not succeed in setting up a subflow to the incorrect host.
2281	   However, it could still create unwanted TCP handshake traffic.  This
2282	   feature of MPTCP could be a target for denial-of-service exploits,
2283	   with malicious participants in MPTCP connections encouraging the
2284	   recipient to target other hosts in the network.  Therefore,
2285	   implementations should consider heuristics (Section 3.8) at both the
2286	   sender and receiver to reduce the impact of this.

2288	   A small security risk could theoretically exist with key reuse, but
2289	   in order to accomplish a replay attack, both the sender and receiver
2290	   keys, and the sender and receiver random numbers, in the MP_JOIN
2291	   handshake (Section 3.2) would have to match.

2293	   Whilst this specification defines a "medium" security solution,
2294	   meeting the criteria specified at the start of this section and the
2295	   threat analyis ([8]), since attacks only ever get worse, it is likely
2296	   that a future standards-track version of MPTCP would need to be able
2297	   to support stronger security.  There are several ways the security of
2298	   MPTCP could potentially be improved; some of these would be
2299	   compatible with MPTCP as defined in this document, whilst others may
2300	   not be.  For now, the best approach is to get experience with the
2301	   current approach, establish what might work and check that the threat
2302	   analysis is still accurate.

2304	   Possible ways of improving MPTCP security could include:

2306	   o  defining a new MPCTP cryptographic algorithm, as negotiated in
2307	      MP_CAPABLE.  A sub-case could be to include an additional
2308	      deployment assumption, such as stateful servers, in order to allow
2309	      a more powerful algorithm to be used.

2311	   o  defining how to secure data transfer with MPTCP, whilst not
2312	      changing the signalling part of the protocol.

2314	   o  defining security that requires more option space, perhaps in
2315	      conjunction with a "long options" proposal for extending the TCP
2316	      options space (such as those surveyed in [19]), or perhaps
2317	      building on the current approach with a second stage of MPTCP-
2318	      option-based security.

2320	   o  re-visiting the working group's decision to exclusively use TCP
2321	      options for MPTCP signalling, and instead look at also making use
2322	      of the TCP payloads.

2324	   MPTCP has been designed with several methods available to indicate a
2325	   new security mechanism, including:

2327	   o  available flags in MP_CAPABLE (Figure 4);

2329	   o  available subtypes in the MPTCP Option Figure 3);
2330	   o  the version field in MP_CAPABLE (Figure 4);

2332	6.  Interactions with Middleboxes

2334	   Multipath TCP was designed to be deployable in the present world.
2335	   Its design takes into account "reasonable" existing middlebox
2336	   behaviour.  In this section we outline a few representative
2337	   middlebox-related failure scenarios and show how multipath TCP
2338	   handles them.  Next, we list the design decisions multipath has made
2339	   to accommodate the different middleboxes.

2341	   A primary concern is our use of a new TCP option.  Middleboxes should
2342	   forward packets with unknown options unchanged, yet there are some
2343	   that don't.  These we expect will either strip options and pass the
2344	   data, drop packets with new options, copy the same option into
2345	   multiple segments (e.g. when doing segmentation) or drop options
2346	   during segment coalescing.

2348	   MPTCP uses a single new TCP option "Kind", and all message types are
2349	   defined by "subtype" values (see Section 8).  This should reduce the
2350	   chances of only some types of MPTCP options being passed, and instead
2351	   the key differing characteristics are different paths, and the
2352	   presence of the SYN flag.

2354	   MPTCP SYN packets on the first subflow of a connection contain the
2355	   MP_CAPABLE option (Section 3.1).  If this is dropped, MPTCP SHOULD
2356	   fall back to regular TCP.  If packets with the MP_JOIN option
2357	   (Section 3.2) are dropped, the paths will simply not be used.

2359	   If a middlebox strips options but otherwise passes the packets
2360	   unchanged, MPTCP will behave safely.  If an MP_CAPABLE option is
2361	   dropped on either the outgoing or the return path, the initiating
2362	   host can fall back to regular TCP, as illustrated in Figure 16 and
2363	   discussed in Section 3.1.

2365	   Subflow SYNs contain the MP_JOIN option.  If this option is stripped
2366	   on the outgoing path the SYN will appear to be a regular SYN to host
2367	   B. Depending on whether there is a listening socket on the target
2368	   port, host B will reply either with SYN/ACK or RST (subflow
2369	   connection fails).  When host A receives the SYN/ACK it sends a RST
2370	   because the SYN/ACK does not contain the MP_JOIN option and its
2371	   token.  Either way, the subflow setup fails, but otherwise does not
2372	   affect the MPTCP connection as a whole.

2374	        Host A                             Host B
2375	         |              Middlebox M            |
2376	         |                   |                 |
2377	         |  SYN(MP_CAPABLE)  |        SYN      |
2378	         |-------------------|---------------->|
2379	         |                SYN/ACK              |
2380	         |<------------------------------------|
2381	     a) MP_CAPABLE option stripped on outgoing path

2383	       Host A                               Host B
2384	         |            SYN(MP_CAPABLE)          |
2385	         |------------------------------------>|
2386	         |             Middlebox M             |
2387	         |                 |                   |
2388	         |    SYN/ACK      |SYN/ACK(MP_CAPABLE)|
2389	         |<----------------|-------------------|
2390	     b) MP_CAPABLE option stripped on return path

2392	   Figure 16: Connection Setup with Middleboxes that Strip Options from
2393	                                  Packets

2395	   We now examine data flow with MPTCP, assuming the flow is correctly
2396	   setup, which implies the options in the SYN packets were allowed
2397	   through by the relevant middleboxes.  If options are allowed through
2398	   and there is no resegmentation or coalescing to TCP segments,
2399	   multipath TCP flows can proceed without problems.

2401	   The case when options get stripped on data packets has been discussed
2402	   in the Fallback section.  If a fraction of options are stripped,
2403	   behaviour is not deterministic.  If some Data Sequence Mappings are
2404	   lost, the connection can continue so long as mappings exist for the
2405	   subflow-level data (e.g. if multiple maps have been sent that
2406	   reinforce each other).  If some subflow-level space is left unmapped,
2407	   however, the subflow is treated as broken and is closed, through the
2408	   process described in Section 3.6.  MPTCP should survive with a loss
2409	   of some Data ACKs, but performance will degrade as the fraction of
2410	   stripped options increases.  We do not expect such cases to appear in
2411	   practice, though: most middleboxes will either strip all options or
2412	   let them all through.

2414	   We end this section with a list of middlebox classes, their behaviour
2415	   and the elements in the MPTCP design that allow operation through
2416	   such middleboxes.  Issues surrounding dropping packets with options
2417	   or stripping options were discussed above, and are not included here:

2419	   o  NATs [20] (Network Address (and Port) Translators) change the
2420	      source address (and often source port) of packets.  This means
2421	      that a host will not know its public-facing address for signaling
2422	      in MPTCP.  Therefore, MPTCP permits implicit address addition via
2423	      the MP_JOIN option, and the handshake mechanism ensures that
2424	      connection attempts to private addresses [17] do not cause
2425	      problems.  Explicit address removal is undertaken by an Address ID
2426	      to allow no knowledge of the source address.

2428	   o  Performance Enhancing Proxies (PEPs) [21] might pro-actively ACK
2429	      data to increase performance.  MPTCP, however, relies on accurate
2430	      congestion control signals from the end host, and non-MPTCP-aware
2431	      PEPs will not be able to provide such signals.  MPTCP will
2432	      therefore fall back to single-path TCP, or close the problematic
2433	      subflow (see Section 3.6).

2435	   o  Traffic Normalizers [22] may not allow holes in sequence numbers,
2436	      and may cache packets and retransmit the same data.  MPTCP looks
2437	      like standard TCP on the wire, and will not retransmit different
2438	      data on the same subflow sequence number.  In the event of a
2439	      retransmission, the same data will be retransmitted on the
2440	      original TCP subflow even if it is additionally retransmitted at
2441	      the connection-level on a different subflow.

2443	   o  Firewalls [23] might perform initial sequence number randomization
2444	      on TCP connections.  MPTCP uses relative sequence numbers in data
2445	      sequence mapping to cope with this.  Like NATs, firewalls will not
2446	      permit many incoming connections, so MPTCP supports address
2447	      signaling (ADD_ADDR) so that a multi-addressed host can invite its
2448	      peer behind the firewall/NAT to connect out to its additional
2449	      interface.

2451	   o  Intrusion Detection Systems look out for traffic patterns and
2452	      content that could threaten a network.  Multipath will mean that
2453	      such data is potentially spread, so it is more difficult for an
2454	      IDS to analyse the whole traffic, and potentially increases the
2455	      risk of false positives.  However, for an MPTCP-aware IDS, tokens
2456	      can be read by such systems to correlate multiple subflows and re-
2457	      assemble for analysis.

2459	   o  Application level middleboxes such as content-aware firewalls may
2460	      alter the payload within a subflow, such as re-writing URIs in
2461	      HTTP traffic.  MPTCP will detect these using the checksum and
2462	      close the affected subflow(s), if there are other subflows that
2463	      can be used.  If all subflows are affected multipath will fallback
2464	      to TCP, allowing such middleboxes to change the payload.  MPTCP-
2465	      aware middleboxes should be able to adjust the payload and MPTCP
2466	      metadata in order not to break the connection.

2468	   In addition, all classes of middleboxes may affect TCP traffic in the
2469	   following ways:

2471	   o  TCP Options may be removed, or packets with unknown options
2472	      dropped, by many classes of middleboxes.  It is intended that the
2473	      initial SYN exchange, with a TCP Option, will be sufficient to
2474	      identify the path capabilities.  If such a packet does not get
2475	      through, MPTCP will end up falling back to regular TCP.

2477	   o  Segmentation/Coalescing (e.g.  TCP segmentation offloading) might
2478	      copy options between packets and might strip some options.
2479	      MPTCP's data sequence mapping includes the relative subflow
2480	      sequence number instead of using the sequence number in the
2481	      segment.  In this way, the mapping is independent of the packets
2482	      that carry it.

2484	   o  The Receive Window may be shrunk by some middleboxes at the
2485	      subflow level.  MPTCP will use the maximum window at data-level,
2486	      but will also obey subflow specific windows.

2488	7.  Acknowledgments

2490	   The authors were originally supported by Trilogy
2491	   (http://www.trilogy-project.org), a research project (ICT-216372)
2492	   partially funded by the European Community under its Seventh
2493	   Framework Program.

2495	   Alan Ford was originally supported by Roke Manor Research.

2497	   The authors gratefully acknowledge significant input into this
2498	   document from Sebastien Barre, Christoph Paasch, and Andrew McDonald.

2500	   The authors also wish to acknowledge reviews and contributions from
2501	   Iljitsch van Beijnum, Lars Eggert, Marcelo Bagnulo, Robert Hancock,
2502	   Pasi Sarolahti, Toby Moncaster, Philip Eardley, Sergio Lembo,
2503	   Lawrence Conroy, Yoshifumi Nishida, Bob Briscoe, Stein Gjessing,
2504	   Andrew McGregor, Georg Hampel, Anumita Biswas, Wes Eddy, Alexey
2505	   Melnikov, Francis Dupont, Adrian Farrel, Barry Leiba, Robert Sparks,
2506	   Sean Turner, Stephen Farrell, and Martin Stiemerling.

2508	8.  IANA Considerations

2510	   This document defines a new TCP option for MPTCP, assigned a value of
2511	   30 (decimal) from the TCP Option space.  This value is the value of
2512	   "Kind" as seen in all MPTCP options in this document.  This value is
2513	   defined as:

2515	            +------+--------+---------------+-----------------+
2516	            | Kind | Length |    Meaning    |    Reference    |
2517	            +------+--------+---------------+-----------------+
2518	            |  30  |    N   | Multipath TCP | (This document) |
2519	            +------+--------+---------------+-----------------+

2521	                     Table 1: TCP Option Kind Numbers

2523	   This document also defines a four-bit subtype field, for which IANA
2524	   is to create and maintain a new sub-registry entitled "MPTCP option
2525	   subtype values" under the TCP Parameters registry.  Initial values
2526	   for the MPTCP option subtype registry are given below; future
2527	   assignments are to be defined by Standards Action as defined by [24].
2528	   Assignments consist of the MPTCP subtype's symbolic name and its
2529	   associated value, as per the following table.

2531	   +--------------+----------------------------+---------------+-------+
2532	   |    Symbol    |            Name            |   Reference   | Value |
2533	   +--------------+----------------------------+---------------+-------+
2534	   |  MP_CAPABLE  |      Multipath Capable     |  Section 3.1  |  0x0  |
2535	   |    MP_JOIN   |       Join Connection      |  Section 3.2  |  0x1  |
2536	   |      DSS     | Data Sequence Signal (Data |  Section 3.3  |  0x2  |
2537	   |              |    ACK and Data Sequence   |               |       |
2538	   |              |          Mapping)          |               |       |
2539	   |   ADD_ADDR   |         Add Address        | Section 3.4.1 |  0x3  |
2540	   |  REMOVE_ADDR |       Remove Address       | Section 3.4.2 |  0x4  |
2541	   |    MP_PRIO   |   Change Subflow Priority  | Section 3.3.8 |  0x5  |
2542	   |    MP_FAIL   |          Fallback          |  Section 3.6  |  0x6  |
2543	   | MP_FASTCLOSE |         Fast Close         |  Section 3.5  |  0x7  |
2544	   +--------------+----------------------------+---------------+-------+

2546	                      Table 2: MPTCP Option Subtypes

2548	   The value 0xf is reserved for Private Use within controlled testbeds.

2550	   This document also requests that IANA creates another sub-registry,
2551	   "MPTCP handshake algorithms" under the TCP Paramers registry, based
2552	   on the flags in MP_CAPABLE (Section 3.1).  The flags consist of eight
2553	   bits, labelled "A" through "H", and this document assigns the bits as
2554	   follows, where "(available)" means that the bit is available for
2555	   future assignment:

2557	       +----------+-------------------+----------------------------+
2558	       | Flag Bit |      Meaning      |          Reference         |
2559	       +----------+-------------------+----------------------------+
2560	       |     A    | Checksum required | This document, Section 3.1 |
2561	       |     B    |   Extensibility   | This document, Section 3.1 |
2562	       |     C    |    (available)    |                            |
2563	       |     D    |    (available)    |                            |
2564	       |     E    |    (available)    |                            |
2565	       |     F    |    (available)    |                            |
2566	       |     G    |    (available)    |                            |
2567	       |     H    |     HMAC-SHA1     | This document, Section 3.2 |
2568	       +----------+-------------------+----------------------------+

2570	                    Table 3: MPTCP Handshake Algorithms

2572	   Note that the meanings of bits C through H can be dependent upon bit
2573	   B, depending on how Extensibility is defined in future
2574	   specifications; see Section 3.1 for more information.

2576	   Future assignments in this registry are also to be defined by
2577	   Standards Action as defined by [24].  Assignments consist of the
2578	   value of the flags, a symbolic name for the algorithm, and a
2579	   reference to its specification.

2581	9.  References

2583	9.1.  Normative References

2585	   [1]   Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
2586	         September 1981.

2588	   [2]   Ford, A., Raiciu, C., Handley, M., Barre, S., and J. Iyengar,
2589	         "Architectural Guidelines for Multipath TCP Development",
2590	         RFC 6182, March 2011.

2592	   [3]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
2593	         Levels", BCP 14, RFC 2119, March 1997.

2595	   [4]   National Institute of Science and Technology, "Secure Hash
2596	         Standard", Federal Information Processing Standard
2597	         (FIPS) 180-3, October 2008, <http://csrc.nist.gov/publications/
2598	         fips/fips180-3/fips180-3_final.pdf>.

2600	9.2.  Informative References

2602	   [5]   Raiciu, C., Handley, M., and D. Wischik, "Coupled Congestion
2603	         Control for Multipath Transport Protocols", RFC 6356,
2604	         October 2011.

2606	   [6]   Scharf, M. and A. Ford, "MPTCP Application Interface
2607	         Considerations", draft-ietf-mptcp-api-05 (work in progress),
2608	         April 2012.

2610	   [7]   Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",
2611	         RFC 2992, November 2000.

2613	   [8]   Bagnulo, M., "Threat Analysis for TCP Extensions for Multipath
2614	         Operation with Multiple Addresses", RFC 6181, March 2011.

2616	   [9]   Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed-Hashing
2617	         for Message Authentication", RFC 2104, February 1997.

2619	   [10]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
2620	         Selective Acknowledgment Options", RFC 2018, October 1996.

2622	   [11]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
2623	         Control", RFC 5681, September 2009.

2625	   [12]  Gont, F., "Survey of Security Hardening Methods for
2626	         Transmission Control Protocol (TCP) Implementations",
2627	         draft-ietf-tcpm-tcp-security-03 (work in progress), March 2012.

2629	   [13]  Eastlake, D., Schiller, J., and S. Crocker, "Randomness
2630	         Requirements for Security", BCP 106, RFC 4086, June 2005.

2632	   [14]  Eastlake, D. and T. Hansen, "US Secure Hash Algorithms (SHA and
2633	         SHA-based HMAC and HKDF)", RFC 6234, May 2011.

2635	   [15]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions for
2636	         High Performance", RFC 1323, May 1992.

2638	   [16]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of
2639	         Explicit Congestion Notification (ECN) to IP", RFC 3168,
2640	         September 2001.

2642	   [17]  Rekhter, Y., Moskowitz, R., Karrenberg, D., Groot, G., and E.
2643	         Lear, "Address Allocation for Private Internets", BCP 5,
2644	         RFC 1918, February 1996.

2646	   [18]  Braden, R., "Requirements for Internet Hosts - Communication
2647	         Layers", STD 3, RFC 1122, October 1989.

2649	   [19]  Ramaiah, A., "TCP option space extension",
2650	         draft-ananth-tcpm-tcpoptext-00 (work in progress), March 2012.

2652	   [20]  Srisuresh, P. and K. Egevang, "Traditional IP Network Address
2653	         Translator (Traditional NAT)", RFC 3022, January 2001.

2655	   [21]  Border, J., Kojo, M., Griner, J., Montenegro, G., and Z.
2656	         Shelby, "Performance Enhancing Proxies Intended to Mitigate
2657	         Link-Related Degradations", RFC 3135, June 2001.

2659	   [22]  Handley, M., Paxson, V., and C. Kreibich, "Network Intrusion
2660	         Detection: Evasion, Traffic Normalization, and End-to-End
2661	         Protocol Semantics", Usenix Security 2001, 2001, <http://
2662	         www.usenix.org/events/sec01/full_papers/handley/handley.pdf>.

2664	   [23]  Freed, N., "Behavior of and Requirements for Internet
2665	         Firewalls", RFC 2979, October 2000.

2667	   [24]  Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
2668	         Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

2670	Appendix A.  Notes on use of TCP Options

2672	   The TCP option space is limited due to the length of the Data Offset
2673	   field in the TCP header (4 bits), which defines the TCP header length
2674	   in 32 bit words.  With the standard TCP header being 20 bytes, this
2675	   leaves a maximum of 40 bytes for options, and many of these may
2676	   already be used by options such as timestamp and SACK.

2678	   We have performed a brief study on the commonly used TCP options in
2679	   SYN, data, and pure ACK packets, and found that there is enough room
2680	   to fit all the options we propose using in this draft.

2682	   SYN packets typically include MSS (4 bytes), window scale (3 bytes),
2683	   SACK permitted (2 bytes) and timestamp (10 bytes) options.  Together
2684	   these sum to 19 bytes.  Some operating systems appear to pad each
2685	   option up to a word boundary, thus using 24 bytes (a brief survey
2686	   suggests Windows XP and Mac OS X do this, whereas Linux does not).
2687	   Optimistically, therefore, we have 21 bytes spare, or 16 if it has to
2688	   be word-aligned.  In either case, however, the SYN versions of
2689	   Multipath Capable (12 bytes) and Join (12 or 16 bytes) options will
2690	   fit in this remaining space.

2692	   TCP data packets typically carry timestamp options in every packet,
2693	   taking 10 bytes (or 12 with padding).  That leaves 30 bytes (or 28,
2694	   if word-aligned).  The Data Sequence Signal (DSS) option varies in
2695	   length depending on whether the Data Sequence Mapping and DATA_ACK
2696	   are included, and whether the sequence numbers in use are 4 or 8
2697	   octets.  The maximum size of the DSS option is 28 bytes, so even that
2698	   will fit in the available space.  But unless a connection is both bi-
2699	   directional and high-bandwidth, it is unlikely that all that option
2700	   space will be required on each DSS option.

2702	   Within the DSS option, it is not necessary to include the Data
2703	   Sequence Mapping and DATA_ACK in each packet, and in many cases it
2704	   may be possible to alternate their presence (so long as the mapping
2705	   covers the data being sent in the following packet).  It would also
2706	   be possible to alternate between 4 and 8 byte sequence numbers in
2707	   each option.

2709	   On subflow and connection setup, an MPTCP option is also set on the
2710	   third packet (an ACK).  These are 20 bytes (for Multipath Capable)
2711	   and 24 bytes (for Join), both of which will fit in the available
2712	   option space.

2714	   Pure ACKs in TCP typically contain only timestamps (10 bytes).  Here,
2715	   multipath TCP typically needs to encode only the DATA_ACK (maximum of
2716	   12 bytes).  Occasionally ACKs will contain SACK information.
2717	   Depending on the number of lost packets, SACK may utilize the entire
2718	   option space.  If a DATA_ACK had to be included, then it is probably
2719	   necessary to reduce the number of SACK blocks to accommodate the
2720	   DATA_ACK.  However, the presence of the DATA_ACK is unlikely to be
2721	   necessary in a case where SACK is in use, since until at least some
2722	   of the SACK blocks have been retransmitted, the cumulative data-level
2723	   ACK will not be moving forward (or if it does, due to retransmissions
2724	   on another path, then that path can also be used to transmit the new
2725	   DATA_ACK).

2727	   The ADD_ADDR option can be between 8 and 22 bytes, depending on
2728	   whether IPv4 or IPv6 is used, and whether the port number is present
2729	   or not.  It is unlikely that such signaling would fit in a data
2730	   packet (although if there is space, it is fine to include it).  It is
2731	   recommended to use duplicate ACKs with no other payload or options in
2732	   order to transmit these rare signals.  Note this is the reason for
2733	   mandating that duplicate ACKs with MPTCP options are not taken as a
2734	   signal of congestion.

2736	   Finally, there are issues with reliable delivery of options.  As
2737	   options can also be sent on pure ACKs, these are not reliably sent.
2738	   This is not an issue for DATA_ACK due to their cumulative nature, but
2739	   may be an issue for ADD_ADDR/REMOVE_ADDR options.  Here, it is
2740	   recommended to send these options redundantly (whether on multiple
2741	   paths, or on the same path on a number of ACKs - but interspersed
2742	   with data in order to avoid interpretation as congestion).  The cases
2743	   where options are stripped by middleboxes are discussed in Section 6.

2745	Appendix B.  Control Blocks

2747	   Conceptually, an MPTCP connection can be represented as an MPTCP
2748	   control block that contains several variables that track the progress
2749	   and the state of the MPTCP connection and a set of linked TCP control
2750	   blocks that correspond to the subflows that have been established.

2752	   RFC793 [1] specifies several state variables.  Whenever possible, we
2753	   reuse the same terminology as RFC793 to describe the state variables
2754	   that are maintained by MPTCP.

2756	B.1.  MPTCP Control Block

2758	   The MPTCP control block contains the following variable per-
2759	   connection.

2761	B.1.1.  Authentication and Metadata

2763	   Local.Token (32 bits):  This is the token chosen by the local host on
2764	      this MPTCP connection.  The token MUST be unique among all
2765	      established MPTCP connections, generated from the local key.

2767	   Local.Key (64 bits):  This is the key sent by the local host on this
2768	      MPTCP connection.

2770	   Remote.Token (32 bits):  This is the token chosen by the remote host
2771	      on this MPTCP connection, generated from the remote key.

2773	   Remote.Key (64 bits):  This is the key chosen by the remote host on
2774	      this MPTCP connection

2776	   MPTCP.Checksum (flag):  This flag is set to true if at least one of
2777	      the hosts has set the C bit in the MP_CAPABLE options exchanged
2778	      during connection establishment, and is set to false otherwise.
2779	      If this flag is set, the checksum must be computed in all DSS
2780	      options.

2782	B.1.2.  Sending Side

2784	   SND.UNA (64 bits):  This is the Data Sequence Number of the next byte
2785	      to be acknowledged, at the MPTCP connection level.  This variable
2786	      is updated upon reception of a DSS option containing a DATA_ACK.

2788	   SND.NXT (64 bits):  This is the Data Sequence Number of the next byte
2789	      to be sent.  SND.NXT is used to determine the value of the DSN in
2790	      the DSS option.

2792	   SND.WND (32 bits with RFC1323, 16 bits without):  This is the sending
2793	      window.  MPTCP maintains the sending window at the MPTCP
2794	      connection level and the same window is shared by all subflows.
2795	      All subflows use the MPTCP connection level SND.WND to compute the
2796	      SEQ.WND value which is sent in each transmitted segment.

2798	B.1.3.  Receiving Side

2800	   RCV.NXT (64 bits):  This is the Data Sequence Number of the next byte
2801	      which is expected on the MPTCP connection.  This state variable is
2802	      modified upon reception of in-order data.  The value of RCV.NXT is
2803	      used to specify the DATA_ACK which is sent in the DSS option on
2804	      all subflows.

2806	   RCV.WND (32bits with RFC1323, 16 bits otherwise):  This is the
2807	      connection-level receive window, which is the maximum of the
2808	      RCV.WND on all the subflows.

2810	B.2.  TCP Control Blocks

2812	   The MPTCP control block also contains a list of the TCP control
2813	   blocks that are associated to the MPTCP connection.

2815	   Note that the TCP control block on the TCP subflows does not contain
2816	   the RCV.WND and SND.WND state variables as these are maintained at
2817	   the MPTCP connection level and not at the subflow level.

2819	   Inside each TCP control block, the following state variables are
2820	   defined:

2822	B.2.1.  Sending Side

2824	   SND.UNA (32 bits):  This is the sequence number of the next byte to
2825	      be acknowledged on the subflow.  This variable is updated upon
2826	      reception of each TCP acknowledgement on the subflow.

2828	   SND.NXT (32 bits):  This is the sequence number of the next byte to
2829	      be sent on the subflow.  SND.NXT is used to set the value of
2830	      SEG.SEQ upon transmission of the next segment.

2832	B.2.2.  Receiving Side

2834	   RCV.NXT (32 bits):  This is the sequence number of the next byte
2835	      which is expected on the subflow.  This state variable is modified
2836	      upon reception of in-order segments.  The value of RCV.NXT is
2837	      copied to the SEG.ACK field of the next segments transmitted on
2838	      the subflow.

2840	   RCV.WND (32 bits with RFC1323, 16 bits otherwise):  This is the
2841	      subflow-level receive window which is updated with the window
2842	      field from the segments received on this subflow.

2844	Appendix C.  Finite State Machine

2846	   The diagram in Figure 17 shows the Finite State Machine for
2847	   connection-level closure.  This illustrates how the DATA_FIN
2848	   connection-level signal (indicated as the DFIN flag on a DATA_ACK)
2849	   interacts with subflow-level FINs, and permits "break-before-make"
2850	   handover between subflows.

2852	                              +---------+
2853	                              | M_ESTAB |
2854	                              +---------+
2855	                     M_CLOSE    |     |    rcv DATA_FIN
2856	                      -------   |     |    -------
2857	 +---------+       snd DATA_FIN /       \ snd DATA_ACK[DFIN] +---------+
2858	 |  M_FIN  |<-----------------           ------------------->| M_CLOSE |
2859	 | WAIT-1  |---------------------------                      |   WAIT  |
2860	 +---------+               rcv DATA_FIN \                    +---------+
2861	   | rcv DATA_ACK[DFIN]         ------- |                   M_CLOSE |
2862	   | --------------        snd DATA_ACK |                   ------- |
2863	   | CLOSE all subflows                 |              snd DATA_FIN |
2864	   V                                    V                           V
2865	 +-----------+              +-----------+                  +-----------+
2866	 |M_FINWAIT-2|              | M_CLOSING |                  | M_LAST-ACK|
2867	 +-----------+              +-----------+                  +-----------+
2868	   |              rcv DATA_ACK[DFIN] |           rcv DATA_ACK[DFIN] |
2869	   | rcv DATA_FIN     -------------- |               -------------- |
2870	   |  -------     CLOSE all subflows |           CLOSE all subflows |
2871	   | snd DATA_ACK[DFIN]              V            delete MPTCP PCB  V
2872	   \                          +-----------+                  +---------+
2873	     ------------------------>|M_TIME WAIT|----------------->| M_CLOSED|
2874	                              +-----------+                  +---------+
2875	                                         All subflows in CLOSED
2876	                                             ------------
2877	                                         delete MPTCP PCB

2879	          Figure 17: Finite State Machine for Connection Closure

2881	Authors' Addresses

2883	   Alan Ford
2884	   Cisco
2885	   Ruscombe Business Park
2886	   Ruscombe, Berkshire  RG10 9NN
2887	   UK

2889	   Email: alanford@cisco.com

2891	   Costin Raiciu
2892	   University Politehnica of Bucharest
2893	   Splaiul Independentei 313
2894	   Bucharest
2895	   Romania

2897	   Email: costin.raiciu@cs.pub.ro

2899	   Mark Handley
2900	   University College London
2901	   Gower Street
2902	   London  WC1E 6BT
2903	   UK

2905	   Email: m.handley@cs.ucl.ac.uk

2907	   Olivier Bonaventure
2908	   Universite catholique de Louvain
2909	   Pl. Ste Barbe, 2
2910	   Louvain-la-Neuve  1348
2911	   Belgium

2913	   Email: olivier.bonaventure@uclouvain.be