idnits 2.17.1 

draft-ford-mptcp-multiaddressed-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 10, 2009) is 5394 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Unused Reference: 'I-D.van-beijnum-1e-mp-tcp-00' is defined on line
     1048, but no explicit reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                  A. Ford
3	Internet-Draft                                       Roke Manor Research
4	Intended status: Experimental                                  C. Raiciu
5	Expires: January 11, 2010                                     M. Handley
6	                                               University College London
7	                                                                S. Barre
8	                                                Universite catholique de
9	                                                                 Louvain
10	                                                           July 10, 2009

12	     TCP Extensions for Multipath Operation with Multiple Addresses
13	                   draft-ford-mptcp-multiaddressed-01

15	Status of this Memo

17	   This Internet-Draft is submitted to IETF in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as Internet-
23	   Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on January 11, 2010.

38	Copyright Notice

40	   Copyright (c) 2009 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents in effect on the date of
45	   publication of this document (http://trustee.ietf.org/license-info).
46	   Please review these documents carefully, as they describe your rights
47	   and restrictions with respect to this document.

49	Abstract

51	   Often endpoints are connected by multiple paths, but the nature of
52	   TCP/IP restricts communications to a single path per socket.
53	   Resource usage within the network would be more efficient were these
54	   multiple paths able to be used concurrently.  This should enhance
55	   user experience through higher throughput and improved resilience to
56	   network failure.  This document presents extensions to TCP in order
57	   to transparently provide this multi-path functionality at the
58	   transport layer, if at least one endpoint is multi-addressed.

60	Table of Contents

62	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
63	     1.1.  Motivation . . . . . . . . . . . . . . . . . . . . . . . .  4
64	     1.2.  Design Assumptions . . . . . . . . . . . . . . . . . . . .  4
65	     1.3.  Layered Representation . . . . . . . . . . . . . . . . . .  5
66	     1.4.  Operation Summary  . . . . . . . . . . . . . . . . . . . .  6
67	     1.5.  Open Issues  . . . . . . . . . . . . . . . . . . . . . . .  7
68	     1.6.  Requirements Language  . . . . . . . . . . . . . . . . . .  8
69	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  8
70	   3.  Semantic Issues  . . . . . . . . . . . . . . . . . . . . . . .  8
71	   4.  MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . .  9
72	     4.1.  Session Initiation . . . . . . . . . . . . . . . . . . . .  9
73	     4.2.  Starting a New Subflow . . . . . . . . . . . . . . . . . . 11
74	     4.3.  Address Knowledge Exchange (Path Management) . . . . . . . 12
75	       4.3.1.  Adding Addresses . . . . . . . . . . . . . . . . . . . 13
76	       4.3.2.  Remove Address . . . . . . . . . . . . . . . . . . . . 14
77	     4.4.  General MPTCP Operation  . . . . . . . . . . . . . . . . . 15
78	       4.4.1.  Receive Window Considerations  . . . . . . . . . . . . 16
79	       4.4.2.  Congestion Control Considerations  . . . . . . . . . . 17
80	       4.4.3.  Subflow Policy . . . . . . . . . . . . . . . . . . . . 17
81	       4.4.4.  Retransmissions  . . . . . . . . . . . . . . . . . . . 18
82	     4.5.  Closing a Connection . . . . . . . . . . . . . . . . . . . 19
83	     4.6.  Error Handling . . . . . . . . . . . . . . . . . . . . . . 20
84	   5.  Congestion Control Coupling for MPTCP  . . . . . . . . . . . . 20
85	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
86	   7.  Interactions with Middleboxes  . . . . . . . . . . . . . . . . 22
87	   8.  Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 22
88	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23
89	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 23
90	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23
91	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 23
92	     11.2. Informative References . . . . . . . . . . . . . . . . . . 24
93	   Appendix A.  Functional Separation . . . . . . . . . . . . . . . . 24
94	     A.1.  Motivations  . . . . . . . . . . . . . . . . . . . . . . . 24
95	     A.2.  TCP Performance  . . . . . . . . . . . . . . . . . . . . . 25
96	     A.3.  Architecture overview  . . . . . . . . . . . . . . . . . . 25
97	     A.4.  PM/MPS interface . . . . . . . . . . . . . . . . . . . . . 27
98	   Appendix B.  Notes on use of TCP Options . . . . . . . . . . . . . 28
99	   Appendix C.  Resync Packet . . . . . . . . . . . . . . . . . . . . 28
100	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30

102	1.  Introduction

104	   Multipath TCP is set of extensions for regular TCP [RFC0793] to allow
105	   one TCP connection to be spread across multiple paths.  This section
106	   describes the motivation behind the design of Multipath TCP
107	   (henceforth referred to as MPTCP), and gives a summary of its
108	   operation.  The following sections describe in greater detail the
109	   proposed extensions and the operation of the resulting protocol.

111	1.1.  Motivation

113	   As the Internet evolves, demands on Internet resources are ever-
114	   increasing, but often these resources (in particular, bandwidth)
115	   cannot be fully utilised due to protocol constrains on both the end-
116	   systems and within the network.  By the application of resource
117	   pooling [WISCHIK], these resources can be 'pooled' such that they
118	   appear as a single logical resource to the user.  Multipath TCP
119	   achieves resource pooling by combining multiple TCP sessions running
120	   over multiple paths, and presenting them as a single TCP connection
121	   to the application.

123	   This form of resource pooling bring two key benefits:

125	   o  To increase the resilience of the connectivity by providing
126	      multiple paths, protecting end hosts from the failure of one.

128	   o  To increase the efficiency of the resource usage, and thus
129	      increase the network capacity available to end hosts.

131	   The protocol presented in this document follows the same service
132	   model as TCP [RFC0793]: byte oriented, in order reliable delivery.
133	   To have a deployable protocol, we impose the following "do no harm"
134	   philosophy: multipath TCP should behave no worse (throughput wise)
135	   than running a single TCP connection over any of its paths, and using
136	   multiple paths should not harm users using single path TCP at shared
137	   bottlenecks.

139	1.2.  Design Assumptions

141	   In order to limit the potentially huge design space, the authors
142	   imposed two key constraints on the multipath TCP design presented in
143	   this document:

145	   o  It must be backwards-compatible with current, regular TCP, to
146	      increase its chances of deployment

148	   o  It can be assumed that one or both endpoints are multihomed and
149	      multiaddressed

151	   To simplify the design we assume that the presence of multiple
152	   addresses at an endpoint is sufficient to indicate the existence of
153	   multiple paths.  These paths need not be entirely disjoint: they may
154	   share one or many routers between them.  Even in such a situation
155	   making use of multiple paths is beneficial, improving resource
156	   utilisation and resilience to a subset of node failures.

158	   There are three aspects to the backwards-compatibility listed above:

160	   External Constraints:  The protocol must function through the vast
161	      majority of existing middleboxes such as NATs, firewalls and
162	      proxies, and as such must resemble existing TCP as far as possible
163	      on the wire.  Furthermore, the protocol must not assume the
164	      segments it sends on the wire arrive unmodified at the
165	      destination: they may be split or coalesced; options may be
166	      removed or duplicated.

168	   Application Constraints:  The protocol must be usable with no change
169	      to existing applications that use the standard TCP API (although
170	      it is reasonable that not all features would be available to such
171	      legacy applications).

173	   Fall-back:  The protocol should be able to fall back to standard TCP
174	      with no interference from the user, to be able to communicate with
175	      legacy hosts.

177	   Areas for further study:

179	   o  In theory, since this is purely a TCP extension, it should be
180	      possible to use MPTCP with both IPv4 and IPv6 on dual-stack hosts,
181	      thus having the additional possible benefit of aiding transition.

183	   o  Some features of the design presented here could be extended to
184	      work with non-multi-addressed hosts by using packet marking or
185	      partial multipath.

187	   o  Some features of the design presented here could be combined with
188	      mechanisms such as shim6 [I-D.ietf-shim6-proto].

190	   This draft also suggests a safe way to couple congestion controllers
191	   in a way that achieves the "do no harm philosophy".  This is for
192	   completeness or our arguments: we expect this description to evolve
193	   into a companion new internet draft.

195	1.3.  Layered Representation

197	   MPTCP operates at the transport layer, and its existence aims to be
198	   transparent to both higher and lower layers.  It is a set of
199	   additional features on top of standard TCP, and as such MPTCP is
200	   designed to be usable by legacy applications with no changes.  A
201	   possible implementation would be for such a feature to be a system-
202	   wide setting: "Use multipath TCP by default?  Y/N".  Multipath-aware
203	   applications would be able to use an extended sockets API to have
204	   further influence on the behaviour of MPTCP.  Figure 1 illustrates
205	   this architecture.

207	                                      +-------------------------------+
208	                                      |           Application         |
209	      +---------------+               +-------------------------------+
210	      |  Application  |               |             MPTCP             |
211	      +---------------+               + - - - - - - - + - - - - - - - +
212	      |      TCP      |               | Subflow (TCP) | Subflow (TCP) |
213	      +---------------+               +-------------------------------+
214	      |      IP       |               |       IP      |      IP       |
215	      +---------------+               +-------------------------------+

217	      Figure 1: Comparison of Standard TCP and MPTCP Protocol Stacks

219	   Detailed discussion of an architecture for developing a multipath TCP
220	   implementation, especially regarding the functional separation by
221	   which different components should be developed, is given in
222	   Appendix A.

224	1.4.  Operation Summary

226	   This section provides a high-level summary of normal operation in
227	   MPTCP, and is illustrated by the scenario shown in Figure 2.  A
228	   detailed description of operation is given in Section 4.

230	   o  To a non-MPTCP-aware application, MPTCP will be indistinguishable
231	      from normal TCP.  All MPTCP operation is handled by the MPTCP
232	      implementation, although extended APIs could provide additional
233	      control.  An application begins by opening a TCP socket in the
234	      normal way.

236	   o  An MPTCP connection begins as a single TCP session.  This is
237	      illustrated in Figure 2 as being between Addresses A1 and B1 on
238	      Hosts A and B respectively.

240	   o  If extra paths are available, additional TCP sessions are created
241	      on these paths, and are combined with the existing session, which
242	      continues to appear as a single connection to the applications at
243	      both ends.  The creation of the additional TCP session is
244	      illustrated between Address A2 on Host A and Address B1 on Host B.

246	   o  MPTCP identifies multiple paths by the presence of multiple
247	      addresses at endpoints.  Combinations of these multiple addresses
248	      equate to the additional paths.  In the example, other potential
249	      paths that could be set up are A1<->B2 and A2<->B2.  Although this
250	      additional session is shown as being initiated from A2, it could
251	      equally have been initiated from B1.

253	   o  The discovery and setup of additional TCP sessions (termed
254	      'subflows') will be achieved through a path management method.
255	      This document describes a mechanism by which an endpoint can
256	      initiate new subflows by using its additional addresses, or by
257	      signalling to the other endpoint its available addresses.

259	   o  The exact properties of these TCP sessions that are logically
260	      bonded are dependent upon the congestion and flow control
261	      characteristics of the endpoints' MPTCP implementation.

263	   o  MPTCP adds connection-level sequence numbers in order to
264	      reassemble the data stream in-order from multiple subflows.
265	      Connections are terminated by connection-level FIN packets as well
266	      as those relating to the individual subflows.

268	               Host A                               Host B
269	      ------------------------             ------------------------
270	      Address A1    Address A2             Address B1    Address B2
271	      ----------    ----------             ----------    ----------
272	          |             |                      |             |
273	          |     (initial connection setup)     |             |
274	          |----------------------------------->|             |
275	          |<-----------------------------------|             |
276	          |             |                      |             |
277	          |            (additional subflow setup)            |
278	          |             |--------------------->|             |
279	          |             |<---------------------|             |
280	          |             |                      |             |
281	          |             |                      |             |

283	                  Figure 2: Example MPTCP Usage Scenario

285	1.5.  Open Issues

287	   This specification is a work-in-progress, and as such there are many
288	   issues that are still to be resolved.  This section lists many of the
289	   key open issues within this specification; these are discussed in
290	   more detail in the appropriate sections throughout this document.

292	   o  Best handshake mechanisms.  This document contains a proposed
293	      scheme by which connections and subflows can be set up.  It is
294	      felt that, although this is "no worse than regular TCP", there
295	      could be opportunities for significant improvements in security
296	      that could be included (potentially optionally) within this
297	      protocol.

299	   o  Issues around simulataneous opens, where both ends attempt to
300	      create a new subflow simultaneously, need to be investigated and
301	      behaviour specified.

303	   o  Appropriate mechanisms for controlling policy of subflow usage.
304	      The ECN signal is currently proposed but other alternatives,
305	      including path property options, could be employed instead.

307	1.6.  Requirements Language

309	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
310	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
311	   document are to be interpreted as described in RFC 2119 [RFC2119].

313	2.  Terminology

315	   Path:  A sequence of links between a sender and a receiver, defined
316	      in this context by a source and destination address pair.

318	   Subflow:  A stream of TCP packets sent over a path.  A subflow is a
319	      component part of a connection between two endpoints.

321	   Connection:  A collection of one or more subflows, over which an
322	      application can communicate between two endpoints.  There is a
323	      one-to-one mapping between a connection and a socket.

325	   Token:  A unique identifier given to a multipath connection by an
326	      endpoint.  May also be referred to as a "Connection ID".

328	   Endpoint:  A host operating an MPTCP implementation, and either
329	      initiating or terminating a MPTCP connection.

331	3.  Semantic Issues

333	   In order to support multipath operation, the semantics of some TCP
334	   components have changed.  To aid clarity, this section collects these
335	   semantic changes as a reference.

337	   Sequence Number:  The TCP sequence number is subflow-specific, with a
338	      data sequence number used for reassembly at connection-level.

340	   FIN:  The FIN only applies to a subflow, not to a connection.  For a
341	      connection-level FIN, use the DATA FIN option.

343	   ACK:  The ACK acknowledges the subflow sequence number only, and the
344	      mapping to the data sequence number is handled out-of-band.

346	   RST:  The RST only applies to a subflow.  There is no connection-
347	      level RST, since it would be impossible to distinguish the two, as
348	      the link between a subflow and a connection is established at the
349	      SYN handshake.  A connection is considered reset if every subflow
350	      sends a RST in response.

352	   Length:  There is additionally an explicit length for each MPTCP
353	      segment in order to separate potential TCP/IP-layer segmentation
354	      from the MPTCP data flow.

356	   Address List:  The address management is handled per-connection to
357	      permit the application of per-connection local policy.

359	   5-tuple:  The 5-tuple (protocol,local IP, local port, remote IP,
360	      remote port) presented to the application layer in a non-
361	      multipath-aware application is that of the first subflow, even if
362	      the subflow has since been closed and removed from the connection.

364	4.  MPTCP Protocol

366	   This section describes the operation of the MPTCP protocol, and is
367	   subdivided into sections for each key part of the protocol operation.

369	   All MPTCP operations are signalled using optional TCP header fields.
370	   These TCP Options will have option numbers allocated by IANA, as
371	   discussed in Section 10, and are defined throughout the following
372	   subsections.

374	4.1.  Session Initiation

376	   Session Initiation begins with a SYN, SYN/ACK exchange on a single
377	   path.  Each of these packets will additionally feature the Multipath
378	   Capable TCP option (Figure 3, which declares the sender's locally
379	   unique 32-bit token for this connection, and a version field.

381	   The "Multipath Capable" option declares an endpoint to be capable of
382	   operating Multipath TCP (or rather, more accurately, a desire to
383	   operate Multipath TCP on this particular connection).  As well as
384	   this declaration, this field presents a token, which is used when
385	   adding additional subflows to this connection.

387	   This token is generated by the sender and has local meaning only, but
388	   therefore it MUST be unique for the sender.  The token MUST be
389	   difficult for an attacker to guess, and thus it is recommended it
390	   SHOULD be generated randomly.  (However, see further discussions
391	   about security in Section 6, including the possibility of a 64-bit
392	   token and an initial data sequence number.)

394	   This option is only present in packets with the SYN flag set.  It is
395	   only used in the first TCP session of a connection, in order to
396	   identify the connection; all following connections will use path
397	   management techniques to join the existing connection.

399	                           1                   2                   3
400	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
401	      +---------------+---------------+-------------------------------+
402	      | Kind=OPT_MPC  |  Length = 7   |(resvd)|Version|  Sender Token :
403	      +---------------+---------------+-------------------------------+
404	      : Sender Token (continued - 4 octets total)     |
405	      +-----------------------------------------------+

407	                    Figure 3: Multipath Capable option

409	   The version field represents the version of MPTCP in use.  The
410	   version provided in this specification is 0.  The reserved bits may
411	   be used for connection-specific flags in later versions.

413	   If a SYN contains a "multipath capable" option but the SYN/ACK does
414	   not, it is assumed that the recipient is not multipath capable and
415	   thus the MPTCP session will operate as regular, single-path TCP.  If
416	   a SYN does not contain a "multipath capable" option, the SYN/ACK MUST
417	   NOT contain one in response.

419	   If these packets are unacknowledged, it is up to local policy to
420	   decide how to respond.  It is expected that a sender will eventually
421	   fall back to single-path TCP (i.e. without the Multipath Capable
422	   Option), in order to work around middleboxes that may drop packets
423	   with unknown options, however the number of multipath-capable
424	   attempts that are made first will be up to local policy.  In the case
425	   of out-of-order packets, i.e. if a multipath-capable SYN/ACK is
426	   received in response to a multipath-capable SYN, after a standard SYN
427	   has been sent, then once again it is up to the sender to choose how
428	   to behave.  For example, the sender could respond to new connections
429	   using the previously declared token, or it could simply drop any new
430	   multipath options within the flow.

432	   If an endpoint is known to be multiaddressed (e.g. through multiple
433	   addresses returned in a DNS lookup), alternative destination
434	   addresses should be tried first, before falling back to regular TCP.

436	   In addition to this option, a Data Sequence Number option (discussed
437	   in Section 4.4) is included to provide an initial data-level sequence
438	   number (and this initial SYN counts as one octet in this space, as
439	   for a regular SYN in single-path TCP).

441	4.2.  Starting a New Subflow

443	   Endpoints have knowledge of their own address(es), and can become
444	   aware of the other endpoint's addresses through a path management
445	   technique as described in Section 4.3.  Using this knowledge, an
446	   endpoint will initiate a new subflow over a currently unused pair of
447	   addresses.

449	   A new subflow is started as a normal TCP SYN/ACK exchange.  The
450	   following TCP option is used to identify which connection the new
451	   subflow should become a part.  The token used is the locally unique
452	   token of the destination for the subflow, as defined by the Multipath
453	   Capable option received in the first SYN/ACK exchange.

455	   It should be noted that, in theory, additional subflows can exist
456	   between any pair of ports, and as such it is this token that is used
457	   for demuxing at the receiver.  A receiver must store some mapping
458	   state, of (source_addr, dest_addr, source_port, dest_port) to its
459	   token, using information from the initial SYN exchange, in order to
460	   enable this.  In practice, however, it is envisaged that most new
461	   subflows will connect to a port that is already in use as the source
462	   or destination port of an existing subflow, in order to have a
463	   greater chance of getting through firewalls and other middleboxes,
464	   and to support traffic engineering of the flows.

466	   This option includes an "Address ID".  This is an identifier, locally
467	   unique to the sender of this option, that identifies the source
468	   address of this packet.  This serves two purposes.  Firstly, if an
469	   address becomes unexpectedly unavailable on the sender, it can signal
470	   this to the receiver via a remove address option (Section 4.3.2)
471	   without needing to know what the source address actually is (thus
472	   allowing the use of NATs).  Secondly, it allows correlation between
473	   new connection attempts and address signalling (Section 4.3.1), to
474	   prevent duplicate subflow initiation.

476	   This option can only be present when the SYN flag is set.

478	                           1                   2                   3
479	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
480	      +---------------+---------------+-------------------------------+
481	      | Kind=OPT_JOIN |  Length = 6   |Receiver Token (4 octets total):
482	      +---------------+---------------+----------------+--------------+
483	      :  Receiver Token (continued)   |   Address ID   |
484	      +-------------------------------+----------------+

486	                     Figure 4: Join Connection option

488	4.3.  Address Knowledge Exchange (Path Management)

490	   We use the term "path management" to refer to the exchange of
491	   information about additional paths between endpoints, which in this
492	   design is managed by multiple addresses at endpoints.  For more
493	   detail of the architectrual thinking behind this design, see the
494	   discussion of functional separation in Appendix A.

496	   This design makes use of two methods of sharing such information,
497	   used simultaneously.  The first is the direct setup of new subflows,
498	   already described in Section 4.2.  The second is described in the
499	   following subsections, whereby addresses are signalled explicitly to
500	   the endpoint to allow it to initiate new connections.  This approach
501	   has been chosen so as to allow addresses to change in flight, and
502	   thus the use of NATs, whilst also allowing the signalling of
503	   previously unknown addresses, such as those belonging to other
504	   address families.

506	   In more detail, an example of the typical operation is as follows,
507	   where an existing address is used at one endpoint:

509	   o  An endpoint that is multihomed starts an additional TCP session to
510	      an address/port pair that is already in use on the other endpoint,
511	      using a token to identify the flow (Section 4.2).  (A multihomed
512	      destination may open a new subflow from its new address to the
513	      source address and port, or a multihomed source may open a new
514	      subflow from its new address another connection to the existing
515	      destination and port).

517	   o  To expand upon this, say a connection is intiated from host "A" on
518	      (address, port) combination A1 to desintation (address, port) B1
519	      on host "B".  If host A is multihomed, it starts an additional
520	      connection from new (address, port) A2 to B1, using B's previously
521	      declared token.  Alternatively, if B is multhomed, it will try to
522	      set up a new TCP connection from B2 to A1, using A's previously
523	      declared token.

525	   o  Simultaneously (or near-simultaneously), an "Add Address" option
526	      (Section 4.3.1) is sent on an existing subflow, informing the
527	      receiver of the sender's alternative address(es).  The recipient
528	      can use this information to open a new subflow to the sender's
529	      additional address.  Using the previous notation, this would be a
530	      Add Address packet sent from A1 to B1, informing B of address A2.

532	   o  If host B successfully receives the first SYN, starting a new
533	      subflow, it can use the Address ID to correlate this with the Add
534	      Address option that will also arrive on an existing subflow, and
535	      it will respond to the SYN with a SYN/ACK.  Otherwise, if it does
536	      not receive such a SYN, it tries to initiate a new subflow from
537	      one or more of its addresses to address A2.  This is intended to
538	      permit new sessions to be opened if one endpoint is behind a NAT.

540	   Other scenarios are valid, however, such as those where entirely new
541	   addresses are signalled, e.g. to allow an IPv6 and an IPv4 path to be
542	   used simultaneously.

544	4.3.1.  Adding Addresses

546	   Announcing additional addresses that an endpoint can be reached on
547	   will be undertaken by the Add Address TCP Option (Figure 5), where an
548	   (ID, address) pair can be announced to the other endpoint.  Several
549	   addresses can be added if there is sufficient TCP option space,
550	   otherwise multiple TCP messages containing this option must be sent.
551	   This option can be used at any time during a connection.

553	   The Add Address option announces a list of alternative IP addresses,
554	   beyond the current one in use, that the sender can be contacted on.
555	   This option can be used multiple times until all available addresses
556	   have been announced, in order to get around TCP option space limits.
557	   It should be noted that every address has an ID which can be used for
558	   address removal, and therefore endpoints must cache the mapping
559	   between ID and address.  This is also used to identify Join
560	   Connection options (Section 4.2) relating to the same address, even
561	   when address translators are in use.  The ID must be unique to the
562	   sender, and although it may be a sequential counter, this is not
563	   mandated.

565	   This option is shown for IPv4.  For IPv6, the IPVer field will read
566	   6, and the length of the address will be 16 octets not 4, and thus
567	   the length of the option will be 2 + (18 * number_of_entries).
568	   Multiple addresses can be included, with an ID following on
569	   immediately from the previous address, and their existance can be
570	   inferred through the option length and version fields.

572	   NB: by having a IPVer field, we get four free reserved bits.  These
573	   could be used in later versions of this protocol, e.g. one bit for
574	   "use now" or similar, to differentiate between subflows for backup
575	   purposes and those for throughput.

577	                           1                   2                   3
578	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
579	      +---------------+---------------+---------------+-------+-------+
580	      | Kind=OPT_ADDR |     Length    |  Address ID   | IPVer |(resvd)|
581	      +---------------+---------------+---------------+-------+-------+
582	      |                   Address (IPv4 - 4 octets)                   |
583	      +---------------------------------------------------------------+
584	          ( ... further ID/Version/Address fields as required ... )

586	                  Figure 5: Add Address option (for IPv4)

588	4.3.2.  Remove Address

590	   If, during the lifetime of a MPTCP connection, a previously-announced
591	   address becomes invalid (e.g. if the interface disappears), the
592	   affected endpoint should announce this so that the other endpoint can
593	   remove subflows related to this address.

595	   This is achieved through the Remove Address option (Figure 6), which
596	   will remove a previously-added address (or list of addresses) from a
597	   connection and terminate any subflows currently using that address.

599	   The sending and receipt of this message should trigger the sending of
600	   FINs by both endpoints on the affected subflow(s) (if possible), as a
601	   courtesy to cleaning up middlebox state, but endpoints may clean up
602	   their internal state without a long timeout.

604	   If there is no address at the requested ID, the receiver will
605	   silently ignore the request.

607	   Address removal is undertaken by ID, so as to permit the use of NATs
608	   and other middleboxes, in the cases where new connections have been
609	   initiated but now want to be removed.

611	   The closure of a single subflow, rather than all using a particular
612	   address, is undertaken as normal with a FIN exchange on the subflow -
613	   for more information, see Section 4.5.

615	                           1                   2                   3
616	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
617	      +---------------+---------------+---------------+
618	      |Kind=OPT_REMADR|  Length = 2+n |  Address ID   | ...
619	      +---------------+---------------+---------------+

621	                      Figure 6: Remove Address option

623	4.4.  General MPTCP Operation

625	   This section discusses operation of MPTCP for data transfer,
626	   independent of the path management mechanism used.

628	   At a high level, the an MPTCP implementation will take one input data
629	   stream from an application, and split it into one or more subflows.
630	   The data stream as a whole can be reassembled through the use of the
631	   Data Sequence Number (Figure 7) option, which defines the sequence in
632	   the data stream of the first octet of the packet's payload, and this
633	   is used by the receiver to ensure in-order delivery to th application
634	   layers.  Meanwhile, the subflow-level sequence numbers (i.e. the
635	   regular sequence numbers in the TCP header) have subflow-only
636	   relevance.

638	   The only acknowledgements are those at the subflow-level, so the
639	   sender must be able to map these acknowledgements to the data
640	   sequence numbers that were contained in the relevant packets.  The
641	   sender thus knows, if subflow data goes unackowledged, which part of
642	   the original data stream this equates to, and thus what data must be
643	   retransmitted.  It is expected (but not mandated) that SACK [RFC2018]
644	   is used as an efficiency at the subflow level.  Each subflow will
645	   maintain its own congestion widow.

647	                           1                   2                   3
648	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
649	      +---------------+---------------+------------------------------+
650	      | Kind=OPT_DSN  |    Length     |      Data Sequence Number... :
651	      +---------------+---------------+------------------------------+
652	      : ... ( (length-4) octets )     | Data-level Length (2 octets) |
653	      +-------------------------------+------------------------------+

655	                   Figure 7: Data Sequence Number option

657	   In addition to the Data Sequence Number, this option also includes a
658	   Data-level Length field.  The purpose of this field is to assist with
659	   compatibility with situations where TCP/IP segmentation is undertaken
660	   separately from the stack that is generating the data flow (e.g.
661	   through the use of TCP segementation offloading on network interface
662	   cards, or by middleboxes).  This field declares what length of data
663	   this data sequence number is valid for, allowing a receiver to infer
664	   when it has received sufficient segments.  The primary motivation for
665	   this behaviour is the understanding that devices involved in re-
666	   segmentation typically repeat additional TCP options into every re-
667	   segmented packet.  The use of this length field will make it clear
668	   when all relevant segments have been received.  (It is FFS whether
669	   this is the optimal solution to this issue.)

671	   As a TCP option contains a length field, the length of the Data
672	   Sequence Number can be declared implicitly.  Although it is expected
673	   that initial implementations will use 32-bit sequence numbers (i.e. 4
674	   octets, so a length field of 8), setting the length field to 12 and
675	   including a 64-bit sequence number (of four octets) MUST be
676	   considered valid and processed appropriately.  This may have also
677	   have useful security implications, discussed in Section 6.

679	   As wth the standard TCP sequence number, the data sequence number
680	   should not start at zero, but at a random value to make session
681	   hijacking harder.  This is done by including a Data Sequence Number
682	   option along with the Multipath Capable option in the initial SYN
683	   (which occupies one octet of data sequence space; see Section 4.1).

685	   The Data Sequence Number is included in every MPTCP packet that
686	   contains data (or a DATA FIN, see Section 4.5), even if only one path
687	   is in use, so long as the MPTCP handshake has been completed and the
688	   endpoints have therefore agreed to use MPTCP.

690	   The MPTCP data and subflow level sequence numbering could be seen to
691	   be analogous to that used in SACK, however there are subtle
692	   differences.  The key similarity is that it is possible to have
693	   temporary "holes" in the received data sequence space - later data
694	   may have arrived earlier (most likely on a different subflow), but
695	   does not need to be retransmitted.  The "holes" are later filled in.
696	   The key difference, however, is that while SACK can rely on the
697	   regular TCP cumulative acknowledgements to indicate how much data has
698	   been successfully received (with no holes), there is no similar
699	   method in MPTCP.  Instead, the sender must keep track of the
700	   acknowledgements to derive what data has been successfully received.
701	   This leads to some oddities especially with session termination (see
702	   Section 4.5).

704	4.4.1.  Receive Window Considerations

706	   Normal TCP advertises a receive window in each packet, telling the
707	   sender how much data the receiver is willing to accept past the
708	   cumulative ack.  The receive window is used to implement flow
709	   control, throttling down fast senders when receivers cannot keep up.

711	   MPTCP also uses a unique receive window, shared between the subflows.
712	   The idea is to allow any subflow to send data as long as the receiver
713	   is willing to accept it; the alternative, maintaining per subflow
714	   receive windows, could end-up stalling some subflows while others
715	   would not use up their window.

717	4.4.2.  Congestion Control Considerations

719	   Different subflows in an MPTCP connection have different congestion
720	   windows.  To achieve resource pooling [WISCHIK], it is necessary to
721	   couple the congestion windows in use on each subflow, in order to
722	   push most traffic to uncongested links.  One algorithm for achieving
723	   this is presented in Section 5; the algorithm does not achieve
724	   perfect resource pooling but is "safe" in that it is readily
725	   deployable in the current Internet.

727	   It is foreseeable that different congestion controllers will be
728	   implemented for MPTCP, each aiming to achieve different properties in
729	   the resource pooling/fairness/stability design space.  Much research
730	   is expected in this area in the near future.

732	   Regardless of the algorithm used, the design of the MPTCP protocol
733	   aims to provide the congestion control implementations sufficient
734	   information to take the right decisions; this information includes,
735	   for each subflow, which packets where lost and when.

737	4.4.3.  Subflow Policy

739	   Within a local MPTCP implementation, a host may use any local policy
740	   it wishes to decide how to share the traffic to be sent over the
741	   available paths.

743	   In the typical use case, where the goal is to maximise throughput,
744	   all available paths will be used simultaneously for data transfer.
745	   It is expected, however, that other use cases will appear.

747	   For instance, a possibility is an 'all-or-nothing' approach, i.e.
748	   have a second path ready for use in the event of failure of the first
749	   path, but alternatives could include entirely saturating one path
750	   before using an additional path (the 'overflow' case).  Such choices
751	   would be most likely based on the monetary cost of links, but may
752	   also be based on properties such as delay or bandwidth, in cases
753	   where the additional paths are significantly worse and not worth
754	   including in the base operation.  Other metrics such as this could be
755	   wrapped into an overall "cost" metric for a link.

757	   The ability to make effective choices at the sender requires full
758	   knowledge of the path cost, which is unlikely to be the case.  There
759	   is no mechanism in MPTCP for a receiver to signal their own
760	   particular preferences for paths, but this is a necessary feature
761	   since receivers will often be the multihomed party, such as in the
762	   case of laptop computers with wired and wireless connectivity.
763	   Instead of incorporating complex signalling, it is proposed to use
764	   existing TCP features to signal priority implicitly.  If a receiver
765	   wishes to keep a path active as a backup but wishes to prevent data
766	   being sent on that path, this could be achieved by the receiver not
767	   sending ACKs for any data it receives on that path.  The sender would
768	   interpret this as severe congestion or a broken path and stop using
769	   it.  We do not advocate this method, however, since this is brutal,
770	   naive, and will result in unnecessary retransmissions.

772	   Therefore, it is proposed to use ECN [RFC3168] to to provide fake
773	   congestion signals on paths that a receiver wishes to stop being used
774	   for data.  This has the benefit of causing the sender to back off
775	   without the need to retransmit data unnecessarily, as in the case of
776	   a lost ACK.  This should be sufficient to allow a receiver to express
777	   their policy, although does not permit a rapid increase in throughput
778	   when switching to such a path.

780	4.4.4.  Retransmissions

782	   This protocol specification does not mandate any mechanisms for
783	   handling retransmissions in the event of path failures, and much will
784	   be dependent upon local policy (as discussed in Section 4.4.3).  The
785	   data sequence number, as given in a TCP option, is used to reassemble
786	   the incoming streams before presentation to the application layers,
787	   so a sender is free to re-send data with the same data sequence
788	   number on a different subflow.  When doing this, an endpoint must
789	   still retransmit the original data on the original subflow, in order
790	   to preserve the subflow integrity (middleboxes could replay old data,
791	   and/or could reject holes in subflows), and a receiver will ignore
792	   these retransmissions.  While this is clearly suboptimal, for
793	   compatibility reasons we feel this is the best behaviour.
794	   Optimisations could be negotiated in future versions of this
795	   protocol.

797	   Of course, retransmissions on alternative subflows will only occur if
798	   this is what local policy suggests.  Indeed, it may be equally valid
799	   to retransmit on the same subflow if alternative paths have
800	   considerably worse quality of service, or are only kept for backup
801	   purposes.  Additionally, it may be possible for some implementations
802	   to signal from lower layers if there are problems with the paths, and
803	   so more appropriate responses could occur.

805	4.5.  Closing a Connection

807	   Under single path TCP, a FIN signifies that the sender has no more
808	   data to send.  In order to allow subflows to operate independently,
809	   however, and with as little change from regular TCP as possible, a
810	   FIN in MPTCP will only affect the subflow on which it is sent.  This
811	   allows nodes to exercise considerable freedom over which paths are in
812	   use at any one time.  The semantics of a FIN remain as for regular
813	   TCP, i.e. it is not until both sides have ACKed each other's FINs
814	   that the subflow is fully closed.

816	   When an application calls close() on a socket, this indicates that it
817	   has no more data to send, and for regular TCP this would result in a
818	   FIN on the connection.  For MPTCP, an equivalent mechanism is needed,
819	   and this is the DATA FIN.  This option, shown in Figure 8, is
820	   attached to a regular FIN option on a subflow.

822	   A DATA FIN is an indication that the sender has no more data to send,
823	   and as such can be used as a rapid indication of the end of data from
824	   a sender.  Therefore, it is an optimisation to clean up state
825	   associated with a MPTCP connection, especially when some subflows may
826	   have failed.  Specifically, when a DATA FIN has been received, IF all
827	   data has been successfully received, timeouts on all subflows MAY be
828	   reduced.  Similarly, when sending a DATA FIN, once all data
829	   (including the DATA FIN has been acknowledged, FINs must be sent on
830	   every subflow.  This applies to both endpoints, and is required in
831	   order to clean up state in middleboxes.

833	   There are complex interactions, however, between a DATA FIN and
834	   subflow properties:

836	   o  A DATA FIN can only be sent on a packet which also has the FIN
837	      flag set.

839	   o  A DATA FIN occupies one octet (the final octet) of Data Sequence
840	      Number space.  Therefore, even if there is no user data, a Data
841	      Sequence Number option must be added to a packet containing the
842	      DATA FIN option.  This allows the receiver to easily determine the
843	      last data sequence number that should have been received.

845	   o  There is a one-to-one mapping between the DATA FIN and the
846	      subflow's FIN flag (and its associated sequence space and thus its
847	      acknowlegement).  In other words, when a subflow's FIN flag has
848	      been acknowledged, the associated DATA FIN is also acknowledged.

850	   o  As such, the acknowledgement of a FIN and DATA FIN DOES NOT
851	      indicate that all data has been successfully received.  Because
852	      the data level ack is inferred from subflow acks, the endpoint can
853	      tell when all data before the DATA FIN has been received.

855	   It should be noted that an endpoint may also send a FIN on an
856	   individual subflow to shut it down, but this impact is limited to the
857	   subflow in question.  If all subflows have been closed with a FIN,
858	   that is equivalent to having closed the connection with a DATA FIN.

860	                           1
861	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
862	      +---------------+---------------+
863	      | Kind=OPT_DFIN |   Length = 2  |
864	      +---------------+---------------+

866	                         Figure 8: DATA FIN option

868	4.6.  Error Handling

870	   TBD

872	   Unknown token in MPTCP SYN should equate to an unknown port, e.g. a
873	   TCP reset?  We should make this as silent and tolerant as possible.
874	   Where possible, we should keep this close to the semantics of TCP.
875	   The amount of error handling required may also have an impact on the
876	   choice of path management schemes.  Issues may include odd cases
877	   where a data sequence number is missing from a subflow.  Will
878	   definitely need errors in those cases.

880	5.  Congestion Control Coupling for MPTCP

882	   Coupling congestion windows can achieve resource pooling, by pushing
883	   traffic to underutilized areas of the network.  Another effect of
884	   coupling is fairness at bottleneck: when two MPTCP flows share a
885	   common bottleneck, their combined throughput should not be more than
886	   that of a single TCP flow.

888	   To achieve perfect resource pooling, one must couple both increase
889	   and decrease of congestion windows across subflows.  Yet this tends
890	   to exhibit flappyness: when the two paths have similar levels of
891	   congestion, the controller will tend to allocate all the window to
892	   one or the other subflows, and perform random flips between the two
893	   equilibrium points.  This seems not desirable in general, and is
894	   particularly bad when the achieved rates depend on the RTT (as in the
895	   current Internet).

897	   By only coupling increases we remove flappyness but also reduce the
898	   extent of resource pooling the protocol achieves.  We now succintly
899	   describe our protocol, assuming there are only two subflows (the
900	   general case is easy to derive, but is more difficult to understand).

902	   Let v_1 and v_2 be the congestion windows on the two subflows, and
903	   assume there is always data to send.  Let w = v_1 + v_2.  Let p_i,
904	   rtt_i be the drop probability and round trip time on path i.

906	   Our proposed algorithm is as follows:

908	   o  Increase v_i by a/w for each ack received on subflow i.

910	   o  Decrease v_i by v_i/2 for each drop on subflow i.

912	   "a" is a parameter of the algorithm, and we'll describe next how to
913	   pick proper values for it.

915	   This algorithm will allocate window to the two subflows such that p1
916	   * v1 = p2 * v2.  Thus, when the drop probabilities are equals, each
917	   subflow gets an equal window; when they are different, more and more
918	   window will be allocated to the flow with the lower drop probability.

920	   The total throughput of the algorithm depends on the drop
921	   probabilities and rtts of the two paths.  We require that the total
922	   throughput is no worse than the throughput a single TCP would get on
923	   the fastest path.  If we kept a constant regardless of path
924	   properties, this requirement would be violated.  However, if we
925	   increase a according to the difference in drop probabilities and
926	   rtts, it is always possible to match the throughput of the best path.

928	   The second requirement is that none of the subflows should be, on
929	   their own, more aggressive than a single TCP on the same path.
930	   Increasing "a" indefinitely as required above, may create fairness
931	   issues in some scenarios.  In such cases, the "a" parameter is capped
932	   on the paths where the increase is too aggressive, and some traffic
933	   is pushed on the other paths.

935	   It is possible to achieve all this behavior (adjusting and capping a)
936	   by only using estimates of the rtts and the current windows for the
937	   two subflows; explicit estimates of the drop probabilities are not
938	   needed.

940	   A full description of the congestion control algorithm is beyond the
941	   scope of this document.  The algorithm will be thoroughly described
942	   in a companion document, soon to be released.

944	6.  Security Considerations

946	   TBD
947	   (Token generation, handshake mechanisms, new subflow authentication,
948	   etc...)

950	   The development of a TCP extension such as this will bring with it
951	   many additional security concerns.  We have set out here to produce a
952	   solution that is "no worse" than current TCP, with the possibility
953	   that more secure extensions could be proposed later.

955	   The primary area of concern will be around the handshake to start new
956	   subflows which join existing connections.  The proposal set out in
957	   Section 4.1 and Section 4.2 is for the initiator of the new subflow
958	   to include the token of the other endpoint in the handshake.  The
959	   purpose of this is to indicate that the sender of this token was the
960	   same entity that received this token at the initial handshake.

962	   One area of concern is that the token could be simply brute-forced.
963	   The token must behard to guess, and as such could be randomly
964	   generated.  This may still not be strong enough, however, and so the
965	   use of 64 bits for the token would alleviate this somewhat.

967	   Use of these tokens only provide an indication that the token is the
968	   same as at the initial handshake, and does not say anything about the
969	   current sender of the token.  Therefore, another approach would be to
970	   bring a new measure of freshness in to the handshake, so instead of
971	   using the initial token a sender could request a new token from the
972	   receiver to use in the next handshake.

974	   Yet another alternative would be for all SYN packets to include a
975	   data sequence number.  This could either be used as a passive
976	   identifier to indicate an awareness of the current data sequence
977	   number (although a reasonable window would have to be allowed for
978	   delays).  Or, the SYN could form part of the data sequence space -
979	   but this would cause issues in the event of lost SYNs (if a new
980	   subflow is never established), thus causing unnecessary delays for
981	   retransmissions.

983	7.  Interactions with Middleboxes

985	   TBD

987	   How we get around NATs, firewalls.  Problems with TCP proxies.  How
988	   to make an MPTCP-aware middlebox, ...

990	8.  Interfaces

992	   TBD
993	   Interface with applications, interface with TCP, interface with lower
994	   layers...

996	9.  Acknowledgements

998	   The authors are supported by Trilogy
999	   (http://www.trilogy-project.org), a research project (ICT-216372)
1000	   partially funded by the European Community under its Seventh
1001	   Framework Program.  The views expressed here are those of the
1002	   author(s) only.  The European Commission is not liable for any use
1003	   that may be made of the information in this document.

1005	   The authors gratefully acknowledge significant input into this
1006	   document from many members of the Trilogy project, notably Iljitsch
1007	   van Beijnum, Lars Eggert, Marcelo Bagnulo Braun, Robert Hancock, Pasi
1008	   Sarolahti, Olivier Bonaventure, Toby Moncaster, Philip Eardley and
1009	   Andrew McDonald.

1011	10.  IANA Considerations

1013	   This document will make a request to IANA to allocate new values for
1014	   TCP Option identifiers, as follows:

1016	       +------------+----------------------+---------------+-------+
1017	       |   Symbol   |         Name         |      Ref      | Value |
1018	       +------------+----------------------+---------------+-------+
1019	       |   OPT_MPC  |   Multipath Capable  |  Section 4.1  | (tbc) |
1020	       |  OPT_ADDR  |      Add Address     | Section 4.3.1 | (tbc) |
1021	       | OPT_REMADR |    Remove Address    | Section 4.3.2 | (tbc) |
1022	       |  OPT_JOIN  |    Join Connection   |  Section 4.2  | (tbc) |
1023	       |   OPT_DSN  | Data Sequence Number |  Section 4.4  | (tbc) |
1024	       |  OPT_DFIN  |       DATA FIN       |  Section 4.5  | (tbc) |
1025	       +------------+----------------------+---------------+-------+

1027	                      Table 1: TCP Options for MPTCP

1029	11.  References

1031	11.1.  Normative References

1033	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1034	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1036	11.2.  Informative References

1038	   [I-D.eddy-tcp-loo]
1039	              Eddy, W. and A. Langley, "Extending the Space Available
1040	              for TCP Options", draft-eddy-tcp-loo-04 (work in
1041	              progress), July 2008.

1043	   [I-D.ietf-shim6-proto]
1044	              Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming
1045	              Shim Protocol for IPv6", draft-ietf-shim6-proto-12 (work
1046	              in progress), February 2009.

1048	   [I-D.van-beijnum-1e-mp-tcp-00]
1049	              van Beijnum, I., "One-ended Multipath TCP",
1050	              draft-van-beijnum-1e-mp-tcp-00 (work in progress),
1051	              May 2009.

1053	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
1054	              RFC 793, September 1981.

1056	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
1057	              Selective Acknowledgment Options", RFC 2018, October 1996.

1059	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1060	              of Explicit Congestion Notification (ECN) to IP",
1061	              RFC 3168, September 2001.

1063	   [WISCHIK]  Wischik, D., Handley, M., and M. Bagnulo Braun, "The
1064	              Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num.
1065	              5, pp. 47-52, October 2008,
1066	              <http://ccr.sigcomm.org/online/files/p47-handleyA4.pdf>.

1068	Appendix A.  Functional Separation

1070	   [Potential to move to separate architectural document]

1072	   This section describes the functional separation that drives the
1073	   design of the MPTCP protocol.  Its main goal is to separate MPTCP in
1074	   two parts that communicate through a well defined interface.  We
1075	   first provide the motivations for this functional separation, then we
1076	   describe in more details the two main components of the MPTCP
1077	   architecture.

1079	A.1.  Motivations

1081	   The major goal behind MPTCP is to send data over different paths in
1082	   the same time.  This assumes that an MPTCP implementation must be
1083	   able to discover and use the multiple paths that connect two given
1084	   hosts, when they exist.  However, different mechanisms can be
1085	   envisioned for multipath discovery and use.  Examples are as follows:

1087	   Use multiple addresses:  This is the method currently proposed in
1088	      this document - if hosts are multi-addressed, different address
1089	      pairs may take different routes.

1091	   Use a path selector value:  An end-host might be able to tag packets
1092	      with a path selector value, or "colour".  If some network nodes
1093	      are able to read the colour and use it as a path selector, the
1094	      host can influence the outgoing path of the packet.

1096	   Next-hop selection:  In a network configuration where multiple next-
1097	      hops can offer to forward packets, a host may decide to send some
1098	      of its packets through one next-hop, and some through another.

1100	   The above list is not exhaustive, and could grow as new network
1101	   technologies are deployed.

1103	A.2.  TCP Performance

1105	   In addition to purely sending data over multiple paths, MTCP must do
1106	   it in a way that will not affect TCP performance.  This raises the
1107	   need for an efficient multipath congestion control algorithm.  While
1108	   this specification does not mandate the use of any particular
1109	   algorithm for congestion control, it ensures that the protocol is
1110	   designed in such a way that any CC algorithm can be designed,
1111	   independently of the particular path management mechanism available
1112	   to the host.  Consequently our architecture for MTCP decouples the
1113	   policy from the mechanism.  The policy is the decision of what path
1114	   to use for each packet to send.  It is mainly driven by the
1115	   implementation-dependent congestion control algorithm.  The mechanism
1116	   is the technology used to ensure that a packet will be sent on the
1117	   desired path.  This separation is intended to be relatively future-
1118	   proof by allowing these components to evolve at different speeds.

1120	A.3.  Architecture overview
1121	            Control plane    <--     |     -->    Data plane
1122	   +---------------------------------------------------------------+
1123	   |                     Multipath Scheduler (MPS)                 |
1124	   +---------------------------------------------------------------+
1125	                ^                    |          |
1126	                |                    |          |
1127	                |Announcing new      |   +-------------+
1128	                |paths. (referred    |   | Data packet |<--Path idx:3
1129	                |to as path indices) |   +-------------+   attached
1130	                |                    |          |          by MPS
1131	                |                    |          V
1132	   +--------------------------------------------\------------------+
1133	   |                         Path Manager (PM)   \__________zzzzz  |
1134	   +--------------------------------------------------------\------+
1135	      /                   \          |                       \
1136	     /---------------------\         |   /"\       /"\       /"\
1137	     | Path key    Action  |         |   | |       | |       | |
1138	     |     1        xxxxx  |         |   | |       | |       | |
1139	     |     2        yyyyy  |         |   \./       \./       \./
1140	     |     3        zzzzz  |         |  path1     path2     path3
1141	     +---------------------+

1143	                  Figure 9: Overview of MTCP architecture

1145	   A general overview of the architecture is provided in Figure 9.  The
1146	   Multipath Scheduler (MPS) learns about the number of available paths
1147	   through notifications received from the Path Manager (PM).  From the
1148	   point of view of the Multipath Scheduler, a path is just a number,
1149	   called a Path Index.  Notifications from the PM to the MPS MAY
1150	   contain supporting information about the paths, if relevant, so that
1151	   the MPS can make more intelligent decisions about where to route
1152	   traffic.  When the Multipath Scheduler initiates a communication to a
1153	   new host, it can only send the packets to the default path.  But
1154	   since the Path manager is layered below the MPS, it can detect that a
1155	   new communication is happening, and tell the MPS about the other
1156	   paths it knows about.

1158	   From then on, it is possible for the MPS to attach a Path Index to
1159	   the control structure of its packets (internal to the MTCP
1160	   implementation), so that the Path Manager can map this Path Index to
1161	   the corresponding action. (see table in the lower left part of
1162	   Figure 9).  The particular action depends on the network mechanism
1163	   used to select a path.  Examples are address rewriting, tunnelling or
1164	   setting a path selector valude inside the packet.

1166	   The applicability of the architecture is not limited to the MTCP
1167	   protocol.  While we define in this document an MTCP MPS (MTCP
1168	   Multipath Scheduler), other Multipath Schedulers can be defined.  For
1169	   example, if an appropriate socket interface is designed, applications
1170	   could behave as a Multipath Scheduler and decide where to send any
1171	   particular data.  In this document we concentrate on the MTCP case,
1172	   however.

1174	   In this specification, we define the core protocol for Multipath TCP.
1175	   The core protocol is not dependent on the Path Management technique
1176	   that is chosen, and MUST be implemented in any MTCP MPS.  We also
1177	   provide a default Path Manager that is based on declaring IP
1178	   addresses, and carries control information in TCP options.  An
1179	   implementation of Multipath TCP can use any Path Manager, but it MUST
1180	   be able to fallback to the default PM in case the other end does not
1181	   support the custom PM.  Alternative Path Managers may be specified in
1182	   separate documents in the future.

1184	A.4.  PM/MPS interface

1186	   The minimal set of requirement for a Path Manager is as follows:

1188	   o  Outgoing untagged packets: Any outgoing packet flowing through the
1189	      Path Manager is either tagged or untagged (by the MPS) with a path
1190	      index.  If it is untagged, the packet is sent normally to the
1191	      Internet, as if no multi-path support were present.  Untagged
1192	      packets can be used to trigger a path discovery procedure, that
1193	      is, a Path Manager can listen to untagged packets and decide at
1194	      some time to find if any other path than the default one is
1195	      useable for the corresponding host pair.  Note that any other
1196	      criteria could be used to decide when to start discovering
1197	      available paths.  Note also that MPS scheduling will not be
1198	      possible until the Path Manager has notified the available paths.
1199	      The PM is thus the first entity coming into action.

1201	   o  Outgoing tagged packets: The Path Manager maintains a table
1202	      mapping path indices to actions.  The action is the operation that
1203	      allows using a particular path.  Examples of possible actions are
1204	      route selection, interface selection or packet transformation.
1205	      When the PM sees a packet tagged with a path index, it looks up
1206	      its table to find the appropriate action for that packet.  The tag
1207	      is purely local.  It is removed before the packet is transmitted.

1209	   o  Incoming packets: A Path Manager MUST ensure that incoming path is
1210	      mapped unambiguously to exactly one outgoing path.  Note that this
1211	      requirement implies that the same number of incoming/outgoing
1212	      paths must be established.  Moreover, a PM MUST tag any incoming
1213	      path with the same Path Index as the one used for the
1214	      corresponding outgoing path.  This is necessary for MTCP to know
1215	      what outgoing path in acknowledged by an incoming packet.

1217	   o  Module interface: A PM MUST be able to notify the MPS about the
1218	      number of available paths.  Such notifications MUST contain the
1219	      path indices that are legal for use by the MPS.  In case the PM
1220	      decides to stop providing service for one path, it MUST notify the
1221	      MPS about path deletion.  Additionnaly, a PM MAY provide
1222	      complementary path information when available, such as link
1223	      quality or preference level.

1225	Appendix B.  Notes on use of TCP Options

1227	   The TCP option space is limited due to the length of the Data Offset
1228	   field in the TCP header (4 bits), which defines the TCP header length
1229	   in 32-bit words.  With the standard TCP header being 20 bytes, this
1230	   leaves a maximum of 40 bytes for options, and many of these may
1231	   already be used by options such as timestamp and SACK.

1233	   As such, when doing address list manipulation, not all data may fit.
1234	   This can be mitigated in one of two ways:

1236	   o  Using an option to extend the option space, such as that proposed
1237	      in [I-D.eddy-tcp-loo], which proposes an option providing a 16-bit
1238	      header length field.  Such an option could only be used between
1239	      nodes that support it, however, and so long options could not be
1240	      used until a handshake is complete.

1242	   o  Alternatively, since at least one IP address option field should
1243	      be able to fit per packet, address list manipulation can be
1244	      undertaken with one address per packet.  One method could be to
1245	      wait for data to send, and then append one new address per packet.
1246	      This would seem reasonable if the TCP session begins rapidly, but
1247	      if it is required that the multipath session is ready before the
1248	      first data is to be sent, address list manipulation would be
1249	      required on empty data (signalling only) packets.  Issues may
1250	      arise regarding acknowledged delivery of signalling versus data -
1251	      this is discussed in Section 3 below.

1253	Appendix C.  Resync Packet

1255	   In earlier versions of this draft, we proposed the use of a "re-sync"
1256	   option that would be used in certain circumstances when a sender
1257	   needs to instruct the receiver to skip over certain subflow sequence
1258	   numbers (i.e. to treat the specified sequence space as having been
1259	   received and acknowledged).

1261	   The typical use of this option will be when packets are retransmitted
1262	   on different subflows, after failing to be acknowledged on the
1263	   original subflow.  In such a case, it becomes necessary to move
1264	   forward the original subflow's sequence numbering so as not to later
1265	   transmit different data with a previously used sequence number (i.e.
1266	   when more data comes to be transmitted on the original subflow, it
1267	   would be different data, and so must not be sent with previously-used
1268	   (but unacknowledged) sequence numbering).

1270	   The rationale for needing to do this is two-fold: firstly, when ACKs
1271	   are received they are for the subflow only, and the sender infers
1272	   from this the data that was sent - if the same sequence space could
1273	   be occupied by different data, the sender won't know whether the
1274	   intended data was received.  Secondly, certain classes of middleboxes
1275	   may cache data and not send the new data on a previously-seen
1276	   sequence number.

1278	   This option was dropped, however, since some middleboxes may get
1279	   confused when they meet a hole in the sequence space, and do not
1280	   understand the resync option.  It is therefore felt that the same
1281	   data must continue to be retransmitted on a subflow even if it is
1282	   already received after being retransmitted on another.  There should
1283	   not be a significant performance hit from this since the amount of
1284	   data involved and needing to be retransmitted multiple times will be
1285	   relatively small.

1287	   Therefore, it is necessary to 're-sync' the expected sequence
1288	   numbering at the receiving end of a subflow, using the following TCP
1289	   option.  This packet declares a sequence number space (inclusive)
1290	   which the receiving node should skip over, i.e. if the receiver's
1291	   next expected sequence number was previously within the range
1292	   start_seq_num to end_seq_num, move it forward to end_seq_num + 1.

1294	   This option will be used on the first new packet on the subflow that
1295	   needs its sequence numbering re-synchronised.  It will be continue to
1296	   be included on every packet sent on this subflow until a packet
1297	   containing this option has been acknowledged (i.e. if subflow
1298	   acknowledgements exist for packets beyond the end sequence number).
1299	   If the end sequence number is earlier than the current expected
1300	   sequence number (i.e. if a resync packet has already been received),
1301	   this option should be ignored.

1303	                           1                   2                   3
1304	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1305	      +---------------+---------------+------------------------------+
1306	      |Kind=OPT_RESYNC|  Length = 10  |     Start Sequence Number    :
1307	      +---------------+---------------+------------------------------+
1308	      :          (4 octets)           |      End Sequence Number     :
1309	      +---------------+---------------+------------------------------+
1310	      :          (4 octets)           |
1311	      +-------------------------------+

1313	                         Figure 10: Resync option

1315	Authors' Addresses

1317	   Alan Ford
1318	   Roke Manor Research
1319	   Old Salisbury Lane
1320	   Romsey, Hampshire  SO51 0ZN
1321	   UK

1323	   Phone: +44 1794 833 465
1324	   Email: alan.ford@roke.co.uk

1326	   Costin Raiciu
1327	   University College London
1328	   Gower Street
1329	   London  WC1E 6BT
1330	   UK

1332	   Email: c.raiciu@cs.ucl.ac.uk

1334	   Mark Handley
1335	   University College London
1336	   Gower Street
1337	   London  WC1E 6BT
1338	   UK

1340	   Email: m.handley@cs.ucl.ac.uk
1341	   Sebastien Barre
1342	   Universite catholique de Louvain
1343	   Pl. Ste Barbe, 2
1344	   Louvain-la-Neuve  1348
1345	   Belgium

1347	   Phone: +32 10 47 91 03
1348	   Email: sebastien.barre@uclouvain.be