idnits 2.17.1 

draft-barre-mptcp-impl-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 433 has weird spacing: '... in the  built...'

  -- The document date (March 7, 2011) is 4798 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-12) exists of
     draft-ietf-mptcp-multiaddressed-02

  ** Downref: Normative reference to an Experimental draft:
     draft-ietf-mptcp-multiaddressed (ref. '1')

  == Outdated reference: A later version (-07) exists of
     draft-ietf-mptcp-congestion-01

  ** Downref: Normative reference to an Experimental draft:
     draft-ietf-mptcp-congestion (ref. '2')

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-mptcp-architecture (ref. '3')

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  == Outdated reference: A later version (-07) exists of
     draft-ietf-mptcp-api-00

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-mptcp-api (ref. '5')

  ** Obsolete normative reference: RFC  793 (ref. '6') (Obsoleted by RFC 9293)

  -- Possible downref: Non-RFC (?) normative reference: ref. '7'

  -- Possible downref: Non-RFC (?) normative reference: ref. '8'

  -- Possible downref: Non-RFC (?) normative reference: ref. '9'

  == Outdated reference: A later version (-27) exists of
     draft-tuexen-tsvwg-sctp-multipath-01

  ** Downref: Normative reference to an Experimental draft:
     draft-tuexen-tsvwg-sctp-multipath (ref. '10')

  ** Downref: Normative reference to an Experimental RFC: RFC 3465 (ref. '11')

  == Outdated reference: A later version (-15) exists of
     draft-ietf-mif-problem-statement-09

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-mif-problem-statement (ref. '12')

  ** Obsolete normative reference: RFC 3484 (ref. '13') (Obsoleted by RFC
     6724)

  -- Possible downref: Non-RFC (?) normative reference: ref. '14'


     Summary: 12 errors (**), 0 flaws (~~), 8 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           S. Barre
3	Internet-Draft                                                 C. Paasch
4	Expires: September 8, 2011                                O. Bonaventure
5	                                                      UCLouvain, Belgium
6	                                                           March 7, 2011

8	              MultiPath TCP - Guidelines for implementers
9	                       draft-barre-mptcp-impl-00

11	Abstract

13	   Multipath TCP is a major extension to TCP that allows improving the
14	   resource usage in the current Internet by transmitting data over
15	   several TCP subflows, while still showing one single regular TCP
16	   socket to the application.  This document describes our experience in
17	   writing a MultiPath TCP implementation in the Linux kernel and
18	   discusses implementation guidelines that could be useful for other
19	   developers who are planning to add MultiPath TCP to their networking
20	   stack.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on September 8, 2011.

39	Copyright Notice

41	   Copyright (c) 2011 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	     1.1.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  3
58	   2.  An architecture for Multipath transport  . . . . . . . . . . .  5
59	     2.1.  MPTCP architecture . . . . . . . . . . . . . . . . . . . .  5
60	     2.2.  Structure of the Multipath Transport . . . . . . . . . . .  9
61	     2.3.  Structure of the Path Manager  . . . . . . . . . . . . . .  9
62	   3.  MPTCP challenges for the OS  . . . . . . . . . . . . . . . . . 12
63	     3.1.  Charging the application for its CPU cycles  . . . . . . . 12
64	     3.2.  At connection/subflow establishment  . . . . . . . . . . . 13
65	     3.3.  Subflow management . . . . . . . . . . . . . . . . . . . . 14
66	     3.4.  At the data sink . . . . . . . . . . . . . . . . . . . . . 14
67	       3.4.1.  Receive buffer tuning  . . . . . . . . . . . . . . . . 15
68	       3.4.2.  Receive queue management . . . . . . . . . . . . . . . 15
69	       3.4.3.  Scheduling data ACKs . . . . . . . . . . . . . . . . . 16
70	     3.5.  At the data source . . . . . . . . . . . . . . . . . . . . 16
71	       3.5.1.  Send buffer tuning . . . . . . . . . . . . . . . . . . 17
72	       3.5.2.  Send queue management  . . . . . . . . . . . . . . . . 17
73	       3.5.3.  Scheduling data  . . . . . . . . . . . . . . . . . . . 20
74	         3.5.3.1.  The congestion controller  . . . . . . . . . . . . 21
75	         3.5.3.2.  The Packet Scheduler . . . . . . . . . . . . . . . 22
76	     3.6.  At connection/subflow termination  . . . . . . . . . . . . 23
77	   4.  Configuring the OS for MPTCP . . . . . . . . . . . . . . . . . 25
78	     4.1.  Source address based routing . . . . . . . . . . . . . . . 25
79	     4.2.  Buffer configuration . . . . . . . . . . . . . . . . . . . 27
80	   5.  Future work  . . . . . . . . . . . . . . . . . . . . . . . . . 28
81	   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29
82	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 30
83	   Appendix A.  Design alternatives . . . . . . . . . . . . . . . . . 32
84	     A.1.  Another way to consider Path Management  . . . . . . . . . 32
85	     A.2.  Implementing alternate Path Managers . . . . . . . . . . . 33
86	     A.3.  When to instantiate a new meta-socket ?  . . . . . . . . . 34
87	     A.4.  Forcing more processing in user context  . . . . . . . . . 34
88	     A.5.  Buffering data on a per-subflow basis  . . . . . . . . . . 35
89	   Appendix B.  Ongoing discussions on implementation improvements  . 39
90	     B.1.  Heuristics for subflow management  . . . . . . . . . . . . 39
91	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 41

93	1.  Introduction

95	   The MultiPath TCP protocol [1] is a major TCP extension that allows
96	   for simultaneous use of multiple paths, while being transparent to
97	   the applications, fair to regular TCP flows [2] and deployable in the
98	   current Internet.  The MPTCP design goals and the protocol
99	   architecture that allow reaching them are described in [3].  Besides
100	   the protocol architecture, a number of non-trivial design choices
101	   need to be made in order to extend an existing TCP implementation to
102	   support MultiPath TCP.  This document gathers a set of guidelines
103	   that should help implementers writing an efficient and modular MPTCP
104	   stack.  The guidelines are expected to be applicable regardless of
105	   the Operating System (although the MPTCP implementation described
106	   here is done in Linux [4]).  Another goal is to achieve the greatest
107	   level of modularity without impacting efficiency, hence allowing
108	   other multipath protocols to nicely co-exist in the same stack.  In
109	   order for the reader to clearly disambiguate "useful hints" from
110	   "important requirements", we write the latter in their own
111	   paragraphs, starting with the keyword "IMPORTANT".  By important
112	   requirements, we mean design options that, if not followed, would
113	   lead to an under-performing MPTCP stack, maybe even slower than
114	   regular TCP.

116	   This draft presents implementation guidelines that are based on the
117	   code which has been implemented in our MultiPath TCP aware Linux
118	   kernel (the version covered here is 0.6) which is available from
119	   http://inl.info.ucl.ac.be/mptcp.  We also list configuration
120	   guidelines that have proven to be useful in practice.  In some cases,
121	   we discuss some mechanisms that have not yet been implemented.  These
122	   mechanisms are clearly listed.  During our work in implementing
123	   MultiPath TCP, we evaluated other designs.  Some of them are not used
124	   anymore in our implementation.  However, we explain in the appendix
125	   the reason why these particular designs have not been considered
126	   further.

128	   This document is structured as follows.  First we propose an
129	   architecture that allows supporting MPTCP in a protocol stack
130	   residing in an operating system.  Then we consider a range of
131	   problems that must be solved by an MPTCP stack (compared to a regular
132	   TCP stack).  In Section 4, we propose recommendations on how a system
133	   administrator could correctly configure an MPTCP-enabled host.
134	   Finally, we discuss future work, in particular in the area of MPTCP
135	   optimization.

137	1.1.  Terminology

139	   In this document we use the same terminology as in [3] and [1].  In
140	   addition, we will use the following implementation-specific terms:

142	   o  Meta-socket: A socket structure used to reorder incoming data at
143	      the connection level and schedule outgoing data to subflows.

145	   o  Master subsocket: The socket structure that is visible from the
146	      application.  If regular TCP is in use, this is the only active
147	      socket structure.  If MPTCP is used, this is the socket
148	      corresponding to the first subflow.

150	   o  Slave subsocket: Any socket created by the kernel to provide an
151	      additional subflow.  Those sockets are not visible to the
152	      application (unless a specific API [5] is used).  The meta-socket,
153	      master and slave subsocket are explained in more details in
154	      Section 2.2.

156	   o  Endpoint id: Endpoint identifier.  It is the tuple (saddr, sport,
157	      daddr, dport) that identifies a particular subflow, hence a
158	      particular subsocket.

160	   o  Fendpoint id: First Endpoint identifier.  It is the endpoint
161	      identifier of the Master subsocket.

163	   o  Connection id or token: It is a locally unique number, defined in
164	      Section 2 of [1], that allows finding a connection during the
165	      establishment of new subflows.

167	2.  An architecture for Multipath transport

169	   Section 4 of the MPTCP architecture document [3] describes the
170	   functional decomposition of MPTCP.  It lists four entities, namely
171	   Path Management, Packet Scheduling, Subflow Interface and Congestion
172	   Control.  These entities can be further grouped based on the layer at
173	   which they operate:

175	   o  Transport layer: This includes Packet Scheduling, Subflow
176	      Interface and Congestion Control, and is grouped under the term
177	      "Multipath Transport (MT)".  From an implementation point of view,
178	      they all will involve modifications to TCP.

180	   o  Any layer: Path Management.  Path management can be done in the
181	      transport layer, as is the case of the built-in path manager (PM)
182	      described in [1].  That PM discovers paths through the exchange of
183	      TCP options of type ADD_ADDR or the reception of a SYN on a new
184	      address pair, and defines a path as an endpoint_id (saddr, sport,
185	      daddr, dport).  But, more generally, a PM could be any module able
186	      to expose multiple paths to MPTCP, located either in kernel or
187	      user space, and acting on any OSI layer (e.g. a bonding driver
188	      that would expose its multiple links to the Multipath Transport).

190	   Because of the fundamental independence of Path Management compared
191	   to the three other entities, we draw a clear line between both, and
192	   define a simple interface that allows MPTCP to benefit easily from
193	   any appropriately interfaced multipath technology.  In this document,
194	   we stick to describing how the functional elements of MPTCP are
195	   defined, using the built-in Path Manager described in [1], and we
196	   leave for future separate documents the description of other path
197	   managers.  We describe in the first subsection the precise roles of
198	   the Multipath Transport and the Path Manager.  Then we detail how
199	   they are interfaced with each other.

201	2.1.  MPTCP architecture

203	   Although, when using the built-in PM, MPTCP is fully contained in the
204	   transport layer, it can still be organized as a Path Manager and a
205	   Multipath Transport Layer as shown in Figure 1.  The Path Manager
206	   announces to the MultiPath Transport what paths can be used through
207	   path indices for an MPTCP connection, identified by the fendpoint_id
208	   (first endpoint id).  The fendpoint_id is the tuple (saddr, sport,
209	   daddr, dport) seen by the application and uniquely identifies the
210	   MPTCP connection (an alternate way to identify the MPTCP connection
211	   being the conn_id, which is a token as described in Section 2 of
212	   [1]).  The Path Manager maintains the mapping between the path_index
213	   and an endpoint_id.  The endpoint_id is the tuple (saddr, sport,
214	   daddr, dport) that is to be used for the corresponding path index.

216	   Note that the fendpoint_id itself represents a path and is thus a
217	   particular endpoint_id.  By convention, the fendpoint_id is always
218	   represented as path index 1.  As explained in [3], Section 5.6, it is
219	   not yet clear how an implementation should behave in the event of a
220	   failure in the first subflow.  We expect, however, that the Master
221	   subsocket should be kept in use as an interface with the application,
222	   even if no data is transmitted anymore over it.  It also allows the
223	   fendpoint_id to remain meaningful throughout the life of the
224	   connection.  This behavior has yet to be tested and refined with
225	   Linux MPTCP.

227	   Figure 1 shows an example sequence of MT-PM interactions happening at
228	   the beginning of an exchange.  When the MT starts a new connection
229	   (through an application connect() or accept()), it can request the PM
230	   to be updated about possible alternate paths for this new connection.
231	   The PM can also spontaneously update the MT at any time (normally
232	   when the path set changes).  This is step 1 in Figure 1.  In the
233	   example, 4 paths can be used, hence 3 new ones.  Based on the update,
234	   the MT can decide whether to establish new subflows, and how many of
235	   them.  Here, the MT decides to establish one subflow only, and sends
236	   a request for endpoint_id to the PM.  This is step 2.  In step 3, the
237	   answer is given: <A2,B2,0,pB2>.  The source port is unspecified to
238	   allow the MT ensure the unicity of the new endpoint_id, thanks to the
239	   new_port() primitive (present in regular TCP as well).  Note that
240	   messages 1,2,3 need not be real messages and can be function calls
241	   instead (as is the case in Linux MPTCP).

243	                               Control plane
244	   +---------------------------------------------------------------+
245	   |                     Multipath Transport (MT)                  |
246	   +----------------------------------------------------|----------+
247	     ^                  |           ^                   v
248	     |                  |           |          [Build new subsocket,
249	     | 1.For fendpt_id  |2.endpt_id |                 with endpt_ids
250	     |<A1,B1,pA1,pB1>   | for path  | 3.<A2,B2,   <A2,B2,new_port(),pB2]
251	     |Paths 1->4 can be | index 2 ? |   0,pB2>
252	     |used.             |           |
253	     |                  |           |
254	     |                  |           |
255	     |                  v           |
256	   +---------------------------------------------------------------+
257	   |                         Path Manager (PM)                     |
258	   +---------------------------------------------------------------+
259	      /                                     \
260	     /---------------------------------------\
261	     | mapping table:                        |
262	     |   Subflow   <--> endpoint_id          |
263	     |  path index                           |
264	     |                                       |
265	     |    [see table below]                  |
266	     |                                       |
267	     +---------------------------------------+

269	      Figure 1: Functional separation of MPTCP in the transport layer

271	   The following options, described in [1] , are managed by the
272	   Multipath Transport:

274	   o  MULTIPATH CAPABLE (MP_CAPABLE): Tells the peer that we support
275	      MPTCP and announces our local token.

277	   o  MP_JOIN/MP_AUTH: Initiates a new subflow (Note that MP_AUTH is not
278	      yet part of our Linux implementation at the moment)

280	   o  DATA SEQUENCE NUMBER (DSN_MAP): Identifies the position of a set
281	      of bytes in the meta-flow.

283	   o  DATA_ACK: Acknowledge data at the connection level (subflow level
284	      acknowledgments are contained in the normal TCP header).

286	   o  DATA FIN (DFIN): Terminates a connection.

288	   o  MP_PRIO: Asks the peer to revise the backup status of the subflow
289	      on which the option is sent.  Although the option is sent by the
290	      Multipath Transport (because this allows using the TCP option
291	      space), it may be triggered by the Path Manager.  This option is
292	      not yet supported by our MPTCP implementation.

294	   o  MP_FAIL: Checksum failed at connection-level.  Currently the Linux
295	      implementation does not implement the checksum in option DSN_MAP,
296	      and hence does not implement either the MP_FAIL option.

298	   The Path manager applies a particular technology to give the MT the
299	   possibility to use several paths.  The built-in MPTCP Path Manager
300	   uses multiple IPv4/v6 addresses as its mean to influence the
301	   forwarding of packets through the Internet.  When the MT starts a new
302	   connection, it chooses a token that will be used to identify the
303	   connection.  This is necessary to allow future subflow-establishment
304	   SYNs (that is, containing the MP_JOIN option) to be attached to the
305	   correct connection.  An example mapping table is given hereafter:

307	                 +---------+------------+---------------+
308	                 |  token  | path index |  Endpoint id  |
309	                 +---------+------------+---------------+
310	                 | token_1 |      1     | <A1,B1,0,pB1> |
311	                 |         |            |               |
312	                 | token_1 |      2     | <A2,B2,0,pB1> |
313	                 |         |            |               |
314	                 | token_1 |      3     | <A1,B2,0,pB1> |
315	                 |         |            |               |
316	                 | token_1 |      4     | <A2,B1,0,pB1> |
317	                 |         |            |               |
318	                 |         |            |               |
319	                 | token_2 |      1     | <A1,B1,0,pB2> |
320	                 |         |            |               |
321	                 | token_2 |      2     | <A2,B1,0,pB2> |
322	                 +---------+------------+---------------+

324	              Table 1: Example mapping table for built-in PM

326	   Table 1 shows an example where two MPTCP connections are active.  One
327	   is identified by token_1, the other one with token_2.  As per [1],
328	   the tokens must be unique locally.  Since the endpoint identifier may
329	   change from one subflow to another, the attachment of incoming new
330	   subflows (identified by a SYN + MP_JOIN option) to the right
331	   connection is achieved thanks to the locally unique token.  The
332	   built-in path manager currently implements the following options The
333	   following options (defined in [1]) are intended to be part of the
334	   built-in path manager:

336	   o  Add Address (ADD_ADDR): Announces a new address we own
337	   o  Remove Address (REMOVE_ADDR): Withdraws a previously announced
338	      address

340	   Those options form the built-in MPTCP Path Manager, based on
341	   declaring IP addresses, and carries control information in TCP
342	   options.  An implementation of Multipath TCP can use any Path
343	   Manager, but it must be able to fallback to the default PM in case
344	   the other end does not support the custom PM.  Alternative Path
345	   Managers may be specified in separate documents in the future.

347	2.2.  Structure of the Multipath Transport

349	   The Multipath Transport handles three kinds of sockets.  We define
350	   them here and use this notation throughout the entire document:

352	   o  Master subsocket: This is the first socket in use when a
353	      connection (TCP or MPTCP) starts.  It is also the only one in use
354	      if we need to fall back to regular TCP.  This socket is initiated
355	      by the application through the socket() system call.  Immediately
356	      after a new master subsocket is created, MPTCP capability is
357	      enabled by the creation of the meta-socket.

359	   o  Meta-socket: It holds the multipath control block, and acts as the
360	      connection level socket.  As data source, it holds the main send
361	      buffer.  As data sink, it holds the connection-level receive queue
362	      and out-of-order queue (used for reordering).  We represent it as
363	      a normal (extended) socket structure in Linux MPTCP because this
364	      allows reusing much of the existing TCP code with few
365	      modifications.  In particular, the regular socket structure
366	      already holds pointers to SND.UNA, SND.NXT, SND.WND, RCV.NXT,
367	      RCV.WND (as defined in [6]).  It also holds all the necessary
368	      queues for sending/receiving data.

370	   o  Slave subsocket: Any subflow created by MPTCP, in addition to the
371	      first one (the master subsocket is always considered as a subflow
372	      even though it may be in failed state at some point in the
373	      communication).  The slave subsockets are created by the kernel
374	      (not visible from the application) The master subsocket and the
375	      slave subsockets together form the pool of available subflows that
376	      the MPTCP Packet Scheduler (called from the meta-socket) can use
377	      to send packets.

379	2.3.  Structure of the Path Manager

381	   In contrast to the multipath transport, which is more complex and
382	   divided in sub-entities (namely Packet Scheduler, Subflow Interface
383	   and Congestion Control, see Section 2), the Path Manager just
384	   maintains the mapping table and updates the Multipath Transport when
385	   the mapping table changes.  The mapping table has been described
386	   above (Table 1).  We detail in Table 2 the set of (event,action)
387	   pairs that are implemented in the Linux MPTCP built-in path manager.
388	   For reference, an earlier architecture for the Path Management is
389	   discussed in Appendix A.1.  Also, Appendix A.2 proposes a small
390	   extension to this current architecture to allow supporting other path
391	   managers.

393	   +-------------------------+-----------------------------------------+
394	   | event                   | action                                  |
395	   +-------------------------+-----------------------------------------+
396	   | master_sk bound: This   | Discovers the set of local addresses    |
397	   | event is triggered upon | and stores them in local_addr_table     |
398	   | either a bind(),        |                                         |
399	   | connect(), or when a    |                                         |
400	   | new server-side socket  |                                         |
401	   | becomes established.    |                                         |
402	   |                         |                                         |
403	   | ADD_ADDR option         | Updates remote_addr_table               |
404	   | received or SYN+MP_JOIN | correspondingly                         |
405	   | received on new address |                                         |
406	   |                         |                                         |
407	   | local/remote_addr_table | Updates mapping_table by adding any new |
408	   | updated                 | address combinations, or removing the   |
409	   |                         | ones that have disappeared. Each        |
410	   |                         | address pair is given a path index.     |
411	   |                         | Once allocated to an address pair, a    |
412	   |                         | path index cannot be reallocated to     |
413	   |                         | another one, to ensure consistency of   |
414	   |                         | the mapping table.                      |
415	   |                         |                                         |
416	   | Mapping_table updated   | Sends notification to the Multipath     |
417	   |                         | Transport. The notification contains    |
418	   |                         | the new set of path indices that the MT |
419	   |                         | is allowed to use. This is shown in     |
420	   |                         | Figure 1, msg 1.                        |
421	   |                         |                                         |
422	   | Endpoint_id(path_index) | Retrieves the endpoint_ids for the      |
423	   | request received from   | corresponding path index from the       |
424	   | MT (Figure 1, msg 2)    | mapping table and returns them to the   |
425	   |                         | MT. One such request/response is        |
426	   |                         | illustrated in Figure 1, msg 3. Note    |
427	   |                         | that in that msg 3, the local port is   |
428	   |                         | set to zero. This is to let the         |
429	   |                         | operating system choose a unique local  |
430	   |                         | port for the new socket.                |
431	   +-------------------------+-----------------------------------------+

433	       Table 2: (event,action) pairs implemented in the  built-in PM

435	3.  MPTCP challenges for the OS

437	   MPTCP is a major modification to the MPTCP stack.  We have described
438	   above an architecture that separates Multipath Transport from Path
439	   Management.  Path Management can be implemented rather simply.  But
440	   Multipath Transport involves a set of new challenges, that do not
441	   exist in regular TCP.  We first describe how an MPTCP client or
442	   server can start a new connection, or a new subflow within a
443	   connection.  Then we propose techniques (a concrete implementation of
444	   which is done in Linux MPTCP) to efficiently implement data reception
445	   (at the data sink) and data sending (at the data source).

447	3.1.  Charging the application for its CPU cycles

449	   As this document is about implementation, it is important not only to
450	   ensure that MPTCP is fast, but also that it is fair to other
451	   applications that share the same CPU.  Otherwise one could have an
452	   extremely fast file transfer, while the rest of the system is just
453	   hanging.  CPU fairness is ensured by the scheduler of the Operating
454	   System when things are implemented in user space.  But in the kernel,
455	   we can choose to run code in "user context", that is, in a mode where
456	   each CPU cycle is charged to a particular application.  Or we can
457	   (and must in some cases) run code in "interrupt context", that is,
458	   interrupting everything else until the task has finished.  In Linux
459	   (probably a similar thing is true in other systems), the arrival of a
460	   new packet on a NIC triggers a hardware interrupt, which in turn
461	   schedules a software interrupt that will pull the packet from the NIC
462	   and perform the initial processing.  The challenge is to stop the
463	   processing of the incoming packet in software interrupt as soon as it
464	   can be attached to a socket, and wake up the application.  With TCP,
465	   an additional constraint is that incoming data should be acknowledged
466	   as soon as possible, which requires reordering.  Van Jacobson has
467	   proposed a solution for this [7]: If an application is waiting on a
468	   recv() system call, incoming packets can be put into a special queue
469	   (called prequeue in Linux) and the application is woken up.
470	   Reordering and acknowledgement are then performed in user context.
471	   The execution path for outgoing packets is less critical from that
472	   point of view, because the vast majority of processing can be done
473	   very easily in user context.

475	   In this document, when discussing CPU fairness, we will use the
476	   following terms:

478	   o  User context: Execution environment that is under control of the
479	      OS scheduler.  CPU cycles are charged to the associated
480	      application, which allows to ensure fairness with other
481	      applications.

483	   o  Interrupt context: Execution environment that runs with higher
484	      priority than any process.  Although it is impossible to
485	      completely avoid running code in interrupt context, it is
486	      important to minimize the amount of code running in such a
487	      context.

489	   o  VJ prequeues: This refers to Van Jacobson prequeues, as explained
490	      above[7].

492	3.2.  At connection/subflow establishment

494	   As described in [1], the establishment of an MPTCP connection is
495	   quite simple, being just a regular three-way exchange with additional
496	   options.  As shown in Section 2.2 this is done in the master
497	   subsocket.  Currently Linux MPTCP attaches a meta-socket to a socket
498	   as soon as it is created, that is, upon a socket() system call
499	   (client side), or when a server side socket enters the ESTABLISHED
500	   state.  An alternate solution is described in Appendix A.3.

502	   An implementation can choose the best moment, maybe depending on the
503	   OS, to instantiate the meta-socket.  However, if this meta-socket is
504	   needed to accept new subflows (like it is in Linux MPTCP), it should
505	   be attached at the latest when the MP_CAPABLE option is received.
506	   Otherwise incoming new subflow requests (SYN + MP_JOIN) may be lost,
507	   requiring retransmissions by the peer and delaying the subflow
508	   establishment.

510	   The establishment of subflows, on the other hand, is more tricky.
511	   The problem is that new SYNs (with the MP_JOIN option) must be
512	   accepted by a socket (the meta-socket in the proposed design) as if
513	   it was in LISTEN state, while its state is actually ESTABLISHED.
514	   There is the following in common with a LISTEN socket:

516	   o  Temporary structure: Between the reception of the SYN and the
517	      final ACK, a mini-socket is used as a temporary structure.

519	   o  Queue of connection requests: The meta-socket, like a LISTEN
520	      socket, maintains a list of pending connection requests.  There
521	      are two such lists.  One contains mini-sockets, because the final
522	      ACK has not yet been received.  The second list contains sockets
523	      in the ESTABLISHED state that have not yet been accepted.
524	      "Accepted" means, for regular TCP, returned to the application as
525	      a result of an accept() system call.  For MPTCP it means that the
526	      new subflow has been integrated in the set of active subflows.

528	   We can list the following differences with a normal LISTEN socket.

530	   o  Socket lookup for a SYN: When a SYN is received, the corresponding
531	      LISTEN socket is found by using the endpoint_id.  This is not
532	      possible with MPTCP, since we can receive a SYN on any
533	      endpoint_id.  Instead, the token must be used to retrieve the
534	      meta-socket to which the SYN must be attached.  A new hashtable
535	      must be defined, with tokens as keys.

537	   o  Lookup for connection request: In regular TCP, this lookup is
538	      quite similar to the previous one (in Linux at least).  The
539	      5-tuple is used, first to find the LISTEN socket, next to retrieve
540	      the corresponding mini-socket, stored in a private hashtable
541	      inside the LISTEN socket.  With MPTCP, we cannot do that, because
542	      there is no way to retrieve the meta-socket from the final ACK.
543	      The 5-tuple can be anything, and the token is only present in the
544	      SYN.  There is no token in the final ACK.  Our Linux MPTCP
545	      implementation uses a global hashtable for pending connection
546	      requests, where the key is the 5-tuple of the connection request.

548	   An implementation must carefully check the presence of the MP_JOIN
549	   option in incoming SYNs before performing the usual socket lookup.
550	   If it is present, only the token-based lookup must be done.  If this
551	   lookup does not return a meta-socket, the SYN must be discarded.
552	   Failing to do that could lead to mistakenly attach the incoming SYN
553	   to a LISTEN socket instead of attaching it to a meta-socket.

555	3.3.  Subflow management

557	   Further research is needed to define the appropriate heuristics to
558	   solve these problems.  Initial thoughts are provided in Appendix B.1.

560	   Currently, in a Linux MPTCP client, the Multipath Transport tries to
561	   open all subflows advertised by the Path Manager.  On the other hand,
562	   the server only accepts new subflows, but does not try to establish
563	   new ones.  The rationale for this is that the client is the
564	   connection initiator.  New subflows are only established if the
565	   initiator requests them.  This is subject to change in future
566	   releases of our MPTCP implementation.

568	3.4.  At the data sink

570	   There is a symmetry between the behavior of the data source and the
571	   data sink.  Yet, the specific requirements are different.  The data
572	   sink is described in this section while the data source is described
573	   in the next section.

575	3.4.1.  Receive buffer tuning

577	   The MPTCP required receive buffer is larger than the sum of the
578	   buffers required by the individual subflows.  The reason for this and
579	   proper values for the buffer are explained in [3] Section 5.3.  Not
580	   following this could result in the MPTCP speed being capped at the
581	   bandwidth of the slowest subflow.

583	   An interesting way to dynamically tune the receive buffer according
584	   the bandwidth/delay product (BDP) of a path, for regular TCP, is
585	   described in [8] and implemented in recent Linux kernels.  It uses
586	   the COPIED_SEQ sequence variable (sequence number of the next byte to
587	   copy to the app buffer) to count, every RTT, the number of bytes
588	   received during that RTT.  This number of bytes is precisely the BDP.
589	   The accuracy of this technique is directly dependent on the accuracy
590	   of the RTT estimation.  Unfortunately, the data sink does not have a
591	   reliable estimate of the SRTT.  To solve this, [8] proposes two
592	   techniques:

594	   1.  Using the timestamp option (quite accurate).

596	   2.  Computing the time needed to receive one RCV.WND [6] worth of
597	       data.  It is less precise and is used only to compute an upper
598	       bound on the required receive buffer.

600	   As described in [1], section 3.3.3, the MPTCP advertised receive
601	   window is shared by all subflows.  Hence, no per-subflow information
602	   can be deduced from it, and the second technique from [8] cannot be
603	   used. [3] mentions that the allocated connection-level receive buffer
604	   should be 2*sum(BW_i)*RTT_max, where BW_i is the bandwidth seen by
605	   subflow i and RTT_max is the maximum RTT estimated among all the
606	   subflows.  This is achieved in Linux MPTCP by slightly modifying the
607	   first tuning algorithm from [8], and disabling the second one.  The
608	   modification consists in counting on each subflow, every RTT_max the
609	   number of bytes received during that time on this subflow.  Per
610	   subflow, this provides its contribution to the total receive buffer
611	   of the connection.  This computes the contribution of each subflow to
612	   the total receive buffer of the connection.

614	3.4.2.  Receive queue management

616	   As advised in [1], Section 3.3.1, "subflow-level processing should be
617	   undertaken separately from that at connection-level".  This also has
618	   the side-effect of allowing much code reuse from the regular TCP
619	   stack.  A regular TCP stack (in Linux at least) maintains a receive
620	   queue (for storing incoming segments until the application asks for
621	   them) and an out-of-order queue (to allow reordering).  In Linux
622	   MPTCP, the subflow-level receive-queue is not used.  Incoming
623	   segments are reordered at the subflow-level, just as if they were
624	   plain TCP data.  But once the data is in-order at the subflow level,
625	   it can be immediately handed to MPTCP (See Figure 7 of [3]) for
626	   connection-level reordering.  The role of the subflow-level receive
627	   queue is now taken by the MPTCP-level receive queue.  In order to
628	   maximize the CPU cycles spent in user context (see Section 3.1), VJ
629	   prequeues can be used just as in regular TCP (they are not yet
630	   supported in Linux MPTCP, though).

632	   An alternate design, where the subflow-level receive queue is kept
633	   active and the MPTCP receive queue is not used, is discussed in
634	   Appendix A.4.

636	3.4.3.  Scheduling data ACKs

638	   As specified in [1], Section 3.3.2, data ACKs not only help the
639	   sender in having a consistent view of what data has been correctly
640	   received at the connection level.  They are also used as the left
641	   edge of the advertised receive window.

643	   In regular TCP, if a receive buffer becomes full, the receiver
644	   announces a receive window.  When finally some bytes are given to the
645	   application, freeing space in the receive buffer, a duplicate ACK is
646	   sent to act as a window upate, so that the sender knows it can
647	   transmit again.  Likewise, when the MPTCP shared receive buffer
648	   becomes full, a zero window is advertised.  When some bytes are
649	   delivered to the application, a duplicate DATA_ACK must be sent to
650	   act as a window update.  Such an important DATA_ACK should be sent on
651	   all subflows, to maximize the probability that at least one of them
652	   reaches the peer.  If, however, all DATA_ACKs are lost, there is no
653	   other option than relying on the window probes periodically sent by
654	   the data source, as in regular TCP.

656	   In theory a DATA_ACK can be sent on any subflow, or even on all
657	   subflows, simultaneously.  As of version 0.5, Linux MPTCP simply adds
658	   the DATA_ACK option to any outgoing segment (regardless of whether it
659	   is data or a pure ACK).  There is thus no particular DATA_ACK
660	   scheduling policy.  The only exception is for a window update that
661	   follows a zero-window.  In this case, the behavior is as described in
662	   the previous paragraph.

664	3.5.  At the data source

666	   In this section we mirror the topics of the previous section, in the
667	   case of a data sender.  The sender does not have the same view of the
668	   communication, because one has information that the other can only
669	   estimate.  Also, the data source sends data and receives
670	   acknowledgements, while the data sink does the reverse.  This results
671	   in a different set of problems to be dealt with by the data source.

673	3.5.1.  Send buffer tuning

675	   As explained in [3], end of Section 5.3, the send buffer should have
676	   the same size as the receive buffer.  At the sender, we don't have
677	   the RTT estimation problem described in Section 3.4.1, because we can
678	   reuse the built-in TCP SRTT (smoothed RTT).  Moreover, the sender has
679	   the congestion window, which is itself an estimate of the BDP, and is
680	   used in Linux to tune the send buffer of regular TCP.  Unfortunately,
681	   we cannot use the congestion window with MPTCP, because the buffer
682	   equation does not involve the product BW_i*delay_i for the subflows
683	   (which is what the congestion window estimates), but it involves
684	   BW_i*delay_max, where delay_max is the maximum observed delay across
685	   all subflows.  An obvious way to compute the contribution of each
686	   subflow to the receive buffer would be: 2*(cwnd_i/SRTT_i)*SRTT_max.
687	   However, some care is needed because of the variability of the SRTT
688	   (measurements show that, even smoothed, the SRTT is not quite
689	   stable).  Currently Linux MPTCP estimates the bandwidth periodically
690	   by checking the sequence number progress.  This however introduces
691	   new mechanisms in the kernel, that could probably be avoided.  Future
692	   experience will tell what is appropriate.

694	3.5.2.  Send queue management

696	   As MultiPath TCP involves the use of several TCP subflows, a
697	   scheduler must be added to decide where to send each byte of data.
698	   Two possible places for the scheduler have been evaluated for Linux
699	   MPTCP.  One option is to schedule data as soon as it arrives from the
700	   application buffer.  This option, consisting in _pushing_ data to
701	   subflows as soon as it is available, was implemented in older
702	   versions of Linux MPTCP and is now abandoned.  We keep a description
703	   of it (and why it has been abandoned) in Appendix A.5.  Another
704	   option is to store all data centrally in the Multipath Transport,
705	   inside a shared send buffer (see Figure 2).  Scheduling is then done
706	   at transmission time, whenever any subflow becomes ready to send more
707	   data (usually due to acknowledgements having opened space in the
708	   congestion window).  In that scenario, the subflows _pull_ segments
709	   from the shared send queue whenever they are ready.  Note that
710	   several subflows can become ready simultaneously, if an
711	   acknowledgement advertises a new receive window, that opens more
712	   space in the shared send window.  For that reason, when a subflow
713	   pulls data, the Packet Scheduler is run and other subflows may be fed
714	   by the Packet Scheduler in the same time.

716	                               Application
717	                                   |
718	                                   v
719	                                 | * |
720	    Next segment to send (A)  -> | * |
721	                                 |---| <- Shared send queue
722	   Sent, but not DATA-acked(B)-> |_*_|
723	                                   |
724	                                   v
725	                             Packet Scheduler
726	                                  /  \
727	                                 /    \
728	                                |      |
729	                                v      v
730	   Sent, but not acked(B)  ->  |_|    |_| <- Subflow level congestion
731	                                |      |     window
732	                                v      v
733	                               NIC    NIC

735	                    Figure 2: Send queue configuration

737	   This approach, similar to the one proposed in [9], presents several
738	   advantages:

740	   o  Each subflow can easily fill its pipe.  (As long as there is data
741	      to pull from the shared send buffer, and the scheduler is not
742	      applying a policy that restricts the subflow).

744	   o  If a subflow fails, it will no longer receive acknowledgements,
745	      and hence will naturally stop pulling from the shared send buffer.
746	      This removes the need for an explicit "failed state", to ensure
747	      that a failed subflow does not receive data (As opposed to e.g.
748	      SCTP-CMT, that needs an explicit marking of failed subflows by
749	      design, because it uses a single sequence number space [10]).

751	   o  Similarly, when a failed subflow becomes active again, the pending
752	      segments of its congestion window are finally acknowledged,
753	      allowing it to pull again from the shared send buffer.  Note that
754	      in such a case, the acknowledged data is normally just dropped by
755	      the receiver, because the corresponding segments have been
756	      retransmitted on another subflow during the failure time.

758	   Despite the adoption of that approach in Linux MPTCP, there are still
759	   two drawbacks:

761	   o  There is one single queue, in the Multipath Transport, from which
762	      all subflows pull segments.  In Linux, queue processing is
763	      optimized for handling segments, not bytes.  This implies that the
764	      shared send queue must contain pre-built segments, hence requiring
765	      the _same_ MSS to be used for all subflows.  We note however that
766	      today, the most commonly negotiated MSS is around 1380 bytes [4],
767	      so this approach sounds reasonable.  Should this requirement
768	      become too constraining in the future, a more flexible approach
769	      could be devised (e.g., supporting a few Maximum Segment Sizes).

771	   o  Because the subflows pull data whenever they get new free space in
772	      their congestion window, the Packet Scheduler must run at that
773	      time.  But that time most often corresponds to the reception of an
774	      acknowledgement, which happens in interrupt context (see
775	      Section 3.1).  This is both unfair to other system processes, and
776	      slightly inefficient for high speed communications.  The problem
777	      is that the packet scheduler performs more operations that the
778	      usual "copy packet to NIC".  One way to solve this problem would
779	      be to have a small subflow-specific send queue, which would
780	      actually lead to a hybrid architecture between the pull approach
781	      (described here) and the push approach (described in
782	      Appendix A.5).  Doing that would require solving non-trivial
783	      problems, though, and requires further study.

785	   As shown, in Figure 2, a segment first enters the shared send queue,
786	   then, when reaching the bottom of that queue, it is pulled by some
787	   subflow.  But to support failures, we need to be able to move
788	   segments from one subflow to another, so that the failure is
789	   invisible from the application.  In Linux MPTCP, the segment data is
790	   kept in the Shared send queue (B portion of the queue).  When a
791	   subflow pulls a segment, it actually only copies the control
792	   structure (struct sk_buff) (which Linux calls packet cloning) and
793	   increments its reference count.  The following event/action table
794	   summarizes these operations:

796	   +-----------------+-------------------------------------------------+
797	   | event           | action                                          |
798	   +-----------------+-------------------------------------------------+
799	   | Segment         | Remove references to the segment from the       |
800	   | acknowledged at | subflow-level queue                             |
801	   | subflow level   |                                                 |
802	   |                 |                                                 |
803	   | Segment         | Remove references to the segment from the       |
804	   | acknowledged at | connection-level queue                          |
805	   | connection      |                                                 |
806	   | level           |                                                 |
807	   |                 |                                                 |
808	   | Timeout         | Push the segment to the best running subflow    |
809	   | (subflow-level) | (according to the Packet Scheduler). If no      |
810	   |                 | subflow is available, push it to a temporary    |
811	   |                 | retransmit queue (not represented in Figure 2)  |
812	   |                 | for future pulling by an available subflow. The |
813	   |                 | retransmit queue is parallel to the connection  |
814	   |                 | level queue and is read with higher priority.   |
815	   |                 |                                                 |
816	   | Ready to put    | If the retransmit queue is not empty, first     |
817	   | new data on the | pull from there. Otherwise, then take new       |
818	   | wire (normally  | segment(s) from the connection level send queue |
819	   | triggered by an | (A portion). The pulling operation is a bit     |
820	   | incoming ack)   | special in that it can result in sending a      |
821	   |                 | segment over a different subflow than the one   |
822	   |                 | which initiated the pull. This is because the   |
823	   |                 | Packet Scheduler is run as part of the pull,    |
824	   |                 | which can result in selecting any subflow. In   |
825	   |                 | most cases, though, the subflow which           |
826	   |                 | originated the pull will get fresh data, given  |
827	   |                 | it has space for that in the congestion window. |
828	   |                 | Note that the subflows have no A portion in     |
829	   |                 | Figure 2, because they immediately send the     |
830	   |                 | data they pull.                                 |
831	   +-----------------+-------------------------------------------------+

833	   Table 3: (event,action) pairs implemented in the  Multipath Transport
834	                             queue management

836	   IMPORTANT: A subflow can be stopped from transmitting by the
837	   congestion window, but also by the send window (that is, the receive
838	   window announced by the peer).  Given that the receive window has a
839	   connection level meaning, a DATA_ACK arriving on one subflow could
840	   unblock another subflow.  Implementations should be aware of this to
841	   avoid stalling part of the subflows in such situations.  In the case
842	   of Linux MPTCP, that follows the above architecture, this is ensured
843	   by running the Packet Scheduler at each pull operation.  This is not
844	   completely optimal, though, and may be revised when more experience
845	   is gained.

847	3.5.3.  Scheduling data

849	   As several subflows may be used to transmit data, MPTCP must select a
850	   subflow to send each data.  First, we need to know which subflows are
851	   available for sending data.  The mechanism that controls this is the
852	   congestion controller, which maintains a per-subflow congestion
853	   window.  The aim of a Multipath congestion controller is to move data
854	   away from congested links, and ensure fairness when there is a shared
855	   bottleneck.  The handling of the congestion window is explained in
856	   Section 3.5.3.1.  Given a set of available subflows (according to the
857	   congestion window), one of these has to be selected by the Packet
858	   Scheduler.  The role of the Packet Scheduler is to implement a
859	   particular policy, as will be explained in Section 3.5.3.2.

861	3.5.3.1.  The congestion controller

863	   The Coupled Congestion Control provided in Linux MPTCP implements the
864	   algorithm defined in [2].  Operating System kernels (Linux at least)
865	   do not support floating-point numbers for efficiency reasons. [2]
866	   makes an extensive use of them, which must be worked around.  Linux
867	   MPTCP solves that by performing fixed-point operations using a
868	   minimum number of fractions and performs scaling when divisions are
869	   necessary.

871	   Linux already includes a work-around for floating point operations in
872	   the Reno congestion avoidance implementation.  Upon reception of an
873	   ack, the congestion window (counted in segments, not in bytes as
874	   proposed in [2] does) should be updated as cwnd+=1/cwnd.  Instead,
875	   Linux increments the separate variable snd_cwnd_cnt, until
876	   snd_cwnd_cnt>=cwnd.  When this happens, snd_cwnd_cnt is reset, and
877	   cwnd is incremented.  Linux MPTCP reuses this to update the window in
878	   the CCC (Coupled Congestion Control) congestion avoidance phase:
879	   snd_cwnd_cnt is incremented as previously explained, and cwnd is
880	   incremented when snd_cwnd_cnt >= max(tot_cwnd / alpha, cwnd) (see
881	   [2]).  Note that the bytes_acked variable, present in [2], is not
882	   included here because Linux MPTCP does not currently support ABC
883	   [11], but instead considers acknowledgements in MSS units.  Linux
884	   uses for ABC, in Reno, the bytes_acked variable instead of
885	   snd_cwnd_cnt.  For Reno, cwnd is incremented by one if
886	   bytes_acked>=cwnd*MSS.  Hence, in the case of a CCC with ABC, one
887	   would increment cwnd when bytes_acked>=max(tot_cwnd*MSS / alpha,
888	   cwnd*MSS).

890	   Unfortunately, the alpha parameter mentioned above involves many
891	   fractions.  The current implementation of MPTCP uses a rewritten
892	   version of the alpha formula from [2]:

894	                             cwnd_max * scale_num
895	   alpha = tot_cwnd * ----------------------------------
896	                     /     rtt_max * cwnd_i * scale_den \ 2
897	                     | sum -----------------------------|
898	                     \  i              rtt_i            /

900	   This computation assumes that the MSS is shared by all subflows,
901	   which is true under the architecture described in Section 3.5.2 but
902	   implies that implementations choosing to support several MSS cannot
903	   use the above simplified equation.  The variables cwnd_max and
904	   rtt_max in the above equation are NOT resp. the maximum congestion
905	   window and RTT across all subflows.  Instead, they are the values of
906	   subflow i such that cwnd_i / rtt_i^2 is maximum.  This corresponds to
907	   the numerator of the equation provided in [2].

909	   scale_num and scale_den have to be selected in such a way that
910	   scale_num > scale_den^2.  A good choice is to use scale_num=2^32
911	   (using 64 bits arithmetic) and scale_den=2^10.  In that case the
912	   final alpha value is scaled by 2^12, which gives a reasonable
913	   precision.  Due to the scaling, it is necessary to also scale later
914	   in the formula that decides whether an increase of the congestion
915	   window is necessary or not: snd_cwnd_cnt >= max((tot_cwnd<<12) /
916	   alpha,cwnd).

918	3.5.3.2.  The Packet Scheduler

920	   Whenever the Congestion Controller (described above) allows new data
921	   for at least one subflow, the Packet Scheduler is run.  When only one
922	   subflow is available the Packet Scheduler just decides which packet
923	   to pick from the A section of the shared send buffer (see Figure 2).
924	   Currently Linux MPTCP picks the bottom most segment.  If more than
925	   one subflow is available, there are three decisions to take:

927	   o  Which of the subflows to feed with fresh data: As the only Packet
928	      Scheduler currently supported in Linux MPTCP aims at filling all
929	      pipes, it always feeds data to all subflows as long as there is
930	      data to send.

932	   o  In what order to feed selected subflows: when several subflows
933	      become available simultaneously, they are fed by order of time-
934	      distance to the client.  We define the time-distance as the time
935	      needed for the packet to reach the peer if given to a particular
936	      subflow.  This time depends on the RTT, bandwidth and queue size
937	      (in bytes), as follows: time_distance_i = queue_size_i/bw_i+RTT_i.
938	      Given that with the architecture described in Section 3.5.2, the
939	      subflow-specific queue size cannot exceed a congestion window, the
940	      time_distance becomes time_distance_i~=RTT_i.  This scheduling
941	      policy favors fast subflows for application-limited communications
942	      (where all subflows need not be used).  However, for network-
943	      limited communications, this scheduling policy has little effect
944	      because all subflows will be used at some point, even the slow
945	      ones, to try minimizing the connection-level completion time.

947	   o  How much data to allocate to a single subflow: this question
948	      concerns the granularity of the allocation.  Using big allocation
949	      units allows for better support of TCP Segmentation Offload (TSO).
950	      TSO allows the system to aggregate several times the MSS into one
951	      single segment, sparing memory and CPU cycles, by leaving the
952	      fragmentation task to the NIC.  However, this is only possible if
953	      the large single segment is made of contiguous data, at the
954	      subflow level and the connection level (see also important note
955	      below).

957	   IMPORTANT: When scheduling data to subflows, an implementation must
958	   be careful that if two segments are contiguous at the subflow-level,
959	   but non-contiguous at the connection level, they cannot be aggregated
960	   into one.  As Linux (and probably other systems) merges segments when
961	   it is under memory pressure, it could easily decide to merge non-
962	   contiguous MPTCP segments, simply because they look contiguous from
963	   the subflow viewpoint.  This must be avoided, because the DATA_SEQ
964	   mapping option would loose its meaning in such a case, leading to all
965	   possible kinds of misbehaviors.

967	3.6.  At connection/subflow termination

969	   In Linux MPTCP, subflows are terminated only when the whole
970	   connection terminates, because the heuristic for terminating subflows
971	   (without closing the connection) is not yet mature, as explained in
972	   Section 3.3.

974	   At connection termination, an implementation must ensure that all
975	   subflows plus the meta-socket are cleanly removed.  The obvious
976	   choice to propagate the close() system call on all subflows does not
977	   work.  The problem is that a close() on a subflow appends a FIN at
978	   the end of the send queue.  If we transpose this to the meta-socket,
979	   we would append a DATA_FIN on the shared send queue (see
980	   Section 3.5.2).  That operation results in the shared send queue not
981	   accepting any more data from the application, which is correct.  It
982	   also results in the subflow-specific queues not accepting any more
983	   data from the shared send queue.  The shared send queue may however
984	   still be full of segments, which will never be sent because all gates
985	   are closed.

987	   IMPORTANT: Upon a close() system call, an implementation must refrain
988	   from sending a FIN on all subflows, unless the implementation uses an
989	   architecture with no connection-level send queue (like the one
990	   described in Appendix A.5).  Even in that case, it makes sense to
991	   keep all subflows open until the last byte is sent, to allow
992	   retransmission on any path, should any one of them fail.

994	   Currently, upon a close() system call, Linux MPTCP appends a DATA_FIN
995	   to the connection-level send queue.  Only when that DATA_FIN reaches
996	   the bottom of the send queue is the regular FIN sent on all subflows.

998	   DISCUSSION: In the Linux MPTCP behavior described above, a connection
999	   could still stall near its end if one path fails while transmitting
1000	   its last congestion window of data (because the maximum size of the
1001	   subflow-specific send queue is cwnd).  This can be avoided by waiting
1002	   just a bit more before to trigger the subflow-FIN: Instead of sending
1003	   the FIN together with the DATA_FIN, send the DATA_FIN alone and wait
1004	   for the corresponding DATA_ACK to trigger a FIN on all subflows.
1005	   This however augments by one RTT the duration of the overall
1006	   connection termination.

1008	4.  Configuring the OS for MPTCP

1010	   Previous sections concentrated on implementations.  In this section,
1011	   we try to gather guidelines that help getting the full potential from
1012	   MPCTP through appropriate system configuration.  Currently those
1013	   guidelines apply especially to Linux, but the principles can be
1014	   applied to other systems.

1016	4.1.  Source address based routing

1018	   As already pointed out by [12], the default behavior of most
1019	   operating systems is not appropriate for the use of multiple
1020	   interfaces.  Most operating systems are typically configured to use
1021	   at most one IP address at a time.  It is more and more common to
1022	   maintain several links in up state (e.g. using the wired interface as
1023	   main link, but maintaining a ready-to-use wireless link in the
1024	   background, to facilitate fallback when the wired link fails).  But
1025	   MPTCP is not about that.  MPTCP is about _simultaneously_ using
1026	   several interfaces (when available).  It is expected that one of the
1027	   mostly used MPTCP configurations will be through two or more NICs,
1028	   each being assigned a different address.  Another possible
1029	   configuration would be to assign several IP addresses to the same
1030	   interface, in which case the path diverges later in the network,
1031	   based on the particular address that is used in the packet.

1033	   Usually an operating system has a single default route, with a single
1034	   source IP address.  If the host has several IP addresses and we want
1035	   to do MultiPath TCP, it is necessary to configure source address
1036	   based routing.  This means that based on the source address, selected
1037	   by the MultiPath TCP-module in the operating system, the routing-
1038	   decision is based on a different routing table.  Each of these
1039	   routing tables defines a default route to the Internet.  This is
1040	   different from defining several default routes in the same routing
1041	   table (which is also supported in Linux), because in that case only
1042	   the first one is used.  Any additional default route is considered as
1043	   a fallback route, used only in case the main one fails.

1045	   It is easier to understand the necessary configuration by means of an
1046	   example.  Let a host have two interfaces,I1 and I2, both connected to
1047	   the public Internet and being assigned addresses resp. A1 and A2.
1048	   Such a host needs 3 routing tables.  One of them is the classical
1049	   default routing table, present in all systems.  The default routing
1050	   table is used to find a route based on the destination address only,
1051	   when a segment is issued with the undetermined source address.  The
1052	   undetermined source address is typically used by applications that
1053	   initiate a TCP connect() system call, specifying the destination
1054	   address but letting the system choose the source address.  In that
1055	   case, after the default routing table has been consulted, an address
1056	   is assigned to the socket by the system by applying [13].  The
1057	   additional routing tables are used when the source address is
1058	   specified.  If the source address has no impact on the route that
1059	   should be chosen, then the default routing table is sufficient.  But
1060	   this is a particular case (e.g., a host connected to one network
1061	   only, but using two addresses to exploit ECMP paths later in the
1062	   network).  In most cases, a source address is attached to a specific
1063	   interface, or at least a specific gateway.  Both of those cases
1064	   require defining a separate routing table, one per (gateway, outgoing
1065	   interface) pair.  To select the proper routing table based on the
1066	   source address, an additional indirection level must be configured.
1067	   It is called "policy routing" in Linux and is illustrated at the
1068	   bottom of Figure 3.

1070	   +----------------------------------------------------+
1071	   |                   Default Table                    |
1072	   +----------------------------------------------------+
1073	   | Dst: 0.0.0.0/0  Via: Gateway-IP1 Dev: I1           |
1074	   | Dst: 0.0.0.0/0  Via: Gateway-IP2 Dev: I2           |
1075	   | Dst: Gateway1-Subnet Dev: I1 Src: A1  Scope: Link  |
1076	   | Dst: Gateway2-Subnet Dev: I2 Src: A2  Scope: Link  |
1077	   +----------------------------------------------------+

1079	   +----------------------------------------------------+
1080	   |                      Table 1                       |
1081	   +----------------------------------------------------+
1082	   | Dst: 0.0.0.0/0  Via: Gateway-IP1 Dev: I1           |
1083	   | Dst: Gateway1-Subnet Dev: I1 Src: A1 Scope: Link   |
1084	   +----------------------------------------------------+

1086	   +----------------------------------------------------+
1087	   |                      Table 2                       |
1088	   +----------------------------------------------------+
1089	   | Dst: 0.0.0.0/0  Via: Gateway-IP2 Dev: I2           |
1090	   | Dst: Gateway2-Subnet Dev: I2 Src: A2 Scope: Link   |
1091	   +----------------------------------------------------+

1093	   +----------------------------------------------------+
1094	   |                 Policy Table                       |
1095	   +----------------------------------------------------+
1096	   |   If src == A1 , Table 1                           |
1097	   |   If src == A2 , Table 2                           |
1098	   +----------------------------------------------------+

1100	          Figure 3: Routing table configuration for MultiPath TCP

1102	   If only the default routing table were used, only the first default
1103	   route would be used, regardless of the source address.  For example,
1104	   a packet with source address A2, would leave the host through
1105	   interface I1, which is incorrect.

1107	4.2.  Buffer configuration

1109	   [3], Section 5.3 describes in details the new, higher buffer
1110	   requirements of MPTCP.  Section 3.4.1 and Section 3.5.1 describe how
1111	   the MPTCP buffers can be tuned dynamically.  However, it is important
1112	   to note that even the best tuning is capped by a maximum configured
1113	   at the system level.  When using MultiPath TCP, the maximum receive
1114	   and send buffer should be configured to a higher value than for
1115	   regular TCP.  There is no universal guideline on what value is best
1116	   there.  Instead the most appropriate action, for an administrator, is
1117	   probably to roughly estimate the maximum bandwidth and delay that can
1118	   be observed on a particular connectivity setup, and apply the
1119	   equation from [3], Section 5.3 to find a reasonable tradeoff.  This
1120	   exercise could lead an administrator to decide to disable MPTCP on
1121	   some interfaces, because it allows consuming less memory while still
1122	   achieving reasonable performance.

1124	5.  Future work

1126	   A lot of work has yet to be done, and there is much space for
1127	   improvements.  In this section we try to assemble a list of future
1128	   improvements that would complete this guidelines.

1130	   o  Today's host processors have more and more CPU cores.  Given
1131	      Multipath TCP tries to exploit another form of parallelism, there
1132	      is a challenge in finding how those they can work together
1133	      optimally.  An important question is how to work with hardware
1134	      that behaves intelligently with TCP (e.g. flow to core affinity).
1135	      This problem is discussed in more details in [14].

1137	   o  An evaluation of Linux MPTCP exists [4].  But many optimizations
1138	      are still possible and should be evaluated.  Examples of them VJ
1139	      prequeues (Section 3.1), MPTCP fast path (that is, a translation
1140	      of the existing TCP fast path to MPTCP) or DMA support.  VJ
1141	      prequeues, described in Section 3.1, are intended to defer segment
1142	      processing until the application is awoken, when possible.

1144	   o  Currently, support for TCP Segmentation Offload remains a
1145	      challenge because it plays with the Maximum Segment Size.  Linux
1146	      MPTCP currently works with a single MSS across all subflows (see
1147	      Section 3.5.2).  Adding TSO support to MPTCP is certainly
1148	      possible, but requires further work (Section 3.5.2).  Also,
1149	      support for Large Receive Offload has not been investigated yet.

1151	   o  There are ongoing discussions on heuristics that would be used to
1152	      decide when to start new subflows.  Those discussions are
1153	      summarized in Appendix B.1, but none of the proposed heuristics
1154	      have been evaluated yet.

1156	6.  Acknowledgements

1158	   Sebastien Barre, Christoph Paasch and Olivier Bonaventure are
1159	   supported by Trilogy (http://www.trilogy-project.org), a research
1160	   project (ICT-216372) partially funded by the European Community under
1161	   its Seventh Framework Program.  The views expressed here are those of
1162	   the author(s) only.  The European Commission is not liable for any
1163	   use that may be made of the information in this document.

1165	   The authors gratefully acknowledge Costin Raiciu, who wrote a
1166	   userland implementation of MPTCP and provided insight on
1167	   implementation matters during several fruitful debates.  Discussions
1168	   with Janardhan Iyengar also helped understanding the specificities of
1169	   MPTCP compared to SCTP-CMT.

1171	   The authors would also like to thank the following people for useful
1172	   discussions on the mailing list and/or reviews: Alan Ford, Bob
1173	   Briscoe, Mark Handley, Michael Scharf.

1175	7.  References

1177	   [1]   Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for
1178	         Multipath Operation with Multiple Addresses",
1179	         draft-ietf-mptcp-multiaddressed-02 (work in progress),
1180	         October 2010.

1182	   [2]   Raiciu, C., Handley, M., and D. Wischik, "Coupled Congestion
1183	         Control for Multipath Transport Protocols",
1184	         draft-ietf-mptcp-congestion-01 (work in progress),
1185	         January 2011.

1187	   [3]   Ford, A., Raiciu, C., Handley, M., Barre, S., and J. Iyengar,
1188	         "Architectural Guidelines for Multipath TCP Development",
1189	         draft-ietf-mptcp-architecture-05 (work in progress),
1190	         January 2011.

1192	   [4]   Barre, S., Paasch, C., and O. Bonaventure, "Multipath TCP: From
1193	         Theory to Practice", IFIP Networking,Valencia , May 2011,
1194	         <http://inl.info.ucl.ac.be/publications/mptcp-nw>.

1196	   [5]   Scharf, M. and A. Ford, "MPTCP Application Interface
1197	         Considerations", draft-ietf-mptcp-api-00 (work in progress),
1198	         November 2010.

1200	   [6]   Postel, J., "Transmission Control Protocol", STD 7, RFC 793,
1201	         September 1981.

1203	   [7]   Jacobson, V., "Re: query about tcp header on tcp-ip", Sep 1993,
1204	         <ftp://ftp.ee.lbl.gov/email/vanj.93sep07.txt>.

1206	   [8]   Fisk, M. and W. Feng, "Dynamic right-sizing in TCP", Los Alamos
1207	         Computer Science Institute Symposium , 2001,
1208	         <http://woozle.org/~mfisk/papers/tcpwindow-lacsi.pdf>.

1210	   [9]   Hsieh, H. and R. Sivakumar, "pTCP: An End-to-End Transport
1211	         Layer Protocol for Striped Connections", ICNP , 2002, <http://
1212	         www.ece.gatech.edu/research/GNAN/archive/2002/icnp02h.html>.

1214	   [10]  Becke, M., Dreibholz, T., Iyengar, J., Natarajan, P., and M.
1215	         Tuexen, "Load Sharing for the Stream Control Transmission
1216	         Protocol (SCTP)", draft-tuexen-tsvwg-sctp-multipath-01 (work in
1217	         progress), December 2010.

1219	   [11]  Allman, M., "TCP Congestion Control with Appropriate Byte
1220	         Counting (ABC)", RFC 3465, February 2003.

1222	   [12]  Blanchet, M. and P. Seite, "Multiple Interfaces and
1223	         Provisioning Domains Problem Statement",
1224	         draft-ietf-mif-problem-statement-09 (work in progress),
1225	         October 2010.

1227	   [13]  Draves, R., "Default Address Selection for Internet Protocol
1228	         version 6 (IPv6)", RFC 3484, February 2003.

1230	   [14]  Watson, R., "Protocol stacks and multicore scalability",
1231	         Presentation at Maastricht MPTCP workshop , Jul 2010, <http://
1232	         www.informatics.sussex.ac.uk/research/projects/ngn/slides/
1233	         msn10talks/watson-stack.pdf>.

1235	Appendix A.  Design alternatives

1237	   In this appendix, we describe alternate designs that have been
1238	   considered previously, and abandoned for various reasons (detailed as
1239	   well).  We keep them here for the archive and possible discussion.
1240	   We also describe some potential designs that have not been explored
1241	   yet but could reveal to be better in the future, in which case that
1242	   would be moved to the draft body.

1244	A.1.  Another way to consider Path Management

1246	   In a previous implementation of MPTCP, it was proposed that the
1247	   multipath transport had an even more abstract view of the paths in
1248	   use than what is described in Section 2.  In that design, the sub-
1249	   sockets all shared the same tuple (saddr,sport,daddr,dport), and was
1250	   disambiguated only by the path index.  The advantage is that the
1251	   Multipath Transport needs only to worry about how to efficiently
1252	   spread data among multiple paths, without any knowledge about the
1253	   addresses or ports used by each particular subflow.

1255	   That design was particularly well suited for using Shim6 as a Path
1256	   Manager, because Shim6 is already designed to work in the network
1257	   layer and rewrite addresses.  The first version of the Linux MPTCP
1258	   implementation was using Shim6 as path manager.  It looks also well
1259	   suited to path managers that don't use addresses (e.g. path managers
1260	   that write a label in the packet header, later interpreted by the
1261	   network).  Finally, it removes the need for the token in the
1262	   multipath transport (connection identification is done naturally with
1263	   the tuple, shared by all subflows).  The token hence becomes specific
1264	   to the built-in path manager, and can be just ignored with other path
1265	   managers (the context tag plays a similar role in shim6, nothing is
1266	   needed if the path manager just sets labels to the packets).

1268	   However, this cleaner separation between Multipath Transport and Path
1269	   Management suffers from three drawbacks:

1271	   o  It requires a heavy modification to the existing stacks, because
1272	      it modifies the current way to identify sockets in the stack.
1273	      They are currently unambiguously identified with the usual
1274	      5-tuple.  This architecture would require extending the 5-tuple
1275	      with the path index, given all subflows would share the same
1276	      5-tuple.

1278	   o  Although correctly implemented stacks could handle that new
1279	      endpoint identifier (5-tuple+path index), having several flows
1280	      with same 5-tuple could confuse middleboxes.

1282	   o  When the path manager involves using several addresses, forcing
1283	      the same 5-tuple for all subflows at the Multipath Transport level
1284	      implies that the Path Manager needs to rewrite the address fields
1285	      of each packet.  That rewriting operation is simply avoided if the
1286	      sockets are bound to the addresses actually used to send the
1287	      packets.  Hence, this alternate design would involve avoidable
1288	      costs for path managers that belong to the "multi-address"
1289	      category.

1291	A.2.  Implementing alternate Path Managers

1293	   In Section 2, the Path Manager is defined as an entity that maintains
1294	   a (path_index<->endpoint_id) mapping.  This is enough in the case of
1295	   the built-in path manager, because the segments are associated to a
1296	   path within the socket itself, thanks to its endpoint_id.  However,
1297	   it is expected that most other path managers will need to apply a
1298	   particular action, on a per-packet basis, to associate them with a
1299	   path.  Example actions could be writing a number in a field of the
1300	   segment or choosing a different gateway than the default one in the
1301	   routing table.  In an earlier version of Linux MPTCP, based on a
1302	   Shim6 Path Manager, the action was used and consisted in rewriting
1303	   the addresses of the packets.

1305	   To reflect the need for a per-packet action, the PM mapping table (an
1306	   example of which is given in Table 1) only needs to be extended with
1307	   an action field.  As an example of this, we show hereafter an example
1308	   mapping table for a Path Manager based on writing the path index into
1309	   a field of the packets.

1311	    +---------+------------+---------------+--------------------------+
1312	    |  token  | path index |  Endpoint id  | Action (Write x in DSCP) |
1313	    +---------+------------+---------------+--------------------------+
1314	    | token_1 |      1     | <A1,B1,0,pB1> |             1            |
1315	    |         |            |               |                          |
1316	    | token_1 |      2     | <A1,B1,0,pB1> |             2            |
1317	    |         |            |               |                          |
1318	    | token_1 |      3     | <A1,B1,0,pB1> |             3            |
1319	    |         |            |               |                          |
1320	    | token_1 |      4     | <A1,B1,0,pB1> |             4            |
1321	    |         |            |               |                          |
1322	    |         |            |               |                          |
1323	    | token_2 |      1     | <A1,B1,0,pB2> |             1            |
1324	    |         |            |               |                          |
1325	    | token_2 |      2     | <A1,B1,0,pB2> |             2            |
1326	    +---------+------------+---------------+--------------------------+

1328	            Table 4: Example mapping table for a label-based PM

1330	A.3.  When to instantiate a new meta-socket ?

1332	   The meta-socket is responsible only for MPTCP-related operations.
1333	   This includes connection-level reordering for incoming data,
1334	   scheduling for outgoing data, and subflow management.  A natural
1335	   choice then would be to instantiate a new meta-socket only when the
1336	   peer has told us that it supports MPTCP.  In the server it is
1337	   naturally the case since the master subsocket is created upon the
1338	   reception of a SYN+MP_CAPABLE.  The client, however, instantiates its
1339	   master subsocket when the application issues a socket() system call,
1340	   but needs to wait until the SYN+ACK to know whether its peer supports
1341	   MPTCP.  Yet, it must already provide its token in the SYN.

1343	   Linux MPTCP currently instantiates its client-side meta-socket when
1344	   the master-socket is created (just like the server-side).  The
1345	   drawback of this is that if after socket(), the application
1346	   subsequently issues a listen(), we have built a useless meta-socket.
1347	   The same happens if the peer SYN+ACK does not carry the MP_CAPABLE
1348	   option.  To avoid that, one may want to instantiate the meta-socket
1349	   upon reception of an MP_CAPABLE option.  But this implies that the
1350	   token (sent in the SYN), must be stored in some temporary place or in
1351	   the master subsocket until the meta-socket is built.

1353	A.4.  Forcing more processing in user context

1355	   The implementation architecture proposed in this draft uses the
1356	   following queue configuration:

1358	   o  Subflow level: out-of-order queue.  Used for subflow-level
1359	      reordering.

1361	   o  Connection level: out-of-order queue.  Used for connection-level
1362	      reordering.

1364	   o  Connection level: receive queue.  Used for storing the ordered
1365	      data until the application asks for it through a recvmsg() system
1366	      call or similar.

1368	   In a previous version of Linux MPTCP, another queue configuration has
1369	   been examined:

1371	   o  Subflow level: out-of-order queue.  Used for subflow-level
1372	      reordering.

1374	   o  Subflow level: receive queue.  Used for storing the data until the
1375	      application asks for it through a recvmsg() system call or
1376	      similar.

1378	   o  Connection level: out-of-order queue.  Used for connection-level
1379	      reordering.

1381	   In this alternate architecture, the connection-level data is lazily
1382	   reordered as the application asks for it.  The main goal for this was
1383	   to ensure that as many CPU cycles as possible were spent in user
1384	   context (See Section 3.1).  VJ prequeues allow forcing user context
1385	   processing when the application is waiting on a recv() system call.
1386	   Otherwise the subflow-level reordering must be done in interrupt
1387	   context.  This remains true with MPTCP because the subflow-level
1388	   implementation is left unmodified when possible.  With MPTCP, the
1389	   question is: "Where do we perform connection-level reordering ?".
1390	   This alternate architecture answer is: "Do it _always_ in user
1391	   context".  This was the strength of that architecture.  Technically,
1392	   the task of each subflow was to reorder its own segments and put them
1393	   in their own receive queue, until the application asks for data.
1394	   When the application wants to eat more data, MPTCP searches all
1395	   subflow-level receive queue for the next bytes to receive, and
1396	   reorder them as appropriate by using its own reordering queue.  As
1397	   soon as the number of requested bytes are handed to the application
1398	   buffer, the MPTCP reordering task finishes.

1400	   Unfortunately, there are two major drawbacks about doing it that way:

1402	   o  The socket API supports the SO_RCVLOWAT option, which allows an
1403	      application to ask not being woken up until n bytes have been
1404	      received.  Counting those bytes requires reordering at least n
1405	      bytes at the connection level in interrupt context.

1407	   o  The DATA_ACK [1] should report the latest byte received in order
1408	      at the connection level.  In this architecture, the best we can do
1409	      is report the latest byte that has been copied to the application
1410	      buffers, which would slightly change the DATA_ACK semantic
1411	      described in section 3.3.2 of [1].  This change could confuse
1412	      peers that try to derive information from the received DATA_ACK.

1414	A.5.  Buffering data on a per-subflow basis

1416	   In previous versions of Linux MPTCP, the configuration of the send
1417	   queues was as shown in Figure 4.

1419	                                  Application
1420	                                      |
1421	                                      v
1422	                                Packet Scheduler
1423	                                     /  \
1424	                                    /    \
1425	                                   |      |
1426	                                   v      v
1427	                                 | * |  |   |
1428	    Next segment to send (A)  -> | * |  | * |
1429	                                 |---|  |---|  <- Separate send queue
1430	    Sent, but not acked (B)   -> |_*_|  |_*_|
1431	                                   |      |
1432	                                   v      v
1433	                                  NIC    NIC

1435	                    Figure 4: Send queue configuration

1437	   In contrast to the architecture presented in Section 3.5.2, there is
1438	   no shared send queue.  The Packet Scheduler is run each time data is
1439	   produced by the application.  Compared to Figure 4, the advantages
1440	   and drawbacks are basically reversed.  Here are the advantages:

1442	   o  This architecture supports subflow-specific Maximum Segment Sizes,
1443	      because the subflow is selected before the segment is built.

1445	   o  The segments are stored in their final form in the subflow-
1446	      specific send queues, and there is no need to run the Packet
1447	      Scheduler at transmission time.  The result is more fairness with
1448	      other applications (because the Packet Scheduler runs in user
1449	      context only), and faster data transmission when acknowledgements
1450	      open the congestion window (because segments are buffered in their
1451	      final form and no call to the Packet Scheduler is needed.

1453	   The drawback, which motivated the architecture change in Linux MPTCP
1454	   is the complexity of the data allocation (hence the Packet
1455	   Scheduler), and the computing cost involved.  Given that there is no
1456	   shared send buffer, the send buffer auto-tuning must be divided into
1457	   its subflow contributions.  This buffer size can be easily derived
1458	   from Section 3.5.1.  However, when scheduling in advance a full send
1459	   buffer of data, we may be allocating a segment hundreds of
1460	   milliseconds before it actually goes to the wire.  The task of the
1461	   Packet Scheduler is then complicated because it must _predict_ the
1462	   path properties.  If the prediction is incorrect, two subflows may
1463	   try to put on the wire segments that are very distant in terms of
1464	   DATA_SEQ numbers.  This can eventually result in stalling some
1465	   subflows, because the DATA_SEQ gap between two subflows exceeds the
1466	   receive window announced by the receiver.  The Packet Scheduler can
1467	   relatively easily compute a correct allocation of segments if the
1468	   path properties do not vary (just because it is easy to predict a
1469	   constant value), but the implementation was very sensitive to
1470	   variations in delay or bandwidth.  The previous implementation of
1471	   Linux MPTCP solved this allocation problem by verifying, upon each
1472	   failed transmission attempt, if it was blocked by the receive window
1473	   due to a gap in DATA_SEQ with other subflows.  If this was the case,
1474	   a full reallocation of segments was conducted.  However, the cost of
1475	   such a reallocation is very high, because it involves reconsidering
1476	   the allocation of any single segment, and do this for all the
1477	   subflows.  Worse, this costly reallocation sometimes needed to happen
1478	   in interrupt context, which removed one of the advantages of this
1479	   architecture.

1481	   Yet, under the assumption that the subflow-specific queue size is
1482	   small, the above drawback almost disappears.  For this reason the
1483	   abandoned design described here could be used to feed a future hybrid
1484	   architecture, as explained in Section 3.5.2.  For the sake of
1485	   comparison with Table 3, we provide hereafter the action/table
1486	   implemented by this architecture.

1488	   +-----------------+-------------------------------------------------+
1489	   | event           | action                                          |
1490	   +-----------------+-------------------------------------------------+
1491	   | Segment         | Remove references to it from the subflow-level  |
1492	   | acknowledged at | queue                                           |
1493	   | subflow level   |                                                 |
1494	   |                 |                                                 |
1495	   | Segment         | No queue-related action.                        |
1496	   | acknowledged at |                                                 |
1497	   | connection      |                                                 |
1498	   | level           |                                                 |
1499	   |                 |                                                 |
1500	   | Timeout         | Push the segment to the best subflow (according |
1501	   | (subflow-level) | to the Packet Scheduler). In contrast with the  |
1502	   |                 | solution of Section 3.5.2, there is no need for |
1503	   |                 | a connection-level retransmit queue, because    |
1504	   |                 | there is no requirement to be available         |
1505	   |                 | immediately for a subflow to accept new data.   |
1506	   |                 |                                                 |
1507	   | Ready to put    | Just send the next segment from the A portion   |
1508	   | new data on the | of the subflow-specific send queue, if any.     |
1509	   | wire (normally  | Note that the "IMPORTANT" note from             |
1510	   | triggered by an | Section 3.5.2 still applies with this           |
1511	   | incoming ack)   | architecture.                                   |
1512	   +-----------------+-------------------------------------------------+

1514	   Table 5: (event,action) pairs implemented in a queue management based
1515	                          on separate send queues

1517	Appendix B.  Ongoing discussions on implementation improvements

1519	   This appendix collects information on features that have been
1520	   currently implemented nowhere, but can still be useful as hints for
1521	   implementers to test.  Feedback from implementers will help
1522	   converging on those topics and propose solid guidelines for future
1523	   versions of this memo.

1525	B.1.  Heuristics for subflow management

1527	   Some heuristic should determine when it would be beneficial to add a
1528	   new subflow.  Linux MPTCP has no such heuristic at the moment, but
1529	   the topic has been discussed on the MPTCP mailing list, so this
1530	   section summarizes the input from many individuals.  MPTCP is not
1531	   useful for very short flows, so three questions appear:

1533	   o  How long is a "too short flow"

1535	   o  How to predict that a flow will be short ?

1537	   o  When to decide to add/remove subflows ?

1539	   To answer the third question, it has been proposed to use hints from
1540	   the application.  On the other hand the experience shows that socket
1541	   options are quite often poorly or not used, which motivates the
1542	   parallel use of a good default heuristic.  This default heuristic may
1543	   be influenced in particular by the particular set of options that are
1544	   enabled for MPTCP (e.g. an administrator can decide that some
1545	   security mechanisms for subflow initiation are not needed in his
1546	   environment, and disable them, which would change the cost of
1547	   establishing new subflows).  The following elements have been
1548	   proposed to feed the heuristic, none of them tested yet:

1550	   o  Check the size of the write operations from the applications.
1551	      Initiate a new subflow if the write size exceeds some threshold.
1552	      This information can be taken only as a hint because applications
1553	      could send big chunks of data split in many small writes.  A
1554	      particular case of checking the size of write operations is when
1555	      the application uses the sendfile() system call.  In that
1556	      situation MPTCP can know very precisely how many bytes will be
1557	      transferred.

1559	   o  Check if the flow is network limited or application limited.
1560	      Initiate a new subflow only if it is network limited.

1562	   o  It may be useful to establish new subflows even for application-
1563	      limited communications, to provide failure survivability.  A way
1564	      to do that would be to initiate a new subflow (if not done before
1565	      by another trigger) after some time has elapsed, regardless of
1566	      whether the communication is network or application limited.

1568	   o  Wait until slow start is done before to establish a new subflow.
1569	      Measurements with Linux MPTCP suggest that slow start could be a
1570	      reasonable tool for determining when it is worth starting a new
1571	      subflow (without increasing the overall completion time).  More
1572	      analysis is needed in that area, however.  Also, this should be
1573	      taken as a hint only if the slow start is actually progressing
1574	      (otherwise a stalled subflow could prevent the establishment of
1575	      another one, precisely when a new one would be useful).

1577	   o  Use information from the application-layer protocol.  Some of them
1578	      (e.g.  HTTP) carry flow length information in their headers, which
1579	      can be used to decide how many subflows are useful.

1581	   o  Allow the administrator to configure subflow policies on a per-
1582	      port basis.  The host stack could learn as well for what ports
1583	      MPTCP turns out to be useful.

1585	   o  Check the underlying medium of each potential subflow.  For
1586	      example, if the initial subflow is initiated over 3G, and WiFi is
1587	      available, it probably makes sense to immediately negotiate an
1588	      additional subflow over WiFi.

1590	   It is not only useful to determine when to start new subflows, one
1591	   should also sometimes decide to abandon some of its subflows.  An
1592	   MPTCP implementation should be able to determine when removing a
1593	   subflow would increase the aggregate bandwidth.  This can happen, for
1594	   example, when the subflow has a significantly higher delay compared
1595	   to other subflows, and the maximum buffer size allowed by the
1596	   administrator has been reached (Linux MPTCP currently has no such
1597	   heuristic yet).

1599	Authors' Addresses

1601	   Sebastien Barre
1602	   Universite catholique de Louvain
1603	   Place Ste Barbe, 2
1604	   Louvain-la-Neuve  1348
1605	   BE

1607	   Email: sebastien.barre@uclouvain.be
1608	   URI:   http://inl.info.ucl.ac.be/sbarre

1610	   Christoph Paasch
1611	   Universite catholique de Louvain
1612	   Place Ste Barbe, 2
1613	   Louvain-la-Neuve  1348
1614	   BE

1616	   Email: christoph.paasch@uclouvain.be
1617	   URI:   http://inl.info.ucl.ac.be/cpaasch

1619	   Olivier Bonaventure
1620	   Universite catholique de Louvain
1621	   Place Ste Barbe, 2
1622	   Louvain-la-Neuve  1348
1623	   BE

1625	   URI:   http://inl.info.ucl.ac.be/obo