idnits 2.17.1 

draft-van-beijnum-1e-mp-tcp-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (May 6, 2009) is 5469 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Downref: Normative reference to an Informational RFC: RFC 2992

  -- Obsolete informational reference (is this intentional?): RFC 1072
     (Obsoleted by RFC 1323, RFC 2018, RFC 6247)

  -- Obsolete informational reference (is this intentional?): RFC 2960
     (Obsoleted by RFC 4960)

  == Outdated reference: A later version (-12) exists of
     draft-ietf-shim6-proto-11


     Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network working group                                     I. van Beijnum
3	Internet-Draft                                            IMDEA Networks
4	Expires: November 7, 2009                                    May 6, 2009

6	                        One-ended multipath TCP
7	                     draft-van-beijnum-1e-mp-tcp-00

9	Status of this Memo

11	   This Internet-Draft is submitted to IETF in full conformance with the
12	   provisions of BCP 78 and BCP 79.

14	   Internet-Drafts are working documents of the Internet Engineering
15	   Task Force (IETF), its areas, and its working groups.  Note that
16	   other groups may also distribute working documents as Internet-
17	   Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time.  It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt.

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	   This Internet-Draft will expire on November 7, 2009.

32	Copyright Notice

34	   Copyright (c) 2009 IETF Trust and the persons identified as the
35	   document authors.  All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents in effect on the date of
39	   publication of this document (http://trustee.ietf.org/license-info).
40	   Please review these documents carefully, as they describe your rights
41	   and restrictions with respect to this document.

43	Abstract

45	   Normal TCP/IP operation is for the routing system to select a best
46	   path that remains stable for some time, and for TCP to adjust to the
47	   properties of this path to optimize throughput.  A multipath TCP
48	   would be able to either use capacity on multiple paths, or
49	   dynamically find the best performing path, and therefore reach higher
50	   throughput.  By adapting to the properties of several paths through
51	   the usual congestion control algorithms, a multipath TCP shifts its
52	   traffic to less congested paths, leaving more capacity available for
53	   traffic that can't move to another path on more congested paths.  And
54	   when a path fails, this can be detected and worked around by TCP much
55	   more quickly than by waiting for the routing system to repair the
56	   failure.

58	   This memo specifies a multipath TCP that is implemented on the
59	   sending host only, without requiring modifications on the receiving
60	   host.

62	Table of Contents

64	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
65	   2.  Notational Conventions . . . . . . . . . . . . . . . . . . . .  5
66	   3.  Congestion control . . . . . . . . . . . . . . . . . . . . . .  5
67	     3.1.  RTT measurements . . . . . . . . . . . . . . . . . . . . .  5
68	     3.2.  Fast retransmit  . . . . . . . . . . . . . . . . . . . . .  6
69	     3.3.  Slow retransmit  . . . . . . . . . . . . . . . . . . . . .  6
70	     3.4.  SACK . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
71	     3.5.  Fairness and TCP friendliness  . . . . . . . . . . . . . .  8
72	   4.  Path selection . . . . . . . . . . . . . . . . . . . . . . . .  8
73	     4.1.  The multipath IP layer . . . . . . . . . . . . . . . . . .  9
74	     4.2.  The path indication option . . . . . . . . . . . . . . . . 10
75	     4.3.  Timestamp integration option . . . . . . . . . . . . . . . 12
76	     4.4.  Path for retransmissions . . . . . . . . . . . . . . . . . 12
77	     4.5.  ECN  . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
78	     4.6.  Path MTU discovery . . . . . . . . . . . . . . . . . . . . 13
79	   5.  Flow control and buffer sizes  . . . . . . . . . . . . . . . . 14
80	   6.  Handling of RSTs . . . . . . . . . . . . . . . . . . . . . . . 14
81	   7.  Middlebox considerations . . . . . . . . . . . . . . . . . . . 14
82	   8.  Security considerations  . . . . . . . . . . . . . . . . . . . 15
83	   9.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 15
84	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15
85	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
86	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 16
87	     11.2. Informational References . . . . . . . . . . . . . . . . . 16
88	   Appendix A.  Document and discussion information . . . . . . . . . 17
89	   Appendix B.  An implementation strategy  . . . . . . . . . . . . . 17
90	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 21

92	1.  Introduction

94	   In order to achieve redundancy to protect against failures, network
95	   operators generally install more links than the minimum necessary to
96	   achieve reachability.  So there are often multiple paths between any
97	   two given hosts, even when paths not allowed by policy are removed.
98	   However, routing protocols usually select a single "best" path.  When
99	   multiple paths are used at the same time by the routing system, those
100	   tend to be parallel links between two routers or paths that are
101	   otherwise very similar.  As such, a lot of potentially usable network
102	   capacity is left unused.  A multipath transport protocol would be
103	   able to use more of that capacity by sending its data along multiple
104	   paths at the same time, or by switching to a path with more available
105	   capacity.

107	   As TCP [RFC0793] is used by the vast majority of all networked
108	   applications, and TCP is responsible for the vast majority of all
109	   data transmitted over the internet, the logical choice would be to
110	   make TCP capable of using multiple paths.  SCTP already has the
111	   ability to use multiple paths through the use of multiple addresses.
112	   However, using SCTP in this way requires significant application
113	   changes and deployment would be challenging because there is no
114	   obvious way for an application to know whether a service is available
115	   over SCTP rather than, or in addition to, TCP.  In addition, SCTP as
116	   defined today [RFC2960] does not accommodate the concurrent use of
117	   multiple paths.  Additional paths are purely used for backup
118	   purposes.

120	   This memo describes a one-ended multipath TCP, which only changes the
121	   behavior of the TCP sender, achieving multipath advantages when
122	   communicating with unmodified TCP receivers.  This means it is not
123	   possible to perform path selection by using different destination
124	   addresses.  However, other mechanisms that are transparent to the
125	   receiver are possible.  A simple one would be for the sender to send
126	   some packets to one router, and other packets to another router.  If
127	   these routers then make different routing decisions for the
128	   destination address in the TCP packets, the packets flow over
129	   different paths part of the way.  Other mechanisms to achieve the
130	   same goal are also possible.  However, with a single destination
131	   address, paths can't be completely disjoint.

133	   Using multiple paths at the same time brings up a number of
134	   challenges and questions:

136	   o  Naive scheduling (such as round robin) of transmissions over the
137	      different paths reduces performance of each path to that of the
138	      slowest path.

140	   o  Using multiple paths causes reordering, which triggers the fast
141	      retransmit algorithm, causing unnecessary retransmissions and
142	      reduced performance.

144	   o  TCP requires in-order delivery of data to the application, so when
145	      losses occur on one path, buffer capacity may run out and data
146	      can't be transmitted on unaffected paths until the lost data has
147	      been retransmitted.

149	   o  Using multiple paths with an instance of regular congestion
150	      control on each path for a single TCP session makes that session
151	      use network capacity more aggressively than single path sessions,
152	      which can be considered "unfair" and increases packet loss.

154	   This memo seeks to address the first two issues by running separate
155	   instances of TCP's congestion control algorithms for the subflows
156	   that flow over different paths.  Buffer issues are addressed by
157	   retransmitting packets before buffer space runs out, even if normal
158	   retransmission timers haven't fired yet.  The fairness issue is a
159	   topic of ongoing research; this specification simply limits the
160	   number of subflows to limit unfairness and increased loss.

162	   The one-ended multipath TCP takes advantage of the fact that TCP
163	   [RFC0793] congestion control [RFC2581] and flow control are performed
164	   by the sender.  With regard to flow control and congestion control,
165	   the role of the receiver is limited to sending back acknowledgments
166	   and advertise how much data it is prepared to receive.  Hence, it is
167	   possible for the sender to utilize different paths and modify the
168	   fast retransmit logic as long as the receiver recognizes the packets
169	   as belonging to the same session.  So a multipath TCP sender can
170	   distribute packets over multiple paths as long as this doesn't
171	   require incompatible modifications to the IP or TCP header contents,
172	   most notably the addresses.  A single-ended multipath TCP session
173	   must still be between a single source address and a single
174	   destination address, regardless of the path taken by packets.

176	   The subset of the packets belonging to a TCP session flowing over a
177	   given path is designated a subflow.

179	   In order to benefit from using multiple paths, it's necessary for the
180	   multipath TCP sender to execute separate TCP congestion control
181	   instances for the packets belonging to different subflows.  In the
182	   case where all packets are subject to the same congestion window,
183	   performance over a fast and a slow path will often be poorer than
184	   over just the fast path, defeating the purpose of using multiple
185	   paths.  For instance, in the case of a 10 Mbps and a 100 Mbps path
186	   with otherwise identical properties, a simple round robin
187	   distribution of the packets and the use of a single congestion window
188	   will limit performance to that of the slowest path multiplied by the
189	   number of paths, 20 Mbps in this case.

191	2.  Notational Conventions

193	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
194	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
195	   document are to be interpreted as described in [RFC2119].

197	3.  Congestion control

199	   A multipath TCP maintains instances of all congestion control related
200	   variables for each subflow.  This includes, but is not limited to,
201	   the congestion window, the ssthresh, the retransmission timeout
202	   (RTO), the user timeout and RTT measurements.  However, because TCP
203	   requires in-order delivery of data, there must be a single send
204	   buffer and a single receive buffer, thus flow control must happen
205	   session-wide.

207	   Per-subflow congestion control is performed by recording the path
208	   used to transmit each packet.  Acknowledgments are then attributed to
209	   the subflow the acknowledged packets were sent over and the
210	   congestion window and other congestion control variables for the
211	   relevant subflow are updated accordingly.

213	3.1.  RTT measurements

215	   Because a multipath TCP sender knows which packet it sent over which
216	   path, it can perform per-path round trip time measurements.  This
217	   only works if return packets are consistently sent over the same path
218	   (or a set of paths with the same latency).  If the receiver is not
219	   multipath-aware, this condition will generally hold: acknowledgments
220	   will flow from the receiver to the sender over a single path unless
221	   there is a topology change in the routing system or packets that
222	   belong to a single session are distributed over different paths by
223	   routers, which is rare.  To multipath-capable routers on the return
224	   path (if any), the non-multipath-aware host appears to select the
225	   default path for all of its packets.

227	   However, if, like the sender, the receiver is multipath-aware, then
228	   the return path that the receiver chooses to send ACKs over will
229	   influence the RTTs seen by the original sender.  The situation where
230	   the sender is unaware of fact that the receiver selects different
231	   return paths with different latencies is suboptimal, even compared to
232	   consistently measuring the RTT over the slowest path, as this leads
233	   to higher variability in the RTT measurements and therefore a higher
234	   RTO.

236	   Having the receiver send ACKs over the same path mitigates the
237	   problem somewhat; but presumably, if the receiver is also multipath
238	   capable and has data to send, it will want to send this data over
239	   more than one path.  So RTT measurements may inadvertently end up
240	   measuring different return paths in that case.  A better solution is
241	   for the sender to include an indication in packets that allows the
242	   receiver to determine through which path the sender sent the packet.
243	   This information, along with the path initially chosen for the
244	   outgoing packet that is acknowledged, allows TCP to attribute each
245	   RTT measurement to a specific path.

247	   Because congestion control happens per path, there must also be a
248	   separate retransmission timeout (RTO) value for each path.

250	3.2.  Fast retransmit

252	   Different paths will almost certainly have different RTTs, and even
253	   if the average RTT is the same, normal burstiness and differences in
254	   packet sizes will make packets routinely arrive through the different
255	   paths in a different order than the order in which they were
256	   transmitted.  Without modifications to the algorithm, this would
257	   trigger the fast retransmit algorithm unnecessarily.  To avoid this,
258	   fast retransmit is executed whenever, for packets belonging to the
259	   same subflow, after an unACKed packet or sequence of packets, more
260	   than two segments of new data is ACKed with SACK.  This means fast
261	   retransmit happens per subflow, and reordering between subflows no
262	   longer triggers fast retransmit.

264	3.3.  Slow retransmit

266	   In multipath TCP, a per-path RTO is employed to recover from
267	   congestion events that fast retransmit can't handle.  Because the
268	   missing packets create holes in the data stream, subsequent packets
269	   received over other paths must be buffered in the receive buffer.
270	   Unless the receive buffer is extremely large, this means the entire
271	   session stalls when the receive buffer fills up.  This situation
272	   persists until the RTO expires for the congested or broken path so
273	   the missing packets can be retransmitted.  Should the path in
274	   question be completely broken, this will then lead to an almost
275	   immediate new stall, and the stall/RTO cycles will then continue
276	   until the user timeout / R2 timer [RFC1122] for the subflow expires.

278	   This is solved by taking unacknowledged packets transmitted over
279	   subflows that are stalled because they have exhausted their
280	   congestion window and are now waiting for the RTO to expire, and
281	   scheduling retransmissions of those packets over other paths before
282	   the RTO of the stalled subflow expires.  This should be done such
283	   that the missing packet arrives before it becomes necessary to stop
284	   sending data altogether because the receiver advertises a zero
285	   receive buffer.  Such retransmissions therefore happen as the receive
286	   buffer space advertised by the receiver reaches RTT * MSS for the
287	   path that will be used for the retransmission; presumably the path
288	   with the lowest RTT.  In essence, this creates a second level of fast
289	   retransmit that acts across subflows in addition to the normal fast
290	   retransmit that happens per subflow.  This mechanism is named "slow
291	   retransmit".

293	   In the case of single path TCP, scheduling retransmissions before the
294	   RTO expires could be problematic because this would be more
295	   aggressive than standard (New)Reno congestion control.  But in the
296	   case of multipath TCP, the retransmission can happen over one of the
297	   other paths, which is still progressing.

299	   By scheduling a retransmission faster than an RTO, there is an
300	   increased risk that a packet that was still working its way through
301	   the network is retransmitted unnecessarily.  However, the alternative
302	   is allowing the progress of the session to stall (on all paths),
303	   reducing throughput significantly.

305	3.4.  SACK

307	   When packets (belonging to different subflows) arrive out of order,
308	   the the receiver can't acknowledge the receipt of the out of order
309	   packets using TCP's normal cumulative acknowledgment.  However, the
310	   [RFC2018] (also see [RFC1072]) Selective Acknowledgment (SACK)
311	   mechanism is widely implemented.  SACK makes it possible for a
312	   receiver to indicate that three or four additional ranges of data
313	   were received in addition to what is acknowledged using a normal
314	   cumulative ACK.  When packets are sent over multiple paths and arrive
315	   out of order, the information in the SACK returned by the receiver
316	   can tell the sender how each subflow is progressing, so per-subflow
317	   congestion control can progress smoothly and unnecessary
318	   retransmissions are largely avoided.

320	   One-ended multipath TCP requires the use of SACK to be able to
321	   determine which subflows are progressing even if other subflows are
322	   stalled, and thus the normal TCP ACK isn't progressing.  If the
323	   remote host doesn't indicate the SACK capability during the three-way
324	   handshake, a multipath TCP implementation SHOULD limit itself to
325	   using only a single subflow and thus disabling multipath processing
326	   for the session in question.

328	3.5.  Fairness and TCP friendliness

330	   One of the goals of multipath TCP is increased performance over
331	   regular TCP.  However, it would be harmful to realize this benefit by
332	   taking more than a "fair" share of the available bandwidth.  One
333	   choice would be to make each subflow execute normal NewReno
334	   congestion control on each subflow, so that each individual subflow
335	   competes with other TCPs on the same footing as a regular TCP
336	   session.  If all subflows use non-overlapping physical paths, other
337	   TCPs are no worse off than in the situation where the multipath TCP
338	   were a regular TCP sharing their path, so this could be considered
339	   fair even though the multipath TCP increases its bandwidth in direct
340	   relationship to the number of subflows used.  Note that in this case,
341	   although multipath TCP sends at the same rate as regular TCP on a
342	   given path, resource pooling [wischik08pooling] benefits are still
343	   realized because a given transmission completes faster so it uses up
344	   resources for a shorter amount of time.

346	   But if several logical paths share a physical path, multipath TCP
347	   takes a larger share of the bandwidth on that path.  This would only
348	   be acceptable as fair for a very small number of subflows.  The other
349	   end of the spectrum would be for multipath TCP to conform to exactly
350	   the same congestion window increase and decrease envelope that a
351	   regular TCP exhibits, being no more aggressive than a regular single
352	   path TCP session.  At this point in time we will assume that fairness
353	   is a tunable factor of the regular NewReno AIMD envelope.  A simple
354	   way to limit the amount of additional aggressiveness exhibited by
355	   multipath TCP is a limit on the number of subflows.  Until more
356	   analysis has been performed and/or there is more experience with
357	   multipath TCP, a multipath TCP implementation SHOULD limit itself to
358	   using no more than 3 subflows concurrently.

360	4.  Path selection

362	   Note that in order to gain multipath benefits, the multipath TCP
363	   layer must be able to determine the logical path followed by each
364	   packet so it can measure path properties and perform per-path
365	   congestion control.  In order to limit the number of packets flowing
366	   over each path to the amount allowed by the per path congestion
367	   window, the multipath TCP layer must be able to specify over which
368	   path a given packet is transmitted.

370	   The situation where routers distribute packets over different paths
371	   based on their own criteria makes it impossible for hosts to send
372	   less traffic over congested paths and more traffic over uncongested
373	   paths and is therefore incompatible with multipath TCP.  When routers
374	   distribute traffic belonging to the same flow (or, in the case of
375	   multipath TCP: subflow) over different paths this will also cause
376	   reordering and the associated performance impact on TCP.

378	4.1.  The multipath IP layer

380	   The one-ended multipath TCP is logically layered on a multipath IP
381	   layer, which is able to to deliver packets to the same destination
382	   address through one or more logical paths, where the set of n logical
383	   paths share between one and m physical paths.  In some cases, the
384	   multipath IP layer will be able to determine that a logical path
385	   isn't working, or maps to the same physical path as a previous
386	   logical path.  For example, if the multipath TCP indicates that a
387	   packet should be sent over the third path, and the multipath IP is
388	   set up to use different next hop addresses for path selection, but
389	   only two next hop addresses are available, the multipath IP layer can
390	   provide feedback to the multipath TCP layer.  In other cases, packets
391	   simply won't be delivered, or will be delivered through the same
392	   physical path used by other logical paths.  This may for instance
393	   happen when multipath TCP selects path 1 and multipath IP puts a path
394	   selector with value "1" in the packet, but there are no multipath
395	   capable routers between the source and destination, so all packets,
396	   regardless of the presence and/or value of a path selector, are
397	   routed over the same physical path.

399	   It is up to the multipath TCP layer to handle each of these
400	   situations.

402	   For the purposes of this multipath TCP specification, the simplest
403	   possible interface to the multipath IP layer is assumed.  When TCP
404	   segments traveling down the stack from the TCP layer to the IP layer
405	   aren't accompanied by a path selector value, or the path selector
406	   value is zero, the IP layer delivers packets in the same way as for
407	   unmodified TCP and other existing transport protocols, i.e., over the
408	   default path.  Segments may also be accompanied by a path selector
409	   value higher than zero, which indicates the desired path.  If the
410	   desired logical path is available, or may be available, the multipath
411	   IP layer attempts to deliver the packet using that logical path.  If
412	   the desired logical path is known to be unavailable, the multipath IP
413	   layer drops the segment.

415	   It is assumed that paths as seen by the multipath IP layer are mapped
416	   to logical paths with increasing numbers roughly ordered in order of
417	   decreasing assumed performance or availability.  I.e., if path x
418	   doesn't work or has low performance, that doesn't necessarily mean
419	   that path x+1 doesn't work or has low performance, but if if paths x,
420	   x+1 and x+2 don't work or have low performance, then it's highly
421	   likely that paths x+3 and beyond also don't work or have even lower
422	   performance.  Routers may have good next hop or even intra-domain
423	   link weight information and link congestion information, but they
424	   generally don't have information about the end-to-end path
425	   properties, so the ordering of paths from high to low availability/
426	   performance must be considered little more than a hint.

428	   The multipath IP layer may be implemented through a variety of
429	   mechanisms, including but not limited to:

431	   o  Using different outgoing interfaces on the host

433	   o  Directing packets towards different next hop routers

435	   o  Integration with shim6 [I-D.ietf-shim6-proto] so that packets can
436	      use different address pairs

438	   o  Manipulation of fields used in ECMP [RFC2992] (i.e., a different
439	      flow label)

441	   o  Type of service routing (such as [RFC4915])

443	   o  Different lower layer encapsulation, such as MPLS

445	   o  Tunneling through overlays

447	   o  Source routing

449	   o  An explicit path selector field in packets, acted upon by routers

451	   At this time, no choice is made between these different mechanisms.

453	4.2.  The path indication option

455	   Note that several of the fields discussed below are defined with
456	   future developments in mind, they are not necessarily immediately
457	   useful.

459	   In order to allow for accurate RTT measurements and to inform the IP
460	   layer of the selected path, a TCP option indicating the desired path
461	   is included in all segments that don't use the default path.  The
462	   format of this option is as follows:

464	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
465	   |   KIND=TBA    |  LENGTH = 3   |D|  MP |R|  SP |
466	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

468	   The length is 3.

470	   D is the "discard eligibility" flag (1 bit).  It is similar, but not
471	   identical, to the frame relay discard eligibility bit or the ATM cell
472	   loss priority bit.  Set to zero, no special behavior is requested.
473	   Set to one, this indicates that loss of the packet will be
474	   inconsequential.  This allows routers to drop packets with D=1 more
475	   readily than other packets under congested conditions, and also to
476	   completely block packets with D=1 on links that are considered long-
477	   term congested or expensive, even if there is no momentary
478	   congestion.

480	   Setting the D bit to 1 for some subflows (presumably, ones with a
481	   performance lower than the best performing subflow) allows multipath
482	   TCP to give way to regular TCP and other single path traffic on
483	   congested or expensive paths.  As long as the multipath TCP sets D to
484	   0 on the subflow with the best performance, multipath TCP should
485	   still perform better than regular TCP, but the reduction in bandwidth
486	   use on the other paths helps achieve resource pooling benefits.

488	   MP is a is a path selector that may be interpreted by multiple
489	   routers along the way (3 bits).  A value of 0 is the default path
490	   that is also taken by packets that don't contain a multipath option.
491	   Multipath TCP aware routers should take this value into account when
492	   performing ECMP [RFC2992].  Packets with any value for MP MUST be
493	   forwarded, even if the number of available paths is smaller than the
494	   value in MP.

496	   R (1 bit) is reserved for future use.  MUST be set to zero on
497	   transmission and ignored on reception.

499	   SP is a path selector that is interpreted only once by the local TCP
500	   stack or a router close to the sender (3 bits).  A value of 0 is the
501	   default path that is also taken by packets that don't contain a
502	   multipath option.  If the value in SP points to a path that isn't
503	   available, the packet SHOULD be silently dropped.  This behavior, as
504	   opposed to selecting an alternate path out of the available ones,
505	   helps avoid the use of duplicate paths.  As such, a router may only
506	   interpret SP rather than MP when it is known that the router is the
507	   only one acting on SP.  All other routers may only act on MP.

509	   It is not expected that routers will make routing decisions directly
510	   based on the path indication option, as this option occurs deep
511	   inside the packet and not in a fixed place.  However, a multipath IP
512	   layer or a middlebox may write a path selection value into a field in
513	   packets that is easily accessible to routers.  But conceptually, the
514	   routers act upon the values in SP and MP.

516	   The initial packets for each TCP session MUST use D, MP and SP values
517	   of zero.  If D, MP and SP are all zero, then the path selector option
518	   isn't included in the packet.  This makes sure that single path
519	   operation remains possible even if packets with the path selector
520	   option are filtered in the network or rejected by the receiver.  The
521	   packets that are part of the TCP three-way handshake SHOULD be sent
522	   over the default path, in which case they don't contain the path
523	   selector option; hence the ability to do multipath TCP isn't
524	   indicated to the correspondent at the beginning of the session as is
525	   usual for most other TCP extensions.

527	4.3.  Timestamp integration option

529	   As an optimization, hosts MAY borrow the four bits used by the path
530	   selector option from the timestamp option, and thus save one byte of
531	   option space, which means the path selector option can replace the
532	   padding necessary when the timestamp option is used and not increase
533	   header overhead.  In that case, the combined path selector and
534	   timestamp options MUST appear as follows:

536	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
537	   |   KIND=TBA    |  LENGTH = 2   |     KIND=8    |  LENGTH = 10  |
538	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
539	   |D|  MP |                 TS Value (TSval)                      |
540	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
541	   |                       TS Echo Reply (TSecr)                   |
542	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

544	   D and MP are the same as in the three-byte form of the path selector
545	   option.  R and SP do not occur in this form of the path selector
546	   option and are assumed to be zero.

548	   TSval is the locally generated timestamp.  Because the timestamp is
549	   reduced to 28 bits, the minimum clock frequency is increased from the
550	   59 nanoseconds mandated by [RFC1323] to 1 microsecond so the
551	   timestamp wraps in no less than 255 seconds.

553	   TSecr is the timestamp echoed back to the other side (32 bits).

555	   All hosts conforming to this specification MUST be able to recognize
556	   the integrated path selector and timestamp options, but they are not
557	   required to generate them.

559	4.4.  Path for retransmissions

561	   A multipath TCP implementation MUST be capable of scheduling
562	   retransmissions over a path different from the path used to transmit
563	   the packet originally.  This includes packets subject to fast
564	   retransmit.

566	4.5.  ECN

568	   Explicit Congestion Notification works by routers setting a
569	   congestion indication in the IP header of packets rather than
570	   dropping those packets when they experience congestion.  The receiver
571	   echos this information back to the sender which then performs
572	   congestion control in exactly the same way as if a packet was lost.
573	   The ECN specification ([RFC3168]) is such that the receiver sets the
574	   ECN-Echo (ECE) flag in the TCP header for all subsequent packets that
575	   it sends back until the sender sets the Congestion Window Reduced
576	   (CWR) flag.  As the ECE flag is set in multiple ACKs, there is no
577	   obvious way to correlate the ECN indication in an ACK with a specific
578	   packet that experienced congestion, and subsequently, the path that
579	   is congested.

581	   At this time, a multipath TCP conforming to this specification SHOULD
582	   NOT use ECN.  ECN MAY be negotiated, but when more than a single path
583	   is used at a given time, packets SHOULD be sent with the ECN field
584	   set to Not-ECN (00), and incoming non-zero ECE flags SHOULD NOT be
585	   acted upon with regard to congestion control.

587	4.6.  Path MTU discovery

589	   Path MTU discovery [RFC1191] is performed for TCP by having TCP
590	   reduce its packet sizes whenever "packet too big but DF set" ICMP
591	   messages are received.  As the name suggests, the path MTU is
592	   dependent on the path used, so multipath TCP must maintain MTU
593	   information for each path, and adjust this information for each path
594	   individually based on the too big messages that it receives.

596	   The time between probing with a larger than previously discovered MTU
597	   must either be randomized or explicitly coordinated to avoid probing
598	   larger MTUs for multiple subflows at the same time, as probing larger
599	   MTUs is likely to lead to a lost packet, and having losses on
600	   multiple paths at the same time would be suboptimal.  For instance,
601	   rather than probe every t, in the case of 2 paths, after t*0.5 the
602	   first path is probed, after t the second and after t*1.5 the first is
603	   probed again.

605	   Both the IPv4 and IPv6 versions of ICMP return enough of the original
606	   packet in a "packet too big" message to be able to recover the
607	   sequence number from the original packet, which makes it possible to
608	   correlate the too big message with the packet that caused it, and
609	   thus the path used to transmit the packet.

611	5.  Flow control and buffer sizes

613	   In order to accommodate the increased number of packets in flight,
614	   the send buffer must be increased in direct relationship with the
615	   number of paths being used.  Alternatively, the number of paths used
616	   concurrently should be limited to send buffer / avgRTT.

618	   Although under normal operation, the receive buffer doesn't fill up,
619	   there are two reasons the receive buffer must be the same size as the
620	   send buffer: it must be able to accommodate a round trip time plus
621	   two segments worth of data during fast retransmit, and the advertised
622	   receive window limits the amount of data the sender will transmit
623	   before waiting for acknowledgments.  So in practice, the receive
624	   buffer limits the maximum size of the send buffer, and therefore, the
625	   number of paths that can be supported concurrently.

627	   There is no simple rule of thumb to determine the number of paths
628	   that should be used, as the maximum number of paths that the receive
629	   window can accommodate depends both on the maximum receive window
630	   advertised by the receiver and by the RTTs on the paths.

632	6.  Handling of RSTs

634	   If an RST is received after enabling a new path, this could be a
635	   reaction to the presence of an unknown option.  So the optimal
636	   situation would be for an RST to reset just the path used to send the
637	   packet that generated the RST, not the entire session.  Only when the
638	   last path or the default path (on which packets don't include special
639	   options) receives an RST, the entire session should be reset.

641	7.  Middlebox considerations

643	   NATs are designed to be transparent to TCP.  Because one-ended
644	   multipath TCP conforms to normal TCP semantics on the wire, multipath
645	   TCP should in principle also be compatible with NAT.  However, if
646	   different paths are served by different NATs that apply different
647	   translations, the receiver won't be able to determine that the
648	   different subflows through the different paths belong to the same TCP
649	   session.  So for NAT to work, the translation must either happen in a
650	   location that all paths flow through, or the different NATs on the
651	   different paths must act as a single, distributed NAT and apply the
652	   same translation to the different subflows.

654	   Middleboxes that only see traffic flowing over a subset of the paths
655	   used will see large numbers of gaps in the sequence number space.
656	   They may also not observe only a partial three-way handshake, or not
657	   observe any ACKs.  As such, like with NATs, middleboxes that enforce
658	   conformance to known TCP behavior, must be placed such that they
659	   observe all subflows.  For middleboxes that just check whether
660	   packets fall inside the TCP window, it may be sufficient for
661	   multipath TCP senders to make sure that all paths see at least one
662	   packet per window.  Middleboxes that enforce sequence number
663	   integrity will almost certainly also block TCP packets for which they
664	   didn't observe the three way handshake.  A possible way to
665	   accommodate that behavior would be to send copies of all session
666	   establishment and tear down packets over all paths that the sender
667	   may use.  However, this strategy is still likely to fail unless the
668	   receiver does the same so the middleboxes may observe the signaling
669	   packets flowing in both directions.

671	   It's also possible that middleboxes (or perhaps hosts themselves)
672	   reject packets with the path indicator TCP option.  Since packets
673	   flowing over the default path don't carry the path indicato option,
674	   these packets should always be allowed through, so single path
675	   operation is always possible.  When a multipath TCP sender starts to
676	   send packets over alternative paths, those packets won't make it to
677	   the receiver because they contain the path indicator option.  The
678	   result is that a new subflow, which would use a congestion window of
679	   two maximum segment sizes, would send two packets and then
680	   experiences a retransmission timeout.  Slow retransmit makes sure the
681	   packets are transmitted before the session stalls, so the impact of
682	   the lost packets is negligible.

684	8.  Security considerations

686	   None at this time.

688	9.  IANA considerations

690	   IANA is requested to provide a TCP option kind number for the path
691	   indication option.

693	10.  Acknowledgements

695	   The single ended multipath TCP was developed together with Marcelo
696	   Bagnulo and Arturo Azcorra.

698	   Members of the Trilogy project, especially Costin Raiciu, have
699	   contributed valuable insights.

701	   Iljitsch van Beijnum is supported by Trilogy
702	   (http://www.trilogy-project.org), a research project (ICT-216372)
703	   partially funded by the European Community under its Seventh
704	   Framework Program.  The views expressed here are those of the
705	   author(s) only.  The European Commission is not liable for any use
706	   that may be made of the information in this document.

708	11.  References

710	11.1.  Normative References

712	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
713	              RFC 793, September 1981.

715	   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
716	              November 1990.

718	   [RFC1323]  Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
719	              for High Performance", RFC 1323, May 1992.

721	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
722	              Selective Acknowledgment Options", RFC 2018, October 1996.

724	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
725	              Requirement Levels", BCP 14, RFC 2119, March 1997.

727	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
728	              Control", RFC 2581, April 1999.

730	   [RFC2992]  Hopps, C., "Analysis of an Equal-Cost Multi-Path
731	              Algorithm", RFC 2992, November 2000.

733	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
734	              of Explicit Congestion Notification (ECN) to IP",
735	              RFC 3168, September 2001.

737	11.2.  Informational References

739	   [RFC1072]  Jacobson, V. and R. Braden, "TCP extensions for long-delay
740	              paths", RFC 1072, October 1988.

742	   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
743	              Communication Layers", STD 3, RFC 1122, October 1989.

745	   [RFC2960]  Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
746	              Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
747	              Zhang, L., and V. Paxson, "Stream Control Transmission
748	              Protocol", RFC 2960, October 2000.

750	   [RFC4915]  Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., and P.
751	              Pillay-Esnault, "Multi-Topology (MT) Routing in OSPF",
752	              RFC 4915, June 2007.

754	   [wischik08pooling]
755	              Wischik, D., Handley, M., and M. Bagnulo Braun, "The
756	              resource pooling principle", Computer Communication
757	              Review 38, September 2008.

759	   [I-D.ietf-shim6-proto]
760	              Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming
761	              Shim Protocol for IPv6", draft-ietf-shim6-proto-11 (work
762	              in progress), December 2008.

764	Appendix A.  Document and discussion information

766	   The latest version of this document will always be available at
767	   http://www.muada.com/drafts/.  Please direct questions and comments
768	   to the multipathtcp@ietf.org mailinglist or directly to the author.

770	Appendix B.  An implementation strategy

772	   In order to perform per-path congestion control, all of the ACK-based
773	   events that trigger congestion control responses as well as all the
774	   variables used by the congestion control algorightms must be
775	   recreated in the multipath situation.  These are the triggers and
776	   variables for the four mechanisms in RFC 2581.

778	   1.  the path MTU (page 4)

780	   2.  the arrival of an ACK that acknowledges new data (page 4)

782	   3.  the arrival of a non-duplicate ACK (page 4) or the sum of new
783	       data acknowledged (page 5)

785	   4.  triggering of the retransmission timer (page 5)

787	   5.  the flightsize or number of bytes sent but not acknowledged (page
788	       5)

790	   6.  the retransmission of a segment (page 5)

792	   7.  the arrival of a third or subsequent duplicate ACK (page 6, page
793	       7)

795	   8.  whether a retransmission timeout period has elapsed since the
796	       last reception of an ACK (page 7)

798	   1, 4, 6 and 8 are maintained session-wide.

800	   We recreate these events and variables based on SACK information in
801	   the one-sequence number multipath TCP case as follows.

803	   We keep track of every packet sent.  (Alternatively: multi-packet
804	   contiguous blocks of data transmitted over the same path.)  When an
805	   ACK comes in, we first remove the stored information about packets/
806	   data blocks that are cumulatively ACKed, noting how much data was
807	   ACKed for each path that the packets were sent over.  Then we do the
808	   same for all the SACK blocks in the ACK.  Because we remove the
809	   information about (S)ACKed data and you can remove something just
810	   once, we don't have to keep track of previous SACKs like the current
811	   BSD implementation does.

813	   The only slightly tricky part is emulating duplicate ACKs.  This may
814	   not even be really necessary, as the SACKs give us better information
815	   to base fast retransmit on, but that's something for another day.
816	   What happens in the pseudo code is that when traversing the list of
817	   sent packets (this happens in order of seqnum), we note the path that
818	   packets that aren't SACKed are sent over.  When we're done processing
819	   SACK data and it turns out that for a path there are one or more
820	   packets that we skipped over when processing SACK data and there was
821	   also data SACKed after a skipped packet, there was a lost (or
822	   reordered) packet on this path.  When the amount of "duplicate ACKed"
823	   data grows beyond two segment sizes, we've reached the equivalent of
824	   three duplicate ACKs so we trigger fast retransmit (7).

826	   We update the congestion window (2 and 3) when there was data
827	   (S)ACKed for a path.  ACKs that don't acknowledge any data for a path
828	   aren't relevant because we don't need them to trigger fast retransmit
829	   and we assume that they're sent to (S)ACK data for other paths,
830	   anyway.  (Or they could be window updates.)

832	   We maintain the flightsize (5) by simply adding data bytes as packets
833	   are transmitted and subtracting when they're (S)ACKed.  Because we
834	   have explicit SACKs, we don't need to guess based on duplicate ACKs.
835	   The flightsize is also adjusted when we perform fast retransmit or a
836	   regular retransmission over a path other than which was used for the
837	   original packet.  In addition, we explicitly mark some packets to
838	   trigger once-per-RTT actions when they're ACKed.

840	   Pseudo code for the above:

842	   // initializing data structures is left as an exercise for the
843	   // reader

845	   // transmitting packets
846	   // assume we've selected a path to transmit over

848	   path.flightsize = path.flightsize + packet.datasize
849	   packet.path = path
850	   packet.status.acked = false
851	   // set up state to remember to do per RTT stuff when packet is
852	   // ACKed
853	   if path.do_per_rtt_next_packet == true
854	     path.per_rtt_seqnum = packet.seqnum.first
855	     packet.per_rtt = true
856	     path.do_per_rtt_next_packet = false
857	   else
858	     packet.status.per_rtt = false
859	   // don't set ECN on outgoing packets for now, can add logic
860	   // for deciding which packets to ECN enable later
861	   packet.ecn.sent = 0
862	   // add to linked list of sent packets (to handle retrans-
863	   // missions, linked list must maintain seqnum order, not FIFO
864	   // or LIFO)
865	   llpush(packet)

867	   // receiving (S)ACKs

869	   // normal flow-wide flow control actions based on cumACK
870	   // also happen (elsewhere)

872	   // handle ECN, must detect transitions rather than
873	   // depend on actual value
874	   if packet.ecnecho == true
875	     if ecn.previous == true
876	       ecn.current = false
877	     else
878	       ecn.current = true
879	       ecn.previous = true
880	   else
881	     ecn.previous = false

883	   // initialize some stuff before we handle the ACK
884	   for each path
885	     path.do_per_rtt = false
886	     path.ackedbytes = 0
887	     path.unacked.sure = 0
888	     path.unacked.maybe = 0
889	     path.ecn.received = false

891	   // remove cumulatively ACKed packets
892	   llwalk_init
893	   packet = llwalk_next
894	   while packet.seqnum.first < ack.cumulative
895	     // ECN, we only act if we enabled ECN when we sent the packet
896	     if ecn.current & packet.ecn.sent <> 0
897	       path.ecn.received = true
898	     // if part of a packet is ACKed, we need some trickery
899	     if packet.seqnum.last_plus_one > ack.cumulative
900	       path.ackedbytes += ack.cumulative - packet.seqnum.first
901	       packet.seqnum.first = ack.cumulative
902	     else
903	       path.ackedbytes = path.ackedbytes + packet.datasize
904	       if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum
905	         path.do_per_rtt = true
906	       llremove(packet)
907	     packet = llwalk_next

909	   // now we handle the SACKs (assume exactly one SACKblock for
910	   // simplicity) we continue walking the linked list, no need to
911	   // restart
912	   while packet.seqnum.first < ack.sack.last_plus_one
913	     if packet.seqnum.last_plus_one < ack.sack.first
914	       // these packets overlap with the SACK block
915	       // for simplicity, assume packets are always completely
916	       // SACKed in reality we need to split a packet if only the
917	       // middle is SACKed ECN, we only act if we enabled ECN when
918	       // we sent the packet
919	       if ecn.current & packet.ecn.sent <> 0
920	         path.ecn.received = true
921	       path.ackedbytes = path.ackedbytes + packet.datasize
922	       if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum
923	         path.do_per_rtt = true
924	       // add potentially unacked bytes to for sure unacked bytes
925	       // because we now know we had a SACK hole if any
926	       // unacked maybe bytes
927	       path.unacked.sure = path.unacked.sure + path.unacked.maybe
928	       path.unacked.maybe = 0
929	       // remove packet from the list
930	       llremove(packet)
931	     else
932	       // note how many bytes we skipped unSACKed
933	       // if later data is SACKed, that's our version of a dup ACK
934	       path.unacked.maybe = path.unacked.maybe + packet.datasize
935	     packet = llwalk_next

937	   // done processing, now tally up the the results
938	   foreach path
939	     // update flightsize (item 5 in CC events/variables list)
940	     path.flightsize = path.flightsize - path.ackedbytes
941	     // if any data was ACKed
942	     if path.ackedbytes <> 0
943	       // some stuff was ACKed for this path
944	       if path.unacked.sure > 2 * path.mss
945	         // more than 2 * MSS worth of data in SACK hole = fast
946	         // retransmit execute fast retransmit (item 7 in CC
947	         // events/variables list) need to handle flightsize in
948	         // some way here ignore ECN because we already have a loss
949	         // send back ECN window update indication, though
950	       else
951	         // SACKs were cumulative for this path
952	         // execute cwnd update (items 2 and 3 in CC events/
953	         // variables list)
954	         // ECN must be taken into account here
955	         // and send back ECN window update indication
956	       if path.do_per_rtt
957	         // execute per RTT actions
958	         // indicate that this should be set for next packet sent
959	         path.do_per_rtt_next_packet == true

961	   Note that the pseudo-code doesn't cover all the mechanisms explained
962	   earlier.  Also, ECN is handled here because it's not too difficult to
963	   do.  The hard part is deciding which packets to enable ECN for.

965	Author's Address

967	   Iljitsch van Beijnum
968	   IMDEA Networks
969	   Avda. del Mar Mediterraneo, 22
970	   Leganes, Madrid  28918
971	   Spain

973	   Email: iljitsch@muada.com