idnits 2.17.1 

draft-ietf-mptcp-congestion-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (April 11, 2011) is 4764 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Missing Reference: 'REF' is mentioned on line 415, but not defined

  == Outdated reference: A later version (-03) exists of
     draft-ford-mptcp-multiaddressed-01

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                C. Raiciu
3	Internet-Draft                                                M. Handley
4	Intended status: Experimental                                 D. Wischik
5	Expires: October 13, 2011                      University College London
6	                                                          April 11, 2011

8	      Coupled Congestion Control for Multipath Transport Protocols
9	                     draft-ietf-mptcp-congestion-03

11	Abstract

13	   Often endpoints are connected by multiple paths, but communications
14	   are usually restricted to a single path per connection.  Resource
15	   usage within the network would be more efficient were it possible for
16	   these multiple paths to be used concurrently.  Multipath TCP is a
17	   proposal to achieve multipath transport in TCP.

19	   New congestion control algorithms are needed for multipath transport
20	   protocols such as Multipath TCP, as single path algorithms have a
21	   series of issues in the multipath context.  One of the prominent
22	   problems is that running existing algorithms such as TCP New Reno
23	   independently on each path would give the multipath flow more than
24	   its fair share at a bottleneck link traversed by more than one of its
25	   subflows.  Further, it is desirable that a source with multiple paths
26	   available will transfer more traffic using the least congested of the
27	   paths, hence achieving resource pooling.  This would increase the
28	   overall efficiency of the network and also its robustness to failure.

30	   This document presents a congestion control algorithm which couples
31	   the congestion control algorithms running on different subflows by
32	   linking their increase functions, and dynamically controls the
33	   overall aggresiveness of the multipath flow.  The result is a
34	   practical algorithm that is fair to TCP at bottlenecks while moving
35	   traffic away from congested links.

37	Status of this Memo

39	   This Internet-Draft is submitted to IETF in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF), its areas, and its working groups.  Note that
44	   other groups may also distribute working documents as Internet-
45	   Drafts.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   The list of current Internet-Drafts can be accessed at
53	   http://www.ietf.org/ietf/1id-abstracts.txt.

55	   The list of Internet-Draft Shadow Directories can be accessed at
56	   http://www.ietf.org/shadow.html.

58	   This Internet-Draft will expire on October 13, 2011.

60	Copyright Notice

62	   Copyright (c) 2011 IETF Trust and the persons identified as the
63	   document authors.  All rights reserved.

65	   This document is subject to BCP 78 and the IETF Trust's Legal
66	   Provisions Relating to IETF Documents
67	   (http://trustee.ietf.org/license-info) in effect on the date of
68	   publication of this document.  Please review these documents
69	   carefully, as they describe your rights and restrictions with respect
70	   to this document.  Code Components extracted from this document must
71	   include Simplified BSD License text as described in Section 4.e of
72	   the Trust Legal Provisions and are provided without warranty as
73	   described in the BSD License.

75	Table of Contents

77	   1.  Requirements Language  . . . . . . . . . . . . . . . . . . . .  4
78	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
79	   3.  Coupled Congestion Control Algorithm . . . . . . . . . . . . .  6
80	   4.  Implementation Considerations  . . . . . . . . . . . . . . . .  7
81	     4.1.  Implementation Considerations when CWND is Expressed
82	           in Packets . . . . . . . . . . . . . . . . . . . . . . . .  9
83	   5.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 10
84	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
85	   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
86	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
87	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
88	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 11
89	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 11
90	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12

92	1.  Requirements Language

94	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
95	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
96	   document are to be interpreted as described in RFC 2119 [RFC2119].

98	2.  Introduction

100	   Multipath TCP (MPTCP, [I-D.ford-mptcp-multiaddressed]) is a set of
101	   extensions to regular TCP [RFC0793] that allows one TCP connection to
102	   be spread across multiple paths.  MPTCP distributes load through the
103	   creation of separate "subflows" across potentially disjoint paths.

105	   How should congestion control be performed for multipath TCP?  First,
106	   each subflow must have its own congestion control state (i.e. cwnd)
107	   so that capacity on that path is matched by offered load.  The
108	   simplest way to achieve this goal is to simply run TCP New Reno
109	   congestion control [RFC5681] on each subflow.  However this solution
110	   is unsatisfactory as it gives the multipath flow an unfair share when
111	   the paths taken by its different subflows share a common bottleneck.

113	   Bottleneck fairness is just one requirement multipath congestion
114	   control should meet.  The following three goals capture the desirable
115	   properties of a practical multipath congestion control algorithm:

117	   o  Goal 1 (Improve Throughput) A multipath flow should perform at
118	      least as well as a single path flow would on the best of the paths
119	      available to it.

121	   o  Goal 2 (Do no harm) A multipath flow should not take up more
122	      capacity from any of the resources shared by its different paths,
123	      than if it was a single flow using only one of these paths.  This
124	      guarantees it will not unduly harm other flows.

126	   o  Goal 3 (Balance congestion) A multipath flow should move as much
127	      traffic as possible off its most congested paths, subject to
128	      meeting the first two goals.

130	   Goals 1 and 2 together ensure fairness at the bottleneck.  Goal 3
131	   captures the concept of resource pooling [WISCHIK]: if each multipath
132	   flow sends more data through its least congested path, the traffic in
133	   the network will move away from congested areas.  This improves
134	   robustness and overall throughput, among other things.  The way to
135	   achieve resource pooling is to effectively "couple" the congestion
136	   control loops for the different subflows.

138	   We propose an algorithm that couples only the additive increase
139	   function of the subflows, and uses unmodified TCP New Reno behavior
140	   in case of a drop.  The algorithm relies on the traditional TCP
141	   mechanisms to detect drops, to retransmit data, etc.

143	   Detecting shared bottlenecks reliably is quite difficult, but is just
144	   one part of a bigger question.  This bigger question is how much
145	   bandwidth a multipath user should use in total, even if there is no
146	   shared bottleneck.

148	   The congestion controller aims to set the multipath flow's aggregate
149	   bandwidth to be the same as a regular TCP flow would get on the best
150	   path available to the multipath flow.  To estimate the bandwidth of a
151	   regular TCP flow, the multipath flow estimates loss rates and round
152	   trip times and computes the target rate.  Then it adjusts the overall
153	   aggresiveness (parameter alpha) to achieve the desired rate.

155	   While the mechanism above applies always, its effect depends on
156	   whether the multipath TCP flow influences or does not influence the
157	   link loss rates (high vs. low statistical multiplexing).  If MPTCP
158	   does not influence link loss rates, MPTCP will get the same
159	   throughput as TCP on the best path.  In cases with low statistical
160	   multiplexing, where the multipath flow influences the loss rates on
161	   the path, the multipath throughput will be strictly higher than a
162	   single TCP would get on any of the paths.  In particular, if using
163	   two idle paths, multipath throughput will be sum of the two paths'
164	   throughput.

166	   This algorithm ensures bottleneck fairness and fairness in the
167	   broader, network sense.  We acknowledge that current TCP fairness
168	   criteria are far from ideal, but a multipath TCP needs to be
169	   deployable in the current Internet.  If needed, new fairness criteria
170	   can be implemented by the same algorithm we propose by appropriately
171	   scaling the overall aggressiveness.

173	   It is intended that the algorithm presented here can be applied to
174	   other multipath transport protocols, such as alternative multipath
175	   extensions to TCP, or indeed any other congestion-aware transport
176	   protocols.  However, for the purposes of example this document will,
177	   where appropriate, refer to the MPTCP protocol.

179	   The design decisions and evaluation of the congestion control
180	   algorithm are published in [NSDI].

182	   The algorithm presented here only extends TCP New Reno congestion
183	   control for multipath operation.  It is foreseeable that other
184	   congestion controllers will be implemented for multipath transport to
185	   achieve the bandwidth-scaling properties of the newer congestion
186	   control algorithms for regular TCP (such as Compound TCP and Cubic).

188	3.  Coupled Congestion Control Algorithm

190	   The algorithm we present only applies to the increase phase of the
191	   congestion avoidance state specifying how the window inflates upon
192	   receiving an ack.  The slow start, fast retransmit, and fast recovery
193	   algorithms, as well as the multiplicative decrease of the congestion
194	   avoidance state are the same as in TCP [RFC5681].

196	   Let cwnd_i be the congestion window on the subflow i.  Let tot_cwnd
197	   be the sum of the congestion windows of all subflows in the
198	   connection.  Let p_i, rtt_i and mss_i be the loss rate, round trip
199	   time (i.e. smoothed round trip time estimate) and maximum segment
200	   size on subflow i.

202	   We assume throughout this document that the congestion window is
203	   maintained in bytes, unless otherwise specified.  We briefly describe
204	   the algorithm for packet-based implementations of cwnd in section
205	   Section 4.1.

207	   Our proposed "Linked Increases" algorithm MUST:

209	   o  For each ack received on subflow i, increase cwnd_i by min
210	      (alpha*bytes_acked*mss_i/tot_cwnd , bytes_acked*mss_i/cwnd_i )

212	   The increase formula takes the minimum between the computed increase
213	   for the multipath subflow (first argument to min), and the increase
214	   TCP would get in the same scenario (the second argument).  In this
215	   way, we ensure that any multipath subflow cannot be more aggressive
216	   than a TCP flow in the same circumstances, hence achieving goal 2 (do
217	   no harm).

219	   "alpha" is a parameter of the algorithm that describes the
220	   aggresiveness of the multipath flow.  To meet Goal 1 (improve
221	   throughput), the value of alpha is chosen such that the aggregate
222	   throughput of the multipath flow is equal to the throughput a TCP
223	   flow would get if it ran on the best path.

225	   To get an intuition of what the algorithm is trying to do, let's take
226	   the case where all the subflows have the same round trip time and
227	   MSS.  In this case the algorithm will grow the total window by
228	   approximately alpha*MSS per RTT.  This increase is distributed to the
229	   individual flows according to their instantaneous window size.
230	   Subflow i will increase by alpha*cwnd_i/tot_cwnd segments per RTT.

232	   Note that, as in standard TCP, when tot_cwnd is large the increase
233	   may be 0.  In this case the increase MUST be set to 1.  We discuss
234	   how to implement this formula in practice in the next section.

236	   We assume appropriate byte counting (ABC, [RFC3465]) is used, hence
237	   the bytes_acked variable records the number of bytes newly
238	   acknowledged.  If ABC is not used, bytes_acked SHOULD be set to
239	   mss_i.

241	   To compute tot_cwnd, it is an easy mistake to sum up cwnd_i across
242	   all subflows: when a flow is in fast retransmit, its cwnd is
243	   typically inflated and no longer represents the real congestion
244	   window.  The correct behavior is to use the ssthresh value for flows
245	   in fast retransmit when computing tot_cwnd.  To cater for connections
246	   that are app limited, the computation should consider the minimum
247	   between flight_size_i and cwnd_i, and flight_size_i and ssthresh_i
248	   where appropriate.

250	   The total throughput of a multipath flow depends on the value of
251	   alpha and the loss rates, maximum segment sizes and round trip times
252	   of its paths.  Since we require that the total throughput is no worse
253	   than the throughput a single TCP would get on the best path, it is
254	   impossible to choose a-priori a single value of alpha that achieves
255	   the desired throughput in every ocasion.  Hence, alpha must be
256	   computed based on the observed properties of the paths.

258	   The formula to compute alpha is:

260	                                      cwnd_i
261	                                 max --------
262	                                  i         2
263	                                      rtt_i
264	             alpha = tot_cwnd * ----------------
265	                               /      cwnd_i \ 2
266	                               | sum ---------|
267	                               \  i   rtt_i  /

269	   The formula is derived by equalizing the rate of the multipath flow
270	   with the rate of a TCP running on the best path, and solving for
271	   alpha.

273	4.  Implementation Considerations

275	   The formula for alpha above implies that alpha is a floating point
276	   value.  This would require performing costly floating point
277	   operations whenever an ACK is received, Further, in many kernels
278	   floating point operations are disabled.  There is an easy way to
279	   approximate the above calculations using integer arithmetic.

281	   Let alpha_scale be an integer.  When computing alpha, use alpha_scale
282	   * tot_cwnd instead of tot_cwnd, and do all the operations in integer
283	   arithmetic.

285	   Then, scale down the increase per ack by alpha_scale.  The algorithm
286	   is:

288	   o  For each ack received on subflow i, increase cwnd_i by min (
289	      (alpha*bytes_acked*mss_i/tot_cwnd)/alpha_scale ,
290	      bytes_acked*mss_i/cwnd_i )

292	   Alpha scale denotes the precision we want for computing alpha.
293	   Observe that the errors in computing the numerator or the denominator
294	   in the formula for alpha are quite small, as the cwnd in bytes is
295	   typically much larger than the RTT (measured in ms).

297	   With these changes, all the operations can be done using integer
298	   arithmetic.  We propose alpha_scale be a small power of two, to allow
299	   using faster shift operations instead of multiplication and division.
300	   Our experiments show that using alpha_scale=512 works well in a wide
301	   range of scenarios.  Increasing alpha_scale increases precision, but
302	   also increases the risk of overflow when computing alpha.  Using
303	   64bit operations would solve this issue.  Another option is to
304	   dynamically adjust alpha_scale when computing alpha; in this way we
305	   avoid overflow and obtain maximum precision.

307	   It is possible to implement the algorithm by calculating tot_cwnd on
308	   each ack, however this would be costly especially when the number of
309	   subflows is large.  To avoid this overhead the implementation MAY
310	   choose to maintain a new per connection state variable called
311	   tot_cwnd.  If it does so, the implementation will update tot_cwnd
312	   value whenever the individual subflows' windows are updated.
313	   Updating only requires one more addition or subtraction operation
314	   compared to the regular, per subflow congestion control code, so its
315	   performance impact should be minimal.

317	   Computing alpha per ack is also costly.  We propose alpha be a per
318	   connection variable, computed whenever there is a drop and once per
319	   RTT otherwise.  More specifically, let cwnd_new be the new value of
320	   the congestion window after it is inflated or after a drop.  Update
321	   alpha only if cwnd_i/mss_i != cwnd_new_i/mss_i.

323	   In certain cases with small RTTs, computing alpha can still be
324	   expensive.  We observe that if RTTs were constant, it is sufficient
325	   to compute alpha once per drop, as alpha does not change between
326	   drops (the insight here is that cwnd_i/cwnd_j = constant as long as
327	   both windows increase).  Experimental results show that even if round
328	   trip times are not constant, using average round trip time instead of
329	   instantaneous round trip time gives good precision for computing
330	   alpha.  Hence, it is possible to compute alpha only once per drop
331	   according to the formula above, by replacing rtt_i with rtt_avg_i.

333	   If using average round trip time, rtt_avg_i will be computed by
334	   sampling the rtt_i whenever the window can accomodate one more
335	   packet, i.e. when cwnd / mss < (cwnd+increase)/mss.  The samples are
336	   averaged once per sawtooth into rtt_avg_i.  This sampling ensures
337	   that there is no sampling bias for larger windows.

339	   Given tot_cwnd and alpha, the congestion control algorithm is run for
340	   each subflow independently, with similar complexity to the standard
341	   TCP increase code [RFC5681].

343	4.1.  Implementation Considerations when CWND is Expressed in Packets

345	   When the congestion control algorithm maintains cwnd in packets
346	   rather than bytes, the algorithms above must change to take into
347	   account path mss.

349	   To compute the increase when an ack is received, the implementation
350	   for multipath congestion control is a simple extension of the TCP New
351	   Reno code.  In TCP New Reno cwnd_cnt is an additional state variable
352	   that tracks the number of segments acked since the last cwnd
353	   increment; cwnd is incremented only when cwnd_cnt > cwnd; then
354	   cwnd_cnt is set to 0.

356	   In the multipath case, cwnd_cnt_i is maintained for each subflow as
357	   above, and cwnd_i is increased by 1 when cwnd_cnt_i > max(alpha_scale
358	   * tot_cwnd / alpha, cwnd_i).

360	   When computing alpha for packet-based stacks, the errors in computing
361	   the terms in the denominator are larger (this is because cwnd is much
362	   smaller and rtt may be comparatively large).  Let max be the index of
363	   the subflow used in the numerator.  To reduce errors, it is easiest
364	   to move rtt_max (once calculated) from the numerator to the
365	   denominator, obtaining the equivalent formula below.

367	                                               cwnd_max
368	        alpha = alpha_scale * tot_cwnd * -----------------------
369	                                        /     rtt_max * cwnd_i \ 2
370	                                        | sum -----------------|
371	                                        \  i        rtt_i      /

373	   Note that the formula for computing alpha does not take into account
374	   path mss, and is the same for stacks that keep cwnd in bytes or
375	   packets.  With this formula, the algorithm for computing alpha will
376	   match the rate of TCP on the best path in B/s for byte-oriented
377	   stacks, and in packets/s in packet-based stacks.  In practice, mss
378	   rarely changes between paths so this shouldn't be a problem.

380	   However, it is simple to derive formulae allowing packet-based stacks
381	   to achieve byte rate fairness (and viceversa) if needed.  In
382	   particular, for packet-based stacks wanting byte-rate fairness, the
383	   formula above changes as follows: cwnd_max is replaced by cwnd_max *
384	   mss_max * mss_max, while cwnd_i is replaced with cwnd_i * mss_i.

386	5.  Discussion

388	   To achieve perfect resource pooling, one must couple both increase
389	   and decrease of congestion windows across subflows, as in [KELLY].
390	   Yet this tends to exhibit "flappiness": when the paths have similar
391	   levels of congestion, the congestion controller will tend to allocate
392	   all the window to one random subflow, and allocate zero window to the
393	   other subflows.  The controller will perform random flips between
394	   these stable points.  This doesn't seem desirable in general, and is
395	   particularly bad when the achieved rates depend on the RTT (as in the
396	   current Internet): in such a case, the resulting rate with fluctuate
397	   unpredictably depending on which state the controller is in, hence
398	   violating Goal 1.

400	   By only coupling increases our proposal removes flappiness but also
401	   reduces the extent of resource pooling the protocol achieves.  The
402	   algorithm will allocate window to the subflows such that p_i * cwnd_i
403	   = constant, for all i.  Thus, when the loss rates of the subflows are
404	   equal, each subflow will get an equal window, removing flappiness.
405	   When the loss rates differ, progressively more window will be
406	   allocated to the flow with the lower loss rate.  In contrast, perfect
407	   resource pooling requires that all the window should be allocated on
408	   the path with the lowest loss rate.

410	6.  Security Considerations

412	   None.

414	   Detailed security analysis for the Multipath TCP protocol itself is
415	   included in [I-D.ford-mptcp-multiaddressed] and [REF]

417	7.  Acknowledgements

419	   We thank Christoph Paasch for his suggestions for computing alpha in
420	   packet-based stacks.  The authors are supported by Trilogy
421	   (http://www.trilogy-project.org), a research project (ICT-216372)
422	   partially funded by the European Community under its Seventh
423	   Framework Program.  The views expressed here are those of the
424	   author(s) only.  The European Commission is not liable for any use
425	   that may be made of the information in this document.

427	8.  IANA Considerations

429	   None.

431	9.  References

433	9.1.  Normative References

435	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
436	              Requirement Levels", BCP 14, RFC 2119, March 1997.

438	9.2.  Informative References

440	   [I-D.ford-mptcp-multiaddressed]
441	              Ford, A., Raiciu, C., Handley, M., and S. Barre, "TCP
442	              Extensions for Multipath Operation with Multiple
443	              Addresses", draft-ford-mptcp-multiaddressed-01 (work in
444	              progress), July 2009.

446	   [KELLY]    Kelly, F. and T. Voice, "Stability of end-to-end
447	              algorithms for joint routing and rate control", ACM
448	              SIGCOMM CCR vol. 35 num. 2, pp. 5-12, 2005,
449	              <http://portal.acm.org/citation.cfm?id=1064415>.

451	   [NSDI]     Wischik, D., Raiciu, C., Greenhalgh, A., and M. Handley,
452	              "Design, Implementation and Evaluation of Congestion
453	              Control for Multipath TCP", Usenix NSDI , March 2011, <htt
454	              p://www.cs.ucl.ac.uk/staff/c.raiciu/files/mptcp-nsdi.pdf>.

456	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
457	              RFC 793, September 1981.

459	   [RFC3465]  Allman, M., "TCP Congestion Control with Appropriate Byte
460	              Counting (ABC)", RFC 3465, February 2003.

462	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
463	              Control", RFC 5681, September 2009.

465	   [WISCHIK]  Wischik, D., Handley, M., and M. Bagnulo Braun, "The
466	              Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num.
467	              5, pp. 47-52, October 2008,
468	              <http://ccr.sigcomm.org/online/files/p47-handleyA4.pdf>.

470	Authors' Addresses

472	   Costin Raiciu
473	   University College London
474	   Gower Street
475	   London  WC1E 6BT
476	   UK

478	   Email: c.raiciu@cs.ucl.ac.uk

480	   Mark Handley
481	   University College London
482	   Gower Street
483	   London  WC1E 6BT
484	   UK

486	   Email: m.handley@cs.ucl.ac.uk

488	   Damon Wischik
489	   University College London
490	   Gower Street
491	   London  WC1E 6BT
492	   UK

494	   Email: d.wischik@cs.ucl.ac.uk