idnits 2.17.1 draft-barre-mptcp-impl-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 433 has weird spacing: '... in the built...' -- The document date (March 7, 2011) is 4798 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-12) exists of draft-ietf-mptcp-multiaddressed-02 ** Downref: Normative reference to an Experimental draft: draft-ietf-mptcp-multiaddressed (ref. '1') == Outdated reference: A later version (-07) exists of draft-ietf-mptcp-congestion-01 ** Downref: Normative reference to an Experimental draft: draft-ietf-mptcp-congestion (ref. '2') ** Downref: Normative reference to an Informational draft: draft-ietf-mptcp-architecture (ref. '3') -- Possible downref: Non-RFC (?) normative reference: ref. '4' == Outdated reference: A later version (-07) exists of draft-ietf-mptcp-api-00 ** Downref: Normative reference to an Informational draft: draft-ietf-mptcp-api (ref. '5') ** Obsolete normative reference: RFC 793 (ref. '6') (Obsoleted by RFC 9293) -- Possible downref: Non-RFC (?) normative reference: ref. '7' -- Possible downref: Non-RFC (?) normative reference: ref. '8' -- Possible downref: Non-RFC (?) normative reference: ref. '9' == Outdated reference: A later version (-27) exists of draft-tuexen-tsvwg-sctp-multipath-01 ** Downref: Normative reference to an Experimental draft: draft-tuexen-tsvwg-sctp-multipath (ref. '10') ** Downref: Normative reference to an Experimental RFC: RFC 3465 (ref. '11') == Outdated reference: A later version (-15) exists of draft-ietf-mif-problem-statement-09 ** Downref: Normative reference to an Informational draft: draft-ietf-mif-problem-statement (ref. '12') ** Obsolete normative reference: RFC 3484 (ref. '13') (Obsoleted by RFC 6724) -- Possible downref: Non-RFC (?) normative reference: ref. '14' Summary: 12 errors (**), 0 flaws (~~), 8 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Barre 3 Internet-Draft C. Paasch 4 Expires: September 8, 2011 O. Bonaventure 5 UCLouvain, Belgium 6 March 7, 2011 8 MultiPath TCP - Guidelines for implementers 9 draft-barre-mptcp-impl-00 11 Abstract 13 Multipath TCP is a major extension to TCP that allows improving the 14 resource usage in the current Internet by transmitting data over 15 several TCP subflows, while still showing one single regular TCP 16 socket to the application. This document describes our experience in 17 writing a MultiPath TCP implementation in the Linux kernel and 18 discusses implementation guidelines that could be useful for other 19 developers who are planning to add MultiPath TCP to their networking 20 stack. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on September 8, 2011. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. An architecture for Multipath transport . . . . . . . . . . . 5 59 2.1. MPTCP architecture . . . . . . . . . . . . . . . . . . . . 5 60 2.2. Structure of the Multipath Transport . . . . . . . . . . . 9 61 2.3. Structure of the Path Manager . . . . . . . . . . . . . . 9 62 3. MPTCP challenges for the OS . . . . . . . . . . . . . . . . . 12 63 3.1. Charging the application for its CPU cycles . . . . . . . 12 64 3.2. At connection/subflow establishment . . . . . . . . . . . 13 65 3.3. Subflow management . . . . . . . . . . . . . . . . . . . . 14 66 3.4. At the data sink . . . . . . . . . . . . . . . . . . . . . 14 67 3.4.1. Receive buffer tuning . . . . . . . . . . . . . . . . 15 68 3.4.2. Receive queue management . . . . . . . . . . . . . . . 15 69 3.4.3. Scheduling data ACKs . . . . . . . . . . . . . . . . . 16 70 3.5. At the data source . . . . . . . . . . . . . . . . . . . . 16 71 3.5.1. Send buffer tuning . . . . . . . . . . . . . . . . . . 17 72 3.5.2. Send queue management . . . . . . . . . . . . . . . . 17 73 3.5.3. Scheduling data . . . . . . . . . . . . . . . . . . . 20 74 3.5.3.1. The congestion controller . . . . . . . . . . . . 21 75 3.5.3.2. The Packet Scheduler . . . . . . . . . . . . . . . 22 76 3.6. At connection/subflow termination . . . . . . . . . . . . 23 77 4. Configuring the OS for MPTCP . . . . . . . . . . . . . . . . . 25 78 4.1. Source address based routing . . . . . . . . . . . . . . . 25 79 4.2. Buffer configuration . . . . . . . . . . . . . . . . . . . 27 80 5. Future work . . . . . . . . . . . . . . . . . . . . . . . . . 28 81 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29 82 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 30 83 Appendix A. Design alternatives . . . . . . . . . . . . . . . . . 32 84 A.1. Another way to consider Path Management . . . . . . . . . 32 85 A.2. Implementing alternate Path Managers . . . . . . . . . . . 33 86 A.3. When to instantiate a new meta-socket ? . . . . . . . . . 34 87 A.4. Forcing more processing in user context . . . . . . . . . 34 88 A.5. Buffering data on a per-subflow basis . . . . . . . . . . 35 89 Appendix B. Ongoing discussions on implementation improvements . 39 90 B.1. Heuristics for subflow management . . . . . . . . . . . . 39 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 41 93 1. Introduction 95 The MultiPath TCP protocol [1] is a major TCP extension that allows 96 for simultaneous use of multiple paths, while being transparent to 97 the applications, fair to regular TCP flows [2] and deployable in the 98 current Internet. The MPTCP design goals and the protocol 99 architecture that allow reaching them are described in [3]. Besides 100 the protocol architecture, a number of non-trivial design choices 101 need to be made in order to extend an existing TCP implementation to 102 support MultiPath TCP. This document gathers a set of guidelines 103 that should help implementers writing an efficient and modular MPTCP 104 stack. The guidelines are expected to be applicable regardless of 105 the Operating System (although the MPTCP implementation described 106 here is done in Linux [4]). Another goal is to achieve the greatest 107 level of modularity without impacting efficiency, hence allowing 108 other multipath protocols to nicely co-exist in the same stack. In 109 order for the reader to clearly disambiguate "useful hints" from 110 "important requirements", we write the latter in their own 111 paragraphs, starting with the keyword "IMPORTANT". By important 112 requirements, we mean design options that, if not followed, would 113 lead to an under-performing MPTCP stack, maybe even slower than 114 regular TCP. 116 This draft presents implementation guidelines that are based on the 117 code which has been implemented in our MultiPath TCP aware Linux 118 kernel (the version covered here is 0.6) which is available from 119 http://inl.info.ucl.ac.be/mptcp. We also list configuration 120 guidelines that have proven to be useful in practice. In some cases, 121 we discuss some mechanisms that have not yet been implemented. These 122 mechanisms are clearly listed. During our work in implementing 123 MultiPath TCP, we evaluated other designs. Some of them are not used 124 anymore in our implementation. However, we explain in the appendix 125 the reason why these particular designs have not been considered 126 further. 128 This document is structured as follows. First we propose an 129 architecture that allows supporting MPTCP in a protocol stack 130 residing in an operating system. Then we consider a range of 131 problems that must be solved by an MPTCP stack (compared to a regular 132 TCP stack). In Section 4, we propose recommendations on how a system 133 administrator could correctly configure an MPTCP-enabled host. 134 Finally, we discuss future work, in particular in the area of MPTCP 135 optimization. 137 1.1. Terminology 139 In this document we use the same terminology as in [3] and [1]. In 140 addition, we will use the following implementation-specific terms: 142 o Meta-socket: A socket structure used to reorder incoming data at 143 the connection level and schedule outgoing data to subflows. 145 o Master subsocket: The socket structure that is visible from the 146 application. If regular TCP is in use, this is the only active 147 socket structure. If MPTCP is used, this is the socket 148 corresponding to the first subflow. 150 o Slave subsocket: Any socket created by the kernel to provide an 151 additional subflow. Those sockets are not visible to the 152 application (unless a specific API [5] is used). The meta-socket, 153 master and slave subsocket are explained in more details in 154 Section 2.2. 156 o Endpoint id: Endpoint identifier. It is the tuple (saddr, sport, 157 daddr, dport) that identifies a particular subflow, hence a 158 particular subsocket. 160 o Fendpoint id: First Endpoint identifier. It is the endpoint 161 identifier of the Master subsocket. 163 o Connection id or token: It is a locally unique number, defined in 164 Section 2 of [1], that allows finding a connection during the 165 establishment of new subflows. 167 2. An architecture for Multipath transport 169 Section 4 of the MPTCP architecture document [3] describes the 170 functional decomposition of MPTCP. It lists four entities, namely 171 Path Management, Packet Scheduling, Subflow Interface and Congestion 172 Control. These entities can be further grouped based on the layer at 173 which they operate: 175 o Transport layer: This includes Packet Scheduling, Subflow 176 Interface and Congestion Control, and is grouped under the term 177 "Multipath Transport (MT)". From an implementation point of view, 178 they all will involve modifications to TCP. 180 o Any layer: Path Management. Path management can be done in the 181 transport layer, as is the case of the built-in path manager (PM) 182 described in [1]. That PM discovers paths through the exchange of 183 TCP options of type ADD_ADDR or the reception of a SYN on a new 184 address pair, and defines a path as an endpoint_id (saddr, sport, 185 daddr, dport). But, more generally, a PM could be any module able 186 to expose multiple paths to MPTCP, located either in kernel or 187 user space, and acting on any OSI layer (e.g. a bonding driver 188 that would expose its multiple links to the Multipath Transport). 190 Because of the fundamental independence of Path Management compared 191 to the three other entities, we draw a clear line between both, and 192 define a simple interface that allows MPTCP to benefit easily from 193 any appropriately interfaced multipath technology. In this document, 194 we stick to describing how the functional elements of MPTCP are 195 defined, using the built-in Path Manager described in [1], and we 196 leave for future separate documents the description of other path 197 managers. We describe in the first subsection the precise roles of 198 the Multipath Transport and the Path Manager. Then we detail how 199 they are interfaced with each other. 201 2.1. MPTCP architecture 203 Although, when using the built-in PM, MPTCP is fully contained in the 204 transport layer, it can still be organized as a Path Manager and a 205 Multipath Transport Layer as shown in Figure 1. The Path Manager 206 announces to the MultiPath Transport what paths can be used through 207 path indices for an MPTCP connection, identified by the fendpoint_id 208 (first endpoint id). The fendpoint_id is the tuple (saddr, sport, 209 daddr, dport) seen by the application and uniquely identifies the 210 MPTCP connection (an alternate way to identify the MPTCP connection 211 being the conn_id, which is a token as described in Section 2 of 212 [1]). The Path Manager maintains the mapping between the path_index 213 and an endpoint_id. The endpoint_id is the tuple (saddr, sport, 214 daddr, dport) that is to be used for the corresponding path index. 216 Note that the fendpoint_id itself represents a path and is thus a 217 particular endpoint_id. By convention, the fendpoint_id is always 218 represented as path index 1. As explained in [3], Section 5.6, it is 219 not yet clear how an implementation should behave in the event of a 220 failure in the first subflow. We expect, however, that the Master 221 subsocket should be kept in use as an interface with the application, 222 even if no data is transmitted anymore over it. It also allows the 223 fendpoint_id to remain meaningful throughout the life of the 224 connection. This behavior has yet to be tested and refined with 225 Linux MPTCP. 227 Figure 1 shows an example sequence of MT-PM interactions happening at 228 the beginning of an exchange. When the MT starts a new connection 229 (through an application connect() or accept()), it can request the PM 230 to be updated about possible alternate paths for this new connection. 231 The PM can also spontaneously update the MT at any time (normally 232 when the path set changes). This is step 1 in Figure 1. In the 233 example, 4 paths can be used, hence 3 new ones. Based on the update, 234 the MT can decide whether to establish new subflows, and how many of 235 them. Here, the MT decides to establish one subflow only, and sends 236 a request for endpoint_id to the PM. This is step 2. In step 3, the 237 answer is given: . The source port is unspecified to 238 allow the MT ensure the unicity of the new endpoint_id, thanks to the 239 new_port() primitive (present in regular TCP as well). Note that 240 messages 1,2,3 need not be real messages and can be function calls 241 instead (as is the case in Linux MPTCP). 243 Control plane 244 +---------------------------------------------------------------+ 245 | Multipath Transport (MT) | 246 +----------------------------------------------------|----------+ 247 ^ | ^ v 248 | | | [Build new subsocket, 249 | 1.For fendpt_id |2.endpt_id | with endpt_ids 250 | | for path | 3.4 can be | index 2 ? | 0,pB2> 252 |used. | | 253 | | | 254 | | | 255 | v | 256 +---------------------------------------------------------------+ 257 | Path Manager (PM) | 258 +---------------------------------------------------------------+ 259 / \ 260 /---------------------------------------\ 261 | mapping table: | 262 | Subflow <--> endpoint_id | 263 | path index | 264 | | 265 | [see table below] | 266 | | 267 +---------------------------------------+ 269 Figure 1: Functional separation of MPTCP in the transport layer 271 The following options, described in [1] , are managed by the 272 Multipath Transport: 274 o MULTIPATH CAPABLE (MP_CAPABLE): Tells the peer that we support 275 MPTCP and announces our local token. 277 o MP_JOIN/MP_AUTH: Initiates a new subflow (Note that MP_AUTH is not 278 yet part of our Linux implementation at the moment) 280 o DATA SEQUENCE NUMBER (DSN_MAP): Identifies the position of a set 281 of bytes in the meta-flow. 283 o DATA_ACK: Acknowledge data at the connection level (subflow level 284 acknowledgments are contained in the normal TCP header). 286 o DATA FIN (DFIN): Terminates a connection. 288 o MP_PRIO: Asks the peer to revise the backup status of the subflow 289 on which the option is sent. Although the option is sent by the 290 Multipath Transport (because this allows using the TCP option 291 space), it may be triggered by the Path Manager. This option is 292 not yet supported by our MPTCP implementation. 294 o MP_FAIL: Checksum failed at connection-level. Currently the Linux 295 implementation does not implement the checksum in option DSN_MAP, 296 and hence does not implement either the MP_FAIL option. 298 The Path manager applies a particular technology to give the MT the 299 possibility to use several paths. The built-in MPTCP Path Manager 300 uses multiple IPv4/v6 addresses as its mean to influence the 301 forwarding of packets through the Internet. When the MT starts a new 302 connection, it chooses a token that will be used to identify the 303 connection. This is necessary to allow future subflow-establishment 304 SYNs (that is, containing the MP_JOIN option) to be attached to the 305 correct connection. An example mapping table is given hereafter: 307 +---------+------------+---------------+ 308 | token | path index | Endpoint id | 309 +---------+------------+---------------+ 310 | token_1 | 1 | | 311 | | | | 312 | token_1 | 2 | | 313 | | | | 314 | token_1 | 3 | | 315 | | | | 316 | token_1 | 4 | | 317 | | | | 318 | | | | 319 | token_2 | 1 | | 320 | | | | 321 | token_2 | 2 | | 322 +---------+------------+---------------+ 324 Table 1: Example mapping table for built-in PM 326 Table 1 shows an example where two MPTCP connections are active. One 327 is identified by token_1, the other one with token_2. As per [1], 328 the tokens must be unique locally. Since the endpoint identifier may 329 change from one subflow to another, the attachment of incoming new 330 subflows (identified by a SYN + MP_JOIN option) to the right 331 connection is achieved thanks to the locally unique token. The 332 built-in path manager currently implements the following options The 333 following options (defined in [1]) are intended to be part of the 334 built-in path manager: 336 o Add Address (ADD_ADDR): Announces a new address we own 337 o Remove Address (REMOVE_ADDR): Withdraws a previously announced 338 address 340 Those options form the built-in MPTCP Path Manager, based on 341 declaring IP addresses, and carries control information in TCP 342 options. An implementation of Multipath TCP can use any Path 343 Manager, but it must be able to fallback to the default PM in case 344 the other end does not support the custom PM. Alternative Path 345 Managers may be specified in separate documents in the future. 347 2.2. Structure of the Multipath Transport 349 The Multipath Transport handles three kinds of sockets. We define 350 them here and use this notation throughout the entire document: 352 o Master subsocket: This is the first socket in use when a 353 connection (TCP or MPTCP) starts. It is also the only one in use 354 if we need to fall back to regular TCP. This socket is initiated 355 by the application through the socket() system call. Immediately 356 after a new master subsocket is created, MPTCP capability is 357 enabled by the creation of the meta-socket. 359 o Meta-socket: It holds the multipath control block, and acts as the 360 connection level socket. As data source, it holds the main send 361 buffer. As data sink, it holds the connection-level receive queue 362 and out-of-order queue (used for reordering). We represent it as 363 a normal (extended) socket structure in Linux MPTCP because this 364 allows reusing much of the existing TCP code with few 365 modifications. In particular, the regular socket structure 366 already holds pointers to SND.UNA, SND.NXT, SND.WND, RCV.NXT, 367 RCV.WND (as defined in [6]). It also holds all the necessary 368 queues for sending/receiving data. 370 o Slave subsocket: Any subflow created by MPTCP, in addition to the 371 first one (the master subsocket is always considered as a subflow 372 even though it may be in failed state at some point in the 373 communication). The slave subsockets are created by the kernel 374 (not visible from the application) The master subsocket and the 375 slave subsockets together form the pool of available subflows that 376 the MPTCP Packet Scheduler (called from the meta-socket) can use 377 to send packets. 379 2.3. Structure of the Path Manager 381 In contrast to the multipath transport, which is more complex and 382 divided in sub-entities (namely Packet Scheduler, Subflow Interface 383 and Congestion Control, see Section 2), the Path Manager just 384 maintains the mapping table and updates the Multipath Transport when 385 the mapping table changes. The mapping table has been described 386 above (Table 1). We detail in Table 2 the set of (event,action) 387 pairs that are implemented in the Linux MPTCP built-in path manager. 388 For reference, an earlier architecture for the Path Management is 389 discussed in Appendix A.1. Also, Appendix A.2 proposes a small 390 extension to this current architecture to allow supporting other path 391 managers. 393 +-------------------------+-----------------------------------------+ 394 | event | action | 395 +-------------------------+-----------------------------------------+ 396 | master_sk bound: This | Discovers the set of local addresses | 397 | event is triggered upon | and stores them in local_addr_table | 398 | either a bind(), | | 399 | connect(), or when a | | 400 | new server-side socket | | 401 | becomes established. | | 402 | | | 403 | ADD_ADDR option | Updates remote_addr_table | 404 | received or SYN+MP_JOIN | correspondingly | 405 | received on new address | | 406 | | | 407 | local/remote_addr_table | Updates mapping_table by adding any new | 408 | updated | address combinations, or removing the | 409 | | ones that have disappeared. Each | 410 | | address pair is given a path index. | 411 | | Once allocated to an address pair, a | 412 | | path index cannot be reallocated to | 413 | | another one, to ensure consistency of | 414 | | the mapping table. | 415 | | | 416 | Mapping_table updated | Sends notification to the Multipath | 417 | | Transport. The notification contains | 418 | | the new set of path indices that the MT | 419 | | is allowed to use. This is shown in | 420 | | Figure 1, msg 1. | 421 | | | 422 | Endpoint_id(path_index) | Retrieves the endpoint_ids for the | 423 | request received from | corresponding path index from the | 424 | MT (Figure 1, msg 2) | mapping table and returns them to the | 425 | | MT. One such request/response is | 426 | | illustrated in Figure 1, msg 3. Note | 427 | | that in that msg 3, the local port is | 428 | | set to zero. This is to let the | 429 | | operating system choose a unique local | 430 | | port for the new socket. | 431 +-------------------------+-----------------------------------------+ 433 Table 2: (event,action) pairs implemented in the built-in PM 435 3. MPTCP challenges for the OS 437 MPTCP is a major modification to the MPTCP stack. We have described 438 above an architecture that separates Multipath Transport from Path 439 Management. Path Management can be implemented rather simply. But 440 Multipath Transport involves a set of new challenges, that do not 441 exist in regular TCP. We first describe how an MPTCP client or 442 server can start a new connection, or a new subflow within a 443 connection. Then we propose techniques (a concrete implementation of 444 which is done in Linux MPTCP) to efficiently implement data reception 445 (at the data sink) and data sending (at the data source). 447 3.1. Charging the application for its CPU cycles 449 As this document is about implementation, it is important not only to 450 ensure that MPTCP is fast, but also that it is fair to other 451 applications that share the same CPU. Otherwise one could have an 452 extremely fast file transfer, while the rest of the system is just 453 hanging. CPU fairness is ensured by the scheduler of the Operating 454 System when things are implemented in user space. But in the kernel, 455 we can choose to run code in "user context", that is, in a mode where 456 each CPU cycle is charged to a particular application. Or we can 457 (and must in some cases) run code in "interrupt context", that is, 458 interrupting everything else until the task has finished. In Linux 459 (probably a similar thing is true in other systems), the arrival of a 460 new packet on a NIC triggers a hardware interrupt, which in turn 461 schedules a software interrupt that will pull the packet from the NIC 462 and perform the initial processing. The challenge is to stop the 463 processing of the incoming packet in software interrupt as soon as it 464 can be attached to a socket, and wake up the application. With TCP, 465 an additional constraint is that incoming data should be acknowledged 466 as soon as possible, which requires reordering. Van Jacobson has 467 proposed a solution for this [7]: If an application is waiting on a 468 recv() system call, incoming packets can be put into a special queue 469 (called prequeue in Linux) and the application is woken up. 470 Reordering and acknowledgement are then performed in user context. 471 The execution path for outgoing packets is less critical from that 472 point of view, because the vast majority of processing can be done 473 very easily in user context. 475 In this document, when discussing CPU fairness, we will use the 476 following terms: 478 o User context: Execution environment that is under control of the 479 OS scheduler. CPU cycles are charged to the associated 480 application, which allows to ensure fairness with other 481 applications. 483 o Interrupt context: Execution environment that runs with higher 484 priority than any process. Although it is impossible to 485 completely avoid running code in interrupt context, it is 486 important to minimize the amount of code running in such a 487 context. 489 o VJ prequeues: This refers to Van Jacobson prequeues, as explained 490 above[7]. 492 3.2. At connection/subflow establishment 494 As described in [1], the establishment of an MPTCP connection is 495 quite simple, being just a regular three-way exchange with additional 496 options. As shown in Section 2.2 this is done in the master 497 subsocket. Currently Linux MPTCP attaches a meta-socket to a socket 498 as soon as it is created, that is, upon a socket() system call 499 (client side), or when a server side socket enters the ESTABLISHED 500 state. An alternate solution is described in Appendix A.3. 502 An implementation can choose the best moment, maybe depending on the 503 OS, to instantiate the meta-socket. However, if this meta-socket is 504 needed to accept new subflows (like it is in Linux MPTCP), it should 505 be attached at the latest when the MP_CAPABLE option is received. 506 Otherwise incoming new subflow requests (SYN + MP_JOIN) may be lost, 507 requiring retransmissions by the peer and delaying the subflow 508 establishment. 510 The establishment of subflows, on the other hand, is more tricky. 511 The problem is that new SYNs (with the MP_JOIN option) must be 512 accepted by a socket (the meta-socket in the proposed design) as if 513 it was in LISTEN state, while its state is actually ESTABLISHED. 514 There is the following in common with a LISTEN socket: 516 o Temporary structure: Between the reception of the SYN and the 517 final ACK, a mini-socket is used as a temporary structure. 519 o Queue of connection requests: The meta-socket, like a LISTEN 520 socket, maintains a list of pending connection requests. There 521 are two such lists. One contains mini-sockets, because the final 522 ACK has not yet been received. The second list contains sockets 523 in the ESTABLISHED state that have not yet been accepted. 524 "Accepted" means, for regular TCP, returned to the application as 525 a result of an accept() system call. For MPTCP it means that the 526 new subflow has been integrated in the set of active subflows. 528 We can list the following differences with a normal LISTEN socket. 530 o Socket lookup for a SYN: When a SYN is received, the corresponding 531 LISTEN socket is found by using the endpoint_id. This is not 532 possible with MPTCP, since we can receive a SYN on any 533 endpoint_id. Instead, the token must be used to retrieve the 534 meta-socket to which the SYN must be attached. A new hashtable 535 must be defined, with tokens as keys. 537 o Lookup for connection request: In regular TCP, this lookup is 538 quite similar to the previous one (in Linux at least). The 539 5-tuple is used, first to find the LISTEN socket, next to retrieve 540 the corresponding mini-socket, stored in a private hashtable 541 inside the LISTEN socket. With MPTCP, we cannot do that, because 542 there is no way to retrieve the meta-socket from the final ACK. 543 The 5-tuple can be anything, and the token is only present in the 544 SYN. There is no token in the final ACK. Our Linux MPTCP 545 implementation uses a global hashtable for pending connection 546 requests, where the key is the 5-tuple of the connection request. 548 An implementation must carefully check the presence of the MP_JOIN 549 option in incoming SYNs before performing the usual socket lookup. 550 If it is present, only the token-based lookup must be done. If this 551 lookup does not return a meta-socket, the SYN must be discarded. 552 Failing to do that could lead to mistakenly attach the incoming SYN 553 to a LISTEN socket instead of attaching it to a meta-socket. 555 3.3. Subflow management 557 Further research is needed to define the appropriate heuristics to 558 solve these problems. Initial thoughts are provided in Appendix B.1. 560 Currently, in a Linux MPTCP client, the Multipath Transport tries to 561 open all subflows advertised by the Path Manager. On the other hand, 562 the server only accepts new subflows, but does not try to establish 563 new ones. The rationale for this is that the client is the 564 connection initiator. New subflows are only established if the 565 initiator requests them. This is subject to change in future 566 releases of our MPTCP implementation. 568 3.4. At the data sink 570 There is a symmetry between the behavior of the data source and the 571 data sink. Yet, the specific requirements are different. The data 572 sink is described in this section while the data source is described 573 in the next section. 575 3.4.1. Receive buffer tuning 577 The MPTCP required receive buffer is larger than the sum of the 578 buffers required by the individual subflows. The reason for this and 579 proper values for the buffer are explained in [3] Section 5.3. Not 580 following this could result in the MPTCP speed being capped at the 581 bandwidth of the slowest subflow. 583 An interesting way to dynamically tune the receive buffer according 584 the bandwidth/delay product (BDP) of a path, for regular TCP, is 585 described in [8] and implemented in recent Linux kernels. It uses 586 the COPIED_SEQ sequence variable (sequence number of the next byte to 587 copy to the app buffer) to count, every RTT, the number of bytes 588 received during that RTT. This number of bytes is precisely the BDP. 589 The accuracy of this technique is directly dependent on the accuracy 590 of the RTT estimation. Unfortunately, the data sink does not have a 591 reliable estimate of the SRTT. To solve this, [8] proposes two 592 techniques: 594 1. Using the timestamp option (quite accurate). 596 2. Computing the time needed to receive one RCV.WND [6] worth of 597 data. It is less precise and is used only to compute an upper 598 bound on the required receive buffer. 600 As described in [1], section 3.3.3, the MPTCP advertised receive 601 window is shared by all subflows. Hence, no per-subflow information 602 can be deduced from it, and the second technique from [8] cannot be 603 used. [3] mentions that the allocated connection-level receive buffer 604 should be 2*sum(BW_i)*RTT_max, where BW_i is the bandwidth seen by 605 subflow i and RTT_max is the maximum RTT estimated among all the 606 subflows. This is achieved in Linux MPTCP by slightly modifying the 607 first tuning algorithm from [8], and disabling the second one. The 608 modification consists in counting on each subflow, every RTT_max the 609 number of bytes received during that time on this subflow. Per 610 subflow, this provides its contribution to the total receive buffer 611 of the connection. This computes the contribution of each subflow to 612 the total receive buffer of the connection. 614 3.4.2. Receive queue management 616 As advised in [1], Section 3.3.1, "subflow-level processing should be 617 undertaken separately from that at connection-level". This also has 618 the side-effect of allowing much code reuse from the regular TCP 619 stack. A regular TCP stack (in Linux at least) maintains a receive 620 queue (for storing incoming segments until the application asks for 621 them) and an out-of-order queue (to allow reordering). In Linux 622 MPTCP, the subflow-level receive-queue is not used. Incoming 623 segments are reordered at the subflow-level, just as if they were 624 plain TCP data. But once the data is in-order at the subflow level, 625 it can be immediately handed to MPTCP (See Figure 7 of [3]) for 626 connection-level reordering. The role of the subflow-level receive 627 queue is now taken by the MPTCP-level receive queue. In order to 628 maximize the CPU cycles spent in user context (see Section 3.1), VJ 629 prequeues can be used just as in regular TCP (they are not yet 630 supported in Linux MPTCP, though). 632 An alternate design, where the subflow-level receive queue is kept 633 active and the MPTCP receive queue is not used, is discussed in 634 Appendix A.4. 636 3.4.3. Scheduling data ACKs 638 As specified in [1], Section 3.3.2, data ACKs not only help the 639 sender in having a consistent view of what data has been correctly 640 received at the connection level. They are also used as the left 641 edge of the advertised receive window. 643 In regular TCP, if a receive buffer becomes full, the receiver 644 announces a receive window. When finally some bytes are given to the 645 application, freeing space in the receive buffer, a duplicate ACK is 646 sent to act as a window upate, so that the sender knows it can 647 transmit again. Likewise, when the MPTCP shared receive buffer 648 becomes full, a zero window is advertised. When some bytes are 649 delivered to the application, a duplicate DATA_ACK must be sent to 650 act as a window update. Such an important DATA_ACK should be sent on 651 all subflows, to maximize the probability that at least one of them 652 reaches the peer. If, however, all DATA_ACKs are lost, there is no 653 other option than relying on the window probes periodically sent by 654 the data source, as in regular TCP. 656 In theory a DATA_ACK can be sent on any subflow, or even on all 657 subflows, simultaneously. As of version 0.5, Linux MPTCP simply adds 658 the DATA_ACK option to any outgoing segment (regardless of whether it 659 is data or a pure ACK). There is thus no particular DATA_ACK 660 scheduling policy. The only exception is for a window update that 661 follows a zero-window. In this case, the behavior is as described in 662 the previous paragraph. 664 3.5. At the data source 666 In this section we mirror the topics of the previous section, in the 667 case of a data sender. The sender does not have the same view of the 668 communication, because one has information that the other can only 669 estimate. Also, the data source sends data and receives 670 acknowledgements, while the data sink does the reverse. This results 671 in a different set of problems to be dealt with by the data source. 673 3.5.1. Send buffer tuning 675 As explained in [3], end of Section 5.3, the send buffer should have 676 the same size as the receive buffer. At the sender, we don't have 677 the RTT estimation problem described in Section 3.4.1, because we can 678 reuse the built-in TCP SRTT (smoothed RTT). Moreover, the sender has 679 the congestion window, which is itself an estimate of the BDP, and is 680 used in Linux to tune the send buffer of regular TCP. Unfortunately, 681 we cannot use the congestion window with MPTCP, because the buffer 682 equation does not involve the product BW_i*delay_i for the subflows 683 (which is what the congestion window estimates), but it involves 684 BW_i*delay_max, where delay_max is the maximum observed delay across 685 all subflows. An obvious way to compute the contribution of each 686 subflow to the receive buffer would be: 2*(cwnd_i/SRTT_i)*SRTT_max. 687 However, some care is needed because of the variability of the SRTT 688 (measurements show that, even smoothed, the SRTT is not quite 689 stable). Currently Linux MPTCP estimates the bandwidth periodically 690 by checking the sequence number progress. This however introduces 691 new mechanisms in the kernel, that could probably be avoided. Future 692 experience will tell what is appropriate. 694 3.5.2. Send queue management 696 As MultiPath TCP involves the use of several TCP subflows, a 697 scheduler must be added to decide where to send each byte of data. 698 Two possible places for the scheduler have been evaluated for Linux 699 MPTCP. One option is to schedule data as soon as it arrives from the 700 application buffer. This option, consisting in _pushing_ data to 701 subflows as soon as it is available, was implemented in older 702 versions of Linux MPTCP and is now abandoned. We keep a description 703 of it (and why it has been abandoned) in Appendix A.5. Another 704 option is to store all data centrally in the Multipath Transport, 705 inside a shared send buffer (see Figure 2). Scheduling is then done 706 at transmission time, whenever any subflow becomes ready to send more 707 data (usually due to acknowledgements having opened space in the 708 congestion window). In that scenario, the subflows _pull_ segments 709 from the shared send queue whenever they are ready. Note that 710 several subflows can become ready simultaneously, if an 711 acknowledgement advertises a new receive window, that opens more 712 space in the shared send window. For that reason, when a subflow 713 pulls data, the Packet Scheduler is run and other subflows may be fed 714 by the Packet Scheduler in the same time. 716 Application 717 | 718 v 719 | * | 720 Next segment to send (A) -> | * | 721 |---| <- Shared send queue 722 Sent, but not DATA-acked(B)-> |_*_| 723 | 724 v 725 Packet Scheduler 726 / \ 727 / \ 728 | | 729 v v 730 Sent, but not acked(B) -> |_| |_| <- Subflow level congestion 731 | | window 732 v v 733 NIC NIC 735 Figure 2: Send queue configuration 737 This approach, similar to the one proposed in [9], presents several 738 advantages: 740 o Each subflow can easily fill its pipe. (As long as there is data 741 to pull from the shared send buffer, and the scheduler is not 742 applying a policy that restricts the subflow). 744 o If a subflow fails, it will no longer receive acknowledgements, 745 and hence will naturally stop pulling from the shared send buffer. 746 This removes the need for an explicit "failed state", to ensure 747 that a failed subflow does not receive data (As opposed to e.g. 748 SCTP-CMT, that needs an explicit marking of failed subflows by 749 design, because it uses a single sequence number space [10]). 751 o Similarly, when a failed subflow becomes active again, the pending 752 segments of its congestion window are finally acknowledged, 753 allowing it to pull again from the shared send buffer. Note that 754 in such a case, the acknowledged data is normally just dropped by 755 the receiver, because the corresponding segments have been 756 retransmitted on another subflow during the failure time. 758 Despite the adoption of that approach in Linux MPTCP, there are still 759 two drawbacks: 761 o There is one single queue, in the Multipath Transport, from which 762 all subflows pull segments. In Linux, queue processing is 763 optimized for handling segments, not bytes. This implies that the 764 shared send queue must contain pre-built segments, hence requiring 765 the _same_ MSS to be used for all subflows. We note however that 766 today, the most commonly negotiated MSS is around 1380 bytes [4], 767 so this approach sounds reasonable. Should this requirement 768 become too constraining in the future, a more flexible approach 769 could be devised (e.g., supporting a few Maximum Segment Sizes). 771 o Because the subflows pull data whenever they get new free space in 772 their congestion window, the Packet Scheduler must run at that 773 time. But that time most often corresponds to the reception of an 774 acknowledgement, which happens in interrupt context (see 775 Section 3.1). This is both unfair to other system processes, and 776 slightly inefficient for high speed communications. The problem 777 is that the packet scheduler performs more operations that the 778 usual "copy packet to NIC". One way to solve this problem would 779 be to have a small subflow-specific send queue, which would 780 actually lead to a hybrid architecture between the pull approach 781 (described here) and the push approach (described in 782 Appendix A.5). Doing that would require solving non-trivial 783 problems, though, and requires further study. 785 As shown, in Figure 2, a segment first enters the shared send queue, 786 then, when reaching the bottom of that queue, it is pulled by some 787 subflow. But to support failures, we need to be able to move 788 segments from one subflow to another, so that the failure is 789 invisible from the application. In Linux MPTCP, the segment data is 790 kept in the Shared send queue (B portion of the queue). When a 791 subflow pulls a segment, it actually only copies the control 792 structure (struct sk_buff) (which Linux calls packet cloning) and 793 increments its reference count. The following event/action table 794 summarizes these operations: 796 +-----------------+-------------------------------------------------+ 797 | event | action | 798 +-----------------+-------------------------------------------------+ 799 | Segment | Remove references to the segment from the | 800 | acknowledged at | subflow-level queue | 801 | subflow level | | 802 | | | 803 | Segment | Remove references to the segment from the | 804 | acknowledged at | connection-level queue | 805 | connection | | 806 | level | | 807 | | | 808 | Timeout | Push the segment to the best running subflow | 809 | (subflow-level) | (according to the Packet Scheduler). If no | 810 | | subflow is available, push it to a temporary | 811 | | retransmit queue (not represented in Figure 2) | 812 | | for future pulling by an available subflow. The | 813 | | retransmit queue is parallel to the connection | 814 | | level queue and is read with higher priority. | 815 | | | 816 | Ready to put | If the retransmit queue is not empty, first | 817 | new data on the | pull from there. Otherwise, then take new | 818 | wire (normally | segment(s) from the connection level send queue | 819 | triggered by an | (A portion). The pulling operation is a bit | 820 | incoming ack) | special in that it can result in sending a | 821 | | segment over a different subflow than the one | 822 | | which initiated the pull. This is because the | 823 | | Packet Scheduler is run as part of the pull, | 824 | | which can result in selecting any subflow. In | 825 | | most cases, though, the subflow which | 826 | | originated the pull will get fresh data, given | 827 | | it has space for that in the congestion window. | 828 | | Note that the subflows have no A portion in | 829 | | Figure 2, because they immediately send the | 830 | | data they pull. | 831 +-----------------+-------------------------------------------------+ 833 Table 3: (event,action) pairs implemented in the Multipath Transport 834 queue management 836 IMPORTANT: A subflow can be stopped from transmitting by the 837 congestion window, but also by the send window (that is, the receive 838 window announced by the peer). Given that the receive window has a 839 connection level meaning, a DATA_ACK arriving on one subflow could 840 unblock another subflow. Implementations should be aware of this to 841 avoid stalling part of the subflows in such situations. In the case 842 of Linux MPTCP, that follows the above architecture, this is ensured 843 by running the Packet Scheduler at each pull operation. This is not 844 completely optimal, though, and may be revised when more experience 845 is gained. 847 3.5.3. Scheduling data 849 As several subflows may be used to transmit data, MPTCP must select a 850 subflow to send each data. First, we need to know which subflows are 851 available for sending data. The mechanism that controls this is the 852 congestion controller, which maintains a per-subflow congestion 853 window. The aim of a Multipath congestion controller is to move data 854 away from congested links, and ensure fairness when there is a shared 855 bottleneck. The handling of the congestion window is explained in 856 Section 3.5.3.1. Given a set of available subflows (according to the 857 congestion window), one of these has to be selected by the Packet 858 Scheduler. The role of the Packet Scheduler is to implement a 859 particular policy, as will be explained in Section 3.5.3.2. 861 3.5.3.1. The congestion controller 863 The Coupled Congestion Control provided in Linux MPTCP implements the 864 algorithm defined in [2]. Operating System kernels (Linux at least) 865 do not support floating-point numbers for efficiency reasons. [2] 866 makes an extensive use of them, which must be worked around. Linux 867 MPTCP solves that by performing fixed-point operations using a 868 minimum number of fractions and performs scaling when divisions are 869 necessary. 871 Linux already includes a work-around for floating point operations in 872 the Reno congestion avoidance implementation. Upon reception of an 873 ack, the congestion window (counted in segments, not in bytes as 874 proposed in [2] does) should be updated as cwnd+=1/cwnd. Instead, 875 Linux increments the separate variable snd_cwnd_cnt, until 876 snd_cwnd_cnt>=cwnd. When this happens, snd_cwnd_cnt is reset, and 877 cwnd is incremented. Linux MPTCP reuses this to update the window in 878 the CCC (Coupled Congestion Control) congestion avoidance phase: 879 snd_cwnd_cnt is incremented as previously explained, and cwnd is 880 incremented when snd_cwnd_cnt >= max(tot_cwnd / alpha, cwnd) (see 881 [2]). Note that the bytes_acked variable, present in [2], is not 882 included here because Linux MPTCP does not currently support ABC 883 [11], but instead considers acknowledgements in MSS units. Linux 884 uses for ABC, in Reno, the bytes_acked variable instead of 885 snd_cwnd_cnt. For Reno, cwnd is incremented by one if 886 bytes_acked>=cwnd*MSS. Hence, in the case of a CCC with ABC, one 887 would increment cwnd when bytes_acked>=max(tot_cwnd*MSS / alpha, 888 cwnd*MSS). 890 Unfortunately, the alpha parameter mentioned above involves many 891 fractions. The current implementation of MPTCP uses a rewritten 892 version of the alpha formula from [2]: 894 cwnd_max * scale_num 895 alpha = tot_cwnd * ---------------------------------- 896 / rtt_max * cwnd_i * scale_den \ 2 897 | sum -----------------------------| 898 \ i rtt_i / 900 This computation assumes that the MSS is shared by all subflows, 901 which is true under the architecture described in Section 3.5.2 but 902 implies that implementations choosing to support several MSS cannot 903 use the above simplified equation. The variables cwnd_max and 904 rtt_max in the above equation are NOT resp. the maximum congestion 905 window and RTT across all subflows. Instead, they are the values of 906 subflow i such that cwnd_i / rtt_i^2 is maximum. This corresponds to 907 the numerator of the equation provided in [2]. 909 scale_num and scale_den have to be selected in such a way that 910 scale_num > scale_den^2. A good choice is to use scale_num=2^32 911 (using 64 bits arithmetic) and scale_den=2^10. In that case the 912 final alpha value is scaled by 2^12, which gives a reasonable 913 precision. Due to the scaling, it is necessary to also scale later 914 in the formula that decides whether an increase of the congestion 915 window is necessary or not: snd_cwnd_cnt >= max((tot_cwnd<<12) / 916 alpha,cwnd). 918 3.5.3.2. The Packet Scheduler 920 Whenever the Congestion Controller (described above) allows new data 921 for at least one subflow, the Packet Scheduler is run. When only one 922 subflow is available the Packet Scheduler just decides which packet 923 to pick from the A section of the shared send buffer (see Figure 2). 924 Currently Linux MPTCP picks the bottom most segment. If more than 925 one subflow is available, there are three decisions to take: 927 o Which of the subflows to feed with fresh data: As the only Packet 928 Scheduler currently supported in Linux MPTCP aims at filling all 929 pipes, it always feeds data to all subflows as long as there is 930 data to send. 932 o In what order to feed selected subflows: when several subflows 933 become available simultaneously, they are fed by order of time- 934 distance to the client. We define the time-distance as the time 935 needed for the packet to reach the peer if given to a particular 936 subflow. This time depends on the RTT, bandwidth and queue size 937 (in bytes), as follows: time_distance_i = queue_size_i/bw_i+RTT_i. 938 Given that with the architecture described in Section 3.5.2, the 939 subflow-specific queue size cannot exceed a congestion window, the 940 time_distance becomes time_distance_i~=RTT_i. This scheduling 941 policy favors fast subflows for application-limited communications 942 (where all subflows need not be used). However, for network- 943 limited communications, this scheduling policy has little effect 944 because all subflows will be used at some point, even the slow 945 ones, to try minimizing the connection-level completion time. 947 o How much data to allocate to a single subflow: this question 948 concerns the granularity of the allocation. Using big allocation 949 units allows for better support of TCP Segmentation Offload (TSO). 950 TSO allows the system to aggregate several times the MSS into one 951 single segment, sparing memory and CPU cycles, by leaving the 952 fragmentation task to the NIC. However, this is only possible if 953 the large single segment is made of contiguous data, at the 954 subflow level and the connection level (see also important note 955 below). 957 IMPORTANT: When scheduling data to subflows, an implementation must 958 be careful that if two segments are contiguous at the subflow-level, 959 but non-contiguous at the connection level, they cannot be aggregated 960 into one. As Linux (and probably other systems) merges segments when 961 it is under memory pressure, it could easily decide to merge non- 962 contiguous MPTCP segments, simply because they look contiguous from 963 the subflow viewpoint. This must be avoided, because the DATA_SEQ 964 mapping option would loose its meaning in such a case, leading to all 965 possible kinds of misbehaviors. 967 3.6. At connection/subflow termination 969 In Linux MPTCP, subflows are terminated only when the whole 970 connection terminates, because the heuristic for terminating subflows 971 (without closing the connection) is not yet mature, as explained in 972 Section 3.3. 974 At connection termination, an implementation must ensure that all 975 subflows plus the meta-socket are cleanly removed. The obvious 976 choice to propagate the close() system call on all subflows does not 977 work. The problem is that a close() on a subflow appends a FIN at 978 the end of the send queue. If we transpose this to the meta-socket, 979 we would append a DATA_FIN on the shared send queue (see 980 Section 3.5.2). That operation results in the shared send queue not 981 accepting any more data from the application, which is correct. It 982 also results in the subflow-specific queues not accepting any more 983 data from the shared send queue. The shared send queue may however 984 still be full of segments, which will never be sent because all gates 985 are closed. 987 IMPORTANT: Upon a close() system call, an implementation must refrain 988 from sending a FIN on all subflows, unless the implementation uses an 989 architecture with no connection-level send queue (like the one 990 described in Appendix A.5). Even in that case, it makes sense to 991 keep all subflows open until the last byte is sent, to allow 992 retransmission on any path, should any one of them fail. 994 Currently, upon a close() system call, Linux MPTCP appends a DATA_FIN 995 to the connection-level send queue. Only when that DATA_FIN reaches 996 the bottom of the send queue is the regular FIN sent on all subflows. 998 DISCUSSION: In the Linux MPTCP behavior described above, a connection 999 could still stall near its end if one path fails while transmitting 1000 its last congestion window of data (because the maximum size of the 1001 subflow-specific send queue is cwnd). This can be avoided by waiting 1002 just a bit more before to trigger the subflow-FIN: Instead of sending 1003 the FIN together with the DATA_FIN, send the DATA_FIN alone and wait 1004 for the corresponding DATA_ACK to trigger a FIN on all subflows. 1005 This however augments by one RTT the duration of the overall 1006 connection termination. 1008 4. Configuring the OS for MPTCP 1010 Previous sections concentrated on implementations. In this section, 1011 we try to gather guidelines that help getting the full potential from 1012 MPCTP through appropriate system configuration. Currently those 1013 guidelines apply especially to Linux, but the principles can be 1014 applied to other systems. 1016 4.1. Source address based routing 1018 As already pointed out by [12], the default behavior of most 1019 operating systems is not appropriate for the use of multiple 1020 interfaces. Most operating systems are typically configured to use 1021 at most one IP address at a time. It is more and more common to 1022 maintain several links in up state (e.g. using the wired interface as 1023 main link, but maintaining a ready-to-use wireless link in the 1024 background, to facilitate fallback when the wired link fails). But 1025 MPTCP is not about that. MPTCP is about _simultaneously_ using 1026 several interfaces (when available). It is expected that one of the 1027 mostly used MPTCP configurations will be through two or more NICs, 1028 each being assigned a different address. Another possible 1029 configuration would be to assign several IP addresses to the same 1030 interface, in which case the path diverges later in the network, 1031 based on the particular address that is used in the packet. 1033 Usually an operating system has a single default route, with a single 1034 source IP address. If the host has several IP addresses and we want 1035 to do MultiPath TCP, it is necessary to configure source address 1036 based routing. This means that based on the source address, selected 1037 by the MultiPath TCP-module in the operating system, the routing- 1038 decision is based on a different routing table. Each of these 1039 routing tables defines a default route to the Internet. This is 1040 different from defining several default routes in the same routing 1041 table (which is also supported in Linux), because in that case only 1042 the first one is used. Any additional default route is considered as 1043 a fallback route, used only in case the main one fails. 1045 It is easier to understand the necessary configuration by means of an 1046 example. Let a host have two interfaces,I1 and I2, both connected to 1047 the public Internet and being assigned addresses resp. A1 and A2. 1048 Such a host needs 3 routing tables. One of them is the classical 1049 default routing table, present in all systems. The default routing 1050 table is used to find a route based on the destination address only, 1051 when a segment is issued with the undetermined source address. The 1052 undetermined source address is typically used by applications that 1053 initiate a TCP connect() system call, specifying the destination 1054 address but letting the system choose the source address. In that 1055 case, after the default routing table has been consulted, an address 1056 is assigned to the socket by the system by applying [13]. The 1057 additional routing tables are used when the source address is 1058 specified. If the source address has no impact on the route that 1059 should be chosen, then the default routing table is sufficient. But 1060 this is a particular case (e.g., a host connected to one network 1061 only, but using two addresses to exploit ECMP paths later in the 1062 network). In most cases, a source address is attached to a specific 1063 interface, or at least a specific gateway. Both of those cases 1064 require defining a separate routing table, one per (gateway, outgoing 1065 interface) pair. To select the proper routing table based on the 1066 source address, an additional indirection level must be configured. 1067 It is called "policy routing" in Linux and is illustrated at the 1068 bottom of Figure 3. 1070 +----------------------------------------------------+ 1071 | Default Table | 1072 +----------------------------------------------------+ 1073 | Dst: 0.0.0.0/0 Via: Gateway-IP1 Dev: I1 | 1074 | Dst: 0.0.0.0/0 Via: Gateway-IP2 Dev: I2 | 1075 | Dst: Gateway1-Subnet Dev: I1 Src: A1 Scope: Link | 1076 | Dst: Gateway2-Subnet Dev: I2 Src: A2 Scope: Link | 1077 +----------------------------------------------------+ 1079 +----------------------------------------------------+ 1080 | Table 1 | 1081 +----------------------------------------------------+ 1082 | Dst: 0.0.0.0/0 Via: Gateway-IP1 Dev: I1 | 1083 | Dst: Gateway1-Subnet Dev: I1 Src: A1 Scope: Link | 1084 +----------------------------------------------------+ 1086 +----------------------------------------------------+ 1087 | Table 2 | 1088 +----------------------------------------------------+ 1089 | Dst: 0.0.0.0/0 Via: Gateway-IP2 Dev: I2 | 1090 | Dst: Gateway2-Subnet Dev: I2 Src: A2 Scope: Link | 1091 +----------------------------------------------------+ 1093 +----------------------------------------------------+ 1094 | Policy Table | 1095 +----------------------------------------------------+ 1096 | If src == A1 , Table 1 | 1097 | If src == A2 , Table 2 | 1098 +----------------------------------------------------+ 1100 Figure 3: Routing table configuration for MultiPath TCP 1102 If only the default routing table were used, only the first default 1103 route would be used, regardless of the source address. For example, 1104 a packet with source address A2, would leave the host through 1105 interface I1, which is incorrect. 1107 4.2. Buffer configuration 1109 [3], Section 5.3 describes in details the new, higher buffer 1110 requirements of MPTCP. Section 3.4.1 and Section 3.5.1 describe how 1111 the MPTCP buffers can be tuned dynamically. However, it is important 1112 to note that even the best tuning is capped by a maximum configured 1113 at the system level. When using MultiPath TCP, the maximum receive 1114 and send buffer should be configured to a higher value than for 1115 regular TCP. There is no universal guideline on what value is best 1116 there. Instead the most appropriate action, for an administrator, is 1117 probably to roughly estimate the maximum bandwidth and delay that can 1118 be observed on a particular connectivity setup, and apply the 1119 equation from [3], Section 5.3 to find a reasonable tradeoff. This 1120 exercise could lead an administrator to decide to disable MPTCP on 1121 some interfaces, because it allows consuming less memory while still 1122 achieving reasonable performance. 1124 5. Future work 1126 A lot of work has yet to be done, and there is much space for 1127 improvements. In this section we try to assemble a list of future 1128 improvements that would complete this guidelines. 1130 o Today's host processors have more and more CPU cores. Given 1131 Multipath TCP tries to exploit another form of parallelism, there 1132 is a challenge in finding how those they can work together 1133 optimally. An important question is how to work with hardware 1134 that behaves intelligently with TCP (e.g. flow to core affinity). 1135 This problem is discussed in more details in [14]. 1137 o An evaluation of Linux MPTCP exists [4]. But many optimizations 1138 are still possible and should be evaluated. Examples of them VJ 1139 prequeues (Section 3.1), MPTCP fast path (that is, a translation 1140 of the existing TCP fast path to MPTCP) or DMA support. VJ 1141 prequeues, described in Section 3.1, are intended to defer segment 1142 processing until the application is awoken, when possible. 1144 o Currently, support for TCP Segmentation Offload remains a 1145 challenge because it plays with the Maximum Segment Size. Linux 1146 MPTCP currently works with a single MSS across all subflows (see 1147 Section 3.5.2). Adding TSO support to MPTCP is certainly 1148 possible, but requires further work (Section 3.5.2). Also, 1149 support for Large Receive Offload has not been investigated yet. 1151 o There are ongoing discussions on heuristics that would be used to 1152 decide when to start new subflows. Those discussions are 1153 summarized in Appendix B.1, but none of the proposed heuristics 1154 have been evaluated yet. 1156 6. Acknowledgements 1158 Sebastien Barre, Christoph Paasch and Olivier Bonaventure are 1159 supported by Trilogy (http://www.trilogy-project.org), a research 1160 project (ICT-216372) partially funded by the European Community under 1161 its Seventh Framework Program. The views expressed here are those of 1162 the author(s) only. The European Commission is not liable for any 1163 use that may be made of the information in this document. 1165 The authors gratefully acknowledge Costin Raiciu, who wrote a 1166 userland implementation of MPTCP and provided insight on 1167 implementation matters during several fruitful debates. Discussions 1168 with Janardhan Iyengar also helped understanding the specificities of 1169 MPTCP compared to SCTP-CMT. 1171 The authors would also like to thank the following people for useful 1172 discussions on the mailing list and/or reviews: Alan Ford, Bob 1173 Briscoe, Mark Handley, Michael Scharf. 1175 7. References 1177 [1] Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for 1178 Multipath Operation with Multiple Addresses", 1179 draft-ietf-mptcp-multiaddressed-02 (work in progress), 1180 October 2010. 1182 [2] Raiciu, C., Handley, M., and D. Wischik, "Coupled Congestion 1183 Control for Multipath Transport Protocols", 1184 draft-ietf-mptcp-congestion-01 (work in progress), 1185 January 2011. 1187 [3] Ford, A., Raiciu, C., Handley, M., Barre, S., and J. Iyengar, 1188 "Architectural Guidelines for Multipath TCP Development", 1189 draft-ietf-mptcp-architecture-05 (work in progress), 1190 January 2011. 1192 [4] Barre, S., Paasch, C., and O. Bonaventure, "Multipath TCP: From 1193 Theory to Practice", IFIP Networking,Valencia , May 2011, 1194 . 1196 [5] Scharf, M. and A. Ford, "MPTCP Application Interface 1197 Considerations", draft-ietf-mptcp-api-00 (work in progress), 1198 November 2010. 1200 [6] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, 1201 September 1981. 1203 [7] Jacobson, V., "Re: query about tcp header on tcp-ip", Sep 1993, 1204 . 1206 [8] Fisk, M. and W. Feng, "Dynamic right-sizing in TCP", Los Alamos 1207 Computer Science Institute Symposium , 2001, 1208 . 1210 [9] Hsieh, H. and R. Sivakumar, "pTCP: An End-to-End Transport 1211 Layer Protocol for Striped Connections", ICNP , 2002, . 1214 [10] Becke, M., Dreibholz, T., Iyengar, J., Natarajan, P., and M. 1215 Tuexen, "Load Sharing for the Stream Control Transmission 1216 Protocol (SCTP)", draft-tuexen-tsvwg-sctp-multipath-01 (work in 1217 progress), December 2010. 1219 [11] Allman, M., "TCP Congestion Control with Appropriate Byte 1220 Counting (ABC)", RFC 3465, February 2003. 1222 [12] Blanchet, M. and P. Seite, "Multiple Interfaces and 1223 Provisioning Domains Problem Statement", 1224 draft-ietf-mif-problem-statement-09 (work in progress), 1225 October 2010. 1227 [13] Draves, R., "Default Address Selection for Internet Protocol 1228 version 6 (IPv6)", RFC 3484, February 2003. 1230 [14] Watson, R., "Protocol stacks and multicore scalability", 1231 Presentation at Maastricht MPTCP workshop , Jul 2010, . 1235 Appendix A. Design alternatives 1237 In this appendix, we describe alternate designs that have been 1238 considered previously, and abandoned for various reasons (detailed as 1239 well). We keep them here for the archive and possible discussion. 1240 We also describe some potential designs that have not been explored 1241 yet but could reveal to be better in the future, in which case that 1242 would be moved to the draft body. 1244 A.1. Another way to consider Path Management 1246 In a previous implementation of MPTCP, it was proposed that the 1247 multipath transport had an even more abstract view of the paths in 1248 use than what is described in Section 2. In that design, the sub- 1249 sockets all shared the same tuple (saddr,sport,daddr,dport), and was 1250 disambiguated only by the path index. The advantage is that the 1251 Multipath Transport needs only to worry about how to efficiently 1252 spread data among multiple paths, without any knowledge about the 1253 addresses or ports used by each particular subflow. 1255 That design was particularly well suited for using Shim6 as a Path 1256 Manager, because Shim6 is already designed to work in the network 1257 layer and rewrite addresses. The first version of the Linux MPTCP 1258 implementation was using Shim6 as path manager. It looks also well 1259 suited to path managers that don't use addresses (e.g. path managers 1260 that write a label in the packet header, later interpreted by the 1261 network). Finally, it removes the need for the token in the 1262 multipath transport (connection identification is done naturally with 1263 the tuple, shared by all subflows). The token hence becomes specific 1264 to the built-in path manager, and can be just ignored with other path 1265 managers (the context tag plays a similar role in shim6, nothing is 1266 needed if the path manager just sets labels to the packets). 1268 However, this cleaner separation between Multipath Transport and Path 1269 Management suffers from three drawbacks: 1271 o It requires a heavy modification to the existing stacks, because 1272 it modifies the current way to identify sockets in the stack. 1273 They are currently unambiguously identified with the usual 1274 5-tuple. This architecture would require extending the 5-tuple 1275 with the path index, given all subflows would share the same 1276 5-tuple. 1278 o Although correctly implemented stacks could handle that new 1279 endpoint identifier (5-tuple+path index), having several flows 1280 with same 5-tuple could confuse middleboxes. 1282 o When the path manager involves using several addresses, forcing 1283 the same 5-tuple for all subflows at the Multipath Transport level 1284 implies that the Path Manager needs to rewrite the address fields 1285 of each packet. That rewriting operation is simply avoided if the 1286 sockets are bound to the addresses actually used to send the 1287 packets. Hence, this alternate design would involve avoidable 1288 costs for path managers that belong to the "multi-address" 1289 category. 1291 A.2. Implementing alternate Path Managers 1293 In Section 2, the Path Manager is defined as an entity that maintains 1294 a (path_index<->endpoint_id) mapping. This is enough in the case of 1295 the built-in path manager, because the segments are associated to a 1296 path within the socket itself, thanks to its endpoint_id. However, 1297 it is expected that most other path managers will need to apply a 1298 particular action, on a per-packet basis, to associate them with a 1299 path. Example actions could be writing a number in a field of the 1300 segment or choosing a different gateway than the default one in the 1301 routing table. In an earlier version of Linux MPTCP, based on a 1302 Shim6 Path Manager, the action was used and consisted in rewriting 1303 the addresses of the packets. 1305 To reflect the need for a per-packet action, the PM mapping table (an 1306 example of which is given in Table 1) only needs to be extended with 1307 an action field. As an example of this, we show hereafter an example 1308 mapping table for a Path Manager based on writing the path index into 1309 a field of the packets. 1311 +---------+------------+---------------+--------------------------+ 1312 | token | path index | Endpoint id | Action (Write x in DSCP) | 1313 +---------+------------+---------------+--------------------------+ 1314 | token_1 | 1 | | 1 | 1315 | | | | | 1316 | token_1 | 2 | | 2 | 1317 | | | | | 1318 | token_1 | 3 | | 3 | 1319 | | | | | 1320 | token_1 | 4 | | 4 | 1321 | | | | | 1322 | | | | | 1323 | token_2 | 1 | | 1 | 1324 | | | | | 1325 | token_2 | 2 | | 2 | 1326 +---------+------------+---------------+--------------------------+ 1328 Table 4: Example mapping table for a label-based PM 1330 A.3. When to instantiate a new meta-socket ? 1332 The meta-socket is responsible only for MPTCP-related operations. 1333 This includes connection-level reordering for incoming data, 1334 scheduling for outgoing data, and subflow management. A natural 1335 choice then would be to instantiate a new meta-socket only when the 1336 peer has told us that it supports MPTCP. In the server it is 1337 naturally the case since the master subsocket is created upon the 1338 reception of a SYN+MP_CAPABLE. The client, however, instantiates its 1339 master subsocket when the application issues a socket() system call, 1340 but needs to wait until the SYN+ACK to know whether its peer supports 1341 MPTCP. Yet, it must already provide its token in the SYN. 1343 Linux MPTCP currently instantiates its client-side meta-socket when 1344 the master-socket is created (just like the server-side). The 1345 drawback of this is that if after socket(), the application 1346 subsequently issues a listen(), we have built a useless meta-socket. 1347 The same happens if the peer SYN+ACK does not carry the MP_CAPABLE 1348 option. To avoid that, one may want to instantiate the meta-socket 1349 upon reception of an MP_CAPABLE option. But this implies that the 1350 token (sent in the SYN), must be stored in some temporary place or in 1351 the master subsocket until the meta-socket is built. 1353 A.4. Forcing more processing in user context 1355 The implementation architecture proposed in this draft uses the 1356 following queue configuration: 1358 o Subflow level: out-of-order queue. Used for subflow-level 1359 reordering. 1361 o Connection level: out-of-order queue. Used for connection-level 1362 reordering. 1364 o Connection level: receive queue. Used for storing the ordered 1365 data until the application asks for it through a recvmsg() system 1366 call or similar. 1368 In a previous version of Linux MPTCP, another queue configuration has 1369 been examined: 1371 o Subflow level: out-of-order queue. Used for subflow-level 1372 reordering. 1374 o Subflow level: receive queue. Used for storing the data until the 1375 application asks for it through a recvmsg() system call or 1376 similar. 1378 o Connection level: out-of-order queue. Used for connection-level 1379 reordering. 1381 In this alternate architecture, the connection-level data is lazily 1382 reordered as the application asks for it. The main goal for this was 1383 to ensure that as many CPU cycles as possible were spent in user 1384 context (See Section 3.1). VJ prequeues allow forcing user context 1385 processing when the application is waiting on a recv() system call. 1386 Otherwise the subflow-level reordering must be done in interrupt 1387 context. This remains true with MPTCP because the subflow-level 1388 implementation is left unmodified when possible. With MPTCP, the 1389 question is: "Where do we perform connection-level reordering ?". 1390 This alternate architecture answer is: "Do it _always_ in user 1391 context". This was the strength of that architecture. Technically, 1392 the task of each subflow was to reorder its own segments and put them 1393 in their own receive queue, until the application asks for data. 1394 When the application wants to eat more data, MPTCP searches all 1395 subflow-level receive queue for the next bytes to receive, and 1396 reorder them as appropriate by using its own reordering queue. As 1397 soon as the number of requested bytes are handed to the application 1398 buffer, the MPTCP reordering task finishes. 1400 Unfortunately, there are two major drawbacks about doing it that way: 1402 o The socket API supports the SO_RCVLOWAT option, which allows an 1403 application to ask not being woken up until n bytes have been 1404 received. Counting those bytes requires reordering at least n 1405 bytes at the connection level in interrupt context. 1407 o The DATA_ACK [1] should report the latest byte received in order 1408 at the connection level. In this architecture, the best we can do 1409 is report the latest byte that has been copied to the application 1410 buffers, which would slightly change the DATA_ACK semantic 1411 described in section 3.3.2 of [1]. This change could confuse 1412 peers that try to derive information from the received DATA_ACK. 1414 A.5. Buffering data on a per-subflow basis 1416 In previous versions of Linux MPTCP, the configuration of the send 1417 queues was as shown in Figure 4. 1419 Application 1420 | 1421 v 1422 Packet Scheduler 1423 / \ 1424 / \ 1425 | | 1426 v v 1427 | * | | | 1428 Next segment to send (A) -> | * | | * | 1429 |---| |---| <- Separate send queue 1430 Sent, but not acked (B) -> |_*_| |_*_| 1431 | | 1432 v v 1433 NIC NIC 1435 Figure 4: Send queue configuration 1437 In contrast to the architecture presented in Section 3.5.2, there is 1438 no shared send queue. The Packet Scheduler is run each time data is 1439 produced by the application. Compared to Figure 4, the advantages 1440 and drawbacks are basically reversed. Here are the advantages: 1442 o This architecture supports subflow-specific Maximum Segment Sizes, 1443 because the subflow is selected before the segment is built. 1445 o The segments are stored in their final form in the subflow- 1446 specific send queues, and there is no need to run the Packet 1447 Scheduler at transmission time. The result is more fairness with 1448 other applications (because the Packet Scheduler runs in user 1449 context only), and faster data transmission when acknowledgements 1450 open the congestion window (because segments are buffered in their 1451 final form and no call to the Packet Scheduler is needed. 1453 The drawback, which motivated the architecture change in Linux MPTCP 1454 is the complexity of the data allocation (hence the Packet 1455 Scheduler), and the computing cost involved. Given that there is no 1456 shared send buffer, the send buffer auto-tuning must be divided into 1457 its subflow contributions. This buffer size can be easily derived 1458 from Section 3.5.1. However, when scheduling in advance a full send 1459 buffer of data, we may be allocating a segment hundreds of 1460 milliseconds before it actually goes to the wire. The task of the 1461 Packet Scheduler is then complicated because it must _predict_ the 1462 path properties. If the prediction is incorrect, two subflows may 1463 try to put on the wire segments that are very distant in terms of 1464 DATA_SEQ numbers. This can eventually result in stalling some 1465 subflows, because the DATA_SEQ gap between two subflows exceeds the 1466 receive window announced by the receiver. The Packet Scheduler can 1467 relatively easily compute a correct allocation of segments if the 1468 path properties do not vary (just because it is easy to predict a 1469 constant value), but the implementation was very sensitive to 1470 variations in delay or bandwidth. The previous implementation of 1471 Linux MPTCP solved this allocation problem by verifying, upon each 1472 failed transmission attempt, if it was blocked by the receive window 1473 due to a gap in DATA_SEQ with other subflows. If this was the case, 1474 a full reallocation of segments was conducted. However, the cost of 1475 such a reallocation is very high, because it involves reconsidering 1476 the allocation of any single segment, and do this for all the 1477 subflows. Worse, this costly reallocation sometimes needed to happen 1478 in interrupt context, which removed one of the advantages of this 1479 architecture. 1481 Yet, under the assumption that the subflow-specific queue size is 1482 small, the above drawback almost disappears. For this reason the 1483 abandoned design described here could be used to feed a future hybrid 1484 architecture, as explained in Section 3.5.2. For the sake of 1485 comparison with Table 3, we provide hereafter the action/table 1486 implemented by this architecture. 1488 +-----------------+-------------------------------------------------+ 1489 | event | action | 1490 +-----------------+-------------------------------------------------+ 1491 | Segment | Remove references to it from the subflow-level | 1492 | acknowledged at | queue | 1493 | subflow level | | 1494 | | | 1495 | Segment | No queue-related action. | 1496 | acknowledged at | | 1497 | connection | | 1498 | level | | 1499 | | | 1500 | Timeout | Push the segment to the best subflow (according | 1501 | (subflow-level) | to the Packet Scheduler). In contrast with the | 1502 | | solution of Section 3.5.2, there is no need for | 1503 | | a connection-level retransmit queue, because | 1504 | | there is no requirement to be available | 1505 | | immediately for a subflow to accept new data. | 1506 | | | 1507 | Ready to put | Just send the next segment from the A portion | 1508 | new data on the | of the subflow-specific send queue, if any. | 1509 | wire (normally | Note that the "IMPORTANT" note from | 1510 | triggered by an | Section 3.5.2 still applies with this | 1511 | incoming ack) | architecture. | 1512 +-----------------+-------------------------------------------------+ 1514 Table 5: (event,action) pairs implemented in a queue management based 1515 on separate send queues 1517 Appendix B. Ongoing discussions on implementation improvements 1519 This appendix collects information on features that have been 1520 currently implemented nowhere, but can still be useful as hints for 1521 implementers to test. Feedback from implementers will help 1522 converging on those topics and propose solid guidelines for future 1523 versions of this memo. 1525 B.1. Heuristics for subflow management 1527 Some heuristic should determine when it would be beneficial to add a 1528 new subflow. Linux MPTCP has no such heuristic at the moment, but 1529 the topic has been discussed on the MPTCP mailing list, so this 1530 section summarizes the input from many individuals. MPTCP is not 1531 useful for very short flows, so three questions appear: 1533 o How long is a "too short flow" 1535 o How to predict that a flow will be short ? 1537 o When to decide to add/remove subflows ? 1539 To answer the third question, it has been proposed to use hints from 1540 the application. On the other hand the experience shows that socket 1541 options are quite often poorly or not used, which motivates the 1542 parallel use of a good default heuristic. This default heuristic may 1543 be influenced in particular by the particular set of options that are 1544 enabled for MPTCP (e.g. an administrator can decide that some 1545 security mechanisms for subflow initiation are not needed in his 1546 environment, and disable them, which would change the cost of 1547 establishing new subflows). The following elements have been 1548 proposed to feed the heuristic, none of them tested yet: 1550 o Check the size of the write operations from the applications. 1551 Initiate a new subflow if the write size exceeds some threshold. 1552 This information can be taken only as a hint because applications 1553 could send big chunks of data split in many small writes. A 1554 particular case of checking the size of write operations is when 1555 the application uses the sendfile() system call. In that 1556 situation MPTCP can know very precisely how many bytes will be 1557 transferred. 1559 o Check if the flow is network limited or application limited. 1560 Initiate a new subflow only if it is network limited. 1562 o It may be useful to establish new subflows even for application- 1563 limited communications, to provide failure survivability. A way 1564 to do that would be to initiate a new subflow (if not done before 1565 by another trigger) after some time has elapsed, regardless of 1566 whether the communication is network or application limited. 1568 o Wait until slow start is done before to establish a new subflow. 1569 Measurements with Linux MPTCP suggest that slow start could be a 1570 reasonable tool for determining when it is worth starting a new 1571 subflow (without increasing the overall completion time). More 1572 analysis is needed in that area, however. Also, this should be 1573 taken as a hint only if the slow start is actually progressing 1574 (otherwise a stalled subflow could prevent the establishment of 1575 another one, precisely when a new one would be useful). 1577 o Use information from the application-layer protocol. Some of them 1578 (e.g. HTTP) carry flow length information in their headers, which 1579 can be used to decide how many subflows are useful. 1581 o Allow the administrator to configure subflow policies on a per- 1582 port basis. The host stack could learn as well for what ports 1583 MPTCP turns out to be useful. 1585 o Check the underlying medium of each potential subflow. For 1586 example, if the initial subflow is initiated over 3G, and WiFi is 1587 available, it probably makes sense to immediately negotiate an 1588 additional subflow over WiFi. 1590 It is not only useful to determine when to start new subflows, one 1591 should also sometimes decide to abandon some of its subflows. An 1592 MPTCP implementation should be able to determine when removing a 1593 subflow would increase the aggregate bandwidth. This can happen, for 1594 example, when the subflow has a significantly higher delay compared 1595 to other subflows, and the maximum buffer size allowed by the 1596 administrator has been reached (Linux MPTCP currently has no such 1597 heuristic yet). 1599 Authors' Addresses 1601 Sebastien Barre 1602 Universite catholique de Louvain 1603 Place Ste Barbe, 2 1604 Louvain-la-Neuve 1348 1605 BE 1607 Email: sebastien.barre@uclouvain.be 1608 URI: http://inl.info.ucl.ac.be/sbarre 1610 Christoph Paasch 1611 Universite catholique de Louvain 1612 Place Ste Barbe, 2 1613 Louvain-la-Neuve 1348 1614 BE 1616 Email: christoph.paasch@uclouvain.be 1617 URI: http://inl.info.ucl.ac.be/cpaasch 1619 Olivier Bonaventure 1620 Universite catholique de Louvain 1621 Place Ste Barbe, 2 1622 Louvain-la-Neuve 1348 1623 BE 1625 URI: http://inl.info.ucl.ac.be/obo