idnits 2.17.1 draft-ford-mptcp-multiaddressed-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 10, 2009) is 5394 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: 'I-D.van-beijnum-1e-mp-tcp-00' is defined on line 1048, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force A. Ford 3 Internet-Draft Roke Manor Research 4 Intended status: Experimental C. Raiciu 5 Expires: January 11, 2010 M. Handley 6 University College London 7 S. Barre 8 Universite catholique de 9 Louvain 10 July 10, 2009 12 TCP Extensions for Multipath Operation with Multiple Addresses 13 draft-ford-mptcp-multiaddressed-01 15 Status of this Memo 17 This Internet-Draft is submitted to IETF in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on January 11, 2010. 38 Copyright Notice 40 Copyright (c) 2009 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents in effect on the date of 45 publication of this document (http://trustee.ietf.org/license-info). 46 Please review these documents carefully, as they describe your rights 47 and restrictions with respect to this document. 49 Abstract 51 Often endpoints are connected by multiple paths, but the nature of 52 TCP/IP restricts communications to a single path per socket. 53 Resource usage within the network would be more efficient were these 54 multiple paths able to be used concurrently. This should enhance 55 user experience through higher throughput and improved resilience to 56 network failure. This document presents extensions to TCP in order 57 to transparently provide this multi-path functionality at the 58 transport layer, if at least one endpoint is multi-addressed. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Design Assumptions . . . . . . . . . . . . . . . . . . . . 4 65 1.3. Layered Representation . . . . . . . . . . . . . . . . . . 5 66 1.4. Operation Summary . . . . . . . . . . . . . . . . . . . . 6 67 1.5. Open Issues . . . . . . . . . . . . . . . . . . . . . . . 7 68 1.6. Requirements Language . . . . . . . . . . . . . . . . . . 8 69 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 3. Semantic Issues . . . . . . . . . . . . . . . . . . . . . . . 8 71 4. MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . . 9 72 4.1. Session Initiation . . . . . . . . . . . . . . . . . . . . 9 73 4.2. Starting a New Subflow . . . . . . . . . . . . . . . . . . 11 74 4.3. Address Knowledge Exchange (Path Management) . . . . . . . 12 75 4.3.1. Adding Addresses . . . . . . . . . . . . . . . . . . . 13 76 4.3.2. Remove Address . . . . . . . . . . . . . . . . . . . . 14 77 4.4. General MPTCP Operation . . . . . . . . . . . . . . . . . 15 78 4.4.1. Receive Window Considerations . . . . . . . . . . . . 16 79 4.4.2. Congestion Control Considerations . . . . . . . . . . 17 80 4.4.3. Subflow Policy . . . . . . . . . . . . . . . . . . . . 17 81 4.4.4. Retransmissions . . . . . . . . . . . . . . . . . . . 18 82 4.5. Closing a Connection . . . . . . . . . . . . . . . . . . . 19 83 4.6. Error Handling . . . . . . . . . . . . . . . . . . . . . . 20 84 5. Congestion Control Coupling for MPTCP . . . . . . . . . . . . 20 85 6. Security Considerations . . . . . . . . . . . . . . . . . . . 21 86 7. Interactions with Middleboxes . . . . . . . . . . . . . . . . 22 87 8. Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 22 88 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23 89 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 90 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 91 11.1. Normative References . . . . . . . . . . . . . . . . . . . 23 92 11.2. Informative References . . . . . . . . . . . . . . . . . . 24 93 Appendix A. Functional Separation . . . . . . . . . . . . . . . . 24 94 A.1. Motivations . . . . . . . . . . . . . . . . . . . . . . . 24 95 A.2. TCP Performance . . . . . . . . . . . . . . . . . . . . . 25 96 A.3. Architecture overview . . . . . . . . . . . . . . . . . . 25 97 A.4. PM/MPS interface . . . . . . . . . . . . . . . . . . . . . 27 98 Appendix B. Notes on use of TCP Options . . . . . . . . . . . . . 28 99 Appendix C. Resync Packet . . . . . . . . . . . . . . . . . . . . 28 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 102 1. Introduction 104 Multipath TCP is set of extensions for regular TCP [RFC0793] to allow 105 one TCP connection to be spread across multiple paths. This section 106 describes the motivation behind the design of Multipath TCP 107 (henceforth referred to as MPTCP), and gives a summary of its 108 operation. The following sections describe in greater detail the 109 proposed extensions and the operation of the resulting protocol. 111 1.1. Motivation 113 As the Internet evolves, demands on Internet resources are ever- 114 increasing, but often these resources (in particular, bandwidth) 115 cannot be fully utilised due to protocol constrains on both the end- 116 systems and within the network. By the application of resource 117 pooling [WISCHIK], these resources can be 'pooled' such that they 118 appear as a single logical resource to the user. Multipath TCP 119 achieves resource pooling by combining multiple TCP sessions running 120 over multiple paths, and presenting them as a single TCP connection 121 to the application. 123 This form of resource pooling bring two key benefits: 125 o To increase the resilience of the connectivity by providing 126 multiple paths, protecting end hosts from the failure of one. 128 o To increase the efficiency of the resource usage, and thus 129 increase the network capacity available to end hosts. 131 The protocol presented in this document follows the same service 132 model as TCP [RFC0793]: byte oriented, in order reliable delivery. 133 To have a deployable protocol, we impose the following "do no harm" 134 philosophy: multipath TCP should behave no worse (throughput wise) 135 than running a single TCP connection over any of its paths, and using 136 multiple paths should not harm users using single path TCP at shared 137 bottlenecks. 139 1.2. Design Assumptions 141 In order to limit the potentially huge design space, the authors 142 imposed two key constraints on the multipath TCP design presented in 143 this document: 145 o It must be backwards-compatible with current, regular TCP, to 146 increase its chances of deployment 148 o It can be assumed that one or both endpoints are multihomed and 149 multiaddressed 151 To simplify the design we assume that the presence of multiple 152 addresses at an endpoint is sufficient to indicate the existence of 153 multiple paths. These paths need not be entirely disjoint: they may 154 share one or many routers between them. Even in such a situation 155 making use of multiple paths is beneficial, improving resource 156 utilisation and resilience to a subset of node failures. 158 There are three aspects to the backwards-compatibility listed above: 160 External Constraints: The protocol must function through the vast 161 majority of existing middleboxes such as NATs, firewalls and 162 proxies, and as such must resemble existing TCP as far as possible 163 on the wire. Furthermore, the protocol must not assume the 164 segments it sends on the wire arrive unmodified at the 165 destination: they may be split or coalesced; options may be 166 removed or duplicated. 168 Application Constraints: The protocol must be usable with no change 169 to existing applications that use the standard TCP API (although 170 it is reasonable that not all features would be available to such 171 legacy applications). 173 Fall-back: The protocol should be able to fall back to standard TCP 174 with no interference from the user, to be able to communicate with 175 legacy hosts. 177 Areas for further study: 179 o In theory, since this is purely a TCP extension, it should be 180 possible to use MPTCP with both IPv4 and IPv6 on dual-stack hosts, 181 thus having the additional possible benefit of aiding transition. 183 o Some features of the design presented here could be extended to 184 work with non-multi-addressed hosts by using packet marking or 185 partial multipath. 187 o Some features of the design presented here could be combined with 188 mechanisms such as shim6 [I-D.ietf-shim6-proto]. 190 This draft also suggests a safe way to couple congestion controllers 191 in a way that achieves the "do no harm philosophy". This is for 192 completeness or our arguments: we expect this description to evolve 193 into a companion new internet draft. 195 1.3. Layered Representation 197 MPTCP operates at the transport layer, and its existence aims to be 198 transparent to both higher and lower layers. It is a set of 199 additional features on top of standard TCP, and as such MPTCP is 200 designed to be usable by legacy applications with no changes. A 201 possible implementation would be for such a feature to be a system- 202 wide setting: "Use multipath TCP by default? Y/N". Multipath-aware 203 applications would be able to use an extended sockets API to have 204 further influence on the behaviour of MPTCP. Figure 1 illustrates 205 this architecture. 207 +-------------------------------+ 208 | Application | 209 +---------------+ +-------------------------------+ 210 | Application | | MPTCP | 211 +---------------+ + - - - - - - - + - - - - - - - + 212 | TCP | | Subflow (TCP) | Subflow (TCP) | 213 +---------------+ +-------------------------------+ 214 | IP | | IP | IP | 215 +---------------+ +-------------------------------+ 217 Figure 1: Comparison of Standard TCP and MPTCP Protocol Stacks 219 Detailed discussion of an architecture for developing a multipath TCP 220 implementation, especially regarding the functional separation by 221 which different components should be developed, is given in 222 Appendix A. 224 1.4. Operation Summary 226 This section provides a high-level summary of normal operation in 227 MPTCP, and is illustrated by the scenario shown in Figure 2. A 228 detailed description of operation is given in Section 4. 230 o To a non-MPTCP-aware application, MPTCP will be indistinguishable 231 from normal TCP. All MPTCP operation is handled by the MPTCP 232 implementation, although extended APIs could provide additional 233 control. An application begins by opening a TCP socket in the 234 normal way. 236 o An MPTCP connection begins as a single TCP session. This is 237 illustrated in Figure 2 as being between Addresses A1 and B1 on 238 Hosts A and B respectively. 240 o If extra paths are available, additional TCP sessions are created 241 on these paths, and are combined with the existing session, which 242 continues to appear as a single connection to the applications at 243 both ends. The creation of the additional TCP session is 244 illustrated between Address A2 on Host A and Address B1 on Host B. 246 o MPTCP identifies multiple paths by the presence of multiple 247 addresses at endpoints. Combinations of these multiple addresses 248 equate to the additional paths. In the example, other potential 249 paths that could be set up are A1<->B2 and A2<->B2. Although this 250 additional session is shown as being initiated from A2, it could 251 equally have been initiated from B1. 253 o The discovery and setup of additional TCP sessions (termed 254 'subflows') will be achieved through a path management method. 255 This document describes a mechanism by which an endpoint can 256 initiate new subflows by using its additional addresses, or by 257 signalling to the other endpoint its available addresses. 259 o The exact properties of these TCP sessions that are logically 260 bonded are dependent upon the congestion and flow control 261 characteristics of the endpoints' MPTCP implementation. 263 o MPTCP adds connection-level sequence numbers in order to 264 reassemble the data stream in-order from multiple subflows. 265 Connections are terminated by connection-level FIN packets as well 266 as those relating to the individual subflows. 268 Host A Host B 269 ------------------------ ------------------------ 270 Address A1 Address A2 Address B1 Address B2 271 ---------- ---------- ---------- ---------- 272 | | | | 273 | (initial connection setup) | | 274 |----------------------------------->| | 275 |<-----------------------------------| | 276 | | | | 277 | (additional subflow setup) | 278 | |--------------------->| | 279 | |<---------------------| | 280 | | | | 281 | | | | 283 Figure 2: Example MPTCP Usage Scenario 285 1.5. Open Issues 287 This specification is a work-in-progress, and as such there are many 288 issues that are still to be resolved. This section lists many of the 289 key open issues within this specification; these are discussed in 290 more detail in the appropriate sections throughout this document. 292 o Best handshake mechanisms. This document contains a proposed 293 scheme by which connections and subflows can be set up. It is 294 felt that, although this is "no worse than regular TCP", there 295 could be opportunities for significant improvements in security 296 that could be included (potentially optionally) within this 297 protocol. 299 o Issues around simulataneous opens, where both ends attempt to 300 create a new subflow simultaneously, need to be investigated and 301 behaviour specified. 303 o Appropriate mechanisms for controlling policy of subflow usage. 304 The ECN signal is currently proposed but other alternatives, 305 including path property options, could be employed instead. 307 1.6. Requirements Language 309 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 310 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 311 document are to be interpreted as described in RFC 2119 [RFC2119]. 313 2. Terminology 315 Path: A sequence of links between a sender and a receiver, defined 316 in this context by a source and destination address pair. 318 Subflow: A stream of TCP packets sent over a path. A subflow is a 319 component part of a connection between two endpoints. 321 Connection: A collection of one or more subflows, over which an 322 application can communicate between two endpoints. There is a 323 one-to-one mapping between a connection and a socket. 325 Token: A unique identifier given to a multipath connection by an 326 endpoint. May also be referred to as a "Connection ID". 328 Endpoint: A host operating an MPTCP implementation, and either 329 initiating or terminating a MPTCP connection. 331 3. Semantic Issues 333 In order to support multipath operation, the semantics of some TCP 334 components have changed. To aid clarity, this section collects these 335 semantic changes as a reference. 337 Sequence Number: The TCP sequence number is subflow-specific, with a 338 data sequence number used for reassembly at connection-level. 340 FIN: The FIN only applies to a subflow, not to a connection. For a 341 connection-level FIN, use the DATA FIN option. 343 ACK: The ACK acknowledges the subflow sequence number only, and the 344 mapping to the data sequence number is handled out-of-band. 346 RST: The RST only applies to a subflow. There is no connection- 347 level RST, since it would be impossible to distinguish the two, as 348 the link between a subflow and a connection is established at the 349 SYN handshake. A connection is considered reset if every subflow 350 sends a RST in response. 352 Length: There is additionally an explicit length for each MPTCP 353 segment in order to separate potential TCP/IP-layer segmentation 354 from the MPTCP data flow. 356 Address List: The address management is handled per-connection to 357 permit the application of per-connection local policy. 359 5-tuple: The 5-tuple (protocol,local IP, local port, remote IP, 360 remote port) presented to the application layer in a non- 361 multipath-aware application is that of the first subflow, even if 362 the subflow has since been closed and removed from the connection. 364 4. MPTCP Protocol 366 This section describes the operation of the MPTCP protocol, and is 367 subdivided into sections for each key part of the protocol operation. 369 All MPTCP operations are signalled using optional TCP header fields. 370 These TCP Options will have option numbers allocated by IANA, as 371 discussed in Section 10, and are defined throughout the following 372 subsections. 374 4.1. Session Initiation 376 Session Initiation begins with a SYN, SYN/ACK exchange on a single 377 path. Each of these packets will additionally feature the Multipath 378 Capable TCP option (Figure 3, which declares the sender's locally 379 unique 32-bit token for this connection, and a version field. 381 The "Multipath Capable" option declares an endpoint to be capable of 382 operating Multipath TCP (or rather, more accurately, a desire to 383 operate Multipath TCP on this particular connection). As well as 384 this declaration, this field presents a token, which is used when 385 adding additional subflows to this connection. 387 This token is generated by the sender and has local meaning only, but 388 therefore it MUST be unique for the sender. The token MUST be 389 difficult for an attacker to guess, and thus it is recommended it 390 SHOULD be generated randomly. (However, see further discussions 391 about security in Section 6, including the possibility of a 64-bit 392 token and an initial data sequence number.) 394 This option is only present in packets with the SYN flag set. It is 395 only used in the first TCP session of a connection, in order to 396 identify the connection; all following connections will use path 397 management techniques to join the existing connection. 399 1 2 3 400 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 401 +---------------+---------------+-------------------------------+ 402 | Kind=OPT_MPC | Length = 7 |(resvd)|Version| Sender Token : 403 +---------------+---------------+-------------------------------+ 404 : Sender Token (continued - 4 octets total) | 405 +-----------------------------------------------+ 407 Figure 3: Multipath Capable option 409 The version field represents the version of MPTCP in use. The 410 version provided in this specification is 0. The reserved bits may 411 be used for connection-specific flags in later versions. 413 If a SYN contains a "multipath capable" option but the SYN/ACK does 414 not, it is assumed that the recipient is not multipath capable and 415 thus the MPTCP session will operate as regular, single-path TCP. If 416 a SYN does not contain a "multipath capable" option, the SYN/ACK MUST 417 NOT contain one in response. 419 If these packets are unacknowledged, it is up to local policy to 420 decide how to respond. It is expected that a sender will eventually 421 fall back to single-path TCP (i.e. without the Multipath Capable 422 Option), in order to work around middleboxes that may drop packets 423 with unknown options, however the number of multipath-capable 424 attempts that are made first will be up to local policy. In the case 425 of out-of-order packets, i.e. if a multipath-capable SYN/ACK is 426 received in response to a multipath-capable SYN, after a standard SYN 427 has been sent, then once again it is up to the sender to choose how 428 to behave. For example, the sender could respond to new connections 429 using the previously declared token, or it could simply drop any new 430 multipath options within the flow. 432 If an endpoint is known to be multiaddressed (e.g. through multiple 433 addresses returned in a DNS lookup), alternative destination 434 addresses should be tried first, before falling back to regular TCP. 436 In addition to this option, a Data Sequence Number option (discussed 437 in Section 4.4) is included to provide an initial data-level sequence 438 number (and this initial SYN counts as one octet in this space, as 439 for a regular SYN in single-path TCP). 441 4.2. Starting a New Subflow 443 Endpoints have knowledge of their own address(es), and can become 444 aware of the other endpoint's addresses through a path management 445 technique as described in Section 4.3. Using this knowledge, an 446 endpoint will initiate a new subflow over a currently unused pair of 447 addresses. 449 A new subflow is started as a normal TCP SYN/ACK exchange. The 450 following TCP option is used to identify which connection the new 451 subflow should become a part. The token used is the locally unique 452 token of the destination for the subflow, as defined by the Multipath 453 Capable option received in the first SYN/ACK exchange. 455 It should be noted that, in theory, additional subflows can exist 456 between any pair of ports, and as such it is this token that is used 457 for demuxing at the receiver. A receiver must store some mapping 458 state, of (source_addr, dest_addr, source_port, dest_port) to its 459 token, using information from the initial SYN exchange, in order to 460 enable this. In practice, however, it is envisaged that most new 461 subflows will connect to a port that is already in use as the source 462 or destination port of an existing subflow, in order to have a 463 greater chance of getting through firewalls and other middleboxes, 464 and to support traffic engineering of the flows. 466 This option includes an "Address ID". This is an identifier, locally 467 unique to the sender of this option, that identifies the source 468 address of this packet. This serves two purposes. Firstly, if an 469 address becomes unexpectedly unavailable on the sender, it can signal 470 this to the receiver via a remove address option (Section 4.3.2) 471 without needing to know what the source address actually is (thus 472 allowing the use of NATs). Secondly, it allows correlation between 473 new connection attempts and address signalling (Section 4.3.1), to 474 prevent duplicate subflow initiation. 476 This option can only be present when the SYN flag is set. 478 1 2 3 479 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 480 +---------------+---------------+-------------------------------+ 481 | Kind=OPT_JOIN | Length = 6 |Receiver Token (4 octets total): 482 +---------------+---------------+----------------+--------------+ 483 : Receiver Token (continued) | Address ID | 484 +-------------------------------+----------------+ 486 Figure 4: Join Connection option 488 4.3. Address Knowledge Exchange (Path Management) 490 We use the term "path management" to refer to the exchange of 491 information about additional paths between endpoints, which in this 492 design is managed by multiple addresses at endpoints. For more 493 detail of the architectrual thinking behind this design, see the 494 discussion of functional separation in Appendix A. 496 This design makes use of two methods of sharing such information, 497 used simultaneously. The first is the direct setup of new subflows, 498 already described in Section 4.2. The second is described in the 499 following subsections, whereby addresses are signalled explicitly to 500 the endpoint to allow it to initiate new connections. This approach 501 has been chosen so as to allow addresses to change in flight, and 502 thus the use of NATs, whilst also allowing the signalling of 503 previously unknown addresses, such as those belonging to other 504 address families. 506 In more detail, an example of the typical operation is as follows, 507 where an existing address is used at one endpoint: 509 o An endpoint that is multihomed starts an additional TCP session to 510 an address/port pair that is already in use on the other endpoint, 511 using a token to identify the flow (Section 4.2). (A multihomed 512 destination may open a new subflow from its new address to the 513 source address and port, or a multihomed source may open a new 514 subflow from its new address another connection to the existing 515 destination and port). 517 o To expand upon this, say a connection is intiated from host "A" on 518 (address, port) combination A1 to desintation (address, port) B1 519 on host "B". If host A is multihomed, it starts an additional 520 connection from new (address, port) A2 to B1, using B's previously 521 declared token. Alternatively, if B is multhomed, it will try to 522 set up a new TCP connection from B2 to A1, using A's previously 523 declared token. 525 o Simultaneously (or near-simultaneously), an "Add Address" option 526 (Section 4.3.1) is sent on an existing subflow, informing the 527 receiver of the sender's alternative address(es). The recipient 528 can use this information to open a new subflow to the sender's 529 additional address. Using the previous notation, this would be a 530 Add Address packet sent from A1 to B1, informing B of address A2. 532 o If host B successfully receives the first SYN, starting a new 533 subflow, it can use the Address ID to correlate this with the Add 534 Address option that will also arrive on an existing subflow, and 535 it will respond to the SYN with a SYN/ACK. Otherwise, if it does 536 not receive such a SYN, it tries to initiate a new subflow from 537 one or more of its addresses to address A2. This is intended to 538 permit new sessions to be opened if one endpoint is behind a NAT. 540 Other scenarios are valid, however, such as those where entirely new 541 addresses are signalled, e.g. to allow an IPv6 and an IPv4 path to be 542 used simultaneously. 544 4.3.1. Adding Addresses 546 Announcing additional addresses that an endpoint can be reached on 547 will be undertaken by the Add Address TCP Option (Figure 5), where an 548 (ID, address) pair can be announced to the other endpoint. Several 549 addresses can be added if there is sufficient TCP option space, 550 otherwise multiple TCP messages containing this option must be sent. 551 This option can be used at any time during a connection. 553 The Add Address option announces a list of alternative IP addresses, 554 beyond the current one in use, that the sender can be contacted on. 555 This option can be used multiple times until all available addresses 556 have been announced, in order to get around TCP option space limits. 557 It should be noted that every address has an ID which can be used for 558 address removal, and therefore endpoints must cache the mapping 559 between ID and address. This is also used to identify Join 560 Connection options (Section 4.2) relating to the same address, even 561 when address translators are in use. The ID must be unique to the 562 sender, and although it may be a sequential counter, this is not 563 mandated. 565 This option is shown for IPv4. For IPv6, the IPVer field will read 566 6, and the length of the address will be 16 octets not 4, and thus 567 the length of the option will be 2 + (18 * number_of_entries). 568 Multiple addresses can be included, with an ID following on 569 immediately from the previous address, and their existance can be 570 inferred through the option length and version fields. 572 NB: by having a IPVer field, we get four free reserved bits. These 573 could be used in later versions of this protocol, e.g. one bit for 574 "use now" or similar, to differentiate between subflows for backup 575 purposes and those for throughput. 577 1 2 3 578 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 579 +---------------+---------------+---------------+-------+-------+ 580 | Kind=OPT_ADDR | Length | Address ID | IPVer |(resvd)| 581 +---------------+---------------+---------------+-------+-------+ 582 | Address (IPv4 - 4 octets) | 583 +---------------------------------------------------------------+ 584 ( ... further ID/Version/Address fields as required ... ) 586 Figure 5: Add Address option (for IPv4) 588 4.3.2. Remove Address 590 If, during the lifetime of a MPTCP connection, a previously-announced 591 address becomes invalid (e.g. if the interface disappears), the 592 affected endpoint should announce this so that the other endpoint can 593 remove subflows related to this address. 595 This is achieved through the Remove Address option (Figure 6), which 596 will remove a previously-added address (or list of addresses) from a 597 connection and terminate any subflows currently using that address. 599 The sending and receipt of this message should trigger the sending of 600 FINs by both endpoints on the affected subflow(s) (if possible), as a 601 courtesy to cleaning up middlebox state, but endpoints may clean up 602 their internal state without a long timeout. 604 If there is no address at the requested ID, the receiver will 605 silently ignore the request. 607 Address removal is undertaken by ID, so as to permit the use of NATs 608 and other middleboxes, in the cases where new connections have been 609 initiated but now want to be removed. 611 The closure of a single subflow, rather than all using a particular 612 address, is undertaken as normal with a FIN exchange on the subflow - 613 for more information, see Section 4.5. 615 1 2 3 616 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 617 +---------------+---------------+---------------+ 618 |Kind=OPT_REMADR| Length = 2+n | Address ID | ... 619 +---------------+---------------+---------------+ 621 Figure 6: Remove Address option 623 4.4. General MPTCP Operation 625 This section discusses operation of MPTCP for data transfer, 626 independent of the path management mechanism used. 628 At a high level, the an MPTCP implementation will take one input data 629 stream from an application, and split it into one or more subflows. 630 The data stream as a whole can be reassembled through the use of the 631 Data Sequence Number (Figure 7) option, which defines the sequence in 632 the data stream of the first octet of the packet's payload, and this 633 is used by the receiver to ensure in-order delivery to th application 634 layers. Meanwhile, the subflow-level sequence numbers (i.e. the 635 regular sequence numbers in the TCP header) have subflow-only 636 relevance. 638 The only acknowledgements are those at the subflow-level, so the 639 sender must be able to map these acknowledgements to the data 640 sequence numbers that were contained in the relevant packets. The 641 sender thus knows, if subflow data goes unackowledged, which part of 642 the original data stream this equates to, and thus what data must be 643 retransmitted. It is expected (but not mandated) that SACK [RFC2018] 644 is used as an efficiency at the subflow level. Each subflow will 645 maintain its own congestion widow. 647 1 2 3 648 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 649 +---------------+---------------+------------------------------+ 650 | Kind=OPT_DSN | Length | Data Sequence Number... : 651 +---------------+---------------+------------------------------+ 652 : ... ( (length-4) octets ) | Data-level Length (2 octets) | 653 +-------------------------------+------------------------------+ 655 Figure 7: Data Sequence Number option 657 In addition to the Data Sequence Number, this option also includes a 658 Data-level Length field. The purpose of this field is to assist with 659 compatibility with situations where TCP/IP segmentation is undertaken 660 separately from the stack that is generating the data flow (e.g. 661 through the use of TCP segementation offloading on network interface 662 cards, or by middleboxes). This field declares what length of data 663 this data sequence number is valid for, allowing a receiver to infer 664 when it has received sufficient segments. The primary motivation for 665 this behaviour is the understanding that devices involved in re- 666 segmentation typically repeat additional TCP options into every re- 667 segmented packet. The use of this length field will make it clear 668 when all relevant segments have been received. (It is FFS whether 669 this is the optimal solution to this issue.) 671 As a TCP option contains a length field, the length of the Data 672 Sequence Number can be declared implicitly. Although it is expected 673 that initial implementations will use 32-bit sequence numbers (i.e. 4 674 octets, so a length field of 8), setting the length field to 12 and 675 including a 64-bit sequence number (of four octets) MUST be 676 considered valid and processed appropriately. This may have also 677 have useful security implications, discussed in Section 6. 679 As wth the standard TCP sequence number, the data sequence number 680 should not start at zero, but at a random value to make session 681 hijacking harder. This is done by including a Data Sequence Number 682 option along with the Multipath Capable option in the initial SYN 683 (which occupies one octet of data sequence space; see Section 4.1). 685 The Data Sequence Number is included in every MPTCP packet that 686 contains data (or a DATA FIN, see Section 4.5), even if only one path 687 is in use, so long as the MPTCP handshake has been completed and the 688 endpoints have therefore agreed to use MPTCP. 690 The MPTCP data and subflow level sequence numbering could be seen to 691 be analogous to that used in SACK, however there are subtle 692 differences. The key similarity is that it is possible to have 693 temporary "holes" in the received data sequence space - later data 694 may have arrived earlier (most likely on a different subflow), but 695 does not need to be retransmitted. The "holes" are later filled in. 696 The key difference, however, is that while SACK can rely on the 697 regular TCP cumulative acknowledgements to indicate how much data has 698 been successfully received (with no holes), there is no similar 699 method in MPTCP. Instead, the sender must keep track of the 700 acknowledgements to derive what data has been successfully received. 701 This leads to some oddities especially with session termination (see 702 Section 4.5). 704 4.4.1. Receive Window Considerations 706 Normal TCP advertises a receive window in each packet, telling the 707 sender how much data the receiver is willing to accept past the 708 cumulative ack. The receive window is used to implement flow 709 control, throttling down fast senders when receivers cannot keep up. 711 MPTCP also uses a unique receive window, shared between the subflows. 712 The idea is to allow any subflow to send data as long as the receiver 713 is willing to accept it; the alternative, maintaining per subflow 714 receive windows, could end-up stalling some subflows while others 715 would not use up their window. 717 4.4.2. Congestion Control Considerations 719 Different subflows in an MPTCP connection have different congestion 720 windows. To achieve resource pooling [WISCHIK], it is necessary to 721 couple the congestion windows in use on each subflow, in order to 722 push most traffic to uncongested links. One algorithm for achieving 723 this is presented in Section 5; the algorithm does not achieve 724 perfect resource pooling but is "safe" in that it is readily 725 deployable in the current Internet. 727 It is foreseeable that different congestion controllers will be 728 implemented for MPTCP, each aiming to achieve different properties in 729 the resource pooling/fairness/stability design space. Much research 730 is expected in this area in the near future. 732 Regardless of the algorithm used, the design of the MPTCP protocol 733 aims to provide the congestion control implementations sufficient 734 information to take the right decisions; this information includes, 735 for each subflow, which packets where lost and when. 737 4.4.3. Subflow Policy 739 Within a local MPTCP implementation, a host may use any local policy 740 it wishes to decide how to share the traffic to be sent over the 741 available paths. 743 In the typical use case, where the goal is to maximise throughput, 744 all available paths will be used simultaneously for data transfer. 745 It is expected, however, that other use cases will appear. 747 For instance, a possibility is an 'all-or-nothing' approach, i.e. 748 have a second path ready for use in the event of failure of the first 749 path, but alternatives could include entirely saturating one path 750 before using an additional path (the 'overflow' case). Such choices 751 would be most likely based on the monetary cost of links, but may 752 also be based on properties such as delay or bandwidth, in cases 753 where the additional paths are significantly worse and not worth 754 including in the base operation. Other metrics such as this could be 755 wrapped into an overall "cost" metric for a link. 757 The ability to make effective choices at the sender requires full 758 knowledge of the path cost, which is unlikely to be the case. There 759 is no mechanism in MPTCP for a receiver to signal their own 760 particular preferences for paths, but this is a necessary feature 761 since receivers will often be the multihomed party, such as in the 762 case of laptop computers with wired and wireless connectivity. 763 Instead of incorporating complex signalling, it is proposed to use 764 existing TCP features to signal priority implicitly. If a receiver 765 wishes to keep a path active as a backup but wishes to prevent data 766 being sent on that path, this could be achieved by the receiver not 767 sending ACKs for any data it receives on that path. The sender would 768 interpret this as severe congestion or a broken path and stop using 769 it. We do not advocate this method, however, since this is brutal, 770 naive, and will result in unnecessary retransmissions. 772 Therefore, it is proposed to use ECN [RFC3168] to to provide fake 773 congestion signals on paths that a receiver wishes to stop being used 774 for data. This has the benefit of causing the sender to back off 775 without the need to retransmit data unnecessarily, as in the case of 776 a lost ACK. This should be sufficient to allow a receiver to express 777 their policy, although does not permit a rapid increase in throughput 778 when switching to such a path. 780 4.4.4. Retransmissions 782 This protocol specification does not mandate any mechanisms for 783 handling retransmissions in the event of path failures, and much will 784 be dependent upon local policy (as discussed in Section 4.4.3). The 785 data sequence number, as given in a TCP option, is used to reassemble 786 the incoming streams before presentation to the application layers, 787 so a sender is free to re-send data with the same data sequence 788 number on a different subflow. When doing this, an endpoint must 789 still retransmit the original data on the original subflow, in order 790 to preserve the subflow integrity (middleboxes could replay old data, 791 and/or could reject holes in subflows), and a receiver will ignore 792 these retransmissions. While this is clearly suboptimal, for 793 compatibility reasons we feel this is the best behaviour. 794 Optimisations could be negotiated in future versions of this 795 protocol. 797 Of course, retransmissions on alternative subflows will only occur if 798 this is what local policy suggests. Indeed, it may be equally valid 799 to retransmit on the same subflow if alternative paths have 800 considerably worse quality of service, or are only kept for backup 801 purposes. Additionally, it may be possible for some implementations 802 to signal from lower layers if there are problems with the paths, and 803 so more appropriate responses could occur. 805 4.5. Closing a Connection 807 Under single path TCP, a FIN signifies that the sender has no more 808 data to send. In order to allow subflows to operate independently, 809 however, and with as little change from regular TCP as possible, a 810 FIN in MPTCP will only affect the subflow on which it is sent. This 811 allows nodes to exercise considerable freedom over which paths are in 812 use at any one time. The semantics of a FIN remain as for regular 813 TCP, i.e. it is not until both sides have ACKed each other's FINs 814 that the subflow is fully closed. 816 When an application calls close() on a socket, this indicates that it 817 has no more data to send, and for regular TCP this would result in a 818 FIN on the connection. For MPTCP, an equivalent mechanism is needed, 819 and this is the DATA FIN. This option, shown in Figure 8, is 820 attached to a regular FIN option on a subflow. 822 A DATA FIN is an indication that the sender has no more data to send, 823 and as such can be used as a rapid indication of the end of data from 824 a sender. Therefore, it is an optimisation to clean up state 825 associated with a MPTCP connection, especially when some subflows may 826 have failed. Specifically, when a DATA FIN has been received, IF all 827 data has been successfully received, timeouts on all subflows MAY be 828 reduced. Similarly, when sending a DATA FIN, once all data 829 (including the DATA FIN has been acknowledged, FINs must be sent on 830 every subflow. This applies to both endpoints, and is required in 831 order to clean up state in middleboxes. 833 There are complex interactions, however, between a DATA FIN and 834 subflow properties: 836 o A DATA FIN can only be sent on a packet which also has the FIN 837 flag set. 839 o A DATA FIN occupies one octet (the final octet) of Data Sequence 840 Number space. Therefore, even if there is no user data, a Data 841 Sequence Number option must be added to a packet containing the 842 DATA FIN option. This allows the receiver to easily determine the 843 last data sequence number that should have been received. 845 o There is a one-to-one mapping between the DATA FIN and the 846 subflow's FIN flag (and its associated sequence space and thus its 847 acknowlegement). In other words, when a subflow's FIN flag has 848 been acknowledged, the associated DATA FIN is also acknowledged. 850 o As such, the acknowledgement of a FIN and DATA FIN DOES NOT 851 indicate that all data has been successfully received. Because 852 the data level ack is inferred from subflow acks, the endpoint can 853 tell when all data before the DATA FIN has been received. 855 It should be noted that an endpoint may also send a FIN on an 856 individual subflow to shut it down, but this impact is limited to the 857 subflow in question. If all subflows have been closed with a FIN, 858 that is equivalent to having closed the connection with a DATA FIN. 860 1 861 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 862 +---------------+---------------+ 863 | Kind=OPT_DFIN | Length = 2 | 864 +---------------+---------------+ 866 Figure 8: DATA FIN option 868 4.6. Error Handling 870 TBD 872 Unknown token in MPTCP SYN should equate to an unknown port, e.g. a 873 TCP reset? We should make this as silent and tolerant as possible. 874 Where possible, we should keep this close to the semantics of TCP. 875 The amount of error handling required may also have an impact on the 876 choice of path management schemes. Issues may include odd cases 877 where a data sequence number is missing from a subflow. Will 878 definitely need errors in those cases. 880 5. Congestion Control Coupling for MPTCP 882 Coupling congestion windows can achieve resource pooling, by pushing 883 traffic to underutilized areas of the network. Another effect of 884 coupling is fairness at bottleneck: when two MPTCP flows share a 885 common bottleneck, their combined throughput should not be more than 886 that of a single TCP flow. 888 To achieve perfect resource pooling, one must couple both increase 889 and decrease of congestion windows across subflows. Yet this tends 890 to exhibit flappyness: when the two paths have similar levels of 891 congestion, the controller will tend to allocate all the window to 892 one or the other subflows, and perform random flips between the two 893 equilibrium points. This seems not desirable in general, and is 894 particularly bad when the achieved rates depend on the RTT (as in the 895 current Internet). 897 By only coupling increases we remove flappyness but also reduce the 898 extent of resource pooling the protocol achieves. We now succintly 899 describe our protocol, assuming there are only two subflows (the 900 general case is easy to derive, but is more difficult to understand). 902 Let v_1 and v_2 be the congestion windows on the two subflows, and 903 assume there is always data to send. Let w = v_1 + v_2. Let p_i, 904 rtt_i be the drop probability and round trip time on path i. 906 Our proposed algorithm is as follows: 908 o Increase v_i by a/w for each ack received on subflow i. 910 o Decrease v_i by v_i/2 for each drop on subflow i. 912 "a" is a parameter of the algorithm, and we'll describe next how to 913 pick proper values for it. 915 This algorithm will allocate window to the two subflows such that p1 916 * v1 = p2 * v2. Thus, when the drop probabilities are equals, each 917 subflow gets an equal window; when they are different, more and more 918 window will be allocated to the flow with the lower drop probability. 920 The total throughput of the algorithm depends on the drop 921 probabilities and rtts of the two paths. We require that the total 922 throughput is no worse than the throughput a single TCP would get on 923 the fastest path. If we kept a constant regardless of path 924 properties, this requirement would be violated. However, if we 925 increase a according to the difference in drop probabilities and 926 rtts, it is always possible to match the throughput of the best path. 928 The second requirement is that none of the subflows should be, on 929 their own, more aggressive than a single TCP on the same path. 930 Increasing "a" indefinitely as required above, may create fairness 931 issues in some scenarios. In such cases, the "a" parameter is capped 932 on the paths where the increase is too aggressive, and some traffic 933 is pushed on the other paths. 935 It is possible to achieve all this behavior (adjusting and capping a) 936 by only using estimates of the rtts and the current windows for the 937 two subflows; explicit estimates of the drop probabilities are not 938 needed. 940 A full description of the congestion control algorithm is beyond the 941 scope of this document. The algorithm will be thoroughly described 942 in a companion document, soon to be released. 944 6. Security Considerations 946 TBD 947 (Token generation, handshake mechanisms, new subflow authentication, 948 etc...) 950 The development of a TCP extension such as this will bring with it 951 many additional security concerns. We have set out here to produce a 952 solution that is "no worse" than current TCP, with the possibility 953 that more secure extensions could be proposed later. 955 The primary area of concern will be around the handshake to start new 956 subflows which join existing connections. The proposal set out in 957 Section 4.1 and Section 4.2 is for the initiator of the new subflow 958 to include the token of the other endpoint in the handshake. The 959 purpose of this is to indicate that the sender of this token was the 960 same entity that received this token at the initial handshake. 962 One area of concern is that the token could be simply brute-forced. 963 The token must behard to guess, and as such could be randomly 964 generated. This may still not be strong enough, however, and so the 965 use of 64 bits for the token would alleviate this somewhat. 967 Use of these tokens only provide an indication that the token is the 968 same as at the initial handshake, and does not say anything about the 969 current sender of the token. Therefore, another approach would be to 970 bring a new measure of freshness in to the handshake, so instead of 971 using the initial token a sender could request a new token from the 972 receiver to use in the next handshake. 974 Yet another alternative would be for all SYN packets to include a 975 data sequence number. This could either be used as a passive 976 identifier to indicate an awareness of the current data sequence 977 number (although a reasonable window would have to be allowed for 978 delays). Or, the SYN could form part of the data sequence space - 979 but this would cause issues in the event of lost SYNs (if a new 980 subflow is never established), thus causing unnecessary delays for 981 retransmissions. 983 7. Interactions with Middleboxes 985 TBD 987 How we get around NATs, firewalls. Problems with TCP proxies. How 988 to make an MPTCP-aware middlebox, ... 990 8. Interfaces 992 TBD 993 Interface with applications, interface with TCP, interface with lower 994 layers... 996 9. Acknowledgements 998 The authors are supported by Trilogy 999 (http://www.trilogy-project.org), a research project (ICT-216372) 1000 partially funded by the European Community under its Seventh 1001 Framework Program. The views expressed here are those of the 1002 author(s) only. The European Commission is not liable for any use 1003 that may be made of the information in this document. 1005 The authors gratefully acknowledge significant input into this 1006 document from many members of the Trilogy project, notably Iljitsch 1007 van Beijnum, Lars Eggert, Marcelo Bagnulo Braun, Robert Hancock, Pasi 1008 Sarolahti, Olivier Bonaventure, Toby Moncaster, Philip Eardley and 1009 Andrew McDonald. 1011 10. IANA Considerations 1013 This document will make a request to IANA to allocate new values for 1014 TCP Option identifiers, as follows: 1016 +------------+----------------------+---------------+-------+ 1017 | Symbol | Name | Ref | Value | 1018 +------------+----------------------+---------------+-------+ 1019 | OPT_MPC | Multipath Capable | Section 4.1 | (tbc) | 1020 | OPT_ADDR | Add Address | Section 4.3.1 | (tbc) | 1021 | OPT_REMADR | Remove Address | Section 4.3.2 | (tbc) | 1022 | OPT_JOIN | Join Connection | Section 4.2 | (tbc) | 1023 | OPT_DSN | Data Sequence Number | Section 4.4 | (tbc) | 1024 | OPT_DFIN | DATA FIN | Section 4.5 | (tbc) | 1025 +------------+----------------------+---------------+-------+ 1027 Table 1: TCP Options for MPTCP 1029 11. References 1031 11.1. Normative References 1033 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1034 Requirement Levels", BCP 14, RFC 2119, March 1997. 1036 11.2. Informative References 1038 [I-D.eddy-tcp-loo] 1039 Eddy, W. and A. Langley, "Extending the Space Available 1040 for TCP Options", draft-eddy-tcp-loo-04 (work in 1041 progress), July 2008. 1043 [I-D.ietf-shim6-proto] 1044 Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming 1045 Shim Protocol for IPv6", draft-ietf-shim6-proto-12 (work 1046 in progress), February 2009. 1048 [I-D.van-beijnum-1e-mp-tcp-00] 1049 van Beijnum, I., "One-ended Multipath TCP", 1050 draft-van-beijnum-1e-mp-tcp-00 (work in progress), 1051 May 2009. 1053 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1054 RFC 793, September 1981. 1056 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1057 Selective Acknowledgment Options", RFC 2018, October 1996. 1059 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1060 of Explicit Congestion Notification (ECN) to IP", 1061 RFC 3168, September 2001. 1063 [WISCHIK] Wischik, D., Handley, M., and M. Bagnulo Braun, "The 1064 Resource Pooling Principle", ACM SIGCOMM CCR vol. 38 num. 1065 5, pp. 47-52, October 2008, 1066 . 1068 Appendix A. Functional Separation 1070 [Potential to move to separate architectural document] 1072 This section describes the functional separation that drives the 1073 design of the MPTCP protocol. Its main goal is to separate MPTCP in 1074 two parts that communicate through a well defined interface. We 1075 first provide the motivations for this functional separation, then we 1076 describe in more details the two main components of the MPTCP 1077 architecture. 1079 A.1. Motivations 1081 The major goal behind MPTCP is to send data over different paths in 1082 the same time. This assumes that an MPTCP implementation must be 1083 able to discover and use the multiple paths that connect two given 1084 hosts, when they exist. However, different mechanisms can be 1085 envisioned for multipath discovery and use. Examples are as follows: 1087 Use multiple addresses: This is the method currently proposed in 1088 this document - if hosts are multi-addressed, different address 1089 pairs may take different routes. 1091 Use a path selector value: An end-host might be able to tag packets 1092 with a path selector value, or "colour". If some network nodes 1093 are able to read the colour and use it as a path selector, the 1094 host can influence the outgoing path of the packet. 1096 Next-hop selection: In a network configuration where multiple next- 1097 hops can offer to forward packets, a host may decide to send some 1098 of its packets through one next-hop, and some through another. 1100 The above list is not exhaustive, and could grow as new network 1101 technologies are deployed. 1103 A.2. TCP Performance 1105 In addition to purely sending data over multiple paths, MTCP must do 1106 it in a way that will not affect TCP performance. This raises the 1107 need for an efficient multipath congestion control algorithm. While 1108 this specification does not mandate the use of any particular 1109 algorithm for congestion control, it ensures that the protocol is 1110 designed in such a way that any CC algorithm can be designed, 1111 independently of the particular path management mechanism available 1112 to the host. Consequently our architecture for MTCP decouples the 1113 policy from the mechanism. The policy is the decision of what path 1114 to use for each packet to send. It is mainly driven by the 1115 implementation-dependent congestion control algorithm. The mechanism 1116 is the technology used to ensure that a packet will be sent on the 1117 desired path. This separation is intended to be relatively future- 1118 proof by allowing these components to evolve at different speeds. 1120 A.3. Architecture overview 1121 Control plane <-- | --> Data plane 1122 +---------------------------------------------------------------+ 1123 | Multipath Scheduler (MPS) | 1124 +---------------------------------------------------------------+ 1125 ^ | | 1126 | | | 1127 |Announcing new | +-------------+ 1128 |paths. (referred | | Data packet |<--Path idx:3 1129 |to as path indices) | +-------------+ attached 1130 | | | by MPS 1131 | | V 1132 +--------------------------------------------\------------------+ 1133 | Path Manager (PM) \__________zzzzz | 1134 +--------------------------------------------------------\------+ 1135 / \ | \ 1136 /---------------------\ | /"\ /"\ /"\ 1137 | Path key Action | | | | | | | | 1138 | 1 xxxxx | | | | | | | | 1139 | 2 yyyyy | | \./ \./ \./ 1140 | 3 zzzzz | | path1 path2 path3 1141 +---------------------+ 1143 Figure 9: Overview of MTCP architecture 1145 A general overview of the architecture is provided in Figure 9. The 1146 Multipath Scheduler (MPS) learns about the number of available paths 1147 through notifications received from the Path Manager (PM). From the 1148 point of view of the Multipath Scheduler, a path is just a number, 1149 called a Path Index. Notifications from the PM to the MPS MAY 1150 contain supporting information about the paths, if relevant, so that 1151 the MPS can make more intelligent decisions about where to route 1152 traffic. When the Multipath Scheduler initiates a communication to a 1153 new host, it can only send the packets to the default path. But 1154 since the Path manager is layered below the MPS, it can detect that a 1155 new communication is happening, and tell the MPS about the other 1156 paths it knows about. 1158 From then on, it is possible for the MPS to attach a Path Index to 1159 the control structure of its packets (internal to the MTCP 1160 implementation), so that the Path Manager can map this Path Index to 1161 the corresponding action. (see table in the lower left part of 1162 Figure 9). The particular action depends on the network mechanism 1163 used to select a path. Examples are address rewriting, tunnelling or 1164 setting a path selector valude inside the packet. 1166 The applicability of the architecture is not limited to the MTCP 1167 protocol. While we define in this document an MTCP MPS (MTCP 1168 Multipath Scheduler), other Multipath Schedulers can be defined. For 1169 example, if an appropriate socket interface is designed, applications 1170 could behave as a Multipath Scheduler and decide where to send any 1171 particular data. In this document we concentrate on the MTCP case, 1172 however. 1174 In this specification, we define the core protocol for Multipath TCP. 1175 The core protocol is not dependent on the Path Management technique 1176 that is chosen, and MUST be implemented in any MTCP MPS. We also 1177 provide a default Path Manager that is based on declaring IP 1178 addresses, and carries control information in TCP options. An 1179 implementation of Multipath TCP can use any Path Manager, but it MUST 1180 be able to fallback to the default PM in case the other end does not 1181 support the custom PM. Alternative Path Managers may be specified in 1182 separate documents in the future. 1184 A.4. PM/MPS interface 1186 The minimal set of requirement for a Path Manager is as follows: 1188 o Outgoing untagged packets: Any outgoing packet flowing through the 1189 Path Manager is either tagged or untagged (by the MPS) with a path 1190 index. If it is untagged, the packet is sent normally to the 1191 Internet, as if no multi-path support were present. Untagged 1192 packets can be used to trigger a path discovery procedure, that 1193 is, a Path Manager can listen to untagged packets and decide at 1194 some time to find if any other path than the default one is 1195 useable for the corresponding host pair. Note that any other 1196 criteria could be used to decide when to start discovering 1197 available paths. Note also that MPS scheduling will not be 1198 possible until the Path Manager has notified the available paths. 1199 The PM is thus the first entity coming into action. 1201 o Outgoing tagged packets: The Path Manager maintains a table 1202 mapping path indices to actions. The action is the operation that 1203 allows using a particular path. Examples of possible actions are 1204 route selection, interface selection or packet transformation. 1205 When the PM sees a packet tagged with a path index, it looks up 1206 its table to find the appropriate action for that packet. The tag 1207 is purely local. It is removed before the packet is transmitted. 1209 o Incoming packets: A Path Manager MUST ensure that incoming path is 1210 mapped unambiguously to exactly one outgoing path. Note that this 1211 requirement implies that the same number of incoming/outgoing 1212 paths must be established. Moreover, a PM MUST tag any incoming 1213 path with the same Path Index as the one used for the 1214 corresponding outgoing path. This is necessary for MTCP to know 1215 what outgoing path in acknowledged by an incoming packet. 1217 o Module interface: A PM MUST be able to notify the MPS about the 1218 number of available paths. Such notifications MUST contain the 1219 path indices that are legal for use by the MPS. In case the PM 1220 decides to stop providing service for one path, it MUST notify the 1221 MPS about path deletion. Additionnaly, a PM MAY provide 1222 complementary path information when available, such as link 1223 quality or preference level. 1225 Appendix B. Notes on use of TCP Options 1227 The TCP option space is limited due to the length of the Data Offset 1228 field in the TCP header (4 bits), which defines the TCP header length 1229 in 32-bit words. With the standard TCP header being 20 bytes, this 1230 leaves a maximum of 40 bytes for options, and many of these may 1231 already be used by options such as timestamp and SACK. 1233 As such, when doing address list manipulation, not all data may fit. 1234 This can be mitigated in one of two ways: 1236 o Using an option to extend the option space, such as that proposed 1237 in [I-D.eddy-tcp-loo], which proposes an option providing a 16-bit 1238 header length field. Such an option could only be used between 1239 nodes that support it, however, and so long options could not be 1240 used until a handshake is complete. 1242 o Alternatively, since at least one IP address option field should 1243 be able to fit per packet, address list manipulation can be 1244 undertaken with one address per packet. One method could be to 1245 wait for data to send, and then append one new address per packet. 1246 This would seem reasonable if the TCP session begins rapidly, but 1247 if it is required that the multipath session is ready before the 1248 first data is to be sent, address list manipulation would be 1249 required on empty data (signalling only) packets. Issues may 1250 arise regarding acknowledged delivery of signalling versus data - 1251 this is discussed in Section 3 below. 1253 Appendix C. Resync Packet 1255 In earlier versions of this draft, we proposed the use of a "re-sync" 1256 option that would be used in certain circumstances when a sender 1257 needs to instruct the receiver to skip over certain subflow sequence 1258 numbers (i.e. to treat the specified sequence space as having been 1259 received and acknowledged). 1261 The typical use of this option will be when packets are retransmitted 1262 on different subflows, after failing to be acknowledged on the 1263 original subflow. In such a case, it becomes necessary to move 1264 forward the original subflow's sequence numbering so as not to later 1265 transmit different data with a previously used sequence number (i.e. 1266 when more data comes to be transmitted on the original subflow, it 1267 would be different data, and so must not be sent with previously-used 1268 (but unacknowledged) sequence numbering). 1270 The rationale for needing to do this is two-fold: firstly, when ACKs 1271 are received they are for the subflow only, and the sender infers 1272 from this the data that was sent - if the same sequence space could 1273 be occupied by different data, the sender won't know whether the 1274 intended data was received. Secondly, certain classes of middleboxes 1275 may cache data and not send the new data on a previously-seen 1276 sequence number. 1278 This option was dropped, however, since some middleboxes may get 1279 confused when they meet a hole in the sequence space, and do not 1280 understand the resync option. It is therefore felt that the same 1281 data must continue to be retransmitted on a subflow even if it is 1282 already received after being retransmitted on another. There should 1283 not be a significant performance hit from this since the amount of 1284 data involved and needing to be retransmitted multiple times will be 1285 relatively small. 1287 Therefore, it is necessary to 're-sync' the expected sequence 1288 numbering at the receiving end of a subflow, using the following TCP 1289 option. This packet declares a sequence number space (inclusive) 1290 which the receiving node should skip over, i.e. if the receiver's 1291 next expected sequence number was previously within the range 1292 start_seq_num to end_seq_num, move it forward to end_seq_num + 1. 1294 This option will be used on the first new packet on the subflow that 1295 needs its sequence numbering re-synchronised. It will be continue to 1296 be included on every packet sent on this subflow until a packet 1297 containing this option has been acknowledged (i.e. if subflow 1298 acknowledgements exist for packets beyond the end sequence number). 1299 If the end sequence number is earlier than the current expected 1300 sequence number (i.e. if a resync packet has already been received), 1301 this option should be ignored. 1303 1 2 3 1304 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1305 +---------------+---------------+------------------------------+ 1306 |Kind=OPT_RESYNC| Length = 10 | Start Sequence Number : 1307 +---------------+---------------+------------------------------+ 1308 : (4 octets) | End Sequence Number : 1309 +---------------+---------------+------------------------------+ 1310 : (4 octets) | 1311 +-------------------------------+ 1313 Figure 10: Resync option 1315 Authors' Addresses 1317 Alan Ford 1318 Roke Manor Research 1319 Old Salisbury Lane 1320 Romsey, Hampshire SO51 0ZN 1321 UK 1323 Phone: +44 1794 833 465 1324 Email: alan.ford@roke.co.uk 1326 Costin Raiciu 1327 University College London 1328 Gower Street 1329 London WC1E 6BT 1330 UK 1332 Email: c.raiciu@cs.ucl.ac.uk 1334 Mark Handley 1335 University College London 1336 Gower Street 1337 London WC1E 6BT 1338 UK 1340 Email: m.handley@cs.ucl.ac.uk 1341 Sebastien Barre 1342 Universite catholique de Louvain 1343 Pl. Ste Barbe, 2 1344 Louvain-la-Neuve 1348 1345 Belgium 1347 Phone: +32 10 47 91 03 1348 Email: sebastien.barre@uclouvain.be