idnits 2.17.1 draft-ietf-mptcp-multiaddressed-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 22, 2012) is 4204 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'Data ACK' on line 474 -- Looks like a reference, but probably isn't: 'Checksum' on line 475 -- Looks like a reference, but probably isn't: 'Data FIN' on line 503 -- Looks like a reference, but probably isn't: 'DFIN' on line 2871 ** Obsolete normative reference: RFC 793 (ref. '1') (Obsoleted by RFC 9293) == Outdated reference: A later version (-07) exists of draft-ietf-mptcp-api-05 -- Obsolete informational reference (is this intentional?): RFC 1323 (ref. '15') (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '24') (Obsoleted by RFC 8126) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force A. Ford 3 Internet-Draft Cisco 4 Intended status: Experimental C. Raiciu 5 Expires: April 25, 2013 University Politehnica of 6 Bucharest 7 M. Handley 8 University College London 9 O. Bonaventure 10 Universite catholique de 11 Louvain 12 October 22, 2012 14 TCP Extensions for Multipath Operation with Multiple Addresses 15 draft-ietf-mptcp-multiaddressed-12 17 Abstract 19 TCP/IP communication is currently restricted to a single path per 20 connection, yet multiple paths often exist between peers. The 21 simultaneous use of these multiple paths for a TCP/IP session would 22 improve resource usage within the network, and thus improve user 23 experience through higher throughput and improved resilience to 24 network failure. 26 Multipath TCP provides the ability to simultaneously use multiple 27 paths between peers. This document presents a set of extensions to 28 traditional TCP to support multipath operation. The protocol offers 29 the same type of service to applications as TCP (i.e. reliable 30 bytestream), and provides the components necessary to establish and 31 use multiple TCP flows across potentially disjoint paths. 33 Status of this Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at http://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on April 25, 2013. 50 Copyright Notice 52 Copyright (c) 2012 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 1.1. Design Assumptions . . . . . . . . . . . . . . . . . . . . 4 69 1.2. Multipath TCP in the Networking Stack . . . . . . . . . . 5 70 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 71 1.4. MPTCP Concept . . . . . . . . . . . . . . . . . . . . . . 7 72 1.5. Requirements Language . . . . . . . . . . . . . . . . . . 8 73 2. Operation Overview . . . . . . . . . . . . . . . . . . . . . . 8 74 2.1. Initiating an MPTCP connection . . . . . . . . . . . . . . 9 75 2.2. Associating a new subflow with an existing MPTCP 76 connection . . . . . . . . . . . . . . . . . . . . . . . . 9 77 2.3. Informing the other Host about another potential 78 address . . . . . . . . . . . . . . . . . . . . . . . . . 10 79 2.4. Data transfer using MPTCP . . . . . . . . . . . . . . . . 11 80 2.5. Requesting a change in a path's priority . . . . . . . . . 11 81 2.6. Closing an MPTCP connection . . . . . . . . . . . . . . . 12 82 2.7. Notable features . . . . . . . . . . . . . . . . . . . . . 12 83 3. MPTCP Protocol . . . . . . . . . . . . . . . . . . . . . . . . 12 84 3.1. Connection Initiation . . . . . . . . . . . . . . . . . . 13 85 3.2. Starting a New Subflow . . . . . . . . . . . . . . . . . . 18 86 3.3. General MPTCP Operation . . . . . . . . . . . . . . . . . 23 87 3.3.1. Data Sequence Mapping . . . . . . . . . . . . . . . . 25 88 3.3.2. Data Acknowledgments . . . . . . . . . . . . . . . . . 28 89 3.3.3. Closing a Connection . . . . . . . . . . . . . . . . . 29 90 3.3.4. Receiver Considerations . . . . . . . . . . . . . . . 30 91 3.3.5. Sender Considerations . . . . . . . . . . . . . . . . 31 92 3.3.6. Reliability and Retransmissions . . . . . . . . . . . 32 93 3.3.7. Congestion Control Considerations . . . . . . . . . . 33 94 3.3.8. Subflow Policy . . . . . . . . . . . . . . . . . . . . 34 95 3.4. Address Knowledge Exchange (Path Management) . . . . . . . 35 96 3.4.1. Address Advertisement . . . . . . . . . . . . . . . . 36 97 3.4.2. Remove Address . . . . . . . . . . . . . . . . . . . . 39 98 3.5. Fast Close . . . . . . . . . . . . . . . . . . . . . . . . 40 99 3.6. Fallback . . . . . . . . . . . . . . . . . . . . . . . . . 41 100 3.7. Error Handling . . . . . . . . . . . . . . . . . . . . . . 44 101 3.8. Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 45 102 3.8.1. Port Usage . . . . . . . . . . . . . . . . . . . . . . 45 103 3.8.2. Delayed Subflow Start . . . . . . . . . . . . . . . . 45 104 3.8.3. Failure Handling . . . . . . . . . . . . . . . . . . . 46 105 4. Semantic Issues . . . . . . . . . . . . . . . . . . . . . . . 47 106 5. Security Considerations . . . . . . . . . . . . . . . . . . . 48 107 6. Interactions with Middleboxes . . . . . . . . . . . . . . . . 51 108 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 54 109 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 54 110 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 56 111 9.1. Normative References . . . . . . . . . . . . . . . . . . . 56 112 9.2. Informative References . . . . . . . . . . . . . . . . . . 56 113 Appendix A. Notes on use of TCP Options . . . . . . . . . . . . . 58 114 Appendix B. Control Blocks . . . . . . . . . . . . . . . . . . . 60 115 B.1. MPTCP Control Block . . . . . . . . . . . . . . . . . . . 60 116 B.1.1. Authentication and Metadata . . . . . . . . . . . . . 60 117 B.1.2. Sending Side . . . . . . . . . . . . . . . . . . . . . 60 118 B.1.3. Receiving Side . . . . . . . . . . . . . . . . . . . . 61 119 B.2. TCP Control Blocks . . . . . . . . . . . . . . . . . . . . 61 120 B.2.1. Sending Side . . . . . . . . . . . . . . . . . . . . . 61 121 B.2.2. Receiving Side . . . . . . . . . . . . . . . . . . . . 61 122 Appendix C. Finite State Machine . . . . . . . . . . . . . . . . 62 123 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 62 125 1. Introduction 127 MPTCP is a set of extensions to regular TCP [1] to provide a 128 Multipath TCP [2] service, which enables a transport connection to 129 operate across multiple paths simultaneously. This document presents 130 the protocol changes required to add multipath capability to TCP; 131 specifically, those for signaling and setting up multiple paths 132 ("subflows"), managing these subflows, reassembly of data, and 133 termination of sessions. This is not the only information required 134 to create a Multipath TCP implementation, however. This document is 135 complemented by three others: 137 o Architecture [2], which explains the motivations behind Multipath 138 TCP, contains a discussion of high-level design decisions on which 139 this design is based, and an explanation of a functional 140 separation through which an extensible MPTCP implementation can be 141 developed. 143 o Congestion Control [5], presenting a safe congestion control 144 algorithm for coupling the behaviour of the multiple paths in 145 order to "do no harm" to other network users. 147 o Application Considerations [6], discussing what impact MPTCP will 148 have on applications, what applications will want to do with 149 MPTCP, and as a consequence of these factors, what API extensions 150 an MPTCP implementation should present. 152 1.1. Design Assumptions 154 In order to limit the potentially huge design space, the working 155 group imposed two key constraints on the multipath TCP design 156 presented in this document: 158 o It must be backwards-compatible with current, regular TCP, to 159 increase its chances of deployment 161 o It can be assumed that one or both hosts are multihomed and 162 multiaddressed 164 To simplify the design we assume that the presence of multiple 165 addresses at a host is sufficient to indicate the existence of 166 multiple paths. These paths need not be entirely disjoint: they may 167 share one or many routers between them. Even in such a situation 168 making use of multiple paths is beneficial, improving resource 169 utilisation and resilience to a subset of node failures. The 170 congestion control algorithms defined in [5] ensure this does not act 171 detrimentally. Furthermore, there may be some scenarios where 172 different TCP ports on a single host can provide disjoint paths (such 173 as through certain ECMP implementations [7]), and so the MPTCP design 174 also supports the use of ports in path identifiers. 176 There are three aspects to the backwards-compatibility listed above 177 (discussed in more detail in [2]): 179 External Constraints: The protocol must function through the vast 180 majority of existing middleboxes such as NATs, firewalls and 181 proxies, and as such must resemble existing TCP as far as possible 182 on the wire. Furthermore, the protocol must not assume the 183 segments it sends on the wire arrive unmodified at the 184 destination: they may be split or coalesced; TCP options may be 185 removed or duplicated. 187 Application Constraints: The protocol must be usable with no change 188 to existing applications that use the common TCP API (although it 189 is reasonable that not all features would be available to such 190 legacy applications). Furthermore, the protocol must provide the 191 same service model as regular TCP to the application. 193 Fall-back: The protocol should be able to fall back to standard TCP 194 with no interference from the user, to be able to communicate with 195 legacy hosts. 197 The complementary application considerations document [6] discusses 198 the necessary features of an API to provide backwards-compatibility, 199 as well as API extensions to convey the behaviour of MPTCP at a level 200 of control and information equivalent to that available with regular, 201 single-path TCP. 203 Further discussion of the design constraints and associated design 204 decisions are given in the MPTCP Architecture document [2]. 206 1.2. Multipath TCP in the Networking Stack 208 MPTCP operates at the transport layer and aims to be transparent to 209 both higher and lower layers. It is a set of additional features on 210 top of standard TCP; Figure 1 illustrates this layering. MPTCP is 211 designed to be usable by legacy applications with no changes; 212 detailed discussion of its interactions with applications is given in 213 [6]. 215 +-------------------------------+ 216 | Application | 217 +---------------+ +-------------------------------+ 218 | Application | | MPTCP | 219 +---------------+ + - - - - - - - + - - - - - - - + 220 | TCP | | Subflow (TCP) | Subflow (TCP) | 221 +---------------+ +-------------------------------+ 222 | IP | | IP | IP | 223 +---------------+ +-------------------------------+ 225 Figure 1: Comparison of Standard TCP and MPTCP Protocol Stacks 227 1.3. Terminology 229 This document makes use of a number of terms which are either MPTCP- 230 specific, or have defined meaning in the context of MPTCP, as 231 follows: 233 Path: A sequence of links between a sender and a receiver, defined 234 in this context by a 4-tuple of source and destination address/ 235 port pairs. 237 Subflow: A flow of TCP segments operating over an individual path, 238 which forms part of a larger MPTCP connection. A subflow is 239 started and terminated similarly to a regular TCP connection. 241 (MPTCP) Connection: A set of one or more subflows, over which an 242 application can communicate between two hosts. There is a one-to- 243 one mapping between a connection and an application socket. 245 Data-level: The payload data is nominally transferred over a 246 connection, which in turn is transported over subflows. Thus the 247 term "data-level" is synonymous with "connection level", in 248 contrast to "subflow-level" which refers to properties of an 249 individual subflow. 251 Token: A locally unique identifier given to a multipath connection 252 by a host. May also be referred to as a "Connection ID". 254 Host: A end host operating an MPTCP implementation, and either 255 initiating or accepting an MPTCP connection. 257 In addition to these terms, note that MPTCP's interpretation of, and 258 effect on, regular single-path TCP semantics are discussed in 259 Section 4. 261 1.4. MPTCP Concept 263 This section provides a high-level summary of normal operation of 264 MPTCP, and is illustrated by the scenario shown in Figure 2. A 265 detailed description of operation is given in Section 3. 267 o To a non-MPTCP-aware application, MPTCP will behave the same as 268 normal TCP. Extended APIs could provide additional control to 269 MPTCP-aware applications [6]. An application begins by opening a 270 TCP socket in the normal way. MPTCP signaling and operation is 271 handled by the MPTCP implementation. 273 o An MPTCP connection begins similarly to a regular TCP connection. 274 This is illustrated in Figure 2 where an MPTCP connection is 275 established between addresses A1 and B1 on Hosts A and B 276 respectively. 278 o If extra paths are available, additional TCP sessions (termed 279 MPTCP "subflows") are created on these paths, and are combined 280 with the existing session, which continues to appear as a single 281 connection to the applications at both ends. The creation of the 282 additional TCP session is illustrated between Address A2 on Host A 283 and Address B1 on Host B. 285 o MPTCP identifies multiple paths by the presence of multiple 286 addresses at hosts. Combinations of these multiple addresses 287 equate to the additional paths. In the example, other potential 288 paths that could be set up are A1<->B2 and A2<->B2. Although this 289 additional session is shown as being initiated from A2, it could 290 equally have been initiated from B1. 292 o The discovery and setup of additional subflows will be achieved 293 through a path management method; this document describes a 294 mechanism by which a host can initiate new subflows by using its 295 own additional addresses, or by signaling its available addresses 296 to the other host. 298 o MPTCP adds connection-level sequence numbers to allow the 299 reassembly of segments arriving on multiple subflows with 300 differing network delays. 302 o Subflows are terminated as regular TCP connections, with a four 303 way FIN handshake. The MPTCP connection is terminated by a 304 connection-level FIN. 306 Host A Host B 307 ------------------------ ------------------------ 308 Address A1 Address A2 Address B1 Address B2 309 ---------- ---------- ---------- ---------- 310 | | | | 311 | (initial connection setup) | | 312 |----------------------------------->| | 313 |<-----------------------------------| | 314 | | | | 315 | (additional subflow setup) | 316 | |--------------------->| | 317 | |<---------------------| | 318 | | | | 319 | | | | 321 Figure 2: Example MPTCP Usage Scenario 323 1.5. Requirements Language 325 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 326 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 327 document are to be interpreted as described in RFC 2119 [3]. 329 2. Operation Overview 331 This section presents a single description of common MPTCP operation, 332 with reference to the protocol operation. This is a high-level 333 overview of the key functions; the full specification follows in 334 Section 3. Extensibility and negotiated features are not discussed 335 here. Considerable reference is made to symbolic names of MPTCP 336 options throughout this section - these are subtypes of the IANA- 337 assigned MPTCP option (see Section 8), and their formats are defined 338 in the detailed protocol specification which follows in Section 3. 340 A Multipath TCP connection provides a bidirectionnal bytestream 341 between two hosts communicating like normal TCP and thus does not 342 require any change to the applications. However, Multipath TCP 343 enables the hosts to use different paths with different IP addresses 344 to exchange packets belonging to the MPTCP connection. A Multipath 345 TCP connection appears like a normal TCP connection to an 346 application. However, to the network layer each MPTCP subflows looks 347 like a regular TCP flow whose segments carry a new TCP option type. 348 Multipath TCP manages the creation, removal and utilization of these 349 subflows to send data. The number of subflows that are managed 350 within a Multipath TCP connection is not fixed and it can fluctuate 351 during the lifetime of the Multipath TCP connection. 353 All MPTCP operations are signaled with a TCP option - a single 354 numerical type for MPTCP, with "sub-types" for each MPTCP message. 355 What follows is a summary of the purpose and rationale of these 356 messages. 358 2.1. Initiating an MPTCP connection 360 This is the same signaling as for initiating a normal TCP connection, 361 but the SYN, SYN/ACK and ACK packets also carry the MP_CAPABLE 362 option. This is variable-length and serves multiple purposes. 363 Firstly, it verifies whether the remote host supports Multipath TCP; 364 and secondly, this option allows the hosts to exchange some 365 information to authenticate the establishment of additional subflows. 366 Further details are given in Section 3.1. 368 Host-A Host-B 369 ------ ------ 370 MP_CAPABLE -> 371 [A's key, flags] 372 <- MP_CAPABLE 373 [B's key, flags] 374 ACK + MP_CAPABLE -> 375 [A's key, B's key, flags] 377 2.2. Associating a new subflow with an existing MPTCP connection 379 The exchange of keys in the MP_CAPABLE handshake provides material 380 that can be used to authenticate the endpoints when new subflows will 381 be setup. Additional subflows begin in the same way as initiating a 382 normal TCP connection, but the SYN, SYN/ACK and ACK packets also 383 carry the MP_JOIN option. 385 Host-A initiates a new subflow between one of its addresses and one 386 of Host-B's addresses. The token - generated from the key - is used 387 to identify which MPTCP connection it is joining, and the HMAC is 388 used for authentication. The HMAC uses the keys exchanged in the 389 MP_CAPABLE handshake, and the random numbers (nonces) exchanged in 390 these MP_JOIN options. MP_JOIN also contains flags and an Address ID 391 that can be used to refer to the source address without the sender 392 needing to know if it has been changed by a NAT. Further details in 393 Section 3.2. 395 Host-A Host-B 396 ------ ------ 397 MP_JOIN -> 398 [B's token, A's nonce, 399 A's Address ID, flags] 400 <- MP_JOIN 401 [B's HMAC, B's nonce, 402 B's Address ID, flags] 403 ACK + MP_JOIN -> 404 [A's HMAC] 406 <- ACK 408 2.3. Informing the other Host about another potential address 410 The set of IP addresses associated to a multihomed host may change 411 during the lifetime of an MPTCP connection. MPTCP supports the 412 addition and removal of addresses on a host both implicitly and 413 explicitly. If Host-A has established a subflow starting at address 414 IP#-A1 and wants to open a second subflow starting at address IP#-A2, 415 it simply initiates the establishment of the subflow as explained 416 above. The remote host will then be implicitly informed about the 417 new address. 419 In some circumstances, a host may want to advertise to the remote 420 host the availability of an address without establishing a new 421 subflow, for example when a NAT prevents setup in one direction. In 422 the example below, Host-A informs Host-B about its alternative IP 423 address (IP#-A2). Host-B may later send an MP_JOIN to this new 424 address. Due to the presence of middleboxes that may translate IP 425 addresses, this option uses an address identifier to unambiguously 426 identify an address on a host. Further details in Section 3.4.1. 428 Host-A Host-B 429 ------ ------ 430 ADD_ADDR -> 431 [IP#-A2, 432 IP#-A2's Address ID] 434 There is a corresponding signal for address removal, making use of 435 the Address ID that is signalled in the add address handshake. 436 Further details in Section 3.4.2. 438 Host-A Host-B 439 ------ ------ 440 REMOVE_ADDR -> 441 [IP#-A2's Address ID] 443 2.4. Data transfer using MPTCP 445 To ensure reliable, in-order delivery of data over subflows that may 446 appear and disappear at any time, MPTCP uses a 64-bit Data Sequence 447 Number (DSN) to number all data sent over the MPTCP connection. Each 448 subflow has its own 32 bits sequence number space and an MPTCP option 449 maps the subflow sequence space to the data sequence space. In this 450 way, data can be retransmitted on different subflows (mapped to the 451 same DSN) in the event of failure. 453 The "Data Sequence Signal" carries the "Data Sequence Mapping". The 454 Data Sequence Mapping consists of the subflow sequence number, data 455 sequence number, and length for which this mapping is valid. This 456 option can also carry a connection-level acknowledgement (the "Data 457 ACK") for the received DSN. 459 With MPTCP, all subflows share the same receive buffer and advertise 460 the same receive window. There are two levels of acknowledgement in 461 MPTCP. Regular TCP acknowledgments are used on each subflow to 462 acknowledge the reception of the segments sent over the subflow 463 independently of their DSN. In addition, there are connection-level 464 acknowledgments for the data sequence space. These acknowledgments 465 track the advancement of the bytestream and slide the receiving 466 window. 468 Further details are in Section 3.3. 470 Host-A Host-B 471 ------ ------ 472 DATA_SEQUENCE_SIGNAL -> 473 [Data Sequence Mapping] 474 [Data ACK] 475 [Checksum] 477 2.5. Requesting a change in a path's priority 479 Hosts can indicate at initial subflow setup whether they wish the 480 subflow to be used as a regular or backup path - a backup path being 481 only used if there are no regular paths available. During a 482 connection, Host-A can request a change in the priority of a subflow 483 through the MP_PRIO signal to Host-B. Further details in 484 Section 3.3.8. 486 Host-A Host-B 487 ------ ------ 488 MP_PRIO -> 490 2.6. Closing an MPTCP connection 492 When Host-A wants to inform Host-B that it has no more data to send, 493 it signals this "Data FIN" as part of the Data Sequence Signal (see 494 above). It has the same semantics and behaviour as a regular TCP 495 FIN, but at the connection level. Once all the data on the MPTCP 496 connection has been successfully received, then this message is 497 acknowledged at the connection level with a DATA_ACK. Further 498 details in Section 3.3.3. 500 Host-A Host-B 501 ------ ------ 502 DATA_SEQUENCE_SIGNAL -> 503 [Data FIN] 505 <- (MPTCP DATA_ACK) 507 2.7. Notable features 509 It is worth highlighting that MPTCP's signaling has been designed 510 with several key requirements in mind: 512 o To cope with NATs on the path, addresses are referred to by 513 Address IDs, in case the IP packet's source address gets changed 514 by a NAT. Setting up a new TCP flow is not possible if the 515 passive opener is behind a NAT; to allow subflows to be created 516 when either end is behind a NAT, MPTCP uses the ADD_ADDR message. 518 o MPTCP falls back to ordinary TCP if MPTCP operation is not 519 possible. For example if one host is not MPTCP capable, or if a 520 middlebox alters the payload. 522 o To meet the threats identified in [8], the following steps are 523 taken: keys are sent in the clear in the MP_CAPABLE messages; 524 MP_JOIN messages are secured with HMAC-SHA1 ([9], [4]) using those 525 keys; and standard TCP validity checks are made on the other 526 messages (ensuring sequence numbers are in-window). 528 3. MPTCP Protocol 530 This section describes the operation of the MPTCP protocol, and is 531 subdivided into sections for each key part of the protocol operation. 533 All MPTCP operations are signalled using optional TCP header fields. 534 A single TCP option number ("Kind") will be assigned by IANA for 535 MPTCP (see Section 8), and then individual messages will be 536 determined by a "sub-type", the values of which will also be stored 537 in an IANA registry (and are also listed in Section 8). 539 Throughout this document, when reference is made to an MPTCP option 540 by symbolic name, such as "MP_CAPABLE", this refers to a TCP option 541 with the single MPTCP option type, and with the sub-type value of the 542 symbolic name as defined in Section 8. This sub-type is a four-bit 543 field - the first four bits of the option payload, as shown in 544 Figure 3. The MPTCP messages are defined in the following sections. 546 1 2 3 547 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 548 +---------------+---------------+-------+-----------------------+ 549 | Kind | Length |Subtype| | 550 +---------------+---------------+-------+ | 551 | Subtype-specific data | 552 | (variable length) | 553 +---------------------------------------------------------------+ 555 Figure 3: MPTCP option format 557 Those MPTCP options associated with subflow initiation are used on 558 packets with the SYN flag set. Additionally, there is one MPTCP 559 option for signaling metadata to ensure segmented data can be 560 recombined for delivery to the application. 562 The remaining options, however, are signals that do not need to be on 563 a specific packet, such as those for signaling additional addresses. 564 Whilst an implementation may desire to send MPTCP options as soon as 565 possible, it may not be possible to combine all desired options (both 566 those for MPTCP and for regular TCP, such as SACK [10]) on a single 567 packet. Therefore, an implementation may choose to send duplicate 568 ACKs containing the additional signaling information. This changes 569 the semantics of a duplicate ACK, these are usually only sent as a 570 signal of a lost segment [11] in regular TCP. Therefore, an MPTCP 571 implementation receiving a duplicate ACK which contains an MPTCP 572 option MUST NOT treat it as a signal of congestion. Additionally, an 573 MPTCP implementation SHOULD NOT send more than two duplicate ACKs in 574 a row for the purposes of sending MPTCP options alone, in order to 575 ensure no middleboxes misinterpret this as a sign of congestion. 577 Furthermore, standard TCP validity checks (such as ensuring the 578 Sequence Number and Acknowledgement Number are within window) MUST be 579 undertaken before processing any MPTCP signals, as described in [12]. 581 3.1. Connection Initiation 583 Connection Initiation begins with a SYN, SYN/ACK, ACK exchange on a 584 single path. Each packet contains the Multipath Capable (MP_CAPABLE) 585 TCP option (Figure 4). This option declares its sender is capable of 586 performing multipath TCP and wishes to do so on this particular 587 connection. 589 This option is used to declare the 64 bit key which the sender has 590 generated for this MPTCP connection. This key is used to 591 authenticate the addition of future subflows to this connection. 592 This is the only time the key will be sent in clear on the wire 593 (unless "fast close", Section 3.5, is used); all future subflows will 594 identify the connection using a 32 bit "token". This token is a 595 cryptographic hash of this key. The algorithm for this process is 596 dependent on the authentication algorithm selected; the method of 597 selection is defined later in this section. 599 This key is generated by its sender, and its method of generation is 600 implementation-specific. The key MUST be hard to guess, and it MUST 601 be unique for the sending host at any one time. Recommendations for 602 generating random numbers for use in keys are given in [13]. 603 Connections will be indexed at each host by the token (a one-way hash 604 of the key). Therefore, an implementation will require a mapping 605 from each token to the corresponding connection, and in turn to the 606 keys for the connection. 608 There is a risk that two different keys will hash to the same token. 609 The risk of hash collisions is usually small, unless the host is 610 handling many tens of thousands of connections. Therefore, an 611 implementation SHOULD check its list of connection tokens to ensure 612 there is not a collision before sending its key in the SYN/ACK. This 613 would, however, be costly for a server with thousands of connections. 614 The subflow handshake mechanism (Section 3.2) will ensure that new 615 subflows only join the correct connection, however, through the 616 cryptographic handshake, as well as checking the connection tokens in 617 both directions, and ensuring sequence numbers are in-window, so in 618 the worst case if there was a token collision, the new subflow would 619 not succeed, but the MPTCP connection would continue to provide a 620 regular TCP service. 622 The MP_CAPABLE option is carried on the SYN, SYN/ACK, and ACK packets 623 that start the first subflow of an MPTCP connection. The data 624 carried by each packet is as follows, where A = initiator and B = 625 listener. 627 o SYN (A->B): A's Key for this connection. 629 o SYN/ACK (B->A): B's Key for this connection. 631 o ACK (A->B): A's Key followed by B's Key. 633 The contents of the option is determined by the SYN and ACK flags of 634 the packet, verified by the option's length field. For the diagram 635 shown in Figure 4, "sender" and "receiver" refer to the sender or 636 receiver of the TCP packet (which can be either host). If the SYN 637 flag is set, a single key is included; if only an ACK flag is set, 638 both keys are present. 640 B's Key is echoed in the ACK in order to allow the listener (host B) 641 to act statelessly until the TCP connection reaches the ESTABLISHED 642 state. If the listener acts in this way, however, it MUST generate 643 its key in a way that would allow it to verify that it generated the 644 key when it is echoed in the ACK. 646 This exchange allows the safe passage of MPTCP options on SYN packets 647 to be determined. If any of these options are dropped, MPTCP will 648 gracefully fall back to regular single-path TCP, as documented in 649 Section 3.6. Note that new subflows MUST NOT be established (using 650 the process documented in Section 3.2) until a DSS option has been 651 successfully received across the path (as documented in Section 3.3). 653 1 2 3 654 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 655 +---------------+---------------+-------+-------+---------------+ 656 | Kind | Length |Subtype|Version|A|B|C|D|E|F|G|H| 657 +---------------+---------------+-------+-------+---------------+ 658 | Option Sender's Key (64 bits) | 659 | | 660 | | 661 +---------------------------------------------------------------+ 662 | Option Receiver's Key (64 bits) | 663 | (if option Length == 20) | 664 | | 665 +---------------------------------------------------------------+ 667 Figure 4: Multipath Capable (MP_CAPABLE) option 669 The first four bits of the first octet in the MP_CAPABLE option 670 (Figure 4) define the MPTCP option subtype (see Section 8; for 671 MP_CAPABLE, this is 0), and the remaining four bits of this octet 672 specifies the MPTCP version in use (for this specification, this is 673 0). 675 The second octet is reserved for flags, allocated as follows: 677 A: The leftmost bit, labelled "A", SHOULD be set to 1 to indicate 678 "Checksum Required", unless the system administrator has decided 679 that checksums are not required (for example, if the environment 680 is controlled and no middleboxes exist that might adjust the 681 payload). 683 B: The second bit, labelled "B", is an extensibility flag, and MUST 684 be set to 0 for current implementations. This will be used for an 685 extensibility mechanism in a future specification, and the impact 686 of this flag will be defined at a later date. If receiving a 687 message with the "B" flag set to 1, and this is not understood, 688 then this SYN MUST be silently ignored; the sender is expected to 689 retry with a format compatible with this legacy specification. 690 Note that the length of the MP_CAPABLE option, and the meanings of 691 bits "C" through "H", may be altered by setting B=1. 693 C through H: The remaining bits, labelled "C" through "H", are used 694 for crypto algorithm negotiation. Currently only the rightmost 695 bit, labelled "H", is assigned. Bit "H" indicates the use of 696 HMAC-SHA1 (as defined in Section 3.2). An implementation that 697 only supports this method MUST set bit "H" to 1, and bits "C" 698 through "G" to 0. 700 A crypto algorithm MUST be specified. If flag bits C through H are 701 all 0, the MP_CAPABLE option MUST be treated as invalid and ignored 702 (that is, it must be treated as a regular TCP handshake). 704 The selection of the authentication algorithm also impacts the 705 algorithm used to generate the token and the Initial Data Sequence 706 Number. In this specification, with only the SHA-1 algorithm (bit 707 "H") specified and selected, the token MUST be a truncated (most 708 significant 32 bits) SHA-1 hash ([4], [14]) of the key. A different, 709 64 bit truncation (the least significant 64 bits) of the SHA-1 hash 710 of the key MUST be used as the Initial Data Sequence Number. Note 711 that the key MUST be hashed in network byte order. Also note that 712 the "least significant" bits MUST be the rightmost bits of the SHA-1 713 digest, as per [4]. Future specifications of the use of the crypto 714 bits may choose to specify different algorithms for token and IDSN 715 generation. 717 Both the crypto and checksum bits negotiate capabilities in similar 718 ways. For the Checksum Required bit (labelled "A"), if either host 719 requires the use of checksums, checksums MUST be used. In other 720 words, the only way for checksums not to be used is if both hosts in 721 their SYNs set A=0. This decision is confirmed by the setting of the 722 "A" bit in the third packet (the ACK) of the handshake. For example, 723 if the initiator sets A=0 in the SYN, but the responder sets A=1 in 724 the SYN/ACK, checksums MUST be used in both directions, and the 725 initiator will set A=1 in the ACK. The decision whether to use 726 checksums will be stored by an implementation in a per-connection 727 binary state variable. 729 For crypto negotiation, the responder has the choice. The initiator 730 creates a proposal setting a bit for each algorithm it supports to 1 731 (in this version of the specification, there is only one proposal, so 732 bit "H" will be always set to 1). The responder responds with only 733 one bit set - this is the chosen algorithm. The rationale for this 734 behaviour is that the responder will typically be a server with 735 potentially many thousands of connections, so it may wish to choose 736 an algorithm with minimal computational complexity, depending on the 737 load. If a responder does not support (or does not want to support) 738 any of the initiator's proposals, it can respond without an 739 MP_CAPABLE option, thus forcing a fall-back to regular TCP. 741 The MP_CAPABLE option is only used in the first subflow of a 742 connection, in order to identify the connection; all following 743 subflows will use the "Join" option (see Section 3.2) to join the 744 existing connection. 746 If a SYN contains an MP_CAPABLE option but the SYN/ACK does not, it 747 is assumed that the passive opener is not multipath capable and thus 748 the MPTCP session MUST operate as a regular, single-path TCP. If a 749 SYN does not contain a MP_CAPABLE option, the SYN/ACK MUST NOT 750 contain one in response. If the third packet (the ACK) does not 751 contain the MP_CAPABLE option, then the session MUST fall back to 752 operating as a regular, single-path TCP. This is to maintain 753 compatibility with middleboxes on the path that drop some or all TCP 754 options. Note that an implementation MAY choose to attempt sending 755 MPTCP options more than one time before making this decision to 756 operate as regular TCP (see Section 3.8). 758 If the SYN packets are unacknowledged, it is up to local policy to 759 decide how to respond. It is expected that a sender will eventually 760 fall back to single-path TCP (i.e. without the MP_CAPABLE Option) in 761 order to work around middleboxes that may drop packets with unknown 762 options; however, the number of multipath-capable attempts that are 763 made first will be up to local policy. It is possible that MPTCP and 764 non-MPTCP SYNs could get re-ordered in the network. Therefore, the 765 final state is inferred from the presence or absence of the 766 MP_CAPABLE option in the third packet of the TCP handshake. If this 767 option is not present, the connection SHOULD fall back to regular 768 TCP, as documented in Section 3.6. 770 The initial Data Sequence Number (IDSN) on a MPTCP connection is 771 generated from the Key. The algorithm for IDSN generation is also 772 determined from the negotiated authentication algorithm. In this 773 specification, with only the SHA-1 algorithm specified and selected, 774 the IDSN of a host MUST be the least significant 64 bits of the SHA-1 775 hash of its key, i.e. IDSN-A = Hash(Key-A) and IDSN-B = Hash(Key-B). 776 This deterministic generation of the IDSN allows a receiver to ensure 777 that there are no gaps in sequence space at the start of the 778 connection. The SYN with MP_CAPABLE occupies the first octet of Data 779 Sequence Space, although this does not need to be acknowledged at the 780 connection level until the first data is sent (see Section 3.3). 782 3.2. Starting a New Subflow 784 Once an MPTCP connection has begun with the MP_CAPABLE exchange, 785 further subflows can be added to the connection. Hosts have 786 knowledge of their own address(es), and can become aware of the other 787 host's addresses through signaling exchanges as described in 788 Section 3.4. Using this knowledge, a host can initiate a new subflow 789 over a currently unused pair of addresses. It is permitted for 790 either host in a connection to initiate the creation of a new 791 subflow, but it is expected that this will normally be the original 792 connection initiator (see Section 3.8 for heuristics). 794 A new subflow is started as a normal TCP SYN/ACK exchange. The Join 795 Connection (MP_JOIN) TCP option is used to identify the connection to 796 be joined by the new subflow. It uses keying material that was 797 exchanged in the initial MP_CAPABLE handshake (Section 3.1), and that 798 handshake also negotiates the crypto algorithm in use for the MP_JOIN 799 handshake. 801 This section specifies the behaviour of MP_JOIN using the HMAC-SHA1 802 algorithm. An MP_JOIN option is present in the SYN, SYN/ACK and ACK 803 of the three-way handshake, although in each case with a different 804 format. 806 In the first MP_JOIN on the SYN packet, illustrated in Figure 5, the 807 initiator sends a token, random number, and address ID. 809 The token is used to identify the MPTCP connection and is a 810 cryptographic hash of the receiver's key, as exchanged in the initial 811 MP_CAPABLE handshake (Section 3.1). In this specification, the 812 tokens presented in this option are generated by the SHA-1 ([4], 813 [14]) algorithm, truncated to the most significant 32 bits. The 814 token included in the MP_JOIN option is the token that the receiver 815 of the packet uses to identify this connection, i.e. Host A will 816 send Token-B (which is generated from Key-B). Note that the hash 817 generation algorithm can be overridden by the choice of cryptographic 818 handshake algorithm, as defined in Section 3.1. 820 The MP_JOIN SYN not only sends the token (which is static for a 821 connection) but also Random Numbers (nonces) that are used to prevent 822 replay attacks on the authentication method. Recommendations for the 823 generation of random numbers for this purpose are given in [13]. 825 The MP_JOIN option includes an "Address ID". This is an identifier 826 that only has significance within a single connection, where it 827 identifies the source address of this packet, even if the IP header 828 has been changed in transit by a middlebox. The Address ID allows 829 address removal (Section 3.4.2) without needing to know what the 830 source address at the receiver is, thus allowing address removal 831 through NATs. The Address ID also allows correlation between new 832 subflow setup attempts and address signaling (Section 3.4.1), to 833 prevent setting up duplicate subflows on the same path, if a MP_JOIN 834 and ADD_ADDR are sent at the same time. 836 The Address IDs of the subflow used in the initial SYN exchange of 837 the first subflow in the connection are implicit, and have the value 838 zero. A host MUST store the mappings between Address IDs and 839 addresses both for itself and the remote host. An implementation 840 will also need to know which local and remote Address IDs are 841 associated with which established subflows, for when addresses are 842 removed from a local or remote host. 844 The MP_JOIN option on packets with the SYN flag set also includes 4 845 bits of flags, 3 of which are currently reserved and MUST be set to 846 zero by the sender. The final bit, labelled 'B', indicates whether 847 the sender of this option wishes this subflow to be used as a backup 848 path (B=1) in the event of failure of other paths, or whether it 849 wants it to be used as part of the connection immediately. By 850 setting B=1, the sender of the option is requesting the other host to 851 only send data on this subflow if there are no available subflows 852 where B=0. Subflow policy is discussed in more detail in 853 Section 3.3.8. 855 1 2 3 856 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 857 +---------------+---------------+-------+-----+-+---------------+ 858 | Kind | Length = 12 |Subtype| |B| Address ID | 859 +---------------+---------------+-------+-----+-+---------------+ 860 | Receiver's Token (32 bits) | 861 +---------------------------------------------------------------+ 862 | Sender's Random Number (32 bits) | 863 +---------------------------------------------------------------+ 865 Figure 5: Join Connection (MP_JOIN) option (for initial SYN) 867 When receiving a SYN with an MP_JOIN option that contains a valid 868 token for an existing MPTCP connection, the recipient SHOULD respond 869 with a SYN/ACK also containing an MP_JOIN option containing a random 870 number and a truncated (leftmost 64 bits) Hash-based Message 871 Authentication Code (HMAC). This version of the option is shown in 872 Figure 6. If the token is unknown, or the host wants to refuse 873 subflow establishment (for example, due to a limit on the number of 874 subflows it will permit), the receiver will send back an RST, 875 analogous to an unknown port in TCP. Although calculating an HMAC 876 requires cryptographic operations, it is believed that the 32 bit 877 token in the MP_JOIN SYN gives sufficient protection against blind 878 state exhaustion attacks and therefore there is no need to provide 879 mechanisms to allow a responder to operate statelessly at the MP_JOIN 880 stage. 882 An HMAC is sent by both hosts - by the initiator (Host A) in the 883 third packet (the ACK) and by the responder (Host B) in the second 884 packet (the SYN/ACK). Doing the HMAC exchange at this stage allows 885 both hosts to have first exchanged random data (in the first two SYN 886 packets) that is used as the "message". This specification defines 887 that HMAC as defined in [9] is used, along with the SHA-1 hash 888 algorithm [4] (potentially implemented as in [14]), thus generating a 889 160-bit / 20 octet HMAC. Due to option space limitations, the HMAC 890 included in the SYN/ACK is truncated to the leftmost 64 bits, but 891 this is acceptable since random numbers are used, and thus an 892 attacker only has one chance to guess the HMAC correctly (if the HMAC 893 is incorrect, the TCP connection is closed, so a new MP_JOIN 894 negotiation with a new random number is required). 896 The initiator's authentication information is sent in its first ACK 897 (the third packet of the handshake), as shown in Figure 7. This data 898 needs to be sent reliably, since it is the only time this HMAC is 899 sent and therefore receipt of this packet MUST trigger a regular TCP 900 ACK in response, and the packet MUST be retransmitted if this ACK is 901 not received. In other words, sending the ACK/MP_JOIN packet places 902 the subflow in the PRE_ESTABLISHED state, and it moves to the 903 ESTABLISHED state only on receipt of an ACK from the receiver. It is 904 not permitted to send data while in the PRE_ESTABLISHED state. The 905 reserved bits in this option MUST be set to zero by the sender. 907 The key for the HMAC algorithm, in the case of the message 908 transmitted by Host A, will be Key-A followed by Key-B, and in the 909 case of Host B, Key-B followed by Key-A. These are the keys that 910 were exchanged in the original MP_CAPABLE handshake. The "message" 911 for the HMAC algorithm in each case is the concatenations of Random 912 Number for each host (denoted by R): for Host A, R-A followed by R-B; 913 and for Host B, R-B followed by R-A. 915 1 2 3 916 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 917 +---------------+---------------+-------+-----+-+---------------+ 918 | Kind | Length = 16 |Subtype| |B| Address ID | 919 +---------------+---------------+-------+-----+-+---------------+ 920 | | 921 | Sender's Truncated HMAC (64 bits) | 922 | | 923 +---------------------------------------------------------------+ 924 | Sender's Random Number (32 bits) | 925 +---------------------------------------------------------------+ 927 Figure 6: Join Connection (MP_JOIN) option (for responding SYN/ACK) 929 1 2 3 930 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 931 +---------------+---------------+-------+-----------------------+ 932 | Kind | Length = 24 |Subtype| (reserved) | 933 +---------------+---------------+-------+-----------------------+ 934 | | 935 | | 936 | Sender's HMAC (160 bits) | 937 | | 938 | | 939 +---------------------------------------------------------------+ 941 Figure 7: Join Connection (MP_JOIN) option (for third ACK) 943 These various TCP options fit together to enable authenticated 944 subflow setup as illustrated in Figure 8. 946 Host A Host B 947 ------------------------ ---------- 948 Address A1 Address A2 Address B1 949 ---------- ---------- ---------- 950 | | | 951 | SYN + MP_CAPABLE(Key-A) | 952 |--------------------------------------------->| 953 |<---------------------------------------------| 954 | SYN/ACK + MP_CAPABLE(Key-B) | 955 | | | 956 | ACK + MP_CAPABLE(Key-A, Key-B) | 957 |--------------------------------------------->| 958 | | | 959 | | SYN + MP_JOIN(Token-B, R-A) | 960 | |------------------------------->| 961 | |<-------------------------------| 962 | | SYN/ACK + MP_JOIN(HMAC-B, R-B) | 963 | | | 964 | | ACK + MP_JOIN(HMAC-A) | 965 | |------------------------------->| 966 | |<-------------------------------| 967 | | ACK | 969 HMAC-A = HMAC(Key=(Key-A+Key-B), Msg=(R-A+R-B)) 970 HMAC-B = HMAC(Key=(Key-B+Key-A), Msg=(R-B+R-A)) 972 Figure 8: Example use of MPTCP Authentication 974 If the token received at Host B is unknown or local policy prohibits 975 the acceptance of the new subflow, the recipient MUST respond with a 976 TCP RST for the subflow. 978 If the token is accepted at Host B, but the HMAC returned to Host A 979 does not match the one expected, Host A MUST close the subflow with a 980 TCP RST. 982 If Host B does not receive the expected HMAC, or the MP_JOIN option 983 is missing from the ACK, it MUST close the subflow with a TCP RST. 985 If the HMACs are verified as correct, then both hosts have 986 authenticated each other as being the same peers as existed at the 987 start of the connection, and they have agreed of which connection 988 this subflow will become a part. 990 If the SYN/ACK as received at Host A does not have an MP_JOIN option, 991 Host A MUST close the subflow with a RST. 993 This covers all cases of the loss of an MP_JOIN. In more detail, if 994 MP_JOIN is stripped from the SYN on the path from A to B, and Host B 995 does not have a passive opener on the relevant port, it will respond 996 with an RST in the normal way. If in response to a SYN with an 997 MP_JOIN option, a SYN/ACK is received without the MP_JOIN option 998 (either since it was stripped on the return path, or it was stripped 999 on the outgoing path but the passive opener on Host B responded as if 1000 it were a new regular TCP session), then the subflow is unusable and 1001 Host A MUST close it with a RST. 1003 Note that additional subflows can be created between any pair of 1004 ports (but see Section 3.8 for heuristics); no explicit application- 1005 level accept calls or bind calls are required to open additional 1006 subflows. To associate a new subflow with an existing connection, 1007 the token supplied in the subflow's SYN exchange is used for 1008 demultiplexing. This then binds the 5-tuple of the TCP subflow to 1009 the local token of the connection. A consequence is that it is 1010 possible to allow any port pairs to be used for a connection. 1012 Demultiplexing subflow SYNs MUST be done using the token; this is 1013 unlike traditional TCP, where the destination port is used for 1014 demultiplexing SYN packets. Once a subflow is setup, demultiplexing 1015 packets is done using the five-tuple, as in traditional TCP. The 1016 five-tuples will be mapped to the local connection identifier 1017 (token). Note that Host A will know its local token for the subflow 1018 even though it is not sent on the wire - only the responder's token 1019 is sent. 1021 3.3. General MPTCP Operation 1023 This section discusses operation of MPTCP for data transfer. At a 1024 high level, an MPTCP implementation will take one input data stream 1025 from an application, and split it into one or more subflows, with 1026 sufficient control information to allow it to be reassembled and 1027 delivered reliably and in-order to the recipient application. The 1028 following subsections define this behaviour in detail. 1030 The Data Sequence Mapping and the Data ACK are signalled in the Data 1031 Sequence Signal (DSS) option. Either or both can be signalled in one 1032 DSS, dependent on the flags set. The Data Sequence Mapping defines 1033 how the sequence space on the subflow maps to the connection level, 1034 and the Data ACK acknowledges receipt of data at the connection 1035 level. These functions are described in more detail in the following 1036 two subsections. 1038 Either or both the Data Sequence Mapping and the Data ACK can be 1039 signalled in the DSS option, dependent on the flags set. 1041 1 2 3 1042 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1043 +---------------+---------------+-------+----------------------+ 1044 | Kind | Length |Subtype| (reserved) |F|m|M|a|A| 1045 +---------------+---------------+-------+----------------------+ 1046 | Data ACK (4 or 8 octets, depending on flags) | 1047 +--------------------------------------------------------------+ 1048 | Data Sequence Number (4 or 8 octets, depending on flags) | 1049 +--------------------------------------------------------------+ 1050 | Subflow Sequence Number (4 octets) | 1051 +-------------------------------+------------------------------+ 1052 | Data-level Length (2 octets) | Checksum (2 octets) | 1053 +-------------------------------+------------------------------+ 1055 Figure 9: Data Sequence Signal (DSS) option 1057 The flags when set define the contents of this option, as follows: 1059 o A = Data ACK present 1061 o a = Data ACK is 8 octets (if not set, Data ACK is 4 octets) 1063 o M = Data Sequence Number, Subflow Sequence Number, Data-level 1064 Length, and Checksum present 1066 o m = Data Sequence Number is 8 octets (if not set, DSN is 4 octets) 1068 The flags 'a' and 'm' only have meaning if the corresponding 'A' or 1069 'M' flags are set, otherwise they will be ignored. The maximum 1070 length of this option, with all flags set, is 28 octets. 1072 The 'F' flag indicates "DATA_FIN". If present, this means that this 1073 mapping covers the final data from the sender. This is the 1074 connection-level equivalent to the FIN flag in single-path TCP. A 1075 connection is not closed unless there has been a DATA_FIN exchange, 1076 or a timeout. The purpose of the DATA_FIN, along with the 1077 interactions between this flag, the subflow-level FIN flag, and the 1078 data sequence mapping are described in Section 3.3.3. The remaining 1079 reserved bits MUST be set to zero by an implementation of this 1080 specification. 1082 Note that the Checksum is only present in this option if the use of 1083 MPTCP checksumming has been negotiated at the MP_CAPABLE handshake 1084 (see Section 3.1). The presence of the checksum can be inferred from 1085 the length of the option. If a checksum is present, but its use had 1086 not been negotiated in the MP_CAPABLE handshake, the checksum field 1087 MUST be ignored. If a checksum is not present when its use has been 1088 negotiated, the receiver MUST close the subflow with a RST as it is 1089 considered broken. 1091 3.3.1. Data Sequence Mapping 1093 The data stream as a whole can be reassembled through the use of the 1094 Data Sequence Mapping components of the DSS option (Figure 9), which 1095 define the mapping from the subflow sequence number to the data 1096 sequence number. This is used by the receiver to ensure in-order 1097 delivery to the application layer. Meanwhile, the subflow-level 1098 sequence numbers (i.e. the regular sequence numbers in the TCP 1099 header) have subflow-only relevance. It is expected (but not 1100 mandated) that SACK [10] is used at the subflow level to improve 1101 efficiency. 1103 The Data Sequence Mapping specifies a mapping from subflow sequence 1104 space to data sequence space. This is expressed in terms of starting 1105 sequence numbers for the subflow and the data level, and a length of 1106 bytes for which this mapping is valid. This explicit mapping for a 1107 range of data was chosen rather than per-packet signaling to assist 1108 with compatibility with situations where TCP/IP segmentation or 1109 coalescing is undertaken separately from the stack that is generating 1110 the data flow (e.g. through the use of TCP segmentation offloading on 1111 network interface cards, or by middleboxes such as performance 1112 enhancing proxies). It also allows a single mapping to cover many 1113 packets, which may be useful in bulk transfer situations. 1115 A mapping is fixed, in that the subflow sequence number is bound to 1116 the data sequence number after the mapping has been processed. A 1117 sender MUST NOT change this mapping after it has been declared; 1118 however, the same data sequence number can be mapped to by different 1119 subflows for retransmission purposes (see Section 3.3.6). This would 1120 also permit the same data to be sent simultaneously on multiple 1121 subflows for resilience or efficiency purposes, especially in the 1122 case of lossy links. Although the detailed specification of such 1123 operation is outside the scope of this document, an implementation 1124 SHOULD treat the first data that is received at a subflow for the 1125 data sequence space as that which should be delivered to the 1126 application, and any later data for that sequence space ignored. 1128 The data sequence number is specified as an absolute value, whereas 1129 the subflow sequence numbering is relative (the SYN at the start of 1130 the subflow has relative subflow sequence number 0). This is to 1131 allow middleboxes to change the Initial Sequence Number of a subflow, 1132 such as firewalls that undertake ISN randomization. 1134 The data sequence mapping also contains a checksum of the data that 1135 this mapping covers, if use of checksums has been negotiated at the 1136 MP_CAPABLE exchange. Checksums are used to detect if the payload has 1137 been adjusted in any way by a non-MPTCP-aware middlebox. If this 1138 checksum fails, it will trigger a failure of the subflow, or a 1139 fallback to regular TCP, as documented in Section 3.6, since MPTCP 1140 can no longer reliably know the subflow sequence space at the 1141 receiver to build data sequence mappings. 1143 The checksum algorithm used is the standard TCP checksum [1], 1144 operating over the data covered by this mapping, along with a pseudo- 1145 header as shown in Figure 10. 1147 1 2 3 1148 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1149 +--------------------------------------------------------------+ 1150 | | 1151 | Data Sequence Number (8 octets) | 1152 | | 1153 +--------------------------------------------------------------+ 1154 | Subflow Sequence Number (4 octets) | 1155 +-------------------------------+------------------------------+ 1156 | Data-level Length (2 octets) | Zeros (2 octets) | 1157 +-------------------------------+------------------------------+ 1159 Figure 10: Pseudo-Header for DSS Checksum 1161 Note that the Data Sequence Number used in the pseudo-header is 1162 always the 64 bit value, irrespective of what length is used in the 1163 DSS option itself. The standard TCP checksum algorithm has been 1164 chosen since it will be calculated anyway for the TCP subflow, and if 1165 calculated first over the data before adding the pseudo-headers, it 1166 only needs to be calculated once. Furthermore, since the TCP 1167 checksum is additive, the checksum for a DSN_MAP can be constructed 1168 by simply adding together the checksums for the data of each 1169 constituent TCP segment, and adding the checksum for the DSS pseudo- 1170 header. 1172 Note that checksumming relies on the TCP subflow containing 1173 contiguous data, and therefore a TCP subflow MUST NOT use the Urgent 1174 Pointer to interrupt an existing mapping. Further note, however, 1175 that if Urgent data is received on a subflow, it SHOULD be mapped to 1176 the data sequence space and delivered to the application analogous to 1177 Urgent data in regular TCP. 1179 To avoid possible deadlock scenarios, subflow-level processing should 1180 be undertaken separately from that at connection-level. Therefore, 1181 even if a mapping does not exist from the subflow space to the data- 1182 level space, the data SHOULD still be ACKed at the subflow (if it is 1183 in-window). This data cannot, however, be acknowledged at the data 1184 level (Section 3.3.2) because its data sequence numbers are unknown. 1186 Implementations MAY hold onto such unmapped data for a short while in 1187 the expectation that a mapping will arrive shortly. Such unmapped 1188 data cannot be counted as being within the connection-level receive 1189 window because this is relative to the data sequence numbers, so if 1190 the receiver runs out of memory to hold this data, it will have to be 1191 discarded. If a mapping for that subflow-level sequence space does 1192 not arrive within a receive window of data, that subflow SHOULD be 1193 treated as broken, closed with an RST, and any unmapped data silently 1194 discarded. 1196 Data sequence numbers are always 64 bit quantities, and MUST be 1197 maintained as such in implementations. If a connection is 1198 progressing at a slow rate, so protection against wrapped sequence 1199 numbers is not required, then it is permissible to include just the 1200 lower 32 bits of the data sequence number in the Data Sequence 1201 Mapping and/or Data ACK as an optimization, and an implementation can 1202 make this choice independently for each packet. 1204 An implementation MUST send the full 64 bit Data Sequence Number if 1205 it is transmitting at a sufficiently high rate that the 32 bit value 1206 could wrap within the Maximum Segment Lifetime (MSL) [15]. The 1207 lengths of the DSNs used in these values (which may be different) are 1208 declared with flags in the DSS option. Implementations MUST accept a 1209 32 bit DSN and implicitly promote it to a 64 bit quantity by 1210 incrementing the upper 32 bits of sequence number each time the lower 1211 32 bits wrap. A sanity check MUST be implemented to ensure that a 1212 wrap occurs at an expected time (e.g. the sequence number jumps from 1213 a very high number to a very low number) and is not triggered by out- 1214 of-order packets. 1216 As with the standard TCP sequence number, the data sequence number 1217 should not start at zero, but at a random value to make blind session 1218 hijacking harder. This specification requires setting the initial 1219 data sequence number (IDSN) of each host to the least significant 64 1220 bits of the SHA-1 hash of the host's key, as described in 1221 Section 3.1. 1223 A Data Sequence Mapping does not need to be included in every MPTCP 1224 packet, as long as the subflow sequence space in that packet is 1225 covered by a mapping known at the receiver. This can be used to 1226 reduce overhead in cases where the mapping is known in advance; one 1227 such case is when there is a single subflow between the hosts, 1228 another is when segments of data are scheduled in larger than packet- 1229 sized chunks. 1231 An "infinite" mapping can be used to fallback to regular TCP by 1232 mapping the subflow-level data to the connection-level data for the 1233 remainder of the connection (see Section 3.6). This is achieved by 1234 setting the Data-level Length field of the DSS option to the reserved 1235 value of 0. The checksum, in such a case, will also be set to zero. 1237 3.3.2. Data Acknowledgments 1239 To provide full end-to-end resilience, MPTCP provides a connection- 1240 level acknowledgement, to act as a cumulative ACK for the connection 1241 as a whole. This is the "Data ACK" field of the DSS option 1242 (Figure 9). The Data ACK is analogous to the behaviour of the 1243 standard TCP cumulative ACK - indicating how much data has been 1244 successfully received (with no holes). This is in comparison to the 1245 subflow-level ACK, which acts analogous to TCP SACK, given that there 1246 may still be holes in the data stream at the connection level. The 1247 Data ACK specifies the next Data Sequence Number it expects to 1248 receive. 1250 The Data ACK, as for the DSN, can be sent as the full 64 bit value, 1251 or as the lower 32 bits. If data is received with a 64 bit DSN, it 1252 MUST be acknowledged with a 64 bit Data ACK. If the DSN received is 1253 32 bits, it is valid for the implementation to choose whether to send 1254 a 32 bit or 64 bit Data ACK. 1256 The Data ACK proves that the data, and all required MPTCP signaling, 1257 has been received and accepted by the remote end. One key use of the 1258 Data ACK signal is that it is used to indicate the left edge of the 1259 advertised receive window. As explained in Section 3.3.4, the 1260 receive window is shared by all subflows and is relative to the Data 1261 ACK. Because of this, an implementation MUST NOT use the RCV.WND 1262 field of a TCP segment at connection-level if it does not also carry 1263 a DSS option with a Data ACK field. Furthermore, separating the 1264 connection-level acknowledgments from the subflow-level allows 1265 processing to be done separately, and a receiver has the freedom to 1266 drop segments after acknowledgement at the subflow level, for example 1267 due to memory constraints when many segments arrive out-of-order. 1269 An MPTCP sender MUST NOT free data from the send buffer until it has 1270 been acknowledged by both a Data ACK received on any subflow and at 1271 the subflow level by all subflows the data was sent on. The former 1272 condition ensures liveness of the connection and the latter condition 1273 ensures liveness and self-consistence of a subflow when data needs to 1274 be retransmitted. Note, however, that if some data needs to be 1275 retransmitted multiple times over a subflow, there is a risk of 1276 blocking the sending window. In this case, the MPTCP sender can 1277 decide to terminate the subflow that is behaving badly by sending a 1278 RST. 1280 The Data ACK MAY be included in all segments, however optimisations 1281 SHOULD be considered in more advanced implementations, where the Data 1282 ACK is present in segments only when the Data ACK value advances, and 1283 this behaviour MUST be treated as valid. This behaviour ensures the 1284 sender buffer is freed, while reducing overhead when the data 1285 transfer is unidirectional. 1287 3.3.3. Closing a Connection 1289 In regular TCP a FIN announces the receiver that the sender has no 1290 more data to send. In order to allow subflows to operate 1291 independently and to keep the appearance of TCP over the wire, a FIN 1292 in MPTCP only affects the subflow on which it is sent. This allows 1293 nodes to exercise considerable freedom over which paths are in use at 1294 any one time. The semantics of a FIN remain as for regular TCP, i.e. 1295 it is not until both sides have ACKed each other's FINs that the 1296 subflow is fully closed. 1298 When an application calls close() on a socket, this indicates that it 1299 has no more data to send, and for regular TCP this would result in a 1300 FIN on the connection. For MPTCP, an equivalent mechanism is needed, 1301 and this is referred to as the DATA_FIN. 1303 A DATA_FIN is an indication that the sender has no more data to send, 1304 and as such can be used to verify that all data has been successfully 1305 received. A DATA_FIN, as with the FIN on a regular TCP connection, 1306 is a unidirectional signal. 1308 The DATA_FIN is signalled by setting the 'F' flag in the Data 1309 Sequence Signal option (Figure 9) to 1. A DATA_FIN occupies one 1310 octet (the final octet) of the connection-level sequence space. Note 1311 that the DATA_FIN is included in the Data-Level Length, but not at 1312 the subflow level: for example, a segment with DSN 80, and Data-Level 1313 Length 11, with DATA_FIN set, would map 10 octets from the subflow 1314 into data sequnce space 80-89, the DATA_FIN is DSN 90, and therefore 1315 this segment including DATA_FIN would be acknowledged with a DATA_ACK 1316 of 91. 1318 Note that when the DATA_FIN is not attached to a TCP segment 1319 containing data, the Data Sequence Signal MUST have Subflow Sequence 1320 Number of 0, a Data-Level Length of 1, and the Data Sequence Number 1321 that corresponds with the DATA_FIN itself. The checksum in this case 1322 will only cover the pseudo-header. 1324 A DATA_FIN has the semantics and behaviour as a regular TCP FIN, but 1325 at the connection level. Notably, it is only DATA_ACKed once all 1326 data has been successfully received at the connection level. Note 1327 therefore that a DATA_FIN is decoupled from a subflow FIN. It is 1328 only permissible to combine these signals on one subflow if there is 1329 no data outstanding on other subflows. Otherwise, it may be 1330 necessary to retransmit data on different subflows. Essentially, a 1331 host MUST NOT close all functioning subflows unless it is safe to do 1332 so, i.e. until all outstanding data has been DATA_ACKed, or that the 1333 segment with the DATA_FIN flag set is the only outstanding segment. 1335 Once a DATA_FIN has been acknowledged, all remaining subflows MUST be 1336 closed with standard FIN exchanges. Both hosts SHOULD send FINs on 1337 all subflows, as a courtesy to allow middleboxes to clean up state 1338 even if an individual subflow has failed. It is also encouraged to 1339 reduce the timeouts (Maximum Segment Life) on subflows at end hosts. 1340 In particular, any subflows where there is still outstanding data 1341 queued (which has been retransmitted on other subflows in order to 1342 get the DATA_FIN acknowledged) MAY be closed with an RST. 1344 A connection is considered closed once both hosts' DATA_FINs have 1345 been acknowledged by DATA_ACKs. 1347 As specified above, a standard TCP FIN on an individual subflow only 1348 shuts down the subflow on which it was sent. If all subflows have 1349 been closed with a FIN exchange, but no DATA_FIN has been received 1350 and acknowledged, the MPTCP connection is treated as closed only 1351 after a timeout. This implies that an implementation will have 1352 TIME_WAIT states at both the subflow and connection levels (see 1353 Appendix C). This permits "break-before-make" scenarios where 1354 connectivity is lost on all subflows before a new one can be re- 1355 established. 1357 3.3.4. Receiver Considerations 1359 Regular TCP advertises a receive window in each packet, telling the 1360 sender how much data the receiver is willing to accept past the 1361 cumulative ack. The receive window is used to implement flow 1362 control, throttling down fast senders when receivers cannot keep up. 1364 MPTCP also uses a unique receive window, shared between the subflows. 1365 The idea is to allow any subflow to send data as long as the receiver 1366 is willing to accept it; the alternative, maintaining per subflow 1367 receive windows, could end-up stalling some subflows while others 1368 would not use up their window. 1370 The receive window is relative to the DATA_ACK. As in TCP, a 1371 receiver MUST NOT shrink the right edge of the receive window (i.e. 1372 DATA_ACK + receive window). The receiver will use the Data Sequence 1373 Number to tell if a packet should be accepted at connection level. 1375 When deciding to accept packets at subflow level, regular TCP checks 1376 the sequence number in the packet against the allowed receive window. 1377 With multipath, such a check is done using only the connection level 1378 window. A sanity check SHOULD be performed at subflow level to 1379 ensure that the subflow and mapped sequence numbers meet the 1380 following test: SSN - SUBFLOW_ACK <= DSN - DATA_ACK, where SSN is the 1381 subflow sequence number of the received packet and SUBFLOW_ACK is the 1382 RCV.NXT (next expected sequence number) of the subflow (with the 1383 equivalent connection-level definitions for DSN and DATA_ACK). 1385 In regular TCP, once a segment is deemed in-window, it is either put 1386 in the in-order receive queue or in the out-of-order queue. In 1387 multipath TCP, the same happens but at connection-level: a segment is 1388 placed in the connection level in-order or out-of-order queue if it 1389 is in-window at both connection and subflow level. The stack still 1390 has to remember, for each subflow, which segments were received 1391 successfully so that it can ACK them at subflow level appropriately. 1392 Typically, this will be implemented by keeping per subflow out-of- 1393 order queues (containing only message headers, not the payloads) and 1394 remembering the value of the cumulative ACK. 1396 It is important for implementers to understand how large a receiver 1397 buffer is appropriate. The lower bound for full network utilization 1398 is the maximum bandwidth-delay product of any one of the paths. 1399 However this might be insufficient when a packet is lost on a slower 1400 subflow and needs to be retransmitted (see Section 3.3.6). A tight 1401 upper bound would be the maximum RTT of any path multiplied by the 1402 total bandwidth available across all paths. This permits all 1403 subflows to continue at full speed while a packet is fast- 1404 retransmitted on the maximum RTT path. Even this might be 1405 insufficient to maintain full performance in the event of a 1406 retransmit timeout on the maximum RTT path. It is for future study 1407 to determine the relationship between retransmission strategies and 1408 receive buffer sizing. 1410 3.3.5. Sender Considerations 1412 The sender remembers receiver window advertisements from the 1413 receiver. It should only update its local receive window values when 1414 the largest sequence number allowed (i.e. DATA_ACK + receive window) 1415 increases, on the receipt of a DATA_ACK. This is important to allow 1416 using paths with different RTTs, and thus different feedback loops. 1418 MPTCP uses a single receive window across all subflows, and if the 1419 receive window was guaranteed to be unchanged end-to-end, a host 1420 could always read the most recent receive window value. However, 1421 some classes of middleboxes may alter the TCP-level receive window. 1422 Typically these will shrink the offered window, although for short 1423 periods of time it may be possible for the window to be larger 1424 (however note that this would not continue for long periods since 1425 ultimately the middlebox must keep up with delivering data to the 1426 receiver). Therefore, if receive window sizes differ on multiple 1427 subflows, when sending data MPTCP SHOULD take the largest of the most 1428 recent window sizes as the one to use in calculations. This rule is 1429 implicit in the requirement not to reduce the right edge of the 1430 window. 1432 The sender MUST also remember the receive windows advertised by each 1433 subflow. The allowed window for subflow i is (ack_i, ack_i + 1434 rcv_wnd_i), where ack_i is the subflow-level cumulative ack of 1435 subflow i. This ensures data will not be sent to a middlebox unless 1436 there is enough buffering for the data. 1438 Putting the two rules together, we get the following: a sender is 1439 allowed to send data segments with data-level sequence numbers 1440 between (DATA_ACK, DATA_ACK + receive_window). Each of these 1441 segments will be mapped onto subflows, as long as subflow sequence 1442 numbers are in the the allowed windows for those subflows. Note that 1443 subflow sequence numbers do not generally affect flow control if the 1444 same receive window is advertised across all subflows. They will 1445 perform flow control for those subflows with a smaller advertised 1446 receive window. 1448 The send buffer MUST, at a minimum, be as big as the receive buffer, 1449 to enable the sender to reach maximum throughput. 1451 3.3.6. Reliability and Retransmissions 1453 The data sequence mapping allows senders to re-send data with the 1454 same data sequence number on a different subflow. When doing this, a 1455 host MUST still retransmit the original data on the original subflow, 1456 in order to preserve the subflow integrity (middleboxes could replay 1457 old data, and/or could reject holes in subflows), and a receiver will 1458 ignore these retransmissions. While this is clearly suboptimal, for 1459 compatibility reasons this is sensible behaviour. Optimisations 1460 could be negotiated in future versions of this protocol. 1462 This protocol specification does not mandate any mechanisms for 1463 handling retransmissions, and much will be dependent upon local 1464 policy (as discussed in Section 3.3.8). One can imagine aggressive 1465 connection level retransmissions policies where every packet lost at 1466 subflow level is retransmitted on a different subflow (hence wasting 1467 bandwidth but possibly reducing application-to-application delays), 1468 or conservative retransmission policies where connection-level 1469 retransmits are only used after a few subflow level retransmission 1470 timeouts occur. 1472 It is envisaged that a standard connection-level retransmission 1473 mechanism would be implemented around a connection-level data queue: 1475 all segments that haven't been DATA_ACKed are stored. A timer is set 1476 when the head of the connection-level is ACKed at subflow level but 1477 its corresponding data is not ACKed at data level. This timer will 1478 guard against failures in re-transmission by middleboxes that pro- 1479 actively ACK data. 1481 The sender MUST keep data in its send buffer as long as the data has 1482 not been acknowledged at both connection level and on all subflows it 1483 has been sent on. In this way, the sender can always retransmit the 1484 data if needed, on the same subflow or on a different one. A special 1485 case is when a subflow fails: the sender will typically resend the 1486 data on other working subflows after a timeout, and will keep trying 1487 to retransmit the data on the failed subflow too. The sender will 1488 declare the subflow failed after a predefined upper bound on 1489 retransmissions is reached (which MAY be lower than the usual TCP 1490 limits of the Maximum Segment Life), or on the receipt of an ICMP 1491 error, and only then delete the outstanding data segments. 1493 Multiple retransmissions are triggers that will indicate that a 1494 subflow performs badly and could lead to a host resetting the subflow 1495 with an RST. However, additional research is required to understand 1496 the heuristics of how and when to reset underperforming subflows. 1497 For example, a highly asymmetric path may be mis-diagnosed as 1498 underperforming. 1500 3.3.7. Congestion Control Considerations 1502 Different subflows in an MPTCP connection have different congestion 1503 windows. To achieve fairness at bottlenecks and resource pooling, it 1504 is necessary to couple the congestion windows in use on each subflow, 1505 in order to push most traffic to uncongested links. One algorithm 1506 for achieving this is presented in [5]; the algorithm does not 1507 achieve perfect resource pooling but is "safe" in that it is readily 1508 deployable in the current Internet. By this, we mean that it does 1509 not take up more capacity on any one path than if it was a single 1510 path flow using only that route, so this ensures fair coexistence 1511 with single-path TCP at shared bottlenecks. 1513 It is foreseeable that different congestion controllers will be 1514 implemented for MPTCP, each aiming to achieve different properties in 1515 the resource pooling/fairness/stability design space, as well as 1516 those for achieving different properties in quality of service, 1517 reliability and resilience. 1519 Regardless of the algorithm used, the design of the MPTCP protocol 1520 aims to provide the congestion control implementations sufficient 1521 information to take the right decisions; this information includes, 1522 for each subflow, which packets were lost and when. 1524 3.3.8. Subflow Policy 1526 Within a local MPTCP implementation, a host may use any local policy 1527 it wishes to decide how to share the traffic to be sent over the 1528 available paths. 1530 In the typical use case, where the goal is to maximise throughput, 1531 all available paths will be used simultaneously for data transfer, 1532 using coupled congestion control as described in [5]. It is 1533 expected, however, that other use cases will appear. 1535 For instance, a possibility is an 'all-or-nothing' approach, i.e. 1536 have a second path ready for use in the event of failure of the first 1537 path, but alternatives could include entirely saturating one path 1538 before using an additional path (the 'overflow' case). Such choices 1539 would be most likely based on the monetary cost of links, but may 1540 also be based on properties such as the delay or jitter of links, 1541 where stability (of delay or bandwidth) is more important than 1542 throughput. Application requirements such as these are discussed in 1543 detail in [6]. 1545 The ability to make effective choices at the sender requires full 1546 knowledge of the path "cost", which is unlikely to be the case. It 1547 would be desirable for a receiver to be able to signal their own 1548 preferences for paths, since they will often be the multihomed party, 1549 and may have to pay for metered incoming bandwidth. 1551 Whilst fine-grained control may be the most powerful solution, that 1552 would require some mechanism such as overloading the ECN signal [16], 1553 which is undesirable, and it is felt that there would not be 1554 sufficient benefit to justify an entirely new signal. Therefore the 1555 MP_JOIN option (see Section 3.2) contains the 'B' bit, which allows a 1556 host to indicate to its peer that this path should be treated as a 1557 backup path to use only in the event of failure of other working 1558 subflows (i.e. a subflow where the receiver has indicated B=1 SHOULD 1559 NOT be used to send data unless there are no usable subflows where 1560 B=0). 1562 In the event that the available set of paths changes, a host may wish 1563 to signal a change in priority of subflows to the peer (e.g. a 1564 subflow that was previously set as backup should now take priority 1565 over all remaining subflows). Therefore, the MP_PRIO option, shown 1566 in Figure 11, can be used to change the 'B' flag of the subflow on 1567 which it is sent. 1569 1 2 3 1570 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1571 +---------------+---------------+-------+-----+-+--------------+ 1572 | Kind | Length |Subtype| |B| AddrID (opt) | 1573 +---------------+---------------+-------+-----+-+--------------+ 1575 Figure 11: MP_PRIO option 1577 It should be noted that the backup flag is a request from a data 1578 receiver to a data sender only, and the data sender SHOULD adhere to 1579 these requests. A host cannot assume that the data sender will do 1580 so, however, since local policies - or technical difficulties - may 1581 override MP_PRIO requests. Note also that this signal applies to a 1582 single direction, and so the sender of this option could choose to 1583 continue using the subflow to send data even if it has signalled B=1 1584 to the other host. 1586 This option can also be applied to other subflows than the one on 1587 which it is sent, by setting the optional Address ID field. This 1588 applies the given setting of B to all subflows in this connection 1589 that use the address identified by the given Address ID. The 1590 presence of this field is determined by the option length; if 1591 Length==4 then it is present, if Length==3 then it applies to the 1592 current subflow only. The use case of this is that a host can signal 1593 to its peer that an address is temporarily unavailable (for example, 1594 if it has radio coverage issues) and the peer should therefore drop 1595 to backup state on all subflows using that Address ID. 1597 3.4. Address Knowledge Exchange (Path Management) 1599 We use the term "path management" to refer to the exchange of 1600 information about additional paths between hosts, which in this 1601 design is managed by multiple addresses at hosts. For more detail of 1602 the architectural thinking behind this design, see the separate 1603 architecture document [2]. 1605 This design makes use of two methods of sharing such information, and 1606 both can be used on a connection. The first is the direct setup of 1607 new subflows, already described in Section 3.2, where the initiator 1608 has an additional address. The second method, described in the 1609 following subsections, signals addresses explicitly to the other host 1610 to allow it to initiate new subflows. The two mechanisms are 1611 complementary: the first is implicit and simple, while the explicit 1612 is more complex but is more robust. Together, the mechanisms allow 1613 addresses to change in flight (and thus support operation through 1614 NATs, since the source address need not be known), and also allow the 1615 signaling of previously unknown addresses, and of addresses belonging 1616 to other address families (e.g. both IPv4 and IPv6). 1618 Here is an example of typical operation of the protocol: 1620 o An MPTCP connection is initially set up between address/port A1 of 1621 host A and address/port B1 of host B. If host A is multihomed and 1622 multi-addressed, it can start an additional subflow from its 1623 address A2 to B1, by sending a SYN with a Join option from A2 to 1624 B1, using B's previously declared token for this connection. 1625 Alternatively, if B is multihomed, it can try to set up a new 1626 subflow from B2 to A1, using A's previously declared token. In 1627 either case, the SYN will be sent to the port already in use for 1628 the original subflow on the receiving host. 1630 o Simultaneously (or after a timeout), an ADD_ADDR option 1631 (Section 3.4.1) is sent on an existing subflow, informing the 1632 receiver of the sender's alternative address(es). The recipient 1633 can use this information to open a new subflow to the sender's 1634 additional address. In our example, A will send ADD_ADDR option 1635 informing B of address/port A2. The mix of using the SYN-based 1636 option and the ADD_ADDR option, including timeouts, is 1637 implementation-specific and can be tailored to agree with local 1638 policy. 1640 o If subflow A2-B1 is succesfully setup, host B can use the Address 1641 ID in the Join option to correlate this with the ADD_ADDR option 1642 that will also arrive on an existing subflow; now B knows not to 1643 open A2-B1, ignoring the ADD_ADDR. Otherwise, if B has not 1644 received the A2-B1 MP_JOIN SYN but received the ADD_ADDR, it can 1645 try to initiate a new subflow from one or more of its addresses to 1646 address A2. This permits new sessions to be opened if one host is 1647 behind a NAT. 1649 Other ways of using the two signaling mechanisms are possible; for 1650 instance, signaling addresses in other address families can only be 1651 done explicitly using the Add Address option. 1653 3.4.1. Address Advertisement 1655 The Add Address (ADD_ADDR) TCP Option announces additional addresses 1656 (and optionally, ports) on which a host can be reached (Figure 12). 1657 Multiple instances of this TCP option can be added in a single 1658 message if there is sufficient TCP option space, otherwise multiple 1659 TCP messages containing this option will be sent. This option can be 1660 used at any time during a connection, depending on when the sender 1661 wishes to enable multiple paths and/or when paths become available. 1662 As with all MPTCP signals, the receiver MUST undertake standard TCP 1663 validity checks before acting upon it. 1665 Every address has an Address ID which can be used for uniquely 1666 identifying the address within a connection, for address removal. 1667 This is also used to identify MP_JOIN options (see Section 3.2) 1668 relating to the same address, even when address translators are in 1669 use. The Address ID MUST uniquely identify the address to the sender 1670 (within the scope of the connection), but the mechanism for 1671 allocating such IDs is implementation-specific. 1673 All address IDs learnt via either MP_JOIN or ADD_ADDR SHOULD be 1674 stored by the receiver in a data structure that gathers all the 1675 Address ID to address mappings for a connection (identified by a 1676 token pair). In this way there is a stored mapping between Address 1677 ID, observed source address and token pair for future processing of 1678 control information for a connection. Note that an implementation 1679 MAY discard incoming address advertisements at will, for example for 1680 avoiding the required mapping state, or because advertised addresses 1681 are of no use to it (for example, IPv6 addresses when it has IPv4 1682 only). Therefore, a host MUST treat address advertisements as soft 1683 state, and MAY choose to refresh advertisements periodically. 1685 This option is shown in Figure 12. The illustration is sized for 1686 IPv4 addresses (IPVer = 4). For IPv6, the IPVer field will read 6, 1687 and the length of the address will be 16 octets (instead of 4). 1689 The presence of the final two octets, specifying the TCP port number 1690 to use, are optional and can be inferred from the length of the 1691 option. Although it is expected that the majority of use cases will 1692 use the same port pairs as used for the initial subflow (e.g. port 80 1693 remains port 80 on all subflows, as does the ephemeral port at the 1694 client), there may be cases (such as port-based load balancing) where 1695 the explicit specification of a different port is required. If no 1696 port is specified, MPTCP SHOULD attempt to connect to the specified 1697 address on the same port as is already in use by the subflow on which 1698 the ADD_ADDR signal was sent; this is discussed in more detail in 1699 Section 3.8. 1701 1 2 3 1702 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1703 +---------------+---------------+-------+-------+---------------+ 1704 | Kind | Length |Subtype| IPVer | Address ID | 1705 +---------------+---------------+-------+-------+---------------+ 1706 | Address (IPv4 - 4 octets / IPv6 - 16 octets) | 1707 +-------------------------------+-------------------------------+ 1708 | Port (2 octets, optional) | 1709 +-------------------------------+ 1711 Figure 12: Add Address (ADD_ADDR) option 1713 Due to the proliferation of NATs, it is reasonably likely that one 1714 host may attempt to advertise private addresses [17]. It is not 1715 desirable to prohibit this, since there may be cases where both hosts 1716 have additional interfaces on the same private network, and a host 1717 MAY want to advertise such addresses. The MP_JOIN handshake to 1718 create a new subflow (Section 3.2) provides mechanisms to minimise 1719 security risks. The MP_JOIN message contains a 32 bit token that 1720 uniquely identifies the connection to the receiving host. If the 1721 token is unknown, the host will return with a RST. In the unlikely 1722 event that the token is known, subflow setup will continue, but the 1723 HMAC exchange must occur for authentication. This will fail, and 1724 will provide sufficient protection against two unconnected hosts 1725 accidentally setting up a new subflow upon the signal of a private 1726 address. Further security considerations around the issue of 1727 ADD_ADDR messages that accidentally mis-direct, or maliciously 1728 direct, new MP_JOIN attempts are discussed in Section 5. 1730 Ideally, ADD_ADDR and REMOVE_ADDR options would be sent reliably, and 1731 in order, to the other end. This would ensure that this address 1732 management does not unnecessarily cause an outage in the connection 1733 when remove/add addresses are processed in reverse order, and also to 1734 ensure that all possible paths are used. Note, however, that losing 1735 reliability and ordering will not break the multipath connections, it 1736 will just reduce the opportunity to open multipath paths and to 1737 survive different patterns of path failures. 1739 Therefore, implementing reliability signals for these TCP options is 1740 not necessary. In order to minimise the impact of the loss of these 1741 options, however, it is RECOMMENDED that a sender should send these 1742 options on all available subflows. If these options need to be 1743 received in-order, an implementation SHOULD only send one ADD_ADDR/ 1744 REMOVE_ADDR option per RTT, to minimise the risk of misordering. 1746 A host can send an ADD_ADDR message with an already assigned Address 1747 ID, but the Address MUST be the same as previously assigned to this 1748 Address ID, and the Port MUST be different to one already in use for 1749 this Address ID. If these conditions are not met, the receiver 1750 SHOULD silently ignore the ADD_ADDR. A host wishing to replace an 1751 existing Address ID MUST first remove the existing one 1752 (Section 3.4.2). 1754 A host that receives an ADD_ADDR but finds a connection setup to that 1755 IP address and port number is unsuccessful SHOULD NOT perform further 1756 connection attempts to this address/port combination for this 1757 connection. A sender that wants to trigger a new incoming connection 1758 attempt on a previously advertised address/port combination can 1759 therefore refresh ADD_ADDR information by sending the option again. 1761 During normal MPTCP operation, it is unlikely that there will be 1762 sufficient TCP option space for ADD_ADDR to be included along with 1763 those for data sequence numbering (Section 3.3.1). Therefore, it is 1764 expected that an MPTCP implementation will send the ADD_ADDR option 1765 on separate ACKs. As discussed earlier, however, an MPTCP 1766 implementation MUST NOT treat duplicate ACKs with any MPTCP option, 1767 with the exception of the DSS option, as indications of congestion 1768 [11], and an MPTCP implementation SHOULD NOT send more than two 1769 duplicate ACKs in a row for signaling purposes. 1771 3.4.2. Remove Address 1773 If, during the lifetime of an MPTCP connection, a previously- 1774 announced address becomes invalid (e.g. if the interface disappears), 1775 the affected host SHOULD announce this so that the peer can remove 1776 subflows related to this address. 1778 This is achieved through the Remove Address (REMOVE_ADDR) option 1779 (Figure 13), which will remove a previously-added address (or list of 1780 addresses) from a connection and terminate any subflows currently 1781 using that address. 1783 For security purposes, if a host receives a REMOVE_ADDR option, it 1784 must ensure the affected path(s) are no longer in use before it 1785 instigates closure. The receipt of REMOVE_ADDR SHOULD first trigger 1786 the sending of a TCP Keepalive [18] on the path, and if a response is 1787 received the path SHOULD NOT be removed. Typical TCP validity tests 1788 on the subflow (e.g. ensuring sequence and ack numbers are correct) 1789 MUST also be undertaken. An implementation can use indications of 1790 these test failures as part of intrusion detection or error logging. 1792 The sending and receipt (if no keepalive response was received) of 1793 this message SHOULD trigger the sending of RSTs by both hosts on the 1794 affected subflow(s) (if possible), as a courtesy to cleaning up 1795 middlebox state, before cleaning up any local state. 1797 Address removal is undertaken by ID, so as to permit the use of NATs 1798 and other middleboxes that rewrite source addresses. If there is no 1799 address at the requested ID, the receiver will silently ignore the 1800 request. 1802 A subflow that is still functioning MUST be closed with a FIN 1803 exchange as in regular TCP, rather than using this option. For more 1804 information, see Section 3.3.3. 1806 1 2 3 1807 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1808 +---------------+---------------+-------+-------+---------------+ 1809 | Kind | Length = 3+n |Subtype|(resvd)| Address ID | ... 1810 +---------------+---------------+-------+-------+---------------+ 1811 (followed by n-1 Address IDs, if required) 1813 Figure 13: Remove Address (REMOVE_ADDR) option 1815 3.5. Fast Close 1817 Regular TCP has the means of sending a reset signal (RST) to abruptly 1818 close a connection. With MPTCP, the RST only has the scope of the 1819 subflow and will only close the concerned subflow but not affect the 1820 remaining subflows. MPTCP's connection will stay alive at the data- 1821 level, in order to permit break-before-make handover between 1822 subflows. It is therefore necessary to provide an MPTCP-level 1823 "reset" to allow the abrupt closure of the whole MPTCP connection, 1824 and this is the MP_FASTCLOSE option. 1826 MP_FASTCLOSE is used to indicate to the peer that the connection will 1827 be abruptly closed and no data will be accepted any more. The 1828 reasons for triggering an MP_FASTCLOSE are implementation-specific. 1829 Regular TCP does not allow sending a RST while the connection is in a 1830 synchronized state [1]. Nevertheless, implementations allow the 1831 sending of a RST in this state, if for example the operating system 1832 is running out of resources. In these cases, MPTCP should send the 1833 MP_FASTCLOSE. This option is illustrated in Figure 14. 1835 1 2 3 1836 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1837 +---------------+---------------+-------+-----------------------+ 1838 | Kind | Length |Subtype| (reserved) | 1839 +---------------+---------------+-------+-----------------------+ 1840 | Option Receiver's Key | 1841 | (64 bits) | 1842 | | 1843 +---------------------------------------------------------------+ 1845 Figure 14: Fast Close (MP_FASTCLOSE) option 1847 If Host A wants to force the closure of an MPTCP connection, the 1848 MPTCP Fast Close procedure is as follows: 1850 o Host A sends an ACK containing the MP_FASTCLOSE option on one 1851 subflow, containing the key of Host B as declared in the initial 1852 connection handshake. On all the other subflows, Host A sends a 1853 regular TCP RST to close these subflows, and tears them down. 1855 Host A now enters FASTCLOSE_WAIT state. 1857 o Upon receipt of an MP_FASTCLOSE, containing the valid key, host B 1858 answers on the same subflow with a TCP RST and tears down all 1859 subflows. Host B can now close the whole MPTCP connection (it 1860 transitions directly to CLOSED state). 1862 o As soon as Host A has received the TCP RST on the remaining 1863 subflow, it can close this subflow and tear down the whole 1864 connection (transition from FASTCLOSE_WAIT to CLOSED states). If 1865 Host A receives an MP_FASTCLOSE instead of a TCP RST, both hosts 1866 attempted fast closure simultaneously. Hose A should reply with a 1867 TCP RST and tear down the connection. 1869 o If host A does not receive a TCP RST in reply to its MP_FASTCLOSE 1870 after one RTO (the RTO of the subflow where the MPTCP_RST has been 1871 sent), it SHOULD retransmit the MP_FASTCLOSE. The number of 1872 retransmissions SHOULD be limited to avoid this connection from 1873 being retained for a long time, but this limit is implementation- 1874 specific. A RECOMMENDED number is 3. 1876 3.6. Fallback 1878 Sometimes, middleboxes will exist on a path that could prevent the 1879 operation of MPTCP. MPTCP has been designed in order to cope with 1880 many middlebox modifications (see Section 6), but there are still 1881 some cases where a subflow could fail to operate within the MPTCP 1882 requirements. These cases are notably: the loss of TCP options on a 1883 path; and the modification of payload data. If such an event occurs, 1884 it is necessary to "fall back" to the previous, safe operation. This 1885 may either be falling back to regular TCP, or removing a problematic 1886 subflow. 1888 At the start of an MPTCP connection (i.e. the first subflow), it is 1889 important to ensure that the path is fully MPTCP-capable and the 1890 necessary TCP options can reach each host. The handshake as 1891 described in Section 3.1 SHOULD fall back to regular TCP if either of 1892 the SYN messages do not have the MPTCP options: this is the same, and 1893 desired, behaviour in the case where a host is not MPTCP capable, or 1894 the path does not support the MPTCP options. When attempting to join 1895 an existing MPTCP connection (Section 3.2), if a path is not MPTCP 1896 capable and the TCP options do not get through on the SYNs, the 1897 subflow will be closed according to the MP_JOIN logic. 1899 There is, however, another corner case which should be addressed. 1900 That is one of MPTCP options getting through on the SYN, but not on 1901 regular packets. This can be resolved if the subflow is the first 1902 subflow, and thus all data in flight is contiguous, using the 1903 following rules. 1905 A sender MUST include a DSS option with Data Sequence Mapping in 1906 every segment until one of the sent segments has been acknowledged 1907 with a DSS option containing a Data ACK. Upon reception of the 1908 acknowledgement, the sender has the confirmation that the DSS option 1909 passes in both directions and may choose to send fewer DSS options 1910 than once per segment. 1912 If, however, an ACK is received for data (not just for the SYN) 1913 without a DSS option containing a Data ACK, the sender determines the 1914 path is not MPTCP capable. In the case of this occurring on an 1915 additional subflow (i.e. one started with MP_JOIN), the host MUST 1916 close the subflow with an RST. In the case of the first subflow 1917 (i.e. that started with MP_CAPABLE), it MUST drop out of an MPTCP 1918 mode back to regular TCP. The sender will send one final Data 1919 Sequence Mapping, with the Data-Level Length value of 0 indicating an 1920 infinite mapping (in case the path drops options in one direction 1921 only), and then revert to sending data on the single subflow without 1922 any MPTCP options. 1924 Note that this rule essentially prohibits the sending of data on the 1925 third packet of an MP_CAPABLE or MP_JOIN handshake, since both that 1926 option and a DSS cannot fit in TCP option space. If the initiator is 1927 to send first, another segment must be sent that contains the data 1928 and DSS. Note also that an additional subflow cannot be used until 1929 the initial path has been verified as MPTCP-capable. 1931 These rules should cover all cases where such a failure could happen: 1932 whether it's on the forward or reverse path, and whether the server 1933 or the client first sends data. If lost options on data packets 1934 occur on any other subflow apart from the the initial subflow, it 1935 should be treated as a standard path failure. The data would not be 1936 DATA_ACKed (since there is no mapping for the data), and the subflow 1937 can be closed with an RST. 1939 The case described above is a specialised case of fallback, for when 1940 the lack of MPTCP support is detected before any data is acknowledged 1941 at the connection level on a subflow. More generally, fallback 1942 (either closing a subflow, or to regular TCP) can become necessary at 1943 any point during a connection if a non-MPTCP-aware middlebox changes 1944 the data stream. 1946 As described in Section 3.3, each portion of data for which there is 1947 a mapping is protected by a checksum. This mechanism is used to 1948 detect if middleboxes have made any adjustments to the payload 1949 (added, removed, or changed data). A checksum will fail if the data 1950 has been changed in any way. This will also detect if the length of 1951 data on the subflow is increased or decreased, and this means the 1952 Data Sequence Mapping is no longer valid. The sender no longer knows 1953 what subflow-level sequence number the receiver is genuinely 1954 operating at (the middlebox will be faking ACKs in return), and 1955 cannot signal any further mappings. Furthermore, in addition to the 1956 possibility of payload modifications that are valid at the 1957 application layer, there is the possibility that false-positives 1958 could be hit across MPTCP segment boundaries, corrupting the data. 1959 Therefore, all data from the start of the segment that failed the 1960 checksum onwards is not trustworthy. 1962 When multiple subflows are in use, the data in-flight on a subflow 1963 will likely involve data that is not contiguously part of the 1964 connection-level stream, since segments will be spread across the 1965 multiple subflows. Due to the problems identified above, it is not 1966 possible to determine what the adjustment has done to the data 1967 (notably, any changes to the subflow sequence numbering). Therefore, 1968 it is not possible to recover the subflow, and the affected subflow 1969 must be immediately closed with an RST, featuring an MP_FAIL option 1970 (Figure 15), which defines the Data Sequence Number at the start of 1971 the segment (defined by the Data Sequence Mapping) which had the 1972 checksum failure. Note that the MP_FAIL option requires the use of 1973 the full 64-bit sequence number, even if 32-bit sequence numbers are 1974 normally in use in the DSS signals on the path. 1976 1 2 3 1977 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1978 +---------------+---------------+-------+----------------------+ 1979 | Kind | Length=12 |Subtype| (reserved) | 1980 +---------------+---------------+-------+----------------------+ 1981 | | 1982 | Data Sequence Number (8 octets) | 1983 | | 1984 +--------------------------------------------------------------+ 1986 Figure 15: Fallback (MP_FAIL) option 1988 The receiver MUST discard all data following the data sequence number 1989 specified. Failed data MUST NOT be DATA_ACKed and so will be re- 1990 transmitted on other subflows (Section 3.3.6). 1992 A special case is when there is a single subflow and it fails with a 1993 checksum error. If it is known that all unacknowledged data in 1994 flight is contiguous (which will usually be the case with a single 1995 subflow), an infinite mapping can be applied to the subflow without 1996 the need to close it first, and essentially turn off all further 1997 MPTCP signaling. In this case, if a receiver identifies a checksum 1998 failure when there is only one path, it will send back an MP_FAIL 1999 option on the subflow-level ACK, refering to the data-level sequence 2000 number of the start of the segment on which the checksum error was 2001 detected. The sender will receive this, and if all unacknowledged 2002 data in flight is contiguous, will signal an infinite mapping. This 2003 infinite mapping will be a DSS option (Section 3.3) on the first new 2004 packet, containing a Data Sequence Mapping that acts retroactively, 2005 referring to the start of the subflow sequence number of the last 2006 segment that was known to be delivered intact. From that point 2007 onwards data can be altered by a middlebox without affecting MPTCP, 2008 as the data stream is equivalent to a regular, legacy TCP session. 2010 In the rare case that the data is not contiguous (which could happen 2011 when there is only one subflow but it is retransmitting data from a 2012 subflow that has recently been uncleanly closed), the receiver MUST 2013 close the subflow with an RST with MP_FAIL. The receiver MUST 2014 discard all data that follows the data sequence number specified. 2015 The sender MAY attempt to create a new subflow belonging to the same 2016 connection, and if it chooses to do so, SHOULD place the single 2017 subflow immediately in single-path mode by setting an infinite data 2018 sequence mapping. This mapping will begin from the data-level 2019 sequence number that was declared in the MP_FAIL. 2021 After a sender signals an infinite mapping it MUST only use subflow 2022 ACKs to clear its send buffer. This is because Data ACKs may become 2023 misaligned with the subflow ACKs when middleboxes insert or delete 2024 data. The receive SHOULD stop generating Data ACKs after it receives 2025 an infinite mapping. 2027 When a connection has fallen back, only one subflow can send data, 2028 otherwise the receiver would not know how to reorder the data. In 2029 practice, this means that all MPTCP subflows will have to be 2030 terminated except one. Once MPTCP falls back to regular TCP, it MUST 2031 NOT revert to MPTCP later in the connection. 2033 It should be emphasised that we are not attempting to prevent the use 2034 of middleboxes that want to adjust the payload. An MPTCP-aware 2035 middlebox could provide such functionality by also rewriting 2036 checksums. 2038 3.7. Error Handling 2040 In addition to the fallback mechanism as described above, the 2041 standard classes of TCP errors may need to be handled in an MPTCP- 2042 specific way. Note that changing semantics - such as the relevance 2043 of an RST - are covered in Section 4. Where possible, we do not want 2044 to deviate from regular TCP behaviour. 2046 The following list covers possible errors and the appropriate MPTCP 2047 behaviour: 2049 o Unknown token in MP_JOIN (or HMAC failure in MP_JOIN ACK, or 2050 missing MP_JOIN in SYN/ACK response): send RST (analogous to TCP's 2051 behaviour on an unknown port) 2053 o DSN out of Window (during normal operation): drop the data, do not 2054 send Data ACKs. 2056 o Remove request for unknown address ID: silently ignore 2058 3.8. Heuristics 2060 There are a number of heuristics that are needed for performance or 2061 deployment but which are not required for protocol correctness. In 2062 this section we detail such heuristics. Note that discussion of 2063 buffering and certain sender and receiver window behaviours are 2064 presented in Section 3.3.4 and Section 3.3.5, as well as 2065 retransmission in Section 3.3.6. 2067 3.8.1. Port Usage 2069 Under typical operation an MPTCP implementation SHOULD use the same 2070 ports as already in use. In other words, the destination port of a 2071 SYN containing an MP_JOIN option SHOULD be the same as the remote 2072 port of the first subflow in the connection. The local port for such 2073 SYNs SHOULD also be the same as for the first subflow (and as such, 2074 an implementation SHOULD reserve ephemeral ports across all local IP 2075 addresses), although there may be cases where this is infeasible. 2076 This strategy is intended to maximize the probability of the SYN 2077 being permitted by a firewall or NAT at the recipient and to avoid 2078 confusing any network monitoring software. 2080 There may also be cases, however, where the passive opener wishes to 2081 signal to the other host that a specific port should be used, and 2082 this facility is provided in the Add Address option as documented in 2083 Section 3.4.1. It is therefore feasible to allow multiple subflows 2084 between the same two addresses but using different port pairs, and 2085 such a facility could be used to allow load balancing within the 2086 network based on 5-tuples (e.g. some ECMP implementations [7]). 2088 3.8.2. Delayed Subflow Start 2090 Many TCP connections are short-lived and consist only of a few 2091 segments, and so the overheads of using MPTCP outweigh any benefits. 2092 A heuristic is required, therefore, to decide when to start using 2093 additional subflows in an MPTCP connection. We expect that 2094 experience gathered from deployments will provide further guidance on 2095 this, and will be affected by particular application characteristics 2096 (which are likely to change over time). However, a suggested 2097 general-purpose heuristic that an implementation MAY choose to employ 2098 is as follows. Results from experimental deployments are needed in 2099 order to verify the correctness of this proposal. 2101 If a host has data buffered for its peer (which implies that the 2102 application has received a request for data), the host opens one 2103 subflow for each initial window's worth of data that is buffered. 2105 Consideration should also be given to limiting the rate of adding new 2106 subflows, as well as limiting the total number of subflows open for a 2107 particular connection. A host may choose to vary these values based 2108 on its load or knowledge of traffic and path characteristics. 2110 Note that this heuristic alone is probably insufficient. Traffic for 2111 many common applications, such as downloads, is highly asymmetric and 2112 the host that is multihomed may well be the client which will never 2113 fill its buffers, and thus never use MPTCP. Advanced APIs that allow 2114 an application to signal its traffic requirements would aid in these 2115 decisions. 2117 An additional time-based heuristic could be applied, opening 2118 additional subflows after a given period of time has passed. This 2119 would alleviate the above issue, and also provide resilience for low- 2120 bandwidth but long-lived applications. 2122 This section has shown some of the considerations that an implementer 2123 should give when developing MPTCP heuristics, but is not intended to 2124 be prescriptive. 2126 3.8.3. Failure Handling 2128 Requirements for MPTCP's handling of unexpected signals have been 2129 given in Section 3.7. There are other failure cases, however, where 2130 a hosts can choose appropriate behaviour. 2132 For example, Section 3.1 suggests that a host SHOULD fall back to 2133 trying regular TCP SYNs after one or more failures of MPTCP SYNs for 2134 a connection. A host may keep a system-wide cache of such 2135 information, so that it can back off from using MPTCP, firstly for 2136 that particular destination host, and eventually on a whole 2137 interface, if MPTCP connections continue failing. 2139 Another failure could occur when the MP_JOIN handshake fails. 2140 Section 3.7 specifies that an incorrect handshake MUST lead to the 2141 subflow being closed with a RST. A host operating an active 2142 intrusion detection system may choose to start blocking MP_JOIN 2143 packets from the source host if multiple failed MP_JOIN attempts are 2144 seen. From the connection initiator's point of view, if an MP_JOIN 2145 fails, it SHOULD NOT attempt to connect to the same IP address and 2146 port during the lifetime of the connection, unless the other host 2147 refreshes the information with another ADD_ADDR option. Note that 2148 the ADD_ADDR option is informational only, and does not guarantee the 2149 other host will attempt a connection. 2151 In addition, an implementation may learn over a number of connections 2152 that certain interfaces or destination addresses consistently fail 2153 and may default to not trying to use MPTCP for these. Behaviour 2154 could also be learnt for particularly badly performing subflows or 2155 subflows that regularly fail during use, in order to temporarily 2156 choose not to use these paths. 2158 4. Semantic Issues 2160 In order to support multipath operation, the semantics of some TCP 2161 components have changed. To aid clarity, this section collects these 2162 semantic changes as a reference. 2164 Sequence Number: The (in-header) TCP sequence number is specific to 2165 the subflow. To allow the receiver to reorder application data, 2166 an additional data-level sequence space is used. In this data- 2167 level sequence space, the initial SYN and the final DATA_FIN 2168 occupy one octet of sequence space. There is an explicit mapping 2169 of data sequence space to subflow sequence space, which is 2170 signalled through TCP options in data packets. 2172 ACK: The ACK field in the TCP header acknowledges only the subflow 2173 sequence number, not the data-level sequence space. 2174 Implementations SHOULD NOT attempt to infer a data-level 2175 acknowledgement from the subflow ACKs. This separates subflow- 2176 and connection-level processing at an end host. 2178 Duplicate ACK: A duplicate ACK that includes any MPTCP signaling 2179 (with the exception of the DSS option) MUST NOT be treated as a 2180 signal of congestion. To limit the chances of non-MPTCP-aware 2181 entities mistakenly interpreting duplicate ACKs as a signal of 2182 congestion, MPTCP SHOULD NOT send more than two duplicate ACKs 2183 containing (non-DSS) MPTCP signals in a row. 2185 Receive Window: The receive window in the TCP header indicates the 2186 amount of free buffer space for the whole data-level connection 2187 (as opposed to for this subflow) that is available at the 2188 receiver. This is the same semantics as regular TCP, but to 2189 maintain these semantics the receive window must be interpreted at 2190 the sender as relative to the sequence number given in the 2191 DATA_ACK rather than the subflow ACK in the TCP header. In this 2192 way the original flow control role is preserved. Note that some 2193 middleboxes may change the receive window, and so a host SHOULD 2194 use the maximum value of those recently seen on the constituent 2195 subflows for the connection-level receive window, and also needs 2196 to maintain a subflow-level window for subflow-level processing. 2198 FIN: The FIN flag in the TCP header applies only to the subflow it 2199 is sent on, not to the whole connection. For connection-level FIN 2200 semantics, the DATA_FIN option is used. 2202 RST: The RST flag in the TCP header applies only to the subflow it 2203 is sent on, not to the whole connection. The MP_FASTCLOSE option 2204 provides the fast-close functionality of a RST at the MPTCP 2205 connection level. 2207 Address List: Address list management (i.e. knowledge of the local 2208 and remote hosts' lists of available IP addresses) is handled on a 2209 per-connection basis (as opposed to per-subflow, per host, or per 2210 pair of communicating hosts). This permits the application of 2211 per-connection local policy. Adding an address to one connection 2212 (either explicitly through an Add Address message, or implicitly 2213 through a Join) has no implication for other connections between 2214 the same pair of hosts. 2216 5-tuple: The 5-tuple (protocol, local address, local port, remote 2217 address, remote port) presented by kernel APIs to the application 2218 layer in a non-multipath-aware application is that of the first 2219 subflow, even if the subflow has since been closed and removed 2220 from the connection. This decision, and other related API issues, 2221 are discussed in more detail in [6]. 2223 5. Security Considerations 2225 As identified in [8], the addition of multipath capability to TCP 2226 will bring with it a number of new classes of threat. In order to 2227 prevent these, [2] presents a set of requirements for a security 2228 solution for MPTCP. The fundamental goal is for the security of 2229 MPTCP to be "no worse" than regular TCP today, and the key security 2230 requirements are: 2232 o Provide a mechanism to confirm that the parties in a subflow 2233 handshake are the same as in the original connection setup. 2235 o Provide verification that the peer can receive traffic at a new 2236 address before using it as part of a connection. 2238 o Provide replay protection, i.e. ensure that a request to add/ 2239 remove a subflow is 'fresh'. 2241 In order to achieve these goals, MPTCP includes a hash-based 2242 handshake algorithm documented in Section 3.1 and Section 3.2. 2244 The security of the MPTCP connection hangs on the use of keys that 2245 are shared once at the start of the first subflow, and are never sent 2246 again over the network (unless used in the fast close mechanism, 2247 Section 3.5). To ease demultiplexing whilst not giving away any 2248 cryptographic material, future subflows use a truncated cryptographic 2249 hash of this key as the connection identification "token". The keys 2250 are concatenated and used as keys for creating Hash-based Message 2251 Authentication Codes (HMAC) used on subflow setup, in order to verify 2252 that the parties in the handshake are the same as in the original 2253 connection setup. It also provides verification that the peer can 2254 receive traffic at this new address. Replay attacks would still be 2255 possible when only keys are used, and therefore the handshakes use 2256 single-use random numbers (nonces) at both ends - this ensures the 2257 HMAC will never be the same on two handshakes. Guidance on 2258 generating random numbers suitable for use as keys is given in [13] 2259 and discussed in Section 3.1. 2261 The use of crypto capability bits in the initial connection handshake 2262 to negotiate use of a particular algorithm allows the deployment of 2263 additional crypto mechanisms in the future. Note that this would be 2264 susceptible to bid-down attacks only if the attacker was on-path (and 2265 thus would be able to modify the data anyway). The security 2266 mechanism presented in this draft should therefore protect against 2267 all forms of flooding and hijacking attacks discussed in [8]. 2269 During normal operation, regular TCP protection mechanisms (such as 2270 ensuring sequence numbers are in-window) will provide the same level 2271 of protection against attacks on indivudal TCP subflows as exists for 2272 regular TCP today. Implementations will introduce additional buffers 2273 compared to regular TCP, to reassemble data at the connection level. 2274 The application of window sizing will minimize the risk of denial-of- 2275 service attacks consuming resources. 2277 As discussed in Section 3.4.1, a host may advertise its private 2278 addresses, but these might point to different hosts in the receiver's 2279 network. The MP_JOIN handshake (Section 3.2) will ensure that this 2280 does not succeed in setting up a subflow to the incorrect host. 2281 However, it could still create unwanted TCP handshake traffic. This 2282 feature of MPTCP could be a target for denial-of-service exploits, 2283 with malicious participants in MPTCP connections encouraging the 2284 recipient to target other hosts in the network. Therefore, 2285 implementations should consider heuristics (Section 3.8) at both the 2286 sender and receiver to reduce the impact of this. 2288 A small security risk could theoretically exist with key reuse, but 2289 in order to accomplish a replay attack, both the sender and receiver 2290 keys, and the sender and receiver random numbers, in the MP_JOIN 2291 handshake (Section 3.2) would have to match. 2293 Whilst this specification defines a "medium" security solution, 2294 meeting the criteria specified at the start of this section and the 2295 threat analyis ([8]), since attacks only ever get worse, it is likely 2296 that a future standards-track version of MPTCP would need to be able 2297 to support stronger security. There are several ways the security of 2298 MPTCP could potentially be improved; some of these would be 2299 compatible with MPTCP as defined in this document, whilst others may 2300 not be. For now, the best approach is to get experience with the 2301 current approach, establish what might work and check that the threat 2302 analysis is still accurate. 2304 Possible ways of improving MPTCP security could include: 2306 o defining a new MPCTP cryptographic algorithm, as negotiated in 2307 MP_CAPABLE. A sub-case could be to include an additional 2308 deployment assumption, such as stateful servers, in order to allow 2309 a more powerful algorithm to be used. 2311 o defining how to secure data transfer with MPTCP, whilst not 2312 changing the signalling part of the protocol. 2314 o defining security that requires more option space, perhaps in 2315 conjunction with a "long options" proposal for extending the TCP 2316 options space (such as those surveyed in [19]), or perhaps 2317 building on the current approach with a second stage of MPTCP- 2318 option-based security. 2320 o re-visiting the working group's decision to exclusively use TCP 2321 options for MPTCP signalling, and instead look at also making use 2322 of the TCP payloads. 2324 MPTCP has been designed with several methods available to indicate a 2325 new security mechanism, including: 2327 o available flags in MP_CAPABLE (Figure 4); 2329 o available subtypes in the MPTCP Option Figure 3); 2330 o the version field in MP_CAPABLE (Figure 4); 2332 6. Interactions with Middleboxes 2334 Multipath TCP was designed to be deployable in the present world. 2335 Its design takes into account "reasonable" existing middlebox 2336 behaviour. In this section we outline a few representative 2337 middlebox-related failure scenarios and show how multipath TCP 2338 handles them. Next, we list the design decisions multipath has made 2339 to accommodate the different middleboxes. 2341 A primary concern is our use of a new TCP option. Middleboxes should 2342 forward packets with unknown options unchanged, yet there are some 2343 that don't. These we expect will either strip options and pass the 2344 data, drop packets with new options, copy the same option into 2345 multiple segments (e.g. when doing segmentation) or drop options 2346 during segment coalescing. 2348 MPTCP uses a single new TCP option "Kind", and all message types are 2349 defined by "subtype" values (see Section 8). This should reduce the 2350 chances of only some types of MPTCP options being passed, and instead 2351 the key differing characteristics are different paths, and the 2352 presence of the SYN flag. 2354 MPTCP SYN packets on the first subflow of a connection contain the 2355 MP_CAPABLE option (Section 3.1). If this is dropped, MPTCP SHOULD 2356 fall back to regular TCP. If packets with the MP_JOIN option 2357 (Section 3.2) are dropped, the paths will simply not be used. 2359 If a middlebox strips options but otherwise passes the packets 2360 unchanged, MPTCP will behave safely. If an MP_CAPABLE option is 2361 dropped on either the outgoing or the return path, the initiating 2362 host can fall back to regular TCP, as illustrated in Figure 16 and 2363 discussed in Section 3.1. 2365 Subflow SYNs contain the MP_JOIN option. If this option is stripped 2366 on the outgoing path the SYN will appear to be a regular SYN to host 2367 B. Depending on whether there is a listening socket on the target 2368 port, host B will reply either with SYN/ACK or RST (subflow 2369 connection fails). When host A receives the SYN/ACK it sends a RST 2370 because the SYN/ACK does not contain the MP_JOIN option and its 2371 token. Either way, the subflow setup fails, but otherwise does not 2372 affect the MPTCP connection as a whole. 2374 Host A Host B 2375 | Middlebox M | 2376 | | | 2377 | SYN(MP_CAPABLE) | SYN | 2378 |-------------------|---------------->| 2379 | SYN/ACK | 2380 |<------------------------------------| 2381 a) MP_CAPABLE option stripped on outgoing path 2383 Host A Host B 2384 | SYN(MP_CAPABLE) | 2385 |------------------------------------>| 2386 | Middlebox M | 2387 | | | 2388 | SYN/ACK |SYN/ACK(MP_CAPABLE)| 2389 |<----------------|-------------------| 2390 b) MP_CAPABLE option stripped on return path 2392 Figure 16: Connection Setup with Middleboxes that Strip Options from 2393 Packets 2395 We now examine data flow with MPTCP, assuming the flow is correctly 2396 setup, which implies the options in the SYN packets were allowed 2397 through by the relevant middleboxes. If options are allowed through 2398 and there is no resegmentation or coalescing to TCP segments, 2399 multipath TCP flows can proceed without problems. 2401 The case when options get stripped on data packets has been discussed 2402 in the Fallback section. If a fraction of options are stripped, 2403 behaviour is not deterministic. If some Data Sequence Mappings are 2404 lost, the connection can continue so long as mappings exist for the 2405 subflow-level data (e.g. if multiple maps have been sent that 2406 reinforce each other). If some subflow-level space is left unmapped, 2407 however, the subflow is treated as broken and is closed, through the 2408 process described in Section 3.6. MPTCP should survive with a loss 2409 of some Data ACKs, but performance will degrade as the fraction of 2410 stripped options increases. We do not expect such cases to appear in 2411 practice, though: most middleboxes will either strip all options or 2412 let them all through. 2414 We end this section with a list of middlebox classes, their behaviour 2415 and the elements in the MPTCP design that allow operation through 2416 such middleboxes. Issues surrounding dropping packets with options 2417 or stripping options were discussed above, and are not included here: 2419 o NATs [20] (Network Address (and Port) Translators) change the 2420 source address (and often source port) of packets. This means 2421 that a host will not know its public-facing address for signaling 2422 in MPTCP. Therefore, MPTCP permits implicit address addition via 2423 the MP_JOIN option, and the handshake mechanism ensures that 2424 connection attempts to private addresses [17] do not cause 2425 problems. Explicit address removal is undertaken by an Address ID 2426 to allow no knowledge of the source address. 2428 o Performance Enhancing Proxies (PEPs) [21] might pro-actively ACK 2429 data to increase performance. MPTCP, however, relies on accurate 2430 congestion control signals from the end host, and non-MPTCP-aware 2431 PEPs will not be able to provide such signals. MPTCP will 2432 therefore fall back to single-path TCP, or close the problematic 2433 subflow (see Section 3.6). 2435 o Traffic Normalizers [22] may not allow holes in sequence numbers, 2436 and may cache packets and retransmit the same data. MPTCP looks 2437 like standard TCP on the wire, and will not retransmit different 2438 data on the same subflow sequence number. In the event of a 2439 retransmission, the same data will be retransmitted on the 2440 original TCP subflow even if it is additionally retransmitted at 2441 the connection-level on a different subflow. 2443 o Firewalls [23] might perform initial sequence number randomization 2444 on TCP connections. MPTCP uses relative sequence numbers in data 2445 sequence mapping to cope with this. Like NATs, firewalls will not 2446 permit many incoming connections, so MPTCP supports address 2447 signaling (ADD_ADDR) so that a multi-addressed host can invite its 2448 peer behind the firewall/NAT to connect out to its additional 2449 interface. 2451 o Intrusion Detection Systems look out for traffic patterns and 2452 content that could threaten a network. Multipath will mean that 2453 such data is potentially spread, so it is more difficult for an 2454 IDS to analyse the whole traffic, and potentially increases the 2455 risk of false positives. However, for an MPTCP-aware IDS, tokens 2456 can be read by such systems to correlate multiple subflows and re- 2457 assemble for analysis. 2459 o Application level middleboxes such as content-aware firewalls may 2460 alter the payload within a subflow, such as re-writing URIs in 2461 HTTP traffic. MPTCP will detect these using the checksum and 2462 close the affected subflow(s), if there are other subflows that 2463 can be used. If all subflows are affected multipath will fallback 2464 to TCP, allowing such middleboxes to change the payload. MPTCP- 2465 aware middleboxes should be able to adjust the payload and MPTCP 2466 metadata in order not to break the connection. 2468 In addition, all classes of middleboxes may affect TCP traffic in the 2469 following ways: 2471 o TCP Options may be removed, or packets with unknown options 2472 dropped, by many classes of middleboxes. It is intended that the 2473 initial SYN exchange, with a TCP Option, will be sufficient to 2474 identify the path capabilities. If such a packet does not get 2475 through, MPTCP will end up falling back to regular TCP. 2477 o Segmentation/Coalescing (e.g. TCP segmentation offloading) might 2478 copy options between packets and might strip some options. 2479 MPTCP's data sequence mapping includes the relative subflow 2480 sequence number instead of using the sequence number in the 2481 segment. In this way, the mapping is independent of the packets 2482 that carry it. 2484 o The Receive Window may be shrunk by some middleboxes at the 2485 subflow level. MPTCP will use the maximum window at data-level, 2486 but will also obey subflow specific windows. 2488 7. Acknowledgments 2490 The authors were originally supported by Trilogy 2491 (http://www.trilogy-project.org), a research project (ICT-216372) 2492 partially funded by the European Community under its Seventh 2493 Framework Program. 2495 Alan Ford was originally supported by Roke Manor Research. 2497 The authors gratefully acknowledge significant input into this 2498 document from Sebastien Barre, Christoph Paasch, and Andrew McDonald. 2500 The authors also wish to acknowledge reviews and contributions from 2501 Iljitsch van Beijnum, Lars Eggert, Marcelo Bagnulo, Robert Hancock, 2502 Pasi Sarolahti, Toby Moncaster, Philip Eardley, Sergio Lembo, 2503 Lawrence Conroy, Yoshifumi Nishida, Bob Briscoe, Stein Gjessing, 2504 Andrew McGregor, Georg Hampel, Anumita Biswas, Wes Eddy, Alexey 2505 Melnikov, Francis Dupont, Adrian Farrel, Barry Leiba, Robert Sparks, 2506 Sean Turner, Stephen Farrell, and Martin Stiemerling. 2508 8. IANA Considerations 2510 This document defines a new TCP option for MPTCP, assigned a value of 2511 30 (decimal) from the TCP Option space. This value is the value of 2512 "Kind" as seen in all MPTCP options in this document. This value is 2513 defined as: 2515 +------+--------+---------------+-----------------+ 2516 | Kind | Length | Meaning | Reference | 2517 +------+--------+---------------+-----------------+ 2518 | 30 | N | Multipath TCP | (This document) | 2519 +------+--------+---------------+-----------------+ 2521 Table 1: TCP Option Kind Numbers 2523 This document also defines a four-bit subtype field, for which IANA 2524 is to create and maintain a new sub-registry entitled "MPTCP option 2525 subtype values" under the TCP Parameters registry. Initial values 2526 for the MPTCP option subtype registry are given below; future 2527 assignments are to be defined by Standards Action as defined by [24]. 2528 Assignments consist of the MPTCP subtype's symbolic name and its 2529 associated value, as per the following table. 2531 +--------------+----------------------------+---------------+-------+ 2532 | Symbol | Name | Reference | Value | 2533 +--------------+----------------------------+---------------+-------+ 2534 | MP_CAPABLE | Multipath Capable | Section 3.1 | 0x0 | 2535 | MP_JOIN | Join Connection | Section 3.2 | 0x1 | 2536 | DSS | Data Sequence Signal (Data | Section 3.3 | 0x2 | 2537 | | ACK and Data Sequence | | | 2538 | | Mapping) | | | 2539 | ADD_ADDR | Add Address | Section 3.4.1 | 0x3 | 2540 | REMOVE_ADDR | Remove Address | Section 3.4.2 | 0x4 | 2541 | MP_PRIO | Change Subflow Priority | Section 3.3.8 | 0x5 | 2542 | MP_FAIL | Fallback | Section 3.6 | 0x6 | 2543 | MP_FASTCLOSE | Fast Close | Section 3.5 | 0x7 | 2544 +--------------+----------------------------+---------------+-------+ 2546 Table 2: MPTCP Option Subtypes 2548 The value 0xf is reserved for Private Use within controlled testbeds. 2550 This document also requests that IANA creates another sub-registry, 2551 "MPTCP handshake algorithms" under the TCP Paramers registry, based 2552 on the flags in MP_CAPABLE (Section 3.1). The flags consist of eight 2553 bits, labelled "A" through "H", and this document assigns the bits as 2554 follows, where "(available)" means that the bit is available for 2555 future assignment: 2557 +----------+-------------------+----------------------------+ 2558 | Flag Bit | Meaning | Reference | 2559 +----------+-------------------+----------------------------+ 2560 | A | Checksum required | This document, Section 3.1 | 2561 | B | Extensibility | This document, Section 3.1 | 2562 | C | (available) | | 2563 | D | (available) | | 2564 | E | (available) | | 2565 | F | (available) | | 2566 | G | (available) | | 2567 | H | HMAC-SHA1 | This document, Section 3.2 | 2568 +----------+-------------------+----------------------------+ 2570 Table 3: MPTCP Handshake Algorithms 2572 Note that the meanings of bits C through H can be dependent upon bit 2573 B, depending on how Extensibility is defined in future 2574 specifications; see Section 3.1 for more information. 2576 Future assignments in this registry are also to be defined by 2577 Standards Action as defined by [24]. Assignments consist of the 2578 value of the flags, a symbolic name for the algorithm, and a 2579 reference to its specification. 2581 9. References 2583 9.1. Normative References 2585 [1] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, 2586 September 1981. 2588 [2] Ford, A., Raiciu, C., Handley, M., Barre, S., and J. Iyengar, 2589 "Architectural Guidelines for Multipath TCP Development", 2590 RFC 6182, March 2011. 2592 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 2593 Levels", BCP 14, RFC 2119, March 1997. 2595 [4] National Institute of Science and Technology, "Secure Hash 2596 Standard", Federal Information Processing Standard 2597 (FIPS) 180-3, October 2008, . 2600 9.2. Informative References 2602 [5] Raiciu, C., Handley, M., and D. Wischik, "Coupled Congestion 2603 Control for Multipath Transport Protocols", RFC 6356, 2604 October 2011. 2606 [6] Scharf, M. and A. Ford, "MPTCP Application Interface 2607 Considerations", draft-ietf-mptcp-api-05 (work in progress), 2608 April 2012. 2610 [7] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", 2611 RFC 2992, November 2000. 2613 [8] Bagnulo, M., "Threat Analysis for TCP Extensions for Multipath 2614 Operation with Multiple Addresses", RFC 6181, March 2011. 2616 [9] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed-Hashing 2617 for Message Authentication", RFC 2104, February 1997. 2619 [10] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 2620 Selective Acknowledgment Options", RFC 2018, October 1996. 2622 [11] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 2623 Control", RFC 5681, September 2009. 2625 [12] Gont, F., "Survey of Security Hardening Methods for 2626 Transmission Control Protocol (TCP) Implementations", 2627 draft-ietf-tcpm-tcp-security-03 (work in progress), March 2012. 2629 [13] Eastlake, D., Schiller, J., and S. Crocker, "Randomness 2630 Requirements for Security", BCP 106, RFC 4086, June 2005. 2632 [14] Eastlake, D. and T. Hansen, "US Secure Hash Algorithms (SHA and 2633 SHA-based HMAC and HKDF)", RFC 6234, May 2011. 2635 [15] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions for 2636 High Performance", RFC 1323, May 1992. 2638 [16] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of 2639 Explicit Congestion Notification (ECN) to IP", RFC 3168, 2640 September 2001. 2642 [17] Rekhter, Y., Moskowitz, R., Karrenberg, D., Groot, G., and E. 2643 Lear, "Address Allocation for Private Internets", BCP 5, 2644 RFC 1918, February 1996. 2646 [18] Braden, R., "Requirements for Internet Hosts - Communication 2647 Layers", STD 3, RFC 1122, October 1989. 2649 [19] Ramaiah, A., "TCP option space extension", 2650 draft-ananth-tcpm-tcpoptext-00 (work in progress), March 2012. 2652 [20] Srisuresh, P. and K. Egevang, "Traditional IP Network Address 2653 Translator (Traditional NAT)", RFC 3022, January 2001. 2655 [21] Border, J., Kojo, M., Griner, J., Montenegro, G., and Z. 2656 Shelby, "Performance Enhancing Proxies Intended to Mitigate 2657 Link-Related Degradations", RFC 3135, June 2001. 2659 [22] Handley, M., Paxson, V., and C. Kreibich, "Network Intrusion 2660 Detection: Evasion, Traffic Normalization, and End-to-End 2661 Protocol Semantics", Usenix Security 2001, 2001, . 2664 [23] Freed, N., "Behavior of and Requirements for Internet 2665 Firewalls", RFC 2979, October 2000. 2667 [24] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 2668 Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. 2670 Appendix A. Notes on use of TCP Options 2672 The TCP option space is limited due to the length of the Data Offset 2673 field in the TCP header (4 bits), which defines the TCP header length 2674 in 32 bit words. With the standard TCP header being 20 bytes, this 2675 leaves a maximum of 40 bytes for options, and many of these may 2676 already be used by options such as timestamp and SACK. 2678 We have performed a brief study on the commonly used TCP options in 2679 SYN, data, and pure ACK packets, and found that there is enough room 2680 to fit all the options we propose using in this draft. 2682 SYN packets typically include MSS (4 bytes), window scale (3 bytes), 2683 SACK permitted (2 bytes) and timestamp (10 bytes) options. Together 2684 these sum to 19 bytes. Some operating systems appear to pad each 2685 option up to a word boundary, thus using 24 bytes (a brief survey 2686 suggests Windows XP and Mac OS X do this, whereas Linux does not). 2687 Optimistically, therefore, we have 21 bytes spare, or 16 if it has to 2688 be word-aligned. In either case, however, the SYN versions of 2689 Multipath Capable (12 bytes) and Join (12 or 16 bytes) options will 2690 fit in this remaining space. 2692 TCP data packets typically carry timestamp options in every packet, 2693 taking 10 bytes (or 12 with padding). That leaves 30 bytes (or 28, 2694 if word-aligned). The Data Sequence Signal (DSS) option varies in 2695 length depending on whether the Data Sequence Mapping and DATA_ACK 2696 are included, and whether the sequence numbers in use are 4 or 8 2697 octets. The maximum size of the DSS option is 28 bytes, so even that 2698 will fit in the available space. But unless a connection is both bi- 2699 directional and high-bandwidth, it is unlikely that all that option 2700 space will be required on each DSS option. 2702 Within the DSS option, it is not necessary to include the Data 2703 Sequence Mapping and DATA_ACK in each packet, and in many cases it 2704 may be possible to alternate their presence (so long as the mapping 2705 covers the data being sent in the following packet). It would also 2706 be possible to alternate between 4 and 8 byte sequence numbers in 2707 each option. 2709 On subflow and connection setup, an MPTCP option is also set on the 2710 third packet (an ACK). These are 20 bytes (for Multipath Capable) 2711 and 24 bytes (for Join), both of which will fit in the available 2712 option space. 2714 Pure ACKs in TCP typically contain only timestamps (10 bytes). Here, 2715 multipath TCP typically needs to encode only the DATA_ACK (maximum of 2716 12 bytes). Occasionally ACKs will contain SACK information. 2717 Depending on the number of lost packets, SACK may utilize the entire 2718 option space. If a DATA_ACK had to be included, then it is probably 2719 necessary to reduce the number of SACK blocks to accommodate the 2720 DATA_ACK. However, the presence of the DATA_ACK is unlikely to be 2721 necessary in a case where SACK is in use, since until at least some 2722 of the SACK blocks have been retransmitted, the cumulative data-level 2723 ACK will not be moving forward (or if it does, due to retransmissions 2724 on another path, then that path can also be used to transmit the new 2725 DATA_ACK). 2727 The ADD_ADDR option can be between 8 and 22 bytes, depending on 2728 whether IPv4 or IPv6 is used, and whether the port number is present 2729 or not. It is unlikely that such signaling would fit in a data 2730 packet (although if there is space, it is fine to include it). It is 2731 recommended to use duplicate ACKs with no other payload or options in 2732 order to transmit these rare signals. Note this is the reason for 2733 mandating that duplicate ACKs with MPTCP options are not taken as a 2734 signal of congestion. 2736 Finally, there are issues with reliable delivery of options. As 2737 options can also be sent on pure ACKs, these are not reliably sent. 2738 This is not an issue for DATA_ACK due to their cumulative nature, but 2739 may be an issue for ADD_ADDR/REMOVE_ADDR options. Here, it is 2740 recommended to send these options redundantly (whether on multiple 2741 paths, or on the same path on a number of ACKs - but interspersed 2742 with data in order to avoid interpretation as congestion). The cases 2743 where options are stripped by middleboxes are discussed in Section 6. 2745 Appendix B. Control Blocks 2747 Conceptually, an MPTCP connection can be represented as an MPTCP 2748 control block that contains several variables that track the progress 2749 and the state of the MPTCP connection and a set of linked TCP control 2750 blocks that correspond to the subflows that have been established. 2752 RFC793 [1] specifies several state variables. Whenever possible, we 2753 reuse the same terminology as RFC793 to describe the state variables 2754 that are maintained by MPTCP. 2756 B.1. MPTCP Control Block 2758 The MPTCP control block contains the following variable per- 2759 connection. 2761 B.1.1. Authentication and Metadata 2763 Local.Token (32 bits): This is the token chosen by the local host on 2764 this MPTCP connection. The token MUST be unique among all 2765 established MPTCP connections, generated from the local key. 2767 Local.Key (64 bits): This is the key sent by the local host on this 2768 MPTCP connection. 2770 Remote.Token (32 bits): This is the token chosen by the remote host 2771 on this MPTCP connection, generated from the remote key. 2773 Remote.Key (64 bits): This is the key chosen by the remote host on 2774 this MPTCP connection 2776 MPTCP.Checksum (flag): This flag is set to true if at least one of 2777 the hosts has set the C bit in the MP_CAPABLE options exchanged 2778 during connection establishment, and is set to false otherwise. 2779 If this flag is set, the checksum must be computed in all DSS 2780 options. 2782 B.1.2. Sending Side 2784 SND.UNA (64 bits): This is the Data Sequence Number of the next byte 2785 to be acknowledged, at the MPTCP connection level. This variable 2786 is updated upon reception of a DSS option containing a DATA_ACK. 2788 SND.NXT (64 bits): This is the Data Sequence Number of the next byte 2789 to be sent. SND.NXT is used to determine the value of the DSN in 2790 the DSS option. 2792 SND.WND (32 bits with RFC1323, 16 bits without): This is the sending 2793 window. MPTCP maintains the sending window at the MPTCP 2794 connection level and the same window is shared by all subflows. 2795 All subflows use the MPTCP connection level SND.WND to compute the 2796 SEQ.WND value which is sent in each transmitted segment. 2798 B.1.3. Receiving Side 2800 RCV.NXT (64 bits): This is the Data Sequence Number of the next byte 2801 which is expected on the MPTCP connection. This state variable is 2802 modified upon reception of in-order data. The value of RCV.NXT is 2803 used to specify the DATA_ACK which is sent in the DSS option on 2804 all subflows. 2806 RCV.WND (32bits with RFC1323, 16 bits otherwise): This is the 2807 connection-level receive window, which is the maximum of the 2808 RCV.WND on all the subflows. 2810 B.2. TCP Control Blocks 2812 The MPTCP control block also contains a list of the TCP control 2813 blocks that are associated to the MPTCP connection. 2815 Note that the TCP control block on the TCP subflows does not contain 2816 the RCV.WND and SND.WND state variables as these are maintained at 2817 the MPTCP connection level and not at the subflow level. 2819 Inside each TCP control block, the following state variables are 2820 defined: 2822 B.2.1. Sending Side 2824 SND.UNA (32 bits): This is the sequence number of the next byte to 2825 be acknowledged on the subflow. This variable is updated upon 2826 reception of each TCP acknowledgement on the subflow. 2828 SND.NXT (32 bits): This is the sequence number of the next byte to 2829 be sent on the subflow. SND.NXT is used to set the value of 2830 SEG.SEQ upon transmission of the next segment. 2832 B.2.2. Receiving Side 2834 RCV.NXT (32 bits): This is the sequence number of the next byte 2835 which is expected on the subflow. This state variable is modified 2836 upon reception of in-order segments. The value of RCV.NXT is 2837 copied to the SEG.ACK field of the next segments transmitted on 2838 the subflow. 2840 RCV.WND (32 bits with RFC1323, 16 bits otherwise): This is the 2841 subflow-level receive window which is updated with the window 2842 field from the segments received on this subflow. 2844 Appendix C. Finite State Machine 2846 The diagram in Figure 17 shows the Finite State Machine for 2847 connection-level closure. This illustrates how the DATA_FIN 2848 connection-level signal (indicated as the DFIN flag on a DATA_ACK) 2849 interacts with subflow-level FINs, and permits "break-before-make" 2850 handover between subflows. 2852 +---------+ 2853 | M_ESTAB | 2854 +---------+ 2855 M_CLOSE | | rcv DATA_FIN 2856 ------- | | ------- 2857 +---------+ snd DATA_FIN / \ snd DATA_ACK[DFIN] +---------+ 2858 | M_FIN |<----------------- ------------------->| M_CLOSE | 2859 | WAIT-1 |--------------------------- | WAIT | 2860 +---------+ rcv DATA_FIN \ +---------+ 2861 | rcv DATA_ACK[DFIN] ------- | M_CLOSE | 2862 | -------------- snd DATA_ACK | ------- | 2863 | CLOSE all subflows | snd DATA_FIN | 2864 V V V 2865 +-----------+ +-----------+ +-----------+ 2866 |M_FINWAIT-2| | M_CLOSING | | M_LAST-ACK| 2867 +-----------+ +-----------+ +-----------+ 2868 | rcv DATA_ACK[DFIN] | rcv DATA_ACK[DFIN] | 2869 | rcv DATA_FIN -------------- | -------------- | 2870 | ------- CLOSE all subflows | CLOSE all subflows | 2871 | snd DATA_ACK[DFIN] V delete MPTCP PCB V 2872 \ +-----------+ +---------+ 2873 ------------------------>|M_TIME WAIT|----------------->| M_CLOSED| 2874 +-----------+ +---------+ 2875 All subflows in CLOSED 2876 ------------ 2877 delete MPTCP PCB 2879 Figure 17: Finite State Machine for Connection Closure 2881 Authors' Addresses 2883 Alan Ford 2884 Cisco 2885 Ruscombe Business Park 2886 Ruscombe, Berkshire RG10 9NN 2887 UK 2889 Email: alanford@cisco.com 2891 Costin Raiciu 2892 University Politehnica of Bucharest 2893 Splaiul Independentei 313 2894 Bucharest 2895 Romania 2897 Email: costin.raiciu@cs.pub.ro 2899 Mark Handley 2900 University College London 2901 Gower Street 2902 London WC1E 6BT 2903 UK 2905 Email: m.handley@cs.ucl.ac.uk 2907 Olivier Bonaventure 2908 Universite catholique de Louvain 2909 Pl. Ste Barbe, 2 2910 Louvain-la-Neuve 1348 2911 Belgium 2913 Email: olivier.bonaventure@uclouvain.be