idnits 2.17.1 draft-scharf-mptcp-mctcp-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: The token supplied in the initial connection's SYN exchange is used for the demultiplexing of coupled connections, i. e., to associate a new coupled connection to an existing MCTCP session. This means that the port numbers in a SYN of a coupled connection MAY NOT be used for demultiplexing. Still, an active opener of a new coupled connection SHOULD use a destination port numbers that is already in use by the passive opener, as long as the 5-tuple is unique for each host. Once a coupled connection is established, demultiplexing packets is done using the five-tuple, as in traditional TCP. This strategy is intended to maximize the probability of the SYN being permitted by a firewall or network address port translation (NAPT) at the recipient and to avoid confusing any network monitoring software. -- The document date (July 12, 2010) is 5031 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (ref. '1') (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 4960 (ref. '4') (Obsoleted by RFC 9260) ** Obsolete normative reference: RFC 5246 (ref. '5') (Obsoleted by RFC 8446) == Outdated reference: A later version (-05) exists of draft-ietf-mptcp-architecture-01 == Outdated reference: A later version (-12) exists of draft-ietf-mptcp-multiaddressed-00 == Outdated reference: A later version (-08) exists of draft-ietf-mptcp-threat-02 == Outdated reference: A later version (-04) exists of draft-scharf-mptcp-api-02 Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Scharf 3 Internet-Draft Alcatel-Lucent Bell Labs 4 Intended status: Experimental July 12, 2010 5 Expires: January 13, 2011 7 Multi-Connection TCP (MCTCP) Transport 8 draft-scharf-mptcp-mctcp-01 10 Abstract 12 Multipath transport over potentially different paths can be realized 13 by several coupled Transmission Control Protocol (TCP) connections. 14 Multi-Connection TCP (MCTCP) transport aggregates multiple TCP 15 connections between potentially different addresses into a single 16 session that can be accessed by an application like a single TCP 17 connection. MCTCP encodes control information, as far as possible, 18 in the payload of the TCP connections and therefore requires only 19 minor changes in the TCP implementations, and it is transparent in 20 the single-path case. MCTCP is therefore proposed as a simple, 21 modular, and extensible mechanism for multipath transport. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on January 13, 2011. 40 Copyright Notice 42 Copyright (c) 2010 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 3. Design Considerations . . . . . . . . . . . . . . . . . . . . 4 60 3.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4 61 3.2. Operation Summary . . . . . . . . . . . . . . . . . . . . 5 62 3.3. Differences to Other Multipath Transport Solutions . . . . 9 63 4. TCP Extensions by MCTCP . . . . . . . . . . . . . . . . . . . 14 64 4.1. Setup of the Initial Connection . . . . . . . . . . . . . 14 65 4.2. Setup of Coupled Connection . . . . . . . . . . . . . . . 15 66 4.3. Usage of Coupled Connections . . . . . . . . . . . . . . . 17 67 4.4. Operation Mode Switch . . . . . . . . . . . . . . . . . . 18 68 5. MCTCP Session Protocol Messages . . . . . . . . . . . . . . . 19 69 5.1. Data Segmentation and Encoding . . . . . . . . . . . . . . 19 70 5.2. Retransmission Requests . . . . . . . . . . . . . . . . . 21 71 5.3. Address Advertisement . . . . . . . . . . . . . . . . . . 22 72 5.4. Connection Management and Fallback . . . . . . . . . . . . 24 73 6. MCTCP Session Policies and Algorithms . . . . . . . . . . . . 25 74 6.1. Message Scheduling . . . . . . . . . . . . . . . . . . . . 25 75 6.2. Congestion and Flow Control . . . . . . . . . . . . . . . 25 76 7. Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 26 77 7.1. Interface between MCTCP and TCP . . . . . . . . . . . . . 26 78 7.2. Interface to Applications . . . . . . . . . . . . . . . . 27 79 8. Interaction with Middleboxes . . . . . . . . . . . . . . . . . 27 80 8.1. Middleboxes that Manipulate TCP Options . . . . . . . . . 27 81 8.2. Middleboxes that Change Content . . . . . . . . . . . . . 28 82 8.3. Middleboxes that Translate Addresses/Ports . . . . . . . . 29 83 8.4. Middleboxes that Want to Control MCTCP Traffic . . . . . . 30 84 8.5. Middleboxes that Proactively Acknowledge Data . . . . . . 30 85 9. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 31 86 10. Security Considerations . . . . . . . . . . . . . . . . . . . 31 87 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 88 12. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 32 89 13. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 32 90 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33 91 14.1. Normative References . . . . . . . . . . . . . . . . . . . 33 92 14.2. Informative References . . . . . . . . . . . . . . . . . . 33 93 Appendix A. Possible Future MCTCP Extension . . . . . . . . . . . 33 94 Appendix B. Change History of the Document . . . . . . . . . . . 35 96 1. Introduction 98 The objective of Multipath TCP is to enable multipath transport over 99 multiple paths like a regular TCP connection [1]. The motivation for 100 using multiple paths, as well as design considerations are discussed 101 in [7]. 103 One key question concerning the Multipath TCP protocol design is how 104 to transport the control information, which is required for the setup 105 and the teardown of different sub-flows, as well as for the 106 segmentation and reassembly of the byte stream in the sender and 107 receiver, respectively. One possibility is to encode this signaling 108 information in several new TCP options [8]. 110 This document describes Multi-Connection TCP (MCTCP) transport. 111 MCTCP is an alternative solution that transports both application and 112 control data with an own framing mechanism in the payload of parallel 113 TCP connections, but only if multipath transport is really needed. 114 MCTCP is simpler and more modular while providing almost the same 115 service like a Multipath TCP protocol with option signaling. 117 To applications, MCTCP offers the same reliable, in-order, byte- 118 stream transport as TCP. It is designed to be backward-compatible 119 with both applications and the network layer. Applications can use 120 MCTCP exactly like a single TCP connection, as described in [11]. As 121 long as multiple paths are not used, an MCTCP transfer is identical 122 to a standard TCP transfer, except for a new TCP option in SYN 123 segments that detects MCTCP support in the remote end. Once multi- 124 connection transfer is enabled, data chunks are sent over several TCP 125 connections with a new type-length-value (TLV) framing format. This 126 framing also permits the exchange of arbitrary amounts of control 127 information between the endpoints of the MCTCP session. The multiple 128 TCP connections operate independently, but the MCTCP session 129 coordinates the congestion control states. MCTCP can therefore use a 130 coupled congestion control (e. g., [10]) that does not harm other 131 network users. 133 2. Terminology 135 This document uses a terminology that slighly differs to [8]: 137 Path: A sequence of links between a sender and a receiver, defined 138 in this context by a source and destination address pair. 140 Initial connection: The first TCP connection between the two 141 endpoints of the MCTCP session. 143 Coupled connection: A coupled connection is a follow-up TCP 144 connection that is part of the session. It roughly corresponds to 145 a "subflow" in [8]. 147 Session: A collection of the initial connection and, if in use, 148 one or more coupled TCP connections. The applications at the two 149 endpoints of the session can communicate as if there was a single 150 TCP connection only. For an application, there is a one-to-one 151 mapping between a session and the socket. If a session includes 152 only the initial connection, it is almost identical to a standard 153 TCP connection, except for a new TCP option in the SYN segments. 155 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 156 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 157 document are to be interpreted as described in [3]. 159 3. Design Considerations 161 This section gives a high-level, non-compulsory overview of MCTCP's 162 design and its usage. 164 3.1. Objectives 166 With multipath transport, applications should be able to use the 167 aggregated bandwith of several paths without coping about details of 168 data transport, path management, scheduling, and congestion control. 169 This can improve both performance and resilience compared to the 170 current data transport that is mostly limited to a single path. 172 Yet, a multipath transport solution that requires multiple addresses 173 at least on one side will only be useful under certain constraints: 174 First, it requires endsystems with more than one address. One 175 example are mobile devices with several radio interfaces, which are 176 increasingly common. But even in that case it can make sense to use 177 one interface only, for instance in order to save battery energy. 178 Second, due to the signaling overhead and the effort of negotiation, 179 a multipath transport mechanism is mainly useful for long bulk data 180 file transfers. In the Internet, this use case only represents a 181 small subset of TCP's usage scenarios. 183 Given this rather specific use case, this document argues that a 184 multipath transport mechanism should neither require complex 185 modifications of the TCP stack, nor fundamentally change the TCP data 186 transmission as seen by middleboxes on the path, at least as long 187 only a single path is in use. Obviously, once multipath transport is 188 enabled, any middlebox performing deep packet inspection may get 189 confused as it will only see that part of the byte stream that is 190 transported over the corresponding path. As a consequence, on can 191 use a different framing format in that case. Furthermore, rapid 192 deployment of a multipath solution would also significantly benefit 193 from the possibility to implement it in the user space, as far as 194 possible. 196 Multi-Connection TCP (MCTCP) transport is designed to be a simple, 197 modular, extensible, and non-disruptive multipath transport 198 mechanism. Key design objectives are: 200 o Backward-compatibility: MCTCP is designed to be entirely backward- 201 compatible with a single TCP connection and falls back to standard 202 TCP if it is not supported by both endsystems, or if the setup of 203 additional coupled connections fails. 205 o Few TCP options only: MCTCP only requires new, short TCP options 206 in SYN segments, at least for the basic operation. As a result, 207 middleboxes that strip, duplicate, or modify TCP options, drop 208 such packets, or reassemble the byte stream cannot affect the 209 integrity of the data transport. 211 o Identical byte stream: MCTCP's byte stream is identical to a TCP 212 connection until multipath usage gets negotiated, except for the 213 new TCP option in the SYN. As a fallback, it is in principle even 214 possible to seamlessly continue the transport of the whole 215 application data over the initial TCP connection, if multipath 216 transport fails (e. g., due to middleboxes). 218 o Simplicity: MCTCP tries to minimize the changes required inside 219 existing network stacks. Except for few pretty straightforward 220 addons, a coupled TCP connection is setup, maintained, and closed 221 like a standard TCP connection. The major functions of MCTCP can 222 be implemented in the user-space. 224 o Same API: MCTCP can provide the same API to applications like the 225 existing TCP. 227 o Multi-address assumption: MCTCP assumes that one or both endpoints 228 of an MCTCP session are multihomed and multiaddressed. 230 These objectives are achieved by defining two different operation 231 modes of MCTCP, the single-connection and the multi-connection mode. 233 3.2. Operation Summary 235 In single-connection mode, an MCTCP session is equivalent to a single 236 TCP connection. The required minimum of control information is 237 exchanged by TCP options. When multipath transfer shall be enabled, 238 MCTCP switches to the multi-connection mode, in which it opens 239 additional, coupled TCP connections from or to possibly different 240 addresses of the same endsystems. Initial and coupled connection are 241 linked by two tokens in each session endpoint, which are exchanged 242 during the setup of the initial connection. 244 Each coupled TCP connection can transport control information and 245 data chunks in messages that are encoded in a type-length-value (TLV) 246 framing format. In multi-connection mode, the MCTCP transport on one 247 of the coupled TCP connections is similar to the Transport Layer 248 Security (TLS) protocol [5], except that data is not encrypted but 249 partitioned over different connections. TLS can be used on top of 250 MCTCP without requiring any adaptation. 252 In summary, in single-connection mode MCTCP is transparent, while in 253 multi-connection mode it acts as a shim layer between several coupled 254 TCP connections and the upper protocol layers, with a payload 255 encoding similar like TLS. An MCTCP session can also fall back to 256 single-connection mode a mean to further increase MCTCP's robustness 257 when facing problems with certain types of middleboxes. 259 +-------------------------------+ 260 | Application | 261 +-------------------------------+ 262 ^^^^ 263 |||| Byte stream (e. g., socket interface) 264 VVVV 265 +-------------------------------+ 266 | MCTCP session layer | 267 +-------------------------------+ 268 ^^ ^ ^ ^^ 269 Chunked || : Connection & : || Chunked 270 data || : cong. control : || data 271 VV V V VV 272 +---------------+---------------+ 273 | TCP connection| TCP connection| 274 +-------------------------------+ 275 | IP | IP | 276 +-------------------------------+ 278 Figure 1: MCTCP in the protocol stack 280 Figure 1 shows the position of MCTCP in the protocol stack, as a shim 281 layer between (coupled) TCP connections and upper-layer protocols or 282 applications. For MCTCP's connection management and the coupled 283 congestion control, the MCTCP session layer requires an additional 284 interface to each TCP connection, as well as some simple changes in 285 the TCP stack, e. g., to set the new TCP option in SYN segments. 286 Both modifications are straightforward and only affect a small subset 287 of TCP's function. 289 The MCTCP session layer can be implemented in the kernel space as an 290 extension of the socket interface processing. Alternatively, the 291 connection management, data segmentation/reassembly, and congestion 292 control coupling can be realized in the user space, in combination 293 with some small modifications of TCP. As an example, MCTCP could be 294 implemented as an extension of the library that offers the socket 295 interface to applications. In both cases the MCTCP session layer can 296 be completely transparent to applications, i. e., they can continue 297 to use the existing socket interface to TCP [11]. 299 In the following, a high-level summary of normal operation of MCTCP 300 is provided, for the scenario shown in Figure 2: 302 o To a non-MCTCP-aware application, MCTCP will be transparent and 303 indistinguishable from normal TCP. All MCTCP operation is handled 304 by the MCTCP implementation, although extended APIs could provide 305 additional control and influence [11]. An application begins by 306 opening a TCP socket in the normal way. 308 o An MCTCP session begins in single-connection mode with a single 309 TCP connection ("initial connection"). This is illustrated in 310 Figure 2 between Addresses A1 and B1 on Hosts A and B, 311 respectively. 313 o MCTCP uses an "Multipath Capable" TCP option in the SYN segments 314 to determine whether both endsystems support MCTCP. If the option 315 is not echoed in the SYN/ACK, the connection initiator knows that 316 the destination is not MCTCP-capable. If the SYN segment has to 317 be retransmitted, the connection initiator will not set the 318 "Multipath Capable" TCP option again, in order to circumvent 319 problems with middleboxes that cannot deal with unknown TCP 320 options. In that case, multipath transport cannot be used to that 321 destination. 323 o MCTCP does not exchange much signaling information in single- 324 connection mode, as this would require further TCP options outside 325 SYN segments. The only exception is the non-mandatory "Mode" TCP 326 option, which can be set by one endpoint in order to signal to the 327 other endpoint that it shall switch to multi-connection mode by 328 establishing a coupled connection to the same destination IP 329 address, over which additional information can then be exchanged. 330 If this TCP option is removed on the path, MCTCP may not be able 331 to enable multipath transport in some usage scenarios (e. g., 332 behind NAPTs), but the single-connection transport will continue 333 without being impacted. 335 o If additional addresses are available, and if they shall be used, 336 MCTCP switches to the multi-connection mode. 338 o When entering multi-connection mode, the MCTCP session endpoints 339 establish one or more coupled TCP connections. The first coupled 340 connection should use the same IP source and destination address 341 like the initial connection, in order to establish a control 342 channel over which more information can be exchanged. Each 343 coupled connection is added to the MCTCP session. 345 o MCTCP identifies multiple paths by the presence of multiple 346 addresses at endpoints, and it can establish coupled connections 347 between combinations of these multiple addresses. In the example 348 shown in Figure 2, coupled connections are set up between A1 and 349 B1, and between A2 and B1. 351 o The discovery and setup of additional coupled TCP connections will 352 be achieved through a path management method described later in 353 this document. 355 o The coupled connection use TLV-encoded messages and can thus 356 transport both control messages and data chunks. The data chunks 357 include a session-level sequence number to allow the in-order 358 reassembly of the data chunks from multiple coupled connections at 359 the receiver. 361 Host A Host B 362 ------------------------ ------------------------ 363 Address A1 Address A2 Address B1 Address B2 364 ---------- ---------- ---------- ---------- 365 | | | | 366 | "Initial connection" setup | | ^ 367 |--------------SYN+MPCAP------------>| | | 368 |(incl. Multipath Capable TCP option)| | | Single- 369 | | | | | conn. 370 |<----------SYN/ACK+MPCAP------------| | | mode 371 | | | | | 372 |#####Byte stream data transfer######| | V 373 | | | | 374 ~ ~ ~ ~ 375 | | | | 376 | "Coupled connections" setup | | 377 |--------------SYN+JOIN------------->| | 378 |<-----------SYN/ACK+JOIN------------| | ^ 379 | | | | | 380 | |------SYN+JOIN------->| | | Multi- 381 | |<----SYN/ACK+JOIN-----| | | conn. 382 | | | | | mode 383 |##########TLV data transfer#########| | | 384 | | | | | 385 | |##TLV data transfer###| | V 386 | | | | 388 Figure 2: MCTCP usage scenario 390 For simplicity reasons, MCTCP does not send further data over the 391 initial connection after it has triggered the transition to multi- 392 connection mode. As a consequence, the initial connection will be 393 unused in multi-connection mode. This document mandates to keep the 394 connection open as long as other coupled connections exist. This 395 design choice is motivated later in this document. 397 3.3. Differences to Other Multipath Transport Solutions 399 MCTCP follows the design principles outlined in [7], but it differs 400 to the protocol design described in [8], which uses TCP options to 401 transport all control information. In the following, the key 402 advantages of MCTCP are summarized: 404 o MCTCP does not rely on frequently sent TCP options, in particular 405 not on options that may have to be present in many packets. In 406 the simplest case, it only requires two new types of TCP options 407 which are set in SYN segments only. The required options are 408 short and do not consume much of the TCP option space, which is 409 already scarce in SYNs. It should also be noted that the 410 selective acknowledgment (SACK) option [2] is currently the only 411 major TCP option that is sporadically set after connection setup. 412 Yet, SACK options are only present after packet losses or 413 reordering events, which are seldom, and they are often set in 414 segments without payload. Adding sporadically other new TCP 415 options to all kinds of segments may increase the complexity of 416 the TCP sender, since the MSS must be adapted correspondingly. As 417 a consequence, MCTCP may also be simpler to realize in combination 418 with TCP segmentation offload on network cards. 420 o MCTCP's operation is much more robust in combination with 421 middleboxes that strip, duplicate, or modify TCP options and/or 422 drop packets with unknown TCP options. The worst case is that 423 multipath transport will not be enabled on a path with such 424 middleboxes, but the data stream's integrity will not be affected. 425 In general, the transport of information in TCP options outside 426 SYNs is not necessarily reliable, unless an acknowledgement and 427 retransmission mechanism for that information exists. As a 428 consequence, TCP options are not well suited for transport of 429 information that is absolutely essential for the data integrity. 430 It is also impossible to savely detect whether novel TCP options 431 can indeed be exchanged between two hosts in the Internet, as the 432 routing may change and additional middleboxes may appear on the 433 paths, e. g., in mobile networks. Therefore, a signaling method 434 that transports essential control information such as sequence 435 numbers in TCP options is not robust in such environments. 436 Obviously, it cannot efficiently use multiple paths if a middlebox 437 blocks TCP options, as there is no way to reliably exchange 438 control information in options. There are also situations where 439 multipath transport with option encoding cannot even fall back to 440 single-path transport, e. g., if routing changes and afterwards 441 TCP options cannot be exchanged on all used paths. Unlike MCTCP, 442 multipath transport with option encoding would break and not be 443 able to complete ongoing data transfers in such cases, except if 444 it used an MCTCP-like approach as well. 446 o MCTCP is also rather robust when middleboxes rewrite content, as 447 it can use a checksum to savely detect content modifications in 448 one or several connections. It could even define schemes that 449 transfer such content in a different content encoding format. 451 o MCTCP offers a simple mechanism by which a middlebox can prevent 452 to transport any multi-connection traffic: It can simply drop SYN 453 segments with the "JOIN" TCP option. In that case, unless routing 454 changes, paths through that middlebox will not be used in multi- 455 connection mode. If that middlebox is on the path of the initial 456 connection, it will always see the whole, unmodified byte stream. 458 This middlebox-friendly design is an advantage of the distinction 459 between initial and coupled connections. It could also help to 460 comply with certain network policies such as lawful interception. 462 o The TCP option space is limited to 40 byte. In multi-connection 463 mode, MCTCP can exchange any amount of information between the 464 endsystems. As such, it is more extensible and flexible. For 465 instance, without length limitation, one can easily exchange a 466 list of several IPv6 candidate addresses in the payload of a 467 single TCP sgement. It would also be possible to announce lists 468 of candidate port numbers or even to exchange address information 469 in form of a Uniform Resource Identifer (URI) or any other 470 referral object structure. Finally, MCTCP could use strong 471 protection mechanisms between coupled connections to ensure that 472 they have indeed the same endpoint, such as longer tokens. 474 o The design is modular, as the operation of a single TCP connection 475 is almost independent from the multipath transport, except for the 476 necessary coupling of congestion control. For instance, there is 477 no need to modify the SACK scoreboards implementation in existing 478 TCP implementations, and synchronization issues between different 479 TCP connections are avoided. 481 o MCTCP has a reasonable deployment roadmap. Most functions of 482 MCTCP can be realized in the user space with a small patch of the 483 TCP implementation only. The required extensions inside the 484 network stack are simple, straightforward, and non-disruptive. 485 This means that MCTCP can initially be deployed mostly as a user 486 space solution, without lacking any features. As a second step, 487 once the protocol is widely supported in the Internet, it could 488 become an integral part of the network stack. 490 o The transport of control information in the payload is reliable 491 and congestion-controlled. TLV-encoded messaging is 492 straightforward and well-known, e. g., from TLS. MCTCP does not 493 use a mandatory positive acknowledgement mechanism and therefore 494 does not require frequent additional data transport in the reverse 495 direction. 497 o MCTCP can be extended in future, for instance to use a stronger 498 protection for the coupling of connections, possibly even by 499 exchange of cryptographic keys, if needed. A list of possible 500 future extensions is provided in the appendix. 502 MCTCP shares a number of properties of [8]. It can use a coupled 503 congestion control in a similar way, and it is able to enable 504 multipath transport under the same constraints. 506 Still, it must be noted that there are a number of potential 507 drawbacks of MCTCP's design as well: 509 o MCTCP is designed for the use case of a bulk data transfer that 510 starts as a single path transfer that is later "upgraded" in order 511 to use multiple interfaces. This is the most obvious use case of 512 multipath transfer, as transporting smaller amounts of data over 513 multiple paths would result in a significant overhead. In 514 contrast, MCTCP is less efficient if the multipath transfer shall 515 be used right from the beginning of a transfer, due the backward- 516 compatible design of MCTCP's single-connection mode that results 517 in a very limited control. If this use case was important, an 518 MCTCP variant with payload encoding in the initial connection 519 could be developed, too. Its design is straightforward, but left 520 for further study, as it would only be of use in certain 521 scenarios. 523 o MCTCP opens an additional TCP connection when switching to multi- 524 connection mode, and it does not continue using the initial 525 connection. The connection setup of the coupled connections 526 results in a small delay, i. e., the path may not be completely 527 utilized during a short time. An obvious optimization would be to 528 transfer the congestion control state from the initial connection 529 to the first coupled connection, in order to avoid the TCP Slow- 530 Start there. Both connections should use the same path. It must 531 be noted that not using the initial connection after the switch- 532 over to the multi-connection mode is the simplest solution; 533 alternative solutions are possible. Furthermore, the "handover" 534 process and the resulting delay could be minimized by further 535 optimization, but this is left for further study. 537 o MCTCP session endpoints do not exchange address information before 538 entering the multi-connection mode, even if this would be possible 539 by additional TCP options [8]. Both endsystems can initiate a 540 change of operation mode, and address information can be exchanged 541 by the MCTCP session protocol once this is successful. If the 542 "Mode" TCP option is supported, an endpoint can even trigger the 543 setup of a coupled connection by the other endpoint, e. g., if 544 that host is located behind a NAPT. Yet, while being in single- 545 connection mode, MCTCP provides no means to learn other addresses. 546 As a consequence, endsystems may try to enter the multi-connection 547 mode in vain, if they assume that their peer is multi-homed. If 548 that peer is not multi-homed, it can either agree to switch to 549 multi-connection mode, or deny that (by not responding with a 550 "Join" option). In the former case, an additional TCP connection 551 is needlessly established between both peers, and in the latter 552 case data transfer could briefly slow down until MCTCP falls back 553 to single-connection mode. For long-lived connections that 554 benefit most from multi-connection mode both cases hardly cause 555 much harm. 557 o Given that MCTCP transports control information in the payload, it 558 is more complex for middleboxes to parse and potentially modify 559 MCTCP's control information. In order to do so, a middlebox has 560 to perform deep packet inspection and reassemble the messages of 561 the coupled TCP connection(s). This may prevent certain 562 operations and optimizations by middleboxes. However, it should 563 be noted that middleboxes cannot affect the payload in other 564 related protocols such as TLS neither, i. e., MCTCP is somehow 565 similar to TLS in that sense. Of course, middleboxes can still 566 perform certain forms of traffic engineering for an individual 567 coupled connection, such as randomizing initial sequence numbers 568 or modifying the advertized receive window (which may, of course, 569 do harm to any end-to-end connection). A middlebox that wants to 570 prevent MCTCP usage can simply and savely drop packets with the 571 TCP "Join" option and will then not be passed by any multi- 572 connection traffic, except if routing changes. 574 o If MCTCP detects that one coupled connection stalls, it can 575 retransmit data over another connection, which can reduce the 576 delivery time and prevent head-of-line blocking. However, if 577 MCTCP is partly realized in the user space, it might not be able 578 to retransmit a lost segment immediately over another coupled 579 connection, given that this would require complex changes of the 580 segmentation and SACK scoreboard implementation in each coupled 581 connection. As a result, if congestion occurs on a subset of the 582 coupled connections, the end-to-end delivery delay of a user-space 583 solution may be larger than the delay of a protocol that is 584 tightly integrated into the protocol stack. In general, an 585 implementation inside the protocol stack can assign data more 586 flexibly and more dynamically to the different interfaces. This 587 would be an advantage of a kernel-space implementation. Yet, a 588 reasonable MCTCP session layer scheduling can reduce the risk of 589 head-of-line blocking by simply avoiding long send buffer queues, 590 even if it is realized in the user space. 592 o MCTCP as defined in this document does not provide some signaling 593 mechanisms of [8], such as the "DATA FIN". While it is obviously 594 possible to add these mechanisms as well, it will result in a more 595 complex protocol design and is therefore not addressed in this 596 version of the protocol specification. 598 4. TCP Extensions by MCTCP 600 This section describes the modifications in the TCP protocol that are 601 required by MCTCP. MCTCP only defines additional TCP options. 602 Several TCP options and mechanisms are similar to [8], but differ in 603 details. Later, Section 7.1 describes to what information inside the 604 TCP stack an MCTCP session must have access to. 606 4.1. Setup of the Initial Connection 608 The initial connection of an MCTCP session is setup like a TCP 609 connection with a three-way handshake. A connection initiator that 610 wants to announce its MCTCP capability sets the "Multipath Capable" 611 TCP option in the SYN, as shown in Figure 3. This option only 612 declares that its sender is capable of using MCTCP, even if will not 613 be enabled for that session. It includes a field that presents a 614 locally-unique token identifying this connection. The two tokens 615 will be used when adding additional coupled connections to verify 616 that the endpoint is identical. 618 1 2 3 619 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 620 +---------------+---------------+-------------------------------+ 621 | Kind=OPT_MPCAP| Length=6 | Sender Token : 622 +---------------+---------------+-------------------------------+ 623 : Sender Token (contd.) | 624 +-------------------------------+ 626 Figure 3: Multipath Capable option 628 This option MUST only be present in packets with the SYN flag set. 629 It is only used in the initial TCP connection, in order to identify 630 the MCTCP session; all following (coupled) connections will use 631 another, similar option to join the MCTCP session. 633 If a SYN contains an "Multipath Capable" option but the SYN/ACK does 634 not, it is assumed that the responder is not multipath capable and 635 thus the MCTCP session MUST fall back to standard TCP. If a SYN does 636 not contain a "Multipath Capable" option, the SYN/ACK MUST NOT 637 contain one in response. 639 There are two tokens in a MCTCP session, one per endsystem. The 640 token is generated by the sender and has local meaning only. It MUST 641 be unique for the sender. The token MUST be difficult for an 642 attacker to guess, and thus it is recommended that it SHOULD be 643 generated randomly. 645 If the SYN packets are unacknowledged, it is up to a local policy to 646 decide how to respond. A sender SHOULD fall back to standard TCP (i. 647 e., without the "Multipath Capable" option) after a maximum number of 648 attempts, in order to work around middleboxes that may drop packets 649 with unknown options. The number of attempts that are made will be 650 up to local policy. Once the connection initiator has sent a SYN 651 without the "Multipath Capable" option, it MUST fall back to regular 652 TCP behavior, even if it subsequently receives a SYN/ACK that 653 contains an "Multipath Capable" option. This might happen if the 654 "Multipath Capable" SYN and subsequent non-MP-capable SYN are 655 reordered. This is to ensure that the two endpoints end up in an 656 interoperable state, no matter what order the SYNs arrive at the 657 passive opener. 659 4.2. Setup of Coupled Connection 661 An MCTCP session can open additional, coupled TCP connections. These 662 coupled TCP connections all run the MCTCP session protocol with TLV 663 encoding, as specified below. The endsystems can also use the 664 coupled connection to exchange knowledge about their own address(es) 665 - in particular the first one. Using this knowledge, an endpoint can 666 initiate further coupled connections over currently unused pairs of 667 addresses. Either endpoint that is part of an MCTCP session SHOULD 668 be able to initiate the creation of a new coupled connection. 670 A new coupled connection is started as a normal TCP three-way- 671 handshake. The "Join" TCP option (Figure 4) is used to identify of 672 which session the new connection should become a part. The token 673 used is the locally unique token of the destination for the 674 connection, as received by the "Multipath Capable" option in the SYN/ 675 ACK exchange of the initial connection. 677 1 2 3 678 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 679 +---------------+---------------+-------------------------------+ 680 | Kind=OPT_JOIN | Length=6 | Receiver Token : 681 +---------------+---------------+-------------------------------+ 682 : Receiver Token (contd.) | 683 +-------------------------------+ 685 Figure 4: Multipath Join option 687 This option MUST only be present when the SYN flag is set. The 688 recipient of the "Join" option with a token that is valid for an 689 existing MCTCP session must decide whether to allow an additional 690 coupled connection, or whether to deny it. If the coupled connection 691 shall be established, the recipient of the SYN responds with a SYN/ 692 ACK also containing a "Join" option, with the initiator's token. 694 Otherwise, if the recipient decides to deny the setup of a coupled 695 connection, it MUST reply with a TCP RST. If the token is unknown at 696 the recipient, the recipient MUST also respond with a TCP RST in the 697 same way as when an unknown TCP port is used. Similarly, if the 698 initiator of a coupled connection receives a SYN/ACK with an invalid 699 token or a SYN/ACK without the "Join" option, it must send a TCP RST. 700 In all these cases, the setup procedure of that coupled connection 701 MUST be abandoned. As a result, the endpoints MUST return to single- 702 connection mode if it is the first coupled connection. If there are 703 already other coupled connections, it SHOULD NOT use that address 704 pair for multipath transport. The verification of the tokens in both 705 endpoints of the MCTCP session ensures that the endpoints of a 706 coupled connection are identical to the endpoints of the initial 707 connection. Also, middleboxes that drop packets with SYN options, or 708 strip the option, can be detected in that way. 710 A local policy SHOULD ensure that an endpoint stops re-sending SYNs 711 with the "Join" option if it receives TCP RST or if it does not 712 receive corresponding SYN/ACKs. In general, an endpoint SHOULD NOT 713 try to open further coupled connections if previous attempts to the 714 same destination address failed. An endpoint SHOULD also refrain 715 from attempts to switch to multi-connection mode if this repeatedly 716 failed before; this SHOULD be governed by a local policy. 718 Host A Host B 719 ------------------------ ------------------------ 720 Address A1 Address A2 Address B1 Address B2 721 ---------- ---------- ---------- ---------- 722 | | | | 723 |---------SYN+MPCAP (Token A)------->| | ^ 724 |<-----SYN/ACK+MPCAP (Token B)-------| | | Single- 725 | | | | | conn. 726 |########Initial connection##########| | | mode 727 | | | | V 728 ~ ~ ~ ~ 729 | | | | 730 |---------SYN+JOIN (Token B)-------->| | 731 |<------SYN/ACK+JOIN (Token A)-------| | ^ 732 | | | | | 733 |<=====E. g., MCTCP Add. Address=====| | | Multi- 734 | | | | | conn. 735 | |----------SYN+JOIN (Token B)------->| | mode 736 | |<-------SYN/ACK+JOIN (Token A)------| | 737 | | | | | 738 |######First coupled connection######| | | 739 | | | | | 740 | |#####Second coupled connection######| V 741 | | | | 743 Figure 5: Example use of MCTCP tokens 745 Figure 5 illustrates the usage of the two MCTCP tokens. An endpoint 746 can decide to switch to multi-connection mode any time, as long as 747 the initial connection is established. In multi-connection mode, an 748 endpoint can add further coupled connections at any time. 750 4.3. Usage of Coupled Connections 752 The setup of the first coupled connection MUST use the same source 753 and destination IP addresses and SHOULD use same destination port 754 like the initial connection. This implies that the first coupled 755 connection SHOULD be actively opened by the initiator of the initial 756 connection. This constraint ensures that the first coupled 757 connection indeed uses valid addresses and that it uses the same path 758 like the initial connection. It also facilites user-space 759 implementation and network address port translation (NAPT) traversal. 760 The first coupled connection has a special role because it enables 761 the exchange of addresses or other information, which can be useful 762 to setup additional coupled connections. 764 The token supplied in the initial connection's SYN exchange is used 765 for the demultiplexing of coupled connections, i. e., to associate a 766 new coupled connection to an existing MCTCP session. This means that 767 the port numbers in a SYN of a coupled connection MAY NOT be used for 768 demultiplexing. Still, an active opener of a new coupled connection 769 SHOULD use a destination port numbers that is already in use by the 770 passive opener, as long as the 5-tuple is unique for each host. Once 771 a coupled connection is established, demultiplexing packets is done 772 using the five-tuple, as in traditional TCP. This strategy is 773 intended to maximize the probability of the SYN being permitted by a 774 firewall or network address port translation (NAPT) at the recipient 775 and to avoid confusing any network monitoring software. 777 Control information can be sent over any established coupled 778 connection, and it always affects the MCTCP session as a whole. As 779 control information and data chunks are transported over the same 780 pipe and may experience queueing in the send buffer, it is reasonable 781 to send important control information immediately after the 782 establishment of a new coupled connections (as shown in Figure 4 for 783 an "MCTCP Additional Address" message). A scheduler in the MCTCP 784 session layer decides which MCTCP messages are sent over which 785 coupled connection. 787 4.4. Operation Mode Switch 789 An MCTCP session endpoint MUST change its operation mode from single- 790 connection to multi-connection mode once the first coupled connection 791 is sucessfully setup. 793 Either endpoint of an MCTCP session can request the other endpoint to 794 switch to multi-connection mode by a "Mode" TCP option that is 795 depicted in Figure 6. This may be useful if only the other endpoint 796 can establish coupled TCP connections, e. g., if it is located behind 797 a middlebox performing network address port translation (NAPT). 799 1 800 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 801 +---------------+---------------+ 802 | Kind=OPT_MODE | Length=2 | 803 +---------------+---------------+ 805 Figure 6: Mode option 807 This TCP option MAY be set in segments of the initial connection. 808 Its implementation is RECOMMENDED. It MAY be set in segments without 809 or with payload once the initial connection is established, as long 810 as the MCTCP session is not in multi-connection mode. The option is 811 also allowed in SYN/ACK segments, but not in pure SYN segments. If 812 it is set in the SYN/ACK, it asks the connection initiation to enter 813 multi-connection mode immediately. When receiving a "Mode" TCP 814 option, an MCTCP endpoint MAY send a SYN with the "Join" TCP option 815 to the destination address and port of the initial connection, and 816 switch to multi-connection mode. It is also allowed to silently 817 ignore that notification and to continue in single-connection mode. 818 An endsystem MUST refrain from resending "Mode" TCP options 819 frequently if the MCTCP session cannot successfully negotiate the 820 multi-connection mode, in order to avoid needless effort. 822 5. MCTCP Session Protocol Messages 824 All coupled TCP connections run the MCTCP session protocol, which 825 transports both data chunks and control messages in the format that 826 is defined in this section. 828 5.1. Data Segmentation and Encoding 830 In multi-connection mode, MCTCP segments data in chunks and 831 transports them as TLV-encoded messages over one or more coupled TCP 832 connections. The framing format of these chunks is shown in 833 Figure 7. 835 1 2 3 836 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 837 +---------------+-------------------------------+---------------+ 838 | Type=MSG_CHUNK| Total message length |C| Reserved | 839 +---------------+-------------------------------+---------------+ 840 | Session sequence number (64 bit) : 841 +---------------------------------------------------------------+ 842 : Session sequence number (contd.) | 843 +---------------------------------------------------------------+ 844 | | 845 ~ Data chunk (variable) ~ 846 | | 847 +---------------------------------------------------------------+ 848 | Optional checksum (32 bit) | 849 +---------------------------------------------------------------+ 851 Figure 7: MCTCP Data Chunk message 853 If a receiver observes a corrupted MCTCP message, e. g., by invalid 854 TLV format or an invalid checksum, it SHOULD close the corresponding 855 coupled connection by sending a TCP FIN. 857 MCTCP uses global sequence number during a session. The value 0 858 refers to the first byte that is sent over the initial connection. 859 An MCTCP receiver reassembles the byte stream according to that 860 sequence number and delivers the data in-order to the upper protocol 861 layer or application. 863 If the the C-flag is set, the MCTCP Data Chunk message includes a 32 864 bit checksum that covers the whole MCTCP message. The checksum is 865 OPTIONAL, but it helps to detect middleboxes that modify the TCP byte 866 stream. If it is present, a receiver MUST verify the checksum. If 867 there is a checksum mismatch, the receiver MUST discard the MCTCP 868 message and its data, and it SHOULD close the corresponding coupled 869 connection, as the integrity of the TLV framing on that connection is 870 not guaranteed any more. The receiver MAY ask for a retransmission 871 of the corresponding data chunk over an alternative coupled 872 connection, as defined in the next section. If there is only one 873 coupled connection, there is a possibility to fall-back to transport 874 over the initial connection, as discussed below. 876 If present, the checksum is calculated by the Castagnoli CRC 32C 877 algorithm that is also used in the Stream Control Transmission 878 Protocol (SCTP) [4]. 880 The sequence number in the first Data Chunk message sent over coupled 881 TCP connections SHOULD be the first byte that the MCTCP 882 implementation has not already enqueued on the initial connection. 883 In that case, there is no overlap between data transported over the 884 initial connection and data transport over the coupled connections, 885 which simplifies the reassembly. An MCTCP sender MAY also resend 886 data that has already been written to the initial connection if a 887 coupled connection can use a faster path, but it MUST NOT resend data 888 that has already been acknowledged on the initial connection by the 889 receiver. 891 A sender SHOULD NOT write further data to the initial connection 892 after it has sent its first Data Chunk message to a coupled 893 connection, in order to simplify the reconstruction of the byte 894 stream in the receiver. The only exception is a fallback to single 895 connection mode, which is needed if all coupled connections are 896 closed. The initial connection transports the upper layer protocol's 897 byte stream without any gaps, i. e., the global session sequence 898 number implicitly increases continuously even after multi-connection 899 mode is entered. As a consequence, apart from redundancy and 900 fallback, it does not make much sense to continue sending the 901 application byte stream over the initial connection. A receiver 902 SHOULD close the MCTCP session if it detects an inconsistency between 903 the byte stream received over the initial connection and the data 904 chunks on the coupled connections. 906 The maximum allowed size of an MCTCP message is 65535 octets. 907 Therefore, the maximum data chunk size is 2^16-13 = 65523 octets. 908 The minimum allowed data chunk size is 1 octet. 910 The segmentation of the application byte stream into data chunks and 911 their assignment to coupled TCP connections is decided by a local 912 algorithm in the MCTCP sender, which may take into account the path 913 characteristics such as MSS, congestion control state, and other 914 relevant information (e. g., the page size in case of a kernel 915 implementation). An efficient segmentation algorithm should avoid 916 sending small data chunks to reduce the header overhead both in the 917 MCTCP and TCP layer. 919 MCTCP does not provide positive acknowledgements at session layer, 920 since TCP transport is reliable as long as paths do not fail. It is 921 an allowed behavior for an MCTCP instance to free the memory after 922 handing data over to a connection. In that case, if a coupled TCP 923 connection fails or if it is closed, it may be impossible to complete 924 the transfer on other coupled connections. Therefore, it is 925 RECOMMENDED that an MCTCP instance caches sent data for a certain 926 time. An MCTCP sender can duplicate or retransmit data chunks over 927 other coupled connections, even with overlapping sequence numbers. 928 The receiver can explicitly request such retransmissions as described 929 in the next section. A retransmission strategy is more efficient if 930 the retransmission is sent over a coupled connection that does not 931 have a long-standing sending queue. The MCTCP sender can infer the 932 connection state from the sequence numbers and congestion control 933 state of the individual connections. 935 5.2. Retransmission Requests 937 As the individual coupled TCP connections provide already reliable 938 transport, the session error recovery must only deal with connection 939 failure or middlebox problems. If a path fails, it will be necessary 940 to retransmit the data that has not been sucessfully transported. In 941 this case the MCTCP sender SHOULD retransmit the data on a coupled 942 connection over another path by assembling new MCTCP Data Chunk 943 messages. It MAY also close the MCTCP session instead. 945 There are two different solutions how the MCTCP sender can determine 946 what data has to be retransmitted: It can either try to implicitly 947 determine the missing data from the amount of unacknowledged data in 948 the connections that fails, if it has access to this information. 950 Alternatively, the MCTCP receiver can explicitly request for the 951 retransmission of data that has not successfully been received. 952 Since MCTCP session messages are transported reliably, MCTCP uses a 953 negative acknowledgment (NACK) mechanism: The receiver MAY send MCTCP 954 Retransmission Request messages in order to indicate gaps in the 955 received global sequence number space. However, a receiver SHOULD 956 wait until there is reasonable evidence that the data has been lost 957 due to path failure, or that a retransmission over another coupled 958 connection would be of significant benefit, in order to avoid 959 spurious retransmissions. The MCTCP Retransmission Request message 960 MAY also be sent after a checksum mismatch in a Data Chunk message. 961 It is allowed to send these messages over several coupled connections 962 in parallel. Such messages should only seldomly be required, since 963 TCP transport is in general reliable unless paths completely fail. 964 If there are several gaps in the sequence number space, the receiver 965 SHOULD coalesce the sequence numbers in a reasonable way to reduce 966 the overhead. The message format of the MCTCP Retransmission Request 967 message is defined in Figure 8: 969 1 2 3 970 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 971 +---------------+-------------------------------+---------------+ 972 | Type=MSG_RTXRQ| Total message length |C| Reserved | 973 +---------------+-------------------------------+---------------+ 974 | Start session sequence number (64 bit) : 975 +---------------------------------------------------------------+ 976 : Start session sequence number (contd.) | 977 +---------------------------------------------------------------+ 978 | End session sequence number (64 bit) : 979 +---------------------------------------------------------------+ 980 : End session sequence number (contd.) | 981 +---------------------------------------------------------------+ 983 Figure 8: MCTCP Retransmission Request message 985 The two sequence numbers refer to the first and last missing byte in 986 the session sequence number space. Upon reception of this message, a 987 MCTCP sender SHOULD retransmit the data over one or more subflows, 988 other than the one that has originally been used. The MCTCP sender 989 must still have the data buffered in order to be able to retransmit 990 the data. MCTCP also allows that the MCTCP sender closes the MCTCP 991 session instead of retransmitting data, as single-path data transport 992 over that path would have failed, too. 994 5.3. Address Advertisement 996 As motivated in [7], path management refers to the exchange of 997 information about additional paths between endpoints. MCTCP requires 998 multiple addresses at endpoints to be able to use multiple, possibly 999 at least partly disjoint paths. 1001 In multi-connection mode, MCTCP can explicitly signal additional 1002 addresses of one endpoint to the other endpoint, which allows it to 1003 initiate new connections. The MCTCP session can therefore also deal 1004 with addresses that change. 1006 The "Add Address" MCTCP message announces additional addresses on 1007 which an endpoint can be reached (Figure 9 and Figure 10). Multiple 1008 messages can be sent subsequently in order to advertise several 1009 addresses. This message can be sent at any time over any coupled 1010 connection. 1012 1 2 3 1013 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1014 +---------------+-------------------------------+---------------+ 1015 | Type=MSG_AADD4| Total message length = 8 | Reserved | 1016 +---------------+-------------------------------+---------------+ 1017 | IPv4 address (32 bit) | 1018 +---------------------------------------------------------------+ 1020 Figure 9: MCTCP Additional IPv4 Address message 1022 In Figure 9, the "Additional Address" message is shown for IPv4. The 1023 reserved bits could be used to express priorities or policies (e. g., 1024 "use now"). 1026 1 2 3 1027 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1028 +---------------+-------------------------------+---------------+ 1029 | Type=MSG_AADD6| Total message length = 20 | Reserved | 1030 +---------------+-------------------------------+---------------+ 1031 | | 1032 ~ IPv6 address (128 bit) ~ 1033 | | 1034 +---------------------------------------------------------------+ 1036 Figure 10: MCTCP Additional IPv6 Address message 1038 Furthermore, there are MCTCP message to remove candidate addresses, 1039 which are shown in Figure 11 and Figure 12. If an address is 1040 removed, an endpoint SHOULD NOT try to open further coupled 1041 connections to that address. Already established coupled connections 1042 are not affected by these messages and must be explicitly closed 1043 separately. 1045 1 2 3 1046 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1047 +---------------+-------------------------------+---------------+ 1048 | Type=MSG_RADD4| Total message length = 8 | Reserved | 1049 +---------------+-------------------------------+---------------+ 1050 | IPv4 address (32 bit) | 1051 +---------------------------------------------------------------+ 1053 Figure 11: MCTCP Remove IPv4 Address message 1054 1 2 3 1055 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1056 +---------------+-------------------------------+---------------+ 1057 | Type=MSG_RADD6| Total message length = 20 | Reserved | 1058 +---------------+-------------------------------+---------------+ 1059 | | 1060 ~ IPv6 address (128 bit) ~ 1061 | | 1062 +---------------------------------------------------------------+ 1064 Figure 12: MCTCP Remove IPv6 Address message 1066 5.4. Connection Management and Fallback 1068 Each coupled TCP connection is maintained individually. A FIN only 1069 closes that individual connection. If an application closes the 1070 socket, the MCTCP shim layer MUST close the initial connection and 1071 all existing coupled connection. Apart from that, the MCTCP layer 1072 may always close (or even re-open) coupled connections, governed by 1073 the local path management policies. In multi-connection mode, the 1074 MCTCP session is only closed once all coupled connections are closed. 1075 Coupled connections can be kept in the half-open state, but the MCTCP 1076 connection management SHOULD avoid this. It would be possible to 1077 specify an MCTCP message for explicitly closing the MCTCP session, or 1078 several coupled connections, but this is left for further study. 1080 MCTCP SHOULD keep the initial connection established when being in 1081 multi-connection mode, even if it is not used for data transport any 1082 more. This allows to expose valid addresses and port numbers to the 1083 application [11]. Keep-alives MAY be sent. The initial connection 1084 is closed by the MCTCP layer when all coupled connections are closed. 1085 If the initial connection is closed, the whole MCTCP session SHOULD 1086 be closed, too. Further studies are needed to understand whether the 1087 initial connection could savely be closed earlier, and whether an 1088 MCTCP session can be kept established even if the addresses of the 1089 initial connections cannot be used any more. 1091 If an MCTCP receiver detects that the byte stream on a coupled 1092 connection has been modified by a middlebox, it SHOULD close the 1093 corresponding coupled connection. By error recovery and 1094 retransmission schemes the corresponding data can then be transfered 1095 over other coupled connections. If all coupled connections are 1096 closed, the session SHOULD fall back to single-connection mode. 1097 Then, data transfer SHOULD continue over the initial connection. The 1098 MCTCP session MUST NOT try to enter multi-connection mode again. As 1099 an alternative, either of the two session endpoints MAY decide to 1100 close the MCTCP session in case of such an violation of TCP's end-to- 1101 end semantics. 1103 In certain cases, byte counters of the initial connection in the 1104 sender and receiver could get desynchronized if a middlebox 1105 transparently changes the length of the content sent over the initial 1106 connection. As also discussed in Section 8, this violation of TCP's 1107 end-to-end semantics can be detected in the receiver, e. g., if there 1108 is a gap between the first byte received from the coupled connections 1109 and the last byte received from the initial connection. 1110 Alternatively, there could be an overlap or potentially even 1111 mismatching content. If the receiver detects this, it SHOULD 1112 immediately close all coupled connections. This means that the MCTCP 1113 session falls back to single-connection mode and continues the byte 1114 stream data transport over the initial connection, including all 1115 middlebox modifications. As an other remedy, or if a fallback is not 1116 possible, either sender or receiver MAY also decide to close the 1117 MCTCP session in case of such an event. Further work is needed to 1118 define whether MCTCP should also have a method to resynchronize the 1119 sequence numbers at sender and receiver in such cases. 1121 6. MCTCP Session Policies and Algorithms 1123 This document does not mandate specific policies how to use and share 1124 resources on the coupled connections. Still, this section addresses 1125 some important issues that an MCTCP implementation must take into 1126 account. 1128 6.1. Message Scheduling 1130 Data and control messages can be assigned to any coupled TCP 1131 connection and are sent then over that connection. Messages may be 1132 duplicated or retransmitted for redundancy reasons. The receiver 1133 MUST process the messages in one coupled TCP connection in the order 1134 of arrival. In-order message processing among several coupled 1135 connection of one MCTCP session is not ensured. 1137 6.2. Congestion and Flow Control 1139 The MCTCP protocol does not have an own congestion control, nor an 1140 own flow control. Instead, it relies on the algorithms in the 1141 individual TCP connections. In the following, the operation is 1142 explained more in detail for the multi-connection mode. In single- 1143 connection mode, there is no difference compared to a normal TCP 1144 connection. 1146 Concerning flow control, the operation is straightforward: If the 1147 MCTCP receiver runs out of buffer space, it stops reading data from 1148 one or more coupled TCP connections. Depending on TCP's flow control 1149 and the available receive buffer, the flow control on one or more 1150 connections may throttle data transport until the MCTCP layer can 1151 process data again. 1153 The MCTCP layer SHOULD at least be able to queue one full-sized MCTCP 1154 message (i. e., 65535 byte) for each established coupled TCP 1155 connection. In order to avoid stalls of the data transfer, an 1156 endsystem SHOULD NOT actively or passively open coupled TCP 1157 connection when it is short on memory. Similarly, coupled 1158 connections SHOULD NOT be established if an application explicitly 1159 sets small send or receive buffer sizes [11]. 1161 The coupled connections have different congestion windows. To 1162 achieve resource pooling, it is necessary to couple the congestion 1163 windows in use on each connection, in order to push most traffic to 1164 uncongested links and avoid unfairness. One algorithm that aims at 1165 achieving this objective is presented in [10]. MCTCP is able to use 1166 this or other coupled congestion control algorithms. 1168 In addition, an MCTCP sender may have local policies to decide how 1169 much traffic to sent over the available connections. It could also 1170 obtain path cost metrics from the receivers. The latter could be 1171 realized by a new MCTCP messages defining connection priorities, 1172 which is left for further study. 1174 7. Interfaces 1176 This section describes MCTCP's interfaces from a functional point of 1177 view. Their realization is implementation-specific. 1179 7.1. Interface between MCTCP and TCP 1181 MCTCP must be able to control a small set of features inside a TCP 1182 stack and therefore requires a corresponding interface: 1184 o The MCTCP layer must be able to set a "Multipath Capable" or 1185 "Join" TCP option in SYN segments. It must also be notified if 1186 those options are set in an incoming SYN segment, it must be able 1187 to access the tokens, and it must be able to influence how to 1188 respond depending on the token value (i. e., either by a SYN/ACK 1189 or RST). 1191 o The MCTCP layer may set the "Mode" TCP option on the established 1192 initial connection, in any segment other than pure SYNs, and it 1193 should be notified if that option is received. 1195 o The MCTCP layer must be able to affect the congestion window on 1196 each coupled connection. Depending on the algorithm, it may be 1197 sufficient just to set periodically certain parameters of the 1198 congestion control, such as the additive increase factor. 1200 For efficient operation, MCTCP may also have to read certain 1201 information from each coupled TCP connection, such as: 1203 o The current amount of acknowledged and unacknowledged data on that 1204 connection, or the corresponding pointers to the byte stream. 1206 o The receive window advertised by the other endpoint on that 1207 connection. 1209 o The estimated round-trip time. 1211 o The maximum transmission unit (MTU) of the path, or TCP's maximum 1212 segment size (MSS). Note that the MSS is not a constant value if 1213 TCP options are added to data segments. 1215 Many operating systems provide already information about a subset of 1216 these parameters by a kernel/user-space interface. 1218 7.2. Interface to Applications 1220 MCTCP provides reliable, in-order, byte-stream transport to 1221 applications and thus can be used by legacy applications like a 1222 standard TCP connection [11]. When MCTCP is realized inside the 1223 network stack, it is a new function block between the TCP instance 1224 and the socket interface, which is transparent to applications. 1226 Alternatively, MCTCP can be implemented in large parts by a user- 1227 space library that accesses an extended network stack by the socket 1228 interface, which may have to be enhanced to provide some additional 1229 control functions as explained in the previous section. Applications 1230 could then still use the standard APIs to that library and would not 1231 be affected at all. Such a user-space implementation in combination 1232 with a simple patch of the network stack could facilitate the initial 1233 deployment of MCTCP. 1235 8. Interaction with Middleboxes 1237 There are various types of middleboxes in the Internet. Some of them 1238 only parse a TCP stream (e. g., deep packet inspection), while others 1239 change TCP header fields on the fly, and some may even rewrite the 1240 TCP payload. MCTCP is designed to be compatible with most types of 1241 middleboxes, but as middlebox behavior is not well specified, some 1242 open issues may remain. 1244 8.1. Middleboxes that Manipulate TCP Options 1246 One class of middleboxes may strip, duplicate, or modify TCP options 1247 and/or drop packets with unknown TCP options, and this may even 1248 depend on whether the SYN flag is set or not. If a middlebox removes 1249 MCTCP's TCP options in SYN segments, multipath transport will not be 1250 enabled at all (if that middlebox is on the path of the initial 1251 connection), or not over that path (if the middlebox is on the path 1252 of a potential coupled connection towards another address). Still, 1253 data transfer over the initial connection or other coupled 1254 connection(s) can continue without being significantly affected. 1256 Other TCP options that could be used by MCTCP are non-mandatory, i. 1257 e., the data integrity is not affected when these options are 1258 stripped or duplicated. In summary, unlike protocols that transport 1259 essential information in TCP options outside SYNs, MCTCP operates 1260 savely in an environment with middleboxes that strip, duplicate, or 1261 modify TCP options and/or drop packets with unknown TCP options. 1263 8.2. Middleboxes that Change Content 1265 Other middleboxes may rewrite the content of the TCP payload and 1266 possibly also its length (e. g., by rewriting URIs). MCTCP, as well 1267 as other multipath transport solutions, requires a session level 1268 sequence number space for the in-order reassembly of the application 1269 data. If a middlebox changes the content and/or length on the 1270 initial connection or on coupled connections, it may be impossible to 1271 correctly reassemble the byte stream at the receiver. 1273 MCTCP will in many cases be able to detect changes of content over 1274 coupled connections, as it looses track of the TLV framing on that 1275 connection. Content modifications can even better be detected if the 1276 sender adds checksums to the data chunks. If MCTCP detects a 1277 middlebox that changes the byte stream on a coupled connection, it 1278 will close the corresponding coupled connection. By error recovery 1279 and retransmission schemes the corresponding content can then be 1280 transfered over other coupled connections, or over the initial 1281 connection as a fallback method. 1283 If a middlebox changes the length of the byte stream on the initial 1284 connection, the sequence numbers at sender and receiver will not be 1285 synchronized when entering multi-connection mode, and there could be 1286 a gap or an overlap even with mismatching content. MCTCP can detect 1287 both cases. MCTCP keeps the initial connection open even in multi- 1288 connection mode. Therefore, if a content length modification on the 1289 initial connection is detected, it can fall back to the initial 1290 connection by closing all coupled connections and continue to use 1291 single-path transport. 1293 8.3. Middleboxes that Translate Addresses/Ports 1295 NAPT middleboxes that are unaware of MCTCP create two problems: 1296 First, as hosts have local addresses only, and the global addresses 1297 are not necessarily known to host behind the NAPT, it may not be 1298 possible to advertise addresses to the other endpoint. Second, it 1299 may be impossible for one endpoint to open a coupled TCP connection 1300 to an endpoint sitting behind a NAPT middlebox. 1302 In order to address the latter issue, MCTCP defines the Mode option. 1303 With that option, one endpoint can ask the other endpoint to enter 1304 multi-connection mode. As shown in Figure 13, sending this TCP 1305 option is useful if one endpoint has multiple public IP addresses, 1306 but cannot anounce them over the initial connection. If the host 1307 behind the NAPT middlebox receives the option and establishes a 1308 coupled connection, this can be used to convey the information about 1309 the other public address, and a coupled connection to that address 1310 can then be established, too. 1312 Host A NAPT Host B 1313 ------------------------ // ------------------------ 1314 Address A1 Address A2 // Address B1 Address B2 1315 (private) (private) // (public) (public) 1316 ---------- ---------- // ---------- ---------- 1317 | | // | | 1318 |---------SYN+MPCAP------//--------->| | ^ 1319 |<-----SYN/ACK+MPCAP-----//----------| | | Single- 1320 | | // | | | conn. 1321 |###Initial connection###//##########| | | mode 1322 | | // | | V 1323 ~ ~ ~~ ~ ~ 1324 | | // | | 1325 |<--------Mode option----//----------| | 1326 | | // | | 1327 |---------SYN+JOIN-------//--------->| | 1328 |<------SYN/ACK+JOIN-----//----------| | ^ 1329 | | // | | | 1330 |#1st coupled connection#//##########| | | 1331 | | // | | | 1332 |<=MCTCP Add. Address B2=//==========| | | Multi- 1333 | | // | | | conn. 1334 |---------SYN+JOIN-------//----------------------->| | mode 1335 |<------SYN/ACK+JOIN-----//------------------------| | 1336 | | // | | | 1337 |#2nt coupled connection#//########################| V 1338 | | // | | 1340 Figure 13: Example use of the Mode option 1342 8.4. Middleboxes that Want to Control MCTCP Traffic 1344 Given that MCTCP transports control information in the payload, it is 1345 more complex for middleboxes to parse and potentially modify MCTCP's 1346 control information. In order to do so, a middlebox must perform 1347 deep packet inspection and it has to parse the MCTCP session messages 1348 in the TCP connection. This may prevent certain operations and 1349 optimizations by middleboxes. However, it should be noted that 1350 middleboxes cannot affect the payload in TLS neither, i. e., MCTCP is 1351 somehow similar to TLS in that sense. As a remedy, it could be 1352 possible to define a TCP option that contains an offset field with a 1353 pointer to the first byte of an MCTCP control message, so that a 1354 middlebox can find control messages without parsing the whole byte 1355 stream of a coupled TCP connection. Yet, such an option would be 1356 subject to all limitations of sporadically added TCP options. 1358 A middlebox that wants to prevent MCTCP usage can drop SYN segments 1359 containing the "Join" TCP option without causing any significant 1360 harm. If that middlebox is on the path of the initial connection, 1361 MCTCP will continue using the backward-compatible initial TCP 1362 connection only. If the middlebox is on the path towards another 1363 address, i. e., if the multi-connection mode is already entered, 1364 MCTCP will not establish an additional coupled connection. Under the 1365 assumption of stable routing, no TLV-encoded content will pass that 1366 middlebox in both cases. Instead of dropping SYN segments with the 1367 "Join" TCP option, a middlebox could also strip the "Join" option, as 1368 the setup of a coupled connection will then fail. This method would 1369 avoid timeouts and further retransmission attempts by the sender. 1371 Alternatively, a middlebox could remove the "Multipath Capable" TCP 1372 option from SYN segments. Then, MCTCP will be identical to a 1373 standard TCP connection and never try to switch to multi-connection 1374 mode. However, it is not recommended to drop SYN segments containing 1375 the "Multipath Capable" TCP option as a means to prevent MCTCP, since 1376 this needlessly results in a longer connection setup time, and since 1377 just dropping segments with the "Join" option would be sufficient. 1379 8.5. Middleboxes that Proactively Acknowledge Data 1381 Finally, there might be middleboxes that proactively acknowledge 1382 data, or middleboxes that transparently split the TCP connection. 1383 Such middleboxes break the end-to-end semantics of TCP connections 1384 [6], i. e., TCP cannot ensure a reliable end-to-end transport of data 1385 over such middleboxes. Mitigating the drawbacks of proactively 1386 acknowledging middleboxes is mostly orthogonal to multipath 1387 transport. 1389 Yet, if such a middleboxe is on a path used by MCTCP, and if this 1390 path fails, a specific problem arises: The MCTCP sender may 1391 erroneously assume that the data over the corresponding coupled 1392 connections has already been received by the receiver, and therefore 1393 it will not retransmit it. In that case, after some time, the MCTCP 1394 receiver will observe a gap in the session sequence number space and 1395 can issue a request for retransmission. The sender can then decide 1396 whether to retransmit the data over another coupled connection to 1397 solve this problem, or it can just close the session. MCTCP 1398 explicitly allows the latter behavior as a single-path transport over 1399 the path with that middlebox would have failed, too. 1401 If MCTCP used positive session layer acknowledgements, future 1402 middleboxes could parse MCTCP's session messages and proactively 1403 acknowledge data on the session level, too. MCTCP does not 1404 incorporate a positive session layer acknowledgement mechanism in 1405 order to prevent such a further violation of the end-to-end 1406 principle. Of course, future middleboxes could still try to modify 1407 the retransmission requests inside the coupled connections, but this 1408 would not have any significant benefit. 1410 9. Open Issues 1412 o Avoiding inconsistencies when switching in parallel to multi- 1413 connection mode. 1415 o MCTCP does not support out-of-band TCP signaling transport (urgent 1416 flag). 1418 10. Security Considerations 1420 A generic threat analysis for the addition of multipath capabilities 1421 to TCP is presented in [9]. MCTCP is designed along the assumptions 1422 of that document, with some enhancements. In general, MCTCP is 1423 subject to similar security threads like [8], but due to its 1424 extensibility, additional protection mechanisms could be incorporated 1425 in a future version. For instance, MCTCP can employ more secure 1426 mechanisms to protect the coupling of TCP connections, even by 1427 cryptographic keys like in TLS. 1429 MCTCP uses a 32bit token only, in order to save TCP option space in 1430 SYN segments. This is reasonable, as this token is only required to 1431 authenticate the initiator of the first coupled connection, which 1432 must use the same IP source and destination address like the initial 1433 connection, i. e., off-path attacks are not possible. Coupled 1434 connections that are added subsequently could use a more secure 1435 protection scheme at the MCTCP session layer, either by longer 64bit 1436 tokens, or even by cryptographic methods, which could be exchanged by 1437 corresponding MCTCP control messages (not specified in this version 1438 of the document). 1440 This section will be extended in a later version of this document. 1442 11. IANA Considerations 1444 This document will make a request to IANA to allocate new values for 1445 TCP option identifiers: 1447 o OPT_MPCAP ("Multipath Capable" option) 1449 o OPT_JOIN ("Join" option in order to add a coupled connection to 1450 the MCTCP session) 1452 o OPT_MODE ("Mode" option that requests change from single- 1453 connection to multi-connection operation mode) 1455 This document also defines several types of MCTCP messages: 1457 o MSG_CHUNK ("MCTCP Data Chunk") 1459 o MSG_RTXRQ ("MCTCP Retransmission Request") 1461 o MSG_AADD4 ("MCTCP Additional IPv4 Address") 1463 o MSG_AADD6 ("MCTCP Additional IPv6 Address") 1465 o MSG_RADD4 ("MCTCP Remove IPv4 Address") 1467 o MSG_RADD6 ("MCTCP Remove IPv6 Address") 1469 12. Conclusion 1471 Multi-connection TCP transport is a simple, modular, and extensible 1472 solution to enable reliable transfer over multiple paths. This 1473 specification defines the protocol on top of the TCP byte stream, the 1474 few required extensions of TCP, and the light-weight interface 1475 between MCTCP and each TCP connection. In summary, MCTCP is a 1476 reasonable and incrementally deployable alternative to a signaling 1477 mechanism that uses TCP options only. 1479 13. Acknowledgments 1481 Michael Scharf is supported by the German-Lab project 1482 (http://www.german-lab.de/) funded by the German Federal Ministry of 1483 Education and Research (BMBF). 1485 14. References 1486 14.1. Normative References 1488 [1] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, 1489 September 1981. 1491 [2] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1492 Selective Acknowledgment Options", RFC 2018, October 1996. 1494 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1495 Levels", BCP 14, RFC 2119, March 1997. 1497 [4] Stewart, R., "Stream Control Transmission Protocol", RFC 4960, 1498 September 2007. 1500 [5] Dierks, T. and E. Rescorla, "The Transport Layer Security (TLS) 1501 Protocol Version 1.2", RFC 5246, August 2008. 1503 14.2. Informative References 1505 [6] Border, J., Kojo, M., Griner, J., Montenegro, G., and Z. 1506 Shelby, "Performance Enhancing Proxies Intended to Mitigate 1507 Link-Related Degradations", RFC 3135, June 2001. 1509 [7] Ford, A., Raiciu, C., Barre, S., and J. Iyengar, "Architectural 1510 Guidelines for Multipath TCP Development", 1511 draft-ietf-mptcp-architecture-01 (work in progress), June 2010. 1513 [8] Ford, A., Raiciu, C., and M. Handley, "TCP Extensions for 1514 Multipath Operation with Multiple Addresses", 1515 draft-ietf-mptcp-multiaddressed-00 (work in progress), 1516 June 2010. 1518 [9] Bagnulo, M., "Threat Analysis for Multi-addressed/Multi-path 1519 TCP", draft-ietf-mptcp-threat-02 (work in progress), 1520 March 2010. 1522 [10] Raiciu, C., Handley, M., and D. Wischik, "Coupled Multipath- 1523 Aware Congestion Control", draft-raiciu-mptcp-congestion-01 1524 (work in progress), March 2010. 1526 [11] Scharf, M. and A. Ford, "MPTCP Application Interface 1527 Considerations", draft-scharf-mptcp-api-02 (work in progress), 1528 July 2010. 1530 Appendix A. Possible Future MCTCP Extension 1532 This memo describes the baseline specification of MCTCP and the 1533 required minimum set of functions. A future version of this 1534 specification may additionally add several other features to MCTCP, 1535 such as: 1537 o Exchange of longer tokens (e. g., 64bit) for connection coupling, 1538 using MCTCP control messages. 1540 o Signaling messages to exchange policy information concerning the 1541 usage of the coupled TCP connections. 1543 o A signaling message that advertises combination of addresses and 1544 port numbers, e. g., to deal with corresponding policies on one 1545 endpoint. 1547 o A signaling message that advertises additional addresses in 1548 another format, e. g., as URI. 1550 o MCTCP session positive level acknowledgements ("data 1551 acknowledgement"). 1553 o A checksum in all MCTCP messages. 1555 o Signaling messages to negotiate different payload encoding 1556 formats, e. g., MIME-like encoding. A future version of the MCTCP 1557 session protocol could also define retransmission requests for a 1558 different encoding format to work around content modifying 1559 middleboxes. 1561 o MCTCP control messages that manage coupled connections, such as a 1562 method to explicitly ask for closing several connections at MCTCP 1563 layer, similar to a "DATA FIN". 1565 o A simple MCTCP session flow control mechanism, complementing TCP's 1566 flow control. 1568 o A negotiation whether to indeed keep the initial connection 1569 established in multi-connection mode, assuming that it could 1570 either be closed or reused as a coupled connection. 1572 o A variant of this protocol that uses TLV-encoded message transport 1573 right from the beginning. 1575 o A method to discover and negotiate features between the two MCTCP 1576 session endpoints, e. g., by Hello messages similar to TLS. 1578 Further studies are needed to determine whether some of these 1579 functions should be added to MCTCP. If so, their implementation may 1580 partly be optional and negotiated between the session endpoints. The 1581 baseline MCTCP design should be kept as simple as possible. 1583 Appendix B. Change History of the Document 1585 Changes compared to version 00: 1587 o Addition of a checksum in data chunk messages 1589 o Definition of a message to request retransmission 1591 o Description of how to fall back to single-connection mode 1593 o Discussion of proactively acking middleboxes 1595 o Various clarifications of the design motivations 1597 Author's Address 1599 Michael Scharf 1600 Alcatel-Lucent Bell Labs 1601 Lorenzstrasse 10 1602 70435 Stuttgart 1603 Germany 1605 EMail: michael.scharf@alcatel-lucent.com