idnits 2.17.1 draft-van-beijnum-1e-mp-tcp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 6, 2009) is 5469 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Downref: Normative reference to an Informational RFC: RFC 2992 -- Obsolete informational reference (is this intentional?): RFC 1072 (Obsoleted by RFC 1323, RFC 2018, RFC 6247) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) == Outdated reference: A later version (-12) exists of draft-ietf-shim6-proto-11 Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network working group I. van Beijnum 3 Internet-Draft IMDEA Networks 4 Expires: November 7, 2009 May 6, 2009 6 One-ended multipath TCP 7 draft-van-beijnum-1e-mp-tcp-00 9 Status of this Memo 11 This Internet-Draft is submitted to IETF in full conformance with the 12 provisions of BCP 78 and BCP 79. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on November 7, 2009. 32 Copyright Notice 34 Copyright (c) 2009 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents in effect on the date of 39 publication of this document (http://trustee.ietf.org/license-info). 40 Please review these documents carefully, as they describe your rights 41 and restrictions with respect to this document. 43 Abstract 45 Normal TCP/IP operation is for the routing system to select a best 46 path that remains stable for some time, and for TCP to adjust to the 47 properties of this path to optimize throughput. A multipath TCP 48 would be able to either use capacity on multiple paths, or 49 dynamically find the best performing path, and therefore reach higher 50 throughput. By adapting to the properties of several paths through 51 the usual congestion control algorithms, a multipath TCP shifts its 52 traffic to less congested paths, leaving more capacity available for 53 traffic that can't move to another path on more congested paths. And 54 when a path fails, this can be detected and worked around by TCP much 55 more quickly than by waiting for the routing system to repair the 56 failure. 58 This memo specifies a multipath TCP that is implemented on the 59 sending host only, without requiring modifications on the receiving 60 host. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Notational Conventions . . . . . . . . . . . . . . . . . . . . 5 66 3. Congestion control . . . . . . . . . . . . . . . . . . . . . . 5 67 3.1. RTT measurements . . . . . . . . . . . . . . . . . . . . . 5 68 3.2. Fast retransmit . . . . . . . . . . . . . . . . . . . . . 6 69 3.3. Slow retransmit . . . . . . . . . . . . . . . . . . . . . 6 70 3.4. SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 71 3.5. Fairness and TCP friendliness . . . . . . . . . . . . . . 8 72 4. Path selection . . . . . . . . . . . . . . . . . . . . . . . . 8 73 4.1. The multipath IP layer . . . . . . . . . . . . . . . . . . 9 74 4.2. The path indication option . . . . . . . . . . . . . . . . 10 75 4.3. Timestamp integration option . . . . . . . . . . . . . . . 12 76 4.4. Path for retransmissions . . . . . . . . . . . . . . . . . 12 77 4.5. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 78 4.6. Path MTU discovery . . . . . . . . . . . . . . . . . . . . 13 79 5. Flow control and buffer sizes . . . . . . . . . . . . . . . . 14 80 6. Handling of RSTs . . . . . . . . . . . . . . . . . . . . . . . 14 81 7. Middlebox considerations . . . . . . . . . . . . . . . . . . . 14 82 8. Security considerations . . . . . . . . . . . . . . . . . . . 15 83 9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 15 84 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15 85 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 86 11.1. Normative References . . . . . . . . . . . . . . . . . . . 16 87 11.2. Informational References . . . . . . . . . . . . . . . . . 16 88 Appendix A. Document and discussion information . . . . . . . . . 17 89 Appendix B. An implementation strategy . . . . . . . . . . . . . 17 90 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 21 92 1. Introduction 94 In order to achieve redundancy to protect against failures, network 95 operators generally install more links than the minimum necessary to 96 achieve reachability. So there are often multiple paths between any 97 two given hosts, even when paths not allowed by policy are removed. 98 However, routing protocols usually select a single "best" path. When 99 multiple paths are used at the same time by the routing system, those 100 tend to be parallel links between two routers or paths that are 101 otherwise very similar. As such, a lot of potentially usable network 102 capacity is left unused. A multipath transport protocol would be 103 able to use more of that capacity by sending its data along multiple 104 paths at the same time, or by switching to a path with more available 105 capacity. 107 As TCP [RFC0793] is used by the vast majority of all networked 108 applications, and TCP is responsible for the vast majority of all 109 data transmitted over the internet, the logical choice would be to 110 make TCP capable of using multiple paths. SCTP already has the 111 ability to use multiple paths through the use of multiple addresses. 112 However, using SCTP in this way requires significant application 113 changes and deployment would be challenging because there is no 114 obvious way for an application to know whether a service is available 115 over SCTP rather than, or in addition to, TCP. In addition, SCTP as 116 defined today [RFC2960] does not accommodate the concurrent use of 117 multiple paths. Additional paths are purely used for backup 118 purposes. 120 This memo describes a one-ended multipath TCP, which only changes the 121 behavior of the TCP sender, achieving multipath advantages when 122 communicating with unmodified TCP receivers. This means it is not 123 possible to perform path selection by using different destination 124 addresses. However, other mechanisms that are transparent to the 125 receiver are possible. A simple one would be for the sender to send 126 some packets to one router, and other packets to another router. If 127 these routers then make different routing decisions for the 128 destination address in the TCP packets, the packets flow over 129 different paths part of the way. Other mechanisms to achieve the 130 same goal are also possible. However, with a single destination 131 address, paths can't be completely disjoint. 133 Using multiple paths at the same time brings up a number of 134 challenges and questions: 136 o Naive scheduling (such as round robin) of transmissions over the 137 different paths reduces performance of each path to that of the 138 slowest path. 140 o Using multiple paths causes reordering, which triggers the fast 141 retransmit algorithm, causing unnecessary retransmissions and 142 reduced performance. 144 o TCP requires in-order delivery of data to the application, so when 145 losses occur on one path, buffer capacity may run out and data 146 can't be transmitted on unaffected paths until the lost data has 147 been retransmitted. 149 o Using multiple paths with an instance of regular congestion 150 control on each path for a single TCP session makes that session 151 use network capacity more aggressively than single path sessions, 152 which can be considered "unfair" and increases packet loss. 154 This memo seeks to address the first two issues by running separate 155 instances of TCP's congestion control algorithms for the subflows 156 that flow over different paths. Buffer issues are addressed by 157 retransmitting packets before buffer space runs out, even if normal 158 retransmission timers haven't fired yet. The fairness issue is a 159 topic of ongoing research; this specification simply limits the 160 number of subflows to limit unfairness and increased loss. 162 The one-ended multipath TCP takes advantage of the fact that TCP 163 [RFC0793] congestion control [RFC2581] and flow control are performed 164 by the sender. With regard to flow control and congestion control, 165 the role of the receiver is limited to sending back acknowledgments 166 and advertise how much data it is prepared to receive. Hence, it is 167 possible for the sender to utilize different paths and modify the 168 fast retransmit logic as long as the receiver recognizes the packets 169 as belonging to the same session. So a multipath TCP sender can 170 distribute packets over multiple paths as long as this doesn't 171 require incompatible modifications to the IP or TCP header contents, 172 most notably the addresses. A single-ended multipath TCP session 173 must still be between a single source address and a single 174 destination address, regardless of the path taken by packets. 176 The subset of the packets belonging to a TCP session flowing over a 177 given path is designated a subflow. 179 In order to benefit from using multiple paths, it's necessary for the 180 multipath TCP sender to execute separate TCP congestion control 181 instances for the packets belonging to different subflows. In the 182 case where all packets are subject to the same congestion window, 183 performance over a fast and a slow path will often be poorer than 184 over just the fast path, defeating the purpose of using multiple 185 paths. For instance, in the case of a 10 Mbps and a 100 Mbps path 186 with otherwise identical properties, a simple round robin 187 distribution of the packets and the use of a single congestion window 188 will limit performance to that of the slowest path multiplied by the 189 number of paths, 20 Mbps in this case. 191 2. Notational Conventions 193 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 194 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 195 document are to be interpreted as described in [RFC2119]. 197 3. Congestion control 199 A multipath TCP maintains instances of all congestion control related 200 variables for each subflow. This includes, but is not limited to, 201 the congestion window, the ssthresh, the retransmission timeout 202 (RTO), the user timeout and RTT measurements. However, because TCP 203 requires in-order delivery of data, there must be a single send 204 buffer and a single receive buffer, thus flow control must happen 205 session-wide. 207 Per-subflow congestion control is performed by recording the path 208 used to transmit each packet. Acknowledgments are then attributed to 209 the subflow the acknowledged packets were sent over and the 210 congestion window and other congestion control variables for the 211 relevant subflow are updated accordingly. 213 3.1. RTT measurements 215 Because a multipath TCP sender knows which packet it sent over which 216 path, it can perform per-path round trip time measurements. This 217 only works if return packets are consistently sent over the same path 218 (or a set of paths with the same latency). If the receiver is not 219 multipath-aware, this condition will generally hold: acknowledgments 220 will flow from the receiver to the sender over a single path unless 221 there is a topology change in the routing system or packets that 222 belong to a single session are distributed over different paths by 223 routers, which is rare. To multipath-capable routers on the return 224 path (if any), the non-multipath-aware host appears to select the 225 default path for all of its packets. 227 However, if, like the sender, the receiver is multipath-aware, then 228 the return path that the receiver chooses to send ACKs over will 229 influence the RTTs seen by the original sender. The situation where 230 the sender is unaware of fact that the receiver selects different 231 return paths with different latencies is suboptimal, even compared to 232 consistently measuring the RTT over the slowest path, as this leads 233 to higher variability in the RTT measurements and therefore a higher 234 RTO. 236 Having the receiver send ACKs over the same path mitigates the 237 problem somewhat; but presumably, if the receiver is also multipath 238 capable and has data to send, it will want to send this data over 239 more than one path. So RTT measurements may inadvertently end up 240 measuring different return paths in that case. A better solution is 241 for the sender to include an indication in packets that allows the 242 receiver to determine through which path the sender sent the packet. 243 This information, along with the path initially chosen for the 244 outgoing packet that is acknowledged, allows TCP to attribute each 245 RTT measurement to a specific path. 247 Because congestion control happens per path, there must also be a 248 separate retransmission timeout (RTO) value for each path. 250 3.2. Fast retransmit 252 Different paths will almost certainly have different RTTs, and even 253 if the average RTT is the same, normal burstiness and differences in 254 packet sizes will make packets routinely arrive through the different 255 paths in a different order than the order in which they were 256 transmitted. Without modifications to the algorithm, this would 257 trigger the fast retransmit algorithm unnecessarily. To avoid this, 258 fast retransmit is executed whenever, for packets belonging to the 259 same subflow, after an unACKed packet or sequence of packets, more 260 than two segments of new data is ACKed with SACK. This means fast 261 retransmit happens per subflow, and reordering between subflows no 262 longer triggers fast retransmit. 264 3.3. Slow retransmit 266 In multipath TCP, a per-path RTO is employed to recover from 267 congestion events that fast retransmit can't handle. Because the 268 missing packets create holes in the data stream, subsequent packets 269 received over other paths must be buffered in the receive buffer. 270 Unless the receive buffer is extremely large, this means the entire 271 session stalls when the receive buffer fills up. This situation 272 persists until the RTO expires for the congested or broken path so 273 the missing packets can be retransmitted. Should the path in 274 question be completely broken, this will then lead to an almost 275 immediate new stall, and the stall/RTO cycles will then continue 276 until the user timeout / R2 timer [RFC1122] for the subflow expires. 278 This is solved by taking unacknowledged packets transmitted over 279 subflows that are stalled because they have exhausted their 280 congestion window and are now waiting for the RTO to expire, and 281 scheduling retransmissions of those packets over other paths before 282 the RTO of the stalled subflow expires. This should be done such 283 that the missing packet arrives before it becomes necessary to stop 284 sending data altogether because the receiver advertises a zero 285 receive buffer. Such retransmissions therefore happen as the receive 286 buffer space advertised by the receiver reaches RTT * MSS for the 287 path that will be used for the retransmission; presumably the path 288 with the lowest RTT. In essence, this creates a second level of fast 289 retransmit that acts across subflows in addition to the normal fast 290 retransmit that happens per subflow. This mechanism is named "slow 291 retransmit". 293 In the case of single path TCP, scheduling retransmissions before the 294 RTO expires could be problematic because this would be more 295 aggressive than standard (New)Reno congestion control. But in the 296 case of multipath TCP, the retransmission can happen over one of the 297 other paths, which is still progressing. 299 By scheduling a retransmission faster than an RTO, there is an 300 increased risk that a packet that was still working its way through 301 the network is retransmitted unnecessarily. However, the alternative 302 is allowing the progress of the session to stall (on all paths), 303 reducing throughput significantly. 305 3.4. SACK 307 When packets (belonging to different subflows) arrive out of order, 308 the the receiver can't acknowledge the receipt of the out of order 309 packets using TCP's normal cumulative acknowledgment. However, the 310 [RFC2018] (also see [RFC1072]) Selective Acknowledgment (SACK) 311 mechanism is widely implemented. SACK makes it possible for a 312 receiver to indicate that three or four additional ranges of data 313 were received in addition to what is acknowledged using a normal 314 cumulative ACK. When packets are sent over multiple paths and arrive 315 out of order, the information in the SACK returned by the receiver 316 can tell the sender how each subflow is progressing, so per-subflow 317 congestion control can progress smoothly and unnecessary 318 retransmissions are largely avoided. 320 One-ended multipath TCP requires the use of SACK to be able to 321 determine which subflows are progressing even if other subflows are 322 stalled, and thus the normal TCP ACK isn't progressing. If the 323 remote host doesn't indicate the SACK capability during the three-way 324 handshake, a multipath TCP implementation SHOULD limit itself to 325 using only a single subflow and thus disabling multipath processing 326 for the session in question. 328 3.5. Fairness and TCP friendliness 330 One of the goals of multipath TCP is increased performance over 331 regular TCP. However, it would be harmful to realize this benefit by 332 taking more than a "fair" share of the available bandwidth. One 333 choice would be to make each subflow execute normal NewReno 334 congestion control on each subflow, so that each individual subflow 335 competes with other TCPs on the same footing as a regular TCP 336 session. If all subflows use non-overlapping physical paths, other 337 TCPs are no worse off than in the situation where the multipath TCP 338 were a regular TCP sharing their path, so this could be considered 339 fair even though the multipath TCP increases its bandwidth in direct 340 relationship to the number of subflows used. Note that in this case, 341 although multipath TCP sends at the same rate as regular TCP on a 342 given path, resource pooling [wischik08pooling] benefits are still 343 realized because a given transmission completes faster so it uses up 344 resources for a shorter amount of time. 346 But if several logical paths share a physical path, multipath TCP 347 takes a larger share of the bandwidth on that path. This would only 348 be acceptable as fair for a very small number of subflows. The other 349 end of the spectrum would be for multipath TCP to conform to exactly 350 the same congestion window increase and decrease envelope that a 351 regular TCP exhibits, being no more aggressive than a regular single 352 path TCP session. At this point in time we will assume that fairness 353 is a tunable factor of the regular NewReno AIMD envelope. A simple 354 way to limit the amount of additional aggressiveness exhibited by 355 multipath TCP is a limit on the number of subflows. Until more 356 analysis has been performed and/or there is more experience with 357 multipath TCP, a multipath TCP implementation SHOULD limit itself to 358 using no more than 3 subflows concurrently. 360 4. Path selection 362 Note that in order to gain multipath benefits, the multipath TCP 363 layer must be able to determine the logical path followed by each 364 packet so it can measure path properties and perform per-path 365 congestion control. In order to limit the number of packets flowing 366 over each path to the amount allowed by the per path congestion 367 window, the multipath TCP layer must be able to specify over which 368 path a given packet is transmitted. 370 The situation where routers distribute packets over different paths 371 based on their own criteria makes it impossible for hosts to send 372 less traffic over congested paths and more traffic over uncongested 373 paths and is therefore incompatible with multipath TCP. When routers 374 distribute traffic belonging to the same flow (or, in the case of 375 multipath TCP: subflow) over different paths this will also cause 376 reordering and the associated performance impact on TCP. 378 4.1. The multipath IP layer 380 The one-ended multipath TCP is logically layered on a multipath IP 381 layer, which is able to to deliver packets to the same destination 382 address through one or more logical paths, where the set of n logical 383 paths share between one and m physical paths. In some cases, the 384 multipath IP layer will be able to determine that a logical path 385 isn't working, or maps to the same physical path as a previous 386 logical path. For example, if the multipath TCP indicates that a 387 packet should be sent over the third path, and the multipath IP is 388 set up to use different next hop addresses for path selection, but 389 only two next hop addresses are available, the multipath IP layer can 390 provide feedback to the multipath TCP layer. In other cases, packets 391 simply won't be delivered, or will be delivered through the same 392 physical path used by other logical paths. This may for instance 393 happen when multipath TCP selects path 1 and multipath IP puts a path 394 selector with value "1" in the packet, but there are no multipath 395 capable routers between the source and destination, so all packets, 396 regardless of the presence and/or value of a path selector, are 397 routed over the same physical path. 399 It is up to the multipath TCP layer to handle each of these 400 situations. 402 For the purposes of this multipath TCP specification, the simplest 403 possible interface to the multipath IP layer is assumed. When TCP 404 segments traveling down the stack from the TCP layer to the IP layer 405 aren't accompanied by a path selector value, or the path selector 406 value is zero, the IP layer delivers packets in the same way as for 407 unmodified TCP and other existing transport protocols, i.e., over the 408 default path. Segments may also be accompanied by a path selector 409 value higher than zero, which indicates the desired path. If the 410 desired logical path is available, or may be available, the multipath 411 IP layer attempts to deliver the packet using that logical path. If 412 the desired logical path is known to be unavailable, the multipath IP 413 layer drops the segment. 415 It is assumed that paths as seen by the multipath IP layer are mapped 416 to logical paths with increasing numbers roughly ordered in order of 417 decreasing assumed performance or availability. I.e., if path x 418 doesn't work or has low performance, that doesn't necessarily mean 419 that path x+1 doesn't work or has low performance, but if if paths x, 420 x+1 and x+2 don't work or have low performance, then it's highly 421 likely that paths x+3 and beyond also don't work or have even lower 422 performance. Routers may have good next hop or even intra-domain 423 link weight information and link congestion information, but they 424 generally don't have information about the end-to-end path 425 properties, so the ordering of paths from high to low availability/ 426 performance must be considered little more than a hint. 428 The multipath IP layer may be implemented through a variety of 429 mechanisms, including but not limited to: 431 o Using different outgoing interfaces on the host 433 o Directing packets towards different next hop routers 435 o Integration with shim6 [I-D.ietf-shim6-proto] so that packets can 436 use different address pairs 438 o Manipulation of fields used in ECMP [RFC2992] (i.e., a different 439 flow label) 441 o Type of service routing (such as [RFC4915]) 443 o Different lower layer encapsulation, such as MPLS 445 o Tunneling through overlays 447 o Source routing 449 o An explicit path selector field in packets, acted upon by routers 451 At this time, no choice is made between these different mechanisms. 453 4.2. The path indication option 455 Note that several of the fields discussed below are defined with 456 future developments in mind, they are not necessarily immediately 457 useful. 459 In order to allow for accurate RTT measurements and to inform the IP 460 layer of the selected path, a TCP option indicating the desired path 461 is included in all segments that don't use the default path. The 462 format of this option is as follows: 464 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 465 | KIND=TBA | LENGTH = 3 |D| MP |R| SP | 466 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 468 The length is 3. 470 D is the "discard eligibility" flag (1 bit). It is similar, but not 471 identical, to the frame relay discard eligibility bit or the ATM cell 472 loss priority bit. Set to zero, no special behavior is requested. 473 Set to one, this indicates that loss of the packet will be 474 inconsequential. This allows routers to drop packets with D=1 more 475 readily than other packets under congested conditions, and also to 476 completely block packets with D=1 on links that are considered long- 477 term congested or expensive, even if there is no momentary 478 congestion. 480 Setting the D bit to 1 for some subflows (presumably, ones with a 481 performance lower than the best performing subflow) allows multipath 482 TCP to give way to regular TCP and other single path traffic on 483 congested or expensive paths. As long as the multipath TCP sets D to 484 0 on the subflow with the best performance, multipath TCP should 485 still perform better than regular TCP, but the reduction in bandwidth 486 use on the other paths helps achieve resource pooling benefits. 488 MP is a is a path selector that may be interpreted by multiple 489 routers along the way (3 bits). A value of 0 is the default path 490 that is also taken by packets that don't contain a multipath option. 491 Multipath TCP aware routers should take this value into account when 492 performing ECMP [RFC2992]. Packets with any value for MP MUST be 493 forwarded, even if the number of available paths is smaller than the 494 value in MP. 496 R (1 bit) is reserved for future use. MUST be set to zero on 497 transmission and ignored on reception. 499 SP is a path selector that is interpreted only once by the local TCP 500 stack or a router close to the sender (3 bits). A value of 0 is the 501 default path that is also taken by packets that don't contain a 502 multipath option. If the value in SP points to a path that isn't 503 available, the packet SHOULD be silently dropped. This behavior, as 504 opposed to selecting an alternate path out of the available ones, 505 helps avoid the use of duplicate paths. As such, a router may only 506 interpret SP rather than MP when it is known that the router is the 507 only one acting on SP. All other routers may only act on MP. 509 It is not expected that routers will make routing decisions directly 510 based on the path indication option, as this option occurs deep 511 inside the packet and not in a fixed place. However, a multipath IP 512 layer or a middlebox may write a path selection value into a field in 513 packets that is easily accessible to routers. But conceptually, the 514 routers act upon the values in SP and MP. 516 The initial packets for each TCP session MUST use D, MP and SP values 517 of zero. If D, MP and SP are all zero, then the path selector option 518 isn't included in the packet. This makes sure that single path 519 operation remains possible even if packets with the path selector 520 option are filtered in the network or rejected by the receiver. The 521 packets that are part of the TCP three-way handshake SHOULD be sent 522 over the default path, in which case they don't contain the path 523 selector option; hence the ability to do multipath TCP isn't 524 indicated to the correspondent at the beginning of the session as is 525 usual for most other TCP extensions. 527 4.3. Timestamp integration option 529 As an optimization, hosts MAY borrow the four bits used by the path 530 selector option from the timestamp option, and thus save one byte of 531 option space, which means the path selector option can replace the 532 padding necessary when the timestamp option is used and not increase 533 header overhead. In that case, the combined path selector and 534 timestamp options MUST appear as follows: 536 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 537 | KIND=TBA | LENGTH = 2 | KIND=8 | LENGTH = 10 | 538 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 539 |D| MP | TS Value (TSval) | 540 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 541 | TS Echo Reply (TSecr) | 542 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 544 D and MP are the same as in the three-byte form of the path selector 545 option. R and SP do not occur in this form of the path selector 546 option and are assumed to be zero. 548 TSval is the locally generated timestamp. Because the timestamp is 549 reduced to 28 bits, the minimum clock frequency is increased from the 550 59 nanoseconds mandated by [RFC1323] to 1 microsecond so the 551 timestamp wraps in no less than 255 seconds. 553 TSecr is the timestamp echoed back to the other side (32 bits). 555 All hosts conforming to this specification MUST be able to recognize 556 the integrated path selector and timestamp options, but they are not 557 required to generate them. 559 4.4. Path for retransmissions 561 A multipath TCP implementation MUST be capable of scheduling 562 retransmissions over a path different from the path used to transmit 563 the packet originally. This includes packets subject to fast 564 retransmit. 566 4.5. ECN 568 Explicit Congestion Notification works by routers setting a 569 congestion indication in the IP header of packets rather than 570 dropping those packets when they experience congestion. The receiver 571 echos this information back to the sender which then performs 572 congestion control in exactly the same way as if a packet was lost. 573 The ECN specification ([RFC3168]) is such that the receiver sets the 574 ECN-Echo (ECE) flag in the TCP header for all subsequent packets that 575 it sends back until the sender sets the Congestion Window Reduced 576 (CWR) flag. As the ECE flag is set in multiple ACKs, there is no 577 obvious way to correlate the ECN indication in an ACK with a specific 578 packet that experienced congestion, and subsequently, the path that 579 is congested. 581 At this time, a multipath TCP conforming to this specification SHOULD 582 NOT use ECN. ECN MAY be negotiated, but when more than a single path 583 is used at a given time, packets SHOULD be sent with the ECN field 584 set to Not-ECN (00), and incoming non-zero ECE flags SHOULD NOT be 585 acted upon with regard to congestion control. 587 4.6. Path MTU discovery 589 Path MTU discovery [RFC1191] is performed for TCP by having TCP 590 reduce its packet sizes whenever "packet too big but DF set" ICMP 591 messages are received. As the name suggests, the path MTU is 592 dependent on the path used, so multipath TCP must maintain MTU 593 information for each path, and adjust this information for each path 594 individually based on the too big messages that it receives. 596 The time between probing with a larger than previously discovered MTU 597 must either be randomized or explicitly coordinated to avoid probing 598 larger MTUs for multiple subflows at the same time, as probing larger 599 MTUs is likely to lead to a lost packet, and having losses on 600 multiple paths at the same time would be suboptimal. For instance, 601 rather than probe every t, in the case of 2 paths, after t*0.5 the 602 first path is probed, after t the second and after t*1.5 the first is 603 probed again. 605 Both the IPv4 and IPv6 versions of ICMP return enough of the original 606 packet in a "packet too big" message to be able to recover the 607 sequence number from the original packet, which makes it possible to 608 correlate the too big message with the packet that caused it, and 609 thus the path used to transmit the packet. 611 5. Flow control and buffer sizes 613 In order to accommodate the increased number of packets in flight, 614 the send buffer must be increased in direct relationship with the 615 number of paths being used. Alternatively, the number of paths used 616 concurrently should be limited to send buffer / avgRTT. 618 Although under normal operation, the receive buffer doesn't fill up, 619 there are two reasons the receive buffer must be the same size as the 620 send buffer: it must be able to accommodate a round trip time plus 621 two segments worth of data during fast retransmit, and the advertised 622 receive window limits the amount of data the sender will transmit 623 before waiting for acknowledgments. So in practice, the receive 624 buffer limits the maximum size of the send buffer, and therefore, the 625 number of paths that can be supported concurrently. 627 There is no simple rule of thumb to determine the number of paths 628 that should be used, as the maximum number of paths that the receive 629 window can accommodate depends both on the maximum receive window 630 advertised by the receiver and by the RTTs on the paths. 632 6. Handling of RSTs 634 If an RST is received after enabling a new path, this could be a 635 reaction to the presence of an unknown option. So the optimal 636 situation would be for an RST to reset just the path used to send the 637 packet that generated the RST, not the entire session. Only when the 638 last path or the default path (on which packets don't include special 639 options) receives an RST, the entire session should be reset. 641 7. Middlebox considerations 643 NATs are designed to be transparent to TCP. Because one-ended 644 multipath TCP conforms to normal TCP semantics on the wire, multipath 645 TCP should in principle also be compatible with NAT. However, if 646 different paths are served by different NATs that apply different 647 translations, the receiver won't be able to determine that the 648 different subflows through the different paths belong to the same TCP 649 session. So for NAT to work, the translation must either happen in a 650 location that all paths flow through, or the different NATs on the 651 different paths must act as a single, distributed NAT and apply the 652 same translation to the different subflows. 654 Middleboxes that only see traffic flowing over a subset of the paths 655 used will see large numbers of gaps in the sequence number space. 656 They may also not observe only a partial three-way handshake, or not 657 observe any ACKs. As such, like with NATs, middleboxes that enforce 658 conformance to known TCP behavior, must be placed such that they 659 observe all subflows. For middleboxes that just check whether 660 packets fall inside the TCP window, it may be sufficient for 661 multipath TCP senders to make sure that all paths see at least one 662 packet per window. Middleboxes that enforce sequence number 663 integrity will almost certainly also block TCP packets for which they 664 didn't observe the three way handshake. A possible way to 665 accommodate that behavior would be to send copies of all session 666 establishment and tear down packets over all paths that the sender 667 may use. However, this strategy is still likely to fail unless the 668 receiver does the same so the middleboxes may observe the signaling 669 packets flowing in both directions. 671 It's also possible that middleboxes (or perhaps hosts themselves) 672 reject packets with the path indicator TCP option. Since packets 673 flowing over the default path don't carry the path indicato option, 674 these packets should always be allowed through, so single path 675 operation is always possible. When a multipath TCP sender starts to 676 send packets over alternative paths, those packets won't make it to 677 the receiver because they contain the path indicator option. The 678 result is that a new subflow, which would use a congestion window of 679 two maximum segment sizes, would send two packets and then 680 experiences a retransmission timeout. Slow retransmit makes sure the 681 packets are transmitted before the session stalls, so the impact of 682 the lost packets is negligible. 684 8. Security considerations 686 None at this time. 688 9. IANA considerations 690 IANA is requested to provide a TCP option kind number for the path 691 indication option. 693 10. Acknowledgements 695 The single ended multipath TCP was developed together with Marcelo 696 Bagnulo and Arturo Azcorra. 698 Members of the Trilogy project, especially Costin Raiciu, have 699 contributed valuable insights. 701 Iljitsch van Beijnum is supported by Trilogy 702 (http://www.trilogy-project.org), a research project (ICT-216372) 703 partially funded by the European Community under its Seventh 704 Framework Program. The views expressed here are those of the 705 author(s) only. The European Commission is not liable for any use 706 that may be made of the information in this document. 708 11. References 710 11.1. Normative References 712 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 713 RFC 793, September 1981. 715 [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, 716 November 1990. 718 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 719 for High Performance", RFC 1323, May 1992. 721 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 722 Selective Acknowledgment Options", RFC 2018, October 1996. 724 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 725 Requirement Levels", BCP 14, RFC 2119, March 1997. 727 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 728 Control", RFC 2581, April 1999. 730 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 731 Algorithm", RFC 2992, November 2000. 733 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 734 of Explicit Congestion Notification (ECN) to IP", 735 RFC 3168, September 2001. 737 11.2. Informational References 739 [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay 740 paths", RFC 1072, October 1988. 742 [RFC1122] Braden, R., "Requirements for Internet Hosts - 743 Communication Layers", STD 3, RFC 1122, October 1989. 745 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 746 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 747 Zhang, L., and V. Paxson, "Stream Control Transmission 748 Protocol", RFC 2960, October 2000. 750 [RFC4915] Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., and P. 751 Pillay-Esnault, "Multi-Topology (MT) Routing in OSPF", 752 RFC 4915, June 2007. 754 [wischik08pooling] 755 Wischik, D., Handley, M., and M. Bagnulo Braun, "The 756 resource pooling principle", Computer Communication 757 Review 38, September 2008. 759 [I-D.ietf-shim6-proto] 760 Nordmark, E. and M. Bagnulo, "Shim6: Level 3 Multihoming 761 Shim Protocol for IPv6", draft-ietf-shim6-proto-11 (work 762 in progress), December 2008. 764 Appendix A. Document and discussion information 766 The latest version of this document will always be available at 767 http://www.muada.com/drafts/. Please direct questions and comments 768 to the multipathtcp@ietf.org mailinglist or directly to the author. 770 Appendix B. An implementation strategy 772 In order to perform per-path congestion control, all of the ACK-based 773 events that trigger congestion control responses as well as all the 774 variables used by the congestion control algorightms must be 775 recreated in the multipath situation. These are the triggers and 776 variables for the four mechanisms in RFC 2581. 778 1. the path MTU (page 4) 780 2. the arrival of an ACK that acknowledges new data (page 4) 782 3. the arrival of a non-duplicate ACK (page 4) or the sum of new 783 data acknowledged (page 5) 785 4. triggering of the retransmission timer (page 5) 787 5. the flightsize or number of bytes sent but not acknowledged (page 788 5) 790 6. the retransmission of a segment (page 5) 792 7. the arrival of a third or subsequent duplicate ACK (page 6, page 793 7) 795 8. whether a retransmission timeout period has elapsed since the 796 last reception of an ACK (page 7) 798 1, 4, 6 and 8 are maintained session-wide. 800 We recreate these events and variables based on SACK information in 801 the one-sequence number multipath TCP case as follows. 803 We keep track of every packet sent. (Alternatively: multi-packet 804 contiguous blocks of data transmitted over the same path.) When an 805 ACK comes in, we first remove the stored information about packets/ 806 data blocks that are cumulatively ACKed, noting how much data was 807 ACKed for each path that the packets were sent over. Then we do the 808 same for all the SACK blocks in the ACK. Because we remove the 809 information about (S)ACKed data and you can remove something just 810 once, we don't have to keep track of previous SACKs like the current 811 BSD implementation does. 813 The only slightly tricky part is emulating duplicate ACKs. This may 814 not even be really necessary, as the SACKs give us better information 815 to base fast retransmit on, but that's something for another day. 816 What happens in the pseudo code is that when traversing the list of 817 sent packets (this happens in order of seqnum), we note the path that 818 packets that aren't SACKed are sent over. When we're done processing 819 SACK data and it turns out that for a path there are one or more 820 packets that we skipped over when processing SACK data and there was 821 also data SACKed after a skipped packet, there was a lost (or 822 reordered) packet on this path. When the amount of "duplicate ACKed" 823 data grows beyond two segment sizes, we've reached the equivalent of 824 three duplicate ACKs so we trigger fast retransmit (7). 826 We update the congestion window (2 and 3) when there was data 827 (S)ACKed for a path. ACKs that don't acknowledge any data for a path 828 aren't relevant because we don't need them to trigger fast retransmit 829 and we assume that they're sent to (S)ACK data for other paths, 830 anyway. (Or they could be window updates.) 832 We maintain the flightsize (5) by simply adding data bytes as packets 833 are transmitted and subtracting when they're (S)ACKed. Because we 834 have explicit SACKs, we don't need to guess based on duplicate ACKs. 835 The flightsize is also adjusted when we perform fast retransmit or a 836 regular retransmission over a path other than which was used for the 837 original packet. In addition, we explicitly mark some packets to 838 trigger once-per-RTT actions when they're ACKed. 840 Pseudo code for the above: 842 // initializing data structures is left as an exercise for the 843 // reader 845 // transmitting packets 846 // assume we've selected a path to transmit over 848 path.flightsize = path.flightsize + packet.datasize 849 packet.path = path 850 packet.status.acked = false 851 // set up state to remember to do per RTT stuff when packet is 852 // ACKed 853 if path.do_per_rtt_next_packet == true 854 path.per_rtt_seqnum = packet.seqnum.first 855 packet.per_rtt = true 856 path.do_per_rtt_next_packet = false 857 else 858 packet.status.per_rtt = false 859 // don't set ECN on outgoing packets for now, can add logic 860 // for deciding which packets to ECN enable later 861 packet.ecn.sent = 0 862 // add to linked list of sent packets (to handle retrans- 863 // missions, linked list must maintain seqnum order, not FIFO 864 // or LIFO) 865 llpush(packet) 867 // receiving (S)ACKs 869 // normal flow-wide flow control actions based on cumACK 870 // also happen (elsewhere) 872 // handle ECN, must detect transitions rather than 873 // depend on actual value 874 if packet.ecnecho == true 875 if ecn.previous == true 876 ecn.current = false 877 else 878 ecn.current = true 879 ecn.previous = true 880 else 881 ecn.previous = false 883 // initialize some stuff before we handle the ACK 884 for each path 885 path.do_per_rtt = false 886 path.ackedbytes = 0 887 path.unacked.sure = 0 888 path.unacked.maybe = 0 889 path.ecn.received = false 891 // remove cumulatively ACKed packets 892 llwalk_init 893 packet = llwalk_next 894 while packet.seqnum.first < ack.cumulative 895 // ECN, we only act if we enabled ECN when we sent the packet 896 if ecn.current & packet.ecn.sent <> 0 897 path.ecn.received = true 898 // if part of a packet is ACKed, we need some trickery 899 if packet.seqnum.last_plus_one > ack.cumulative 900 path.ackedbytes += ack.cumulative - packet.seqnum.first 901 packet.seqnum.first = ack.cumulative 902 else 903 path.ackedbytes = path.ackedbytes + packet.datasize 904 if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum 905 path.do_per_rtt = true 906 llremove(packet) 907 packet = llwalk_next 909 // now we handle the SACKs (assume exactly one SACKblock for 910 // simplicity) we continue walking the linked list, no need to 911 // restart 912 while packet.seqnum.first < ack.sack.last_plus_one 913 if packet.seqnum.last_plus_one < ack.sack.first 914 // these packets overlap with the SACK block 915 // for simplicity, assume packets are always completely 916 // SACKed in reality we need to split a packet if only the 917 // middle is SACKed ECN, we only act if we enabled ECN when 918 // we sent the packet 919 if ecn.current & packet.ecn.sent <> 0 920 path.ecn.received = true 921 path.ackedbytes = path.ackedbytes + packet.datasize 922 if packet.per_rtt & packet.seqnum.first == path.per_rtt_seqnum 923 path.do_per_rtt = true 924 // add potentially unacked bytes to for sure unacked bytes 925 // because we now know we had a SACK hole if any 926 // unacked maybe bytes 927 path.unacked.sure = path.unacked.sure + path.unacked.maybe 928 path.unacked.maybe = 0 929 // remove packet from the list 930 llremove(packet) 931 else 932 // note how many bytes we skipped unSACKed 933 // if later data is SACKed, that's our version of a dup ACK 934 path.unacked.maybe = path.unacked.maybe + packet.datasize 935 packet = llwalk_next 937 // done processing, now tally up the the results 938 foreach path 939 // update flightsize (item 5 in CC events/variables list) 940 path.flightsize = path.flightsize - path.ackedbytes 941 // if any data was ACKed 942 if path.ackedbytes <> 0 943 // some stuff was ACKed for this path 944 if path.unacked.sure > 2 * path.mss 945 // more than 2 * MSS worth of data in SACK hole = fast 946 // retransmit execute fast retransmit (item 7 in CC 947 // events/variables list) need to handle flightsize in 948 // some way here ignore ECN because we already have a loss 949 // send back ECN window update indication, though 950 else 951 // SACKs were cumulative for this path 952 // execute cwnd update (items 2 and 3 in CC events/ 953 // variables list) 954 // ECN must be taken into account here 955 // and send back ECN window update indication 956 if path.do_per_rtt 957 // execute per RTT actions 958 // indicate that this should be set for next packet sent 959 path.do_per_rtt_next_packet == true 961 Note that the pseudo-code doesn't cover all the mechanisms explained 962 earlier. Also, ECN is handled here because it's not too difficult to 963 do. The hard part is deciding which packets to enable ECN for. 965 Author's Address 967 Iljitsch van Beijnum 968 IMDEA Networks 969 Avda. del Mar Mediterraneo, 22 970 Leganes, Madrid 28918 971 Spain 973 Email: iljitsch@muada.com