idnits 2.17.1 draft-welzl-tcp-ccc-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2731 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-ietf-rmcat-coupled-cc-03 == Outdated reference: A later version (-11) exists of draft-ietf-rmcat-sbd-04 -- Obsolete informational reference (is this intentional?): RFC 1078 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2140 (Obsoleted by RFC 9040) -- Obsolete informational reference (is this intentional?): RFC 5245 (Obsoleted by RFC 8445, RFC 8839) -- Obsolete informational reference (is this intentional?): RFC 5389 (Obsoleted by RFC 8489) -- Obsolete informational reference (is this intentional?): RFC 6093 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6555 (Obsoleted by RFC 8305) -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Congestion Control Research Group M. Welzl 3 Internet-Draft S. Islam 4 Intended status: Experimental K. Hiorth 5 Expires: May 4, 2017 University of Oslo 6 J. You 7 Huawei 8 October 31, 2016 10 TCP-CCC: single-path TCP congestion control coupling 11 draft-welzl-tcp-ccc-00 13 Abstract 15 This document specifies a method, TCP-CCC, to combine the congestion 16 controls of multiple TCP connections between the same pair of hosts. 17 This can have several performance benefits, and it makes it possible 18 to precisely assign a share of the congestion window to the 19 connections based on priorities. This document also addresses the 20 problem that TCP connections between the same pair of hosts may not 21 share the same path. We discuss methods to detect if, or enforce 22 that connections traverse a common bottleneck. 24 Requirements Language 26 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 27 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 28 document are to be interpreted as described in [RFC2119]. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on May 4, 2017. 47 Copyright Notice 49 Copyright (c) 2016 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 65 2. Coupled Congestion Control . . . . . . . . . . . . . . . . . 3 66 3. Ensuring a Common Bottleneck . . . . . . . . . . . . . . . . 6 67 3.1. Encapsulation . . . . . . . . . . . . . . . . . . . . . . 7 68 3.1.1. TCP in UDP . . . . . . . . . . . . . . . . . . . . . 7 69 3.1.2. Other Methods . . . . . . . . . . . . . . . . . . . . 14 70 4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . 14 71 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 15 72 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 73 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 74 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 76 9.1. Normative References . . . . . . . . . . . . . . . . . . 16 77 9.2. Informative References . . . . . . . . . . . . . . . . . 16 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 80 1. Introduction 82 When multiple TCP connections between the same host pair compete on 83 the same bottleneck, they often incur more delay and losses than a 84 single TCP connection. Moreover, it is often not possible to 85 precisely divide the available capacity among the connections. To 86 address this problem, this document presents TCP-CCC, a method to 87 combine the congestion controls of multiple TCP connections between 88 the same pair of hosts. This can have several performance benefits: 90 o Reduced average loss and queuing delay (because the competition 91 between the encapsulated TCP connections is avoided) 93 o Assign a precise capacity share based on a priority. 95 o Even in the absence of prioritization, better fairness between the 96 TCP connections. 98 o No need for new connections to slow start up to a reasonable cwnd 99 value that ongoing connections already have: a connection can 100 immediately be assigned its share of the aggregate's total cwnd. 101 This can significantly reduce the completion time of short 102 connections. 104 All of these benefits only play out when there are more than one TCP 105 connections. Some of the benefits in the list above are more 106 significant when some transfers are short. This makes the usage of 107 TCP-CCC especially attractive in situations where some transfers are 108 short. 110 We discuss methods to determine if connections traverse the same 111 bottleneck as well as methods to ensure this. To this end, we 112 propose a light-weight, dynamically configured TCP-in-UDP (TiU) 113 encapsulation scheme. TiU is optional, as our coupled congestion 114 control strategy is applicable wherever overlapping TCP flows must 115 follow the same path (such as when routed over a VPN tunnel). 117 2. Coupled Congestion Control 119 For each TCP connection c, the algorithm described below receives 120 cwnd and ssthresh as input and stores the following information: 122 o the Connection ID. 124 o a priority P(c) -- e.g., an integer value in the range from 1 125 (unimportant) to 10 (very important). 127 o The previously used cwnd used by the connection c, ccc_cwnd(c). 129 o The previously used ssthresh used by the connection c, 130 ccc_ssthresh(c). 132 Three global variables sum_cwnd, sum_ssthresh and sum_p are used to 133 represent the sum of all the ccc_cwnd values, ccc_sshtresh values and 134 priorities of all TCP connections, respectively. sum_cwnd and 135 sum_ssthresh are used to update the cwnd and ssthresh values for all 136 connections. 138 This algorithm emulates the behavior of a single TCP connection by 139 choosing one connection as the connection that dictates the increase 140 / decrease behavior for the aggregate. We call it the "Coordinating 141 Connection" (CoCo). The algorithm was designed to be as simple as 142 possible. Below, abbreviations are used to refer to the phases of 143 TCP congestion control as defined in [RFC5681]: SS refers to Slow 144 Start, CA refers to Congestion Avoidance and FR refers to Fast 145 Recovery. 147 For simplicity, this algorithm refrains from changing cwnd when a 148 connection is in FR. SS should not happen as long as ACKs arrive. 149 Hence, the algorithm ensures that the aggregate's behavior is only 150 dictated by SS when all connections are in the SS phase. We use a 151 bit array, ssbits, with a bit for each connection in the group. We 152 set the bit if the connection state is SS due to an RTO. 154 (1) When a connection c starts, it adds its priority P(c) to sum_p. 155 If it is the very first connection, it sets sum_cwnd to its own 156 cwnd. After that, the connection's globally known cwnd and 157 ssthresh values (ccc_cwnd(c) and ccc_ssthresh(c)) are updated, 158 and the connection updates its own cwnd and ssthresh values to 159 be equal to ccc_cwnd(c) and ccc_ssthresh(c). 161 ccc_P(c) = P 162 sum_P = sum_P + P 163 sum_cwnd sum_cwnd + cwnd 164 ccc_cwnd(c) P = sum_cwnd / sum_P 165 ccc_ssthresh(c) = ssthresh 166 if sum_ssthresh > 0 then 167 ccc_ssthresh(c) P = sum_ssthresh / sum_P 168 end if 169 // Update c's own cwnd and ssthresh for immediate use: 170 Send ccc_cwnd(c) and ccc_ssthresh(c) to c 172 (2) When a connection c stops, its entry is removed. sum_p is 173 recalculated. 175 if c = CoCo then 176 Coco = the next connection 177 end if 178 sum_p sum_p - ccc_P(c) 179 Remove ccc_P(c), ccc_cwnd(c), ccc_ssthresh(c) 181 (3) Every time the congestion controller of a connection c 182 calculates a new cwnd, the connection calls UPDATE, which 183 carries out the tasks listed below to derive the new cwnd and 184 ssthresh values. Whenever the CoCo calls UPDATE, sum_cwnd and 185 sum_ssthresh are additionally updated to reflect the current sum 186 of all stored ccc_cwnd and ccc_ssthresh values. Initially, 187 there is only one connection and this connection automatically 188 becomes the CoCo. It updates sum_cwnd to its own cwnd and sets 189 sum_ssthresh to 0. 191 (4) WHEN a non-CoCo connection c CALLS UPDATE...... 193 if(all of the connections including CoCo are in CA but c is in FR) 194 c becomes the new CoCo. 195 else 196 if(c is in CA or SS) 197 c's cwnd is assigned its previously stored ccc_cwnd value. 199 (5) WHEN c(CoCo) CALLS UPDATE...... 201 if CoCo == c then 202 if state == CA and ssbits(c) == 0 then 203 if cwnd >= ccc_cwnd(c) then // increased cwnd 204 sum_cwnd = sum_cwnd + cwnd - ccc_cwnd(c) 205 else 206 sum_cwnd = sum_cwnd * cwnd / ccc_cwnd(c) 207 end if 208 ccc_cwnd(c) = ccc_P(c) * sum_cwnd / sum_p 209 ccc_ssthresh(c) ssthresh 210 if sum_ssthresh > 0 then 211 ccc_ssthresh(c) ccc_P(c) * sum_ssthresh/sum_p 212 end if 213 else if state == FR then 214 sum_ssthresh = sum_cwnd/2 215 else if state == SS then 216 if c experienced a timeout then 217 ssbits(c) = 1 218 end if 219 if ssbits(x) == 1 for all x then 220 ssbits(x) = 0 // for all x 221 sum_cwnd = sum_cwnd * cwnd / ccc_cwnd(c) 222 ccc_cwnd(c) = ccc_P(c) * sum_cwnd / sum_p 223 sum_ssthresh = sum_cwnd/2 224 else 225 CoCo = first connection where ccc_state == SS 226 end if 227 end if 228 end if 230 (6) After that, if the ccc_state(c) is not equal to FR 231 if state != FR then 232 Send ccc_cwnd(c) and ccc_ssthresh(c) to c 233 end if 235 When a flow gets a large share of the aggregate immediately after 236 joining, it can potentially create a burst in the network. We 237 propose a mechanism [anrw2016] to clock the packet transmission out 238 by using the ack-clock of TCP. Our algorithm achieves a form of 239 "pacing", but it does not rely on any timers. 241 When a connection c joins, it turns on the ack-clock feature and 242 calculates the share of the aggregate, clocked_cwnd c. Below, we 243 illustrate the ack-clock mechanism that is used to distribute the 244 share of the cwnd based on the acknowledgements received from other 245 flows. 247 if clocked_cwnd(c) <= 0 then 248 return // alg. ends; other connections can increase cwnd again 249 end if 250 if number_of_acks c % N = 0 then 251 send a new segment for connection c 252 clocked_cwnd(c)= clocked_cwnd(c) - 1 253 end if 254 number_of_acks(c) = number_of_acks(c) + 1 256 3. Ensuring a Common Bottleneck 258 Our algorithm, as well as EFCM [EFCM], E-TCP [EFCM] and the CM 259 [RFC3124] assume that multiple TCP connections between the same host 260 pair traverse the same bottleneck. This is not always true: load- 261 balancing mechanisms such as Link Aggregation Group (LAG) and Equal- 262 Cost Multi-Path (ECMP) may force them to take different paths 263 [RFC7424]. If this leads to the connections seeing different 264 bottlenecks, combining the congestion controllers would incur wrong 265 behavior. There are, however, several application scenarios where 266 the single-bottleneck assumption is correct. 268 Sometimes, the network configuration is known, and it is known that 269 mechanisms such as ECMP and LAG do not operate on the bottleneck or 270 are simply not in use. Alternatively, measurements can infer whether 271 flows traverse the same bottleneck [I-D.ietf-rmcat-sbd]. When IPv6 272 is available, the TCP connections could be assigned the same IPv6 273 flow label. According to [RFC6437], "The usage of the 3-tuple of the 274 Flow Label, Source Address, and Destination Address fields enables 275 efficient IPv6 flow classification, where only IPv6 main header 276 fields in fixed positions are used" - this would be favorable for TCP 277 congestion control coupling. However, this [RFC6437] does not make a 278 clear recommendation about either using the 3-tuple or 5-tuple (which 279 includes the port numbers) - both methods are valid. Thus, whether 280 it works to use the flow label as the sole means to put connections 281 on the same path depends on router configuration. When it works, it 282 is an attractive option because it does not require changing the 283 receiver. 285 Finally, encapsulating packets with a header that ensures a common 286 path is another possibility to make connections traverse the same 287 bottleneck. We will discuss encapsulation in the next section. 289 3.1. Encapsulation 291 3.1.1. TCP in UDP 293 3.1.1.1. Introduction 295 We want to be able to ensure that TCP congestion control coupling can 296 always work, provided that the required code is available at the 297 receiver - and be able to efficiently fall back to the standard 298 behaviour in case it is not. To achieve this, we present a method, 299 TCP-in-UDP (TiU), to encapsulate multiple TCP connections using the 300 same UDP port pair. 302 TCP-in-UDP (TiU) is based on [Che13]. It differs from it in that: 304 o Other than [Che13], TiU encapsulates multiple TCP connections 305 using the same UDP port number pair. TCP port numbers are 306 preserved; a single well-known UDP port is used for TiU. If TiU 307 is implemented in the kernel, this allows using normal TCP 308 sockets, where enabling the usage of TiU could be done via a 309 socket option, for example. 311 o The header format is slightly different to allow representing a 312 TCP connection with a few bits that are encoded across the 313 original TCP header's "Reserved" field and the URG (Urgent) flag 314 to encode a Connection ID. With this encoding, similar to the 315 encapsulation in [Che13], the total TiU header size does not 316 exceed the original TCP header size. 318 o A (TiU-encapsulated) TCP SYN uses a newly defined TCP option to 319 establish the mapping between a Connection ID and the original TCP 320 port number pair. 322 TiU inherits all the benefits of [Che13] and a preceding similar 323 proposal, [Den08]. It enables TCP-CCC coupled congestion control, 324 and it adds the potential disadvantage of not being able to benefit 325 from ECMP. In short, the benefits and features of TiU that are 326 already explained in detail in [Che13] and [Den08] are: 328 o To establish direct communication between two devices that are 329 both behind NAT gateways, Interactive Connectivity Establishment 330 (ICE) [RFC5245] is used to create the necessary mappings in both 331 NAT gateways, and ICE can have higher success rates using UDP 332 [RFC5128]. 334 o TCP options, as required for Multipath TCP [RFC6824], for example, 335 are expected to work more reliably because middleboxes will be 336 less able to interfere with them. 338 o Because the packet format allows the first octet to be in the 339 range 0x0-0x3 (as is the case for a STUN [RFC5389] packet, where 340 the most significant two bits are always zero), the UDP port 341 number pair used by TiU can be used to exchange STUN packets with 342 a STUN server that is unaware of TiU. 344 o Following the method described in [Che13] and [Den08], other 345 transport protocols than TCP (e.g., SCTP) could be UDP- 346 encapsulated in a similar fashion. With TiU, the same outer UDP 347 port number pair could be used for different encapsulated 348 protocols at the same time. 350 [Che13] also lists a disadvantage of UDP-encapsulating TCP packets: 351 because NAT gateways typically use shorter timeouts for UDP port 352 mappings than they do for TCP port mappings, long-lived UDP- 353 encapsulated TCP connections will need to send more frequent 354 keepalive packets than native TCP connections. TiU inherits this 355 problem too, although using a single five-tuple for multiple TCP 356 connections alleviates it by reducing the chance of experiencing long 357 periods of silence. 359 3.1.1.2. Specification 361 TiU uses a header that is very similar to the header format in 362 [Den08] and [Che13], where it is explained in greater detail. It 363 consists of a UDP header that is followed by a slightly altered TCP 364 header. The UDP source and destination ports are semantically 365 different from [Den08] and [Che13]: TiU uses a single well-known UDP 366 port, and multiple TCP connections use the same UDP port number pair. 367 The encapsulated TCP header is changed to fit into a UDP packet 368 without increasing the MSS; this is achieved by removing the TCP 369 source and destination ports, the Urgent Pointer and the (now 370 unnecessary) TCP checksum. Moreover, the order of fields is changed 371 to move the Data Offset field to the beginning of the UDP payload. 373 This allows using it to identify other encapsulated content such as a 374 STUN packet: for TCP, the Data Offset must be at least 5, i.e. the 375 most-significant four bits of the first octet of the UDP payload are 376 in the range 0x5-0xF, whereas this is not the case for other 377 protocols (e.g., STUN requires these bits to be 0). The altered TCP 378 header for TiU is shown below: 380 0 1 2 3 381 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 382 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 383 | Source Port | Destination Port | 384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 385 | Length | Checksum | 386 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 387 | Data | Conn |C|E|C|A|P|R|S|F| | 388 | Offset| ID |W|C|I|C|S|S|Y|I| Window | 389 | | |R|E|D|K|H|T|N|N| | 390 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 391 | Sequence Number | 392 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 393 | Acknowledgment Number | 394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 395 | (Optional) Options | 396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 398 Figure 1: Encapsulated TCP-in-UDP Header Format (the first 8 bytes 399 are the UDP header) 401 Different from [Den08] and [Che13], the least-significant four bits 402 of the first octet and a bit that replaces the URG bit in the next 403 octet together form a five-bit "Connection ID" (Conn ID). TiU 404 maintains the port numbers of the TCP connections that it 405 encapsulates; the Connection ID is a way to encode the port number 406 information with a few unused header bits. It uniquely identifies a 407 port number pair of a TCP connection that is encapsulated with TiU. 408 Using these five bits, TiU can combine up to 32 TCP connections with 409 one UDP port number pair. 411 The TiU-TCP SYN and SYN/ACK packets look slightly little different, 412 because they need to establish the mapping between the Connection ID 413 and the port numbers that are used by TiU-encapsulated TCP 414 connections: 416 0 1 2 3 417 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 418 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 419 | Source Port | Destination Port | 420 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 421 | Length | Checksum | 422 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 423 | Data |Re- |C|E| |A|P|R|S|F| | 424 | Offset|served |W|C|0|C|S|S|Y|I| Window | 425 | | |R|E| |K|H|T|N|N| | 426 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 427 | Sequence Number | 428 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 429 | Acknowledgment Number | 430 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 431 | Encapsulated Source Port | Encapsulated Destination Port | 432 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 433 | Options | 434 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 436 Figure 2: Encapsulated TCP-in-UDP SYN and SYN/ACK Packet Header 437 Format 439 The Encapsulated Source Port and Encapsulated Destination Port are 440 the port numbers of the TCP connection. To create this header, an 441 implementation can simply swap the position of the original TCP 442 header's port number fields with the position of the Data Offset / 443 Reserved / Flags / Window fields. 445 Every TiU SYN or TiU SYN-ACK packet also carries at least the TiU- 446 Setup TCP option. This option contains a Connection ID number. On a 447 SYN packet, it is the Connection ID that the sender intends to use in 448 future packets to represent the Encapsulated Source Port and 449 Encapsulated Destination Port. On a SYN/ACK packet, it confirms that 450 such usage is accepted by the recipient of the SYN. A special value 451 of 255 is used to signify an error, upon which TiU will no longer be 452 used (i.e., the next packet is expected to be a non-encapsulated TCP 453 packet). The TiU-Setup TCP option is defined as follows: 455 0 1 2 3 456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 457 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 458 | Kind | Length | ExID | 459 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 460 | Connection ID | 461 +-+-+-+-+-+-+-+-+ 463 Figure 3: TiU Setup TCP Option 465 The option follows the format for Experimental TCP Options defined in 466 [RFC6994]. It has Kind=253, Length=5, an ExID that is with value TBD 467 (see Section 6) and the Connection ID. The Connection ID is an 8-bit 468 field for easier parsing, but only values 0-31 are valid Connection 469 IDs (because the Connection ID in non - SYN or SYN/ACK TiU packets is 470 only 5 bit long). 472 3.1.1.3. Protocol Operation and Implementation Notes 474 There can be several ways to implement TCP-in-UDP. The following 475 gives an overview of how a TiU implementation can operate. This 476 description matches the implementation described in Section 5. 478 A goal of TiU is to achieve congestion control coupling with a simple 479 implementation that minimizes changes to existing code. It is thus 480 recommendable to implement TiU in the kernel, as a change to the 481 existing kernel TCP code. The changes fall in two basic categories: 483 o Encapsulation and decapsulation: this is code that should, in the 484 simplest case, operate just before a TCP segment is transmitted. 485 Based on e.g. a socket option that enables/disables TiU, the TCP 486 segment is changed into the TiU header format (Figure 1). In case 487 it is a TCP SYN or TCP SYN/ACK packet, the header format is 488 defined as in Figure 2, and the TiU-Setup TCP option is appended. 489 This packet is then transmitted. For decapsulation, the reverse 490 mechanism applies, upon reception of a UDP packet that uses 491 destination port XXX (TBD, see Section 6). Both hosts keep a list 492 of encapsulated TCP port numbers and their corresponding 493 Connection IDs. In case a SYN packet requests using a Connection 494 ID that is already reserved, an error (Connection ID value 255 in 495 the TiU Setup TCP option) must be signified to the other end in a 496 TiU-encapsulated TCP SYN/ACK, and encapsulation must be disabled 497 on all further TCP packets. Similarly, when receiving a TiU SYN/ 498 ACK with an error, a TCP sender must stop encapsulating TCP 499 packets. 501 The TCP port number space usage on the host is left unchanged: the 502 original code can reserve TCP ports as it always did. Except for the 503 TiU encapsulation compressing the port numbers into a Connection ID 504 field, TCP ports should be used similar to normal TCP operation. A 505 TCP port that is in use by a TiU-encapsulated TCP connection must 506 therefore not be made available to non-encapsulated TCP connections, 507 and vice versa. 509 For each TCP connection, two variables must be configured: 1) TiU- 510 ENABLE, which is a boolean, deciding whether to use TiU or not, and 511 2) Priority, which is a value, e.g. from 1 to 10, that is used by the 512 coupled congestion control algorithm to assign an appropriate share 513 of the total cwnd to the connection. Priority values are local and 514 their range does not matter for this algorithm: the algorithm works 515 with a flow's priority portion of the sum of all priority values. 516 The configuration of the two per-connection variables can be 517 implemented in various ways, e.g. through an API option. 519 With these code changes in place, TiU can operate as follows, 520 assuming no previous TiU connections have been made between a 521 specific host pair and a client tries to connect to a server: 523 o An application uses an API option to request TiU operation. The 524 kernel then sends out a TiU TCP SYN that contains a TiU-Setup TCP 525 option. This packet header contains the encapsulated TCP port 526 numbers (source port A and destination port B) and the Connection 527 ID X. 529 o The server listens on UDP port XXX (TBD, see Section 6). Upon 530 receiving a packet on this port, it knows that it is a TiU packet 531 and decodes it, handing the resulting TCP packet over to "normal" 532 TCP processing. The TiU-Setup TCP option allows the server to 533 associate future TiU packets containing Connection ID X with ports 534 A and B. The server sends its response as a TiU SYN-ACK. 536 o TCP operates as normal from here on, but packets are TiU- 537 encapsulated before sending them out and decapsulated upon 538 reception, using Connection ID X. Both hosts associate TiU 539 packets carrying Connection ID X with a local identifier that 540 matches ports A and B, just like they would associate non- 541 encapsulated TCP packets with the same local identifier when 542 seeing ports A and B in the TCP header. 544 o If an application on either side of the TiU connection wants to 545 connect to a destination host on the other side and requests TiU 546 operation, the kernel sends out another TiU TCP SYN, this time 547 containing a different TCP source port number and either the same 548 or a different destination port number (C and D), and a TiU-Setup 549 TCP option with Connection ID Y. From now on, packets carrying 550 Connection ID Y will be associated with ports C and D on both 551 hosts. Otherwise, TiU operation continues as described above. 553 o Now, because there are two or more connections available between 554 the same host pair, coupled congestion control begins to operate 555 for all outgoing TiU packets (see Section 2 for details). This is 556 a local operation, applying the priority values that were 557 configured to use for the TiU-encapsulated TCP connections. 559 Unless it is known that UDP packets with destination port number XXX 560 (TBD, see Section 6) can be used without problems on the path between 561 two communicating hosts, it is advisable for TiU implementations to 562 contain methods to fall back to non-encapsulated ("raw") TCP 563 communication. Such fall-back must be supported for the case of 564 Connection ID collisions anyway. Middleboxes have been known to 565 track TCP connections [Honda11], and falling back to communication 566 with raw TCP packets without ever using a raw TCP SYN - SYN/ACK 567 handshake may lead to problems with such devices. The following 568 method is recommended to efficiently fall back to raw TCP 569 communication: 571 o After sending out a TiU SYN packet, additionally send a raw TCP 572 SYN packet. 574 o After sending out a TiU SYN/ACK packet, additionally send a raw 575 TCP SYN/ACK packet. 577 o Upon receiving a TiU SYN packet, after responding with a TiU SYN/ 578 ACK packet and raw TCP SYN/ACK packet, immediately store the 579 encapsulated port numbers and Connection ID. As long as a TiU 580 connection is ongoing, ignore any additional incoming TCP SYN or 581 TCP SYN/ACK packets from the same host that carry port numbers 582 matching the stored encapsulated port numbers. Otherwise, process 583 TCP SYN or TCP SYN/ACK packets as normal. 585 This method ensures that the TCP SYN / SYN/ACK handshake is visible 586 to middleboxes and allows to immediately switch back to raw TCP 587 communication in case of failures. If implemented on both sides as 588 described above and no TiU SYN or TiU SYN/ACK packet arrives, yet a 589 TCP SYN or TCP SYN/ACK packet does, this can only mean that the other 590 host does not support TiU, a UDP packet was dropped, or the UDP and 591 TCP packets were reordered in transit. Reordering in the host (e.g., 592 a server responding to a TCP SYN before it responds to a TiU SYN) can 593 be a problem for similar methods (e.g. [RFC6555]), but it can be 594 eliminated by prescribing the processing order as above. 596 Because TCP does not preserve message boundaries and the size of the 597 TCP header can vary depending on the options that are used, it is 598 also no problem to precede the TCP header in the UDP packet with a 599 different header (e.g. PLUS or SPUD [I-D.hildebrand-spud-prototype]) 600 without exceeding the known MTU limit. When creating a TCP segment, 601 a TCP sender needs to consider the length of this header when 602 calculating the segment size, just like it would consider the length 603 of a TCP option. For this to work, the usage of other headers such 604 as PLUS or SPUD in-between the UDP header and the TiU header must 605 therefore be known to both the sender-side and receiver-side code 606 that processes TiU. 608 3.1.1.4. Usage Considerations 610 TiU cannot work with applications that require the Urgent pointer 611 (which is not recommended for use by new applications anyway 612 [RFC6093], but should be consider if TiU is implemented in a way that 613 allows it to be applied onto existing applications; telnet is a well- 614 known example of an application that uses this functionality). It 615 can also be used as a method to experimentally test new TCP 616 functionality in the presence of middleboxes that would otherwise 617 create problems (as some have been known to do [Honda11]). 619 Reasons to use TiU include the benefits of [Che13] and [Den08] that 620 were discussed in Section 1. TiU has the disadvantage of disabling 621 ECMP for the TCP connections that it encapsulates. This can reduce 622 the capacity usage of these TCP connections. It has the advantage of 623 being able to apply TCP-CCC coupled congestion control, which can 624 provide precise congestion window assignment based on a priority. 626 3.1.2. Other Methods 628 There are many possible encapsulation schemes for various use cases. 629 For example, Generic UDP Encapsulation (GUE) 630 [I-D.draft-ietf-nvo3-gue] allows us to multiplex several TCP 631 connections onto a same UDP port number pair. Several encapsulation 632 methods transmit layer-2 frames over an IP network - e.g. VXLAN 633 [RFC7348] (over UDP/IP) and NvGRE [RFC7637] (over GRE/IP). Because 634 Layer-2 networks should be agnostic to the transport connections 635 running over them, the path should not depend on the TCP port number 636 pair and our algorithm should work. Some care must still be taken: 637 for example, for NvGRE, [RFC7637] says: "If ECMP is used, it is 638 RECOMMENDED that the ECMP hash is calculated either using the outer 639 IP frame fields and entire Key field (32 bits) or the inner IP and 640 transport frame fields". If routers do use the inner transport frame 641 fields (typically, port numbers) for this hashing, we have the same 642 problem even over NvGRE. 644 4. Related Work 646 The TCPMUX mechanism in [RFC1078] multiplexes TCP connections under 647 the same outer transport port number; it does however not preserve 648 the port numbers of the original TCP connections, and no method to 649 couple congestion controls is described in [RFC1078]. 651 Congestion control coupling follows the style of RTP application 652 congestion control coupling in [I-D.ietf-rmcat-coupled-cc] which is 653 designed to be easy to implement, and to minimize the number of 654 changes that need to be made to the underlying congestion control 655 mechanisms. This method was shown to yield several benefits in 657 [fse]. TCP-CCC requires slightly deeper changes to TCP's congestion 658 control, making it harder to implement than 659 [I-D.ietf-rmcat-coupled-cc], but it is still a much smaller code 660 change than the Congestion Manager [RFC3124]. 662 Combining congestion controls as TCP-CCC does it has some 663 similarities with Ensemble Sharing in [RFC2140], which however only 664 concerns initial values of variables used by new connections and does 665 not share the congestion window (cwnd). The cwnd variable is shared 666 across ongoing connections in [ETCP] and [EFCM], and the mechanism 667 described in Section 2 resembles the mechanisms in these works, but 668 neither [ETCP] nor [EFCM] address the problem of ECMP. 670 Coupled congestion control has also been specified for Multipath TCP 671 [RFC6356]. MPTCP's coupled congestion control combines the 672 congestion controls of subflows that may traverse different paths, 673 whereas we propose congestion control coupling for flows sharing a 674 single-path. TCP-CCC builds on the assumption that all its 675 encapsulated TCP connections traverse the same path. This makes the 676 two methods for coupled congestion control very different, even 677 though they both aim at emulating the behavior of a single TCP 678 connection in the case where all flows traverse the same network 679 bottleneck. For example, a new flow obtaining a a larger-than-IW 680 share of the aggregate cwnd would be inappropriate for an MPTCP 681 subflow. 683 5. Implementation Status 685 We have implemented TCP-CCC and TiU encapsulation for both the sender 686 and receiver in the FreeBSD kernel, as a simple add-on to the TCP 687 implementation that is controlled via a socket option. 689 6. IANA Considerations 691 This document specifies a new TCP option that uses the shared 692 experimental options format [RFC6994]. No value has yet been 693 assigned for ExID. 695 This document requires a well-known UDP port (referred to as port XXX 696 in this document). Due to the highly experimental nature of TiU, 697 this document is being shared with the community to solicit comments 698 before requesting such a port number. 700 7. Security Considerations 702 TBD 704 8. Acknowledgements 706 This work has received funding from Huawei Technologies Co., Ltd., 707 and the European Union's Horizon 2020 research and innovation 708 programme under grant agreement No. 644334 (NEAT). The views 709 expressed are solely those of the author(s). 711 9. References 713 9.1. Normative References 715 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 716 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 717 RFC2119, March 1997, 718 . 720 9.2. Informative References 722 [anrw2016] 723 Islam, S. and M. Welzl, "Start Me Up:Determining and 724 Sharing TCP's Initial Congestion Window", ACM, IRTF, ISOC 725 Applied Networking Research Workshop 2016 (ANRW 2016) , 726 2016. 728 [Che13] Cheshire, S., Graessley, J., and R. McGuire, 729 "Encapsulation of TCP and other Transport Protocols over 730 UDP", Internet-draft draft-cheshire-tcp-over-udp-00, June 731 2013. 733 [Den08] Denis-Courmont, R., "UDP-Encapsulated Transport 734 Protocols", Internet-draft draft-denis-udp-transport-00, 735 July 2008. 737 [EFCM] Savoric, M., Karl, H., Schlager, M., Poschwatta, T., and 738 A. Wolisz, "Analysis and performance evaluation of the 739 EFCM common congestion controller for TCP connections", 740 Computer Networks (2005) , 2005. 742 [ETCP] Eggert, L., Heidemann, J., and J. Joe, "Effects of 743 ensemble-TCP", ACM SIGCOMM Computer Communication Review 744 (2000) , 2000. 746 [fse] Islam, S., Welzl, M., Gjessing, S., and N. Khademi, 747 "Coupled Congestion Control for RTP Media", ACM SIGCOMM 748 Capacity Sharing Workshop (CSWS 2014) and ACM SIGCOMM CCR 749 44(4) 2014; extended version available as a technical 750 report from 751 http://safiquli.at.ifi.uio.no/paper/fse-tech-report.pdf , 752 2014. 754 [Honda11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., 755 Handley, M., and H. Tokuda, "Is it still possible to 756 extend TCP?", Proc. of ACM Internet Measurement Conference 757 (IMC) '11, November 2011. 759 [I-D.draft-ietf-nvo3-gue] 760 Herbert, T., Yong, L., and O. Zia, "Generic UDP 761 Encapsulation", Internet-draft draft-ietf-nvo3-gue-05, 762 October 2016. 764 [I-D.hildebrand-spud-prototype] 765 Hildebrand, J. and B. Trammell, "Substrate Protocol for 766 User Datagrams (SPUD) Prototype", draft-hildebrand-spud- 767 prototype-03 (work in progress), March 2015. 769 [I-D.ietf-rmcat-coupled-cc] 770 Islam, S., Welzl, M., and S. Gjessing, "Coupled congestion 771 control for RTP media", draft-ietf-rmcat-coupled-cc-03 772 (work in progress), July 2016. 774 [I-D.ietf-rmcat-sbd] 775 Hayes, D., Ferlin, S., Welzl, M., and K. Hiorth, "Shared 776 Bottleneck Detection for Coupled Congestion Control for 777 RTP Media.", draft-ietf-rmcat-sbd-04 (work in progress), 778 March 2016. 780 [RFC1078] Lottor, M., "TCP port service Multiplexer (TCPMUX)", RFC 781 1078, DOI 10.17487/RFC1078, November 1988, 782 . 784 [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, 785 DOI 10.17487/RFC2140, April 1997, 786 . 788 [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", 789 RFC 3124, DOI 10.17487/RFC3124, June 2001, 790 . 792 [RFC5128] Srisuresh, P., Ford, B., and D. Kegel, "State of Peer-to- 793 Peer (P2P) Communication across Network Address 794 Translators (NATs)", RFC 5128, DOI 10.17487/RFC5128, March 795 2008, . 797 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 798 (ICE): A Protocol for Network Address Translator (NAT) 799 Traversal for Offer/Answer Protocols", RFC 5245, DOI 800 10.17487/RFC5245, April 2010, 801 . 803 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 804 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 805 DOI 10.17487/RFC5389, October 2008, 806 . 808 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 809 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 810 . 812 [RFC6093] Gont, F. and A. Yourtchenko, "On the Implementation of the 813 TCP Urgent Mechanism", RFC 6093, DOI 10.17487/RFC6093, 814 January 2011, . 816 [RFC6356] Raiciu, C., Handley, M., and D. Wischik, "Coupled 817 Congestion Control for Multipath Transport Protocols", RFC 818 6356, DOI 10.17487/RFC6356, October 2011, 819 . 821 [RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, 822 "IPv6 Flow Label Specification", RFC 6437, DOI 10.17487/ 823 RFC6437, November 2011, 824 . 826 [RFC6555] Wing, D. and A. Yourtchenko, "Happy Eyeballs: Success with 827 Dual-Stack Hosts", RFC 6555, DOI 10.17487/RFC6555, April 828 2012, . 830 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 831 "TCP Extensions for Multipath Operation with Multiple 832 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 833 . 835 [RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 836 6994, DOI 10.17487/RFC6994, August 2013, 837 . 839 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 840 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 841 eXtensible Local Area Network (VXLAN): A Framework for 842 Overlaying Virtualized Layer 2 Networks over Layer 3 843 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 844 . 846 [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. 847 Khasnabish, "Mechanisms for Optimizing Link Aggregation 848 Group (LAG) and Equal-Cost Multipath (ECMP) Component Link 849 Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, 850 January 2015, . 852 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 853 Virtualization Using Generic Routing Encapsulation", RFC 854 7637, DOI 10.17487/RFC7637, September 2015, 855 . 857 Authors' Addresses 859 Michael Welzl 860 University of Oslo 861 PO Box 1080 Blindern 862 Oslo N-0316 863 Norway 865 Email: michawe@ifi.uio.no 867 Safiqul Islam 868 University of Oslo 869 PO Box 1080 Blindern 870 Oslo N-0316 871 Norway 873 Phone: +47 22 84 08 37 874 Email: safiquli@ifi.uio.no 876 Kristian Hiorth 877 University of Oslo 878 PO Box 1080 Blindern 879 Oslo N-0316 880 Norway 882 Email: kristahi@ifi.uio.no 883 Jianjie You 884 Huawei 885 101 Software Avenue, Yuhua District 886 Nanjing 210012 887 China 889 Email: youjianjie@huawei.com