idnits 2.17.1 draft-ietf-tcpm-dctcp-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 29, 2017) is 2394 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Duplicate reference: RFC3168, mentioned in 'RFC3168-ERRATA3639', was also mentioned in 'RFC3168'. -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Bensley 3 Internet-Draft D. Thaler 4 Intended status: Informational P. Balasubramanian 5 Expires: March 2, 2018 Microsoft 6 L. Eggert 7 NetApp 8 G. Judd 9 Morgan Stanley 10 August 29, 2017 12 Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters 13 draft-ietf-tcpm-dctcp-10 15 Abstract 17 This informational memo describes Datacenter TCP (DCTCP), a TCP 18 congestion control scheme for datacenter traffic. DCTCP extends the 19 Explicit Congestion Notification (ECN) processing to estimate the 20 fraction of bytes that encounter congestion, rather than simply 21 detecting that some congestion has occurred. DCTCP then scales the 22 TCP congestion window based on this estimate. This method achieves 23 high burst tolerance, low latency, and high throughput with shallow- 24 buffered switches. This memo also discusses deployment issues 25 related to the coexistence of DCTCP and conventional TCP, the lack of 26 a negotiating mechanism between sender and receiver, and presents 27 some possible mitigations. This memo documents DCTCP as currently 28 implemented by several major operating systems. DCTCP as described 29 in this draft is applicable to deployments in controlled environments 30 like datacenters but it must not be deployed over the public Internet 31 without additional measures. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at http://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on March 2, 2018. 50 Copyright Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 70 3.1. Marking Congestion on the L3 Switches and Routers . . . . 4 71 3.2. Echoing Congestion Information on the Receiver . . . . . 5 72 3.3. Processing Echoed Congestion Indications on the Sender . 6 73 3.4. Handling of Congestion Window Growth . . . . . . . . . . 8 74 3.5. Handling of Packet Loss . . . . . . . . . . . . . . . . . 8 75 3.6. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 8 76 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 77 4.1. Configuration of DCTCP . . . . . . . . . . . . . . . . . 8 78 4.2. Computation of DCTCP.Alpha . . . . . . . . . . . . . . . 9 79 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 10 80 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 11 81 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 83 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 84 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 85 10.1. Normative References . . . . . . . . . . . . . . . . . . 13 86 10.2. Informative References . . . . . . . . . . . . . . . . . 13 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 89 1. Introduction 91 Large datacenters necessarily need many network switches to 92 interconnect their many servers. Therefore, a datacenter can greatly 93 reduce its capital expenditure by leveraging low-cost switches. 94 However, such low-cost switches tend to have limited queue capacities 95 and are thus more susceptible to packet loss due to congestion. 97 Network traffic in a datacenter is often a mix of short and long 98 flows, where the short flows require low latencies and the long flows 99 require high throughputs. Datacenters also experience incast bursts, 100 where many servers send traffic to a single server at the same time. 101 For example, this traffic pattern is a natural consequence of 102 MapReduce [MAPREDUCE] workload: The worker nodes complete at 103 approximately the same time, and all reply to the master node 104 concurrently. 106 These factors place some conflicting demands on the queue occupancy 107 of a switch: 109 o The queue must be short enough that it does not impose excessive 110 latency on short flows. 112 o The queue must be long enough to buffer sufficient data for the 113 long flows to saturate the path capacity. 115 o The queue must be long enough to absorb incast bursts without 116 excessive packet loss. 118 Standard TCP congestion control [RFC5681] relies on packet loss to 119 detect congestion. This does not meet the demands described above. 120 First, short flows will start to experience unacceptable latencies 121 before packet loss occurs. Second, by the time TCP congestion 122 control kicks in on the senders, most of the incast burst has already 123 been dropped. 125 [RFC3168] describes a mechanism for using Explicit Congestion 126 Notification (ECN) from the switches for detection of congestion. 127 However, this method only detects the presence of congestion, not its 128 extent. In the presence of mild congestion, the TCP congestion 129 window is reduced too aggressively and this unnecessarily reduces the 130 throughput of long flows. 132 Datacenter TCP (DCTCP) changes traditional ECN processing by 133 estimating the fraction of bytes that encounter congestion, rather 134 than simply detecting that some congestion has occurred. DCTCP then 135 scales the TCP congestion window based on this estimate. This method 136 achieves high burst tolerance, low latency, and high throughput with 137 shallow-buffered switches. DCTCP is a modification to the processing 138 of ECN by a conventional TCP and requires that standard TCP 139 congestion control be used for handling packet loss. 141 DCTCP should only be deployed in an intra-datacenter environment 142 where both endpoints and the switching fabric are under a single 143 administrative domain. DCTCP MUST NOT be deployed over the public 144 Internet without additional measures, as detailed in Section 5. 146 The objective of this Informational RFC is to document DCTCP as a new 147 approach to address TCP congestion control in data centers that is 148 known to be widely implemented and deployed. It is consensus in the 149 IETF TCPM working group that a DCTCP standard would require further 150 work. A precise documentation of running code enables follow-up IETF 151 Experimental or Standards Track RFCs. 153 This document describes DCTCP as implemented in Microsoft Windows 154 Server 2012 [WINDOWS]. The Linux [LINUX] and FreeBSD [FREEBSD] 155 operating systems have also implemented support for DCTCP in a way 156 that is believed to follow this document. Deployment experiences 157 with DCTCP as have been documented in [MORGANSTANLEY]. 159 2. Terminology 161 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 162 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 163 document are to be interpreted as described in [RFC2119]. 165 Normative language is used to describe how necessary the various 166 aspects of a DCTCP implementation are for interoperability, but even 167 compliant implementations without the measures in sections 4-6 would 168 still only be safe to deploy in controlled environments, i.e., not 169 over the public Internet. 171 3. DCTCP Algorithm 173 There are three components involved in the DCTCP algorithm: 175 o The switches (or other intermediate devices in the network) detect 176 congestion and set the Congestion Encountered (CE) codepoint in 177 the IP header. 179 o The receiver echoes the congestion information back to the sender, 180 using the ECN-Echo (ECE) flag in the TCP header. 182 o The sender computes a congestion estimate and reacts, by reducing 183 the TCP congestion window accordingly (cwnd). 185 3.1. Marking Congestion on the L3 Switches and Routers 187 The level-3 (L3) switches and routers in a datacenter fabric indicate 188 congestion to the end nodes by setting the CE codepoint in the IP 189 header as specified in Section 5 of [RFC3168]. For example, the 190 switches may be configured with a congestion threshold. When a 191 packet arrives at a switch and its queue length is greater than the 192 congestion threshold, the switch sets the CE codepoint in the packet. 193 For example, Section 3.4 of [DCTCP10] suggests threshold marking with 194 a threshold K > (RTT * C)/7, where C is the link rate in packets per 195 second. In typical deployments the marking threshold is set to be a 196 small value to maintain a short average queueing delay. However, the 197 actual algorithm for marking congestion is an implementation detail 198 of the switch and will generally not be known to the sender and 199 receiver. Therefore, sender and receiver should not assume that a 200 particular marking algorithm is implemented by the switching fabric. 202 3.2. Echoing Congestion Information on the Receiver 204 According to Section 6.1.3 of [RFC3168], the receiver sets the ECE 205 flag if any of the packets being acknowledged had the CE code point 206 set. The receiver then continues to set the ECE flag until it 207 receives a packet with the Congestion Window Reduced (CWR) flag set. 208 However, the DCTCP algorithm requires more detailed congestion 209 information. In particular, the sender must be able to determine the 210 number of bytes sent that encountered congestion. Thus, the scheme 211 described in [RFC3168] does not suffice. 213 One possible solution is to ACK every packet and set the ECE flag in 214 the ACK if and only if the CE code point was set in the packet being 215 acknowledged. However, this prevents the use of delayed ACKs, which 216 are an important performance optimization in datacenters. If the 217 delayed ACK frequency is m, then an ACK is generated every m packets. 218 The typical value of m is 2 but it could be affected by ACK 219 throttling or packet coalescing techniques designed to improve 220 performance. 222 Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP 223 Congestion Encountered" (DCTCP.CE), which is initialized to false and 224 stored in the Transmission Control Block (TCB). When sending an ACK, 225 the ECE flag MUST be set if and only if DCTCP.CE is true. When 226 receiving packets, the CE codepoint MUST be processed as follows: 228 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to 229 true and send an immediate ACK. 231 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE 232 to false and send an immediate ACK. 234 3. Otherwise, ignore the CE codepoint. 236 Since the immediate ACK reflects the new DCTCP.CE state, it may 237 acknowledge any previously unacknowledged packets in the old state. 238 This can lead to an incorrect rate computation at the sender per 239 Section 3.3. To avoid this, an implementation MAY choose to send two 240 ACKs, one for previously unacknowledged packets and another 241 acknowledging the most recently received packet. 243 Receiver handling of the "Congestion Window Reduced" (CWR) bit is 244 also per [RFC3168] including [RFC3168-ERRATA3639]. That is, on 245 receipt of a segment with both the CE and CWR bits set, CWR is 246 processed first and then CE is processed. 248 Send immediate 249 ACK with ECE=0 250 .-----. .--------------. .-----. 251 Send 1 ACK / v v | | \ 252 for every | .------------. .------------. | Send 1 ACK 253 m packets | | DCTCP.CE=0 | | DCTCP.CE=1 | | for every 254 with ECE=0 | '------------' '------------' | m packets 255 \ | | ^ ^ / with ECE=1 256 '-----' '--------------' '-----' 257 Send immediate 258 ACK with ECE=1 260 Figure 1: ACK generation state machine. DCTCP.CE abbreviated as CE. 262 3.3. Processing Echoed Congestion Indications on the Sender 264 The sender estimates the fraction of bytes sent that encountered 265 congestion. The current estimate is stored in a new TCP state 266 variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be 267 updated as follows: 269 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 271 where 273 o g is the estimation gain, a real number between 0 and 1. The 274 selection of g is left to the implementation. See Section 4 for 275 further considerations. 277 o M is the fraction of bytes sent that encountered congestion during 278 the previous observation window, where the observation window is 279 chosen to be approximately the Round Trip Time (RTT). In 280 particular, an observation window ends when all bytes in flight at 281 the beginning of the window have been acknowledged. 283 In order to update DCTCP.Alpha, the TCP state variables defined in 284 [RFC0793] are used, and three additional TCP state variables are 285 introduced: 287 o DCTCP.WindowEnd: The TCP sequence number threshold when one 288 observation window ends and another is to begin; initialized to 289 SND.UNA. 291 o DCTCP.BytesAcked: The number of sent bytes acknowledged during the 292 current observation window; initialized to zero. 294 o DCTCP.BytesMarked: The number of bytes sent during the current 295 observation window that encountered congestion; initialized to 296 zero. 298 The congestion estimator on the sender MUST process acceptable ACKs 299 as follows: 301 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are 302 ignored for this computation): 304 BytesAcked = SEG.ACK - SND.UNA 306 2. Update the bytes sent: 308 DCTCP.BytesAcked += BytesAcked 310 3. If the ECE flag is set, update the bytes marked: 312 DCTCP.BytesMarked += BytesAcked 314 4. If the acknowledgment number is less than or equal to 315 DCTCP.WindowEnd, stop processing. Otherwise, the end of the 316 observation window has been reached, so proceed to update the 317 congestion estimate as follows: 319 5. Compute the congestion level for the current observation window: 321 M = DCTCP.BytesMarked / DCTCP.BytesAcked 323 6. Update the congestion estimate: 325 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 327 7. Determine the end of the next observation window: 329 DCTCP.WindowEnd = SND.NXT 331 8. Reset the byte counters: 333 DCTCP.BytesAcked = DCTCP.BytesMarked = 0 335 9. Rather than always halving the congestion window as described in 336 [RFC3168], the sender SHOULD update cwnd as follows: 338 cwnd = cwnd * (1 - DCTCP.Alpha / 2) 340 Just as specified in [RFC3168], DCTCP does not react to congestion 341 indications more than once for every window of data. The setting of 342 the "Congestion Window Reduced" (CWR) bit is also as per [RFC3168]. 343 This is required for interop with classic ECN receivers due to 344 potential misconfigurations. 346 3.4. Handling of Congestion Window Growth 348 A DCTCP sender grows its congestion window in the same way as 349 conventional TCP. Slow start and congestion avoidance algorithms are 350 handled as specified in [RFC5681]. 352 3.5. Handling of Packet Loss 354 A DCTCP sender MUST react to loss episodes in the same way as 355 conventional TCP, including fast retransmit and fast recovery 356 algorithms, as specified in [RFC5681]. For cases where the packet 357 loss is inferred and not explicitly signaled by ECN, the cwnd and 358 other state variables like ssthresh MUST be changed in the same way 359 that a conventional TCP would have changed them. As with ECN, DCTCP 360 sender will only reduce the cwnd once per window of data across all 361 loss signals. Just as specified in [RFC5681], upon a timeout, the 362 cwnd MUST be set to no more than the loss window (1 full-sized 363 segment), regardless of previous cwnd reductions in a given window of 364 data. 366 3.6. Handling of SYN, SYN-ACK, RST Packets 368 If SYN, SYN-ACK and RST packets for DCTCP connections have the "ECN 369 Capable Transport" (ECT) codepoint set in the IP header, they will 370 receive the same treatment as other DCTCP packets when forwarded by a 371 switching fabric under load. Lack of ECT in these packets can result 372 in a higher drop rate depending on the switching fabric 373 configuration. Hence for DCTCP connections, the sender SHOULD set 374 ECT for SYN, SYN-ACK and RST packets. A DCTCP receiver ignores CE 375 codepoints set on any SYN, SYN-ACK, or RST packets. 377 4. Implementation Issues 379 4.1. Configuration of DCTCP 381 An implementation needs to know when to use DCTCP. Datacenter 382 servers may need to communicate with endpoints outside the 383 datacenter, where DCTCP is unsuitable or unsupported. Thus, a global 384 configuration setting to enable DCTCP will generally not suffice. 385 DCTCP provides no mechanism for negotiating its use. Thus, 386 additional management and configuration functionality is needed to 387 ensure that DCTCP is not used with non-DCTCP endpoints. 389 Known solutions rely on either configuration or heuristics. 390 Heuristics need to allow endpoints to individually enable DCTCP, to 391 ensure a DCTCP sender is always paired with a DCTCP receiver. One 392 approach is to enable DCTCP based on the IP address of the remote 393 endpoint. Another approach is to detect connections that transmit 394 within the bounds a datacenter. For example, an implementation could 395 support automatic selection of DCTCP if the estimated RTT is less 396 than a threshold (like 10 msec) and ECN is successfully negotiated, 397 under the assumption that if the RTT is low, then the two endpoints 398 are likely in the same datacenter network. 400 [RFC3168] forbids the ECN-marking of pure ACK packets, because of the 401 inability of TCP to mitigate ACK-path congestion. RFC 3168 also 402 forbids ECN-marking of retransmissions, window probes and RSTs. 403 However, dropping all these control packets - rather than ECN marking 404 them - has considerable performance disadvantages. It is RECOMMENDED 405 that an implementation provide a configuration knob that will cause 406 ECT to be set on such control packets, which can be used in 407 environments where such concerns do not apply. See 408 [ECN-EXPERIMENTATION] for details. 410 It is useful to implement DCTCP as additional actions on top of an 411 existing congestion control algorithm like Reno [RFC5681]. The DCTCP 412 implementation MAY also allow configuration of resetting the value of 413 DCTCP.Alpha as part of processing any loss episodes. 415 4.2. Computation of DCTCP.Alpha 417 As noted in Section 3.3, the implementation will need to choose a 418 suitable estimation gain. [DCTCP10] provides a theoretical basis for 419 selecting the gain. However, it may be more practical to use 420 experimentation to select a suitable gain for a particular network 421 and workload. A fixed estimation gain of 1/16 is used in some 422 implementations. (It should be noted that values of 0 or 1 for g 423 result in problematic behavior; g=0 fixes DCTCP.Alpha to its initial 424 value and g=1 sets it to M without any smoothing.) 426 The DCTCP.Alpha computation as per the formula in Section 3.3 427 involves fractions. An efficient kernel implementation MAY scale the 428 DCTCP.Alpha value for efficient computation using shift operations. 429 For example, if the implementation chooses g as 1/16, multiplications 430 of DCTCP.Alpha by g become right-shifts by 4. A scaling 431 implementation SHOULD ensure that DCTCP.Alpha is able to reach zero 432 once it falls below the smallest shifted value (16 in the above 433 example). At the other extreme, a scaled update needs to ensure 434 DCTCP.Alpha does not exceed the scaling factor, which would be 435 equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be 436 clamped after an update. 438 This results in the following computations replacing steps 5 and 6 in 439 Section 3.3, where SCF is the chosen scaling factor (65536 in the 440 example) and SHF is the shift factor (4 in the example): 442 1. Compute the congestion level for the current observation window: 444 ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked 446 2. Update the congestion estimate: 448 if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 450 DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) 452 if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF 454 5. Deployment Issues 456 DCTCP and conventional TCP congestion control do not coexist well in 457 the same network. In typical DCTCP deployments, the marking 458 threshold in the switching fabric is set to a very low value to 459 reduce queueing delay, and a relatively small amount of congestion 460 will exceed the marking threshold. During such periods of 461 congestion, conventional TCP will suffer packet loss and quickly and 462 drastically reduce cwnd. DCTCP, on the other hand, will use the 463 fraction of marked packets to reduce cwnd more gradually. Thus, the 464 rate reduction in DCTCP will be much slower than that of conventional 465 TCP, and DCTCP traffic will gain a larger share of the capacity 466 compared to conventional TCP traffic traversing the same path. If 467 the traffic in the datacenter is a mix of conventional TCP and DCTCP, 468 it is RECOMMENDED that DCTCP traffic be segregated from conventional 469 TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP 470 Differentiated Services Code Point (DSCP) bits to segregate the 471 network such that Active Queue Management (AQM) [RFC7567] is applied 472 to DCTCP traffic, whereas TCP traffic is managed via drop-tail 473 queueing. 475 Deployments should take into account segregation of non-TCP traffic 476 as well. Today's commodity switches allow configuration of different 477 marking/drop profiles for non-TCP and non-IP packets. Non-TCP and 478 non-IP packets should be able to pass through such switches, unless 479 they really run out of buffer space. 481 Since DCTCP relies on congestion marking by the switches, DCTCP's 482 potential can only be realized in datacenters where the entire 483 network infrastructure supports ECN. The switches may also support 484 configuration of the congestion threshold used for marking. The 485 proposed parameterization can be configured with switches that 486 implement Random Early Detection (RED) [RFC2309]. [DCTCP10] provides 487 a theoretical basis for selecting the congestion threshold, but as 488 with the estimation gain, it may be more practical to rely on 489 experimentation or simply to use the default configuration of the 490 device. DCTCP will revert to loss-based congestion control when 491 packet loss is experienced (e.g. when transiting a congested drop- 492 tail link, or a link with an AQM drop behavior). 494 DCTCP requires changes on both the sender and the receiver, so both 495 endpoints must support DCTCP. Furthermore, DCTCP provides no 496 mechanism for negotiating its use, so both endpoints must be 497 configured through some out-of-band mechanism to use DCTCP. A 498 variant of DCTCP that can be deployed unilaterally and only requires 499 standard ECN behavior has been described in [ODCTCP][BSDCAN], but 500 requires additional experimental evaluation. 502 6. Known Issues 504 DCTCP relies on the sender's ability to reconstruct the stream of CE 505 codepoints received by the remote endpoint. To accomplish this, 506 DCTCP avoids using a single ACK packet to acknowledge segments 507 received both with and without the CE codepoint set. However, if one 508 or more ACK packets are dropped, it is possible that a subsequent ACK 509 will cumulatively acknowledge a mix of CE and non-CE segments. This 510 will, of course, result in a less accurate congestion estimate. 511 There are some potential considerations: 513 o Even with an inaccurate congestion estimate, DCTCP may still 514 perform better than [RFC3168]. 516 o If the estimation gain is small relative to the packet loss rate, 517 the estimate may not be too inaccurate. 519 o If ACK packet loss mostly occurs under heavy congestion, most 520 drops will occur during an unbroken string of CE packets, and the 521 estimate will be unaffected. 523 However, the effect of packet drops on DCTCP under real world 524 conditions has not been analyzed. 526 DCTCP provides no mechanism for negotiating its use. The effect of 527 using DCTCP with a standard ECN endpoint has been analyzed in 528 [ODCTCP][BSDCAN]. Furthermore, it is possible that other 529 implementations may also modify [RFC3168] behavior without 530 negotiation, causing further interoperability issues. 532 Much like standard TCP, DCTCP is biased against flows with longer 533 RTTs. A method for improving the RTT fairness of DCTCP has been 534 proposed in [ADCTCP], but requires additional experimental 535 evaluation. 537 7. Security Considerations 539 DCTCP enhances ECN and thus inherits the general security 540 considerations discussed in [RFC3168], although additional mitigation 541 options exist due to the limited intra-datacenter deployment of 542 DCTCP. 544 The processing changes introduced by DCTCP do not exacerbate the 545 considerations in [RFC3168] or introduce new ones. In particular, 546 with either algorithm, the network infrastructure or the remote 547 endpoint can falsely report congestion and thus cause the sender to 548 reduce cwnd. However, this is no worse than what can be achieved by 549 simply dropping packets. 551 [RFC3168] requires that a compliant TCP must not set ECT on SYN or 552 SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, 553 but maintains the restriction of no ECT on SYN packets. Both these 554 RFCs prohibit ECT in SYN packets due to security concerns regarding 555 malicious SYN packets with ECT set. These RFCs, however, are 556 intended for general Internet use, and do not directly apply to a 557 controlled datacenter environment. The security concerns addressed 558 by both these RFCs might not apply in controlled environments like 559 datacenters, and it might not be necessary to account for the 560 presence of non-ECN servers. Beyond the security considerations 561 related to virtual servers, additional security can be imposed in the 562 physical servers to intercept and drop traffic resembling an attack. 564 8. IANA Considerations 566 This document has no actions for IANA. 568 9. Acknowledgements 570 The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] 571 by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, 572 Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari 573 Sridharan. 575 We would like to thank Andrew Shewmaker for identifying the problem 576 of clamping DCTCP.Alpha and proposing a solution for it. 578 Lars Eggert has received funding from the European Union's Horizon 579 2020 research and innovation program 2014-2018 under grant agreement 580 No. 644866 ("SSICLOPS"). This document reflects only the authors' 581 views and the European Commission is not responsible for any use that 582 may be made of the information it contains. 584 10. References 586 10.1. Normative References 588 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 589 RFC 793, DOI 10.17487/RFC0793, September 1981, 590 . 592 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 593 Selective Acknowledgment Options", RFC 2018, 594 DOI 10.17487/RFC2018, October 1996, . 597 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 598 Requirement Levels", BCP 14, RFC 2119, 599 DOI 10.17487/RFC2119, March 1997, . 602 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 603 of Explicit Congestion Notification (ECN) to IP", 604 RFC 3168, DOI 10.17487/RFC3168, September 2001, 605 . 607 [RFC3168-ERRATA3639] 608 Scheffenegger, R., "RFC3168 Errata ID 3639", 2013, 609 . 612 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 613 Ramakrishnan, "Adding Explicit Congestion Notification 614 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 615 DOI 10.17487/RFC5562, June 2009, . 618 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 619 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 620 . 622 10.2. Informative References 624 [ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis 625 of DCTCP: Stability, Convergence, and Fairness", 626 DOI 10.1145/1993744.1993753, Proc. ACM SIGMETRICS Joint 627 International Conference on Measurement and Modeling of 628 Computer Systems (SIGMETRICS 11), June 2011, 629 . 631 [BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and 632 H. Tokuda, "Extensions to FreeBSD Datacenter TCP for 633 Incremental Deployment Support", BSDCan 2015, June 2015, 634 . 636 [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 637 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 638 Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. 639 ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, 640 . 642 [ECN-EXPERIMENTATION] 643 Black, D., "Explicit Congestion Notification (ECN) 644 Experimentation", 2017, . 647 [FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP) 648 implementation", 2015, 649 . 652 [LINUX] Borkmann, D. and F. Westphal, "Linux DCTCP patch", 2014, 653 . 657 [MAPREDUCE] 658 Dean, J. and S. Ghemawat, "MapReduce: Simplified Data 659 Processing on Large Clusters", Proc. 6th ACM/USENIX 660 Symposium on Operating Systems Design and Implementation 661 (OSDI 04), December 2004, . 664 [MORGANSTANLEY] 665 Judd, G., "Attaining the Promise and Avoiding the Pitfalls 666 of TCP in the Datacenter", Proc. 12th USENIX Symposium on 667 Networked Systems Design and Implementation (NSDI 15), May 668 2015, . 671 [ODCTCP] Kato, M., "Improving Transmission Performance with One- 672 Sided Datacenter TCP", M.S. Thesis, Keio University, 673 2014, . 675 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 676 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 677 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 678 S., Wroclawski, J., and L. Zhang, "Recommendations on 679 Queue Management and Congestion Avoidance in the 680 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 681 . 683 [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF 684 Recommendations Regarding Active Queue Management", 685 BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, 686 . 688 [WINDOWS] Microsoft, "Windows DCTCP reference", 2012, 689 . 692 Authors' Addresses 694 Stephen Bensley 695 Microsoft 696 One Microsoft Way 697 Redmond, WA 98052 698 USA 700 Phone: +1 425 703 5570 701 Email: sbens@microsoft.com 703 Dave Thaler 704 Microsoft 706 Phone: +1 425 703 8835 707 Email: dthaler@microsoft.com 709 Praveen Balasubramanian 710 Microsoft 712 Phone: +1 425 538 2782 713 Email: pravb@microsoft.com 714 Lars Eggert 715 NetApp 716 Sonnenallee 1 717 Kirchheim 85551 718 Germany 720 Phone: +49 151 120 55791 721 Email: lars@netapp.com 722 URI: http://eggert.org/ 724 Glenn Judd 725 Morgan Stanley 727 Phone: +1 973 979 6481 728 Email: glenn.judd@morganstanley.com