idnits 2.17.1 draft-ietf-tcpm-dctcp-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 17, 2017) is 2475 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Duplicate reference: RFC3168, mentioned in 'RFC3168-ERRATA3639', was also mentioned in 'RFC3168'. -- Obsolete informational reference (is this intentional?): RFC 2309 (Obsoleted by RFC 7567) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Bensley 3 Internet-Draft D. Thaler 4 Intended status: Informational P. Balasubramanian 5 Expires: January 18, 2018 Microsoft 6 L. Eggert 7 NetApp 8 G. Judd 9 Morgan Stanley 10 July 17, 2017 12 Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters 13 draft-ietf-tcpm-dctcp-09 15 Abstract 17 This informational memo describes Datacenter TCP (DCTCP), a TCP 18 congestion control scheme for datacenter traffic. DCTCP extends the 19 Explicit Congestion Notification (ECN) processing to estimate the 20 fraction of bytes that encounter congestion, rather than simply 21 detecting that some congestion has occurred. DCTCP then scales the 22 TCP congestion window based on this estimate. This method achieves 23 high burst tolerance, low latency, and high throughput with shallow- 24 buffered switches. This memo also discusses deployment issues 25 related to the coexistence of DCTCP and conventional TCP, the lack of 26 a negotiating mechanism between sender and receiver, and presents 27 some possible mitigations. This memo documents DCTCP as currently 28 implemented by several major operating systems. DCTCP as described 29 in this draft is applicable to deployments in controlled environments 30 like datacenters but it must not be deployed over the public Internet 31 without additional measures. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at http://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on January 18, 2018. 50 Copyright Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 70 3.1. Marking Congestion on the L3 Switches and Routers . . . . 4 71 3.2. Echoing Congestion Information on the Receiver . . . . . 5 72 3.3. Processing Echoed Congestion Indications on the Sender . 6 73 3.4. Handling of packet loss . . . . . . . . . . . . . . . . . 8 74 3.5. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 8 75 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 76 4.1. Configuration of DCTCP . . . . . . . . . . . . . . . . . 8 77 4.2. Computation of DCTCP.Alpha . . . . . . . . . . . . . . . 9 78 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 10 79 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 11 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 81 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 82 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 83 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 84 10.1. Normative References . . . . . . . . . . . . . . . . . . 13 85 10.2. Informative References . . . . . . . . . . . . . . . . . 13 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 88 1. Introduction 90 Large datacenters necessarily need many network switches to 91 interconnect their many servers. Therefore, a datacenter can greatly 92 reduce its capital expenditure by leveraging low-cost switches. 93 However, such low-cost switches tend to have limited queue capacities 94 and are thus more susceptible to packet loss due to congestion. 96 Network traffic in a datacenter is often a mix of short and long 97 flows, where the short flows require low latencies and the long flows 98 require high throughputs. Datacenters also experience incast bursts, 99 where many servers send traffic to a single server at the same time. 100 For example, this traffic pattern is a natural consequence of 101 MapReduce [MAPREDUCE] workload: The worker nodes complete at 102 approximately the same time, and all reply to the master node 103 concurrently. 105 These factors place some conflicting demands on the queue occupancy 106 of a switch: 108 o The queue must be short enough that it does not impose excessive 109 latency on short flows. 111 o The queue must be long enough to buffer sufficient data for the 112 long flows to saturate the path capacity. 114 o The queue must be long enough to absorb incast bursts without 115 excessive packet loss. 117 Standard TCP congestion control [RFC5681] relies on packet loss to 118 detect congestion. This does not meet the demands described above. 119 First, short flows will start to experience unacceptable latencies 120 before packet loss occurs. Second, by the time TCP congestion 121 control kicks in on the senders, most of the incast burst has already 122 been dropped. 124 [RFC3168] describes a mechanism for using Explicit Congestion 125 Notification (ECN) from the switches for detection of congestion. 126 However, this method only detects the presence of congestion, not its 127 extent. In the presence of mild congestion, the TCP congestion 128 window is reduced too aggressively and this unnecessarily reduces the 129 throughput of long flows. 131 Datacenter TCP (DCTCP) changes traditional ECN processing by 132 estimating the fraction of bytes that encounter congestion, rather 133 than simply detecting that some congestion has occurred. DCTCP then 134 scales the TCP congestion window based on this estimate. This method 135 achieves high burst tolerance, low latency, and high throughput with 136 shallow-buffered switches. DCTCP is a modification to the processing 137 of ECN by a conventional TCP and requires that standard TCP 138 congestion control be used for handling packet loss. 140 DCTCP should only be deployed in an intra-datacenter environment 141 where both endpoints and the switching fabric are under a single 142 administrative domain. DCTCP MUST NOT be deployed over the public 143 Internet without additional measures, as detailed in Section 5. 145 The objective of this Informational RFC is to document DCTCP as a new 146 approach to address TCP congestion control in data centers that is 147 known to be widely implemented and deployed. It is consensus in the 148 IETF TCPM working group that a DCTCP standard would require further 149 work. A precise documentation of running code enables follow-up IETF 150 Experimental or Standards Track RFCs. 152 This document describes DCTCP as implemented in Microsoft Windows 153 Server 2012 [WINDOWS]. The Linux [LINUX] and FreeBSD [FREEBSD] 154 operating systems have also implemented support for DCTCP in a way 155 that is believed to follow this document. Deployment experiences 156 with DCTCP as have been documented in [MORGANSTANLEY]. 158 2. Terminology 160 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 161 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 162 document are to be interpreted as described in [RFC2119]. 164 Normative language is used to describe how necessary the various 165 aspects of a DCTCP implementation are for interoperability, but even 166 compliant implementations without the measures in sections 4-6 would 167 still only be safe to deploy in controlled environments, i.e., not 168 over the public Internet. 170 3. DCTCP Algorithm 172 There are three components involved in the DCTCP algorithm: 174 o The switches (or other intermediate devices in the network) detect 175 congestion and set the Congestion Encountered (CE) codepoint in 176 the IP header. 178 o The receiver echoes the congestion information back to the sender, 179 using the ECN-Echo (ECE) flag in the TCP header. 181 o The sender computes a congestion estimate and reacts, by reducing 182 the TCP congestion window accordingly (cwnd). 184 3.1. Marking Congestion on the L3 Switches and Routers 186 The level-3 (L3) switches and routers in a datacenter fabric indicate 187 congestion to the end nodes by setting the CE codepoint in the IP 188 header as specified in Section 5 of [RFC3168]. For example, the 189 switches may be configured with a congestion threshold. When a 190 packet arrives at a switch and its queue length is greater than the 191 congestion threshold, the switch sets the CE codepoint in the packet. 192 For example, Section 3.4 of [DCTCP10] suggests threshold marking with 193 a threshold K > (RTT * C)/7, where C is the link rate in packets per 194 second. In typical deployments the marking threshold is set to be a 195 small value to maintain a short average queueing delay. However, the 196 actual algorithm for marking congestion is an implementation detail 197 of the switch and will generally not be known to the sender and 198 receiver. Therefore, sender and receiver should not assume that a 199 particular marking algorithm is implemented by the switching fabric. 201 3.2. Echoing Congestion Information on the Receiver 203 According to Section 6.1.3 of [RFC3168], the receiver sets the ECE 204 flag if any of the packets being acknowledged had the CE code point 205 set. The receiver then continues to set the ECE flag until it 206 receives a packet with the Congestion Window Reduced (CWR) flag set. 207 However, the DCTCP algorithm requires more detailed congestion 208 information. In particular, the sender must be able to determine the 209 number of bytes sent that encountered congestion. Thus, the scheme 210 described in [RFC3168] does not suffice. 212 One possible solution is to ACK every packet and set the ECE flag in 213 the ACK if and only if the CE code point was set in the packet being 214 acknowledged. However, this prevents the use of delayed ACKs, which 215 are an important performance optimization in datacenters. If the 216 delayed ACK frequency is m, then an ACK is generated every m packets. 217 The typical value of m is 2 but it could be affected by ACK 218 throttling or packet coalescing techniques designed to improve 219 performance. 221 Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP 222 Congestion Encountered" (DCTCP.CE), which is initialized to false and 223 stored in the Transmission Control Block (TCB). When sending an ACK, 224 the ECE flag MUST be set if and only if DCTCP.CE is true. When 225 receiving packets, the CE codepoint MUST be processed as follows: 227 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to 228 true and send an immediate ACK. 230 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE 231 to false and send an immediate ACK. 233 3. Otherwise, ignore the CE codepoint. 235 Since the immediate ACK reflects the new DCTCP.CE state, it may 236 acknowledge any previously unacknowledged packets in the old state. 237 This can lead to an incorrect rate computation at the sender per 238 Section 3.3. To avoid this, an implementation MAY choose to send two 239 ACKs, one for previously unacknowledged packets and another 240 acknowledging the most recently received packet. 242 Receiver handling of the "Congestion Window Reduced" (CWR) bit is 243 also per [RFC3168] including [RFC3168-ERRATA3639]. That is, on 244 receipt of a segment with both the CE and CWR bits set, CWR is 245 processed first and then CE is processed. 247 Send immediate 248 ACK with ECE=0 249 .-----. .--------------. .-----. 250 Send 1 ACK / v v | | \ 251 for every | .------------. .------------. | Send 1 ACK 252 m packets | | DCTCP.CE=0 | | DCTCP.CE=1 | | for every 253 with ECE=0 | '------------' '------------' | m packets 254 \ | | ^ ^ / with ECE=1 255 '-----' '--------------' '-----' 256 Send immediate 257 ACK with ECE=1 259 Figure 1: ACK generation state machine. DCTCP.CE abbreviated as CE. 261 3.3. Processing Echoed Congestion Indications on the Sender 263 The sender estimates the fraction of bytes sent that encountered 264 congestion. The current estimate is stored in a new TCP state 265 variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be 266 updated as follows: 268 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 270 where 272 o g is the estimation gain, a real number between 0 and 1. The 273 selection of g is left to the implementation. See Section 4 for 274 further considerations. 276 o M is the fraction of bytes sent that encountered congestion during 277 the previous observation window, where the observation window is 278 chosen to be approximately the Round Trip Time (RTT). In 279 particular, an observation window ends when all bytes in flight at 280 the beginning of the window have been acknowledged. 282 In order to update DCTCP.Alpha, the TCP state variables defined in 283 [RFC0793] are used, and three additional TCP state variables are 284 introduced: 286 o DCTCP.WindowEnd: The TCP sequence number threshold when one 287 observation window ends and another is to begin; initialized to 288 SND.UNA. 290 o DCTCP.BytesAcked: The number of sent bytes acknowledged during the 291 current observation window; initialized to zero. 293 o DCTCP.BytesMarked: The number of bytes sent during the current 294 observation window that encountered congestion; initialized to 295 zero. 297 The congestion estimator on the sender MUST process acceptable ACKs 298 as follows: 300 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are 301 ignored for this computation): 303 BytesAcked = SEG.ACK - SND.UNA 305 2. Update the bytes sent: 307 DCTCP.BytesAcked += BytesAcked 309 3. If the ECE flag is set, update the bytes marked: 311 DCTCP.BytesMarked += BytesAcked 313 4. If the acknowledgment number is less than or equal to 314 DCTCP.WindowEnd, stop processing. Otherwise, the end of the 315 observation window has been reached, so proceed to update the 316 congestion estimate as follows: 318 5. Compute the congestion level for the current observation window: 320 M = DCTCP.BytesMarked / DCTCP.BytesAcked 322 6. Update the congestion estimate: 324 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 326 7. Determine the end of the next observation window: 328 DCTCP.WindowEnd = SND.NXT 330 8. Reset the byte counters: 332 DCTCP.BytesAcked = DCTCP.BytesMarked = 0 334 9. Rather than always halving the congestion window as described in 335 [RFC3168], the sender SHOULD update cwnd as follows: 337 cwnd = cwnd * (1 - DCTCP.Alpha / 2) 339 Just as specified in [RFC3168], DCTCP does not react to congestion 340 indications more than once for every window of data. The setting of 341 the "Congestion Window Reduced" (CWR) bit is also as per [RFC3168]. 342 This is required for interop with classic ECN receivers due to 343 potential misconfigurations. 345 3.4. Handling of packet loss 347 A DCTCP sender MUST react to loss episodes in the same way as 348 conventional TCP. For cases where the packet loss is inferred and 349 not explicitly signaled by ECN, the cwnd and other state variables 350 like ssthresh MUST be changed in the same way that a conventional TCP 351 would have changed them. As with ECN, DCTCP sender will only reduce 352 the cwnd once per window of data across all loss signals. Just as 353 specified in [RFC5681], upon a timeout, the cwnd MUST be set to no 354 more than the loss window (1 full-sized segment), regardless of 355 previous cwnd reductions in a given window of data. 357 3.5. Handling of SYN, SYN-ACK, RST Packets 359 If SYN, SYN-ACK and RST packets for DCTCP connections have the "ECN 360 Capable Transport" (ECT) codepoint set in the IP header, they will 361 receive the same treatment as other DCTCP packets when forwarded by a 362 switching fabric under load. Lack of ECT in these packets can result 363 in a higher drop rate depending on the switching fabric 364 configuration. Hence for DCTCP connections, the sender SHOULD set 365 ECT for SYN, SYN-ACK and RST packets. A DCTCP receiver ignores CE 366 codepoints set on any SYN, SYN-ACK, or RST packets. 368 4. Implementation Issues 370 4.1. Configuration of DCTCP 372 An implementation needs to know when to use DCTCP. Datacenter 373 servers may need to communicate with endpoints outside the 374 datacenter, where DCTCP is unsuitable or unsupported. Thus, a global 375 configuration setting to enable DCTCP will generally not suffice. 376 DCTCP provides no mechanism for negotiating its use. Thus, 377 additional management and configuration functionality is needed to 378 ensure that DCTCP is not used with non-DCTCP endpoints. 380 Known solutions rely on either configuration or heuristics. 381 Heuristics need to allow endpoints to individually enable DCTCP, to 382 ensure a DCTCP sender is always paired with a DCTCP receiver. One 383 approach is to enable DCTCP based on the IP address of the remote 384 endpoint. Another approach is to detect connections that transmit 385 within the bounds a datacenter. For example, an implementation could 386 support automatic selection of DCTCP if the estimated RTT is less 387 than a threshold (like 10 msec) and ECN is successfully negotiated, 388 under the assumption that if the RTT is low, then the two endpoints 389 are likely in the same datacenter network. 391 [RFC3168] forbids the ECN-marking of pure ACK packets, because of the 392 inability of TCP to mitigate ACK-path congestion. RFC 3168 also 393 forbids ECN-marking of retransmissions, window probes and RSTs. 394 However, dropping all these control packets - rather than ECN marking 395 them - has considerable performance disadvantages. It is RECOMMENDED 396 that an implementation provide a configuration knob that will cause 397 ECT to be set on such control packets, which can be used in 398 environments where such concerns do not apply. See 399 [ECN-EXPERIMENTATION] for details. 401 It is useful to implement DCTCP as additional actions on top of an 402 existing congestion control algorithm like Reno [RFC5681]. The DCTCP 403 implementation MAY also allow configuration of resetting the value of 404 DCTCP.Alpha as part of processing any loss episodes. 406 4.2. Computation of DCTCP.Alpha 408 As noted in Section 3.3, the implementation will need to choose a 409 suitable estimation gain. [DCTCP10] provides a theoretical basis for 410 selecting the gain. However, it may be more practical to use 411 experimentation to select a suitable gain for a particular network 412 and workload. A fixed estimation gain of 1/16 is used in some 413 implementations. (It should be noted that values of 0 or 1 for g 414 result in problematic behavior; g=0 fixes DCTCP.Alpha to its initial 415 value and g=1 sets it to M without any smoothing.) 417 The DCTCP.Alpha computation as per the formula in Section 3.3 418 involves fractions. An efficient kernel implementation MAY scale the 419 DCTCP.Alpha value for efficient computation using shift operations. 420 For example, if the implementation chooses g as 1/16, multiplications 421 of DCTCP.Alpha by g become right-shifts by 4. A scaling 422 implementation SHOULD ensure that DCTCP.Alpha is able to reach zero 423 once it falls below the smallest shifted value (16 in the above 424 example). At the other extreme, a scaled update needs to ensure 425 DCTCP.Alpha does not exceed the scaling factor, which would be 426 equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be 427 clamped after an update. 429 This results in the following computations replacing steps 5 and 6 in 430 Section 3.3, where SCF is the chosen scaling factor (65536 in the 431 example) and SHF is the shift factor (4 in the example): 433 1. Compute the congestion level for the current observation window: 435 ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked 437 2. Update the congestion estimate: 439 if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 441 DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) 443 if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF 445 5. Deployment Issues 447 DCTCP and conventional TCP congestion control do not coexist well in 448 the same network. In typical DCTCP deployments, the marking 449 threshold in the switching fabric is set to a very low value to 450 reduce queueing delay, and a relatively small amount of congestion 451 will exceed the marking threshold. During such periods of 452 congestion, conventional TCP will suffer packet loss and quickly and 453 drastically reduce cwnd. DCTCP, on the other hand, will use the 454 fraction of marked packets to reduce cwnd more gradually. Thus, the 455 rate reduction in DCTCP will be much slower than that of conventional 456 TCP, and DCTCP traffic will gain a larger share of the capacity 457 compared to conventional TCP traffic traversing the same path. If 458 the traffic in the datacenter is a mix of conventional TCP and DCTCP, 459 it is RECOMMENDED that DCTCP traffic be segregated from conventional 460 TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP 461 Differentiated Services Code Point (DSCP) bits to segregate the 462 network such that Active Queue Management (AQM) [RFC7567] is applied 463 to DCTCP traffic, whereas TCP traffic is managed via drop-tail 464 queueing. 466 Deployments should take into account segregation of non-TCP traffic 467 as well. Today's commodity switches allow configuration of different 468 marking/drop profiles for non-TCP and non-IP packets. Non-TCP and 469 non-IP packets should be able to pass through such switches, unless 470 they really run out of buffer space. 472 Since DCTCP relies on congestion marking by the switches, DCTCP's 473 potential can only be realized in datacenters where the entire 474 network infrastructure supports ECN. The switches may also support 475 configuration of the congestion threshold used for marking. The 476 proposed parameterization can be configured with switches that 477 implement Random Early Detection (RED) [RFC2309]. [DCTCP10] provides 478 a theoretical basis for selecting the congestion threshold, but as 479 with the estimation gain, it may be more practical to rely on 480 experimentation or simply to use the default configuration of the 481 device. DCTCP will revert to loss-based congestion control when 482 packet loss is experienced (e.g. when transiting a congested drop- 483 tail link, or a link with an AQM drop behavior). 485 DCTCP requires changes on both the sender and the receiver, so both 486 endpoints must support DCTCP. Furthermore, DCTCP provides no 487 mechanism for negotiating its use, so both endpoints must be 488 configured through some out-of-band mechanism to use DCTCP. A 489 variant of DCTCP that can be deployed unilaterally and only requires 490 standard ECN behavior has been described in [ODCTCP][BSDCAN], but 491 requires additional experimental evaluation. 493 6. Known Issues 495 DCTCP relies on the sender's ability to reconstruct the stream of CE 496 codepoints received by the remote endpoint. To accomplish this, 497 DCTCP avoids using a single ACK packet to acknowledge segments 498 received both with and without the CE codepoint set. However, if one 499 or more ACK packets are dropped, it is possible that a subsequent ACK 500 will cumulatively acknowledge a mix of CE and non-CE segments. This 501 will, of course, result in a less accurate congestion estimate. 502 There are some potential considerations: 504 o Even with an inaccurate congestion estimate, DCTCP may still 505 perform better than [RFC3168]. 507 o If the estimation gain is small relative to the packet loss rate, 508 the estimate may not be too inaccurate. 510 o If ACK packet loss mostly occurs under heavy congestion, most 511 drops will occur during an unbroken string of CE packets, and the 512 estimate will be unaffected. 514 However, the effect of packet drops on DCTCP under real world 515 conditions has not been analyzed. 517 DCTCP provides no mechanism for negotiating its use. The effect of 518 using DCTCP with a standard ECN endpoint has been analyzed in 519 [ODCTCP][BSDCAN]. Furthermore, it is possible that other 520 implementations may also modify [RFC3168] behavior without 521 negotiation, causing further interoperability issues. 523 Much like standard TCP, DCTCP is biased against flows with longer 524 RTTs. A method for improving the RTT fairness of DCTCP has been 525 proposed in [ADCTCP], but requires additional experimental 526 evaluation. 528 7. Security Considerations 530 DCTCP enhances ECN and thus inherits the general security 531 considerations discussed in [RFC3168], although additional mitigation 532 options exist due to the limited intra-datacenter deployment of 533 DCTCP. 535 The processing changes introduced by DCTCP do not exacerbate the 536 considerations in [RFC3168] or introduce new ones. In particular, 537 with either algorithm, the network infrastructure or the remote 538 endpoint can falsely report congestion and thus cause the sender to 539 reduce cwnd. However, this is no worse than what can be achieved by 540 simply dropping packets. 542 [RFC3168] requires that a compliant TCP must not set ECT on SYN or 543 SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, 544 but maintains the restriction of no ECT on SYN packets. Both these 545 RFCs prohibit ECT in SYN packets due to security concerns regarding 546 malicious SYN packets with ECT set. These RFCs, however, are 547 intended for general Internet use, and do not directly apply to a 548 controlled datacenter environment. The security concerns addressed 549 by both these RFCs might not apply in controlled environments like 550 datacenters, and it might not be necessary to account for the 551 presence of non-ECN servers. Beyond the security considerations 552 related to virtual servers, additional security can be imposed in the 553 physical servers to intercept and drop traffic resembling an attack. 555 8. IANA Considerations 557 This document has no actions for IANA. 559 9. Acknowledgements 561 The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] 562 by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, 563 Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari 564 Sridharan. 566 We would like to thank Andrew Shewmaker for identifying the problem 567 of clamping DCTCP.Alpha and proposing a solution for it. 569 Lars Eggert has received funding from the European Union's Horizon 570 2020 research and innovation program 2014-2018 under grant agreement 571 No. 644866 ("SSICLOPS"). This document reflects only the authors' 572 views and the European Commission is not responsible for any use that 573 may be made of the information it contains. 575 10. References 577 10.1. Normative References 579 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 580 RFC 793, DOI 10.17487/RFC0793, September 1981, 581 . 583 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 584 Selective Acknowledgment Options", RFC 2018, 585 DOI 10.17487/RFC2018, October 1996, 586 . 588 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 589 Requirement Levels", BCP 14, RFC 2119, 590 DOI 10.17487/RFC2119, March 1997, 591 . 593 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 594 of Explicit Congestion Notification (ECN) to IP", 595 RFC 3168, DOI 10.17487/RFC3168, September 2001, 596 . 598 [RFC3168-ERRATA3639] 599 Scheffenegger, R., "RFC3168 Errata ID 3639", 2013, 600 . 603 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 604 Ramakrishnan, "Adding Explicit Congestion Notification 605 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 606 DOI 10.17487/RFC5562, June 2009, 607 . 609 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 610 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 611 . 613 10.2. Informative References 615 [ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis 616 of DCTCP: Stability, Convergence, and Fairness", 617 DOI 10.1145/1993744.1993753, Proc. ACM SIGMETRICS Joint 618 International Conference on Measurement and Modeling of 619 Computer Systems (SIGMETRICS 11), June 2011, 620 . 622 [BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and 623 H. Tokuda, "Extensions to FreeBSD Datacenter TCP for 624 Incremental Deployment Support", BSDCan 2015, June 2015, 625 . 627 [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 628 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 629 Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. 630 ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, 631 . 633 [ECN-EXPERIMENTATION] 634 Black, D., "Explicit Congestion Notification (ECN) 635 Experimentation", 2017, . 638 [FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP) 639 implementation", 2015, 640 . 643 [LINUX] Borkmann, D. and F. Westphal, "Linux DCTCP patch", 2014, 644 . 648 [MAPREDUCE] 649 Dean, J. and S. Ghemawat, "MapReduce: Simplified Data 650 Processing on Large Clusters", Proc. 6th ACM/USENIX 651 Symposium on Operating Systems Design and Implementation 652 (OSDI 04), December 2004, . 655 [MORGANSTANLEY] 656 Judd, G., "Attaining the Promise and Avoiding the Pitfalls 657 of TCP in the Datacenter", Proc. 12th USENIX Symposium on 658 Networked Systems Design and Implementation (NSDI 15), May 659 2015, . 662 [ODCTCP] Kato, M., "Improving Transmission Performance with One- 663 Sided Datacenter TCP", M.S. Thesis, Keio University, 664 2014, . 666 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 667 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 668 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 669 S., Wroclawski, J., and L. Zhang, "Recommendations on 670 Queue Management and Congestion Avoidance in the 671 Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, 672 . 674 [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF 675 Recommendations Regarding Active Queue Management", 676 BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, 677 . 679 [WINDOWS] Microsoft, "Windows DCTCP reference", 2012, 680 . 683 Authors' Addresses 685 Stephen Bensley 686 Microsoft 687 One Microsoft Way 688 Redmond, WA 98052 689 USA 691 Phone: +1 425 703 5570 692 Email: sbens@microsoft.com 694 Dave Thaler 695 Microsoft 697 Phone: +1 425 703 8835 698 Email: dthaler@microsoft.com 700 Praveen Balasubramanian 701 Microsoft 703 Phone: +1 425 538 2782 704 Email: pravb@microsoft.com 705 Lars Eggert 706 NetApp 707 Sonnenallee 1 708 Kirchheim 85551 709 Germany 711 Phone: +49 151 120 55791 712 Email: lars@netapp.com 713 URI: http://eggert.org/ 715 Glenn Judd 716 Morgan Stanley 718 Phone: +1 973 979 6481 719 Email: glenn.judd@morganstanley.com