idnits 2.17.1 draft-ietf-tcpm-dctcp-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 27, 2017) is 2488 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Duplicate reference: RFC3168, mentioned in 'RFC3168-ERRATA3639', was also mentioned in 'RFC3168'. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Bensley 3 Internet-Draft D. Thaler 4 Intended status: Informational P. Balasubramanian 5 Expires: December 29, 2017 Microsoft 6 L. Eggert 7 NetApp 8 G. Judd 9 Morgan Stanley 10 June 27, 2017 12 Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters 13 draft-ietf-tcpm-dctcp-08 15 Abstract 17 This informational memo describes Datacenter TCP (DCTCP), a TCP 18 congestion control scheme for datacenter traffic. DCTCP extends the 19 Explicit Congestion Notification (ECN) processing to estimate the 20 fraction of bytes that encounter congestion, rather than simply 21 detecting that some congestion has occurred. DCTCP then scales the 22 TCP congestion window based on this estimate. This method achieves 23 high burst tolerance, low latency, and high throughput with shallow- 24 buffered switches. This memo also discusses deployment issues 25 related to the coexistence of DCTCP and conventional TCP, the lack of 26 a negotiating mechanism between sender and receiver, and presents 27 some possible mitigations. This memo documents DCTCP as currently 28 implemented by several major operating systems. DCTCP as described 29 in this draft is applicable to deployments in controlled environments 30 like datacenters but it must not be deployed over the public Internet 31 without additional measures. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at http://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on December 29, 2017. 50 Copyright Notice 52 Copyright (c) 2017 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (http://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 70 3.1. Marking Congestion on the L3 Switches and Routers . . . . 4 71 3.2. Echoing Congestion Information on the Receiver . . . . . 5 72 3.3. Processing Echoed Congestion Indications on the Sender . 6 73 3.4. Handling of packet loss . . . . . . . . . . . . . . . . . 8 74 3.5. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 8 75 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 76 4.1. Configuration of DCTCP . . . . . . . . . . . . . . . . . 8 77 4.2. Computation of DCTCP.Alpha . . . . . . . . . . . . . . . 9 78 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 10 79 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 11 80 7. Implementation Status . . . . . . . . . . . . . . . . . . . . 11 81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 12 82 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 83 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 84 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 85 11.1. Normative References . . . . . . . . . . . . . . . . . . 13 86 11.2. Informative References . . . . . . . . . . . . . . . . . 14 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 89 1. Introduction 91 Large datacenters necessarily need many network switches to 92 interconnect their many servers. Therefore, a datacenter can greatly 93 reduce its capital expenditure by leveraging low-cost switches. 94 However, such low-cost switches tend to have limited queue capacities 95 and are thus more susceptible to packet loss due to congestion. 97 Network traffic in a datacenter is often a mix of short and long 98 flows, where the short flows require low latencies and the long flows 99 require high throughputs. Datacenters also experience incast bursts, 100 where many servers send traffic to a single server at the same time. 101 For example, this traffic pattern is a natural consequence of 102 MapReduce [MAPREDUCE] workload: The worker nodes complete at 103 approximately the same time, and all reply to the master node 104 concurrently. 106 These factors place some conflicting demands on the queue occupancy 107 of a switch: 109 o The queue must be short enough that it does not impose excessive 110 latency on short flows. 112 o The queue must be long enough to buffer sufficient data for the 113 long flows to saturate the path capacity. 115 o The queue must be long enough to absorb incast bursts without 116 excessive packet loss. 118 Standard TCP congestion control [RFC5681] relies on packet loss to 119 detect congestion. This does not meet the demands described above. 120 First, short flows will start to experience unacceptable latencies 121 before packet loss occurs. Second, by the time TCP congestion 122 control kicks in on the senders, most of the incast burst has already 123 been dropped. 125 [RFC3168] describes a mechanism for using Explicit Congestion 126 Notification (ECN) from the switches for detection of congestion. 127 However, this method only detects the presence of congestion, not its 128 extent. In the presence of mild congestion, the TCP congestion 129 window is reduced too aggressively and this unnecessarily reduces the 130 throughput of long flows. 132 Datacenter TCP (DCTCP) improves traditional ECN processing by 133 estimating the fraction of bytes that encounter congestion, rather 134 than simply detecting that some congestion has occurred. DCTCP then 135 scales the TCP congestion window based on this estimate. This method 136 achieves high burst tolerance, low latency, and high throughput with 137 shallow-buffered switches. DCTCP is a modification to the processing 138 of ECN by a conventional TCP and requires that standard TCP 139 congestion control be used for handling packet loss. 141 DCTCP should only be deployed in an intra-datacenter environment 142 where both endpoints and the switching fabric are under a single 143 administrative domain. DCTCP MUST NOT be deployed over the public 144 Internet without additional measures, as detailed in Section 5. 146 The objective of this Informational RFC is to document DCTCP as an 147 alternative TCP congestion control algorithm [RFC5033] that is known 148 to be widely implemented and deployed. It is consensus in the IETF 149 TCPM working group that a DCTCP standard would require further work. 150 A precise documentation of running code enables follow-up IETF 151 Experimental or Standards Track RFCs. 153 2. Terminology 155 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 156 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 157 document are to be interpreted as described in [RFC2119]. 159 Normative language is used to describe how necessary the various 160 aspects of a DCTCP implementation are for interoperability, but even 161 compliant implementations without the measures in sections 4-6 would 162 still only be safe to deploy in controlled environments, i.e., not 163 over the public Internet. 165 3. DCTCP Algorithm 167 There are three components involved in the DCTCP algorithm: 169 o The switches (or other intermediate devices in the network) detect 170 congestion and set the Congestion Encountered (CE) codepoint in 171 the IP header. 173 o The receiver echoes the congestion information back to the sender, 174 using the ECN-Echo (ECE) flag in the TCP header. 176 o The sender computes a congestion estimate and reacts, by reducing 177 the TCP congestion window accordingly (cwnd). 179 3.1. Marking Congestion on the L3 Switches and Routers 181 The level-3 (L3) switches and routers in a datacenter fabric indicate 182 congestion to the end nodes by setting the CE codepoint in the IP 183 header as specified in Section 5 of [RFC3168]. For example, the 184 switches may be configured with a congestion threshold. When a 185 packet arrives at a switch and its queue length is greater than the 186 congestion threshold, the switch sets the CE codepoint in the packet. 187 For example, Section 3.4 of [DCTCP10] suggests threshold marking with 188 a threshold K > (RTT * C)/7, where C is the link rate in packets per 189 second. In typical deployments the marking threshold is set to be a 190 small value to maintain a short average queueing delay. However, the 191 actual algorithm for marking congestion is an implementation detail 192 of the switch and will generally not be known to the sender and 193 receiver. Therefore, sender and receiver should not assume that a 194 particular marking algorithm is implemented by the switching fabric. 196 3.2. Echoing Congestion Information on the Receiver 198 According to Section 6.1.3 of [RFC3168], the receiver sets the ECE 199 flag if any of the packets being acknowledged had the CE code point 200 set. The receiver then continues to set the ECE flag until it 201 receives a packet with the Congestion Window Reduced (CWR) flag set. 202 However, the DCTCP algorithm requires more detailed congestion 203 information. In particular, the sender must be able to determine the 204 number of bytes sent that encountered congestion. Thus, the scheme 205 described in [RFC3168] does not suffice. 207 One possible solution is to ACK every packet and set the ECE flag in 208 the ACK if and only if the CE code point was set in the packet being 209 acknowledged. However, this prevents the use of delayed ACKs, which 210 are an important performance optimization in datacenters. If the 211 delayed ACK frequency is m, then an ACK is generated every m packets. 212 The typical value of m is 2 but it could be affected by ACK 213 throttling or packet coalescing techniques designed to improve 214 performance. 216 Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP 217 Congestion Encountered" (DCTCP.CE), which is initialized to false and 218 stored in the Transmission Control Block (TCB). When sending an ACK, 219 the ECE flag MUST be set if and only if DCTCP.CE is true. When 220 receiving packets, the CE codepoint MUST be processed as follows: 222 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to 223 true and send an immediate ACK. 225 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE 226 to false and send an immediate ACK. 228 3. Otherwise, ignore the CE codepoint. 230 Since the immediate ACK reflects the new DCTCP.CE state, it may 231 acknowledge any previously unacknowledged packets in the old state. 232 This can lead to an incorrect DCTCP.Alpha value computation at the 233 sender per Section 3.3. To avoid this, an implementation may choose 234 to send two ACKs, one for previously unacknowledged packets and 235 another acknowledging the most recently received packet. 237 Receiver handling of the "Congestion Window Reduced" (CWR) bit is 238 also per [RFC3168] including [RFC3168-ERRATA3639]. That is, on 239 receipt of a segment with both the CE and CWR bits set, CWR is 240 processed first and then CE is processed. 242 Send immediate 243 ACK with ECE=0 244 .----. .-------------. .---. 245 Send 1 ACK / v v | | \ 246 for every | .------. .------. | Send 1 ACK 247 m packets | | CE=0 | | CE=1 | | for every 248 with ECE=0 | '------' '------' | m packets 249 \ | | ^ ^ / with ECE=1 250 '---' '------------' '----' 251 Send immediate 252 ACK with ECE=1 254 Figure 1: ACK generation state machine. DCTCP.CE abbreviated as CE. 256 3.3. Processing Echoed Congestion Indications on the Sender 258 The sender estimates the fraction of bytes sent that encountered 259 congestion. The current estimate is stored in a new TCP state 260 variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be 261 updated as follows: 263 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 265 where 267 o g is the estimation gain, a real number between 0 and 1. The 268 selection of g is left to the implementation. See Section 4 for 269 further considerations. 271 o M is the fraction of bytes sent that encountered congestion during 272 the previous observation window, where the observation window is 273 chosen to be approximately the Round Trip Time (RTT). In 274 particular, an observation window ends when all bytes in flight at 275 the beginning of the window have been acknowledged. 277 In order to update DCTCP.Alpha, the TCP state variables defined in 278 [RFC0793] are used, and three additional TCP state variables are 279 introduced: 281 o DCTCP.WindowEnd: The TCP sequence number threshold for beginning a 282 new observation window; initialized to SND.UNA. 284 o DCTCP.BytesAcked: The number of sent bytes acknowledged during the 285 current observation window; initialized to zero. 287 o DCTCP.BytesMarked: The number of bytes sent during the current 288 observation window that encountered congestion; initialized to 289 zero. 291 The congestion estimator on the sender SHOULD process acceptable ACKs 292 as follows: 294 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are 295 ignored for this computation): 297 BytesAcked = SEG.ACK - SND.UNA 299 2. Update the bytes sent: 301 DCTCP.BytesAcked += BytesAcked 303 3. If the ECE flag is set, update the bytes marked: 305 DCTCP.BytesMarked += BytesAcked 307 4. If the acknowledgment number is less than or equal to 308 DCTCP.WindowEnd, stop processing. Otherwise, the end of the 309 observation window has been reached, so proceed to update the 310 congestion estimate as follows: 312 5. Compute the congestion level for the current observation window: 314 M = DCTCP.BytesMarked / DCTCP.BytesAcked 316 6. Update the congestion estimate: 318 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 320 7. Determine the end of the next observation window: 322 DCTCP.WindowEnd = SND.NXT 324 8. Reset the byte counters: 326 DCTCP.BytesAcked = DCTCP.BytesMarked = 0 328 9. Rather than always halving the congestion window as described in 329 [RFC3168], the sender SHOULD update cwnd as follows: 331 cwnd = cwnd * (1 - DCTCP.Alpha / 2) 333 Thus, when no bytes sent experienced congestion, DCTCP.Alpha equals 334 zero, and cwnd is left unchanged. When all sent bytes experienced 335 congestion, DCTCP.Alpha equals one, and cwnd is reduced by half. 336 Lower levels of congestion will result in correspondingly smaller 337 reductions to cwnd. 339 Just as specified in [RFC3168], DCTCP does not react to congestion 340 indications more than once for every window of data. The setting of 341 the "Congestion Window Reduced" (CWR) bit is also as per [RFC3168]. 342 This is required for interop with classic ECN receivers due to 343 potential misconfigurations. 345 3.4. Handling of packet loss 347 A DCTCP sender MUST react to loss episodes in the same way as 348 conventional TCP. For cases where the packet loss is inferred and 349 not explicitly signaled by ECN, the cwnd and other state variables 350 like ssthresh must be changed in the same way that a conventional TCP 351 would have changed them. As with ECN, DCTCP sender will only reduce 352 the cwnd once per window of data across all loss signals. Just as 353 specified in [RFC5681], upon a timeout, the cwnd MUST be set to no 354 more than the loss window (1 full-sized segment), regardless of 355 previous cwnd reductions in a given window of data. 357 3.5. Handling of SYN, SYN-ACK, RST Packets 359 If SYN, SYN-ACK and RST packets for DCTCP connections have the "ECN 360 Capable Transport" (ECT) codepoint set in the IP header, they will 361 receive the same treatment as other DCTCP packets when forwarded by a 362 switching fabric under load. Lack of ECT in these packets may result 363 in a higher drop rate depending on the switching fabric 364 configuration. Hence for DCTCP connections, the sender SHOULD set 365 ECT for SYN, SYN-ACK and RST packets. A DCTCP receiver ignores CE 366 codepoints set on any SYN, SYN-ACK, or RST packets. 368 4. Implementation Issues 370 4.1. Configuration of DCTCP 372 An implementation should decide when to use DCTCP. Datacenter 373 servers may need to communicate with endpoints outside the 374 datacenter, where DCTCP is unsuitable or unsupported. Thus, a global 375 configuration setting to enable DCTCP will generally not suffice. 376 DCTCP provides no mechanism for negotiating its use. Thus, there is 377 additional management and configuration overhead required to ensure 378 that DCTCP is not used with non-DCTCP endpoints. 380 Potential solutions rely on either configuration or heuristics. 381 Heuristics need to allow endpoints to individually enable DCTCP, to 382 ensure a DCTCP sender is always paired with a DCTCP receiver. One 383 approach is to enable DCTCP based on the IP address of the remote 384 endpoint. Another approach is to detect connections that transmit 385 within the bounds a datacenter. For example, an implementation could 386 support automatic selection of DCTCP if the estimated RTT is less 387 than a threshold (like 10 msec) and ECN is successfully negotiated, 388 under the assumption that if the RTT is low, then the two endpoints 389 are likely in the same datacenter network. 391 [RFC3168] forbids the ECN-marking of pure ACK packets, because of the 392 inability of TCP to mitigate ACK-path congestion. RFC 3168 also 393 forbids ECN-marking of retransmissions, window probes and RSTs. 394 However, dropping all these control packets - rather than ECN marking 395 them - has considerable performance disadvantages. It is RECOMMENDED 396 that an implementation provide a configuration knob that will cause 397 ECT to be set on such control packets, which can be used in 398 environments where such concerns do not apply. See 399 [ECN-EXPERIMENTATION] for details. 401 It is useful to implement DCTCP as additional actions on top of an 402 existing congestion control algorithm like Reno [RFC5681]. The DCTCP 403 implementation MAY also allow configuration of resetting the value of 404 DCTCP.Alpha as part of processing any loss episodes. 406 4.2. Computation of DCTCP.Alpha 408 As noted in Section 3.3, the implementation will need to choose a 409 suitable estimation gain. [DCTCP10] provides a theoretical basis for 410 selecting the gain. However, it may be more practical to use 411 experimentation to select a suitable gain for a particular network 412 and workload. A fixed estimation gain of 1/16 is used in some 413 implementations. 415 The DCTCP.Alpha computation as per the formula in Section 3.3 416 involves fractions. An efficient kernel implementation MAY scale the 417 DCTCP.Alpha value for efficient computation using shift operations. 418 For example, if the implementation chooses g as 1/16, multiplications 419 of DCTCP.Alpha by g become right-shifts by 4. A scaling 420 implementation SHOULD ensure that DCTCP.Alpha is able to reach zero 421 once it falls below the smallest shifted value (16 in the above 422 example). At the other extreme, a scaled update must ensure 423 DCTCP.Alpha does not exceed the scaling factor, which would be 424 equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be 425 clamped after an update. 427 This results in the following computations replacing steps 5 and 6 in 428 Section 3.3, where SCF is the chosen scaling factor (65536 in the 429 example) and SHF is the shift factor (4 in the example): 431 1. Compute the congestion level for the current observation window: 433 ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked 435 2. Update the congestion estimate: 437 if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 439 DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) 441 if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF 443 5. Deployment Issues 445 DCTCP and conventional TCP congestion control do not coexist well in 446 the same network. In typical DCTCP deployments, the marking 447 threshold in the switching fabric is set to a very low value to 448 reduce queueing delay, and a relatively small amount of congestion 449 will exceed the marking threshold. During such periods of 450 congestion, conventional TCP will suffer packet loss and quickly and 451 drastically reduce cwnd. DCTCP, on the other hand, will use the 452 fraction of marked packets to reduce cwnd more gradually. Thus, the 453 rate reduction in DCTCP will be much slower than that of conventional 454 TCP, and DCTCP traffic will gain a larger share of the capacity 455 compared to conventional TCP traffic traversing the same path. If 456 the traffic in the datacenter is a mix of conventional TCP and DCTCP, 457 it is RECOMMENDED that DCTCP traffic be segregated from conventional 458 TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP 459 Differentiated Services Code Point (DSCP) bits to segregate the 460 network such that Active Queue Management (AQM) is applied to DCTCP 461 traffic, whereas TCP traffic is managed via drop-tail queueing. 463 Deployments should take into account segregation of non-TCP traffic 464 as well. Today's commodity switches allow configuration of different 465 marking/drop profiles for non-TCP and non-IP packets. Non-TCP and 466 non-IP packets should be able to pass through such switches, unless 467 they really run out of buffer space. 469 Since DCTCP relies on congestion marking by the switches, DCTCP's 470 potential can only be realized in datacenters where the entire 471 network infrastructure supports ECN. The switches may also support 472 configuration of the congestion threshold used for marking. The 473 proposed parameterization can be configured with switches that 474 implement Random Early Detection (RED). [DCTCP10] provides a 475 theoretical basis for selecting the congestion threshold, but as with 476 the estimation gain, it may be more practical to rely on 477 experimentation or simply to use the default configuration of the 478 device. DCTCP will revert to loss-based congestion control when 479 packet loss is experienced (e.g. when transiting a congested drop- 480 tail link, or a link with an AQM drop behavior). 482 DCTCP requires changes on both the sender and the receiver, so both 483 endpoints must support DCTCP. Furthermore, DCTCP provides no 484 mechanism for negotiating its use, so both endpoints must be 485 configured through some out-of-band mechanism to use DCTCP. A 486 variant of DCTCP that can be deployed unilaterally and only requires 487 standard ECN behavior has been described in [ODCTCP][BSDCAN], but 488 requires additional experimental evaluation. 490 6. Known Issues 492 DCTCP relies on the sender's ability to reconstruct the stream of CE 493 codepoints received by the remote endpoint. To accomplish this, 494 DCTCP avoids using a single ACK packet to acknowledge segments 495 received both with and without the CE codepoint set. However, if one 496 or more ACK packets are dropped, it is possible that a subsequent ACK 497 will cumulatively acknowledge a mix of CE and non-CE segments. This 498 will, of course, result in a less accurate congestion estimate. 499 There are some potential considerations: 501 o Even with an inaccurate congestion estimate, DCTCP may still 502 perform better than [RFC3168]. 504 o If the estimation gain is small relative to the packet loss rate, 505 the estimate may not be too inaccurate. 507 o If ACK packet loss mostly occurs under heavy congestion, most 508 drops will occur during an unbroken string of CE packets, and the 509 estimate will be unaffected. 511 However, the effect of packet drops on DCTCP under real world 512 conditions has not been analyzed. 514 DCTCP provides no mechanism for negotiating its use. The effect of 515 using DCTCP with a standard ECN endpoint has been analyzed in 516 [ODCTCP][BSDCAN]. Furthermore, it is possible that other 517 implementations may also modify [RFC3168] behavior without 518 negotiation, causing further interoperability issues. 520 Much like standard TCP, DCTCP is biased against flows with longer 521 RTTs. A method for improving the RTT fairness of DCTCP has been 522 proposed in [ADCTCP], but requires additional experimental 523 evaluation. 525 7. Implementation Status 527 This section documents the implementation status of the specification 528 in this document, as recommended by [RFC7942]. 530 This document describes DCTCP as implemented in Microsoft Windows 531 Server 2012 [WINDOWS]. The Linux [LINUX] and FreeBSD [FREEBSD] 532 operating systems have also implemented support for DCTCP in a way 533 that is believed to follow this document. Deployment experiences 534 with DCTCP as have been documented in [MORGANSTANLEY]. 536 8. Security Considerations 538 DCTCP enhances ECN and thus inherits the general security 539 considerations discussed in [RFC3168], although additional mitigation 540 options exist due to the limited intra-datacenter deployment of 541 DCTCP. 543 The processing changes introduced by DCTCP do not exacerbate the 544 considerations in [RFC3168] or introduce new ones. In particular, 545 with either algorithm, the network infrastructure or the remote 546 endpoint can falsely report congestion and thus cause the sender to 547 reduce cwnd. However, this is no worse than what can be achieved by 548 simply dropping packets. 550 [RFC3168] requires that a compliant TCP must not set ECT on SYN or 551 SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, 552 but maintains the restriction of no ECT on SYN packets. Both these 553 RFCs prohibit ECT in SYN packets due to security concerns regarding 554 malicious SYN packets with ECT set. These RFCs, however, are 555 intended for general Internet use, and do not directly apply to a 556 controlled datacenter environment. The security concerns addressed 557 by both these RFCs might not apply in controlled environments like 558 datacenters, and it might not be necessary to account for the 559 presence of non-ECN servers. Since most servers run virtualized in 560 datacenters, additional security can be imposed in the physical 561 servers to intercept and drop traffic resembling an attack. 563 9. IANA Considerations 565 This document has no actions for IANA. 567 10. Acknowledgements 569 The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] 570 by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, 571 Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari 572 Sridharan. 574 We would like to thank Andrew Shewmaker for identifying the problem 575 of clamping DCTCP.Alpha and proposing a solution for it. 577 Lars Eggert has received funding from the European Union's Horizon 578 2020 research and innovation program 2014-2018 under grant agreement 579 No. 644866 ("SSICLOPS"). This document reflects only the authors' 580 views and the European Commission is not responsible for any use that 581 may be made of the information it contains. 583 11. References 585 11.1. Normative References 587 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 588 RFC 793, DOI 10.17487/RFC0793, September 1981, 589 . 591 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 592 Selective Acknowledgment Options", RFC 2018, 593 DOI 10.17487/RFC2018, October 1996, 594 . 596 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 597 Requirement Levels", BCP 14, RFC 2119, 598 DOI 10.17487/RFC2119, March 1997, 599 . 601 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 602 of Explicit Congestion Notification (ECN) to IP", 603 RFC 3168, DOI 10.17487/RFC3168, September 2001, 604 . 606 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 607 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 608 . 610 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 611 Ramakrishnan, "Adding Explicit Congestion Notification 612 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 613 DOI 10.17487/RFC5562, June 2009, 614 . 616 [RFC3168-ERRATA3639] 617 Scheffenegger, R., "RFC3168 Errata ID 3639", 2013, 618 . 621 11.2. Informative References 623 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 624 Code: The Implementation Status Section", BCP 205, 625 RFC 7942, DOI 10.17487/RFC7942, July 2016, 626 . 628 [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion 629 Control Algorithms", BCP 133, RFC 5033, 630 DOI 10.17487/RFC5033, August 2007, 631 . 633 [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 634 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 635 Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. 636 ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, 637 . 639 [ODCTCP] Kato, M., "Improving Transmission Performance with One- 640 Sided Datacenter TCP", M.S. Thesis, Keio University, 641 2014, . 643 [BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and 644 H. Tokuda, "Extensions to FreeBSD Datacenter TCP for 645 Incremental Deployment Support", BSDCan 2015, June 2015, 646 . 648 [ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis 649 of DCTCP: Stability, Convergence, and Fairness", 650 DOI 10.1145/1993744.1993753, Proc. ACM SIGMETRICS Joint 651 International Conference on Measurement and Modeling of 652 Computer Systems (SIGMETRICS 11), June 2011, 653 . 655 [WINDOWS] Microsoft, "Windows DCTCP reference", 2012, 656 . 659 [LINUX] Borkmann, D. and F. Westphal, "Linux DCTCP patch", 2014, 660 . 664 [FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP) 665 implementation", 2015, 666 . 669 [MORGANSTANLEY] 670 Judd, G., "Attaining the Promise and Avoiding the Pitfalls 671 of TCP in the Datacenter", Proc. 12th USENIX Symposium on 672 Networked Systems Design and Implementation (NSDI 15), May 673 2015, . 676 [ECN-EXPERIMENTATION] 677 Black, D., "Explicit Congestion Notification (ECN) 678 Experimentation", 2017, . 681 [MAPREDUCE] 682 Dean, J. and S. Ghemawat, "MapReduce: Simplified Data 683 Processing on Large Clusters", Proc. 6th ACM/USENIX 684 Symposium on Operating Systems Design and Implementation 685 (OSDI 04), December 2004, . 688 Authors' Addresses 690 Stephen Bensley 691 Microsoft 692 One Microsoft Way 693 Redmond, WA 98052 694 USA 696 Phone: +1 425 703 5570 697 Email: sbens@microsoft.com 699 Dave Thaler 700 Microsoft 702 Phone: +1 425 703 8835 703 Email: dthaler@microsoft.com 705 Praveen Balasubramanian 706 Microsoft 708 Phone: +1 425 538 2782 709 Email: pravb@microsoft.com 710 Lars Eggert 711 NetApp 712 Sonnenallee 1 713 Kirchheim 85551 714 Germany 716 Phone: +49 151 120 55791 717 Email: lars@netapp.com 718 URI: http://eggert.org/ 720 Glenn Judd 721 Morgan Stanley 723 Phone: +1 973 979 6481 724 Email: glenn.judd@morganstanley.com