idnits 2.17.1 draft-ietf-tcpm-dctcp-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 10, 2017) is 2542 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Duplicate reference: RFC3168, mentioned in 'RFC3168-ERRATA3639', was also mentioned in 'RFC3168'. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Bensley 3 Internet-Draft D. Thaler 4 Intended status: Informational P. Balasubramanian 5 Expires: November 11, 2017 Microsoft 6 L. Eggert 7 NetApp 8 G. Judd 9 Morgan Stanley 10 May 10, 2017 12 Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters 13 draft-ietf-tcpm-dctcp-06 15 Abstract 17 This informational memo describes Datacenter TCP (DCTCP), a TCP 18 congestion control scheme for datacenter traffic. DCTCP extends the 19 Explicit Congestion Notification (ECN) processing to estimate the 20 fraction of bytes that encounter congestion, rather than simply 21 detecting that some congestion has occurred. DCTCP then scales the 22 TCP congestion window based on this estimate. This method achieves 23 high burst tolerance, low latency, and high throughput with shallow- 24 buffered switches. This memo also discusses deployment issues 25 related to the coexistence of DCTCP and conventional TCP, the lack of 26 a negotiating mechanism between sender and receiver, and presents 27 some possible mitigations. DCTCP as described in this draft is 28 applicable to deployments in controlled environments like datacenters 29 but it must not be deployed over the public Internet without 30 additional measures, as detailed in Section 5. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on November 11, 2017. 49 Copyright Notice 51 Copyright (c) 2017 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 4 69 3.1. Marking Congestion on the L3 Switches and Routers . . . . 4 70 3.2. Echoing Congestion Information on the Receiver . . . . . 4 71 3.3. Processing Echoed Congestion Indications on the Sender . 6 72 3.4. Handling of packet loss . . . . . . . . . . . . . . . . . 8 73 3.5. Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . . 8 74 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 8 75 4.1. Configuration of DCTCP . . . . . . . . . . . . . . . . . 8 76 4.2. Computation of DCTCP.Alpha . . . . . . . . . . . . . . . 9 77 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 10 78 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 11 79 7. Implementation Status . . . . . . . . . . . . . . . . . . . . 11 80 8. Security Considerations . . . . . . . . . . . . . . . . . . . 12 81 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 82 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 83 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 84 11.1. Normative References . . . . . . . . . . . . . . . . . . 13 85 11.2. Informative References . . . . . . . . . . . . . . . . . 13 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 88 1. Introduction 90 Large datacenters necessarily need many network switches to 91 interconnect their many servers. Therefore, a datacenter can greatly 92 reduce its capital expenditure by leveraging low-cost switches. 93 However, such low-cost switches tend to have limited queue capacities 94 and are thus more susceptible to packet loss due to congestion. 96 Network traffic in a datacenter is often a mix of short and long 97 flows, where the short flows require low latencies and the long flows 98 require high throughputs. Datacenters also experience incast bursts, 99 where many servers send traffic to a single server at the same time. 100 For example, this traffic pattern is a natural consequence of 101 MapReduce workload: The worker nodes complete at approximately the 102 same time, and all reply to the master node concurrently. 104 These factors place some conflicting demands on the queue occupancy 105 of a switch: 107 o The queue must be short enough that it does not impose excessive 108 latency on short flows. 110 o The queue must be long enough to buffer sufficient data for the 111 long flows to saturate the path capacity. 113 o The queue must be long enough to absorb incast bursts without 114 excessive packet loss. 116 Standard TCP congestion control [RFC5681] relies on packet loss to 117 detect congestion. This does not meet the demands described above. 118 First, short flows will start to experience unacceptable latencies 119 before packet loss occurs. Second, by the time TCP congestion 120 control kicks in on the senders, most of the incast burst has already 121 been dropped. 123 [RFC3168] describes a mechanism for using Explicit Congestion 124 Notification (ECN) from the switches for detection of congestion. 125 However, this method only detects the presence of congestion, not its 126 extent. In the presence of mild congestion, the TCP congestion 127 window is reduced too aggressively and this unnecessarily reduces the 128 throughput of long flows. 130 Datacenter TCP (DCTCP) improves traditional ECN processing by 131 estimating the fraction of bytes that encounter congestion, rather 132 than simply detecting that some congestion has occurred. DCTCP then 133 scales the TCP congestion window based on this estimate. This method 134 achieves high burst tolerance, low latency, and high throughput with 135 shallow-buffered switches. DCTCP is a modification to the processing 136 of ECN by a conventional TCP and requires that standard TCP 137 congestion control be used for handling packet loss. 139 DCTCP should only be deployed in a datacenter environment where the 140 endpoints and the switching fabric are under a single administrative 141 domain. DCTCP MUST NOT be deployed over the public Internet without 142 additional measures, as detailed in Section 5. 144 2. Terminology 146 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 148 document are to be interpreted as described in [RFC2119]. Normative 149 language is used to describe how necessary the various aspects of the 150 Microsoft implementation are for interoperability, but even compliant 151 implementations without the measures in sections 4-6 would still only 152 be safe to deploy in controlled environments. 154 3. DCTCP Algorithm 156 There are three components involved in the DCTCP algorithm: 158 o The switches (or other intermediate devices in the network) detect 159 congestion and set the Congestion Encountered (CE) codepoint in 160 the IP header. 162 o The receiver echoes the congestion information back to the sender, 163 using the ECN-Echo (ECE) flag in the TCP header. 165 o The sender computes a congestion estimate and reacts, by reducing 166 the TCP congestion window accordingly (cwnd). 168 3.1. Marking Congestion on the L3 Switches and Routers 170 The L3 switches and routers in a datacenter fabric indicate 171 congestion to the end nodes by setting the CE codepoint in the IP 172 header as specified in Section 5 of [RFC3168]. For example, the 173 switches may be configured with a congestion threshold. When a 174 packet arrives at a switch and its queue length is greater than the 175 congestion threshold, the switch sets the CE codepoint in the packet. 176 For example, Section 3.4 of [DCTCP10] suggests threshold marking with 177 a threshold K > (RTT * C)/7, where C is the link rate in packets per 178 second. However, the actual algorithm for marking congestion is an 179 implementation detail of the switch and will generally not be known 180 to the sender and receiver. Therefore, sender and receiver should 181 not assume that a particular marking algorithm is implemented by the 182 switching fabric. 184 3.2. Echoing Congestion Information on the Receiver 186 According to Section 6.1.3 of [RFC3168], the receiver sets the ECE 187 flag if any of the packets being acknowledged had the CE code point 188 set. The receiver then continues to set the ECE flag until it 189 receives a packet with the Congestion Window Reduced (CWR) flag set. 190 However, the DCTCP algorithm requires more detailed congestion 191 information. In particular, the sender must be able to determine the 192 number of bytes sent that encountered congestion. Thus, the scheme 193 described in [RFC3168] does not suffice. 195 One possible solution is to ACK every packet and set the ECE flag in 196 the ACK if and only if the CE code point was set in the packet being 197 acknowledged. However, this prevents the use of delayed ACKs, which 198 are an important performance optimization in datacenters. If the 199 delayed ACK frequency is m, then an ACK is generated every m packets. 200 The typical value of m is 2 but it could be affected by ACK 201 throttling or packet coalescing techniques designed to improve 202 performance. 204 Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP 205 Congestion Encountered" (DCTCP.CE), which is initialized to false and 206 stored in the Transmission Control Block (TCB). When sending an ACK, 207 the ECE flag MUST be set if and only if DCTCP.CE is true. When 208 receiving packets, the CE codepoint MUST be processed as follows: 210 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to 211 true and send an immediate ACK. 213 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE 214 to false and send an immediate ACK. 216 3. Otherwise, ignore the CE codepoint. 218 Since the immediate ACK reflects the new DCTCP.CE state, it may 219 acknowledge any previously unacknowledged packets in the old state. 220 This can lead to an incorrect DCTCP.Alpha value computation at the 221 sender per Section 3.3. To avoid this, an implementation may choose 222 to send two ACKs, one for previously unacknowledged packets and 223 another acknowledging the most recently received packet. 225 Receiver handling of the "Congestion Window Reduced" (CWR) bit is 226 also per [RFC3168] including [RFC3168-ERRATA3639]. That is, on 227 receipt of a segment with both the CE and CWR bits set, CWR is 228 processed first and then CE is processed. 230 Send immediate 231 ACK with ECE=0 232 .----. .-------------. .---. 233 Send 1 ACK / v v | | \ 234 for every | .------. .------. | Send 1 ACK 235 m packets | | CE=0 | | CE=1 | | for every 236 with ECE=0 | '------' '------' | m packets 237 \ | | ^ ^ / with ECE=1 238 '---' '------------' '----' 239 Send immediate 240 ACK with ECE=1 242 Figure 1: ACK generation state machine. DCTCP.CE abbreviated as CE. 244 3.3. Processing Echoed Congestion Indications on the Sender 246 The sender estimates the fraction of bytes sent that encountered 247 congestion. The current estimate is stored in a new TCP state 248 variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be 249 updated as follows: 251 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 253 where 255 o g is the estimation gain, a real number between 0 and 1. The 256 selection of g is left to the implementation. See Section 4 for 257 further considerations. 259 o M is the fraction of bytes sent that encountered congestion during 260 the previous observation window, where the observation window is 261 chosen to be approximately the Round Trip Time (RTT). In 262 particular, an observation window ends when all bytes in flight at 263 the beginning of the window have been acknowledged. 265 In order to update DCTCP.Alpha, the TCP state variables defined in 266 [RFC0793] are used, and three additional TCP state variables are 267 introduced: 269 o DCTCP.WindowEnd: The TCP sequence number threshold for beginning a 270 new observation window; initialized to SND.UNA. 272 o DCTCP.BytesAcked: The number of sent bytes acknowledged during the 273 current observation window; initialized to zero. 275 o DCTCP.BytesMarked: The number of bytes sent during the current 276 observation window that encountered congestion; initialized to 277 zero. 279 The congestion estimator on the sender SHOULD process acceptable ACKs 280 as follows: 282 1. Compute the bytes acknowledged (TCP SACK options [RFC2018] are 283 ignored for this computation): 285 BytesAcked = SEG.ACK - SND.UNA 287 2. Update the bytes sent: 289 DCTCP.BytesAcked += BytesAcked 291 3. If the ECE flag is set, update the bytes marked: 293 DCTCP.BytesMarked += BytesAcked 295 4. If the acknowledgment number is less than or equal to 296 DCTCP.WindowEnd, stop processing. Otherwise, the end of the 297 observation window has been reached, so proceed to update the 298 congestion estimate as follows: 300 5. Compute the congestion level for the current observation window: 302 M = DCTCP.BytesMarked / DCTCP.BytesAcked 304 6. Update the congestion estimate: 306 DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M 308 7. Determine the end of the next observation window: 310 DCTCP.WindowEnd = SND.NXT 312 8. Reset the byte counters: 314 DCTCP.BytesAcked = DCTCP.BytesMarked = 0 316 9. Rather than always halving the congestion window as described in 317 [RFC3168], the sender SHOULD update cwnd as follows: 319 cwnd = cwnd * (1 - DCTCP.Alpha / 2) 321 Thus, when no bytes sent experienced congestion, DCTCP.Alpha equals 322 zero, and cwnd is left unchanged. When all sent bytes experienced 323 congestion, DCTCP.Alpha equals one, and cwnd is reduced by half. 324 Lower levels of congestion will result in correspondingly smaller 325 reductions to cwnd. 327 Just as specified in [RFC3168], DCTCP does not react to congestion 328 indications more than once for every window of data. The setting of 329 the "Congestion Window Reduced" (CWR) bit is also as per [RFC3168]. 330 This is required for interop with classic ECN receivers due to 331 potential misconfigurations. 333 3.4. Handling of packet loss 335 A DCTCP sender MUST react to loss episodes in the same way as 336 conventional TCP. For cases where the packet loss is inferred and 337 not explicitly signaled by ECN, the cwnd and other state variables 338 like ssthresh must be changed in the same way that a conventional TCP 339 would have changed them. As with ECN, DCTCP sender will only reduce 340 the cwnd once per window of data across all loss signals. Just as 341 specified in [RFC5681], upon a timeout, the cwnd MUST be set to no 342 more than the loss window (1 full-sized segment), regardless of 343 previous cwnd reductions in a given window of data. 345 3.5. Handling of SYN, SYN-ACK, RST Packets 347 If SYN , SYN-ACK and RST packets for DCTCP connections have ECT set 348 in the IP header, they will receive the same treatment as other DCTCP 349 packets when forwarded by a switching fabric under load. Lack of ECT 350 in these packets may result in a higher drop rate depending on the 351 switching fabric configuration. Hence for DCTCP connections, the 352 sender SHOULD set ECT for SYN, SYN-ACK and RST packets. A DCTCP 353 receiver ignores CE codepoints set on any SYN, SYN-ACK, or RST 354 packets. 356 4. Implementation Issues 358 4.1. Configuration of DCTCP 360 An implementation should decide when to use DCTCP. Datacenter 361 servers may need to communicate with endpoints outside the 362 datacenter, where DCTCP is unsuitable or unsupported. Thus, a global 363 configuration setting to enable DCTCP will generally not suffice. 364 DCTCP provides no mechanism for negotiating its use. Thus, there is 365 additional management and configuration overhead required to ensure 366 that DCTCP is not used with non-DCTCP endpoints. 368 Potential solutions rely on either configuration or heuristics. 369 Heuristics need to allow endpoints to individually enable DCTCP, to 370 ensure a DCTCP sender is always paired with a DCTCP receiver. One 371 approach is to enable DCTCP based on the IP address of the remote 372 endpoint. Another approach is to detect connections that transmit 373 within the bounds a datacenter. For example, an implementation could 374 support automatic selection of DCTCP if the estimated RTT is less 375 than a threshold (like 10 msec) and ECN is successfully negotiated, 376 under the assumption that if the RTT is low, then the two endpoints 377 are likely in the same datacenter network. 379 [RFC3168] forbids the ECN-marking of pure ACK packets, because of the 380 inability of TCP to mitigate ACK-path congestion. RFC 3168 also 381 forbids ECN-marking of retransmissions, window probes and RSTs. 382 However, dropping all these control packets - rather than ECN marking 383 them - has considerable performance disadvantages. It is RECOMMENDED 384 that an implementation provide a configuration knob that will cause 385 ECT to be set on such control packets, which can be used in 386 environments where such concerns do not apply. See 387 [ECN-EXPERIMENTATION] for details. 389 It is useful to implement DCTCP as additional actions on top of an 390 existing congestion control algorithm like NewReno. The DCTCP 391 implementation MAY also allow configuration of resetting the value of 392 DCTCP.Alpha as part of processing any loss episodes. 394 4.2. Computation of DCTCP.Alpha 396 As noted in Section 3.3, the implementation will need to choose a 397 suitable estimation gain. [DCTCP10] provides a theoretical basis for 398 selecting the gain. However, it may be more practical to use 399 experimentation to select a suitable gain for a particular network 400 and workload. A fixed estimation gain of 1/16 is used in some 401 implementations. 403 The DCTCP.Alpha computation as per the formula in Section 3.3 404 involves fractions. An efficient kernel implementation MAY scale the 405 DCTCP.Alpha value for efficient computation using shift operations. 406 For example, if the implementation chooses g as 1/16, multiplications 407 of DCTCP.Alpha by g become right-shifts by 4. A scaling 408 implementation SHOULD ensure that DCTCP.Alpha is able to reach zero 409 once it falls below the smallest shifted value (16 in the above 410 example). At the other extreme, a scaled update must ensure 411 DCTCP.Alpha does not exceed the scaling factor, which would be 412 equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be 413 clamped after an update. 415 This results in the following computations replacing steps 5 and 6 in 416 Section 3.3, where SCF is the chosen scaling factor (65536 in the 417 example) and SHF is the shift factor (4 in the example): 419 1. Compute the congestion level for the current observation window: 421 ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked 423 2. Update the congestion estimate: 425 if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0 427 DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF) 429 if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF 431 5. Deployment Issues 433 DCTCP and conventional TCP congestion control do not coexist well in 434 the same network. In typical DCTCP deployments, the marking 435 threshold in the switching fabric is set to a very low value to 436 reduce queueing delay, and a relatively small amount of congestion 437 will exceed the marking threshold. During such periods of 438 congestion, conventional TCP will suffer packet loss and quickly and 439 drastically reduce cwnd. DCTCP, on the other hand, will use the 440 fraction of marked packets to reduce cwnd more gradually. Thus, the 441 rate reduction in DCTCP will be much slower than that of conventional 442 TCP, and DCTCP traffic will gain a larger share of the capacity 443 compared to conventional TCP traffic traversing the same path. If 444 the traffic in the datacenter is a mix of conventional TCP and DCTCP, 445 it is RECOMMENDED that DCTCP traffic be segregated from conventional 446 TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP 447 DSCP bits to segregate the network such that AQM is applied to DCTCP 448 traffic, whereas TCP traffic is managed via drop-tail queueing. 450 Deployments should take into account segregation of non-TCP traffic 451 as well. Today's commodity switches allow configuration of different 452 marking/drop profiles for non-TCP and non-IP packets. Non-TCP and 453 non-IP packets should be able to pass through such switches, unless 454 they really run out of buffer space. 456 Since DCTCP relies on congestion marking by the switches, DCTCP's 457 potential can only be realized in datacenters where the entire 458 network infrastructure supports ECN. The switches may also support 459 configuration of the congestion threshold used for marking. The 460 proposed parameterization can be configured with switches that 461 implement RED. [DCTCP10] provides a theoretical basis for selecting 462 the congestion threshold, but as with the estimation gain, it may be 463 more practical to rely on experimentation or simply to use the 464 default configuration of the device. DCTCP will revert to loss-based 465 congestion control when packet loss is experienced (e.g. when 466 transiting a congested drop-tail link, or a link with an AQM drop 467 behavior). 469 DCTCP requires changes on both the sender and the receiver, so both 470 endpoints must support DCTCP. Furthermore, DCTCP provides no 471 mechanism for negotiating its use, so both endpoints must be 472 configured through some out-of-band mechanism to use DCTCP. A 473 variant of DCTCP that can be deployed unilaterally and only requires 474 standard ECN behavior has been described in [ODCTCP][BSDCAN], but 475 requires additional experimental evaluation. 477 6. Known Issues 479 DCTCP relies on the sender's ability to reconstruct the stream of CE 480 codepoints received by the remote endpoint. To accomplish this, 481 DCTCP avoids using a single ACK packet to acknowledge segments 482 received both with and without the CE codepoint set. However, if one 483 or more ACK packets are dropped, it is possible that a subsequent ACK 484 will cumulatively acknowledge a mix of CE and non-CE segments. This 485 will, of course, result in a less accurate congestion estimate. 486 There are some potential considerations: 488 o Even with an inaccurate congestion estimate, DCTCP may still 489 perform better than [RFC3168]. 491 o If the estimation gain is small relative to the packet loss rate, 492 the estimate may not be too inaccurate. 494 o If ACK packet loss mostly occurs under heavy congestion, most 495 drops will occur during an unbroken string of CE packets, and the 496 estimate will be unaffected. 498 However, the effect of packet drops on DCTCP under real world 499 conditions has not been analyzed. 501 DCTCP provides no mechanism for negotiating its use. The effect of 502 using DCTCP with a standard ECN endpoint has been analyzed in 503 [ODCTCP][BSDCAN]. Furthermore, it is possible that other 504 implementations may also modify [RFC3168] behavior without 505 negotiation, causing further interoperability issues. 507 Much like standard TCP, DCTCP is biased against flows with longer 508 RTTs. A method for improving the RTT fairness of DCTCP has been 509 proposed in [ADCTCP], but requires additional experimental 510 evaluation. 512 7. Implementation Status 514 This section documents the implementation status of the specification 515 in this document, as recommended by [RFC7942]. 517 This document describes DCTCP as implemented in Microsoft Windows 518 Server 2012. Since publication of the first versions of this 519 document, the Linux [LINUX] and FreeBSD [FREEBSD] operating systems 520 have also implemented support for DCTCP in a way that is believed to 521 follow this document. 523 8. Security Considerations 525 DCTCP enhances ECN and thus inherits the security considerations 526 discussed in [RFC3168]. The processing changes introduced by DCTCP 527 do not exacerbate these considerations or introduce new ones. In 528 particular, with either algorithm, the network infrastructure or the 529 remote endpoint can falsely report congestion and thus cause the 530 sender to reduce cwnd. However, this is no worse than what can be 531 achieved by simply dropping packets. 533 [RFC3168] requires that a compliant TCP must not set ECT on SYN or 534 SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets, 535 but maintains the restriction of no ECT on SYN packets. Both these 536 RFCs prohibit ECT in SYN packets due to security concerns regarding 537 malicious SYN packets with ECT set. These RFCs, however, are 538 intended for general Internet use, and do not directly apply to a 539 controlled datacenter environment. The security concerns addressed 540 by both these RFCs might not apply in controlled environments like 541 datacenters, and it might not be necessary to account for the 542 presence of non-ECN servers. Since most servers run virtualized in 543 datacenters, additional security can be imposed in the physical 544 servers to intercept and drop traffic resembling an attack. 546 9. IANA Considerations 548 This document has no actions for IANA. 550 10. Acknowledgements 552 The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] 553 by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, 554 Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari 555 Sridharan. 557 We would like to thank Andrew Shewmaker for identifying the problem 558 of clamping DCTCP.Alpha and proposing a solution for it. 560 Lars Eggert has received funding from the European Union's Horizon 561 2020 research and innovation program 2014-2018 under grant agreement 562 No. 644866 ("SSICLOPS"). This document reflects only the authors' 563 views and the European Commission is not responsible for any use that 564 may be made of the information it contains. 566 11. References 568 11.1. Normative References 570 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 571 RFC 793, DOI 10.17487/RFC0793, September 1981, 572 . 574 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 575 Selective Acknowledgment Options", RFC 2018, 576 DOI 10.17487/RFC2018, October 1996, 577 . 579 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 580 Requirement Levels", BCP 14, RFC 2119, 581 DOI 10.17487/RFC2119, March 1997, 582 . 584 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 585 of Explicit Congestion Notification (ECN) to IP", 586 RFC 3168, DOI 10.17487/RFC3168, September 2001, 587 . 589 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 590 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 591 . 593 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 594 Ramakrishnan, "Adding Explicit Congestion Notification 595 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 596 DOI 10.17487/RFC5562, June 2009, 597 . 599 11.2. Informative References 601 [RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 602 Code: The Implementation Status Section", BCP 205, 603 RFC 7942, DOI 10.17487/RFC7942, July 2016, 604 . 606 [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 607 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 608 Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proc. 609 ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010, 610 . 612 [ODCTCP] Kato, M., "Improving Transmission Performance with One- 613 Sided Datacenter TCP", M.S. Thesis, Keio University, 614 2014, . 616 [BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and 617 H. Tokuda, "Extensions to FreeBSD Datacenter TCP for 618 Incremental Deployment Support", BSDCan 2015, June 2015, 619 . 621 [ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis 622 of DCTCP: Stability, Convergence, and Fairness", 623 DOI 10.1145/1993744.1993753, Proc. ACM SIGMETRICS Joint 624 International Conference on Measurement and Modeling of 625 Computer Systems (SIGMETRICS 11), June 2011, 626 . 628 [LINUX] Borkmann, D. and F. Westphal, "Linux DCTCP patch", 2014, 629 . 633 [FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP) 634 implementation", 2015, 635 . 638 [MORGANSTANLEY] 639 Judd, G., "Attaining the Promise and Avoiding the Pitfalls 640 of TCP in the Datacenter", Proc. 12th USENIX Symposium on 641 Networked Systems Design and Implementation (NSDI 15), May 642 2015, . 645 [RFC3168-ERRATA3639] 646 Scheffenegger, R., "RFC3168 Errata ID 3639", 2013, 647 . 650 [ECN-EXPERIMENTATION] 651 Black, D., "Explicit Congestion Notification (ECN) 652 Experimentation", 2017, . 655 Authors' Addresses 656 Stephen Bensley 657 Microsoft 658 One Microsoft Way 659 Redmond, WA 98052 660 USA 662 Phone: +1 425 703 5570 663 Email: sbens@microsoft.com 665 Dave Thaler 666 Microsoft 668 Phone: +1 425 703 8835 669 Email: dthaler@microsoft.com 671 Praveen Balasubramanian 672 Microsoft 674 Phone: +1 425 538 2782 675 Email: pravb@microsoft.com 677 Lars Eggert 678 NetApp 679 Sonnenallee 1 680 Kirchheim 85551 681 Germany 683 Phone: +49 151 120 55791 684 Email: lars@netapp.com 685 URI: http://eggert.org/ 687 Glenn Judd 688 Morgan Stanley 690 Phone: +1 973 979 6481 691 Email: glenn.judd@morganstanley.com