idnits 2.17.1 

draft-ietf-tcpm-dctcp-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 27, 2017) is 2488 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  -- Duplicate reference: RFC3168, mentioned in 'RFC3168-ERRATA3639', was
     also mentioned in 'RFC3168'.


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         S. Bensley
3	Internet-Draft                                                 D. Thaler
4	Intended status: Informational                        P. Balasubramanian
5	Expires: December 29, 2017                                     Microsoft
6	                                                               L. Eggert
7	                                                                  NetApp
8	                                                                 G. Judd
9	                                                          Morgan Stanley
10	                                                           June 27, 2017

12	     Datacenter TCP (DCTCP): TCP Congestion Control for Datacenters
13	                        draft-ietf-tcpm-dctcp-08

15	Abstract

17	   This informational memo describes Datacenter TCP (DCTCP), a TCP
18	   congestion control scheme for datacenter traffic.  DCTCP extends the
19	   Explicit Congestion Notification (ECN) processing to estimate the
20	   fraction of bytes that encounter congestion, rather than simply
21	   detecting that some congestion has occurred.  DCTCP then scales the
22	   TCP congestion window based on this estimate.  This method achieves
23	   high burst tolerance, low latency, and high throughput with shallow-
24	   buffered switches.  This memo also discusses deployment issues
25	   related to the coexistence of DCTCP and conventional TCP, the lack of
26	   a negotiating mechanism between sender and receiver, and presents
27	   some possible mitigations.  This memo documents DCTCP as currently
28	   implemented by several major operating systems.  DCTCP as described
29	   in this draft is applicable to deployments in controlled environments
30	   like datacenters but it must not be deployed over the public Internet
31	   without additional measures.

33	Status of This Memo

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF).  Note that other groups may also distribute
40	   working documents as Internet-Drafts.  The list of current Internet-
41	   Drafts is at http://datatracker.ietf.org/drafts/current/.

43	   Internet-Drafts are draft documents valid for a maximum of six months
44	   and may be updated, replaced, or obsoleted by other documents at any
45	   time.  It is inappropriate to use Internet-Drafts as reference
46	   material or to cite them other than as "work in progress."

48	   This Internet-Draft will expire on December 29, 2017.

50	Copyright Notice

52	   Copyright (c) 2017 IETF Trust and the persons identified as the
53	   document authors.  All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (http://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document.  Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document.  Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
68	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
69	   3.  DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . .   4
70	     3.1.  Marking Congestion on the L3 Switches and Routers . . . .   4
71	     3.2.  Echoing Congestion Information on the Receiver  . . . . .   5
72	     3.3.  Processing Echoed Congestion Indications on the Sender  .   6
73	     3.4.  Handling of packet loss . . . . . . . . . . . . . . . . .   8
74	     3.5.  Handling of SYN, SYN-ACK, RST Packets . . . . . . . . . .   8
75	   4.  Implementation Issues . . . . . . . . . . . . . . . . . . . .   8
76	     4.1.  Configuration of DCTCP  . . . . . . . . . . . . . . . . .   8
77	     4.2.  Computation of DCTCP.Alpha  . . . . . . . . . . . . . . .   9
78	   5.  Deployment Issues . . . . . . . . . . . . . . . . . . . . . .  10
79	   6.  Known Issues  . . . . . . . . . . . . . . . . . . . . . . . .  11
80	   7.  Implementation Status . . . . . . . . . . . . . . . . . . . .  11
81	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
82	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
83	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  12
84	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  13
85	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  13
86	     11.2.  Informative References . . . . . . . . . . . . . . . . .  14
87	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

89	1.  Introduction

91	   Large datacenters necessarily need many network switches to
92	   interconnect their many servers.  Therefore, a datacenter can greatly
93	   reduce its capital expenditure by leveraging low-cost switches.
94	   However, such low-cost switches tend to have limited queue capacities
95	   and are thus more susceptible to packet loss due to congestion.

97	   Network traffic in a datacenter is often a mix of short and long
98	   flows, where the short flows require low latencies and the long flows
99	   require high throughputs.  Datacenters also experience incast bursts,
100	   where many servers send traffic to a single server at the same time.
101	   For example, this traffic pattern is a natural consequence of
102	   MapReduce [MAPREDUCE] workload: The worker nodes complete at
103	   approximately the same time, and all reply to the master node
104	   concurrently.

106	   These factors place some conflicting demands on the queue occupancy
107	   of a switch:

109	   o  The queue must be short enough that it does not impose excessive
110	      latency on short flows.

112	   o  The queue must be long enough to buffer sufficient data for the
113	      long flows to saturate the path capacity.

115	   o  The queue must be long enough to absorb incast bursts without
116	      excessive packet loss.

118	   Standard TCP congestion control [RFC5681] relies on packet loss to
119	   detect congestion.  This does not meet the demands described above.
120	   First, short flows will start to experience unacceptable latencies
121	   before packet loss occurs.  Second, by the time TCP congestion
122	   control kicks in on the senders, most of the incast burst has already
123	   been dropped.

125	   [RFC3168] describes a mechanism for using Explicit Congestion
126	   Notification (ECN) from the switches for detection of congestion.
127	   However, this method only detects the presence of congestion, not its
128	   extent.  In the presence of mild congestion, the TCP congestion
129	   window is reduced too aggressively and this unnecessarily reduces the
130	   throughput of long flows.

132	   Datacenter TCP (DCTCP) improves traditional ECN processing by
133	   estimating the fraction of bytes that encounter congestion, rather
134	   than simply detecting that some congestion has occurred.  DCTCP then
135	   scales the TCP congestion window based on this estimate.  This method
136	   achieves high burst tolerance, low latency, and high throughput with
137	   shallow-buffered switches.  DCTCP is a modification to the processing
138	   of ECN by a conventional TCP and requires that standard TCP
139	   congestion control be used for handling packet loss.

141	   DCTCP should only be deployed in an intra-datacenter environment
142	   where both endpoints and the switching fabric are under a single
143	   administrative domain.  DCTCP MUST NOT be deployed over the public
144	   Internet without additional measures, as detailed in Section 5.

146	   The objective of this Informational RFC is to document DCTCP as an
147	   alternative TCP congestion control algorithm [RFC5033] that is known
148	   to be widely implemented and deployed.  It is consensus in the IETF
149	   TCPM working group that a DCTCP standard would require further work.
150	   A precise documentation of running code enables follow-up IETF
151	   Experimental or Standards Track RFCs.

153	2.  Terminology

155	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
156	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
157	   document are to be interpreted as described in [RFC2119].

159	   Normative language is used to describe how necessary the various
160	   aspects of a DCTCP implementation are for interoperability, but even
161	   compliant implementations without the measures in sections 4-6 would
162	   still only be safe to deploy in controlled environments, i.e., not
163	   over the public Internet.

165	3.  DCTCP Algorithm

167	   There are three components involved in the DCTCP algorithm:

169	   o  The switches (or other intermediate devices in the network) detect
170	      congestion and set the Congestion Encountered (CE) codepoint in
171	      the IP header.

173	   o  The receiver echoes the congestion information back to the sender,
174	      using the ECN-Echo (ECE) flag in the TCP header.

176	   o  The sender computes a congestion estimate and reacts, by reducing
177	      the TCP congestion window accordingly (cwnd).

179	3.1.  Marking Congestion on the L3 Switches and Routers

181	   The level-3 (L3) switches and routers in a datacenter fabric indicate
182	   congestion to the end nodes by setting the CE codepoint in the IP
183	   header as specified in Section 5 of [RFC3168].  For example, the
184	   switches may be configured with a congestion threshold.  When a
185	   packet arrives at a switch and its queue length is greater than the
186	   congestion threshold, the switch sets the CE codepoint in the packet.
187	   For example, Section 3.4 of [DCTCP10] suggests threshold marking with
188	   a threshold K > (RTT * C)/7, where C is the link rate in packets per
189	   second.  In typical deployments the marking threshold is set to be a
190	   small value to maintain a short average queueing delay.  However, the
191	   actual algorithm for marking congestion is an implementation detail
192	   of the switch and will generally not be known to the sender and
193	   receiver.  Therefore, sender and receiver should not assume that a
194	   particular marking algorithm is implemented by the switching fabric.

196	3.2.  Echoing Congestion Information on the Receiver

198	   According to Section 6.1.3 of [RFC3168], the receiver sets the ECE
199	   flag if any of the packets being acknowledged had the CE code point
200	   set.  The receiver then continues to set the ECE flag until it
201	   receives a packet with the Congestion Window Reduced (CWR) flag set.
202	   However, the DCTCP algorithm requires more detailed congestion
203	   information.  In particular, the sender must be able to determine the
204	   number of bytes sent that encountered congestion.  Thus, the scheme
205	   described in [RFC3168] does not suffice.

207	   One possible solution is to ACK every packet and set the ECE flag in
208	   the ACK if and only if the CE code point was set in the packet being
209	   acknowledged.  However, this prevents the use of delayed ACKs, which
210	   are an important performance optimization in datacenters.  If the
211	   delayed ACK frequency is m, then an ACK is generated every m packets.
212	   The typical value of m is 2 but it could be affected by ACK
213	   throttling or packet coalescing techniques designed to improve
214	   performance.

216	   Instead, DCTCP introduces a new Boolean TCP state variable, "DCTCP
217	   Congestion Encountered" (DCTCP.CE), which is initialized to false and
218	   stored in the Transmission Control Block (TCB).  When sending an ACK,
219	   the ECE flag MUST be set if and only if DCTCP.CE is true.  When
220	   receiving packets, the CE codepoint MUST be processed as follows:

222	   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
223	       true and send an immediate ACK.

225	   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
226	       to false and send an immediate ACK.

228	   3.  Otherwise, ignore the CE codepoint.

230	   Since the immediate ACK reflects the new DCTCP.CE state, it may
231	   acknowledge any previously unacknowledged packets in the old state.
232	   This can lead to an incorrect DCTCP.Alpha value computation at the
233	   sender per Section 3.3.  To avoid this, an implementation may choose
234	   to send two ACKs, one for previously unacknowledged packets and
235	   another acknowledging the most recently received packet.

237	   Receiver handling of the "Congestion Window Reduced" (CWR) bit is
238	   also per [RFC3168] including [RFC3168-ERRATA3639].  That is, on
239	   receipt of a segment with both the CE and CWR bits set, CWR is
240	   processed first and then CE is processed.

242	                                  Send immediate
243	                                  ACK with ECE=0
244	                        .----.    .-------------.     .---.
245	           Send 1 ACK  /     v    v             |    |     \
246	            for every |     .------.           .------.     | Send 1 ACK
247	            m packets |     | CE=0 |           | CE=1 |     | for every
248	           with ECE=0 |     '------'           '------'     | m packets
249	                       \     |    |             ^    ^     /  with ECE=1
250	                        '---'      '------------'    '----'
251	                                   Send immediate
252	                                   ACK with ECE=1

254	   Figure 1: ACK generation state machine.  DCTCP.CE abbreviated as CE.

256	3.3.  Processing Echoed Congestion Indications on the Sender

258	   The sender estimates the fraction of bytes sent that encountered
259	   congestion.  The current estimate is stored in a new TCP state
260	   variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be
261	   updated as follows:

263	      DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M

265	   where

267	   o  g is the estimation gain, a real number between 0 and 1.  The
268	      selection of g is left to the implementation.  See Section 4 for
269	      further considerations.

271	   o  M is the fraction of bytes sent that encountered congestion during
272	      the previous observation window, where the observation window is
273	      chosen to be approximately the Round Trip Time (RTT).  In
274	      particular, an observation window ends when all bytes in flight at
275	      the beginning of the window have been acknowledged.

277	   In order to update DCTCP.Alpha, the TCP state variables defined in
278	   [RFC0793] are used, and three additional TCP state variables are
279	   introduced:

281	   o  DCTCP.WindowEnd: The TCP sequence number threshold for beginning a
282	      new observation window; initialized to SND.UNA.

284	   o  DCTCP.BytesAcked: The number of sent bytes acknowledged during the
285	      current observation window; initialized to zero.

287	   o  DCTCP.BytesMarked: The number of bytes sent during the current
288	      observation window that encountered congestion; initialized to
289	      zero.

291	   The congestion estimator on the sender SHOULD process acceptable ACKs
292	   as follows:

294	   1.  Compute the bytes acknowledged (TCP SACK options [RFC2018] are
295	       ignored for this computation):

297	          BytesAcked = SEG.ACK - SND.UNA

299	   2.  Update the bytes sent:

301	          DCTCP.BytesAcked += BytesAcked

303	   3.  If the ECE flag is set, update the bytes marked:

305	          DCTCP.BytesMarked += BytesAcked

307	   4.  If the acknowledgment number is less than or equal to
308	       DCTCP.WindowEnd, stop processing.  Otherwise, the end of the
309	       observation window has been reached, so proceed to update the
310	       congestion estimate as follows:

312	   5.  Compute the congestion level for the current observation window:

314	          M = DCTCP.BytesMarked / DCTCP.BytesAcked

316	   6.  Update the congestion estimate:

318	          DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M

320	   7.  Determine the end of the next observation window:

322	          DCTCP.WindowEnd = SND.NXT

324	   8.  Reset the byte counters:

326	          DCTCP.BytesAcked = DCTCP.BytesMarked = 0

328	   9.  Rather than always halving the congestion window as described in
329	       [RFC3168], the sender SHOULD update cwnd as follows:

331	          cwnd = cwnd * (1 - DCTCP.Alpha / 2)

333	   Thus, when no bytes sent experienced congestion, DCTCP.Alpha equals
334	   zero, and cwnd is left unchanged.  When all sent bytes experienced
335	   congestion, DCTCP.Alpha equals one, and cwnd is reduced by half.
336	   Lower levels of congestion will result in correspondingly smaller
337	   reductions to cwnd.

339	   Just as specified in [RFC3168], DCTCP does not react to congestion
340	   indications more than once for every window of data.  The setting of
341	   the "Congestion Window Reduced" (CWR) bit is also as per [RFC3168].
342	   This is required for interop with classic ECN receivers due to
343	   potential misconfigurations.

345	3.4.  Handling of packet loss

347	   A DCTCP sender MUST react to loss episodes in the same way as
348	   conventional TCP.  For cases where the packet loss is inferred and
349	   not explicitly signaled by ECN, the cwnd and other state variables
350	   like ssthresh must be changed in the same way that a conventional TCP
351	   would have changed them.  As with ECN, DCTCP sender will only reduce
352	   the cwnd once per window of data across all loss signals.  Just as
353	   specified in [RFC5681], upon a timeout, the cwnd MUST be set to no
354	   more than the loss window (1 full-sized segment), regardless of
355	   previous cwnd reductions in a given window of data.

357	3.5.  Handling of SYN, SYN-ACK, RST Packets

359	   If SYN, SYN-ACK and RST packets for DCTCP connections have the "ECN
360	   Capable Transport" (ECT) codepoint set in the IP header, they will
361	   receive the same treatment as other DCTCP packets when forwarded by a
362	   switching fabric under load.  Lack of ECT in these packets may result
363	   in a higher drop rate depending on the switching fabric
364	   configuration.  Hence for DCTCP connections, the sender SHOULD set
365	   ECT for SYN, SYN-ACK and RST packets.  A DCTCP receiver ignores CE
366	   codepoints set on any SYN, SYN-ACK, or RST packets.

368	4.  Implementation Issues

370	4.1.  Configuration of DCTCP

372	   An implementation should decide when to use DCTCP.  Datacenter
373	   servers may need to communicate with endpoints outside the
374	   datacenter, where DCTCP is unsuitable or unsupported.  Thus, a global
375	   configuration setting to enable DCTCP will generally not suffice.
376	   DCTCP provides no mechanism for negotiating its use.  Thus, there is
377	   additional management and configuration overhead required to ensure
378	   that DCTCP is not used with non-DCTCP endpoints.

380	   Potential solutions rely on either configuration or heuristics.
381	   Heuristics need to allow endpoints to individually enable DCTCP, to
382	   ensure a DCTCP sender is always paired with a DCTCP receiver.  One
383	   approach is to enable DCTCP based on the IP address of the remote
384	   endpoint.  Another approach is to detect connections that transmit
385	   within the bounds a datacenter.  For example, an implementation could
386	   support automatic selection of DCTCP if the estimated RTT is less
387	   than a threshold (like 10 msec) and ECN is successfully negotiated,
388	   under the assumption that if the RTT is low, then the two endpoints
389	   are likely in the same datacenter network.

391	   [RFC3168] forbids the ECN-marking of pure ACK packets, because of the
392	   inability of TCP to mitigate ACK-path congestion.  RFC 3168 also
393	   forbids ECN-marking of retransmissions, window probes and RSTs.
394	   However, dropping all these control packets - rather than ECN marking
395	   them - has considerable performance disadvantages.  It is RECOMMENDED
396	   that an implementation provide a configuration knob that will cause
397	   ECT to be set on such control packets, which can be used in
398	   environments where such concerns do not apply.  See
399	   [ECN-EXPERIMENTATION] for details.

401	   It is useful to implement DCTCP as additional actions on top of an
402	   existing congestion control algorithm like Reno [RFC5681].  The DCTCP
403	   implementation MAY also allow configuration of resetting the value of
404	   DCTCP.Alpha as part of processing any loss episodes.

406	4.2.  Computation of DCTCP.Alpha

408	   As noted in Section 3.3, the implementation will need to choose a
409	   suitable estimation gain.  [DCTCP10] provides a theoretical basis for
410	   selecting the gain.  However, it may be more practical to use
411	   experimentation to select a suitable gain for a particular network
412	   and workload.  A fixed estimation gain of 1/16 is used in some
413	   implementations.

415	   The DCTCP.Alpha computation as per the formula in Section 3.3
416	   involves fractions.  An efficient kernel implementation MAY scale the
417	   DCTCP.Alpha value for efficient computation using shift operations.
418	   For example, if the implementation chooses g as 1/16, multiplications
419	   of DCTCP.Alpha by g become right-shifts by 4.  A scaling
420	   implementation SHOULD ensure that DCTCP.Alpha is able to reach zero
421	   once it falls below the smallest shifted value (16 in the above
422	   example).  At the other extreme, a scaled update must ensure
423	   DCTCP.Alpha does not exceed the scaling factor, which would be
424	   equivalent to greater than 100% congestion.  So, DCTCP.Alpha MUST be
425	   clamped after an update.

427	   This results in the following computations replacing steps 5 and 6 in
428	   Section 3.3, where SCF is the chosen scaling factor (65536 in the
429	   example) and SHF is the shift factor (4 in the example):

431	   1.  Compute the congestion level for the current observation window:

433	          ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked

435	   2.  Update the congestion estimate:

437	          if (DCTCP.Alpha >> SHF) == 0 then DCTCP.Alpha = 0

439	          DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF)

441	          if DCTCP.Alpha > SCF then DCTCP.Alpha = SCF

443	5.  Deployment Issues

445	   DCTCP and conventional TCP congestion control do not coexist well in
446	   the same network.  In typical DCTCP deployments, the marking
447	   threshold in the switching fabric is set to a very low value to
448	   reduce queueing delay, and a relatively small amount of congestion
449	   will exceed the marking threshold.  During such periods of
450	   congestion, conventional TCP will suffer packet loss and quickly and
451	   drastically reduce cwnd.  DCTCP, on the other hand, will use the
452	   fraction of marked packets to reduce cwnd more gradually.  Thus, the
453	   rate reduction in DCTCP will be much slower than that of conventional
454	   TCP, and DCTCP traffic will gain a larger share of the capacity
455	   compared to conventional TCP traffic traversing the same path.  If
456	   the traffic in the datacenter is a mix of conventional TCP and DCTCP,
457	   it is RECOMMENDED that DCTCP traffic be segregated from conventional
458	   TCP traffic.  [MORGANSTANLEY] describes a deployment that uses the IP
459	   Differentiated Services Code Point (DSCP) bits to segregate the
460	   network such that Active Queue Management (AQM) is applied to DCTCP
461	   traffic, whereas TCP traffic is managed via drop-tail queueing.

463	   Deployments should take into account segregation of non-TCP traffic
464	   as well.  Today's commodity switches allow configuration of different
465	   marking/drop profiles for non-TCP and non-IP packets.  Non-TCP and
466	   non-IP packets should be able to pass through such switches, unless
467	   they really run out of buffer space.

469	   Since DCTCP relies on congestion marking by the switches, DCTCP's
470	   potential can only be realized in datacenters where the entire
471	   network infrastructure supports ECN.  The switches may also support
472	   configuration of the congestion threshold used for marking.  The
473	   proposed parameterization can be configured with switches that
474	   implement Random Early Detection (RED).  [DCTCP10] provides a
475	   theoretical basis for selecting the congestion threshold, but as with
476	   the estimation gain, it may be more practical to rely on
477	   experimentation or simply to use the default configuration of the
478	   device.  DCTCP will revert to loss-based congestion control when
479	   packet loss is experienced (e.g. when transiting a congested drop-
480	   tail link, or a link with an AQM drop behavior).

482	   DCTCP requires changes on both the sender and the receiver, so both
483	   endpoints must support DCTCP.  Furthermore, DCTCP provides no
484	   mechanism for negotiating its use, so both endpoints must be
485	   configured through some out-of-band mechanism to use DCTCP.  A
486	   variant of DCTCP that can be deployed unilaterally and only requires
487	   standard ECN behavior has been described in [ODCTCP][BSDCAN], but
488	   requires additional experimental evaluation.

490	6.  Known Issues

492	   DCTCP relies on the sender's ability to reconstruct the stream of CE
493	   codepoints received by the remote endpoint.  To accomplish this,
494	   DCTCP avoids using a single ACK packet to acknowledge segments
495	   received both with and without the CE codepoint set.  However, if one
496	   or more ACK packets are dropped, it is possible that a subsequent ACK
497	   will cumulatively acknowledge a mix of CE and non-CE segments.  This
498	   will, of course, result in a less accurate congestion estimate.
499	   There are some potential considerations:

501	   o  Even with an inaccurate congestion estimate, DCTCP may still
502	      perform better than [RFC3168].

504	   o  If the estimation gain is small relative to the packet loss rate,
505	      the estimate may not be too inaccurate.

507	   o  If ACK packet loss mostly occurs under heavy congestion, most
508	      drops will occur during an unbroken string of CE packets, and the
509	      estimate will be unaffected.

511	   However, the effect of packet drops on DCTCP under real world
512	   conditions has not been analyzed.

514	   DCTCP provides no mechanism for negotiating its use.  The effect of
515	   using DCTCP with a standard ECN endpoint has been analyzed in
516	   [ODCTCP][BSDCAN].  Furthermore, it is possible that other
517	   implementations may also modify [RFC3168] behavior without
518	   negotiation, causing further interoperability issues.

520	   Much like standard TCP, DCTCP is biased against flows with longer
521	   RTTs.  A method for improving the RTT fairness of DCTCP has been
522	   proposed in [ADCTCP], but requires additional experimental
523	   evaluation.

525	7.  Implementation Status

527	   This section documents the implementation status of the specification
528	   in this document, as recommended by [RFC7942].

530	   This document describes DCTCP as implemented in Microsoft Windows
531	   Server 2012 [WINDOWS].  The Linux [LINUX] and FreeBSD [FREEBSD]
532	   operating systems have also implemented support for DCTCP in a way
533	   that is believed to follow this document.  Deployment experiences
534	   with DCTCP as have been documented in [MORGANSTANLEY].

536	8.  Security Considerations

538	   DCTCP enhances ECN and thus inherits the general security
539	   considerations discussed in [RFC3168], although additional mitigation
540	   options exist due to the limited intra-datacenter deployment of
541	   DCTCP.

543	   The processing changes introduced by DCTCP do not exacerbate the
544	   considerations in [RFC3168] or introduce new ones.  In particular,
545	   with either algorithm, the network infrastructure or the remote
546	   endpoint can falsely report congestion and thus cause the sender to
547	   reduce cwnd.  However, this is no worse than what can be achieved by
548	   simply dropping packets.

550	   [RFC3168] requires that a compliant TCP must not set ECT on SYN or
551	   SYN-ACK packets.  [RFC5562] proposes setting ECT on SYN-ACK packets,
552	   but maintains the restriction of no ECT on SYN packets.  Both these
553	   RFCs prohibit ECT in SYN packets due to security concerns regarding
554	   malicious SYN packets with ECT set.  These RFCs, however, are
555	   intended for general Internet use, and do not directly apply to a
556	   controlled datacenter environment.  The security concerns addressed
557	   by both these RFCs might not apply in controlled environments like
558	   datacenters, and it might not be necessary to account for the
559	   presence of non-ECN servers.  Since most servers run virtualized in
560	   datacenters, additional security can be imposed in the physical
561	   servers to intercept and drop traffic resembling an attack.

563	9.  IANA Considerations

565	   This document has no actions for IANA.

567	10.  Acknowledgements

569	   The DCTCP algorithm was originally proposed and analyzed in [DCTCP10]
570	   by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye,
571	   Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari
572	   Sridharan.

574	   We would like to thank Andrew Shewmaker for identifying the problem
575	   of clamping DCTCP.Alpha and proposing a solution for it.

577	   Lars Eggert has received funding from the European Union's Horizon
578	   2020 research and innovation program 2014-2018 under grant agreement
579	   No. 644866 ("SSICLOPS").  This document reflects only the authors'
580	   views and the European Commission is not responsible for any use that
581	   may be made of the information it contains.

583	11.  References

585	11.1.  Normative References

587	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
588	              RFC 793, DOI 10.17487/RFC0793, September 1981,
589	              <http://www.rfc-editor.org/info/rfc793>.

591	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
592	              Selective Acknowledgment Options", RFC 2018,
593	              DOI 10.17487/RFC2018, October 1996,
594	              <http://www.rfc-editor.org/info/rfc2018>.

596	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
597	              Requirement Levels", BCP 14, RFC 2119,
598	              DOI 10.17487/RFC2119, March 1997,
599	              <http://www.rfc-editor.org/info/rfc2119>.

601	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
602	              of Explicit Congestion Notification (ECN) to IP",
603	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
604	              <http://www.rfc-editor.org/info/rfc3168>.

606	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
607	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
608	              <http://www.rfc-editor.org/info/rfc5681>.

610	   [RFC5562]  Kuzmanovic, A., Mondal, A., Floyd, S., and K.
611	              Ramakrishnan, "Adding Explicit Congestion Notification
612	              (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562,
613	              DOI 10.17487/RFC5562, June 2009,
614	              <http://www.rfc-editor.org/info/rfc5562>.

616	   [RFC3168-ERRATA3639]
617	              Scheffenegger, R., "RFC3168 Errata ID 3639", 2013,
618	              <http://www.rfc-editor.org/
619	              errata_search.php?rfc=3168&eid=3639>.

621	11.2.  Informative References

623	   [RFC7942]  Sheffer, Y. and A. Farrel, "Improving Awareness of Running
624	              Code: The Implementation Status Section", BCP 205,
625	              RFC 7942, DOI 10.17487/RFC7942, July 2016,
626	              <http://www.rfc-editor.org/info/rfc7942>.

628	   [RFC5033]  Floyd, S. and M. Allman, "Specifying New Congestion
629	              Control Algorithms", BCP 133, RFC 5033,
630	              DOI 10.17487/RFC5033, August 2007,
631	              <http://www.rfc-editor.org/info/rfc5033>.

633	   [DCTCP10]  Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
634	              P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data
635	              Center TCP (DCTCP)", DOI 10.1145/1851182.1851192,  Proc.
636	              ACM SIGCOMM 2010 Conference (SIGCOMM 10), August 2010,
637	              <http://dl.acm.org/citation.cfm?doid=1851182.1851192>.

639	   [ODCTCP]   Kato, M., "Improving Transmission Performance with One-
640	              Sided Datacenter TCP",  M.S. Thesis, Keio University,
641	              2014, <http://eggert.org/students/kato-thesis.pdf>.

643	   [BSDCAN]   Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and
644	              H. Tokuda, "Extensions to FreeBSD Datacenter TCP for
645	              Incremental Deployment Support",  BSDCan 2015, June 2015,
646	              <https://www.bsdcan.org/2015/schedule/events/559.en.html>.

648	   [ADCTCP]   Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis
649	              of DCTCP: Stability, Convergence, and Fairness",
650	              DOI 10.1145/1993744.1993753,  Proc. ACM SIGMETRICS Joint
651	              International Conference on Measurement and Modeling of
652	              Computer Systems (SIGMETRICS 11), June 2011,
653	              <https://dl.acm.org/citation.cfm?id=1993753>.

655	   [WINDOWS]  Microsoft, "Windows DCTCP reference", 2012,
656	              <https://technet.microsoft.com/en-us/library/
657	              hh997028(v=ws.11).aspx>.

659	   [LINUX]    Borkmann, D. and F. Westphal, "Linux DCTCP patch", 2014,
660	              <https://git.kernel.org/cgit/linux/kernel/git/davem/net-
661	              next.git/
662	              commit/?id=e3118e8359bb7c59555aca60c725106e6d78c5ce>.

664	   [FREEBSD]  Kato, M. and H. Panchasara, "DCTCP (Data Center TCP)
665	              implementation", 2015,
666	              <https://github.com/freebsd/freebsd/
667	              commit/8ad879445281027858a7fa706d13e458095b595f>.

669	   [MORGANSTANLEY]
670	              Judd, G., "Attaining the Promise and Avoiding the Pitfalls
671	              of TCP in the Datacenter",  Proc. 12th USENIX Symposium on
672	              Networked Systems Design and Implementation (NSDI 15), May
673	              2015, <https://www.usenix.org/conference/nsdi15/technical-
674	              sessions/presentation/judd>.

676	   [ECN-EXPERIMENTATION]
677	              Black, D., "Explicit Congestion Notification (ECN)
678	              Experimentation", 2017, <https://datatracker.ietf.org/doc/
679	              draft-ietf-tsvwg-ecn-experimentation/>.

681	   [MAPREDUCE]
682	              Dean, J. and S. Ghemawat, "MapReduce: Simplified Data
683	              Processing on Large Clusters",  Proc. 6th ACM/USENIX
684	              Symposium on Operating Systems Design and Implementation
685	              (OSDI 04), December 2004, <https://www.usenix.org/legacy/p
686	              ublications/library/proceedings/osdi04/tech/dean.html>.

688	Authors' Addresses

690	   Stephen Bensley
691	   Microsoft
692	   One Microsoft Way
693	   Redmond, WA  98052
694	   USA

696	   Phone: +1 425 703 5570
697	   Email: sbens@microsoft.com

699	   Dave Thaler
700	   Microsoft

702	   Phone: +1 425 703 8835
703	   Email: dthaler@microsoft.com

705	   Praveen Balasubramanian
706	   Microsoft

708	   Phone: +1 425 538 2782
709	   Email: pravb@microsoft.com
710	   Lars Eggert
711	   NetApp
712	   Sonnenallee 1
713	   Kirchheim  85551
714	   Germany

716	   Phone: +49 151 120 55791
717	   Email: lars@netapp.com
718	   URI:   http://eggert.org/

720	   Glenn Judd
721	   Morgan Stanley

723	   Phone: +1 973 979 6481
724	   Email: glenn.judd@morganstanley.com