idnits 2.17.1 

draft-even-iccrg-dc-fast-congestion-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 23, 2019) is 1647 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'CongestionManagment' is defined on line 511, but no
     explicit reference was found in the text

  == Unused Reference: 'I-D.herbert-ipv4-eh' is defined on line 535, but no
     explicit reference was found in the text

  == Unused Reference: 'I-D.ietf-quic-transport' is defined on line 546, but
     no explicit reference was found in the text

  == Unused Reference: 'RFC6679' is defined on line 620, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-03) exists of
     draft-herbert-ipv4-eh-01

  == Outdated reference: A later version (-17) exists of
     draft-ietf-ippm-ioam-data-07

  == Outdated reference: A later version (-34) exists of
     draft-ietf-quic-transport-23

  == Outdated reference: A later version (-28) exists of
     draft-ietf-tcpm-accurate-ecn-09

  == Outdated reference: A later version (-25) exists of
     draft-ietf-tsvwg-aqm-dualq-coupled-10

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-07

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-04

  -- Obsolete informational reference (is this intentional?): RFC 8312
     (Obsoleted by RFC 9438)


     Summary: 0 errors (**), 0 flaws (~~), 12 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	TSVWG                                                            R. Even
3	Internet-Draft                                                    Huawei
4	Intended status: Informational                                  R. Huang
5	Expires: April 25, 2020                    Huawei Technologies Co., Ltd.
6	                                                        October 23, 2019

8	                 Data Center Fast Congestion Management
9	                 draft-even-iccrg-dc-fast-congestion-00

11	Abstract

13	   Fast congestion control is discussed in academic papers as well as in
14	   the different standard bodies.  There is no one proposal for
15	   providing a solution that will work for all use cases leading to
16	   multiple approaches.  By congestion control we refer to an end to end
17	   solution and not only to the congestion control algorithm on the
18	   sender side.  This document describes the current state of flow
19	   control and congestion for Data Centers and proposes future
20	   directions.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on April 25, 2020.

39	Copyright Notice

41	   Copyright (c) 2019 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
57	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .   3
58	   3.  Abbreviations . . . . . . . . . . . . . . . . . . . . . . . .   3
59	   4.  Alternative Congestion Management mechanisms  . . . . . . . .   4
60	     4.1.  Mechanisms based on estimation of network status  . . . .   4
61	     4.2.  Network provides limited information  . . . . . . . . . .   4
62	       4.2.1.  ECN and DCTCP . . . . . . . . . . . . . . . . . . . .   5
63	       4.2.2.  DCQCN . . . . . . . . . . . . . . . . . . . . . . . .   5
64	       4.2.3.  SCE - Some Congestion Experienced . . . . . . . . . .   6
65	       4.2.4.  L4S - Low Latency, Low Loss, Scalable Throughput  . .   7
66	     4.3.  Network provides more information . . . . . . . . . . . .   8
67	     4.4.  Network provides proactive control  . . . . . . . . . . .   9
68	   5.  Summary and Proposal  . . . . . . . . . . . . . . . . . . . .   9
69	     5.1.  Reflect the network status more accurately  . . . . . . .  10
70	     5.2.  Notify the reaction point as soon as possible.  . . . . .  10
71	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  11
72	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  11
73	   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
74	     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  11
75	     8.2.  Informative References  . . . . . . . . . . . . . . . . .  11
76	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

78	1.  Introduction

80	   Fast congestion control is discussed in academic papers as well as in
81	   the different standard bodies.  There is no one proposal for
82	   providing a solution that will work for all use cases leading to
83	   multiple approaches.  By congestion control we refer to an end to end
84	   solution and not only to the congestion control algorithm on the
85	   sender side.

87	   The major use case that we are looking at is congestion control for
88	   Data Centers, a controlled environment[RFC8085].  With the emerging
89	   Distributed Storage, AI/HPC (High Performance Computing), Machine
90	   Learning, etc., modern datacenter applications demand high
91	   throughput(40Gbps and above) with ultra-low latency of less than 10
92	   microsecond per hop from the network, with low CPU overhead.  For the
93	   end to end the latency should be less than 50usec, this value is
94	   based on DCQCN [DCQCN] The high link speed (>40Gb/s) in Data Centers
95	   (DC) are making network transfers complete faster and in fewer RTTs.
96	   Network traffic in a data center is often a mix of short and long
97	   flows, where the short flows require low latencies and the long flows
98	   require high throughputs.

100	   On IP-routed datacenter networks, RDMA is deployed using RoCEv2
101	   [RoCEv2] protocol or iWARP [RFC5040] RoCEv2 [RoCEv2] is a
102	   straightforward extension of the RoCE protocol that involves a simple
103	   modification of the RoCE packet format.  RoCEv2 packets carry an IP
104	   header which allows traversal of IP L3 Routers and a UDP header that
105	   serves as a stateless encapsulation layer for the RDMA Transport
106	   Protocol Packets over IP.  For Data Centers RDMA in ROCEv2 expect a
107	   lossless fabric and this is achieved using ECN and PFC. iWARP
108	   congestion control is based on TCP congestion control (DCTCP
109	   [RFC8257])

111	   A good congestion control for data centers should provide low
112	   latency, fast convergence and high link utilization.  Since multiple
113	   applications with different requirements may run on the DC network it
114	   is important to provide fairness between different applications that
115	   may use different congestion algorithms.  An important issue from the
116	   user perspective is to achieve short Flow Completion Time (FCT).

118	   This document investigates the current congestion control proposals,
119	   and discusses future data center congestion control directions which
120	   aims to achieve high performance and collaboration.

122	2.  Conventions

124	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
125	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
126	   "OPTIONAL" in this document are to be interpreted as described in BCP
127	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
128	   capitals, as shown here.

130	3.  Abbreviations

132	      RCM - RoCEv2 Congestion Management

134	      PFC - Priority-based Flow Control

136	      ECN - Explicit Congestion Notification

138	      DCQCN - Data Center Quantized Congestion Notification

140	      AI/HPC - Artificial Intelligence/High-Performance computing

142	      ECMP - Equal-Cost Multipath

144	      NIC - Network Interface Card
145	      RED - Random early detection gateways for congestion avoidance

147	4.  Alternative Congestion Management mechanisms

149	   This section will describe alternative directions based on current
150	   work.  Looking at the alternatives from the network perspective we
151	   can classify the alternatives as:

153	   1.  Based on estimation of network status: Traditional TCP, Timely.

155	   2.  Network provides limited information: DCQCN using only ECN, SCE
156	       and L4S

158	   3.  Network provides some information: HPCC.

160	   4.  Network provides proactive control: RCP (Rate Control Protocol)

162	   Note that any research on congestion control that requires network
163	   participation will be irrelevant if we cannot find a viable
164	   deployment path where only part of the network devices support the
165	   proposed congestion control.

167	4.1.  Mechanisms based on estimation of network status

169	   Traditional mechanisms uses packet status as the congestion signal
170	   and feedback to the sender, e.g. loss or delay, which is based on the
171	   facts that packets will drop when a buffer is full and packets will
172	   be delayed when a queue is building up.  It can simply be achieved by
173	   the interactions between the sender and the receiver, without the
174	   involvement of network.  It works well on the internet for a very
175	   long time, especially for best effort applications that do not have
176	   specific performance requirements.

178	   However, these mechanism are not optimized for some data center
179	   application because the convergence time and throughput are not good
180	   enough.  Mainly because endpoints estimation of network status are
181	   not accurate enough, and these mechanisms lack further information to
182	   adjust the sender behaviors.

184	4.2.  Network provides limited information

186	   In these mechanisms, the network utilize the ECN field of IP header
187	   to provide some hints on network status.  The following sections
188	   describe some typical proposals.

190	4.2.1.  ECN and DCTCP

192	   The Internet solutions use ECN [RFC3168] for marking the state of the
193	   queues in the network device, they may use some AQM mechanism
194	   (fq_coDel [RFC8290] ], PIE [RFC8033]) in the network devices and a
195	   congestion algorithm (New Reno [RFC5681], Cubic [RFC8312] or
196	   DCTCP[RFC8257]) on the sender side to address the congestion in the
197	   network.  Note that ECN is signaled earlier than packet drop but may
198	   cause earlier exit from TCP slow start.

200	   One of the problem for TCP is that ECN is specified for TCP in such a
201	   way that only one feedback signal can be transmitted per Round-Trip
202	   Time (RTT).  [I-D.ietf-tcpm-accurate-ecn] specifies an alternative
203	   feedback scheme that provides more accurate information that can be
204	   used by DCTCP and L4S.

206	   Traditional TCP uses ECN signal to indicate congestion experienced
207	   instead of packet loss, however, it does not provide information
208	   about the degree of the congestion.  DCTCP [RFC8257] is trying to
209	   solve this issue.  It estimates the fraction of bytes that encounter
210	   congestion rather than simply detecting the congestion presence.
211	   DCTCP further scales its sending rates accordingly.  DCTCP is widely
212	   implemented in current data center environments.

214	4.2.2.  DCQCN

216	   An enhancement to the congestion handling for ROCEv2 is the
217	   Congestion Control for Large-Scale RDMA Deployments [DCQCN] providing
218	   similar functionality to QCN [QCN] and DCTCP [RFC8257], it is
219	   implemented in some of the ROCEv2 NICs but is not part of the ROCEv2
220	   specification.  As such, vendors have their own implementations which
221	   make it difficult to interoperate with each other efficiently.

223	   DCQCN tests are assuming that the Congestion Point is using RED-ECN
224	   for ECN marking and the RDMA CNP message is used by the Notification
225	   Point (the receiver) to report ECN Congestion Experienced (CE).
226	   DCQCN as presented includes parameters that should be set.  It
227	   provides the parameters that were used during the specific tests
228	   using Mellanox NICs.  One of the comments about DCQCN is that it is
229	   not simple to define the parameters in order to get an optimized
230	   solution.  This solution is specific to ROCEv2 and addresses only the
231	   congestion control algorithm and is implemented in the NIC.

233	   DCQCN notification is using CNP that only report that at least one
234	   packet with CE marking was received in the last 50usec; this is
235	   similar to TCP reporting.  Other UDP based transports like RTP and
236	   QUIC provides information about how many packets marked with CE,
237	   ECT(0,1) were received.

239	4.2.3.  SCE - Some Congestion Experienced

241	   [I-D.morton-taht-tsvwg-sce] ECT(1) to be an early notification of
242	   congestion on ECT(0) marked packets, which can be used by AQM
243	   algorithms and transports as an earlier signal of congestion than CE
244	   ("Congestion Experienced").

246	   The ECN specification say that the congestion algorithm should treat
247	   CE marks the same as a drop packets.  Using ECT(1) to signal SCE
248	   permits middleboxes implementing AQM to signal incipient congestion,
249	   below the threshold required to justify setting CE.  Existing
250	   [RFC3168] compliant receivers MUST transparently ignore this new
251	   signal with respect to congestion control, and both existing and SCE-
252	   aware middleboxes MAY convert SCE to CE in the same circumstances as
253	   for ECT, thus ensuring backwards compatibility with ECN [RFC3168]
254	   endpoints.

256	   This solution is using ECT(1) which was defined in ECN [RFC3168] as a
257	   one bit Nonce but this use is obsoleted in RFC8311 and SCE is using
258	   it for the SCE mark.  There may be other documents trying to use this
259	   bit for example L4S use it to signal L4S support.  The SCE marking
260	   are done by the AQM algorithm (RED, CODEL) and are sent back to the
261	   sender by the transport so there may be a need to add support for
262	   conveying the SCE marking to the sender (QUIC for example already has
263	   support for reporting the count of ECT(0) and ECT(1) separately).
264	   This solution is simpler than HPCC but provide less information.

266	   [I-D.heist-tsvwg-sce-one-and-two-flow-tests] presents one and two-
267	   flow test results for the SCE reference implementation.  These tests
268	   are not intended to be a comprehensive real-world evaluation of SCE,
269	   but an illustration of SCE's influence on basic TCP metrics in a
270	   controlled environment.  The goal of the one-flow tests is to analyze
271	   the impact of SCE on the TCP throughput and TCP RTT of single TCP
272	   flows across a range of simulated path bandwidths and RTTs.  The
273	   tests were with RENO and DCCP.  Even though using SCE gave in general
274	   better results there were significant under-utilization at low
275	   bandwidths ( <10Mb/sec; <25Mb/sec) and a slight increase in TCP RTT
276	   for DCTCP-SCE at 100Mbit / 160ms and a slight increase in TCP RTT for
277	   SCE RENO at high BDPs.  The document does not describe the congestion
278	   algorithm that was used for DCTCP-SCE or RENO-SCE and comment that
279	   further work need to be done to understand the reason for this
280	   behvior.

282	   The goal of the two-flow tests is to measure fairness between and
283	   among SCE and non-SCE TCP flows, through either a single queue or
284	   with fair queuing.

286	   The initial results show that SCE enabled flows back off in the face
287	   of competition, whereas non-SCE flows fill the queue until a drop or
288	   CE mark occurs so fairness is not achieved.  By changing the ramp by
289	   which SCE is marked and marking SCE when closer to drop or CE the
290	   fairness is better.

292	4.2.4.  L4S - Low Latency, Low Loss, Scalable Throughput

294	   There are three main components to the L4S architecture
295	   [I-D.ietf-tsvwg-l4s-arch]

297	   1.  Network: L4S traffic needs to be isolated from the queuing
298	       latency of Classic traffic.  However, the two should be able to
299	       freely share a common pool of capacity.  This is because there is
300	       no way to predict how many flows at any one time might use each
301	       service and capacity in access networks is too scarce to
302	       partition into two.  The Dual Queue Coupled AQM
303	       [I-D.ietf-tsvwg-aqm-dualq-coupled] was developed as a minimal
304	       complexity solution to this problem.  The two queues appear to be
305	       separated by a 'semi-permeable' membrane that partitions latency
306	       but not bandwidth.  Per-flow queuing such as in [RFC8290] could
307	       be used but it partitions both latency and bandwidth between
308	       every end-to-end flow.  So it is rather overkill, which brings
309	       disadvantages, not least that large number of queues are needed
310	       when two are sufficient.

312	   2.  Protocol: A host needs to distinguish L4S and Classic packets
313	       with an identifier so that the network can classify them into
314	       their separate treatments.  [I-D.ietf-tsvwg-ecn-l4s-id] considers
315	       various alternative identifiers, and concludes that all
316	       alternatives involve compromises, but the ECT(1) and CE
317	       codepoints of the ECN field represent a workable solution.

319	   3.  Host: Scalable congestion controls already exist.  They solve the
320	       scaling problem with TCP that was first pointed out in [RFC3649].
321	       The one used most widely (in controlled environments) is Data
322	       Center TCP (DCTCP [RFC8257]).  Although DCTCP as-is 'works' well
323	       over the public Internet, most implementations lack certain
324	       safety features that will be necessary once it is used outside
325	       controlled environments like data centers.  A similar scalable
326	       congestion control will also need to be transplanted into
327	       protocols other than TCP (QUIC, SCTP, RTP/RTCP, RMCAT, etc.)
328	       Indeed, between the present document being drafted and published,
329	       the following scalable congestion controls were implemented: TCP
330	       Prague, QUIC Prague and an L4S variant of the RMCAT SCReAM
331	       controller [RFC8298].

333	   Using Dual Queue provides better fairness between DCTCP and Reno/
334	   Cubic . This is less relevant to Data Centers where the competing
335	   streams may use DCQN and DCTCP.

337	4.3.  Network provides more information

339	   The new-generation high-speed cloud network congestion control
340	   protocol HPCC (High Precision Congestion Control) [HPCC], aiming to
341	   achieve the ultimate performance and high stability of the high-speed
342	   cloud network at the same time.  HPCC has been presented at ACM
343	   SIGCOMM 2019.

345	   The key design choice of HPCC is to rely on switches to provide fine-
346	   grained load information, such as queue size and accumulated tx/rx
347	   traffic to compute precise flow rates.  This has two major benefits:
348	   (i) HPCC can quickly converge to proper flow rates to highly utilize
349	   bandwidth while avoiding congestion; and (ii) HPCC can consistently
350	   maintain a close-to-zero queue for low latency.

352	   HPCC is a sender-driven CC framework.  Each packet a sender sends
353	   will be acknowledged by the receiver.  During the propagation of the
354	   packet from the sender to the receiver, each switch along the path
355	   leverages the INT feature of its switching ASIC to insert some meta-
356	   data that reports the current load of the packet's egress port,
357	   including timestamp (ts), queue length (qLen), transmitted bytes
358	   (txBytes), and the link bandwidth capacity (B).  When the receiver
359	   gets the packet, it copies all the meta-data recorded by the switches
360	   to the ACK message it sends back to the sender.  The sender decides
361	   how to adjust its flow rate each time it receives an ACK with network
362	   load information.

364	   Current IETF activity in IOAM [I-D.ietf-ippm-ioam-data] provides a
365	   standard mechanism for inserting metadata by the switches in the
366	   middle.  IOAM can provides an optional method for sending the
367	   metadata feedback by the network to the endpoints on congestion
368	   status.  But to using IOAM, the following points should be
369	   considered:

371	   1.  Is the current IOAM data fields sufficient for congestion
372	       control.

374	   2.  The encapsulation of IOAM in data center for congestion control.

376	   3.  The feedback format for sender driven congestion control.

378	   The HPCC framework requires each node in the middle to add
379	   information about its state to the forward going packet until it
380	   reaches the receiver who will send the acknowledgment.  We can think
381	   of others modes like having the nodes in the middle updating the
382	   status information based on its available resources.  This solution
383	   requires support for INT or IOAM, both protocols need to specify the
384	   packet format with the INT/IOAM extension.  The HPCC document specify
385	   how to implement it for ROCEv2 while for IOAM there are some drafts
386	   in IPPM WG describing how to implement it for different transports
387	   and layer 2 packets.

389	   The conclusion from the trials done were that HPCC can be a next-
390	   generation CC for high-speed networks to achieve ultra-low latency,
391	   high bandwidth, and stability simultaneously.  HPCC achieves fast
392	   convergence, small queues, and fairness by leveraging precise load
393	   information from INT.

395	   Similar mechanism is defined in Quick Start for TCP and IP[RFC4782].
396	   There is a difference with the starting rate.  While HPCC starts at
397	   maximum line speed [RFC4782] starts at a rate as specified in the
398	   Quick-Start request message.  The Quick Start is specified for TCP,
399	   if other transport (UDP) is used there is a need to specify how the
400	   receiver send the Quick-Start response message.

402	4.4.  Network provides proactive control

404	   The typical algorithm in this category is RCP (Rate Control Protocol)
405	   [RCP].  In the basic RCP algorithm , a router maintains a single
406	   rate, R(t), for every link.  The router "stamps" R(t) on every
407	   passing packet (unless it already carries a slower value).  The
408	   receiver sends the value back to the sender, thus informing it about
409	   the slowest (or bottleneck) rate along the path.  In this way, the
410	   sender quickly finds out the rate it should be using (without the
411	   need for Slow-Start).  The router updates R(t) approximately once per
412	   roundtrip time, and strives to emulate Processor Sharing among flows.
413	   The biggest plus of RCP is the short flow completion times under a
414	   wide range of network and traffic characteristics.

416	   The downside of RCP is that RCP involves the routers in congestion
417	   control, so it needs help from the infrastructure.  Although they are
418	   simple, it does have per-packet computations.  Another downside is
419	   that although the RCP algorithm strives to keep the buffer occupancy
420	   low most times, there are no guarantees of buffers not overflowing or
421	   of a zero packet loss.

423	5.  Summary and Proposal

425	   Congestion control is all about how to utilize the network resource
426	   in a better and reasonably way under different network conditions.
427	   Senders are the reaction points that consume network resource, and
428	   network nodes are the congestion points.  Ideally, reaction points
429	   should react as soon as possible when network statuses change.  To
430	   achieve that, there are two directions:

432	5.1.  Reflect the network status more accurately

434	   In order to provide more information than just ECN CE marking there
435	   is a need to standardize a mechanism for the network device to
436	   provide such information and for the receiver to send more
437	   information to the sender.  The network device should not insert any
438	   new fields to the IP packet but should be able to modify the value of
439	   fields in the packets sent from the data sender.

441	   The network device will update the metadata in the forward going
442	   packet to provide more information than a single CE mark or SCE like
443	   solution.

445	   The receiver will analyze the metadata and report back to the sender.
446	   Different from the Internet, data center network can benefit more
447	   from having more accurate information to achieve better congestion
448	   control.  And this means network and hosts must collaborate together
449	   to achieve it.

451	   Issues to be addressed:

453	   o  How to add the metadata to the forward stream (IOAM is a valid
454	      option since we are interested in a single DC domain).  The
455	      encapsulations for both IPv4 and IPv6 should be considered.

457	   o  Negotiation of the capabilities of different nodes.

459	   o  The format of the network information feedback to the sender in
460	      the case of sender-driven mechanisms.

462	   o  The semantics of the message (notification or proactive)

464	   o  Investigation of the extra load on the network device for adding
465	      the metadata.

467	5.2.  Notify the reaction point as soon as possible.

469	   In this direction, it is worth to investigate if it's possible for
470	   the middle nodes to notify the sender directly (like IOAM Postcards)
471	   on network conditions, but such a method is challenging in terms of
472	   addressing security issues and the first concern will be that this
473	   can serve as a tool for DOS attack.  But other ways, for example,
474	   carry the information in the reverse traffic would be an alternative
475	   as long as reverse traffic exists.

477	   Issues to be addressed:

479	   o  How to deal with multiple congestion points?

481	   o  How to identify support by the sender and receiver for this mode
482	      and support legacy systems (same as previous mode).

484	   o  How to authenticate the validity of the data.

486	   o  Hardware implications

488	6.  Security Considerations

490	   TBD

492	7.  IANA Considerations

494	   No IANA action

496	8.  References

498	8.1.  Normative References

500	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
501	              Requirement Levels", BCP 14, RFC 2119,
502	              DOI 10.17487/RFC2119, March 1997,
503	              <https://www.rfc-editor.org/info/rfc2119>.

505	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
506	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
507	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

509	8.2.  Informative References

511	   [CongestionManagment]
512	              "Understanding RoCEv2 Congestion Management", 12 2018,
513	              <https://community.mellanox.com/s/article/understanding-
514	              rocev2-congestion-management>.

516	   [DCQCN]    Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
517	              Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M.
518	              Zhang, "Congestion control for large-scale RDMA
519	              deployments. In ACM SIGCOMM Computer Communication Review,
520	              Vol. 45. ACM, 523-536.", 8 2015,
521	              <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
522	              p523.pdf>.

524	   [HPCC]     Li, Y., Miao, R., Liur, H. H., Zhuang, Y., Feng, F., Tang,
525	              L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M.
526	              Yu, "HPCC: High Precision Congestion Control", 8 2019,
527	              <https://liyuliang001.github.io/publications/hpcc.pdf>.

529	   [I-D.heist-tsvwg-sce-one-and-two-flow-tests]
530	              Heist, P., Grimes, R., and J. Morton, "Some Congestion
531	              Experienced One and Two-Flow Tests", draft-heist-tsvwg-
532	              sce-one-and-two-flow-tests-00 (work in progress), July
533	              2019.

535	   [I-D.herbert-ipv4-eh]
536	              Herbert, T., "IPv4 Extension Headers and Flow Label",
537	              draft-herbert-ipv4-eh-01 (work in progress), May 2019.

539	   [I-D.ietf-ippm-ioam-data]
540	              Brockners, F., Bhandari, S., Pignataro, C., Gredler, H.,
541	              Leddy, J., Youell, S., Mizrahi, T., Mozes, D., Lapukhov,
542	              P., Chang, R., daniel.bernier@bell.ca, d., and J. Lemon,
543	              "Data Fields for In-situ OAM", draft-ietf-ippm-ioam-
544	              data-07 (work in progress), September 2019.

546	   [I-D.ietf-quic-transport]
547	              Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed
548	              and Secure Transport", draft-ietf-quic-transport-23 (work
549	              in progress), September 2019.

551	   [I-D.ietf-tcpm-accurate-ecn]
552	              Briscoe, B., Kuehlewind, M., and R. Scheffenegger, "More
553	              Accurate ECN Feedback in TCP", draft-ietf-tcpm-accurate-
554	              ecn-09 (work in progress), July 2019.

556	   [I-D.ietf-tsvwg-aqm-dualq-coupled]
557	              Schepper, K., Briscoe, B., and G. White, "DualQ Coupled
558	              AQMs for Low Latency, Low Loss and Scalable Throughput
559	              (L4S)", draft-ietf-tsvwg-aqm-dualq-coupled-10 (work in
560	              progress), July 2019.

562	   [I-D.ietf-tsvwg-ecn-l4s-id]
563	              Schepper, K. and B. Briscoe, "Identifying Modified
564	              Explicit Congestion Notification (ECN) Semantics for
565	              Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s-
566	              id-07 (work in progress), July 2019.

568	   [I-D.ietf-tsvwg-l4s-arch]
569	              Briscoe, B., Schepper, K., Bagnulo, M., and G. White, "Low
570	              Latency, Low Loss, Scalable Throughput (L4S) Internet
571	              Service: Architecture", draft-ietf-tsvwg-l4s-arch-04 (work
572	              in progress), July 2019.

574	   [I-D.morton-taht-tsvwg-sce]
575	              Morton, J. and D. Taht, "The Some Congestion Experienced
576	              ECN Codepoint", draft-morton-taht-tsvwg-sce-00 (work in
577	              progress), March 2019.

579	   [IEEE.802.1QBB_2011]
580	              IEEE, "IEEE Standard for Local and metropolitan area
581	              networks--Media Access Control (MAC) Bridges and Virtual
582	              Bridged Local Area Networks--Amendment 17: Priority-based
583	              Flow Control", IEEE 802.1Qbb-2011,
584	              DOI 10.1109/ieeestd.2011.6032693, September 2011,
585	              <http://ieeexplore.ieee.org/servlet/
586	              opac?punumber=6032691>.

588	   [QCN]      Alizadeh, M., Atikoglu, B., Kabbani, A., Lakshmikantha,
589	              A., Pan, R., Prabhakar, B., and M. Seaman, "Data Center
590	              Transport Mechanisms:Congestion Control Theory and IEEE
591	              Standardization", 9 2008,
592	              <https://web.stanford.edu/~balaji/papers/QCN.pdf>.

594	   [RCP]      Dukkipati, N., "RATE CONTROL PROTOCOL (RCP): CONGESTION
595	              CONTROL TO MAKE FLOWS COMPLETE QUICKLY", 10 2007,
596	              <http://yuba.stanford.edu/~nanditad/thesis-NanditaD.pdf>.

598	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
599	              of Explicit Congestion Notification (ECN) to IP",
600	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
601	              <https://www.rfc-editor.org/info/rfc3168>.

603	   [RFC3649]  Floyd, S., "HighSpeed TCP for Large Congestion Windows",
604	              RFC 3649, DOI 10.17487/RFC3649, December 2003,
605	              <https://www.rfc-editor.org/info/rfc3649>.

607	   [RFC4782]  Floyd, S., Allman, M., Jain, A., and P. Sarolahti, "Quick-
608	              Start for TCP and IP", RFC 4782, DOI 10.17487/RFC4782,
609	              January 2007, <https://www.rfc-editor.org/info/rfc4782>.

611	   [RFC5040]  Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
612	              Garcia, "A Remote Direct Memory Access Protocol
613	              Specification", RFC 5040, DOI 10.17487/RFC5040, October
614	              2007, <https://www.rfc-editor.org/info/rfc5040>.

616	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
617	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
618	              <https://www.rfc-editor.org/info/rfc5681>.

620	   [RFC6679]  Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P.,
621	              and K. Carlberg, "Explicit Congestion Notification (ECN)
622	              for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August
623	              2012, <https://www.rfc-editor.org/info/rfc6679>.

625	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
626	              "Proportional Integral Controller Enhanced (PIE): A
627	              Lightweight Control Scheme to Address the Bufferbloat
628	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
629	              <https://www.rfc-editor.org/info/rfc8033>.

631	   [RFC8085]  Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage
632	              Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085,
633	              March 2017, <https://www.rfc-editor.org/info/rfc8085>.

635	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
636	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
637	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
638	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

640	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
641	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
642	              and Active Queue Management Algorithm", RFC 8290,
643	              DOI 10.17487/RFC8290, January 2018,
644	              <https://www.rfc-editor.org/info/rfc8290>.

646	   [RFC8298]  Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation
647	              for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December
648	              2017, <https://www.rfc-editor.org/info/rfc8298>.

650	   [RFC8312]  Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
651	              R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
652	              RFC 8312, DOI 10.17487/RFC8312, February 2018,
653	              <https://www.rfc-editor.org/info/rfc8312>.

655	   [RoCEv2]   "Infiniband Trade Association. Supplement to InfiniBand
656	              architecture specification volume 1 release 1.2.2 annex
657	              A17: RoCEv2 (IP routable RoCE).",
658	              <https://cw.infinibandta.org/document/dl/7781>.

660	Authors' Addresses

662	   Roni Even
663	   Huawei

665	   Email: roni.even@huawei.com

667	   Rachel Huang
668	   Huawei Technologies Co., Ltd.

670	   Email: rachel.huang@huawei.com