idnits 2.17.1 draft-dai-tsvwg-pfc-free-congestion-control-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (11 April 2021) is 1083 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group H. Dai, Ed. 3 Internet-Draft B. Fu 4 Intended status: Informational K. Tan 5 Expires: 13 October 2021 Huawei 6 11 April 2021 8 PFC-Free Low Delay Control Protocol 9 draft-dai-tsvwg-pfc-free-congestion-control-01 11 Abstract 13 Today, low-latency transport protocols like RDMA over Converged 14 Ethernet (RoCE) can provide good delay and throughput performance in 15 small and lightly loaded high-speed datacenter networks due to 16 lossless transport based on priority-based flow control (PFC). 17 However, PFC suffers from various issues from performance degradation 18 and unreliability (e.g., deadlock), limiting the deployment of RoCE 19 to only small scale clusters (~1000). 21 This document presents LDCP, a new transport that scales loss- 22 sensitive transports, e.g., RDMA, to entire data-centers containing 23 tens of thousands machines, without dependency on PFC for 24 losslessness, i.e., PFC-free. LDCP develops a novel end-to-end 25 congestion control scheme and achieves very low queue occupancy even 26 under high network utilization or large traffic churns, resulting in 27 almost no packet loss. Meanwhile, LDCP allows a new flow to jump 28 start at full speed at the very beginning and therefore minimizes the 29 latency for short RPC-style transactions. LDCP relies on only WRED 30 and ECN, two widely supported features on switches, so it can be 31 easily deployed in existing network infrastructures. Finally, LDCP 32 is simple by design and thus can be easily implemented by 33 programmable or ASIC NICs. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on 13 October 2021. 51 Copyright Notice 53 Copyright (c) 2021 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 58 license-info) in effect on the date of publication of this document. 59 Please review these documents carefully, as they describe your rights 60 and restrictions with respect to this document. Code Components 61 extracted from this document must include Simplified BSD License text 62 as described in Section 4.e of the Trust Legal Provisions and are 63 provided without warranty as described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. LDCP algorithm . . . . . . . . . . . . . . . . . . . . . . . 3 70 2.1. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Stable stage algorithm . . . . . . . . . . . . . . . . . 4 72 2.3. Zero-RTT bandwidth acquisition . . . . . . . . . . . . . 6 73 3. Reference Implementations . . . . . . . . . . . . . . . . . . 8 74 3.1. Implementation on programmable NIC . . . . . . . . . . . 8 75 3.2. Implementation on commercial NIC . . . . . . . . . . . . 9 76 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 77 5. Security Considerations . . . . . . . . . . . . . . . . . . . 10 78 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 79 6.1. Normative References . . . . . . . . . . . . . . . . . . 10 80 6.2. Informative References . . . . . . . . . . . . . . . . . 11 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 83 1. Introduction 85 Modern cloud applications, such as web search, social networking, 86 real-time communication, and retail recommendation, require high 87 throughput and low latency network to meet the increasing demands 88 from customers. Meanwhile, new trends in data-centers, like resource 89 disaggregation, heterogeneous computing, block storage over NVMe, 90 etc., continuously drive the need for high-speed networks. Recently, 91 high-speed networks, with 40Gbps to 100Gbps link speed, are deployed 92 in many large data-centers. 94 Conventional software TCP/IP stacks incur high latencies and 95 substantial CPU overhead, and have limited applications from fully 96 utilizing the physical network capacities. RDMA over Converged 97 Ethernet (RoCE), however, has shown very good low-delay and 98 throughput performance in small and lightly loaded networks, due to 99 the ability of OS bypassing and a lossless transport that performs 100 hop-by-hop flow control, i.e., PFC. Nevertheless, in a large data- 101 center network (with tens of thousands of machines) with bursty 102 traffic, PFC backpressure leads to cascaded queue buildups and 103 collateral damages to victim flows, resulting in neither Low latency 104 nor high throughput [Guo2016rdma]. Therefore, high-speed networks 105 still face fundamental challenges to deliver the three aforementioned 106 goals. 108 This document describes LDCP, a scalable end-to-end congestion 109 control that achieves low latency even under high network 110 utilization. The key insight behind LDCP is using ACKs to grant to 111 or revoke from senders credits, in order to mimic receiver-driven 112 pulling. LDCP requires data receivers to reply ACKs as fast as 113 possible, preferably one ACK for each data packet received (per- 114 packet ACK). The congestion window is adjusted on the per-ACK basis 115 using a parameterized AIMD algorithm. This algorithm manages to 116 smooth out the traffic burstiness and stabilize the queue size at 117 ultra-low level, preventing queue buildups and preserving high link 118 utilization. A first-RTT bandwidth acquisition algorithm is also 119 proposed to allow new flows to start sending at a large rate, but 120 excessive packets will be actively dropped by WRED if they overwhelm 121 the network, in order to protect on-going flows. When heavy 122 congestion happens due to a large number of concurrent flows 123 contending for the bottleneck link, e.g., large-scale incast, LDCP 124 allows the congestion window to be beneath one packet, so the number 125 of flows that LDCP can endure remarkably increases compared with TCP 126 or DCTCP. 128 1.1. Requirements Language 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in RFC 2119 [RFC2119]. 134 2. LDCP algorithm 136 LDCP involves primarily two algorithms: a fast start algorithm that 137 is used in the first RTT, and a stable stage algorithm that governs 138 the rest lifespan of a flow. Each algorithm works with a separate 139 ECN setting respectively. Because we want to use as fewer priority 140 classes as possible, we leverage the common WRED/ECN [CiscoGuide2012] 141 [RFC3168] feature in commodity switches to support multiple ECN 142 marking policies within one priority class. 144 2.1. ECN 146 LDCP employs WRED/ECN at intermediate switches to mark packets when 147 congestion happens [Floyd1993random]. Instead of using the average 148 queue size for marking as in the original RED proposal, LDCP employs 149 instant queue based ECN to give more precise congestion information 150 to end hosts [Alizadeh2010data] [Kuzmanovic2005power]. The switch is 151 configured with four parameters: K_min, K_max, P_max and buf_max, and 152 it is going to mark a packet with a probability function as follows, 154 if q < K_min, p = 0 156 if K_min <= q < K_max, p = (q-K_min)/(K_max-K_min)*P_max 158 if q >= K_max, p = 1 160 If q is larger than the maximum buffer of the port (buf_max), the 161 packet is dropped. This general ECN model works for both algorithms 162 developed in LDCP but with different sets of parameters, 163 respectively. We will explain below. 165 2.2. Stable stage algorithm 167 In stable stage, i.e., rounds after the fast start (sec 2.3), the 168 flow is in the congestion avoidance state, and LDCP works as follows. 170 The sender maintains a congestion window (cw) to control the sending 171 rate of data packets. The receiver returns ACK packets to confirm 172 the delivery of these data packets. Meanwhile, the CE (Congestion 173 Experienced) flag in data packets are echoed back by ECN-Echo (ECE) 174 flags in the ACKs. An ACK that does not carry an ECE flag (ECE=0) 175 informs the sender that the network is not congested, while an ACK 176 that carries an ECE flag (ECE=1) informs the sender that the network 177 is congested. 179 There could be two possible ways regarding the number of ACKs 180 generated. The simplest one is to have the receiver to generate an 181 ACK for every received data packet (i.e., per-packet ACK) and set the 182 ECE flag if the corresponding packet has a CE mark. Alternatively, 183 if the receiver is busy, it can also employ delayed ACK to generate 184 an ACK for at most m data packets if they all are not marked, but 185 would generate an ACK with ECE flag immediately once a CE marked 186 packet is received. The goal of this receiver behavior is to ensure 187 that the sender has precise information of CE marking. A similar 188 design is also in [Alizadeh2010data]. 190 An LDCP sender updates the cw upon each ACK arrival according to the 191 ECE marks, namely per-ACK window adjustment (PAWA). An ECE=0 flag 192 increases the cw, while an ECE=1 flag decreases the cw. When per- 193 packet ACK is obeyed on the receiver, the update rule is as follows: 195 if ECN-Echo = 0, cw = cw + alpha/cw 197 if ECN-Echo = 1, cw = cw - beta --(1) 199 where alpha and beta are constants, and cw >= 1 (0 < alpha, beta <= 200 1). 202 Eq. (1) reveals that if an incoming ACK does not carry an ECE flag 203 (ECE=0), it grants the sender credits, and cw is increased by alpha/ 204 cw; if the ACK carries an ECE flag (ECE=1), it revokes credits from 205 the sender, and cw is decreased by beta. 207 In essence, Eq. (1) implements an additive increase and 208 multiplicative decrease (AIMD) policy similar to previous work, e.g., 209 DCTCP [Alizadeh2010data]. But PAWA, together with per-packet ACK, 210 has following benefits: Firstly, PAWA reacts to each received ECE 211 mark (or no mark) immediately, rather than employs a RTT-granularity 212 averaging process and reacts only once per RTT (like DCTCP does), so 213 it is more responsive and accurate to congestions. Secondly, along 214 with WRED/ECN, PAWA is able to de-synchronize flows. Instead of 215 cutting a large portion of cw immediately upon the first ECE-marked 216 ACK (like ECN-enabled TCP does), LDCP distributes the window 217 reduction in one round. Such de-synchronization is effective to 218 reduce the window fluctuation and stabilize a low queue at the 219 switches. Not only that, per-packet ACK allows ACK-clocking to 220 better pace out the packets: as each ACK confirms the delivery of one 221 packet, an ACK arrival also clocks out one new packet, hence the 222 packets are almost equispaced. Finally, PAWA has a tiny state 223 footprint, i.e., a single state of cw, and is easy to implement in 224 hardware compared with DCTCP. 226 Per-packet ACK and PAWA match the principle in discrete control 227 systems: increase the controller's action rate but take a small 228 control step per action. They are effective in improving the control 229 stability and accuracy. 231 If delayed ACK is used on the receiver side, an ACK can confirm the 232 delivery of multiple (denoted by n) packets, then Eq. (1) becomes: 234 if ECN-Echo = 0, cw = cw + n * alpha/cw 236 if ECN-Echo = 1, cw = cw - n * beta --(2) 237 In extremely congested cases where a large number of flows contending 238 for the bottleneck link, e.g., heavy incast with thousands of 239 senders, even each flow maintains a window of merely one packet, 240 large queue sizes would still be caused. To handle these situations, 241 LDCP allows cw to further reduce beneath one packet. A flow with a 242 cw < 1 is ticked out by a timer, whose timeout is set as RTT/cw. 243 Accordingly, the cw updating rule is, 245 if ECN-Echo = 0, cw = cw + gamma 247 if ECN-Echo = 1, cw = max{gamma,eta * cw} 249 where cw < 1. We choose eta = 1/2. gamma is the increase step when 250 ACK is not marked ECE, and is also the minimum window size (typical 251 values of gamma include 1/4, 1/8, 1/16). 253 2.3. Zero-RTT bandwidth acquisition 255 Setting up an initial rate at the very beginning of a flow is 256 challenging. Since the sender does not ever get a chance to probe 257 the network, it faces a difficult dilemma: If it picks up a too large 258 initial window (IW), it may cause congestion inside network, 259 resulting in large queue buildup or even packet drops; On the other 260 hand, if the sender chooses a too conservative IW, it may lose the 261 transmission opportunities in the first RTT and hurt short flow 262 performance greatly, which could have finished in one round. LDCP 263 resolves this dilemma with a zero-RTT bandwidth acquisition 264 algorithm, which allows the sender to fast start opportunistically 265 without adverse impacts to on-going flows in stable stage. In what 266 follows, the design of the fast start algorithm is firstly described, 267 afterwards an implementation using existing techniques is provided. 269 Specifically, when a flow starts, the sender chooses a large enough 270 Initial Window (e.g., BDP) and sends out as many packets as possible 271 in the first RTT. (For brevity, packets transmitted by a sender in 272 the first RTT are denoted by first-RTT-packets, and packets 273 transmitted in the congestion avoidance state (sec 2.2) are referred 274 to as stable-stage-packets.) By intention, first-RTT-packets are 275 marked to have lower priority, while stable-stage-packets are marked 276 to have high priority. The two priority classes are controlled by 277 two separate AQM policies. 279 The first-RTT-packets are controlled by an AQM policy which simply 280 drops packets if they are sent too aggressively, i.e., the queue 281 exceeds a configured threshold K. A network switch receives packets 282 transmitted by the senders and puts them into a queue. The queue 283 distinguishes the first-RTT-packets and the stable-stage-packets 284 according to the marks in the packets. Because first-RTT-packets are 285 with low priority, they will be dropped if the receiving queue size 286 exceeds the configured threshold, while stable-stage-packets are 287 enqueued as long as the queue size is below the queue capacity. 288 Stable-stage-packets are dropped only when the queue is full. 290 Senders and switches must cooperate. The sender adds one mark to 291 first-RTT-packets, and the switches identify first-RTT-packets using 292 this mark; the sender adds another mark to stable-stage-packets, and 293 the switches recognize packets sent beyond first RTT based on this 294 mark. 296 In summary, first-RTT-packets are sent with a large rate, and 297 controlled by a separate AQM, to quickly acquire free bandwidth if 298 there is; and low priority is used to protect on-going long flows if 299 there is not. 301 The above design can be implemented by leveraging a common feature on 302 modern switches. On a commodity switch, the WERD/ECN feature on an 303 ECN-enabled queue works as follows. ECN-capable packets (the two-bit 304 ECN fields in IP headers are set to '01' or '10') are subject to ECN- 305 marking, while ECN-incapable packets (the two-bit ECN fields in IP 306 headers are set to '00') comply with WRED-dropping, i.e., ECN- 307 incapable packets are dropped if the queue size exceeds a configured 308 threshold K, as in Eq (3). 310 if q < K, D(q) = no drop 312 if q >= K, D(q) = drop --(3) 314 The fast start algorithm makes use of such WERD/ECN feature to 315 distinguish first-RTT and stable-stage packets: the sender sets the 316 low priority first-RTT-packets to ECN-incapable, and sets the high 317 priority stable-stage-packets to ECN-capable. All the packets carry 318 the same DSCP value and are mapped to the same priority queue on 319 switches. This queue is exclusively used by LDCP flows. First-RTT- 320 packets are either dropped or successfully pass the switch. After 321 the first RTT, the sender will count how many in-order packets has 322 been acknowledged using ACKs and takes this as a good estimation of 323 cw and enters the stable stage (sec 2.2). 325 At first glance, the above design might look counterintuitive. If we 326 want to improve the performance of short flows, why should we drop 327 their packets, instead of queuing them, even with a higher priority? 328 The answer, however, lies in that if we allow blind burst in the 329 first RTT, these first-RTT-packets could build excessively large 330 queues, e.g., in a heavy incast scenario, and eventually these 331 packets may still get dropped. Therefore, an AQM policy is necessary 332 to keep a low queue for the first RTT packets. An additional benefit 333 of the above strategy is to also give protection to flows in stable 334 stage. Those stable stage flows will experience seldom packet loss 335 and constant performance even facing rather dynamic churns of short 336 flows. Finally, we comment that while we could put the first-RTT- 337 traffic into a separate high priority queue, we believe it is not 338 very necessary. The reason is with LDCP's stable stage algorithm, 339 the queue is already small at the switch, and thus the benefit for a 340 separate priority queue may be limited. Given the limited priority 341 queues in Ethernet, it is a fair choice to map both into one priority 342 queue while applying different WRED/ECN polices to control their 343 behavior. 345 3. Reference Implementations 347 3.1. Implementation on programmable NIC 349 LDCP has been implemented with RoCEv2 on a programmable many-core NIC 350 (referred to as uNIC). uNIC has hardware enhancements for RoCEv2 351 packet (IB/UDP/IP stack) encapsulation and decapsulation. The RoCEv2 352 stack, as well as the congestion control algorithm, is implemented by 353 microcode software on uNIC. 355 Congestion window cw is firstly added to RoCEv2 to limit the in- 356 flight data size. RoCEv2 uses Packet Sequence Number (PSN) to ensure 357 in-order delivery, but PSN can have jump overs if SEND/WRITE requests 358 are interleaved with READ requests, and packets can have different 359 sizes. Therefore, it is difficult for cw to calculate the data size 360 by PSN. A new byte sequence number -- LDCP Sequence Number (LSN) -- 361 is used to slide the window. Packets belonging to READ, SEND/WRITE 362 requests share the same LSN space, while packets of READ Responses 363 have a separate LSN space, coded in a customized header. 365 In the stable stage of LDCP, cw is updated in the PAWA manner, and 366 the uNIC is programmed to reply an ACK for each data packet it 367 receives (uNIC is able to automatically coalesce ACKs based on its 368 current load), which echoes back the CE mark if the data packet is 369 marked. Note that there is no ACK packets for Read Response in the 370 RDMA protocol, the uNIC is also programmed to reply ACKs for Read 371 Responses to enable congestion control. Because out-of-order 372 delivery of Read Responses can be discovered by the requester, and a 373 repeat read request will be issued, it is not necessary to add a NAK 374 protocol for Read Responses to ensure reliability. The CE-Echo bits 375 are coded in a customized header encapsulated in the ACK. 377 On switches, fast-start packet needs WRED-dropping while stable-stage 378 packets need ECN-marking, so the packets should carry different flags 379 to be identified by the switches. The WERD/ECN feature on an ECN- 380 enabled queue works as follows: ECN-capable packets are subject to 381 ECN-marking, while ECN-incapable packets follows WRED-dropping, i.e., 382 packets are dropped if the queue size exceeds the threshold K. 383 Therefore, the WERD/ECN feature is used to tag fast-start and stable- 384 stage packets: uNIC sets fast-start packets to ECN-incapable and 385 stable-stage packets ECN-capable. All the packets are mapped to the 386 same priority queue. 388 If tail loss happens for the fast-start packets, the sender needs to 389 wait for retransmission timeout. To prevent this problem, the last 390 fast-start packet is set to ECN-capable, that is the IW-th packet if 391 the message is larger than BDP, or the last packet of a message if 392 its size is below BDP. The ECN-capable packet will not be dropped by 393 WRED, so it can pass the switches and arrive at the receiver, 394 allowing the receiver to detect if packet loss happens. 396 If a new flow does not finish within the fast-start stage, it will 397 transfer to the stable stage. There are two transition conditions: 398 1) Packet loss is detected in the fast-start stage, which indicates 399 the network is overloaded. cw in stable stage is set to the number of 400 packets that are accumulatively acknowledged before packet loss. The 401 lost packets are retransmitted using go-back-N. 2) When a full IW of 402 packets have all been acknowledged. (IW is set to BDP as suggested 403 in sec 2.3.) This condition is for flows that are larger than BDP 404 and finish the fast-start stage without packet loss. Since all 405 packets sent during fast-start stage are confirmed, the stable stage 406 algorithm now takes over and cw is set to BDP. Note that 407 acknowledging a BDP size of data needs two RTTs (the ACK for the IW- 408 th packet returns at the end of second RTT), but sending BDP-sized 409 data only requires one RTT. After the end of the first RTT, the flow 410 will not stop sending (because the ACK of the first packet will 411 return to free the cw) but set the packets to ECN-capable ever since. 413 All these implementation details are transparent to user 414 applications. LDCP supports all RDMA transport operations (READ, 415 WRITE, SEND, with immediate data or not, ATOMIC), and thus has full 416 support of IB verbs. 418 3.2. Implementation on commercial NIC 420 LDCP has been implemented on Mellanox CX6-DX NIC as well. This NIC 421 has a programmable congestion control (PCC) platform that allows 422 users to define their own algorithms, but the RoCE protocol are 423 standard and are implemented by ASIC. In PCC users can issue a 424 request to measure the round-trip time (RTT), and a standalone RTT 425 request packet will be sent among data packets to the receiver NIC. 426 Upon receiving an RTT requet, the receiver NIC returns a standalone 427 RTT response packet to the sender, then the sender compares the 428 timestamp difference to calculate the RTT. 430 When a data-sender NIC receives ACKs, NACKs, CNPs and RTT responses, 431 or after transmitting a burst of data, it generates corresponding 432 type of events and pushes the events to PCC. In PCC users can define 433 event-handling functions to calculate the transmitting rate. 434 Afterwards, the rate is fed to the transmitting hardware to control 435 the speed at which the data packets are put onto the wire. 437 LDCP is implemented by the event-handling functions. As LDCP is an 438 AIMD algorithm, the AI logic means the window size is increased by a 439 fixed step per-RTT, and the MD logic reveals that window is decreased 440 by beta upon *every* CNP. Therefore, MD can be easily implemented in 441 the CNP handling function, while the difficulty is how to implement 442 AI since standard RoCE does not have per-packet ACK for Send/Write 443 requests, and Read Responses do not have ACKs at all. Eventually, 444 the implementation of AI resorts to the RTT request and response. 445 The RTT request is issued in this way: at the beginning of a flow, an 446 RTT request is sent out, and the next RTT request is sent after the 447 RTT response of the previous request is received (or a timeout is 448 experienced). Upon the arrival of an RTT response, it is for sure 449 that one RTT has elapsed and the window should increase by alpha. 450 Therefore, the AI logic is implemented in the RTT response handling 451 function where the window grows by alpha. Divided by RTT, the window 452 is converted to rate, and the rate is provided to the TX pipeline via 453 an interface in PCC. 455 In conclusion, the LDCP implementation on Mellanox CX6-DX is quite 456 straightforward and does not require any customization. Evaluation 457 results reveal that LDCP outperforms DCQCN and TIMELY remarkably in 458 both throughput and latency. 460 4. IANA Considerations 462 This document makes no request of IANA. 464 5. Security Considerations 466 To be added. 468 6. References 470 6.1. Normative References 472 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 473 Requirement Levels", BCP 14, RFC 2119, 474 DOI 10.17487/RFC2119, March 1997, 475 . 477 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 478 of Explicit Congestion Notification (ECN) to IP", 479 RFC 3168, September 2001, 480 . 482 6.2. Informative References 484 [Alizadeh2010data] 485 Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 486 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 487 Center TCP (DCTCP)", ACM SIGCOMM 63-74, 2010. 489 [CiscoGuide2012] 490 "Cisco IOS Quality of Service Solutions Configuration 491 Guide", , 2012. 493 [Floyd1993random] 494 Floyd, S. and V. Jacobson, "Random early detection 495 gateways for congestion avoidance", IEEE/ACM Transactions 496 on networking 1, 4 (1993), 397-413, 1993. 498 [Guo2016rdma] 499 Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., 500 and M. Lipshteyn, "RDMA over commodity Ethernet at scale", 501 ACM SIGCOMM 202-215, 2016. 503 [Kuzmanovic2005power] 504 Kuzmanovic, A., "The power of explicit congestion 505 notification", ACM SIGCOMM 61-72, 2005. 507 Authors' Addresses 509 Huichen Dai (editor) 510 Huawei 511 Huawei Mansion, No.3, Xinxi Road, Haidian District 512 Beijing 513 China 515 Email: daihuichen@huawei.com 517 Binzhang Fu 518 Huawei 519 Huawei Mansion, No.3, Xinxi Road, Haidian District 520 Beijing 521 China 523 Email: fubinzhang@huawei.com 524 Kun Tan 525 Huawei 526 Huawei Mansion, No.3, Xinxi Road, Haidian District 527 Beijing 528 China 530 Email: kun.tan@huawei.com