idnits 2.17.1 draft-dai-tsvwg-pfc-free-congestion-control-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (13 July 2020) is 1376 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group H. Dai, Ed. 3 Internet-Draft B. Fu 4 Intended status: Informational K. Tan 5 Expires: 14 January 2021 Huawei 6 13 July 2020 8 PFC-Free Low Delay Control Protocol 9 draft-dai-tsvwg-pfc-free-congestion-control-00 11 Abstract 13 Today, low-latency transport protocols like RDMA over Converged 14 Ethernet (RoCE) can provide good delay and throughput performance in 15 small and lightly loaded high-speed datacenter networks due to 16 lossless transport based on priority-based flow control (PFC). 17 However, PFC suffers from various issues from performance degradation 18 and unreliability (e.g., deadlock), limiting the deployment of RoCE 19 to only small scale clusters (~1000). 21 This document presents LDCP, a new transport that scales loss- 22 sensitive transports, e.g., RDMA, to entire data-centers containing 23 tens of thousands machines, without dependency on PFC for 24 losslessness, i.e., PFC-free. LDCP develops a novel end-to-end 25 congestion control scheme and achieves very low queue occupancy even 26 under high network utilization or large traffic churns, resulting in 27 almost no packet loss. Meanwhile, LDCP allows a new flow to jump 28 start at full speed at the very beginning and therefore minimizes the 29 latency for short RPC-style transactions. LDCP relies on only WRED 30 and ECN, two widely supported features on switches, so it can be 31 easily deployed in existing network infrastructures. Finally, LDCP 32 is simple by design and thus can be easily implemented by 33 programmable or ASIC NICs. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on 14 January 2021. 51 Copyright Notice 53 Copyright (c) 2020 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 58 license-info) in effect on the date of publication of this document. 59 Please review these documents carefully, as they describe your rights 60 and restrictions with respect to this document. Code Components 61 extracted from this document must include Simplified BSD License text 62 as described in Section 4.e of the Trust Legal Provisions and are 63 provided without warranty as described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. LDCP algorithm . . . . . . . . . . . . . . . . . . . . . . . 3 70 2.1. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Stable stage algorithm . . . . . . . . . . . . . . . . . 4 72 2.3. Zero-RTT bandwidth acquisition . . . . . . . . . . . . . 6 73 3. Reference Implementation . . . . . . . . . . . . . . . . . . 8 74 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 75 5. Security Considerations . . . . . . . . . . . . . . . . . . . 9 76 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 77 6.1. Normative References . . . . . . . . . . . . . . . . . . 9 78 6.2. Informative References . . . . . . . . . . . . . . . . . 10 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 81 1. Introduction 83 Modern cloud applications, such as web search, social networking, 84 real-time communication, and retail recommendation, require high 85 throughput and low latency network to meet the increasing demands 86 from customers. Meanwhile, new trends in data-centers, like resource 87 disaggregation, heterogeneous computing, block storage over NVMe, 88 etc., continuously drive the need for high-speed networks. Recently, 89 high-speed networks, with 40Gbps to 100Gbps link speed, are deployed 90 in many large data-centers. 92 Conventional software TCP/IP stacks incur high latencies and 93 substantial CPU overhead, and have limited applications from fully 94 utilizing the physical network capacities. RDMA over Converged 95 Ethernet (RoCE), however, has shown very good low-delay and 96 throughput performance in small and lightly loaded networks, due to 97 the ability of OS bypassing and a lossless transport that performs 98 hop-by-hop flow control, i.e., PFC. Nevertheless, in a large data- 99 center network (with tens of thousands of machines) with bursty 100 traffic, PFC backpressure leads to cascaded queue buildups and 101 collateral damages to victim flows, resulting in neither Low latency 102 nor high throughput [Guo2016rdma]. Therefore, high-speed networks 103 still face fundamental challenges to deliver the three aforementioned 104 goals. 106 This document describes LDCP, a scalable end-to-end congestion 107 control that achieves low latency even under high network 108 utilization. The key insight behind LDCP is using ACKs to grant to 109 or revoke from senders credits, in order to mimic receiver-driven 110 pulling. LDCP requires data receivers to reply ACKs as fast as 111 possible, preferably one ACK for each data packet received (per- 112 packet ACK). The congestion window is adjusted on the per-ACK basis 113 using a parameterized AIMD algorithm. This algorithm manages to 114 smooth out the traffic burstiness and stabilize the queue size at 115 ultra-low level, preventing queue buildups and preserving high link 116 utilization. A first-RTT bandwidth acquisition algorithm is also 117 proposed to allow new flows to start sending at a large rate, but 118 excessive packets will be actively dropped by WRED if they overwhelm 119 the network, in order to protect on-going flows. When heavy 120 congestion happens due to a large number of concurrent flows 121 contending for the bottleneck link, e.g., large-scale incast, LDCP 122 allows the congestion window to be beneath one packet, so the number 123 of flows that LDCP can endure remarkably increases compared with TCP 124 or DCTCP. 126 1.1. Requirements Language 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in RFC 2119 [RFC2119]. 132 2. LDCP algorithm 134 LDCP involves primarily two algorithms: a fast start algorithm that 135 is used in the first RTT, and a stable stage algorithm that governs 136 the rest lifespan of a flow. Each algorithm works with a separate 137 ECN setting respectively. Because we want to use as fewer priority 138 classes as possible, we leverage the common WRED/ECN [CiscoGuide2012] 139 [RFC3168] feature in commodity switches to support multiple ECN 140 marking policies within one priority class. 142 2.1. ECN 144 LDCP employs WRED/ECN at intermediate switches to mark packets when 145 congestion happens [Floyd1993random]. Instead of using the average 146 queue size for marking as in the original RED proposal, LDCP employs 147 instant queue based ECN to give more precise congestion information 148 to end hosts [Alizadeh2010data] [Kuzmanovic2005power]. The switch is 149 configured with four parameters: K_min, K_max, P_max and buf_max, and 150 it is going to mark a packet with a probability function as follows, 152 if q < K_min, p = 0 154 if K_min <= q < K_max, p = (q-K_min)/(K_max-K_min)*P_max 156 if q >= K_max, p = 1 158 If q is larger than the maximum buffer of the port (buf_max), the 159 packet is dropped. This general ECN model works for both algorithms 160 developed in LDCP but with different sets of parameters, 161 respectively. We will explain below. 163 2.2. Stable stage algorithm 165 In stable stage, i.e., rounds after the fast start (sec 2.3), the 166 flow is in the congestion avoidance state, and LDCP works as follows. 168 The sender maintains a congestion window (cw) to control the sending 169 rate of data packets. The receiver returns ACK packets to confirm 170 the delivery of these data packets. Meanwhile, the CE (Congestion 171 Experienced) flag in data packets are echoed back by ECN-Echo (ECE) 172 flags in the ACKs. An ACK that does not carry an ECE flag (ECE=0) 173 informs the sender that the network is not congested, while an ACK 174 that carries an ECE flag (ECE=1) informs the sender that the network 175 is congested. 177 There could be two possible ways regarding the number of ACKs 178 generated. The simplest one is to have the receiver to generate an 179 ACK for every received data packet (i.e., per-packet ACK) and set the 180 ECE flag if the corresponding packet has a CE mark. Alternatively, 181 if the receiver is busy, it can also employ delayed ACK to generate 182 an ACK for at most m data packets if they all are not marked, but 183 would generate an ACK with ECE flag immediately once a CE marked 184 packet is received. The goal of this receiver behavior is to ensure 185 that the sender has precise information of CE marking. A similar 186 design is also in [Alizadeh2010data]. 188 An LDCP sender updates the cw upon each ACK arrival according to the 189 ECE marks, namely per-ACK window adjustment (PAWA). An ECE=0 flag 190 increases the cw, while an ECE=1 flag decreases the cw. When per- 191 packet ACK is obeyed on the receiver, the update rule is as follows: 193 if ECN-Echo = 0, cw = cw + alpha/cw 195 if ECN-Echo = 1, cw = cw - beta --(1) 197 where alpha and beta are constants, and cw >= 1 (0 < alpha, beta <= 198 1). 200 Eq. (1) reveals that if an incoming ACK does not carry an ECE flag 201 (ECE=0), it grants the sender credits, and cw is increased by alpha/ 202 cw; if the ACK carries an ECE flag (ECE=1), it revokes credits from 203 the sender, and cw is decreased by beta. 205 In essence, Eq. (1) implements an additive increase and 206 multiplicative decrease (AIMD) policy similar to previous work, e.g., 207 DCTCP [Alizadeh2010data]. But PAWA, together with per-packet ACK, 208 has following benefits: Firstly, PAWA reacts to each received ECE 209 mark (or no mark) immediately, rather than employs a RTT-granularity 210 averaging process and reacts only once per RTT (like DCTCP does), so 211 it is more responsive and accurate to congestions. Secondly, along 212 with WRED/ECN, PAWA is able to de-synchronize flows. Instead of 213 cutting a large portion of cw immediately upon the first ECE-marked 214 ACK (like ECN-enabled TCP does), LDCP distributes the window 215 reduction in one round. Such de-synchronization is effective to 216 reduce the window fluctuation and stabilize a low queue at the 217 switches. Not only that, per-packet ACK allows ACK-clocking to 218 better pace out the packets: as each ACK confirms the delivery of one 219 packet, an ACK arrival also clocks out one new packet, hence the 220 packets are almost equispaced. Finally, PAWA has a tiny state 221 footprint, i.e., a single state of cw, and is easy to implement in 222 hardware compared with DCTCP. 224 Per-packet ACK and PAWA match the principle in discrete control 225 systems: increase the controller's action rate but take a small 226 control step per action. They are effective in improving the control 227 stability and accuracy. 229 If delayed ACK is used on the receiver side, an ACK can confirm the 230 delivery of multiple (denoted by n) packets, then Eq. (1) becomes: 232 if ECN-Echo = 0, cw = cw + n * alpha/cw 234 if ECN-Echo = 1, cw = cw - n * beta --(2) 235 In extremely congested cases where a large number of flows contending 236 for the bottleneck link, e.g., heavy incast with thousands of 237 senders, even each flow maintains a window of merely one packet, 238 large queue sizes would still be caused. To handle these situations, 239 LDCP allows cw to further reduce beneath one packet. A flow with a 240 cw < 1 is ticked out by a timer, whose timeout is set as RTT/cw. 241 Accordingly, the cw updating rule is, 243 if ECN-Echo = 0, cw = cw + gamma 245 if ECN-Echo = 1, cw = max{gamma,eta * cw} 247 where cw < 1. We choose eta = 1/2. gamma is the increase step when 248 ACK is not marked ECE, and is also the minimum window size (typical 249 values of gamma include 1/4, 1/8, 1/16). 251 2.3. Zero-RTT bandwidth acquisition 253 Setting up an initial rate at the very beginning of a flow is 254 challenging. Since the sender does not ever get a chance to probe 255 the network, it faces a difficult dilemma: If it picks up a too large 256 initial window (IW), it may cause congestion inside network, 257 resulting in large queue buildup or even packet drops; On the other 258 hand, if the sender chooses a too conservative IW, it may lose the 259 transmission opportunities in the first RTT and hurt short flow 260 performance greatly, which could have finished in one round. LDCP 261 resolves this dilemma with a zero-RTT bandwidth acquisition 262 algorithm, which allows the sender to fast start opportunistically 263 without adverse impacts to on-going flows in stable stage. In what 264 follows, the design of the fast start algorithm is firstly described, 265 afterwards an implementation using existing techniques is provided. 267 Specifically, when a flow starts, the sender chooses a large enough 268 Initial Window (e.g., BDP) and sends out as many packets as possible 269 in the first RTT. (For brevity, packets transmitted by a sender in 270 the first RTT are denoted by first-RTT-packets, and packets 271 transmitted in the congestion avoidance state (sec 2.2) are referred 272 to as stable-stage-packets.) By intention, first-RTT-packets are 273 marked to have lower priority, while stable-stage-packets are marked 274 to have high priority. The two priority classes are controlled by 275 two separate AQM policies. 277 The first-RTT-packets are controlled by an AQM policy which simply 278 drops packets if they are sent too aggressively, i.e., the queue 279 exceeds a configured threshold K. A network switch receives packets 280 transmitted by the senders and puts them into a queue. The queue 281 distinguishes the first-RTT-packets and the stable-stage-packets 282 according to the marks in the packets. Because first-RTT-packets are 283 with low priority, they will be dropped if the receiving queue size 284 exceeds the configured threshold, while stable-stage-packets are 285 enqueued as long as the queue size is below the queue capacity. 286 Stable-stage-packets are dropped only when the queue is full. 288 Senders and switches must cooperate. The sender adds one mark to 289 first-RTT-packets, and the switches identify first-RTT-packets using 290 this mark; the sender adds another mark to stable-stage-packets, and 291 the switches recognize packets sent beyond first RTT based on this 292 mark. 294 In summary, first-RTT-packets are sent with a large rate, and 295 controlled by a separate AQM, to quickly acquire free bandwidth if 296 there is; and low priority is used to protect on-going long flows if 297 there is not. 299 The above design can be implemented by leveraging a common feature on 300 modern switches. On a commodity switch, the WERD/ECN feature on an 301 ECN-enabled queue works as follows. ECN-capable packets (the two-bit 302 ECN fields in IP headers are set to '01' or '10') are subject to ECN- 303 marking, while ECN-incapable packets (the two-bit ECN fields in IP 304 headers are set to '00') comply with WRED-dropping, i.e., ECN- 305 incapable packets are dropped if the queue size exceeds a configured 306 threshold K, as in Eq (3). 308 if q < K, D(q) = no drop 310 if q >= K, D(q) = drop --(3) 312 The fast start algorithm makes use of such WERD/ECN feature to 313 distinguish first-RTT and stable-stage packets: the sender sets the 314 low priority first-RTT-packets to ECN-incapable, and sets the high 315 priority stable-stage-packets to ECN-capable. All the packets carry 316 the same DSCP value and are mapped to the same priority queue on 317 switches. This queue is exclusively used by LDCP flows. First-RTT- 318 packets are either dropped or successfully pass the switch. After 319 the first RTT, the sender will count how many in-order packets has 320 been acknowledged using ACKs and takes this as a good estimation of 321 cw and enters the stable stage (sec 2.2). 323 At first glance, the above design might look counterintuitive. If we 324 want to improve the performance of short flows, why should we drop 325 their packets, instead of queuing them, even with a higher priority? 326 The answer, however, lies in that if we allow blind burst in the 327 first RTT, these first-RTT-packets could build excessively large 328 queues, e.g., in a heavy incast scenario, and eventually these 329 packets may still get dropped. Therefore, an AQM policy is necessary 330 to keep a low queue for the first RTT packets. An additional benefit 331 of the above strategy is to also give protection to flows in stable 332 stage. Those stable stage flows will experience seldom packet loss 333 and constant performance even facing rather dynamic churns of short 334 flows. Finally, we comment that while we could put the first-RTT- 335 traffic into a separate high priority queue, we believe it is not 336 very necessary. The reason is with LDCP's stable stage algorithm, 337 the queue is already small at the switch, and thus the benefit for a 338 separate priority queue may be limited. Given the limited priority 339 queues in Ethernet, it is a fair choice to map both into one priority 340 queue while applying different WRED/ECN polices to control their 341 behavior. 343 3. Reference Implementation 345 LDCP has been implemented with RoCEv2 on a programmable many-core NIC 346 (referred to as uNIC). uNIC has hardware enhancements for RoCEv2 347 packet (IB/UDP/IP stack) encapsulation and decapsulation. The RoCEv2 348 stack, as well as the congestion control algorithm, is implemented by 349 microcode software on uNIC. 351 We firstly add congestion window cw to RoCEv2. RoCEv2 uses Packet 352 Sequence Number (PSN) to ensure in-order delivery, but PSN can have 353 jump overs if SEND/WRITE requests are interleaved with READ requests, 354 and packets can have different sizes. Therefore, it is difficult for 355 cw to calculate the data size by PSN. We add a new byte sequence 356 number to packets - LDCP Sequence Number (LSN). Packets belonging to 357 READ, SEND/WRITE requests share the same LSN space, while packets of 358 READ Responses have a separate LSN space, coded in a customized 359 header. The LDCP sliding window is based on LSN. 361 In the stable stage of LDCP, cw is updated in the PAWA manner, and we 362 program the uNIC to reply an ACK for each data packet it receives 363 (uNIC is able to automatically coalesce ACKs if all cores are busy), 364 which echoes back the CE mark if the data packet is marked. Note 365 that there is no ACK packets for Read Response in the RDMA protocol, 366 we also program the uNIC to reply ACKs for Read Responses to slide 367 the window. Because out-of-order delivery of Read Responses can be 368 discovered by the requester, and a repeat read request will be 369 issued, it is not necessary to add a NAK protocol for Read Responses 370 to ensure reliability. The CE-Echo bits are coded in a customized 371 header encapsulated in the ACK. 373 As mentioned, packets in fast-start stage and stable stage are 374 distinguished by ECN-capability. If a new flow does not finish 375 within the fast-start stage, it will transfer to the stable stage. 376 There are two transition conditions: 1) Packet loss is detected in 377 the fast-start stage, which indicates the network is overloaded. cw 378 in stable stage is set to the number of packets that are 379 accumulatively acknowledged before packet loss. The lost packets are 380 retransmitted using go-back-N. 2) When a full IW of packets have all 381 been acknowledged. (IW is set to BDP as suggested in sec 2.3.) This 382 condition is for flows that are larger than BDP and finish the fast- 383 start stage without packet loss. Since all packets sent during fast- 384 start stage are confirmed, the stable stage algorithm now takes over 385 and cw is set to BDP. Note that acknowledging a BDP size of data 386 needs two RTTs (the ACK for the IW-th packet returns at the end of 387 second RTT), but sending BDP-sized data only requires one RTT. After 388 the end of the first RTT, the flow will not stop sending (because the 389 ACK of the first packet will return to free the cw) but set the 390 packets to ECN-capable ever since. 392 There is a practical issue to consider: if all the packets sent out 393 during the fast-start stage are dropped due to overloaded traffic, 394 how can the sender quickly detect packet loss to avoid retransmission 395 timeout? We solve this problem by setting the IW-th packet to ECN- 396 capable during the fast-start stage. For messages smaller than IW 397 packets, their last packets are set to ECN-capable. These ECN- 398 capable packets will not be dropped even the queue size exceeds K 399 (unless queue buffer overflows) since they are subject to ECN- 400 marking. They pass through the switches and arrive at the receiver, 401 allowing the receiver to examine if packet loss happens. 403 All these implementation details are transparent to user 404 applications. LDCP supports all RDMA transport operations (READ, 405 WRITE, SEND, with immediate data or not, ATOMIC), and thus has full 406 support of IB verbs. 408 4. IANA Considerations 410 This document makes no request of IANA. 412 5. Security Considerations 414 To be added. 416 6. References 418 6.1. Normative References 420 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 421 Requirement Levels", BCP 14, RFC 2119, 422 DOI 10.17487/RFC2119, March 1997, 423 . 425 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 426 of Explicit Congestion Notification (ECN) to IP", 427 RFC 3168, September 2001, 428 . 430 6.2. Informative References 432 [Alizadeh2010data] 433 Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 434 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data 435 Center TCP (DCTCP)", ACM SIGCOMM 63-74, 2010. 437 [CiscoGuide2012] 438 "Cisco IOS Quality of Service Solutions Configuration 439 Guide", , 2012. 441 [Floyd1993random] 442 Floyd, S. and V. Jacobson, "Random early detection 443 gateways for congestion avoidance", IEEE/ACM Transactions 444 on networking 1, 4 (1993), 397-413, 1993. 446 [Guo2016rdma] 447 Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., 448 and M. Lipshteyn, "RDMA over commodity Ethernet at scale", 449 ACM SIGCOMM 202-215, 2016. 451 [Kuzmanovic2005power] 452 Kuzmanovic, A., "The power of explicit congestion 453 notification", ACM SIGCOMM 61-72, 2005. 455 Authors' Addresses 457 Huichen Dai (editor) 458 Huawei 459 Huawei Mansion, No.3, Xinxi Road, Haidian District 460 Beijing 461 China 463 Email: daihuichen@huawei.com 465 Binzhang Fu 466 Huawei 467 Huawei Mansion, No.3, Xinxi Road, Haidian District 468 Beijing 469 China 471 Email: fubinzhang@huawei.com 472 Kun Tan 473 Huawei 474 Huawei Mansion, No.3, Xinxi Road, Haidian District 475 Beijing 476 China 478 Email: kun.tan@huawei.com