idnits 2.17.1 draft-yueven-tsvwg-dccm-requirements-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 7, 2019) is 1726 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-34) exists of draft-ietf-quic-transport-20 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG F. Chen 3 Internet-Draft W. Sun 4 Intended status: Informational X. Yu 5 Expires: January 8, 2020 Huawei Technologies Co., Ltd. 6 R. Even, Ed. 7 Huawei 8 July 7, 2019 10 Data Center Congestion Management requirements 11 draft-yueven-tsvwg-dccm-requirements-01 13 Abstract 15 On IP-routed datacenter networks, RDMA is deployed using RoCEv2 16 protocol or iWARP. RoCEv2 specification does not define a strong 17 congestion management mechanisms and load balancing methods. RoCEv2 18 relies on the existing Link-Layer Flow-Control IEEE 19 802.1Qbb(Priority-based Flow Control, PFC) to provide a lossless 20 fabric. RoCEv2 Congestion Management(RCM) use ECN(Explicit 21 Congestion Notification, defined in RFC3168) to signal the congestion 22 to the destination and use the congestion notification to reduce the 23 rate of injection and increase the injection rate when the extent of 24 congestion decreases. iWRAP depends on TCP congestion handling. This 25 document describes the current state of flow control and congestion 26 handling in the DC and provides requirements for new directions for 27 better congestion control. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on January 8, 2020. 46 Copyright Notice 48 Copyright (c) 2019 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 3. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 4 66 4. Current Congestion Management mechanisms . . . . . . . . . . 4 67 4.1. Priority-based Flow Control (PFC) . . . . . . . . . . . . 4 68 4.2. Explicit Congestion Notification . . . . . . . . . . . . 4 69 5. Congestion Management Practice . . . . . . . . . . . . . . . 5 70 5.1. Packet Retransmission . . . . . . . . . . . . . . . . . . 5 71 5.2. Congestion Control Mechanisms . . . . . . . . . . . . . . 5 72 5.2.1. RTT-based Congestion Control . . . . . . . . . . . . 5 73 5.2.2. Credit-based Congestion Control . . . . . . . . . . . 6 74 5.2.3. ECN-based Congestion Control . . . . . . . . . . . . 6 75 5.3. Re-ordering . . . . . . . . . . . . . . . . . . . . . . . 6 76 5.4. Load Balancing . . . . . . . . . . . . . . . . . . . . . 6 77 5.4.1. Equal-cost multi-path routing (ECMP) . . . . . . . . 6 78 5.4.2. Flowlet . . . . . . . . . . . . . . . . . . . . . . . 6 79 5.4.3. Per-packet . . . . . . . . . . . . . . . . . . . . . 7 80 6. Data Center Congestion Management requirements . . . . . . . 7 81 7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 82 8. Security Considerations . . . . . . . . . . . . . . . . . . . 8 83 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 84 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 85 10.1. Normative References . . . . . . . . . . . . . . . . . . 8 86 10.2. Informative References . . . . . . . . . . . . . . . . . 9 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 89 1. Introduction 91 With the emerging Distributed Storage, AI/HPC(High Performance 92 Computing), Machine Learning, etc., modern datacenter applications 93 demand high throughput(40Gbps and above) with ultra-low latency of 94 less than 10 microsecond per hop from the network, with low CPU 95 overhead. The high link speed (>40Gb/s) in Data Centers (DC) are 96 making network transfers complete faster and in fewer RTTs. Network 97 traffic in a data center is often a mix of short and long flows, 98 where the short flows require low latencies and the long flows 99 require high throughputs. 101 On IP-routed datacenter networks, RDMA is deployed using RoCEv2 102 protocol or iWARP [RFC5040]. RoCEv2 [RoCEv2] is a straightforward 103 extension of the RoCE protocol that involves a simple modification of 104 the RoCE packet format. RoCEv2 packets carry an IP header which 105 allows traversal of IP L3 Routers and a UDP header that serves as a 106 stateless encapsulation layer for the RDMA Transport Protocol Packets 107 over IP. 109 RoCEv2 Congestion Management (RCM) provides the capability to avoid 110 congestion hot spots and optimize the throughput of the fabric. RCM 111 relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) 112 [IEEE.802.1QBB_2011] to provide a drop free network. RoCEv2 113 Congestion Management(RCM) also use ECN [RFC3168] to signal the 114 congestion to the destination and use the congestion notification as 115 an input to the sender to reduce the rate of injection and increase 116 the injection rate when the extent of congestion decreases. The rate 117 reduction by the sender as well as the increase in data injection is 118 left to the implementation. 120 An enhancement to the congestion handling for ROCEv2 is the DCQCN 121 [DCQCN] providing similar functionality to QCN and DCTCP, it is 122 implemented in some of the ROCEv2 NICs but is not part of the ROCEv2 123 specification. As such, vendors have their own implementations which 124 makes it difficult to interoperate with each other efficiently. 126 iWARP [RFC5040] provides a TCP based transport of RDMA, it is 127 implemented in the NIC and is leveraging TCP retransmission and does 128 not require a lossless fabric 130 2. Conventions 132 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 133 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 134 "OPTIONAL" in this document are to be interpreted as described in BCP 135 14 [RFC2119] [RFC8174] when, and only when, they appear in all 136 capitals, as shown here. 138 3. Abbreviations 140 RCM - RoCEv2 Congestion Management 142 PFC - Priority-based Flow Control 144 ECN - Explicit Congestion Notification 146 DCQCN - Data Center Quantized Congestion Notification 148 AI/HPC - Artificial Intelligence/High-Performance computing 150 ECMP - Equal-Cost Multipath 152 NIC - Network Interface Card 154 4. Current Congestion Management mechanisms 156 4.1. Priority-based Flow Control (PFC) 158 RDMA can be deployed using the RoCEv2 protocol [RoCEv2], and relies 159 on IEEE 802.1Qbb Priority-based Flow Control (PFC) 160 [IEEE.802.1QBB_2011] to enable a drop-free network. 162 PFC is a link level protocol that allows a receiver to assert flow 163 control by requesting the transmitter to pause sending traffic for a 164 specified priority. However, because PFC will stop all traffic in a 165 particular traffic class at the ingress port, the flows destined to 166 other ports will also be blocked. 168 The known problems of PFC are head-of-line blocking, unfairness, 169 deadlock [deadlocks] 171 4.2. Explicit Congestion Notification 173 Explicit Congestion Notification (ECN) [RFC3168] is used by the 174 network to notify that congestion is detected before actually 175 removing packets. Data Center TCP (DCTCP) [RFC8257]: TCP Congestion 176 Control for Data Centers is an Informational RFC that extends the 177 Explicit Congestion Notification (ECN) processing to estimate the 178 fraction of bytes that encounter congestion, DCTCP then scales the 179 TCP congestion window based on this estimate. DCTCP does not change 180 the ECN reporting in TCP. Other ECN notification mechanisms for UDP 181 based transports are specified for RTP in [RFC6679] and for QUIC 182 [I-D.ietf-quic-transport]. The ECN notification are reported from 183 the end receiver to the sender and the notification includes only the 184 occurrence of ECN in the TCP case and the number of ECN marked packet 185 for RTP and QUIC. 187 5. Congestion Management Practice 189 5.1. Packet Retransmission 191 NICs were not designed to deal with losses efficiently. Receiver 192 discards out-of-order packets. Sender does go-back-N on detecting 193 packet loss. RoCEv2 adopt Go-back-N loss recovery and needs lossless 194 layer 2 (by using PFC) for good performance. 196 iWARP [RFC5040] provides a TCP based transport of RDMA, it is 197 implemented in the NIC and is leveraging TCP retransmission and does 198 not require a lossless fabric. 200 Based on iWARP congestion and packet loss handling an experiment to 201 optimize the congestion control is in the improved RoCE NIC design 202 [IRN] that makes two key changes to current RoCE NICs: (1) improving 203 the loss recovery mechanism (similar to TCP with SACK), and (2) basic 204 end-to-end flow control (termed BDP-FC) which bounds the number of 205 in-flight packets by the bandwidth-delay product of the network, BDP- 206 FC is a static value that is calculated based on the number of hops 207 between the sender and the receiver. The tests results show that it 208 provides better congestion handling comparing to DCQCN [DCQCN]. IRN 209 work without PFC which is one of the concerns when using DCQCN. 211 Enhancements such as selective retransmission can be considered to 212 not rely on a lossless network. 214 5.2. Congestion Control Mechanisms 216 5.2.1. RTT-based Congestion Control 218 The typical practice of RTT based Congestion Control is TIMELY 219 [TIMELY]. TIMELY introduces the simple packet delay, measured as 220 round-trip times at hosts, is an effective congestion signal without 221 the need for switch feedback. TIMELY measures RTT with microsecond 222 accuracy, and these RTTs are sufficient to estimate switch queuing. 223 TIMELY can adjust transmission rates using RTT gradients to keep 224 packet latency low while delivering high bandwidth. TIMELY is a 225 delay-based congestion control protocol for use in the datacenter. 227 Because the RDMA transport is in the NIC and sensitive to packet 228 drops, PFC is necessary because packets drops hurt performance badly. 229 TIMELY needs PFC to provide lossless underlay network. 231 5.2.2. Credit-based Congestion Control 233 ExpressPass [ExpressPass] is an end-to-end credit-scheduled, delay- 234 bounded congestion control for data centers. ExpressPass uses credit 235 packets to control congestion even before sending data packets, which 236 enables to achieve bounded delay and fast convergence. It uses end- 237 to-end credit transfer for bandwidth allocation and fine-grained 238 packet scheduling. 240 5.2.3. ECN-based Congestion Control 242 Data Center Quantized Congestion Notification (DCQCN) [DCQCN] is an 243 end-to-end congestion control scheme for RoCEv2. DCQCN is a 244 combination of ECN and PFC to support end-to-end lossless Ethernet. 245 The idea behind DCQCN is to allow ECN to do flow control by 246 decreasing the transmission rate at the sender when congestion 247 starts, thereby minimizing the time PFC is triggered. Configuring 248 the ECN and PFC timeouts is challenging when there are more routers 249 in the DC. 251 5.3. Re-ordering 253 When the packets arrive at the destination out-of-order, the 254 destination should store the packets to restore the order. 255 Destination should assign special buffer resource to perform re- 256 ordering. There are many methods to implement the re-ordering either 257 on the switches or on the NIC side. 259 5.4. Load Balancing 261 5.4.1. Equal-cost multi-path routing (ECMP) 263 RoCEv2 packets use an opaque flow identifier in the UDP Source Port 264 field for ECMP method to implement path selection mechanisms for load 265 balancing and improve utilization of the fabric topology. 266 Traditional ECMP cannot balance loads well in the data center network 267 because it splits loads at the granularity of flow. The finer the 268 granularity of load balancing, the more effective the load balancing 269 is and the higher the utilization of network bandwidth can be 270 achieved. 272 5.4.2. Flowlet 274 The typical Flowlet-based load balancing is CONGA [CONGA]. CONGA is 275 a network-based distributed congestion-aware load balancing mechanism 276 for datacenters. It splits TCP flows into flowlets, estimates real- 277 time congestion on fabric paths, and allocates flowlets to paths 278 based on feedback from remote switches. 280 Flowlets are bursts of packets from a flow. The idle interval 281 between two bursts of packets is larger than the maximum difference 282 in latency among the paths. So the second burst can be sent along a 283 different path than the first without reordering packets. 285 5.4.3. Per-packet 287 The effect of packet-based load balancing is the best because the 288 corresponding granularity is the smallest. The consequence is that 289 packets belonging to the same flow will be allocated to different 290 paths. When the forwarding delays of paths are different, it is 291 possible that packets may arrive at the receiver out-of-order. 293 6. Data Center Congestion Management requirements 295 The first issue is with incast traffic. Network congestion happens 296 in the network routers when the incoming traffic is larger than the 297 bandwidth of the outgoing link on which it has to be transmitted. 298 Congestion is the primary source of loss in the network, congestion 299 leads to performance degradation. 301 The data sender makes its congestion management decision based on 302 information from the data receiver which provides partial information 303 about the state of the network itself. 305 Another issue to address is packet loss due to out-of-order packets 306 which may happen when load balancing is used. RoCEv2 adopt Go-back-N 307 loss recovery and requires lossless fabric to prevent retransmission 308 but is not addressing the packet loss due to re-ordering. 310 RoCEv2 relies on Link-Layer Flow-Control IEEE 802.1Qbb(PFC) 311 [IEEE.802.1QBB_2011] to provide a lossless underlay networks. 312 Lossless networks is implement by a mechanism of flow control, which 313 pauses the traffic with priority granularity in the incoming link 314 before the buffer overfills, and by that prevents the case of 315 dropping packets [CongestionManagment]. However, PFC can lead to 316 poor application performance due to problems like head-of-line 317 blocking and unfairness [DCQCN]. 319 Although DCQCN is widely deployed, due to the lack of formal 320 specification, vendors have their own implementations which make it 321 difficult to interoperate with each other efficiently. Moreover, the 322 potential new congestion control mechanisms should also be considered 323 to be compatible with existing ones. 325 Besides, with the development of RDMA fabric, the mixture of RDMA 326 traffic and normal TCP traffic might also bring issues due to their 327 employed different flow control and congestion control mechanisms. 329 In order to achieve the high throughput and low latency in the large- 330 scale datacenter network, the following requirements for datacenter 331 network congestion management are suggested: 333 o Resolve incast traffic in the network. 335 o Provide more efficient network congestion management for RDMA 336 traffic to avoid retransmission. 338 o Provide better interoperability between vendors. 340 o Provide fairness mixture of RDMA traffic and normal TCP traffics. 342 o Provide compatibility when more than one congestion control 343 mechanism is used. 345 7. Summary 347 As discussed in Section 6, we need an enhancement to current RDMA 348 transport protocols with stronger capability of congestion management 349 to achieve the high throughput and low latency in the large-scale 350 datacenter network. Network co-operation can help with getting 351 better information to the data sender. The solution should also have 352 more flexible requirement from the underlay network. The solution 353 should enable better congestion management capabilities and 354 interoperability for ROCEv2 and iWARP in the data center environment. 356 8. Security Considerations 358 TBD 360 9. IANA Considerations 362 No IANA action 364 10. References 366 10.1. Normative References 368 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 369 Requirement Levels", BCP 14, RFC 2119, 370 DOI 10.17487/RFC2119, March 1997, 371 . 373 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 374 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 375 May 2017, . 377 10.2. Informative References 379 [CONGA] Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, 380 R., Chu, K., Lam, V. T., Matus, F., Pan, R., Yadav, N., 381 and G. Varghese, "CONGA: Distributed Congestion-Aware Load 382 Balancing for Datacenters", 2 2015, 383 . 386 [CongestionManagment] 387 "Understanding RoCEv2 Congestion Management", 12 2018, 388 . 391 [DCQCN] Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 392 Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M. 393 Zhang, "Congestion control for large-scale RDMA 394 deployments. In ACM SIGCOMM Computer Communication Review, 395 Vol. 45. ACM, 523-536.", 8 2015, 396 . 399 [deadlocks] 400 Hu, S., Zhu, Y., Cheng, P., Guo, C., Tan, K., Padhye, J., 401 and K. Chen, "Deadlocks in Datacenter Networks: Why Do 402 They Form, and How to Avoid Them", 11 2016, 403 . 406 [ExpressPass] 407 Cho, I., Han, D., and K. Jang, "ExpressPass: End-to-End 408 Credit-based Congestion Control for Datacenters", 10 2016, 409 . 411 [I-D.ietf-quic-transport] 412 Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed 413 and Secure Transport", draft-ietf-quic-transport-20 (work 414 in progress), April 2019. 416 [IEEE.802.1QBB_2011] 417 IEEE, "IEEE Standard for Local and metropolitan area 418 networks--Media Access Control (MAC) Bridges and Virtual 419 Bridged Local Area Networks--Amendment 17: Priority-based 420 Flow Control", IEEE 802.1Qbb-2011, 421 DOI 10.1109/ieeestd.2011.6032693, September 2011, 422 . 425 [IRN] Mittal, R., Shpiner, A., Panda, A., Zahavi, E., 426 Krishnamurthy, A., Ratnasamy, S., and S. Shenker, 427 "Revisiting Network Support for RDMA. In Proceedings of 428 the 2018 Conference of the ACM Special Interest Group on 429 Data Communication (SIGCOMM '18)", 8 2018, 430 . 432 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 433 of Explicit Congestion Notification (ECN) to IP", 434 RFC 3168, DOI 10.17487/RFC3168, September 2001, 435 . 437 [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 438 Garcia, "A Remote Direct Memory Access Protocol 439 Specification", RFC 5040, DOI 10.17487/RFC5040, October 440 2007, . 442 [RFC6679] Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P., 443 and K. Carlberg, "Explicit Congestion Notification (ECN) 444 for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, August 445 2012, . 447 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 448 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 449 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 450 October 2017, . 452 [RoCEv2] "Infiniband Trade Association. Supplement to InfiniBand 453 architecture specification volume 1 release 1.2.2 annex 454 A17: RoCEv2 (IP routable RoCE).", 455 . 457 [TIMELY] Mittal, R., Lam, T., Dukkipati, N., Blem, E., Wassel, H., 458 Ghobadi, M., Vahdat, A., Wang, Y., Wetherall, D., and D. 459 Zats, "RTT-based Congestion Control for the Datacenter", 8 460 2015, 461 . 464 Authors' Addresses 466 Fei Chen 467 Huawei Technologies Co., Ltd. 469 Email: chenfei57@huawei.com 470 Wenhao Sun 471 Huawei Technologies Co., Ltd. 473 Email: sam.sunwenhao@huawei.com 475 Xiang Yu 476 Huawei Technologies Co., Ltd. 478 Email: yolanda.yu@huawei.com 480 Roni Even (editor) 481 Huawei 483 Email: roni.even@huawei.com