idnits 2.17.1 draft-even-tsvwg-datacenter-fast-congestion-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 4, 2020) is 1536 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-07 == Outdated reference: A later version (-10) exists of draft-ietf-ippm-ioam-flags-00 == Outdated reference: A later version (-32) exists of draft-ietf-tsvwg-udp-options-08 -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 2460 (Obsoleted by RFC 8200) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG R. Even 3 Internet-Draft M. Liu 4 Intended status: Informational Y. Zhang 5 Expires: August 7, 2020 Huawei 6 February 4, 2020 8 Data Center Fast Congestion Management 9 draft-even-tsvwg-datacenter-fast-congestion-00 11 Abstract 13 A good congestion control for data centers (DC) should provide low 14 latency, fast convergence and high link utilization. Since multiple 15 applications with different requirements may run on the DC network it 16 is important to provide fairness between different applications that 17 may use different congestion algorithms. An important issue from the 18 user perspective is to achieve short Flow Completion Time (FCT). 19 This document proposes data center congestion control direction 20 aiming to achieve high performance while proving fairness. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 7, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 3. Congestion Handling Cases . . . . . . . . . . . . . . . . . . 3 59 3.1. Congestion only in leaf switch connected to receiver . . 3 60 3.2. Congestion in the Spine switch . . . . . . . . . . . . . 4 61 3.2.1. ECN case . . . . . . . . . . . . . . . . . . . . . . 4 62 3.2.2. Spine and leaf switches share information . . . . . . 4 63 3.2.3. FCR from spine and leaf switches . . . . . . . . . . 4 64 3.3. Congestion in leaf switch connected to data sender . . . 4 65 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 66 5. Rate Information . . . . . . . . . . . . . . . . . . . . . . 5 67 6. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 5 68 7. Implementation Options . . . . . . . . . . . . . . . . . . . 6 69 8. Tests results . . . . . . . . . . . . . . . . . . . . . . . . 6 70 8.1. Many senders to one receiver . . . . . . . . . . . . . . 6 71 9. Security Considerations . . . . . . . . . . . . . . . . . . . 8 72 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 73 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 11.1. Normative References . . . . . . . . . . . . . . . . . . 9 75 11.2. Informative References . . . . . . . . . . . . . . . . . 9 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 78 1. Introduction 80 The major use case that we are looking at is congestion control for 81 Data Centers, a controlled environment as specified in 82 RFC8085[RFC8085]. With the emerging Distributed Storage, AI/HPC 83 (High Performance Computing), Machine Learning, etc., modern 84 datacenter applications demand high throughput (40Gbps and above) 85 with ultra-low latency of less than 10 microsecond per hop from the 86 network, with low CPU overhead. The end to end latency should be 87 less than 50usec, this value is based on DCQCN [DCQCN]. The high 88 link speed (>40Gb/s) in Data Centers (DC) are making network 89 transfers complete faster and in fewer RTTs. Network traffic in a 90 data center is often a mix of short and long flows, where the short 91 flows require low latencies and the long flows require high 92 throughputs. 94 A good congestion control for data centers (DC) should provide low 95 latency, fast convergence and high link utilization. Since multiple 96 applications with different requirements may run on the DC network it 97 is important to provide fairness between different applications that 98 may use different congestion algorithms. An important issue from the 99 user perspective is to achieve short Flow Completion Time (FCT). 101 A typical DC architecture is composed of a spine-leaf topology where 102 there are three hop switches at most for a flow. If we look from the 103 flow view then we can assume that for the first hop switch there is 104 low probability for congestion. The congestion will happen in higher 105 probability at the spine or the last hop. The figure bellow shows a 106 simple spine-leaf topology; in a typical DC there will be multiple 107 Spines and Leaves. 109 -------- 110 |Spine1| 111 -------- 112 | \ 113 | \ 114 | \ 115 ------ ------ ------- ------ 116 | NIC|__|Leaf1| |Leaf2| ____ |NIC | 117 |Send| ------ ------- |Recv| 118 ------ ------ 120 2. Conventions 122 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 123 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 124 "OPTIONAL" in this document are to be interpreted as described in BCP 125 14 [RFC2119] [RFC8174] when, and only when, they appear in all 126 capitals, as shown here. 128 3. Congestion Handling Cases 130 3.1. Congestion only in leaf switch connected to receiver 132 The leaf switch is congested and does not receive any ECN CE marking 133 on incoming streams. The leaf switch sends FCR (Fast Congestion 134 Response) message to all sending NICs. The general case requires 135 that the leaf switch will know who are the senders and if they 136 support FCR. There is also a requirement to define how the congested 137 leaf connects and send the FCR message to the senders. If not all 138 senders whose streams are congesting the same egress port support FCR 139 the congested leaf switch will drop back to use ECN CE marking to the 140 receiver. Another option is to send FCR to the senders that support 141 it and use ECN CE marking on the flows from senders that do not 142 support FCR, in this case the switch should wait for at least one RTT 143 before sending a second FCR to allow all senders to drop their 144 sending rate. 146 3.2. Congestion in the Spine switch 148 There are a couple of options for supporting this case. The Spine 149 and leaf switch will need to be aware of which option is in use. 151 3.2.1. ECN case 153 The leaf switch receives ECN CE marks from the spine. The leaf 154 switch does not know what rate information it can send regardless if 155 it is congested or not. The leaf switch will convey ECN marking to 156 the receiver. 158 3.2.2. Spine and leaf switches share information 160 The Spine switch provides rate/congestion information to the 161 downstream leaf switch. The leaf switch may be congested or not but 162 will be responsible to send the FCR message to the sending NICs. The 163 information from the spine may provide rate information using an FCR 164 like message. 166 3.2.3. FCR from spine and leaf switches 168 The Spine switch will send FCR to the sending NICs and will not send 169 ECN marking to the downstream leaf switch. In this option if there 170 is also congestion on the downstream leaf a second FCR message will 171 be sent from the leaf to the sending NIC who will have to use the 172 lower recommended rate information. 174 3.3. Congestion in leaf switch connected to data sender 176 This case has lower probability but in case of congestion the leaf 177 switch will send FCR message to all the contributing NICs of the flow 178 causing the congestion on the congested egress port. If FCR is not 179 supported by all the congesting NICs, the switch will CE mark these 180 flows, this will cause the FCR supporting NICs to respond faster and 181 the switch should allow the other streams to respond (wait little 182 over an RTT time) before sending another FCR. 184 4. Summary 186 If all NICs currently sending data to the leaf switch support FCR 187 messages it is safe to use FCR and if the congestion is in the Spine 188 switch the action will be according to the options in section 189 Section 3.2. 191 If the leaf switch knows that not all NICs sending data through the 192 switch support FCR, the leaf switch may fall back to ECN marking. 193 Another option is to use mixed mode by sending FCR to supporting NICs 194 and ECN marks towards the receiver, the senders that support FCR 195 should use the received FCR and ignore the ECN message from the 196 receiver. 198 In the case where there are multiple congestion points, the NIC 199 should use the lowest rate information from all received FCRs. 201 5. Rate Information 203 The leaf switch needs to supply rate information using the FCR 204 message. The same rate information will be sent to all data senders 205 to the congested port regardless of the rate they needed. This may 206 cause underutilization of the available bandwidth if some of them 207 have no need for all the recommended rate; this will be addressed by 208 the leaf switch sending updated rate information based on the current 209 usage after a number of RTTs. The leaf switch may also send updated 210 FCR message when more bandwidth is available, for example when 211 senders stop sending. Note that sending such information may cause 212 congestion on upstream switches; another option is to use the sender 213 congestion control to raise the sending rate according to its CC 214 algorithm. 216 In the tests that were done so far, the solution was that all senders 217 received the same rate information. We need to specify what we would 218 like to send as the content of the rate information in the FCR 219 message (bits/sec, number of bytes to send similar to wnd in TCP). 221 6. Requirements 223 To support FCR based on the above use cases requires: 225 1. The congested leaf should be able to know which data sources 226 support FCR. 228 2. The congested leaf should be able to send the FCR message in-path 229 for example by using TCP/UDP options or in the UDP applications 230 back channel. Another option is to establish a connection to the 231 data senders and send FCR messages to them. 233 3. Sender should be able to start sending at maximum rate if the new 234 stream is the only stream sent by the sender. 236 7. Implementation Options 238 The FCR message from the network to the data sender MUST only be 239 deployed in a controlled environment [RFC8085] such as Data Centers. 240 The FCR message should provide an identification of the stream for 241 example by providing the source and destination IP and Port number of 242 the flow. 244 FCR should only be deployed in an intra-data-center environment where 245 both endpoints and the switching fabric are under a single 246 administrative domain. FCR MUST NOT be deployed over the public 247 Internet 249 1. The tests are based on ROCEv2 [RoCEv2] using a revised CNP 250 message and assume all senders support FCR. To use this option 251 for ROCEv2 the data sender should mark support for the revised 252 CNP message, this will allow the leaf switch to know if it can 253 send back the revised CNP. This implementation mode is for 254 testing only, we do not propose this mode for a solution. 256 2. For the proposed solution for the general case there may be a 257 couple of options. The preference is to use a generic message at 258 the transport level (TCP/UDP) otherwise will need a different 259 message per application. The suggested proposal is to use new 260 TCP option [RFC0793] and [RFC2460] for IPv6 that can be piggy 261 backed on the ack message. There should be an FCR support option 262 sent by the data sender. For UDP where a back channel is usually 263 in the application layer we can use UDP options 264 [I-D.ietf-tsvwg-udp-options] for announcing FCR support and using 265 the application back channel in an application extension or in a 266 UDP option to send FCR (In the testing we used a revised CNP 267 message for ROCEv2). Another option is to use IOAM like 268 mechanism (the general IOAM specification is 269 [I-D.ietf-ippm-ioam-data], the loopback option is in 270 [I-D.ietf-ippm-ioam-flags], sending message from the leaf switch 271 can be based on https://tools.ietf.org/id/draft-ioamteam-ippm- 272 ioam-direct-export-00.txt) 274 8. Tests results 276 Note: this can be an appendix later if relevant 278 8.1. Many senders to one receiver 280 In this test scenario we had six senders and one receiver on a single 281 switch on a 25 Gbit/sec connection. Five senders were sending long 282 flows to create congestion and the sixth sender sent continuous 8 283 Bytes packets to test latency. 285 +---------------------+--------+------------+-----------------------+ 286 | Network Average | NIC CC | Network CC | Improvement | 287 | Load | | | percentage | 288 +---------------------+--------+------------+-----------------------+ 289 | 30% | 1.61 | 1.61 | 0.00% | 290 | 50% | 2.68 | 2.68 | 0.00% | 291 | 80% | 4.23 | 4.24 | 0.24% | 292 | 100% | 4.36 | 4.51 | 3.44% | 293 +---------------------+--------+------------+-----------------------+ 295 Sender NIC BW(Gbps) 297 The bandwidth of NIC CC and Network CC are almost the same in the 298 long flows case 300 +--------------------+---------+------------+-----------------------+ 301 | Network Average | NIC CC | Network CC | Improvement | 302 | Load | | | percentage | 303 +--------------------+---------+------------+-----------------------+ 304 | 30% | 5.89 | 5.79 | 1.70% | 305 | 50% | 6.04 | 6.04 | 0.00% | 306 | 80% | 7.33 | 6.67 | 9.00% | 307 | 100% | 7.45 | 6.78 | 8.99% | 308 +--------------------+---------+------------+-----------------------+ 310 Latency flow result(us) - Average 312 +--------------------+---------+------------+-----------------------+ 313 | Network Average | NIC CC | Network CC | Improvement | 314 | Load | | | percentage | 315 +--------------------+---------+------------+-----------------------+ 316 | 30% | 8.89 | 7.65 | 13.95% | 317 | 50% | 9.14 | 8.46 | 7.44% | 318 | 80% | 12.94 | 8.74 | 32.46% | 319 | 100% | 11.6 | 8.74 | 24.66% | 320 +--------------------+---------+------------+-----------------------+ 322 Latency flow result(us) - 99% 324 +--------------------+---------+-------------+----------------------+ 325 | Network Average | NIC CC | Network CC | Improvement | 326 | Load | | | percentage | 327 +--------------------+---------+-------------+----------------------+ 328 | 30% | 21.77 | 8.79 | 59.62% | 329 | 50% | 24.89 | 11.8 | 52.59% | 330 | 80% | 23.45 | 9.36 | 60.09% | 331 | 100% | 22.91 | 9.19 | 59.89% | 332 +--------------------+---------+-------------+----------------------+ 334 Latency flow result(us) - 99.9% 336 We can see that the average latency is reduced by maximum 9% and the 337 99.9% latency which indicates the maximum queue size is reduced by 338 maximum 60%. 340 The results show in that for the long flow many-to-one situation the 341 Network CC achieves the same bandwidth as the NIC CC and better 342 latency for mice flow. 344 9. Security Considerations 346 The FCR message is hard to secure, sending an FCR message from the 347 network to the source has security risks since it can be easily used 348 for DOS attack. This solution must only be used in a managed network 349 [RFC8085]. The FCR message must be terminated in the managed network 350 and should not cross the network domain. 352 Since this message is sent in a closed managed network it does not 353 have the same security concerns as ICMP source quench message 354 [RFC5927] defined on the general Internet. 356 An attacker can send an FCR message with lower or higher rate 357 information. This may cause an underutilization of the network or 358 congestion. The network entity closest to the receiver should 359 provide an alert if an unexpected rate is being used which may hint 360 that such an attack is taking place. A sender may also try to 361 identify if the FCR message has rate information in the expected 362 range. 364 10. IANA Considerations 366 TBD 368 11. References 370 11.1. Normative References 372 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 373 Requirement Levels", BCP 14, RFC 2119, 374 DOI 10.17487/RFC2119, March 1997, 375 . 377 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 378 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 379 May 2017, . 381 11.2. Informative References 383 [DCQCN] Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 384 Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M. 385 Zhang, "Congestion control for large-scale RDMA 386 deployments. In ACM SIGCOMM Computer Communication Review, 387 Vol. 45. ACM, 523-536.", 8 2015, 388 . 391 [I-D.ietf-ippm-ioam-data] 392 Brockners, F., Bhandari, S., Pignataro, C., Gredler, H., 393 Leddy, J., Youell, S., Mizrahi, T., Mozes, D., Lapukhov, 394 P., Chang, R., daniel.bernier@bell.ca, d., and J. Lemon, 395 "Data Fields for In-situ OAM", draft-ietf-ippm-ioam- 396 data-07 (work in progress), September 2019. 398 [I-D.ietf-ippm-ioam-flags] 399 Mizrahi, T., Brockners, F., Bhandari, S., Sivakolundu, R., 400 Pignataro, C., Kfir, A., Gafni, B., Spiegel, M., and J. 401 Lemon, "In-situ OAM Flags", draft-ietf-ippm-ioam-flags-00 402 (work in progress), October 2019. 404 [I-D.ietf-tsvwg-udp-options] 405 Touch, J., "Transport Options for UDP", draft-ietf-tsvwg- 406 udp-options-08 (work in progress), September 2019. 408 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 409 RFC 793, DOI 10.17487/RFC0793, September 1981, 410 . 412 [RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6 413 (IPv6) Specification", RFC 2460, DOI 10.17487/RFC2460, 414 December 1998, . 416 [RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927, 417 DOI 10.17487/RFC5927, July 2010, 418 . 420 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 421 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 422 March 2017, . 424 [RoCEv2] "Infiniband Trade Association. Supplement to InfiniBand 425 architecture specification volume 1 release 1.2.2 annex 426 A17: RoCEv2 (IP routable RoCE).", 427 . 429 Authors' Addresses 431 Roni Even 432 Huawei 434 Email: roni.even@huawei.com 436 Mengzhu Liu 437 Huawei 439 Email: liumengzhu@huawei.com 441 Yali Zhang 442 Huawei 444 Email: zhangyali369@huawei.com