idnits 2.17.1 draft-zhuang-tsvwg-open-cc-architecture-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 435 has weird spacing: '...5% comp ms |...' == Line 456 has weird spacing: '...5% comp ms |...' == Line 479 has weird spacing: '...5% comp ms |...' == Line 498 has weird spacing: '...5% comp ms |...' == Line 516 has weird spacing: '...5% comp ms |...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (November 3, 2019) is 1637 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 8312 (Obsoleted by RFC 9438) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG Y. Zhuang 3 Internet-Draft W. Sun 4 Intended status: Informational L. Yan 5 Expires: May 6, 2020 Huawei Technologies Co., Ltd. 6 November 3, 2019 8 An Open Congestion Control Architecture for high performance fabrics 9 draft-zhuang-tsvwg-open-cc-architecture-00 11 Abstract 13 This document describes an open congestion control architecture of 14 high performance fabrics for the cloud operators and algorithm 15 developers to deploy or develop new congestion control algorithms as 16 well as make appropriate configurations for traffics on smart NICs in 17 a more efficient and flexible way. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on May 6, 2020. 36 Copyright Notice 38 Copyright (c) 2019 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 3. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 3 56 4. Observations in storage network . . . . . . . . . . . . . . . 4 57 5. Requirements of the open congestion control architecture . . 5 58 6. Open Congestion Control (OpenCC) Architecture Overview . . . 5 59 6.1. Congestion Control Platform and its user interfaces . . . 6 60 6.2. Congestion Control Engine (CCE) and its interfaces . . . 7 61 7. Interoperability Consideration . . . . . . . . . . . . . . . 7 62 7.1. Negotiate the congestion control algorithm . . . . . . . 7 63 7.2. Negotiate the congestion control parameters . . . . . . . 8 64 8. Security Considerations . . . . . . . . . . . . . . . . . . . 8 65 9. Manageability Consideration . . . . . . . . . . . . . . . . . 8 66 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 67 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 68 11.1. Normative References . . . . . . . . . . . . . . . . . . 8 69 11.2. Informative References . . . . . . . . . . . . . . . . . 8 70 Appendix A. Experiments . . . . . . . . . . . . . . . . . . . . 9 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 73 1. Introduction 75 The datacenter networks (DCNs) nowadays is not only providing traffic 76 transmission for tenants using TCP/IP network protocol stack, but 77 also is required to provide RDMA traffic for High Performance 78 Computing (HPC) and distributed storage accessing applications which 79 requires low latency and high throughput. 81 Thus, for datacenter application nowadays, the requirements of 82 latency and throughput are more critical than the normal internet 83 traffics, while network congestion and queuing caused by incast is 84 the point that increases the traffic latency and affect the network 85 throughput. With this, congestion control algorithms aimed for low 86 latency and high bandwidth are proposed such as DCTCP[RFC8257], [BBR] 87 for TCP, [DCQCN] for [RoCEv2]. 89 Besides, the CPU utilization on NICs is another point to improve the 90 efficiency of traffic transmission for low latency applications. By 91 offloading some protocol processing into smart NICs and bypassing 92 CPU, applications can directly write to hardware which reduces the 93 latency of traffic transmission. RDMA and RoCEv2 is currently a good 94 example to show the benefit of bypassing kernel/CPU while TCP 95 offloading is also under discussion in [NVMe-oF]. 97 In general, one hand, the cloud operators or application developers 98 are working on new congestion control algorithms to fit requirements 99 of applications like HPC, AI, storage in high performance fabrics; 100 while on the other hand, smart NIC vendors are working on offloading 101 functions of data plane and control plane onto hardware so as to 102 reduce the process latency and improve the performance. In this 103 case, it comes up with the question that how smart NICs can be 104 optimized by offloading some functions onto the hardware while still 105 being able to provide flexibility to customers to develop or change 106 their congestion control algorithms and run their experiments more 107 easily. 109 That said, it might be good to have an open and modular-based design 110 for congestion control on smart NICs to be able to develop and deploy 111 new algorithms while take the advantage of hardware offloading in a 112 generic way. 114 This document is to describes an open congestion control architecture 115 of high performance fabrics on smart NICs for the cloud operators and 116 application developers to install or develop new congestion control 117 algorithms as well as select appropriate controls in a more efficient 118 and flexible way. 120 It only focus on the basic functionality and discuss some common 121 interfaces to network environments and also administrators and 122 application developers while the detailed implementations should be 123 vendors' specific designs and are out of scope. 125 Discussions of new congestion control algorithms and improved active 126 queue management (AQM) are also out of scope for this document. 128 2. Conventions 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 132 "OPTIONAL" in this document are to be interpreted as described in BCP 133 14 [RFC2119] [RFC8174] when, and only when, they appear in all 134 capitals, as shown here. 136 3. Abbreviations 138 IB - InfinitBand 140 HPC - High Performance Computing 142 ECN - Explicit Congestion Notification 144 AI/HPC - Artificial Intelligence/High-Performance computing 145 RDMA - Remote Direct Memory Access 147 NIC - Network Interface Card 149 AQM - Active Queue Management 151 4. Observations in storage network 153 Besides the benefits of easing the development of new congestion 154 control algorithms by developers while taking advantage of hardware 155 offloading improvement by NIC vendors, we notice that there are also 156 benefits to choose proper algorithms for specific traffic patterns. 158 As stated, there are several congestion control algorithms for low 159 latency high throughput datacenter applications and the industry is 160 still working on enhanced algorithms for requirements of new 161 applications in the high performance area. Then, a question might be 162 asked, how to select a proper congestion algorithm for the network, 163 or whether a selected algorithm is efficient and sufficient to all 164 traffics in the network. 166 With this question, we use a simplified storage network as a use case 167 for study. In this typical network, it mainly includes two traffic 168 types: query and backup. Query is latency sensitive traffic while 169 backup is high throughput traffic. We select several well-known 170 congestion control algorithms (including Reno[RFC5681], 171 Cubic[RFC8312], DCTCP[RFC8257], and BBR[BBR]) of TCP for this study. 173 Two set of experiments were run to see the performance of these 174 algorithms for different traffic types (i.e. traffic patterns). The 175 first set is to study the performance when one algorithm is used for 176 both traffic types; the second set is to run the two traffics with 177 combinations of congestion algorithms. The detailed experiments and 178 testing results can be found in appendix A. 180 According to the result in first experiment set, BBR performs better 181 than others when applied for both traffics; while in the second 182 experiment set, some algorithm combinations show better performance 183 than the same one for both, even compared with BBR. 185 As such, we think there are benefits for different traffic patterns 186 to select their own algorithm in the same network to achieve better 187 performance. This can also be a reason from cloud operation 188 perspective to have an open congestion control on the NIC to select 189 proper algorithms for different traffic patterns. 191 5. Requirements of the open congestion control architecture 193 According to the observation, the architecture design is suggested to 194 follow some principles: 196 o Can support developers to write their congestion control 197 algorithms onto NICs while keep the benefit of congestion control 198 offloading provided by NIC vendors. 200 o Can support vendors to optimize the NIC performance by hardware 201 offloading while allow users to deploy and select new congestion 202 control algorithms. 204 o Can support settings of congestion controls by administrators 205 according to traffic patterns. 207 o Can support settings from applications to provide some QoS 208 requirements. 210 o Be transport protocol independent, for example can support both 211 TCP and RoCE. 213 6. Open Congestion Control (OpenCC) Architecture Overview 215 The architecture shown in Figure 1 only states the congestion control 216 related components while components for other functions are omitted. 217 The OpenCC architecture includes three layers. 219 The bottom layer is called the congestion control engine which 220 provides common function blocks independent of transport protocols 221 which can be implemented in hardware, while the middle layer is the 222 congestion control platform in which different congestion control 223 algorithms will be deployed here. These algorithms can be installed 224 by NIC vendors or can be developed by algorithm developers. At last, 225 the top layer provides all interfaces (i.e. APIs) to users, while 226 the users can be administrators that can select proper algorithms and 227 set proper parameters for their networks, applications that can 228 indicate their QoS requirements which can be further mapped to some 229 runtime settings of congestion control parameters, and the algorithm 230 developers that can write their own algorithms. 232 +------------+ +-----------------+ +---------------+ 233 User | Parameters | | Application(run | | CC developers | 234 interfaces | | | time settings) | | | 235 +-----+------+ +-------+---------+ +------+--------+ 236 | | | 237 | | | 238 | | | 239 +-----------------------+---------+ | 240 | Congestion control Algorithms | | 241 | +-----------------+ <----------+ 242 CC platform | +-----------------+| | 243 | +-----------------+|+ | 244 | | CC algorithm#1 |+ | 245 | +-----------------+ | 246 +--+--------+---------+---------+-+ 247 | | | | 248 | | | | 249 +--+--+ +---+---+ +---+----+ +--+---+ 250 | | | | | | | | / NIC signals 251 CC Engine |Token| |Packet | |Schedule| |CC | /-------------- 252 |mgr | |Process| | | |signal| \-------------- 253 +-----+ +-------+ +--------+ +------+ \ Network signals 255 Figure 1. The architecture of open congestion control 257 6.1. Congestion Control Platform and its user interfaces 259 The congestion control platform is a software environment to deploy 260 and configure various congestion control algorithms. It contains 261 three types of interfaces to the user layer for different usage. 263 One is for administrators, which is to select proper congestion 264 control algorithms for their network traffics and configure 265 corresponding parameters of the selected algorithms. 267 The second one can be an interface defined by NIC vendors or 268 developers that provide some APIs for application developers to 269 define their QoS requirements which will be further mapped to some 270 runtime configuration of the controls. 272 The last one is for algorithm developers to write their own algorithm 273 in the system. It is suggested to have a defined common language to 274 write algorithms which can be further compiled by vendor specific 275 environments (in which some toolkits or library can be provided) to 276 generate the platform dependent codes. 278 6.2. Congestion Control Engine (CCE) and its interfaces 280 Components in the congestion control engine can be offloaded to the 281 hardware to improve the performance. As such, it is suggested to 282 provide some common and basic functions while the upper platform can 283 provide more extensibility and more flexibility for more functions. 285 The CCE includes basic modules of packet transmission and 286 corresponding control. Several function blocks are illustrated here 287 while the detailed implementation is out of scope for this document 288 and left for NIC vendors. A token manager is used to distribute 289 tokens to traffics while the schedule block is to schedule the 290 transmission time for these traffics. The packet process block is to 291 edit or process the packet before transmission. The congestion 292 control signal block is to collect or monitor signals from both 293 network and other NICs which will be fed to congestion control 294 algorithms. 296 As such, an interface to get congestion control signal in the 297 congestion control should be defined to receive signals from both 298 other NICs and networks for existing congestion control algorithms 299 and new extensions. These information will be used as inputs of 300 control algorithms to adjust the sending rate and operate the loss 301 recovery et.al. 303 7. Interoperability Consideration 305 7.1. Negotiate the congestion control algorithm 307 Since there will be several congestion control algorithms, the host 308 might negotiate their supported congestion control capability during 309 the session setup phase. However, it should use the existing way of 310 congestion control as default to provide compatibility with legacy 311 devices. 313 Also, the network devices on the path should be capable to indicate 314 their capability of any specific signals that the congestion control 315 algorithm needs. The capability negotiation between NICs and 316 Switches can be considered either some in-band ECN-like negotiations 317 or out-of-band individual message negotiations. 319 Alternatively, the system can also use a centralized administration 320 platform to configure the algorithms on NICs and network devices. 322 7.2. Negotiate the congestion control parameters 324 The parameters might be set by administrators to meet their traffic 325 patterns and network environments or be set by mappings from 326 application requirements. Hence, these parameters might be changed 327 after the session is set up. As such, hosts should be able to 328 negotiate their parameters when changed or be configured to keep 329 consistent. 331 8. Security Considerations 333 TBD 335 9. Manageability Consideration 337 TBD 339 10. IANA Considerations 341 No IANA action 343 11. References 345 11.1. Normative References 347 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 348 Requirement Levels", BCP 14, RFC 2119, 349 DOI 10.17487/RFC2119, March 1997, 350 . 352 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 353 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 354 May 2017, . 356 11.2. Informative References 358 [BBR] Cardwell, N., Cheng, Y., and S. Yeganeh, "BBR Congestion 359 Control", . 362 [DCQCN] "Congestion Control for Large-Scale RDMA Deployments.", 363 . 366 [NVMe-oF] "NVMe over Fabrics", . 369 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 370 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 371 . 373 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 374 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 375 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 376 October 2017, . 378 [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and 379 R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", 380 RFC 8312, DOI 10.17487/RFC8312, February 2018, 381 . 383 [RoCEv2] "Infiniband Trade Association. InfiniBandTM Architecture 384 Specification Volume 1 and Volume 2.", 385 . 387 Appendix A. Experiments 389 This section includes two sets of experiments to study the 390 performance of congestion control algorithms in a simplified storage 391 network. The first set is to study one algorithm applied for both 392 query and backup traffics while the second set is to study the 393 performance when different algorithms are used for query traffic and 394 backup traffic. The metrics include throughput of backup traffic, 395 average completion time of query traffic and 95% percentile query 396 completion time. 398 +----------+ +----------+ 399 | Database | | Database | 400 | S3 .... .... S4 | 401 +---+------+ . . +------+---+ 402 | . . | 403 | .query. | 404 | . . | 405 backup | . . | backup 406 | ............. | 407 | . ............. | 408 | . . | 409 +---V---V--+ +--V---V---+ 410 | Database <-----------> Database | 411 | S1 | backup | S2 | 412 +----------+ +----------+ 413 Figure 2. Simplified storage network topology 414 All experiments are a full implementation of congestion control 415 algorithms on NICs, including Reno, Cubic, DCTCP and BBR. Our 416 experiments includes 4 servers connecting to one switch. Each server 417 with a 10Gbps NIC connected to a 10Gbps port on the switch. However, 418 we limit all ports to 1Gbps to make congestion points. In the 419 experiments, the database server S1 receives backup traffics from 420 both S3 and S2 and one query traffic from S4. The server S2 gets 421 back traffics from S1 and S4 and one query traffic from S3.In the 422 experiments, three traffic flows are transmitted to S1 from one 423 egress port on the switch, which might cause congestion. 425 In the first experiment set, we test one algorithm for both traffics. 426 The result is shown below in table 1. 428 +----------------+-----------+-----------+-----------+-----------+ 429 | | reno | cubic | bbr | dctcp | 430 +----------------+-----------+-----------+-----------+-----------+ 431 | Throughput MB/s| 64.92 | 65.97 | 75.25 | 70.06 | 432 +----------------+-----------+-----------+-----------+-----------+ 433 | Avg. comp ms | 821.61 | 858.05 | 85.68 | 99.90 | 434 +----------------+-----------+-----------+-----------+-----------+ 435 | 95% comp ms | 894.65 | 911.23 | 231.75 | 273.92 | 436 +----------------+-----------+-----------+-----------+-----------+ 437 Table 1. Performance when use one cc for both query and backup traffics 439 As we can see, the average completion time of BBR and DCTCP is 10 440 times better than that of reno and cubic. BBR is the best to keep 441 high throughput. 443 In the second set, we test all the combinations of algorithms for the 444 two traffics. 446 1. Reno for query traffic 448 reno@query 449 +----------------+-----------+-----------+-----------+-----------+ 450 | @backup | cubic | bbr | dctcp | reno | 451 +----------------+-----------+-----------+-----------+-----------+ 452 | Throughput MB/s| 66.00 | 76.19 | 64.00 | 64.92 | 453 +----------------+-----------+-----------+-----------+-----------+ 454 | Avg. comp ms | 859.61 | 81.87 | 18.38 | 821.61 | 455 +----------------+-----------+-----------+-----------+-----------+ 456 | 95% comp ms | 917.80 | 149.88 | 20.38 | 894.65 | 457 +----------------+-----------+-----------+-----------+-----------+ 459 Table 2. reno @ query and cubic, bbr, dctcp @ backup 460 It shows that given reno used for query traffic, bbr for backup 461 traffic gets better throughput compared with other candidates. 462 However, dctcp for backup traffic gets much better average completion 463 time and 95% completion time, almost 6 times better than those of bbr 464 even its throughput is less than bbr. The reason for this might be 465 bbr does not consider lost packets and congestion levels which might 466 cause much retransmission. In this test set, dctcp for backup 467 traffic gets better performance. 469 2. Cubic for query traffic 471 cubic@query 472 +----------------+-----------+-----------+-----------+-----------+ 473 | @backup | reno | bbr | dctcp | cubic | 474 +----------------+-----------+-----------+-----------+-----------+ 475 | Throughput MB/s| 64.92 | 75.02 | 65.29 | 65.97 | 476 +----------------+-----------+-----------+-----------+-----------+ 477 | Avg. comp ms | 819.23 | 83.50 | 18.42 | 858.05 | 478 +----------------+-----------+-----------+-----------+-----------+ 479 | 95% comp ms | 902.66 | 170.96 | 20.99 | 911.23 | 480 +----------------+-----------+-----------+-----------+-----------+ 481 Table 3. cubic @ query and reno, bbr, dctcp @ backup 483 The results of cubic for query traffic are similar to those of reno. 484 Even with less throughput, dctcp has almost 6 times better than bbr 485 in average completion time and 95% completion time, and nearly 10 486 times better than those of reno and cubic. 488 3. Bbr for query traffic 490 bbr@query 491 +----------------+-----------+-----------+-----------+-----------+ 492 | @backup | reno | cubic | dctcp | bbr | 493 +----------------+-----------+-----------+-----------+-----------+ 494 | Throughput MB/s| 64.28 | 66.61 | 65.29 | 75.25 | 495 +----------------+-----------+-----------+-----------+-----------+ 496 | Avg. comp ms | 866.05 | 895.12 | 18.49 | 85.68 | 497 +----------------+-----------+-----------+-----------+-----------+ 498 | 95% comp ms | 925.06 | 967.67 | 20.86 | 231.75 | 499 +----------------+-----------+-----------+-----------+-----------+ 500 Table 4. bbr @ query and reno, cubi, dctcp @ backup 502 The results still match those we get from reno and cubic. In the 503 last two columns, dctcp for backup shows better performance even when 504 we compared with bbr used for backup. It indicates that bbr @ query 505 and dctcp @ backup is better than bbr @ query and backup. 507 4. Dctcp for query traffic 508 dctcp@query 509 +----------------+-----------+-----------+-----------+-----------+ 510 | @backup | reno | cubic | bbr | dctcp | 511 +----------------+-----------+-----------+-----------+-----------+ 512 | Throughput MB/s| 60.93 | 64.49 | 76.15 | 70.06 | 513 +----------------+-----------+-----------+-----------+-----------+ 514 | Avg. comp ms | 2817,53 | 3077.20 | 816.45 | 99.90 | 515 +----------------+-----------+-----------+-----------+-----------+ 516 | 95% comp ms | 3448.53 | 3639.94 | 2362.72 | 273.92 | 517 +----------------+-----------+-----------+-----------+-----------+ 518 Table 5. dctcp @ query and reno, cubi, bbr @ backup 520 The results for dctcp@query look worse than others in completion 521 time, since we don't introduce L4S in the experiments which means 522 dctcp will back off most of the time when congestion happens which 523 makes the query traffic bares long latency. The best performance in 524 this test set happens at dctcp@backup. In this setting, both 525 traffics have use the same mechanism to back off their traffics. 526 However, the number is still worse than when other algorithms are 527 used for query and dctcp used for backup. 529 Authors' Addresses 531 Yan Zhuang 532 Huawei Technologies Co., Ltd. 533 101 Software Avenue, Yuhua District 534 Nanjing, Jiangsu 210012 535 China 537 Email: zhuangyan.zhuang@huawei.com 539 Wenhao Sun 540 Huawei Technologies Co., Ltd. 541 101 Software Avenue, Yuhua District 542 Nanjing, Jiangsu 210012 543 China 545 Email: sam.sunwenhao@huawei.com 547 Long Yan 548 Huawei Technologies Co., Ltd. 549 101 Software Avenue, Yuhua District 550 Nanjing, Jiangsu 210012 551 China 553 Email: yanlong20@huawei.com