<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "http://xml.resource.org/authoring/rfc2629.dtd" [
	<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
	<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
	<!ENTITY RFC8257 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8257.xml">
	<!ENTITY RFC5681 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5681.xml">
	<!ENTITY RFC8312 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8312.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc iprnotified="no" ?>
<?rfc strict="no" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no"?>
<?rfc sortrefs="yes" ?>
<rfc category="info" docName="draft-zhuang-tsvwg-open-cc-architecture-00" ipr="trust200902">
	<front>
		<title abbrev="open congestion control">
                An Open Congestion Control Architecture for high performance fabrics </title>
		<author initials="Y. Z." surname="Zhuang" fullname="Yan Zhuang">
			<organization>Huawei Technologies Co., Ltd.</organization>
			<address>
			<postal>
          <street>101 Software Avenue, Yuhua District</street>

          <city>Nanjing</city>

          <region>Jiangsu</region>

          <code>210012</code>

          <country>China</country>
        </postal>
				<email>zhuangyan.zhuang@huawei.com</email>
			</address>
		</author>
		<author initials="W.S." surname="Sun" fullname="Wenhao Sun">
			<organization>Huawei Technologies Co., Ltd.</organization>
			<address>
			<postal>
          <street>101 Software Avenue, Yuhua District</street>

          <city>Nanjing</city>

          <region>Jiangsu</region>

          <code>210012</code>

          <country>China</country>
        </postal>
				<email>sam.sunwenhao@huawei.com</email>
			</address>
		</author>
		<author initials="L.Y." surname="Yan" fullname="Long Yan">
			<organization>Huawei Technologies Co., Ltd.</organization>
			<address>
			<postal>
          <street>101 Software Avenue, Yuhua District</street>

          <city>Nanjing</city>

          <region>Jiangsu</region>

          <code>210012</code>

          <country>China</country>
        </postal>
				<email>yanlong20@huawei.com</email>
			</address>
		</author>
		<date month="November" year="2019"/>
		<area>TSV</area>
		<workgroup>TSVWG</workgroup>
		<abstract>
			<t>This document describes an open congestion control architecture of high performance fabrics for the cloud operators and algorithm developers to deploy or develop new 
			congestion control algorithms as well as make appropriate configurations for traffics on smart NICs in a more efficient and flexible way.</t>
		</abstract>
	</front>
	<middle>
		<section title="Introduction" anchor="introduction">
			<t>The datacenter networks (DCNs) nowadays is not only providing traffic transmission for tenants using TCP/IP network protocol stack, but also is required to provide RDMA traffic 
			for High Performance Computing (HPC) and distributed storage accessing applications which requires low latency and high throughput.</t>

			<t>Thus, for datacenter application nowadays, the requirements of latency and throughput are more critical than the normal internet traffics, while network congestion and queuing 
			caused by incast is the point that increases the traffic latency and affect the network throughput. With this, congestion control algorithms aimed for low latency and high bandwidth
			are proposed such as DCTCP<xref target="RFC8257"/>, <xref target="BBR"/> for TCP, <xref target="DCQCN"/> for <xref target ="RoCEv2"/>.</t>
			
			<t>Besides, the CPU utilization on NICs is another point to improve the efficiency of traffic transmission for low latency applications. By offloading some protocol processing into 
			smart NICs and bypassing CPU, applications can directly write to hardware which reduces the latency of traffic transmission. RDMA and RoCEv2 is currently a good example to show the 
			benefit of bypassing kernel/CPU while TCP offloading is also under discussion in <xref target = "NVMe-oF"/>.</t>
			
			<t>In general, one hand, the cloud operators or application developers are working on new congestion control algorithms to fit requirements of applications like HPC, AI, storage 
			in high performance fabrics; while on the other hand, smart NIC vendors are working on offloading functions of data plane and control plane onto hardware so as to reduce the process 
			latency and improve the performance. In this case, it comes up with the question that how smart NICs can be optimized by offloading some functions onto the hardware while still being
			able to provide flexibility to customers to develop or change their congestion control algorithms and run their experiments more easily.</t>
			
			<t>That said, it might be good to have an open and modular-based design for congestion control on smart NICs to be able to develop and deploy new algorithms while take the advantage
			of hardware offloading in a generic way.</t>
			
			<t>This document is to describes an open congestion control architecture of high performance fabrics on smart NICs for the cloud operators and application developers to install or 
			develop new congestion control algorithms as well as select appropriate controls in a more efficient and flexible way.</t>
			
			<t>It only focus on the basic functionality and discuss some common interfaces to network environments and also administrators and application developers while the detailed implementations 
			should be vendors’ specific designs and are out of scope.</t>
			
			<t>Discussions of new congestion control algorithms and improved active queue management (AQM) are also out of scope for this document.</t>
			
		</section>

		<section title="Conventions">
			<t>	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
      NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
      "MAY", and "OPTIONAL" in this document are to be interpreted as
      described in BCP 14 <xref target="RFC2119"/>
				<xref target="RFC8174"/> when, and only when, they
      appear in all capitals, as shown here.				
				
			</t>
		</section>
		<section title="Abbreviations">
			<t>
				<list style="hanging">
					<t>IB - InfinitBand</t>
					<t>HPC - High Performance Computing</t>
					<t>ECN - Explicit Congestion Notification</t>
					<t>AI/HPC - Artificial Intelligence/High-Performance computing</t>
					<t>RDMA - Remote Direct Memory Access</t>
					<t>NIC – Network Interface Card</t>
					<t>AQM – Active Queue Management</t>
				</list>
			</t>
		</section>
		
		<section title = "Observations in storage network">
		<t>Besides the benefits of easing the development of new congestion control algorithms by developers while taking advantage of hardware offloading improvement by NIC vendors, we notice that
		there are also benefits to choose proper algorithms for specific traffic patterns.</t>
		
		<t>As stated, there are several congestion control algorithms for low latency high throughput datacenter applications and the industry is still working on enhanced algorithms for requirements 
		of new applications in the high performance area. Then, a question might be asked, how to select a proper congestion algorithm for the network, or whether a selected algorithm is efficient and
		sufficient to all traffics in the network.</t>
		
		<t>With this question, we use a simplified storage network as a use case for study. In this typical network, it mainly includes two traffic types: query and backup. Query is latency sensitive 
		traffic while backup is high throughput traffic. We select several well-known congestion control algorithms (including Reno<xref target="RFC5681"/>, Cubic<xref target="RFC8312"/>, DCTCP<xref target="RFC8257"/>, and BBR<xref target ="BBR"/>) of TCP for this study.</t>
		
		<t>Two set of experiments were run to see the performance of these algorithms for different traffic types (i.e. traffic patterns). The first set is to study the performance when one algorithm is used for both traffic types; the second set is to run the two 
		traffics with combinations of congestion algorithms. The detailed experiments and testing results can be found in appendix A.</t>
		
		<t>According to the result in first experiment set, BBR performs better than others when applied for both traffics; while in the second experiment set, some algorithm combinations show better performance than the same one for both, even compared with BBR.</t>
		
		<t>As such, we think there are benefits for different traffic patterns to select their own algorithm in the same network to achieve better performance. This can also be a reason from cloud operation perspective to have an open congestion control on the NIC
		to select proper algorithms for different traffic patterns.</t>
		</section>
		
		<section title="Requirements of the open congestion control architecture">
		<t>According to the observation, the architecture design is suggested to follow some principles:
		</t>
		<t>
				<list style="symbols">
					<t>Can support developers to write their congestion control algorithms onto NICs while keep the benefit of congestion control offloading provided by NIC vendors.</t>
					<t>Can support vendors to optimize the NIC performance by hardware offloading while allow users to deploy and select new congestion control algorithms.</t>
					<t>Can support settings of congestion controls by administrators according to traffic patterns.</t>
					<t>Can support settings from applications to provide some QoS requirements.</t>
					<t>Be transport protocol independent, for example can support both TCP and RoCE.</t>
				</list>
		</t>
		</section>
		
		
		<section title="Open Congestion Control (OpenCC) Architecture Overview">
		<t>The architecture shown in Figure 1 only states the congestion control related components while components for other functions are omitted. The OpenCC architecture includes
		three layers.</t>
		
		<t>The bottom layer is called the congestion control engine  which provides common function blocks independent of transport protocols which can be implemented in hardware, while the middle layer is the congestion control platform in which different congestion 
		control algorithms will be deployed here. These algorithms can be installed by NIC vendors or can be developed by algorithm developers. At last, the top layer provides all interfaces (i.e. APIs) to users, while the users can be administrators that can select 
		proper algorithms and set proper parameters for their networks, applications that can indicate their QoS requirements which can be further mapped to some runtime settings of congestion control parameters, and the algorithm developers that can write their 
		own algorithms. </t>
		
		<figure>
			<artwork>

             +------------+  +-----------------+   +---------------+
 User        | Parameters |  | Application(run |   | CC developers |
 interfaces  |            |  | time settings)  |   |               |
             +-----+------+  +-------+---------+   +------+--------+
                   |                 |                    |
                   |                 |                    |
                   |                 |                    |
             +-----------------------+---------+          |
             |  Congestion control Algorithms  |          |
             |        +-----------------+      &lt;----------+
 CC platform |       +-----------------+|      |
             |      +-----------------+|+      |
             |      |  CC algorithm#1 |+       |
             |      +-----------------+        |
             +--+--------+---------+---------+-+
                |        |         |         |
                |        |         |         |
             +--+--+ +---+---+ +---+----+ +--+---+
             |     | |       | |        | |      |   /  NIC signals
 CC Engine   |Token| |Packet | |Schedule| |CC    |  /--------------
             |mgr  | |Process| |        | |signal|  \--------------
             +-----+ +-------+ +--------+ +------+   \  Network signals


   Figure 1. The architecture of open congestion control
			</artwork>
	</figure>
		<section title="Congestion Control Platform and its user interfaces">
			<t>The congestion control platform is a software environment to deploy and configure various congestion control 
			algorithms. It contains three types of interfaces to the user layer for different usage.</t>
			
			<t>One is for administrators, which is to select proper congestion control algorithms for their network traffics 
			and configure corresponding parameters of the selected algorithms.</t>
			
			<t>The second one can be an interface defined by NIC vendors or developers that provide some APIs for application 
			developers to define their QoS requirements which will be further mapped to some runtime configuration of the controls. 
			</t>
			
			<t>The last one is for algorithm developers to write their own algorithm in the system. It is suggested to have a defined
			common language to write algorithms which can be further compiled by vendor specific environments (in which some toolkits
			or library can be provided) to generate the platform dependent codes.</t>
			
		</section>
		
		<section title="Congestion Control Engine (CCE) and its interfaces">
			<t>Components in the congestion control engine can be offloaded to the hardware to improve the performance. As such, it
			is suggested to provide some common and basic functions while the upper platform can provide more extensibility and more
			flexibility for more functions.</t>
			
			<t>The CCE includes basic modules of packet transmission and corresponding control. Several function blocks are illustrated
			here while the detailed implementation is out of scope for this document and left for NIC vendors. A token manager is used
			to distribute tokens to traffics while the schedule block is to schedule the transmission time for these traffics. The packet
			process block is to edit or process the packet before transmission. The congestion control signal block is to collect or
			monitor signals from both network and other NICs which will be fed to congestion control algorithms. </t>
			
			<t>As such, an interface to get congestion control signal in the congestion control should be defined to receive signals
			from both other NICs and networks for existing congestion control algorithms and new extensions. These information will be
			used as inputs of control algorithms to adjust the sending rate and operate the loss recovery et.al.</t>
		</section>
		</section>

		<section title="Interoperability Consideration">
			<section title ="Negotiate the congestion control algorithm">
			<t>Since there will be several congestion control algorithms, the host might negotiate their supported congestion control 
			capability during the session setup phase. However, it should use the existing way of congestion control as default to provide
			compatibility with legacy devices.</t>
			
			<t>Also, the network devices on the path should be capable to indicate their capability of any specific signals that the 
			congestion control algorithm needs. The capability negotiation between NICs and Switches can be considered either some 
			in-band ECN-like negotiations or out-of-band individual message negotiations.</t>
			
			<t>Alternatively, the system can also use a centralized administration platform to configure the algorithms on NICs and network devices.</t>
			</section>
			
			<section title="Negotiate the congestion control parameters">
			<t>The parameters might be set by administrators to meet their traffic patterns and network environments or be set by mappings from application 
			requirements. Hence, these parameters might be changed after the session is set up. As such, hosts should be able to negotiate their parameters
			when changed or be configured to keep consistent.</t>
			
			</section>
		</section>
		<section title="Security Considerations" anchor="Security">
			<t>
			TBD
			</t>
		</section>
		<section title="Manageability Consideration" anchor="Manageability">
			<t>TBD</t>
		</section>
		<section title="IANA Considerations" anchor="IANA">
			<t>No IANA action</t>
		</section>
	</middle>
	<back>
	
		<references title="Normative References">
   &RFC2119;
   &RFC8174;
		</references>
		
		<references title="Informative References">
			&RFC8257;
			&RFC5681;
			&RFC8312;
			<reference anchor="BBR" target="https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00">
				<front>
					<title>BBR Congestion Control</title>
					<author initials='N' surname='Cardwell' fullname='Neal Cardwell'>
					</author>
					<author initials='Y' surname='Cheng' fullname='Yuchuang Cheng'>
					</author>
					<author initials='S' surname='Yeganeh' fullname='Soheil Hassas Yeganeh'>
					</author>
					<date/>
				</front>
			</reference>
			<reference anchor="DCQCN" target="https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf">
				<front>
					<title> Congestion Control for Large-Scale RDMA Deployments.</title>
					<author/>
					<date/>
				</front>
			</reference>
			<reference anchor="RoCEv2" target="https://cw.infinibandta.org/document/dl/7781">
				<front>
					<title>Infiniband Trade Association. 
					InfiniBandTM Architecture Specification Volume 1 and Volume 2.</title>
					<author/>
					<date/>
				</front>
			</reference>
			<reference anchor="NVMe-oF" target="https://nvmexpress.org/wp-content/uploads/NVMe_Over_Fabrics.pdf">
				<front>
					<title>NVMe over Fabrics</title>
					<author/>
					<date/>
				</front>
			</reference>
		</references>
		
		<section anchor="Appendix" title="Experiments">
		<t>This section includes two sets of experiments to study the performance of congestion control algorithms in a simplified storage network.
		The first set is to study one algorithm applied for both query and backup traffics while the second set is to study the performance when 
		different algorithms are used for query traffic and backup traffic. The metrics include throughput of backup traffic, average completion 
		time of query traffic and 95% percentile query completion time.</t>
		
		<figure>
			<artwork>

     +----------+           +----------+
     | Database |           | Database |
     |    S3    ....     ....    S4    |
     +---+------+  .     .  +------+---+
         |         .     .         |
         |         .query.         |
         |         .     .         |
 backup  |         .     .         | backup
         |   .............         |
         |   .     .............   |
         |   .                 .   |
     +---V---V--+           +--V---V---+
     | Database &lt;-----------&gt; Database |
     |    S1    |  backup   |    S2    |
     +----------+           +----------+
Figure 2. Simplified storage network topology
			</artwork>
		</figure>
		<t>All experiments are a full implementation of congestion control algorithms on NICs, including Reno, Cubic, DCTCP and BBR. Our experiments 
		includes 4 servers connecting to one switch. Each server with a 10Gbps NIC connected to a 10Gbps port on the switch. However, we limit all ports to 
		1Gbps to make congestion points. In the experiments, the database server S1 receives backup traffics from both S3 and S2 and one query 
		traffic from S4. The server S2 gets back traffics from S1 and S4 and one query traffic from S3.In the experiments, three traffic flows are 
		transmitted to S1 from one egress port on the switch, which might cause congestion.</t>
		
		<t>In the first experiment set, we test one algorithm for both traffics. The result is shown below in table 1.</t>
		
				<figure>
			<artwork>

+----------------+-----------+-----------+-----------+-----------+
|                |   reno    |   cubic   |    bbr    |   dctcp   |
+----------------+-----------+-----------+-----------+-----------+
| Throughput MB/s|   64.92   |   65.97   |   75.25   |   70.06   |
+----------------+-----------+-----------+-----------+-----------+
|  Avg. comp ms  |  821.61   |  858.05   |   85.68   |   99.90   |
+----------------+-----------+-----------+-----------+-----------+
|  95% comp  ms  |  894.65   |  911.23   |  231.75   |  273.92   |
+----------------+-----------+-----------+-----------+-----------+
Table 1. Performance when use one cc for both query and backup traffics
			</artwork>
		</figure>
		
		<t>As we can see, the average completion time of BBR and DCTCP is 10 times better than that of reno and cubic. BBR is the best to keep high 
		throughput. </t>
		
		<t>In the second set, we test all the combinations of algorithms for the two traffics.</t>
		
		<t> 1. Reno for query traffic
		</t>
						<figure>
			<artwork>
reno@query
+----------------+-----------+-----------+-----------+-----------+
|    @backup     |   cubic   |    bbr    |   dctcp   |    reno   |
+----------------+-----------+-----------+-----------+-----------+
| Throughput MB/s|   66.00   |   76.19   |   64.00   |   64.92   |
+----------------+-----------+-----------+-----------+-----------+
|  Avg. comp ms  |  859.61   |   81.87   |   18.38   |  821.61   |
+----------------+-----------+-----------+-----------+-----------+
|  95% comp  ms  |  917.80   |  149.88   |   20.38   |  894.65   |
+----------------+-----------+-----------+-----------+-----------+

Table 2. reno @ query and cubic, bbr, dctcp @ backup
			</artwork>
		</figure>
		<t>It shows that given reno used for query traffic, bbr for backup traffic gets better throughput compared with other candidates. However, dctcp
		for backup traffic gets much better average completion time and 95% completion time, almost 6 times better than those of bbr even its throughput
		is less than bbr. The reason for this might be bbr does not consider lost packets and congestion levels which might cause much retransmission. 
		In this test set, dctcp for backup traffic gets better performance.</t>
		
		<t>2. Cubic for query traffic</t>
		<figure>
			<artwork>
 cubic@query
 +----------------+-----------+-----------+-----------+-----------+
 |    @backup     |   reno    |    bbr    |   dctcp   |   cubic   |
 +----------------+-----------+-----------+-----------+-----------+
 | Throughput MB/s|   64.92   |   75.02   |   65.29   |   65.97   |
 +----------------+-----------+-----------+-----------+-----------+
 |  Avg. comp ms  |  819.23   |   83.50   |   18.42   |  858.05   |
 +----------------+-----------+-----------+-----------+-----------+
 |  95% comp  ms  |  902.66   |  170.96   |   20.99   |  911.23   |
 +----------------+-----------+-----------+-----------+-----------+
Table 3. cubic @ query and reno, bbr, dctcp @ backup
			</artwork>
		</figure>
		<t>The results of cubic for query traffic are similar to those of reno. Even with less throughput, dctcp has almost 6 times better than bbr in average
		completion time and 95% completion time, and nearly 10 times better than those of reno and cubic.
		</t>
		
		<t>3. Bbr for query traffic</t>
		<figure>
			<artwork>
bbr@query
+----------------+-----------+-----------+-----------+-----------+
|    @backup     |   reno    |   cubic   |   dctcp   |    bbr    |
+----------------+-----------+-----------+-----------+-----------+
| Throughput MB/s|   64.28   |   66.61   |   65.29   |   75.25   |
+----------------+-----------+-----------+-----------+-----------+
|  Avg. comp ms  |  866.05   |  895.12   |   18.49   |   85.68   |
+----------------+-----------+-----------+-----------+-----------+
|  95% comp  ms  |  925.06   |  967.67   |   20.86   |  231.75   |
+----------------+-----------+-----------+-----------+-----------+
Table 4. bbr @ query and reno, cubi, dctcp @ backup
			</artwork>
		</figure>
		<t>The results still match those we get from reno and cubic. In the last two columns, dctcp for backup shows better performance even when we compared with bbr
		used for backup. It indicates that bbr @ query and dctcp @ backup is better than bbr @ query and backup.
		</t>
		
		<t>4. Dctcp for query traffic</t>
		<figure>
			<artwork>
dctcp@query
+----------------+-----------+-----------+-----------+-----------+
|    @backup     |   reno    |   cubic   |    bbr    |   dctcp   |
+----------------+-----------+-----------+-----------+-----------+
| Throughput MB/s|   60.93   |   64.49   |   76.15   |   70.06   |
+----------------+-----------+-----------+-----------+-----------+
|  Avg. comp ms  | 2817,53   | 3077.20   |  816.45   |   99.90   |
+----------------+-----------+-----------+-----------+-----------+
|  95% comp  ms  | 3448.53   | 3639.94   | 2362.72   |  273.92   |
+----------------+-----------+-----------+-----------+-----------+
Table 5. dctcp @ query and reno, cubi, bbr @ backup
			</artwork>
		</figure>
		<t>The results for dctcp@query look worse than others in completion time, since we don’t introduce L4S in the experiments which means dctcp will back off most of the 
		time when congestion happens which makes the query traffic bares long latency. The best performance in this test set happens at dctcp@backup. In this setting, both 
		traffics have use the same mechanism to back off their traffics. However, the number is still worse than when other algorithms are used for query and dctcp used for 
		backup.</t>
		
		   </section>
		<!-- generic-out-of-band-aspects -->
	</back>
</rfc>
