INTERNET-DRAFT F. Chen Intended Status: Informational W. Sun Expires: Sep 22, 2019 X. Yu Huawei Technologies Mar 21, 2019 Requirements for RoCEv3 Congestion Management draft-chen-iccrg-rocev3-cm-requirements-00 Abstract On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol. RoCEv2 specification does not define the strong congestion management mechanisms and load balancing methods. RoCEv2 relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(Priority-based Flow Control, PFC)to provide a lossless fabric. RoCEv2 Congestion Management(RCM) use ECN(Explicit Congestion Notification, defined in RFC3168) to signal the congestion to the destination and use the congestion notification to reduce the rate of injection and increase the injection rate when the extent of congestion decreases. More and more practice of congestion management for RoCEv2 appear in the industry, such as DCQCN(Data Center Quantized Congestion Notification). There is a demanding for the new RoCEv3 protocol to provide stronger congestion management and load balancing mechanisms for RDMA deployment in modern datacenter. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Expires [Page 1] INTERNET DRAFT Copyright and License Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 RoCEv3 congestion management requirements . . . . . . . . . . . 4 5 Current Congestion Management for RoCEv2 . . . . . . . . . . . 4 5.1 PFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5.2 ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6. Congestion Management Practice . . . . . . . . . . . . . . . . 5 6.1 Packet Retransmission . . . . . . . . . . . . . . . . . . . 5 6.2 Congestion Control Mechanisms . . . . . . . . . . . . . . . 5 6.3 Re-ordering . . . . . . . . . . . . . . . . . . . . . . . . 6 6.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 6 7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 Security Considerations . . . . . . . . . . . . . . . . . . . . 7 9 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 10 References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 [EVILBIT] Bellovin, S., "The Security Flag in the IPv4 Header", RFC 3514, April 1 2003. . . . . . . . . . . . . . . . n Expires [Page 2] INTERNET DRAFT 1 Introduction With the emerging Distributed Storage, AI/HPC, Machine Learning, etc., modern datacenter applications demand high throughput(40Gbps and above) with ultra-low latency of < 10 microsecond per hop from the network, with low CPU overhead. Remote Direct Memory Access (RDMA) can meet these needs on Ethernet. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol. RoCEv2 is a straightforward extension of the RoCE protocol that involves a simple modification of the RoCE packet format. RoCEv2 packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP[1]. RoCEv2 Congestion Management (RCM) provides the capability to avoid congestion hot spots and optimize the throughput of the fabric. RCM relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) to provide a drop free network. RoCEv2 Congestion Management(RCM) also use ECN(RFC3168) to signal the congestion to the destination and use the congestion notification to reduce the rate of injection and increase the injection rate when the extent of congestion decreases. More and more practice of congestion management for RoCEv2 appear in the industry, such as DCQCN, etc. Shall we consider to develop next Generation RoCE protocol(alias RoCEv3) with stronger congestion management and load balancing mechanisms for RDMA deployment in modern datacenter? 2 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 3 Abbreviations RCM - RoCEv2 Congestion Management PFC - Priority-based Flow Control ECN - Explicit Congestion Notification DCQCN - Data Center Quantized Congestion Notification AI/HPC - Artificial Intelligence/High-Performance computing ECMP - Equal-Cost Multipath Expires [Page 3] INTERNET DRAFT 4 RoCEv3 congestion management requirements Network congestion happens in the network switches when the incoming traffic is larger than the bandwidth of the outgoing link on which it has to be transmitted. Congestion is the primary source of loss and in the network, congestion leads to dramatic performance degradation. Generally, RoCEv2 relies on Link-Layer Flow-Control IEEE 802.1Qbb(PFC) to provide a lossless underlying networks. Lossless networks implement a mechanism of flow control, which pauses the traffic with priority granularity in the incoming link before the buffer overfills, and by that prevents case of dropping packets[2]. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness[3]. In order to avoid the problems involved by PFC, there is another faction research on the congestion control mechanisms over the lossy network. We need a kind of protocol temporarily named RoCEv3 with stronger capability of congestion management to achieve the high throughput and low latency in the large-scale datacenter network with more flexible requirement to the underlay network. The interoperability is also required among the industry practice. 5 Current Congestion Management for RoCEv2 5.1 PFC RDMA is deployed using the RoCEv2 protocol, which relies on IEEE 802.1Qbb Priority-based Flow Control (PFC) to enable a drop-free network. PFC is a link level protocol that allows a receiver to assert flow control telling the transmitter to pause sending traffic for a specified priority. However, because PFC will stop all traffic in a particular traffic class at the ingress port, the flows destined to other ports will also be blocked. The known problems of PFC are head-of-line blocking, unfairness, deadlock[4]. 5.2 ECN Explicit congestion notification (ECN) enables end-to-end congestion notification between two endpoints on TCP/IP based networks. ECN notifies networks about congestion with the goal of reducing packet loss and delay by making the sending device decrease the transmission rate until the congestion clears, without dropping packets. RFC 3168, The Addition of Explicit Congestion Notification (ECN) to IP, defines Expires [Page 4] INTERNET DRAFT ECN. 6. Congestion Management Practice 6.1 Packet Retransmission NICs were not designed to deal with losses efficiently. Receiver discards out-of-order packets. Sender does go-back-N on detecting packet loss. RoCEv2 adopt Go-back-N loss recovery and needs lossless layer 2 (by using PFC) for good performance[5]. If new RDMA protocol does not rely on the lossless layer 2 network, an efficient method of Packet Retransmission is necessary. 6.2 Congestion Control Mechanisms 6.2.1 RTT-based Congestion Control The typical practice of RTT based Congestion Control is TIMELY[6]. It introduces the simple packet delay, measured as round-trip times at hosts, is an effective congestion signal without the need for switch feedback. TIMELY measures RTT with microsecond accuracy, and that these RTTs are sufficient to estimate switch queueing. TIMELY can adjust transmission rates using RTT gradients to keep packet latency low while delivering high bandwidth. TIMELY is a delay-based congestion control protocol for use in the datacenter. Because the RDMA transport is in the NIC and sensitive to packet drops, so PFC is necessary because drops hurt performance badly. That is to say TIMELY needs PFC to provide lossless underlay network. 6.2.2 Credit-based Congestion Control ExpressPass[7] is an end-to-end credit-scheduled, delay-bounded congestion control for datacenters. ExpressPass uses credit packets to control congestion even before sending data packets, which enables to achieve bounded delay and fast convergence. It uses end-to-end credit transfer for bandwidth allocation and fine-grained packet scheduling. 6.2.3 ECN-based Congestion Control Data Center Quantized Congestion Notification (DCQCN)[3] is an end- to-end congestion control scheme for RoCEv2. DCQCN is a combination of ECN and PFC to support end-to-end lossless Ethernet. The idea behind DCQCN is to allow ECN to do flow control by decreasing the transmission rate at the sender when congestion starts, thereby minimizing the time PFC is triggered. Expires [Page 5] INTERNET DRAFT Although RoCEv2 standard[1] does not list DCQCN as the RCM mechanism, but it is widely used in the industry practice. 6.3 Re-ordering When the packets arrive at the destination out-of-order, the destination should store the packets to restore the order. Destination should assign special buffer resource to perform re- ordering. There are many methods to implement the re-ordering either on switch or on NIC side. Here will not go into the details. 6.4 Load Balancing 6.4.1 ECMP RoCEv2 packets use an opaque flow identifier in their UDP Source Port field for ECMP method to implement path selection mechanisms for load balancing and improve utilization of the fabric topology. Traditional ECMP can not balance loads well in the data center network because it splits loads at the granularity of flow. The finer the granularity of load balancing, the more effective the load balancing is and the higher the utilization of network bandwidth can be achieved. 6.4.2 Flowlet The typical Flowlet-based load balancing is CONGA[8]. CONGA is a network-based distributed congestion-aware load balancing mechanism for datacenters. It splits TCP flows into flowlets, estimates real- time congestion on fabric paths, and allocates flowlets to paths based on feedback from remote switches. Flowlets are bursts of packets from a flow. The idle interval between two bursts of packets is larger than the maximum difference in latency among the paths. So the second burst can be sent along a different path than the first without reordering packets. 6.4.3 Per-packet The effect of packet-based load balancing is the best because the corresponding granularity is the smallest. The consequence is that packets belonging to the same flow will be allocated to different paths. When the forwarding delays of paths are different, it is possible that packets may arrive at the receiver out-of-order. 7 Summary Expires [Page 6] INTERNET DRAFT The new emerging RoCE based applications urge the practice of different congestion management mechanisms to be practiced in kinds of modern large-scale datacenter network. In this problem statement, not all the mainstream mechanisms are introduced. It is still needed to extend when considering the future RoCE protocol temporary named RoCEv3 with robust congestion management capability and more flexible requirement on layer 2 network which might be the next direction. 8 Security Considerations This document does not introduce any additional security constraints. 9 IANA Considerations TBD 10 References [1] Infiniband Trade Association. Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (IP routable RoCE), 2014. [2] Understanding RoCEv2 Congestion Management, https://community.mellanox.com/docs/DOC-2321 [3] Zhu, Yibo, et al. "Congestion Control for Large-Scale RDMA Deployments." Acm Sigcomm Computer Communication Review 45.5(2015):523-536. [4] Hu, Shuihai, et al. "Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them." The, ACM Workshop ACM, 2016:92-98. [5] Mittal, Radhika, et al. "Revisiting Network Support for RDMA." (2018). [6] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for the Datacenter." ACM Conference on Special Interest Group on Data Communication ACM, 2015:537-550. [7] Cho, Inho, D. Han, and K. Jang. "ExpressPass: End-to-End Credit- based Congestion Control for Datacenters." (2016). [8] Alizadeh, Mohammad, et al. "CONGA: distributed congestion-aware load balancing for datacenters." ACM Conference on SIGCOMM ACM, 2014:503-514. Expires [Page 7] INTERNET DRAFT [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC1776] Crocker, S., "The Address is the Message", RFC 1776, April 1 1995. [TRUTHS] Callon, R., "The Twelve Networking Truths", RFC 1925, April 1 1996. [EVILBIT] Bellovin, S., "The Security Flag in the IPv4 Header", RFC 3514, April 1 2003. [RFC5513] Farrel, A., "IANA Considerations for Three Letter Acronyms", RFC 5513, April 1 2009. [RFC5514] Vyncke, E., "IPv6 over Social Networks", RFC 5514, April 1 2009. Authors' Addresses Fei Chen Huawei Technologies Co., Ltd. Email: chenfei57@huawei.com Wenhao Sun Huawei Technologies Co., Ltd. Email: sam.sunwenhao@huawei.com Xiang Yu Huawei Technologies Co., Ltd. Email: yolanda.yu@huawei.com Expires [Page 8]