idnits 2.17.1 draft-chen-nfsv4-rocev2-cm-problem-statement-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 8 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (Aug 10, 2018) is 2086 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'RFC2119' on line 118 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT F. Chen 3 Intended Status: Informational W. Sun 4 Expires: Feb 09, 2019 Huawei Technologies 5 Aug 10, 2018 7 Problem Statement of RoCEv2 Congestion Management 8 draft-chen-nfsv4-rocev2-cm-problem-statement-00 10 Abstract 12 On IP-routed datacenter networks, RDMA is deployed using RoCEv2 13 protocol. RoCEv2 specification does not define the congestion 14 management and load balancing methods. RoCEv2 relies on the existing 15 Link-Layer Flow-Control IEEE 802.1Qbb(Priority-based Flow Control, 16 PFC)to provide a lossless network. RoCEv2 Congestion Management(RCM) 17 use ECN(Explicit Congestion Notification, defined in RFC3168) to 18 signal the congestion to the destination and use the congestion 19 notification to reduce the rate of injection and increase the 20 injection rate when the extent of congestion decreases. More and more 21 practice of congestion management for RoCEv2 appear in the industry, 22 such as DCQCN(Data Center Quantized Congestion Notification). There 23 is a demanding for the new RoCE protocol(temporary alias RoCEv3) to 24 provide stronger congestion management and load balancing mechanisms 25 for RDMA deployment in modern datacenter. 27 Status of this Memo 29 This Internet-Draft is submitted to IETF in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF), its areas, and its working groups. Note that 34 other groups may also distribute working documents as Internet- 35 Drafts. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress". 42 The list of current Internet-Drafts can be accessed at 43 http://www.ietf.org/1id-abstracts.html 45 The list of Internet-Draft Shadow Directories can be accessed at 46 http://www.ietf.org/shadow.html 48 Copyright and License Notice 50 Copyright (c) 2018 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 Copyright (c) 2018 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 69 2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 70 3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3 71 4 Problem statement & requirements . . . . . . . . . . . . . . . 4 72 5 Current Congestion Management for RoCEv2 . . . . . . . . . . . 4 73 5.1 PFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 5.2 ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 75 6. Congestion Management Practice . . . . . . . . . . . . . . . . 5 76 6.1 Packet Retransmission . . . . . . . . . . . . . . . . . . . 5 77 6.2 Congestion Control Mechanisms . . . . . . . . . . . . . . . 5 78 6.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 6 79 7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 80 8 Security Considerations . . . . . . . . . . . . . . . . . . . . 7 81 9 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7 82 10 References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 84 1 Introduction 86 With the emerging Distributed Storage, AI/HPC, Machine Learning, 87 etc., modern datacenter applications demand high throughput(40Gbps 88 and above) with ultra-low latency of < 10 microsecond per hop from 89 the network, with low CPU overhead. Remote Direct Memory Access 90 (RDMA) can meet these needs on Ethernet. 92 On IP-routed datacenter networks, RDMA is deployed using RoCEv2 93 protocol. RoCEv2 is a straightforward extension of the RoCE protocol 94 that involves a simple modification of the RoCE packet format. RoCEv2 95 packets carry an IP header which allows traversal of IP L3 Routers 96 and a UDP header that serves as a stateless encapsulation layer for 97 the RDMA Transport Protocol Packets over IP[1]. 99 RoCEv2 Congestion Management (RCM) provides the capability to avoid 100 congestion hot spots and optimize the throughput of the fabric. RCM 101 relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) to 102 provide a drop free network. RoCEv2 Congestion Management(RCM) also 103 use ECN(RFC3168) to signal the congestion to the destination and use 104 the congestion notification to reduce the rate of injection and 105 increase the injection rate when the extent of congestion decreases. 107 More and more practice of congestion management for RoCEv2 appear in 108 the industry, such as DCQCN, etc. Shall we consider to develop next 109 Generation RoCE protocol(alias RoCEv3) with stronger congestion 110 management and load balancing mechanisms for RDMA deployment in 111 modern datacenter? 113 2 Terminology 115 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 116 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 117 document are to be interpreted as described in RFC 2119 [RFC2119]. 119 3 Abbreviations 121 RCM - RoCEv2 Congestion Management 123 PFC - Priority-based Flow Control 125 ECN - Explicit Congestion Notification 127 DCQCN - Data Center Quantized Congestion Notification 129 AI/HPC - Artificial Intelligence/High-Performance computing 131 ECMP - Equal-Cost Multipath 133 4 Problem statement & requirements 135 Network congestion happens in the network switches when the incoming 136 traffic is larger than the bandwidth of the outgoing link on which it 137 has to be transmitted. Congestion is the primary source of loss and 138 in the network, congestion leads to dramatic performance degradation. 140 Generally, RoCEv2 relies on Link-Layer Flow-Control IEEE 141 802.1Qbb(PFC) to provide a lossless underlying networks. Lossless 142 networks implement mechanism of flow control, which pauses the 143 traffic in the incoming link before the buffer overfills, and by that 144 prevents case of dropping packets[2]. However, PFC can lead to poor 145 application performance due to problems like head-of-line blocking 146 and unfairness[3]. In order to avoid the problems involved by PFC, 147 there is another faction research on the congestion control 148 mechanisms over the lossy network. 150 We need a kind of protocol with stronger capability of congestion 151 management to achieve the high throughput and low latency in the 152 large-scale datacenter network with more flexible requirement to the 153 underlay network. The interoperability is also required among the 154 industry practice. 156 5 Current Congestion Management for RoCEv2 157 5.1 PFC 158 RDMA is deployed using the RoCEv2 protocol, which relies on IEEE 159 802.1Qbb Priority-based Flow Control (PFC) to enable a drop-free 160 network. 162 PFC is a link level protocol that allows a receiver to assert flow 163 control telling the transmitter to pause sending traffic for a 164 specified priority. However, because PFC will stop all traffic in a 165 particular traffic class at the ingress port, the flows destined to 166 other ports will also be blocked. 168 The known problems of PFC are head-of-line blocking, unfairness, 169 deadlock[4]. 171 5.2 ECN 173 Explicit congestion notification (ECN) enables end-to-end congestion 174 notification between two endpoints on TCP/IP based networks. ECN 175 notifies networks about congestion with the goal of reducing packet 176 loss and delay by making the sending device decrease the transmission 177 rate until the congestion clears, without dropping packets. RFC 3168, 178 The Addition of Explicit Congestion Notification (ECN) to IP, defines 179 ECN. 181 6. Congestion Management Practice 182 6.1 Packet Retransmission 183 NICs were not designed to deal with losses efficiently. Receiver 184 discards out-of-order packets. Sender does go-back-N on detecting 185 packet loss. RoCEv2 adopt Go-back-N loss recovery and needs lossless 186 layer 2 (by using PFC) for good performance[5]. 188 If new RDMA protocol does not rely on the lossless layer 2 network, 189 an efficient method of Packet Retransmission is necessary. 191 6.2 Congestion Control Mechanisms 193 6.2.1 RTT-based Congestion Control 195 The typical practice of RTT based Congestion Control is TIMELY[6]. It 196 introduces the simple packet delay, measured as round-trip times at 197 hosts, is an effective congestion signal without the need for switch 198 feedback. TIMELY measures RTT with microsecond accuracy, and that 199 these RTTs are sufficient to estimate switch queueing. TIMELY can 200 adjust transmission rates using RTT gradients to keep packet latency 201 low while delivering high bandwidth. TIMELY is a delay-based 202 congestion control protocol for use in the datacenter. 204 Because the RDMA transport is in the NIC and sensitive to packet 205 drops, so PFC is necessary because drops hurt performance badly. That 206 is to say TIMELY needs PFC to provide lossless underlay network. 208 6.2.2 Credit-based Congestion Control 209 ExpressPass[7] is an end-to-end credit-scheduled, delay-bounded 210 congestion control for datacenters. ExpressPass uses credit packets 211 to control congestion even before sending data packets, which enables 212 to achieve bounded delay and fast convergence. It uses end-to-end 213 credit transfer for bandwidth allocation and fine-grained packet 214 scheduling. 216 6.2.3 ECN-based congestion control 218 Data Center Quantized Congestion Notification (DCQCN)[3] is an end- 219 to-end congestion control scheme for RoCEv2. DCQCN is a combination 220 of ECN and PFC to support end-to-end lossless Ethernet. The idea 221 behind DCQCN is to allow ECN to do flow control by decreasing the 222 transmission rate at the sender when congestion starts, thereby 223 minimizing the time PFC is triggered. 225 Although RoCEv2 standard[1] does not list DCQCN as the RCM mechanism, 226 but it is widely used in the industry practice. 228 6.3 Re-ordering 230 When the packets arrive at the destination out-of-order, the 231 destination should store the packets to restore the order. 232 Destination should assign special buffer resource to perform re- 233 ordering. There are many methods to implement the re-ordering either 234 on switch or on NIC side. Here will not go into the details. 236 6.4 Load Balancing 237 6.4.1 ECMP 239 RoCEv2 packets use an opaque flow identifier in their UDP Source Port 240 field for ECMP method to implement path selection mechanisms for load 241 balancing and improve utilization of the fabric topology. 243 Traditional ECMP can not balance loads well in the data center 244 network because it splits loads at the granularity of flow. 246 The finer the granularity of load balancing, the more effective the 247 load balancing is and the higher the utilization of network bandwidth 248 can be achieved. 250 6.4.2 Flowlet 252 The typical Flowlet-based load balancing is CONGA[8]. CONGA is a 253 network-based distributed congestion-aware load balancing mechanism 254 for datacenters. It splits TCP flows into flowlets, estimates real- 255 time congestion on fabric paths, and allocates flowlets to paths 256 based on feedback from remote switches. 258 Flowlets are bursts of packets from a flow. The idle interval between 259 two bursts of packets is larger than the maximum difference in 260 latency among the paths. So the second burst can be sent along a 261 different path than the first without reordering packets. 263 6.4.3 Per-packet 264 The effect of packet-based load balancing is the best because the 265 corresponding granularity is the smallest. The consequence is that 266 packets belonging to the same flow will be allocated to different 267 paths. When the forwarding delays of paths are different, it is 268 possible that packets may arrive at the receiver out-of-order. 270 7 Summary 271 The new emerging RoCE based applications urge the practice of 272 different congestion management mechanisms to be practiced in kinds 273 of modern large-scale datacenter network. In this problem statement, 274 not all the mainstream mechanisms are introduced. It is still needed 275 to extend when considering the future RoCE protocol(temporary alias 276 RoCEv3) with robot congestion management capability and more flexible 277 requirement on layer 2 network which might be the next direction. 279 8 Security Considerations 280 This document does not introduce any additional security constraints. 282 9 IANA Considerations TBD 284 10 References 286 [1] Infiniband Trade Association. Supplement to InfiniBand 287 architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 288 (IP routable RoCE), 2014. 290 [2] Understanding RoCEv2 Congestion Management, 291 https://community.mellanox.com/docs/DOC-2321 293 [3] Zhu, Yibo, et al. "Congestion Control for Large-Scale RDMA 294 Deployments." Acm Sigcomm Computer Communication Review 295 45.5(2015):523-536. 297 [4] Hu, Shuihai, et al. "Deadlocks in Datacenter Networks: Why Do 298 They Form, and How to Avoid Them." The, ACM Workshop ACM, 2016:92-98. 300 [5] Mittal, Radhika, et al. "Revisiting Network Support for RDMA." 301 (2018). 303 [6] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for 304 the Datacenter." ACM Conference on Special Interest Group on Data 305 Communication ACM, 2015:537-550. 307 [7] Cho, Inho, D. Han, and K. Jang. "ExpressPass: End-to-End Credit- 308 based Congestion Control for Datacenters." (2016). 310 [8] Alizadeh, Mohammad, et al. "CONGA: distributed congestion-aware 311 load balancing for datacenters." ACM Conference on SIGCOMM ACM, 312 2014:503-514. 314 Authors' Addresses 316 Fei Chen 317 Huawei Technologies Co., Ltd. 318 Email: chenfei57@huawei.com 320 Wenhao Sun 321 Huawei Technologies Co., Ltd. 322 Email: sam.sunwenhao@huawei.com