idnits 2.17.1 draft-miao-rtgwg-hpccplus-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 7, 2021) is 1145 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-09 == Outdated reference: A later version (-07) exists of draft-kumar-ippm-ifa-01 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group R. Miao 3 Internet-Draft H. Liu 4 Intended status: Experimental Alibaba Group 5 Expires: September 8, 2022 R. Pan 6 J. Lee 7 C. Kim 8 Intel Corporation 9 B. Gafni 10 Y. Shpigelman 11 Mellanox Technologies, Inc. 12 J. Tantsura 13 Microsoft Corporation 14 March 7, 2021 16 HPCC++: Enhanced High Precision Congestion Control 17 draft-miao-rtgwg-hpccplus-00 19 Abstract 21 Congestion control (CC) is the key to achieving ultra-low latency, 22 high bandwidth and network stability in high-speed networks. 23 However, the existing high-speed CC schemes have inherent limitations 24 for reaching these goals. 26 In this document, we describe HPCC++ (High Precision Congestion 27 Control), a new high-speed CC mechanism which achieves the three 28 goals simultaneously. HPCC++ leverages inband telemetry to obtain 29 precise link load information and controls traffic precisely. By 30 addressing challenges such as delayed signaling during congestion and 31 overreaction to the congestion signaling using inband and granular 32 telemetry, HPCC++ can quickly converge to utilize all the available 33 bandwidth while avoiding congestion, and can maintain near-zero in- 34 network queues for ultra-low latency. HPCC++ is also fair and easy 35 to deploy in hardware, implementable with commodity NICs and 36 switches. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on September 8, 2022. 55 Copyright Notice 57 Copyright (c) 2021 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 75 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 76 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 6 77 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 78 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 79 6. Design enhancement and implementation . . . . . . . . . . . . 9 80 6.1. Inband telemetry padding at the network switches . . . . 9 81 6.1.1. Inband telemetry on IFA2.0 . . . . . . . . . . . . . 9 82 6.1.2. Inband telemetry on IOAM . . . . . . . . . . . . . . 9 83 6.1.3. Inband telemetry on P4 . . . . . . . . . . . . . . . 9 84 6.2. Congestion Notification . . . . . . . . . . . . . . . . . 10 85 6.2.1. Forward direction Congestion detection . . . . . . . 11 86 6.2.2. Reverse direction . . . . . . . . . . . . . . . . . . 11 87 6.3. Congestion control at NICs . . . . . . . . . . . . . . . 12 88 6.3.1. Sender-based HPCC . . . . . . . . . . . . . . . . . . 12 89 6.3.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . 13 90 7. Reference Implementation . . . . . . . . . . . . . . . . . . 14 91 7.1. Implementation on RDMA RoCEv2 . . . . . . . . . . . . . . 14 92 7.2. Implementation on TCP . . . . . . . . . . . . . . . . . . 15 93 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 94 9. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 15 95 9.1. Internet Deployment . . . . . . . . . . . . . . . . . . . 15 96 9.2. Switch-assisted congestion control . . . . . . . . . . . 16 97 9.3. Work with QoS queuing . . . . . . . . . . . . . . . . . . 16 98 9.4. Path migration . . . . . . . . . . . . . . . . . . . . . 17 99 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 17 100 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 17 101 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 102 12.1. Normative References . . . . . . . . . . . . . . . . . . 17 103 12.2. Informative References . . . . . . . . . . . . . . . . . 18 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 106 1. Introduction 108 The link speed in data center networks has grown from 1Gbps to 109 100Gbps in the past decade, and this growth is continuing. Ultralow 110 latency and high bandwidth, which are demanded by more and more 111 applications, are two critical requirements in today's and future 112 high-speed networks. 114 Given that traditional software-based network stacks in hosts can no 115 longer sustain the critical latency and bandwidth requirements as 116 described in [Zhu-SIGCOMM2015], offloading network stacks into 117 hardware is an inevitable direction in high-speed networks. As an 118 example, large-scale networks with RDMA (remote direct memory access) 119 often uses hardware-offloading solutions. In some cases, the RDMA 120 networks still face fundamental challenges to reconcile low latency, 121 high bandwidth utilization, and high stability. 123 This document describes a new congestion control mechanism, HPCC++ 124 (Enhanced High Precision Congestion Control), for large-scale, high- 125 speed networks. The key idea behind HPCC++ is to leverage the 126 precise link load information from signaled through inband telemetry 127 to compute accurate flow rate updates. Unlike existing approaches 128 that often require a large number of iterations to find the proper 129 flow rates, HPCC++ requires only one rate update step in most cases. 130 Using precise information from inband telemetry enables HPCC++ to 131 address the limitations in current congestion control schemes. 132 First, HPCC++ senders can quickly ramp up flow rates for high 133 utilization and ramp down flow rates for congestion avoidance. 134 Second, HPCC++ senders can quickly adjust the flow rates to keep each 135 link's output rate slightly lower than the link's capacity, 136 preventing queues from being built-up as well as preserving high link 137 utilization. Finally, since sending rates are computed precisely 138 based on direct measurements at switches, HPCC++ requires merely 139 three independent parameters that are used to tune fairness and 140 efficiency. 142 The base form of HPCC++ is the original HPCC algorithm and its full 143 description can be found in [SIGCOMM-HPCC]. While the original 144 design lays the foundation for inband telemetry based precision 145 congestion control, HPCC++ is an enhanced version which takes into 146 account system constraints and aims to reduce the design overhead and 147 further improves the performance. Section 6 describes these detailed 148 proposed design enhancements and guidelines. 150 This document describes the architecture changes in switches and end- 151 hosts to support the needed tranmission of inband telemetry and its 152 consumption, that imporves the efficiency in handling network 153 congestion. 155 2. Terminology 157 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 158 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 159 "OPTIONAL" in this document are to be interpreted as described in BCP 160 14 [RFC2119] [RFC8174] when, and only when, they appear in all 161 capitals, as shown here. 163 3. System Overview 165 Figure 1 shows the end-to-end system that HPCC++ operates in. During 166 the traverse of the packet from the sender to the receiver, each 167 switch along the path inserts inband telemetry that reports the 168 current state of the packet's egress port, including timestamp (ts), 169 queue length (qLen), transmitted bytes (txBytes), and the link 170 bandwidth capacity (B), together with switch_ID and port_ID. When 171 the receiver gets the packet, it may copy all the inband telemetry 172 recorded from the network to the ACK message it sends back to the 173 sender, and then the sender decides how to adjust its flow rate each 174 time it receives an ACK with network load information. 175 Alternatively, the receiver may calculate the flow rate based on the 176 inband telemetry information and feedback the calculated rate back to 177 the sender. The notification packets would include delayed ack 178 information as well. 180 Note that there also exist network nodes along the reverse 181 (potentially uncongested) path that the feedback reports traverse. 182 Those network nodes are not shown in the figure for sake of brevity. 184 +---------+ pkt +-------+ pkt+tlm +-------+ pkt+tlm +----------+ 185 | Data |-------->| |-------->| |-------->| Data | 186 | Sender |=========|Switch1|=========|Switch2|=========| Receiver | 187 +---------+ Link-0 +-------+ Link-1 +-------+ Link-2 +----------+ 188 /|\ | 189 | | 190 +---------------------------------------------------------+ 191 Notification Packets/ACKs 193 Figure 1: System Overview (tlm=inband telemtry) 195 o Data sender: responsible for controlling inflight bytes. HPCC++ 196 is a window-based congestion control scheme that controls the 197 number of inflight bytes. The inflight bytes mean the amount of 198 data that have been sent, but not acknowledged by the sender yet. 199 Controlling inflight bytes has an important advantage compared to 200 controlling rates. In the absence of congestion, the inflight 201 bytes and rate are interchangeable with equation inflight = rate * 202 T where T is the base propagation RTT. The rate can be calculated 203 locally or obtained from the notification packet. The sender may 204 further use the data pacing mechanism, potentially implemented in 205 hardware, to limit the rate accordingly. 207 o Network nodes: responsible of inserting the inband telemetry 208 information to the data packet. The inband telemetry information 209 reports the current load of the packet's egress port, including 210 timestamp (ts), queue length (qLen), transmitted bytes (txBytes), 211 and link bandwidth capacity (B). Besides, the inband telemetry 212 contains switch_ID and port_ID to identify a link. 214 o Data receiver: responsible for either reflecting back the inband 215 telemetry information in the data packet or calculating the proper 216 flow rate based on network congestion information in inband 217 telemetry and sending notification packets back to the sender. 219 4. HPCC++ Algorithm 221 HPCC++ is a window-based congestion control algorithm. The key 222 design choice of HPCC++ is to rely on network nodes to provide fine- 223 grained load information, such as queue size and accumulated tx/rx 224 traffic to compute precise flow rates. This has two major benefits: 225 (i) HPCC++ can quickly converge to proper flow rates to highly 226 utilize bandwidth while avoiding congestion; and (ii) HPCC++ can 227 consistently maintain a close-to-zero queue for low latency. 229 This section introduces the list of notations and describes the core 230 congestion control algorithm. 232 4.1. Notations 234 This section summarizes the list of variables and parameters used in 235 the HPCC++ algorithm. Figure 3 also includes the default values for 236 choosing the algorithm parameters either to represent a typical 237 setting in practical applications or based on theoretical and 238 simulation studies. 240 +--------------+-------------------------------------------------+ 241 | Notation | Variable Name | 242 +--------------+-------------------------------------------------+ 243 | W_i | Window for flow i | 244 | Wc_i | Reference window for flow i | 245 | B_j | Bandwidth for Link j | 246 | I_j | Estimated inflight bytes for Link j | 247 | U_j | Normalized inflight bytes for Link j | 248 | qlen | Telemetry info: link j queue length | 249 | txRate | Telemetry info: link j output rate | 250 | ts | Telemetry info: timestamp | 251 | txBytes | Telemetry info: link j total transmitted bytes | 252 | | associated with timestamp ts | 253 +--------------+-------------------------------------------------+ 255 Figure 2: List of variables. 257 +--------------+----------------------------------+----------------+ 258 | Notation | Parameter Name | Default Value | 259 +--------------+----------------------------------+----------------+ 260 | T | Known baseline RTT | 5us | 261 | eta | Target link utilization | 95% | 262 | maxStage | Maximum stages for additive | | 263 | | increases | 5 | 264 | N | Maximum number of flows | ... | 265 | W_ai | Additive increase amount | ... | 266 +--------------+----------------------------------+----------------+ 268 Figure 3: List of algorithm parameters and their default values. 270 4.2. Design Functions and Procedures 272 The HPCC++ algorithm can be outlined as below: 274 1: Function MeasureInflight(ack) 275 2: u = 0; 276 3: for each link i on the path do 277 4: ack.L[i].txBytes-L[i].txBytes 278 txRate = ----------------------------- ; 279 ack.L[i].ts-L[i].ts 280 5: min(ack.L[i].qlen,L[i].qlen) txRate 281 u' = ----------------------------- + ---------- ; 282 ack.L[i].B*T ack.L[i].B 283 6: if u' > u then 284 7: u = u'; tau = ack.L[i].ts - L[i].ts; 285 8: tau = min(tau, T); 286 9: U = (1 - tau/T)*U + tau/T*u; 287 10: return U; 289 11: Function ComputeWind(U, updateWc) 290 12: if U >= eta or incStage >= maxStagee then 291 13: Wc 292 W = ----- + W_ai; 293 U/eta 294 14: if updateWc then 295 15: incStagee = 0; Wc = W ; 296 16: else 297 17: W = Wc + W_ai ; 298 18: if updateWc then 299 19: incStage++; Wc = W ; 300 20: return W 302 21: Procedure NewAck(ack) 303 22: if ack.seq > lastUpdateSeq then 304 23: W = ComputeWind(MeasureInflight(ack), True); 305 24: lastUpdateSeq = snd_nxt; 306 25: else 307 26: W = ComputeWind(MeasureInflight(ack), False); 308 27: R = W/T; L = ack.L; 310 The above illustrates the overall process of CC at the sender side 311 for a single flow. Each newly received ACK message triggers the 312 procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq 313 is used to remember the first packet sent with a new W c , and the 314 sequence number in the incoming ACK should be larger than 315 lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and 316 18-19). The sender also remembers the pacing rate and current inband 317 telemetry information at Line 27. The sender computes a new window 318 size W at Line 23 or Line 26, depending on whether to update W c , 319 with function MeasureInflight and ComputeWind. Function 320 MeasureInflight estimates normalized inflight bytes with Eqn (2) at 321 Line 5. First, it computes txRate of each link from the current and 322 last accumulated transferred bytes txBytes and timestamp ts (Line 4). 323 It also uses the minimum of the current and last qlen to filter out 324 noises in qlen (Line 5). The loop from Line 3 to 7 selects maxi(Ui) 325 in Eqn. (3). Instead of directly using maxi(Ui), we use an EWMA 326 (Exponentially Weighted Moving Average) to filter the noises from 327 timer inaccuracy and transient queues. (Line 9). Function 328 ComputeWind combines multiplicative increase/ decrease (MI/MD) and 329 additive increase (AI) to balance the reaction speed and fairness. 330 If a sender finds it should increase the window size, it first tries 331 AI for maxStage times with the stepWAI (Line 17). If it still finds 332 room to increase after maxStage times of AI or the normalized 333 inflight bytes is above, it calls Eqn (4) once to quickly ramp up or 334 ramp down the window size (Line 12-13). 336 5. Configuration Parameters 338 HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. 339 eta controls a simple tradeoff between utilization and transient 340 queue length (due to the temporary collision of packets caused by 341 their random arrivals, so we set it to 95% by default, which only 342 loses 5% bandwidth but achieves almost zero queue. maxStage controls 343 a simple tradeoff between steady state stability and the speed to 344 reclaim free bandwidth. We find maxStage = 5 is conservatively large 345 for stability, while the speed of reclaiming free bandwidth is still 346 much faster than traditional additive increase, especially in high 347 bandwidth networks. W_ai controls the tradeoff between the maximum 348 number of concurrent flows on a link that can sustain near-zero 349 queues and the speed of convergence to fairness. Note that none of 350 the three parameters are reliability-critical. 352 HPCC++'s design brings advantages to short-lived flows, by allowing 353 flows starting at line-rate and the separation of utilization 354 convergence and fairness convergence. HPCC++ achieves fast 355 utilization convergence to mitigate congestion in almost one round- 356 trip time, while allows flows to gradually converge to fairness. 357 This design feature of HPCC++ is especially helpful for the workload 358 of datacenter applications, where flows are usually short and 359 latency-sensitive. Normally we set a very small W_ai to support a 360 large number of concurrent flows on a link, because slower fairness 361 is not critical. A rule of thumb is to set W_ai = W_init*(1-eta) / N 362 where N is the expected or receiver reported maximum number of 363 concurrent flows on a link. The intuition is that the total additive 364 increase every round (N*W_ai ) should not exceed the bandwidth 365 headroom, and thus no queue forms. Even if the actual number of 366 concurrent flows on a link exceeds N, the CC is still stable and 367 achieves full utilization, but just cannot maintain zero queues. 369 6. Design enhancement and implementation 371 There are three compoments HPCC++ needs to implement: telementry 372 padding, congestion notification, and rate update. 374 6.1. Inband telemetry padding at the network switches 376 HPCC++ only relies on packets to share information across senders, 377 receivers, and switches. The switch should capture inband telemetry 378 information that includes link load (txBytes, qlen, ts) and link spec 379 (switch_ID, port_ID, B) at the egress port. Note, each switch should 380 record all those information at the single snapshot to achieve a 381 precise link load estimate. Inside a data center, the path length is 382 often no more than 5 hops. The overhead of the inband telemetry 383 padding for HPCC++ is considered to be low. 385 As long the above algorithm is met, HPCC++ is open to a variety of 386 inband telemetry format standards, which are orthogonal to the HPCC++ 387 algorithm. Although this document does not mandate a particular 388 inband telemetry header format or encapsulation, we provide concrete 389 implementation specifications using strandard inband telemetry 390 protocols, including IFA [I-D.ietf-kumar-ippm-ifa], IETF IOAM 391 [I-D.ietf-ippm-ioam-data], and P4.org INT [P4-INT]. In fact, the 392 emerging inband telemetry protocols inform the evolution for a 393 broader range of protocols and network functions, where this document 394 leverages the trend to propose the architecture change to support in- 395 network functions like congestion control with high efficiency. 397 6.1.1. Inband telemetry on IFA2.0 399 For more details, please refer to IFA [I-D.ietf-kumar-ippm-ifa] 401 6.1.2. Inband telemetry on IOAM 403 Please refer to IETF IOAM [I-D.ietf-ippm-ioam-data] 405 6.1.3. Inband telemetry on P4 406 0 1 2 3 407 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 408 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 409 | nHop | pathID | Padding | 410 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 411 | Speed | Timestamp |txBytes| 412 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 413 | txBytes(lower) | Queue Length | 414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 415 | 2nd Hop | 416 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 417 | 2nd Hop(lower) | 418 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 419 | Options | Padding | 420 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 422 Figure 4: Example P4.org INT header 424 Figure 4 shows the packet format of the INT padding after UDP header. 425 The field nHop is the hop count of the packet's path. The field 426 pathID is the XOR of all the switch IDs (which are 12 bits) along the 427 path. The sender sets nHop and pathID to 0. Each switch along the 428 path adds nHop by 1, and XORs its own switch ID to the pathID. The 429 sender uses pathID to judge whether the path of the flow has been 430 changed. If so, it throws away the existing status records of the 431 flow and builds up new records. Each switch has an 8-byte field to 432 record the status of the egress port of the packet when the packet is 433 emitted. B is a enum type which indicates the speed type of the 434 port(e.g. 40Gbps, 100Gbps, etc.). Timestamp (24 bits) is when the 435 packet is emitted from its egress port, txBytes (20 bits) is the 436 accumulative total bytes sent from the egress port, and Queue length 437 (16 bits) is the current queue length of the egress port. 439 6.2. Congestion Notification 441 HPCC++ uses congestion notification to fetch network congestion 442 information from switches for proper rate updates at end-hosts. 443 Although the basic algorithm described in Section 4 is to add inband 444 telemetry information into every data packet for optimal performance, 445 HPCC++ supports flexible implementation choices to work seamly with 446 transport protocol stacks. We consider congestion nofication choices 447 in both forward and reverse directions of the traffic. 449 6.2.1. Forward direction Congestion detection 451 Forward direction is the traffic direction of data packets that 452 experience bandwidth contention and possible network congestion. The 453 function of congestion notification in forward direction is to fetch 454 inband telemetry from switches. HPCC++ defines two approaches of 455 doing this. 457 1. Inband with data packet. 459 This is basic algorithm setting described in Section 4, where the 460 end-host inserts inband telemetry header into data packets. Switches 461 along the path detect the inband telemetry header and correspondingly 462 add inband telemetry information into data packet to react to 463 congestion as soon as the very first packet observing the network 464 congestion. This is especially helpful to reduce the risk of severe 465 congestion in incast scenarios at the first round-trip time. In 466 addition, original HPCC's algorithm introduction of Wc is for the 467 purpose of solving the over-reaction issue from using this per-packet 468 response. Different with in Section 4, end-host can choice uses 469 every data packet or only a subset of data packets to reduce the 470 overhead. To insert telemetry header, differet telemetry protocols 471 have specific settings for IFA, IETF IOAM, and P4.org INT as 472 following. 474 2. Probe packet. 476 Switches touching every data packet for inband telemetry inserting 477 may lead to security or performance concerns, HPCC++ supports the 478 ``out-of-band'' approach that uses special-generated probe packets at 479 end-hosts to fetch inband telemetry from switches. Thereby, the 480 probe packets should take the same routing path and QoS queueing with 481 the data packets. End-hosts can generate probe packets less 482 frequently and we recommend once per round trip time. In addition, 483 the end-host issues probe packets only when it has data packet in the 484 flight. 486 6.2.2. Reverse direction 488 Reverse direction is the receiver conveying inband telemetry back to 489 traffic sender for rate updates. Similar to forward direction, there 490 are also inband and out-of-band approaches. 492 1. Inband with ACK packet. 494 HPCC++ supports to use the ACK packet in transport protocols to 495 convey the inband telemetry. TCP generates ACK packet once per every 496 data packet or per a few data packets. With ACK packet, the receive 497 sends accumulated inband telemetry back to sender for rate updates. 499 2. Notification packet. 501 Using ACK packet for inband telemetry notification requires transport 502 stack modification and sometimes leads to delay in notification when 503 certain delayed acknowledged mechanism is used. Hence, HPCC++ allows 504 the receiver to use special-generated notification packets to deliver 505 inband telemetry. The nofication packet is generated per each probe 506 packet or data packet with inband telemetry. 508 6.3. Congestion control at NICs 510 6.3.1. Sender-based HPCC 512 Figure 5 shows HPCC++ implementation on a NIC. The NIC provides an 513 HPCC++ module that resides on the data path of the NIC, HPCC++ 514 modules realize both sender and receiver roles. 516 +------------------------------------------------------------------+ 517 | +---------+ window update +-----------+ PktSend +-----------+ | 518 | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> 519 | | | rate update +-----------+ +-----------+ | 520 | | HPCC++ | ^ | 521 | | | inband telemetry| | 522 | | module | | | 523 | | | +-----+-----+ | 524 | | |<----------------------------------- |Rx pipeline| <-+-- 525 | +---------+ telemetry response event +-----------+ | 526 +------------------------------------------------------------------+ 528 Figure 5: Overview of NIC Implementation 530 1. Sender side flow 532 The HPCC++ module running the HPCC CC algorithm in the sender side 533 for every flow in the NIC. Flow can be defined by some transport 534 parameters including 5-tuples, destination QP (queue pair), etc. It 535 receives inband telemetry response events per flow which are 536 generated from the RX pipeline, adjusts the sending window and rate, 537 and update the scheduler on the rate and window of the flow. 539 The scheduler contains a pacing mechanism that determine the flow 540 rate by the value it got from the algorithm. It also maintains the 541 current sending window size for active flows. If the pacing 542 mechanism and the flow's sending window permits, the scheduler 543 invokes for the flow a PktSend command to TX pipeline. 545 The TX pipeline implements packet processing. Once it receives the 546 PktSend event with flow ID from the scheduler, it generates the 547 corresponding packet and delivers to the Network. If a sent packet 548 should collect telemetry on its way the TX pipeline may add 549 indications/headers that triggers the network elements to add 550 telemetry data according to the inband telemetry protocol in use. 551 The telemetry can be collected by the data packet or by dedicated 552 prob packets generated in the TX pipeline. 554 The RX pipe parses the incoming packets from the network and 555 identifies whether telemetry is embedded in the parsed packet. On 556 receiving a telemetry response packet, the RX pipeline extracts the 557 network status from the packet and passes it to the HPCC++ module for 558 processing. A telemetry response packet can be an ACK containing 559 inband telemetry, or a dedicated telemetry response prob packet. 561 2. Receiver side flow 563 On receiving a packet containing inband telemetry, the RX pipeline 564 extracts the network status, and the flow parameters from the packet 565 and passes it to the TX pipeline. The packet can be a data packet 566 containing inband telemetry, or a dedicated telemetry request prob 567 packet. The Tx pipeline may process and edit the telemetry data, and 568 then sends back to the sender the data using either an ACK packet of 569 the flow or a dedicated telemetry response packet. 571 6.3.2. Receiver-based HPCC 573 Note that the window/rate calculation can be implemented at either 574 the data sender or the data receiver. If the ACK packets already 575 exist for reliability purpose, the inband telemetry information can 576 be echoed back to the sender via ACK self-clocking. Not all ACK 577 packets need to carry the inband telemetry information. To reduce 578 the Packet Per Second (PPS) overhead, the receiver may examine the 579 inband telemetry information and adopt the technique of delayed ACKs 580 that only sends out an ACK for a few of received packets. In order 581 to reduce PPS even further, one may implement the algorithm at the 582 receiver and feedback the calculated window in the ACK packet once 583 every RTT. 585 The receiver-based algorithm, Rx-HPCC, is based on int.L, which is 586 the inband telemetry information in the packet header. The receiver 587 performs the same functions except using int.L instead of ack.L. The 588 new function NewINT(int.L) is to replace NewACK(int.L) 589 28: Procedure NewINT(int.L) 590 29: if now > (lastUpdateTime + T) then 591 30: W = ComputeWind(MeasureInflight(int), True); 592 31: send_ack(W) 593 32: lastUpdateTime = now; 594 33: else 595 34: W = ComputeWind(MeasureInflight(int), False); 597 Here, since the receiver does not know the starting sequence number 598 of a burst, it simply records the lastUpdateTime. If time T has 599 passed since lastUpdateTime, the algorithm would recalcuate Wc as in 600 Line 30 and send out the ACK packet which would include W 601 information. Otherwise, it would just update W information locally. 602 This would reduce the amount of traffic that needs to be feedback to 603 the data sender. 605 Note that the receiver can also measure the number of outstanding 606 flows, N, if the last hop is the congestion point and use this 607 information to dynamically adjust W_ai to achieve better fairness. 608 The improvement would allow flows to quickly converge to fairness 609 without causing large swings under heavy load. 611 7. Reference Implementation 613 HPCC++ can be adopted as the CC algorithm by a wide range of 614 transport protocols such as TCP and UDP, as well as others that may 615 run on top of them, such as iWARP, RoCE etc. It requires to have the 616 window limit and congestion feedback through ACK self-clocking, which 617 naturally conforms to the paradigm of TCP design. With that, HPCC++ 618 introduces a scheme to measure the total inflight bytes for more 619 precise congestion control. To run in UDP, some modifications need 620 to be done to enforce the window limit and collect congestion 621 feedback via probing packets, which is incremental. 623 7.1. Implementation on RDMA RoCEv2 625 We describe reference implementation on RDMA RoCEv2. This is an 626 implementation for ``Sender-based HPCC++'' (see section 6.3.1.) using 627 dedicated probe packets to collect the telemetry. HPCC++ module in 628 the sender triggers the sending of ``telemetry request packet'' for a 629 given flow. The NIC then sends the probe packet. The packet will 630 have the same IP and UDP headers as the data packets of the given 631 flow. Such packet is expected to be sent every RTT, see section 6 632 for more details. On receiving of telemetry request packet, the NIC 633 extracts the telemetry from all the links along the path from the 634 sender. HPCC++ module chooses the link with the highest inflight 635 bytes and sends its telemetry (queue length, timestamp and tx bytes) 636 back to the receiver on top of dedicated ``telemetry response 637 packet''. On receiving of telemetry response packet, the NIC 638 extracts the telemetry and pass it to the HPCC++ module which using 639 this info to implement the rate update scheme. 641 7.2. Implementation on TCP 643 Taking the benefit of precise congestion control for TCP is a natural 644 next step. Since TCP segmentation at TX side (e.g., TSO) and 645 coalescing at RX side (e.g., GRO) happen at the NIC HW or low-layer 646 of TCP/IP stack, carrying per-pkt inband telemetry info between the 647 TCP congestion control engine and network fabric has to work with the 648 TSO and GRO. Instead, one way to adopt HPCC++ for TCP is using the 649 special probe and notification packets to retrieve inband telemetry 650 information. The sender generates a probe packet when it is actively 651 sending data. The probe packet has the same 5-tuples (source and 652 destination addresses, source and destination ports and protocol 653 number) with the data packets and the inband telemetry header. The 654 switches along the path identify the probe packet by its inband 655 telemetry header and insert the inband telemetry. Once received the 656 probe packet with inband telemetry, the receiver replies with a 657 response packet piggybacking the inband telemetry to the sender. 658 Note, both probe and response packets use a special DSCP number so 659 that it can bypass the TSO and GRO in each side. 661 8. IANA Considerations 663 This document makes no request of IANA. 665 9. Discussion 667 9.1. Internet Deployment 669 Although the discussion above mainly focuses on the data center 670 environment, HPCC++ can be adopted at Internet at large. There are 671 several security considerations one should be aware of. 673 There may rise privacy concern when the telemetry information is 674 conveyed across Autonomous Systems (ASes) and back to end-users. The 675 link load information captured in telemetry can potentially reveal 676 the provider's network capacity, route utilization, scheduling 677 policy, etc. Those usually are considered to be sensitive data of 678 the network providers. Hence, certain action may take to anonymize 679 the telemetry data and only convey the relative ratio in rate 680 adaptation across ASes without revealing the actual network load. 682 Another consideration is the security of receiving telemetry 683 information. The rate adaptation mechanism in HPCC++ relies on 684 feedback from the network. As such, it is vulnerable to attacks 685 where feedback messages are hijacked, replaced, or intentionally 686 injected with misleading information resulting in denial of service, 687 similar to those that can affect TCP. It is therefore RECOMMENDED 688 that the notification feedback message is at least integrity checked. 689 In addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the 690 potential risk of a receiver providing misleading congestion feedback 691 information and the mechanisms for mitigating such risks. 693 9.2. Switch-assisted congestion control 695 HPCC++ falls in the general category of switch-assisted congestion 696 control. However, HPCC++ includes a few unique design choices that 697 are different from other switch-assisted approaches. 699 o First, HPCC++ implements a primal-mode algorithm that requires 700 only the ``write-to-packet'' operation from switches, which has 701 already been supported by telemetry protocols like INT [P4-INT] or 702 IOAM [I-D.ietf-ippm-ioam-data]. Please note that this is very 703 different from dual-mode algorithms such as XCP 704 [Katabi-SIGCOMM2002] and RCP [Dukkipati-RCP], where switches take 705 an actively role in determining flows' rates. 707 o Second, HPCC++ achieves a fast utilization convergence by 708 decoupling it from fairness convergence, which is inspired by XCP. 710 o Third, HPCC++ enables the switch-guided multiplicative increase 711 (MI) by defining the ``inflight byte'' to quantify the link load. 712 The inflight byte tells both the underload and overload of the 713 link precisely and thus it allows the flow to increase/decrease 714 the rate multiplicatively and safely. By contrast, traditional 715 approaches of using the queue length or RTT as the feedback cannot 716 guide the rate increase and instead have to rely on additive 717 increase (AI) with heuristics. As the link speed contines to 718 grow, this becomes increasingly slow in reclaiming the unused 719 bandwidth. Besides, queue-based feedback mechanisms subject to 720 latency inflation. 722 o Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP. 723 As detailed in [SIGCOMM-HPCC], we view the TX rate is more precise 724 because RX rate and queue length are overlapped and thus it causes 725 oscillation. 727 9.3. Work with QoS queuing 729 Under the use of QoS (Quality of service) priority queuing in 730 switches, the length of flow's own queue cannot tell the actual 731 queuing time and the exact extent of congestion. Although general 732 approaches for running congestion control with QoS queuing are out of 733 the scope of this document, we provide a few hints for HPCC++ running 734 friendly with QoS queuing. In this case, HPCC++ can leverage the 735 packet sojourn time (the egress timestamp minus the ingress 736 timestamp) instead of the queue length to quantify the packet's 737 actual queuing delay. In addition, the operators typically use the 738 Deficit Weighted Round Robin (DWRR) instead of the strict priority 739 (SP) as their QoS scheduling to prevent traffic starvation. DWRR 740 provides a minimum bandwdith guarantee for each queue so that HPCC++ 741 can leverage it for precise rate update to avoid congestion. 743 9.4. Path migration 745 HPCC++ allows switches and end-hosts to share precise information of 746 network utilization, which suggests a framework for path selection 747 and rate control at end-hosts. The framework HPCC++ enabled is to 748 leverage each switch to report its link load information via inband 749 telemetry. The end-host fetches inband telemetry along the traffic 750 routes and makes a timely and accurate decision on path selection and 751 traffic admission. 753 10. Acknowledgments 755 The authors would like to thank RTGWG members for their valuable 756 review comments and helpful input to this specification. 758 11. Contributors 760 The following individuals have contributed to the implementation and 761 evaluation of the proposed scheme, and therefore have helped to 762 validate and substantially improve this specification: Pedro Y. 763 Segura, Roberto P. Cebrian, Robert Southworth and Malek Musleh. 765 12. References 767 12.1. Normative References 769 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 770 Requirement Levels", BCP 14, RFC 2119, 771 DOI 10.17487/RFC2119, March 1997, 772 . 774 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 775 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 776 May 2017, . 778 12.2. Informative References 780 [Dukkipati-RCP] 781 Dukkipati, N., "Rate Control Protocol (RCP): Congestion 782 control to make flows complete quickly.", Stanford 783 University , 2008. 785 [I-D.ietf-avtcore-cc-feedback-message] 786 Sarker, Z., Perkins, C., Singh, V., and M. A. Ramalho, 787 "RTP Control Protocol (RTCP) Feedback for Congestion 788 Control", draft-ietf-avtcore-cc-feedback-message-09 (work 789 in progress), November 2020. 791 [I-D.ietf-ippm-ioam-data] 792 "Data Fields for In-situ OAM", March 2020, 793 . 796 [I-D.ietf-kumar-ippm-ifa] 797 "Inband Flow Analyzer", February 2019, 798 . 800 [Katabi-SIGCOMM2002] 801 Katabi, D., Handley, M., and C. Rohrs, "Congestion Control 802 for High Bandwidth-Delay Product Networks", ACM 803 SIGCOMM Pittsburgh, Pennsylvania, USA, October 2002. 805 [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, 806 v2.0", February 2020, . 809 [SIGCOMM-HPCC] 810 Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, 811 L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M. 812 Yu, "HPCC: High Precision Congestion Control", ACM 813 SIGCOMM Beijing, China, August 2019. 815 [Zhu-SIGCOMM2015] 816 Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 817 Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. 818 Zhang, "Congestion Control for Large-Scale RDMA 819 Deployments", ACM SIGCOMM London, United Kingdom, August 820 2015. 822 Authors' Addresses 824 Rui Miao 825 Alibaba Group 826 525 Almanor Ave, 4th Floor 827 Sunnyvale, CA 94085 828 USA 830 Email: miao.rui@alibaba-inc.com 832 Hongqiang H. Liu 833 Alibaba Group 834 108th Ave NE, Suite 800 835 Bellevue, WA 98004 836 USA 838 Email: hongqiang.liu@alibaba-inc.com 840 Rong Pan 841 Intel, Corp. 842 2200 Mission College Blvd. 843 Santa Clara, CA 95054 844 USA 846 Email: rong.pan@intel.com 848 Jeongkeun Lee 849 Intel, Corp. 850 4750 Patrick Henry Dr. 851 Santa Clara, CA 95054 852 USA 854 Email: jk.lee@intel.com 856 Changhoon Kim 857 Intel Corporation 858 4750 Patrick Henry Dr. 859 Santa Clara, CA 95054 860 USA 862 Email: chang.kim@intel.com 863 Barak Gafni 864 Mellanox Technologies, Inc. 865 350 Oakmead Parkway, Suite 100 866 Sunnyvale, CA 94085 867 USA 869 Email: gbarak@mellanox.com 871 Yuval Shpigelman 872 Mellanox Technologies, Inc. 873 Haim Hazaz 3A 874 Netanya 4247417 875 Israel 877 Email: yuvals@nvidia.com 879 Jeff Tantsura 880 Microsoft Corporation 881 One Microsoft Way 882 Redmond, Washington 98052-6399 883 USA 885 Email: jefftantsura@microsoft.com