idnits 2.17.1 draft-miao-iccrg-hpccplus-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (7 December 2021) is 869 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-09 == Outdated reference: A later version (-07) exists of draft-kumar-ippm-ifa-01 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group R. Miao 3 Internet-Draft H. Liu 4 Intended status: Experimental Alibaba Group 5 Expires: 10 June 2022 R. Pan 6 J. Lee 7 C. Kim 8 Intel Corporation 9 B. Gafni 10 Y. Shpigelman 11 Mellanox Technologies, Inc. 12 J. Tantsura 13 Microsoft Corporation 14 7 December 2021 16 HPCC++: Enhanced High Precision Congestion Control 17 draft-miao-iccrg-hpccplus-01 19 Abstract 21 Congestion control (CC) is the key to achieving ultra-low latency, 22 high bandwidth and network stability in high-speed networks. 23 However, the existing high-speed CC schemes have inherent limitations 24 for reaching these goals. 26 In this document, we describe HPCC++ (High Precision Congestion 27 Control), a new high-speed CC mechanism which achieves the three 28 goals simultaneously. HPCC++ leverages inband telemetry to obtain 29 precise link load information and controls traffic precisely. By 30 addressing challenges such as delayed signaling during congestion and 31 overreaction to the congestion signaling using inband and granular 32 telemetry, HPCC++ can quickly converge to utilize all the available 33 bandwidth while avoiding congestion, and can maintain near-zero in- 34 network queues for ultra-low latency. HPCC++ is also fair and easy 35 to deploy in hardware, implementable with commodity NICs and 36 switches. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on 10 June 2022. 55 Copyright Notice 57 Copyright (c) 2021 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 62 license-info) in effect on the date of publication of this document. 63 Please review these documents carefully, as they describe your rights 64 and restrictions with respect to this document. Code Components 65 extracted from this document must include Revised BSD License text as 66 described in Section 4.e of the Trust Legal Provisions and are 67 provided without warranty as described in the Revised BSD License. 69 Table of Contents 71 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 74 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 75 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 5 76 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 77 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 78 6. Design Enhancement and Implementation . . . . . . . . . . . . 8 79 6.1. HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . . 9 80 6.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . . . 9 81 7. Reference Implementations . . . . . . . . . . . . . . . . . . 10 82 7.1. Inband telemetry padding at the network elements . . . . 10 83 7.2. Congestion control at NICs . . . . . . . . . . . . . . . 10 84 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 85 9. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 12 86 9.1. Internet Deployment . . . . . . . . . . . . . . . . . . . 12 87 9.2. Switch-assisted congestion control . . . . . . . . . . . 12 88 9.3. Work with transport protocols . . . . . . . . . . . . . . 13 89 9.4. Work with QoS queuing . . . . . . . . . . . . . . . . . . 13 90 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 14 91 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 14 92 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 14 93 12.1. Normative References . . . . . . . . . . . . . . . . . . 14 94 12.2. Informative References . . . . . . . . . . . . . . . . . 14 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 97 1. Introduction 99 The link speed in data center networks has grown from 1Gbps to 100 100Gbps in the past decade, and this growth is continuing. Ultralow 101 latency and high bandwidth, which are demanded by more and more 102 applications, are two critical requirements in today's and future 103 high-speed networks. 105 Given that traditional software-based network stacks in hosts can no 106 longer sustain the critical latency and bandwidth requirements as 107 described in [Zhu-SIGCOMM2015], offloading network stacks into 108 hardware is an inevitable direction in high-speed networks. As an 109 example, large-scale networks with RDMA (remote direct memory access) 110 often uses hardware-offloading solutions. In some cases, the RDMA 111 networks still face fundamental challenges to reconcile low latency, 112 high bandwidth utilization, and high stability. 114 This document describes a new congestion control mechanism, HPCC++ 115 (Enhanced High Precision Congestion Control), for large-scale, high- 116 speed networks. The key idea behind HPCC++ is to leverage the 117 precise link load information from signaled through inband telemetry 118 to compute accurate flow rate updates. Unlike existing approaches 119 that often require a large number of iterations to find the proper 120 flow rates, HPCC++ requires only one rate update step in most cases. 121 Using precise information from inband telemetry enables HPCC++ to 122 address the limitations in current congestion control schemes. 123 First, HPCC++ senders can quickly ramp up flow rates for high 124 utilization and ramp down flow rates for congestion avoidance. 125 Second, HPCC++ senders can quickly adjust the flow rates to keep each 126 link's output rate slightly lower than the link's capacity, 127 preventing queues from being built-up as well as preserving high link 128 utilization. Finally, since sending rates are computed precisely 129 based on direct measurements at switches, HPCC++ requires merely 130 three independent parameters that are used to tune fairness and 131 efficiency. 133 The base form of HPCC++ is the original HPCC algorithm and its full 134 description can be found in [SIGCOMM-HPCC]. While the original 135 design lays the foundation for inband telemetry based precision 136 congestion control, HPCC++ is an enhanced version which takes into 137 account system constraints and aims to reduce the design overhead and 138 further improves the performance. Section 6 describes these detailed 139 proposed design enhancements and guidelines. 141 This document describes the architecture changes in switches and end- 142 hosts to support the needed tranmission of inband telemetry and its 143 consumption, that imporves the efficiency in handling network 144 congestion. 146 2. Terminology 148 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 149 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 150 "OPTIONAL" in this document are to be interpreted as described in BCP 151 14 [RFC2119] [RFC8174] when, and only when, they appear in all 152 capitals, as shown here. 154 3. System Overview 156 Figure 1 shows the end-to-end system that HPCC++ operates in. During 157 the traverse of the packet from the sender to the receiver, each 158 switch along the path inserts inband telemetry that reports the 159 current state of the packet's egress port, including timestamp (ts), 160 queue length (qLen), transmitted bytes (txBytes), and the link 161 bandwidth capacity (B), together with switch_ID and port_ID. When 162 the receiver gets the packet, it may copy all the inband telemetry 163 recorded from the network to the ACK message it sends back to the 164 sender, and then the sender decides how to adjust its flow rate each 165 time it receives an ACK with network load information. 166 Alternatively, the receiver may calculate the flow rate based on the 167 inband telemetry information and feedback the calculated rate back to 168 the sender. The notification packets would include delayed ack 169 information as well. 171 Note that there also exist network nodes along the reverse 172 (potentially uncongested) path that the RTCP feedback reports 173 traverse. Those network nodes are not shown in the figure for sake 174 of brevity. 176 +---------+ pkt +-------+ pkt+tlm +-------+ pkt+tlm +----------+ 177 | Data |-------->| |-------->| |-------->| Data | 178 | Sender |=========|Switch1|=========|Switch2|=========| Receiver | 179 +---------+ Link-0 +-------+ Link-1 +-------+ Link-2 +----------+ 180 /|\ | 181 | | 182 +---------------------------------------------------------+ 183 Notification Packets/ACKs 185 Figure 1: System Overview (tlm=inband telemtry) 187 * Data sender: responsible for controlling inflight bytes. HPCC++ 188 is a window-based congestion control scheme that controls the 189 number of inflight bytes. The inflight bytes mean the amount of 190 data that have been sent, but not acknowledged by the sender yet. 191 Controlling inflight bytes has an important advantage compared to 192 controlling rates. In the absence of congestion, the inflight 193 bytes and rate are interchangeable with equation inflight = rate * 194 T where T is the base propagation RTT. The rate can be calculated 195 locally or obtained from the notification packet. The sender may 196 further use the data pacing mechanism, potentially implemented in 197 hardware, to limit the rate accordingly. 199 * Network nodes: responsible of inserting the inband telemetry 200 information to the data packet. The inband telemetry information 201 reports the current load of the packet's egress port, including 202 timestamp (ts), queue length (qLen), transmitted bytes (txBytes), 203 and link bandwidth capacity (B). Besides, the inband telemetry 204 contains switch_ID and port_ID to identify a link. 206 * Data receiver: responsible for either reflecting back the inband 207 telemetry information in the data packet or calculating the proper 208 flow rate based on network congestion information in inband 209 telemetry and sending notification packets back to the sender. 211 4. HPCC++ Algorithm 213 HPCC++ is a window-based congestion control algorithm. The key 214 design choice of HPCC++ is to rely on network nodes to provide fine- 215 grained load information, such as queue size and accumulated tx/rx 216 traffic to compute precise flow rates. This has two major benefits: 217 (i) HPCC++ can quickly converge to proper flow rates to highly 218 utilize bandwidth while avoiding congestion; and (ii) HPCC++ can 219 consistently maintain a close-to-zero queue for low latency. 221 This section introduces the list of notations and describes the core 222 congestion control algorithm. 224 4.1. Notations 226 This section summarizes the list of variables and parameters used in 227 the HPCC++ algorithm. Figure 3 also includes the default values for 228 choosing the algorithm parameters either to represent a typical 229 setting in practical applications or based on theoretical and 230 simulation studies. 232 +--------------+-------------------------------------------------+ 233 | Notation | Variable Name | 234 +--------------+-------------------------------------------------+ 235 | W_i | Window for flow i | 236 | Wc_i | Reference window for flow i | 237 | B_j | Bandwidth for Link j | 238 | I_j | Estimated inflight bytes for Link j | 239 | U_j | Normalized inflight bytes for Link j | 240 | qlen | Telemetry info: link j queue length | 241 | txRate | Telemetry info: link j output rate | 242 | ts | Telemetry info: timestamp | 243 | txBytes | Telemetry info: link j total transmitted bytes | 244 | | associated with timestamp ts | 245 +--------------+-------------------------------------------------+ 247 Figure 2: List of variables. 249 +--------------+----------------------------------+----------------+ 250 | Notation | Parameter Name | Default Value | 251 +--------------+----------------------------------+----------------+ 252 | T | Known baseline RTT | 5us | 253 | eta | Target link utilization | 95% | 254 | maxStage | Maximum stages for additive | | 255 | | increases | 5 | 256 | N | Maximum number of flows | ... | 257 | W_ai | Additive increase amount | ... | 258 +--------------+----------------------------------+----------------+ 260 Figure 3: List of algorithm parameters and their default values. 262 4.2. Design Functions and Procedures 264 The HPCC++ algorithm can be outlined as below: 266 1: Function MeasureInflight(ack) 267 2: u = 0; 268 3: for each link i on the path do 269 4: ack.L[i].txBytes-L[i].txBytes 270 txRate = ----------------------------- ; 271 ack.L[i].ts-L[i].ts 272 5: min(ack.L[i].qlen,L[i].qlen) txRate 273 u' = ----------------------------- + ---------- ; 274 ack.L[i].B*T ack.L[i].B 275 6: if u' > u then 276 7: u = u'; tau = ack.L[i].ts - L[i].ts; 277 8: tau = min(tau, T); 278 9: U = (1 - tau/T)*U + tau/T*u; 279 10: return U; 280 11: Function ComputeWind(U, updateWc) 281 12: if U >= eta or incStage >= maxStagee then 282 13: Wc 283 W = ----- + W_ai; 284 U/eta 285 14: if updateWc then 286 15: incStagee = 0; Wc = W ; 287 16: else 288 17: W = Wc + W_ai ; 289 18: if updateWc then 290 19: incStage++; Wc = W ; 291 20: return W 293 21: Procedure NewAck(ack) 294 22: if ack.seq > lastUpdateSeq then 295 23: W = ComputeWind(MeasureInflight(ack), True); 296 24: lastUpdateSeq = snd_nxt; 297 25: else 298 26: W = ComputeWind(MeasureInflight(ack), False); 299 27: R = W/T; L = ack.L; 301 The above illustrates the overall process of CC at the sender side 302 for a single flow. Each newly received ACK message triggers the 303 procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq 304 is used to remember the first packet sent with a new W c , and the 305 sequence number in the incoming ACK should be larger than 306 lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and 307 18-19). The sender also remembers the pacing rate and current inband 308 telemetry information at Line 27. The sender computes a new window 309 size W at Line 23 or Line 26, depending on whether to update W c , 310 with function MeasureInflight and ComputeWind. Function 311 MeasureInflight estimates normalized inflight bytes with Eqn (2) at 312 Line 5. First, it computes txRate of each link from the current and 313 last accumulated transferred bytes txBytes and timestamp ts (Line 4). 314 It also uses the minimum of the current and last qlen to filter out 315 noises in qlen (Line 5). The loop from Line 3 to 7 selects maxi(Ui) 316 in Eqn. (3). Instead of directly using maxi(Ui), we use an EWMA 317 (Exponentially Weighted Moving Average) to filter the noises from 318 timer inaccuracy and transient queues. (Line 9). Function 319 ComputeWind combines multiplicative increase/ decrease (MI/MD) and 320 additive increase (AI) to balance the reaction speed and fairness. 321 If a sender finds it should increase the window size, it first tries 322 AI for maxStage times with the stepWAI (Line 17). If it still finds 323 room to increase after maxStage times of AI or the normalized 324 inflight bytes is above, it calls Eqn (4) once to quickly ramp up or 325 ramp down the window size (Line 12-13). 327 5. Configuration Parameters 329 HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. 330 eta controls a simple tradeoff between utilization and transient 331 queue length (due to the temporary collision of packets caused by 332 their random arrivals, so we set it to 95% by default, which only 333 loses 5% bandwidth but achieves almost zero queue. maxStage controls 334 a simple tradeoff between steady state stability and the speed to 335 reclaim free bandwidth. We find maxStage = 5 is conservatively large 336 for stability, while the speed of reclaiming free bandwidth is still 337 much faster than traditional additive increase, especially in high 338 bandwidth networks. W_ai controls the tradeoff between the maximum 339 number of concurrent flows on a link that can sustain near-zero 340 queues and the speed of convergence to fairness. Note that none of 341 the three parameters are reliability-critical. 343 HPCC++'s design brings advantages to short-lived flows, by allowing 344 flows starting at line-rate and the separation of utilization 345 convergence and fairness convergence. HPCC++ achieves fast 346 utilization convergence to mitigate congestion in almost one round- 347 trip time, while allows flows to gradually converge to fairness. 348 This design feature of HPCC++ is especially helpful for the workload 349 of datacenter applications, where flows are usually short and 350 latency-sensitive. Normally we set a very small W_ai to support a 351 large number of concurrent flows on a link, because slower fairness 352 is not critical. A rule of thumb is to set W_ai = W_init*(1-eta) / N 353 where N is the expected or receiver reported maximum number of 354 concurrent flows on a link. The intuition is that the total additive 355 increase every round (N*W_ai ) should not exceed the bandwidth 356 headroom, and thus no queue forms. Even if the actual number of 357 concurrent flows on a link exceeds N, the CC is still stable and 358 achieves full utilization, but just cannot maintain zero queues. 360 6. Design Enhancement and Implementation 362 The basic design of HPCC++, i.e. HPCC, as described above is to add 363 inband telemetry information into every data packet to react to 364 congestion as soon as the very first packet observing the network 365 congestion. This is especially helpful to reduce the risk of severe 366 congestion in incast scenarios at the first round-trip time. In 367 addition, original HPCC's algorithm introduction of Wc is for the 368 purpose of solving the over-reaction issue from using this per-packet 369 response. 371 Alternatively, the inband telemetry information needs not to be added 372 to every data packet to reduce the overhead. Switches can attach 373 inband telemetry less frequently, e.g., once per RTT or upon 374 congestion occurance. 376 6.1. HPCC++ Guidelines 378 To ensure network stability, HPCC++ establishes a few guidelines for 379 different implementations: 381 * The algorithm should commit the window/rate update at most once 382 per round-trip time, similar to the procedure of updating Wc. 384 * To support different workloads and to properly set W_ai, HPCC++ 385 allows the option to incorporate mechanisms to speed up the 386 fairness convergence. 388 * The switch should capture inband telemetry information that 389 includes link load (txBytes, qlen, ts) and link spec (switch_ID, 390 port_ID, B) at the egress port. Note, each switch should record 391 all those information at the single snapshot to achieve a precise 392 link load estimate. 394 * HPCC++ can use a probe packet to query the inband telemetry 395 information. Thereby, the probe packets should take the same 396 routing path and QoS queueing with the data packets. 398 As long the above guidelines are met, this document does not mandate 399 a particular inband telemetry header format or encapsulation, which 400 are orthogonal to the HPCC++ algorithm described in this document. 401 The algorithm can be implemented with a choice of inband telemetry 402 protocols, such as in-band network telemetry [P4-INT], IOAM 403 [I-D.ietf-ippm-ioam-data], IFA [I-D.ietf-kumar-ippm-ifa] and others. 404 In fact, the emerging inband telemetry protocols can inform the 405 evolution for a broader range of protocols and network functions, 406 where this document leverages the trend to propose the architecture 407 change to support HPCC++ algorithm. 409 6.2. Receiver-based HPCC 411 Note that the window/rate calculation can be implemented at either 412 the data sender or the data receiver. If the ACK packets already 413 exist for reliability purpose, the inband telemetry information can 414 be echoed back to the sender via ACK self-clocking. Not all ACK 415 packets need to carry the inband telemetry information. To reduce 416 the Packet Per Second (PPS) overhead, the receiver may examine the 417 inband telemetry information and adopt the technique of delayed ACKs 418 that only sends out an ACK for a few of received packets. In order 419 to reduce PPS even further, one may implement the algorithm at the 420 receiver and feedback the calculated window in the ACK packet once 421 every RTT. 423 The receiver-based algorithm, Rx-HPCC, is based on int.L, which is 424 the inband telemetry information in the packet header. The receiver 425 performs the same functions except using int.L instead of ack.L. The 426 new function NewINT(int.L) is to replace NewACK(int.L) 428 28: Procedure NewINT(int.L) 429 29: if now > (lastUpdateTime + T) then 430 30: W = ComputeWind(MeasureInflight(int), True); 431 31: send_ack(W) 432 32: lastUpdateTime = now; 433 33: else 434 34: W = ComputeWind(MeasureInflight(int), False); 436 Here, since the receiver does not know the starting sequence number 437 of a burst, it simply records the lastUpdateTime. If time T has 438 passed since lastUpdateTime, the algorithm would recalcuate Wc as in 439 Line 30 and send out the ACK packet which would include W 440 information. Otherwise, it would just update W information locally. 441 This would reduce the amount of traffic that needs to be feedback to 442 the data sender. 444 Note that the receiver can also measure the number of outstanding 445 flows, N, if the last hop is the congestion point and use this 446 information to dynamically adjust W_ai to achieve better fairness. 447 The improvement would allow flows to quickly converge to fairness 448 without causing large swings under heavy load. 450 7. Reference Implementations 452 A prototype of HPCC++ is implemented in NICs to realize the 453 congestion control algorithm and in switches to realize the inband 454 telemetry feature. 456 7.1. Inband telemetry padding at the network elements 458 HPCC++ only relies on packets to share information across senders, 459 receivers, and switches. HPCC++ is open to a variety of inband 460 telemetry format standards. Inside a data center, the path length is 461 often no more than 5 hops. The overhead of the inband telemetry 462 padding for HPCC++ is considered to be low. 464 7.2. Congestion control at NICs 466 Figure 4 shows HPCC++ implementation on a NIC. The NIC provides an 467 HPCC++ module that resides on the data path of the NIC, HPCC++ 468 modules realize both sender and receiver roles. 470 +------------------------------------------------------------------+ 471 | +---------+ window update +-----------+ PktSend +-----------+ | 472 | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> 473 | | | rate update +-----------+ +-----------+ | 474 | | HPCC++ | ^ | 475 | | | inband telemetry| | 476 | | module | | | 477 | | | +-----+-----+ | 478 | | |<----------------------------------- |Rx pipeline| <-+-- 479 | +---------+ telemetry response event +-----------+ | 480 +------------------------------------------------------------------+ 482 Figure 4: Overview of NIC Implementation 484 1. Sender side flow 486 The HPCC++ module running the HPCC CC algorithm in the sender side 487 for every flow in the NIC. Flow can be defined by some transport 488 parameters including 5-tuples, destination QP (queue pair), etc. It 489 receives inband telemetry response events per flow which are 490 generated from the RX pipeline, adjusts the sending window and rate, 491 and update the scheduler on the rate and window of the flow. 493 The scheduler contains a pacing mechanism that determine the flow 494 rate by the value it got from the algorithm. It also maintains the 495 current sending window size for active flows. If the pacing 496 mechanism and the flow's sending window permits, the scheduler 497 invokes for the flow a PktSend command to TX pipeline. 499 The TX pipeline implements packet processing. Once it receives the 500 PktSend event with flow ID from the scheduler, it generates the 501 corresponding packet and delivers to the Network. If a sent packet 502 should collect telemetry on its way the TX pipeline may add 503 indications/headers that triggers the network elements to add 504 telemetry data according to the inband telemetry protocol in use. 505 The telemetry can be collected by the data packet or by dedicated 506 prob packets generated in the TX pipeline. 508 The RX pipe parses the incoming packets from the network and 509 identifies whether telemetry is embedded in the parsed packet. On 510 receiving a telemetry response packet, the RX pipeline extracts the 511 network status from the packet and passes it to the HPCC++ module for 512 processing. A telemetry response packet can be an ACK containing 513 inband telemetry, or a dedicated telemetry response prob packet. 515 2. Receiver side flow 516 On receiving a packet containing inband telemetry, the RX pipeline 517 extracts the network status, and the flow parameters from the packet 518 and passes it to the TX pipeline. The packet can be a data packet 519 containing inband telemetry, or a dedicated telemetry request prob 520 packet. The Tx pipeline may process and edit the telemetry data, and 521 then sends back to the sender the data using either an ACK packet of 522 the flow or a dedicated telemetry response packet. 524 8. IANA Considerations 526 This document makes no request of IANA. 528 9. Discussion 530 9.1. Internet Deployment 532 Although the discussion above mainly focuses on the data center 533 environment, HPCC++ can be adopted at Internet at large. There are 534 several security considerations one should be aware of. 536 There may rise privacy concern when the telemetry information is 537 conveyed across Autonomous Systems (ASes) and back to end-users. The 538 link load information captured in telemetry can potentially reveal 539 the provider's network capacity, route utilization, scheduling 540 policy, etc. Those usually are considered to be sensitive data of 541 the network providers. Hence, certain action may take to anonymize 542 the telemetry data and only convey the relative ratio in rate 543 adaptation across ASes without revealing the actual network load. 545 Another consideration is the security of receiving telemetry 546 information. The rate adaptation mechanism in HPCC++ relies on 547 feedback from the network. As such, it is vulnerable to attacks 548 where feedback messages are hijacked, replaced, or intentionally 549 injected with misleading information resulting in denial of service, 550 similar to those that can affect TCP. It is therefore RECOMMENDED 551 that the notification feedback message is at least integrity checked. 552 In addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the 553 potential risk of a receiver providing misleading congestion feedback 554 information and the mechanisms for mitigating such risks. 556 9.2. Switch-assisted congestion control 558 HPCC++ falls in the general category of switch-assisted congestion 559 control. However, HPCC++ includes a few unique design choices that 560 are different from other switch-assisted approaches. 562 * First, HPCC++ implements a primal-mode algorithm that requires 563 only the ``write-to-packet'' operation from switches, which has 564 already been supported by telemetry protocols like INT [P4-INT] or 565 IOAM [I-D.ietf-ippm-ioam-data]. Please note that this is very 566 different from dual-mode algorithms such as XCP 567 [Katabi-SIGCOMM2002] and RCP [Dukkipati-RCP], where switches take 568 an actively role in determining flows' rates. 570 * Second, HPCC++ achieves a fast utilization convergence by 571 decoupling it from fairness convergence, which is inspired by XCP. 573 * Third, HPCC++ enables the switch-guided multiplicative increase 574 (MI) by defining the ``inflight byte'' to quantify the link load. 575 The inflight byte tells both the underload and overload of the 576 link precisely and thus it allows the flow to increase/decrease 577 the rate multiplicatively and safely. By contrast, traditional 578 approaches of using the queue length or RTT as the feedback cannot 579 guide the rate increase and instead have to rely on additive 580 increase (AI) with heuristics. As the link speed contines to 581 grow, this becomes increasingly slow in reclaiming the unused 582 bandwidth. Besides, queue-based feedback mechanisms subject to 583 latency inflation. 585 * Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP. 586 As detailed in [SIGCOMM-HPCC], we view the TX rate is more precise 587 because RX rate and queue length are overlapped and thus it causes 588 oscillation. 590 9.3. Work with transport protocols 592 HPCC++ can be adopted as the CC algorithm by a wide range of 593 transport protocols such as TCP and UDP, as well as others that may 594 run on top of them, such as iWARP, RoCE etc. It requires to have the 595 window limit and congestion feedback through ACK self-clocking, which 596 naturally conforms to the paradigm of TCP design. With that, HPCC++ 597 introduces a scheme to measure the total inflight bytes for more 598 precise congestion control. To run in UDP, some modifications need 599 to be done to enforce the window limit and collect congestion 600 feedback via probing packets, which is incremental. 602 9.4. Work with QoS queuing 604 Under the use of QoS (Quality of service) priority queuing in 605 switches, the length of flow's own queue cannot tell the actual 606 queuing time and the exact extent of congestion. Although general 607 approaches for running congestion control with QoS queuing are out of 608 the scope of this document, we provide a few hints for HPCC++ running 609 friendly with QoS queuing. In this case, HPCC++ can leverage the 610 packet sojourn time (the egress timestamp minus the ingress 611 timestamp) instead of the queue length to quantify the packet's 612 actual queuing delay. In addition, the operators typically use the 613 Deficit Weighted Round Robin (DWRR) instead of the strict priority 614 (SP) as their QoS scheduling to prevent traffic starvation. DWRR 615 provides a minimum bandwdith guarantee for each queue so that HPCC++ 616 can leverage it for precise rate update to avoid congestion. 618 10. Acknowledgments 620 The authors would like to thank ICCRG members for their valuable 621 review comments and helpful input to this specification. 623 11. Contributors 625 The following individuals have contributed to the implementation and 626 evaluation of the proposed scheme, and therefore have helped to 627 validate and substantially improve this specification: Pedro Y. 628 Segura, Roberto P. Cebrian, Robert Southworth and Malek Musleh. 630 12. References 632 12.1. Normative References 634 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 635 Requirement Levels", BCP 14, RFC 2119, 636 DOI 10.17487/RFC2119, March 1997, 637 . 639 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 640 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 641 May 2017, . 643 12.2. Informative References 645 [I-D.ietf-avtcore-cc-feedback-message] 646 Sarker, Z., Perkins, C., Singh, V., and M. A. Ramalho, 647 "RTP Control Protocol (RTCP) Feedback for Congestion 648 Control", Work in Progress, Internet-Draft, draft-ietf- 649 avtcore-cc-feedback-message-09, 2 November 2020, 650 . 653 [Katabi-SIGCOMM2002] 654 Katabi, D., Handley, M., and C. Rohrs, "Congestion Control 655 for High Bandwidth-Delay Product Networks", ACM 656 SIGCOMM Pittsburgh, Pennsylvania, USA, October 2002. 658 [Zhu-SIGCOMM2015] 659 Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 660 Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M. 661 Zhang, "Congestion Control for Large-Scale RDMA 662 Deployments", ACM SIGCOMM London, United Kingdom, August 663 2015. 665 [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, 666 v2.0", February 2020, . 669 [I-D.ietf-ippm-ioam-data] 670 "Data Fields for In-situ OAM", March 2020, 671 . 674 [I-D.ietf-kumar-ippm-ifa] 675 "Inband Flow Analyzer", February 2019, 676 . 678 [SIGCOMM-HPCC] 679 Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, 680 L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M. 681 Yu, "HPCC: High Precision Congestion Control", ACM 682 SIGCOMM Beijing, China, August 2019. 684 [Dukkipati-RCP] 685 Dukkipati, N., "Rate Control Protocol (RCP): Congestion 686 control to make flows complete quickly.", Stanford 687 University , 2008. 689 Authors' Addresses 691 Rui Miao 692 Alibaba Group 693 525 Almanor Ave, 4th Floor 694 Sunnyvale, CA 94085 695 United States of America 697 Email: miao.rui@alibaba-inc.com 699 Hongqiang H. Liu 700 Alibaba Group 701 108th Ave NE, Suite 800 702 Bellevue, WA 98004 703 United States of America 704 Email: hongqiang.liu@alibaba-inc.com 706 Rong Pan 707 Intel, Corp. 708 2200 Mission College Blvd. 709 Santa Clara, CA 95054 710 United States of America 712 Email: rong.pan@intel.com 714 Jeongkeun Lee 715 Intel, Corp. 716 4750 Patrick Henry Dr. 717 Santa Clara, CA 95054 718 United States of America 720 Email: jk.lee@intel.com 722 Changhoon Kim 723 Intel Corporation 724 4750 Patrick Henry Dr. 725 Santa Clara, CA 95054 726 United States of America 728 Email: chang.kim@intel.com 730 Barak Gafni 731 Mellanox Technologies, Inc. 732 350 Oakmead Parkway, Suite 100 733 Sunnyvale, CA 94085 734 United States of America 736 Email: gbarak@mellanox.com 738 Yuval Shpigelman 739 Mellanox Technologies, Inc. 740 Haim Hazaz 3A 741 Netanya 4247417 742 Israel 744 Email: yuvals@nvidia.com 745 Jeff Tantsura 746 Microsoft Corporation 747 One Microsoft Way 748 Redmond, Washington 98052-6399 749 United States of America 751 Email: jefftantsura@microsoft.com