idnits 2.17.1 draft-pan-tsvwg-hpccplus-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 29, 2020) is 1367 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-09) exists of draft-ietf-avtcore-cc-feedback-message-07 == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-09 == Outdated reference: A later version (-07) exists of draft-kumar-ippm-ifa-01 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group R. Miao 3 Internet-Draft H. Liu 4 Intended status: Experimental Alibaba Group 5 Expires: January 30, 2021 R. Pan 6 J. Lee 7 C. Kim 8 Intel Corporation 9 B. Gafni 10 Y. Shpigelman 11 Mellanox Technologies, Inc. 12 July 29, 2020 14 HPCC++: Enhanced High Precision Congestion Control 15 draft-pan-tsvwg-hpccplus-01 17 Abstract 19 Congestion control (CC) is the key to achieving ultra-low latency, 20 high bandwidth and network stability in high-speed networks. 21 However, the existing high-speed CC schemes have inherent limitations 22 for reaching these goals. 24 In this document, we describe HPCC++ (High Precision Congestion 25 Control), a new high-speed CC mechanism which achieves the three 26 goals simultaneously. HPCC++ leverages inband telemetry to obtain 27 precise link load information and controls traffic precisely. By 28 addressing challenges such as delayed inband telemetry information 29 during congestion and overreaction to inband telemetry information, 30 HPCC++ can quickly converge to utilize free bandwidth while avoiding 31 congestion, and can maintain near-zero in-network queues for ultra- 32 low latency. HPCC++ is also fair and easy to deploy in hardware, 33 implementable with commodity NICs and switches. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on January 30, 2021. 51 Copyright Notice 53 Copyright (c) 2020 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 69 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 70 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 71 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 72 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 5 73 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 74 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 75 6. Design Enhancement and Implementation . . . . . . . . . . . . 8 76 6.1. HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . . 9 77 6.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . . . 9 78 6.3. Switch-side Optimizations . . . . . . . . . . . . . . . . 10 79 7. Reference Implementations . . . . . . . . . . . . . . . . . . 11 80 7.1. Inband telemetry padding at the network elements . . . . 11 81 7.2. Congestion control at NICs . . . . . . . . . . . . . . . 11 82 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 83 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 84 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 13 85 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 13 86 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 87 12.1. Normative References . . . . . . . . . . . . . . . . . . 13 88 12.2. Informative References . . . . . . . . . . . . . . . . . 13 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 91 1. Introduction 93 The link speed in data center networks has grown from 1Gbps to 94 100Gbps in the past decade, and this growth is continuing. Ultralow 95 latency and high bandwidth, which are demanded by more and more 96 applications, are two critical requirements in today's and future 97 high-speed networks. 99 Given that traditional software-based network stacks in hosts can no 100 longer sustain the critical latency and bandwidth requirements 101 [Zhu-SIGCOMM2015], offloading network stacks into hardware is an 102 inevitable direction in high-speed networks. Large-scale networks 103 with RDMA (remote direct memory access) over Converged Ethernet 104 Version 2 (RoCEv2) often uses hardware-offloading solutions. In some 105 cases, the RDMA networks still face fundamental challenges to 106 reconcile low latency, high bandwidth utilization, and high 107 stability. 109 This document describes a new CC mechanism, HPCC++ (Enhanced High 110 Precision Congestion Control), for large-scale, high-speed networks. 111 The key idea behind HPCC++ is to leverage the precise link load 112 information from inband telemetry to compute accurate flow rate 113 updates. Unlike existing approaches that often require a large 114 number of iterations to find the proper flow rates, HPCC++ requires 115 only one rate update step in most cases. Using precise information 116 from inband telemetry enables HPCC++ to address the limitations in 117 current CC schemes. First, HPCC++ senders can quickly ramp up flow 118 rates for high utilization and ramp down flow rates for congestion 119 avoidance. Second, HPCC++ senders can quickly adjust the flow rates 120 to keep each link's output rate slightly lower than the link's 121 capacity, preventing queues from being built-up as well as preserving 122 high link utilization. Finally, since sending rates are computed 123 precisely based on direct measurements at switches, HPCC++ requires 124 merely three independent parameters that are used to tune fairness 125 and efficiency. 127 The base form of HPCC++ is the original HPCC algorithm and its full 128 description can be found in [SIGCOMM-HPCC]. While the original 129 design lays the foundation for inband telemetry based precision 130 congestion control, HPCC++ is an enhanced version which takes into 131 account system constraints and aims to reduce the design overhead and 132 further improves the performance. Section 6 describes these detailed 133 proposed design enhancements and guidelines. 135 2. Terminology 137 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 138 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 139 "OPTIONAL" in this document are to be interpreted as described in BCP 140 14 [RFC2119] [RFC8174] when, and only when, they appear in all 141 capitals, as shown here. 143 3. System Overview 145 Figure 1 shows the end-to-end system that HPCC++ operates in. During 146 the traverse of the packet from the sender to the receiver, each 147 switch along the path inserts inband telemetry that reports the 148 current state of the packet's egress port, including timestamp (ts), 149 queue length (qLen), transmitted bytes (txBytes), and the link 150 bandwidth capacity (B), together with switch_ID and port_ID. When 151 the receiver gets the packet, it may copy all the inband telemetry 152 recorded from the network to the ACK message it sends back to the 153 sender, and then the sender decides how to adjust its flow rate each 154 time it receives an ACK with network load information. 155 Alternatively, the receiver may calculate the flow rate based on the 156 inband telemetry information and feedback the calculated rate back to 157 the sender. The notification packets would include delayed ack 158 information as well. 160 Note that there also exist network nodes along the reverse 161 (potentially uncongested) path that the RTCP feedback reports 162 traverse. Those network nodes are not shown in the figure for sake 163 of brevity. 165 +---------+ pkt +-------+ pkt+tlm +-------+ pkt+tlm +----------+ 166 | Data |-------->| |-------->| |-------->| Data | 167 | Sender |=========|Switch1|=========|Switch2|=========| Receiver | 168 +---------+ Link-0 +-------+ Link-1 +-------+ Link-2 +----------+ 169 /|\ | 170 | | 171 +---------------------------------------------------------+ 172 Notification Packets/ACKs 174 Figure 1: System Overview (tlm=inband telemtry) 176 o Data sender: responsible for controlling inflight bytes. HPCC++ 177 is a window-based CC scheme that controls the number of inflight 178 bytes. The inflight bytes mean the amount of data that have been 179 sent, but not acknowledged at the sender yet. Controlling 180 inflight bytes has an important advantage compared to controlling 181 rates. In the absence of congestion, the inflight bytes and rate 182 are interchangeable with equation inflight = rate * T where T is 183 the base propagation RTT. The rate can be calculated locally or 184 obtained from the notification packet. The sender may further use 185 the data pacing mechanism in hardware to limit the rate 186 accordingly. 188 o Network nodes: responsible of inserting the inband telemetry 189 information to the data packet. The inband telemetry information 190 reports the current load of the packet's egress port, including 191 timestamp (ts), queue length (qLen), transmitted bytes (txBytes), 192 and the link bandwidth capacity (B). Besides, the inband 193 telemetry contains switch_ID and port_ID to identify a link. 195 o Data receiver: responsible for either reflecting back the inband 196 telemetry information in the data packet or calculating the proper 197 flow rate based on network congestion information in inband 198 telemetry and sending notification packets back to the sender. 200 4. HPCC++ Algorithm 202 HPCC++ is a window-based congestion control algorithm. The key 203 design choice of HPCC++ is to rely on network nodes to provide fine- 204 grained load information, such as queue size and accumulated tx/rx 205 traffic to compute precise flow rates. This has two major benefits: 206 (i) HPCC++ can quickly converge to proper flow rates to highly 207 utilize bandwidth while avoiding congestion; and (ii) HPCC++ can 208 consistently maintain a close-to-zero queue for low latency. 210 This section introduces the list of notations and describes the core 211 congestion control algorithm. 213 4.1. Notations 215 This section summarizes the list of variables and parameters used in 216 the HPCC++ algorithm. Figure 3 also includes the default values for 217 choosing the algorithm parameters either to represent a typical 218 setting in practical applications or based on theoretical and 219 simulation studies. 221 +--------------+-------------------------------------------------+ 222 | Notation | Variable Name | 223 +--------------+-------------------------------------------------+ 224 | W_i | Window for flow i | 225 | Wc_i | Reference window for flow i | 226 | B_j | Bandwidth for Link j | 227 | I_j | Estimated inflight bytes for Link j | 228 | U_j | Normalized inflight bytes for Link j | 229 | qlen | Telemetry info: link j queue length | 230 | txRate | Telemetry info: link j output rate | 231 | ts | Telemetry info: timestamp | 232 | txBytes | Telemetry info: link j total transmitted bytes | 233 | | associated with timestamp ts | 234 +--------------+-------------------------------------------------+ 236 Figure 2: List of variables. 238 +--------------+----------------------------------+----------------+ 239 | Notation | Parameter Name | Default Value | 240 +--------------+----------------------------------+----------------+ 241 | T | Known baseline RTT | 5us | 242 | eta | Target link utilization | 95% | 243 | maxStage | Maximum stages for additive | | 244 | | increases | 5 | 245 | N | Maximum number of flows | ... | 246 | W_ai | Additive increase amount | ... | 247 +--------------+----------------------------------+----------------+ 249 Figure 3: List of algorithm parameters and their default values. 251 4.2. Design Functions and Procedures 253 The HPCC++ algorithm can be outlined as below: 255 1: Function MeasureInflight(ack) 256 2: u = 0; 257 3: for each link i on the path do 258 4: ack.L[i].txBytes-L[i].txBytes 259 txRate = ----------------------------- ; 260 ack.L[i].ts-L[i].ts 261 5: min(ack.L[i].qlen,L[i].qlen) txRate 262 u' = ----------------------------- + ---------- ; 263 ack.L[i].B*T ack.L[i].B 264 6: if u' > u then 265 7: u = u'; tau = ack.L[i].ts - L[i].ts; 266 8: tau = min(tau, T); 267 9: U = (1 - tau/T)*U + tau/T*u; 268 10: return U; 269 11: Function ComputeWind(U, updateWc) 270 12: if U >= eta or incStage >= maxStagee then 271 13: Wc 272 W = ----- + W_ai; 273 U/eta 274 14: if updateWc then 275 15: incStagee = 0; Wc = W ; 276 16: else 277 17: W = Wc + W_ai ; 278 18: if updateWc then 279 19: incStage++; Wc = W ; 280 20: return W 282 21: Procedure NewAck(ack) 283 22: if ack.seq > lastUpdateSeq then 284 23: W = ComputeWind(MeasureInflight(ack), True); 285 24: lastUpdateSeq = snd_nxt; 286 25: else 287 26: W = ComputeWind(MeasureInflight(ack), False); 288 27: R = W/T; L = ack.L; 290 The above illustrates the overall process of CC at the sender side 291 for a single flow. Each newly received ACK message triggers the 292 procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq 293 is used to remember the first packet sent with a new W c , and the 294 sequence number in the incoming ACK should be larger than 295 lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and 296 18-19). The sender also remembers the pacing rate and current inband 297 telemetry information at Line 27. The sender computes a new window 298 size W at Line 23 or Line 26, depending on whether to update W c , 299 with function MeasureInflight and ComputeWind. Function 300 MeasureInflight estimates normalized inflight bytes with Eqn (2) at 301 Line 5. First, it computes txRate of each link from the current and 302 last accumulated transferred bytes txBytes and timestamp ts (Line 4). 303 It also uses the minimum of the current and last qlen to filter out 304 noises in qlen (Line 5). The loop from Line 3 to 7 selects maxi(Ui) 305 in Eqn. (3). Instead of directly using maxi(Ui), we use an EWMA 306 (Exponentially Weighted Moving Average) to filter the noises from 307 timer inaccuracy and transient queues. (Line 9). Function 308 ComputeWind combines multiplicative increase/ decrease (MI/MD) and 309 additive increase (AI) to balance the reaction speed and fairness. 310 If a sender finds it should increase the window size, it first tries 311 AI for maxStage times with the stepWAI (Line 17). If it still finds 312 room to increase after maxStage times of AI or the normalized 313 inflight bytes is above, it calls Eqn (4) once to quickly ramp up or 314 ramp down the window size (Line 12-13). 316 5. Configuration Parameters 318 HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. 319 eta controls a simple tradeoff between utilization and transient 320 queue length (due to the temporary collision of packets caused by 321 their random arrivals, so we set it to 95% by default, which only 322 loses 5% bandwidth but achieves almost zero queue. maxStage controls 323 a simple tradeoff between steady state stability and the speed to 324 reclaim free bandwidth. We find maxStage = 5 is conservatively large 325 for stability, while the speed of reclaiming free bandwidth is still 326 much faster than traditional additive increase, especially in high 327 bandwidth networks. W_ai controls the tradeoff between the maximum 328 number of concurrent flows on a link that can sustain near-zero 329 queues and the speed of convergence to fairness. Note that none of 330 the three parameters are reliability-critical. 332 HPCC++'s design brings advantages to short-lived flows, by allowing 333 flows starting at line-rate and the separation of utilization 334 convergence and fairness convergence. HPCC++ achieves fast 335 utilization convergence to mitigate congestion in almost one round- 336 trip time, while allows flows to gradually converge to fairness. 337 This design feature of HPCC++ is especially helpful for the workload 338 of datacenter applications, where flows are usually short and 339 latency-sensitive. Normally we set a very small W_ai to support a 340 large number of concurrent flows on a link, because slower fairness 341 is not critical. A rule of thumb is to set W_ai = W_init*(1-eta) / N 342 where N is the expected or receiver reported maximum number of 343 concurrent flows on a link. The intuition is that the total additive 344 increase every round (N*W_ai ) should not exceed the bandwidth 345 headroom, and thus no queue forms. Even if the actual number of 346 concurrent flows on a link exceeds N, the CC is still stable and 347 achieves full utilization, but just cannot maintain zero queues. 349 6. Design Enhancement and Implementation 351 The basic design of HPCC++, i.e. HPCC, as described above is to add 352 inband telemetry information into every data packet to response 353 congestion as soon as the very first packet observing the network 354 congestion. This is especially helpful to reduce the risk of severe 355 congestion in incast scenario at the first round-trip time. In 356 addition, original HPCC's algorithm introduction of Wc is for the 357 purpose of solving the over-reaction issue from using this per-packet 358 response. 360 Alternatively, the inband telemetry information needs not to be added 361 to every data packet to reduce the overhead. Switches can generate 362 inband telemetry less frequently, e.g., once per RTT or upon 363 congestion happening. 365 6.1. HPCC++ Guidelines 367 To ensure network stability, HPCC++ establishes a few guidelines for 368 different implementations: 370 o The algorithm should commit the window/rate update at most once 371 per round-trip time, similar to the procedure of updating Wc. 373 o To support different workloads and to properly set W_ai, HPCC++ 374 allows the option to incorporate mechanisms to speed up the 375 fairness convergence. 377 o The switch should capture inband telemetry information that 378 includes link load (txBytes, qlen, ts) and link spec (switch_ID, 379 port_ID, B) at the egress port. Note, each switch should record 380 all those information at the single snapshot to achieve a precise 381 link load estimate. 383 o HPCC++ can use a probe packet to query the inband telemetry 384 information. Thereby, the probe packets should take the same 385 routing path and QoS queueing with the data packets. 387 As long the above guidelines are met, this document does not mandate 388 a particular inband telemetry header format or encapsulation, which 389 are orthogonal to the HPCC++ algorithms described in this document. 390 The algorithm can be implemented with a choice of inband telemetry 391 protocols, such as in-band network telemetry [P4-INT], IOAM 392 [I-D.ietf-ippm-ioam-data], IFA [I-D.ietf-kumar-ippm-ifa] and others. 394 6.2. Receiver-based HPCC 396 Note that the window/rate calculation can be implemented at either 397 the data sender or the data receiver. If the ACK packets already 398 exist for reliability purpose, the inband telemetry information can 399 be echoed back to the sender via ACK self-clocking. Not all ACK 400 packets need to carry the inband telemetry information. To reduce 401 the Packet Per Second (PPS) overhead, the receiver may examine the 402 inband telemetry information and adopt the technique of delayed ACKs 403 that only sends out an ACK for a few of received packets. In order 404 to reduce PPS even further, one may implement the algorithm at the 405 receiver and feedback the calculated window in the ACK packet once 406 every RTT. 408 The receiver-based algorithm, Rx-HPCC, is based on int.L, which is 409 the inband telemetry information in the packet header. The receiver 410 performs the same functions except using int.L instead of ack.L. The 411 new function NewINT(int.L) is to replace NewACK(int.L) 412 28: Procedure NewINT(int.L) 413 29: if now > (lastUpdateTime + T) then 414 30: W = ComputeWind(MeasureInflight(int), True); 415 31: send_ack(W) 416 32: lastUpdateTime = now; 417 33: else 418 34: W = ComputeWind(MeasureInflight(int), False); 420 Here, since the receiver does not know the starting sequence number 421 of a burst, it simply records the lastUpdateTime. If time T has 422 passed since lastUpdateTime, the algorithm would recalcuate Wc as in 423 Line 30 and send out the ACK packet which would include W informtion. 424 Otherwise, it would just update W information locally. This would 425 reduce the amount of traffic that needs to be feedback to the data 426 sender. 428 Note that the receiver can also measure the number of outstanding 429 flows, N, if the last hop is the congestion point and use this 430 information to dynamically adjust W_ai to achieve better fairness. 431 The improvement would allow flows to quickly converge to fairness 432 without causing large swings under heavy load. 434 6.3. Switch-side Optimizations 436 Switches can potentially generate and send separate packets 437 containing inband telemetry information (aka inband telemetry 438 response packets) directly back to the data senders so that they can 439 slow down as soon as possible. This fast feedback and reaction can 440 further reduce buffer size consumption upon heavy incast. Switches 441 can consider the level of congestion to decide when to trigger direct 442 inband telemetry responses. A simple bloom-filter and timer can be 443 used at switches to avoid sending a burst of inband telemetry 444 responses to the same sender. An inband telemetry response packet 445 must carry the sequence number of the original data packet, so that 446 the sender can correctly correlate the inband telemetry response with 447 the data packet triggered the inband telemetry response. 449 One may optimize the inband telemetry header overhead by implementing 450 a simple subscription-based inband telemetry. The data senders may 451 use a different DSCP codepoint or a flag bit in the inband telemetry 452 instruction header to indicate inband telemetry subscription. (We 453 expect future inband telemetry specs to support such a subscription 454 service.) The senders can selectively subscribe to inband telemetry 455 on a per-packet basis to control the inband telemetry data overhead. 456 While forwarding inband telemetry-subscribed data packets, the 457 switches can monitor the level of congestion and conditionally 458 generate separate inband telemetry responses as described above. The 459 inband telemetry responses can be directly sent back to the senders 460 or to the receivers depending on which version of HPCC++ algorithm 461 (sender-based or receiver-based) is used in the network. 463 7. Reference Implementations 465 A prototype of HPCC++ in NICs is implemented to realize the CC 466 algorithm and switches to realize the inband telemetry feature. 468 7.1. Inband telemetry padding at the network elements 470 HPCC++ only relies on packets to share information across senders, 471 receivers, and switches. HPCC++ is open to a variety of inband 472 telemetry format standards. Inside a data center, the path length is 473 often no more than 5 hops. The overhead of the inband telemetry 474 padding for HPCC++ is considered to be low. 476 7.2. Congestion control at NICs 478 Figure 4 shows HPCC++ implementation on a NIC. The NIC provides an 479 HPCC++ module that resides on the data path of the NIC, HPCC++ 480 modules realize both sender and receiver roles. 482 +------------------------------------------------------------------+ 483 | +---------+ window update +-----------+ PktSend +-----------+ | 484 | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> 485 | | | rate update +-----------+ +-----------+ | 486 | | HPCC++ | ^ | 487 | | | inband telemetry| | 488 | | module | | | 489 | | | +-----+-----+ | 490 | | |<----------------------------------- |Rx pipeline| <-+-- 491 | +---------+ telemetry response event +-----------+ | 492 +------------------------------------------------------------------+ 494 Figure 4: Overview of NIC Implementation 496 1. Sender side flow 498 The HPCC++ module running the HPCC CC algorithm in the sender side 499 for every flow in the NIC. Flow can be defined by some transport 500 parameters including 5-tuples, destination QP (queue pair), etc. It 501 receives inband telemetry response events per flow which are 502 generated from the RX pipeline, adjusts the sending window and rate, 503 and update the scheduler on the rate and window of the flow. 505 The scheduler contains a pacing mechanism that determine the flow 506 rate by the value it got from the algorithm. It also maintains the 507 current sending window size for active flows. If the pacing 508 mechanism and the flow's sending window permits, the scheduler 509 invokes for the flow a PktSend command to TX pipeline. 511 The TX pipeline implements RoCEv2 processing. Once it receives the 512 PktSend event with flow ID from the scheduler, it generates the 513 corresponding packet and delivers to the Network. If a sent packet 514 should collect telemetry on its way the TX pipeline may add 515 indications/headers that triggers the network elements to add 516 telemetry data according to the inband telemetry protocol in use. 517 The telemetry can be collected by the data packet or by dedicated 518 prob packets generated in the TX pipeline. 520 The RX pipe parses the incoming packets from the network and 521 identifies whether telemetry is embedded in the parsed packet. On 522 receiving a telemetry response packet, the RX pipeline extracts the 523 network status from the packet and passes it to the HPCC++ module for 524 processing. A telemetry response packet can be an ACK containing 525 inband telemetry, or a dedicated telemetry response prob packet. 527 2. Receiver side flow 529 On receiving a packet containing inband telemetry, the RX pipeline 530 extracts the network status, and the flow parameters from the packet 531 and passes it to the TX pipeline. The packet can be a data packet 532 containing inband telemetry, or a dedicated telemetry request prob 533 packet. The Tx pipeline may process and edit the telemetry data, and 534 then sends back to the sender the data using either an ACK packet of 535 the flow or a dedicated telemetry response packet. 537 8. IANA Considerations 539 This document makes no request of IANA. 541 9. Security Considerations 543 The rate adaptation mechanism in HPCC++ relies on feedback from the 544 network. As such, it is vulnerable to attacks where feedback 545 messages are hijacked, replaced, or intentionally injected with 546 misleading information resulting in denial of service, similar to 547 those that can affect TCP. It is therefore RECOMMENDED that the 548 notification feedback message is at least integrity checked. In 549 addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the 550 potential risk of a receiver providing misleading congestion feedback 551 information and the mechanisms for mitigating such risks. 553 10. Acknowledgments 555 The authors would like to thank ... for their valuable review 556 comments and helpful input to this specification. 558 11. Contributors 560 The following individuals have contributed to the implementation and 561 evaluation of the proposed scheme, and therefore have helped to 562 validate and substantially improve this specification: Pedro Y. 563 Segura, Roberto P. Cebrian, Robert Southworth and Malek Musleh. 565 12. References 567 12.1. Normative References 569 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 570 Requirement Levels", BCP 14, RFC 2119, 571 DOI 10.17487/RFC2119, March 1997, 572 . 574 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 575 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 576 May 2017, . 578 12.2. Informative References 580 [I-D.ietf-avtcore-cc-feedback-message] 581 Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP 582 Control Protocol (RTCP) Feedback for Congestion Control", 583 draft-ietf-avtcore-cc-feedback-message-07 (work in 584 progress), June 2020. 586 [I-D.ietf-ippm-ioam-data] 587 "Data Fields for In-situ OAM", March 2020, 588 . 591 [I-D.ietf-kumar-ippm-ifa] 592 "Inband Flow Analyzer", February 2019, 593 . 595 [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, 596 v2.0", February 2020, . 599 [SIGCOMM-HPCC] 600 Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, 601 L., Cao, Z., and M. Zhang, "HPCC: High Precision 602 Congestion Control", ACM SIGCOMM Beijing, China, August 603 2019. 605 [Zhu-SIGCOMM2015] 606 Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., 607 Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. 608 Zhang, "Congestion Control for Large-Scale RDMA 609 Deployments", ACM SIGCOMM London, United Kingdom, August 610 2015. 612 Authors' Addresses 614 Rui Miao 615 Alibaba Group 616 525 Almanor Ave, 4th Floor 617 Sunnyvale, CA 94085 618 USA 620 Email: miao.rui@alibaba-inc.com 622 Hongqiang H. Liu 623 Alibaba Group 624 108th Ave NE, Suite 800 625 Bellevue, WA 98004 626 USA 628 Email: hongqiang.liu@alibaba-inc.com 630 Rong Pan 631 Intel, Corp. 632 2200 Mission College Blvd. 633 Santa Clara, CA 95054 634 USA 636 Email: rong.pan@intel.com 637 Jeongkeun Lee 638 Intel, Corp. 639 4750 Patrick Henry Dr. 640 Santa Clara, CA 95054 641 USA 643 Email: jk.lee@intel.com 645 Changhoon Kim 646 Intel Corporation 647 4750 Patrick Henry Dr. 648 Santa Clara, CA 95054 649 USA 651 Email: chang.kim@intel.com 653 Barak Gafni 654 Mellanox Technologies, Inc. 655 350 Oakmead Parkway, Suite 100 656 Sunnyvale, CA 94085 657 USA 659 Email: gbarak@mellanox.com 661 Yuval Shpigelman 662 Mellanox Technologies, Inc. 663 Haim Hazaz 3A 664 Netanya 4247417 665 Israel 667 Email: yuvals@nvidia.com