idnits 2.17.1 

draft-miao-iccrg-hpccplus-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (7 December 2021) is 869 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-17) exists of
     draft-ietf-ippm-ioam-data-09

  == Outdated reference: A later version (-07) exists of
     draft-kumar-ippm-ifa-01


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                            R. Miao
3	Internet-Draft                                                    H. Liu
4	Intended status: Experimental                              Alibaba Group
5	Expires: 10 June 2022                                             R. Pan
6	                                                                  J. Lee
7	                                                                  C. Kim
8	                                                       Intel Corporation
9	                                                                B. Gafni
10	                                                           Y. Shpigelman
11	                                             Mellanox Technologies, Inc.
12	                                                             J. Tantsura
13	                                                   Microsoft Corporation
14	                                                         7 December 2021

16	           HPCC++: Enhanced High Precision Congestion Control
17	                      draft-miao-iccrg-hpccplus-01

19	Abstract

21	   Congestion control (CC) is the key to achieving ultra-low latency,
22	   high bandwidth and network stability in high-speed networks.
23	   However, the existing high-speed CC schemes have inherent limitations
24	   for reaching these goals.

26	   In this document, we describe HPCC++ (High Precision Congestion
27	   Control), a new high-speed CC mechanism which achieves the three
28	   goals simultaneously.  HPCC++ leverages inband telemetry to obtain
29	   precise link load information and controls traffic precisely.  By
30	   addressing challenges such as delayed signaling during congestion and
31	   overreaction to the congestion signaling using inband and granular
32	   telemetry, HPCC++ can quickly converge to utilize all the available
33	   bandwidth while avoiding congestion, and can maintain near-zero in-
34	   network queues for ultra-low latency.  HPCC++ is also fair and easy
35	   to deploy in hardware, implementable with commodity NICs and
36	   switches.

38	Status of This Memo

40	   This Internet-Draft is submitted in full conformance with the
41	   provisions of BCP 78 and BCP 79.

43	   Internet-Drafts are working documents of the Internet Engineering
44	   Task Force (IETF).  Note that other groups may also distribute
45	   working documents as Internet-Drafts.  The list of current Internet-
46	   Drafts is at https://datatracker.ietf.org/drafts/current/.

48	   Internet-Drafts are draft documents valid for a maximum of six months
49	   and may be updated, replaced, or obsoleted by other documents at any
50	   time.  It is inappropriate to use Internet-Drafts as reference
51	   material or to cite them other than as "work in progress."

53	   This Internet-Draft will expire on 10 June 2022.

55	Copyright Notice

57	   Copyright (c) 2021 IETF Trust and the persons identified as the
58	   document authors.  All rights reserved.

60	   This document is subject to BCP 78 and the IETF Trust's Legal
61	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
62	   license-info) in effect on the date of publication of this document.
63	   Please review these documents carefully, as they describe your rights
64	   and restrictions with respect to this document.  Code Components
65	   extracted from this document must include Revised BSD License text as
66	   described in Section 4.e of the Trust Legal Provisions and are
67	   provided without warranty as described in the Revised BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
73	   3.  System Overview . . . . . . . . . . . . . . . . . . . . . . .   4
74	   4.  HPCC++ Algorithm  . . . . . . . . . . . . . . . . . . . . . .   5
75	     4.1.  Notations . . . . . . . . . . . . . . . . . . . . . . . .   5
76	     4.2.  Design Functions and Procedures . . . . . . . . . . . . .   6
77	   5.  Configuration Parameters  . . . . . . . . . . . . . . . . . .   8
78	   6.  Design Enhancement and Implementation . . . . . . . . . . . .   8
79	     6.1.  HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . .   9
80	     6.2.  Receiver-based HPCC . . . . . . . . . . . . . . . . . . .   9
81	   7.  Reference Implementations . . . . . . . . . . . . . . . . . .  10
82	     7.1.  Inband telemetry padding at the network elements  . . . .  10
83	     7.2.  Congestion control at NICs  . . . . . . . . . . . . . . .  10
84	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
85	   9.  Discussion  . . . . . . . . . . . . . . . . . . . . . . . . .  12
86	     9.1.  Internet Deployment . . . . . . . . . . . . . . . . . . .  12
87	     9.2.  Switch-assisted congestion control  . . . . . . . . . . .  12
88	     9.3.  Work with transport protocols . . . . . . . . . . . . . .  13
89	     9.4.  Work with QoS queuing . . . . . . . . . . . . . . . . . .  13
90	   10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  14
91	   11. Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  14
92	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  14
93	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  14
94	     12.2.  Informative References . . . . . . . . . . . . . . . . .  14
95	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

97	1.  Introduction

99	   The link speed in data center networks has grown from 1Gbps to
100	   100Gbps in the past decade, and this growth is continuing.  Ultralow
101	   latency and high bandwidth, which are demanded by more and more
102	   applications, are two critical requirements in today's and future
103	   high-speed networks.

105	   Given that traditional software-based network stacks in hosts can no
106	   longer sustain the critical latency and bandwidth requirements as
107	   described in [Zhu-SIGCOMM2015], offloading network stacks into
108	   hardware is an inevitable direction in high-speed networks.  As an
109	   example, large-scale networks with RDMA (remote direct memory access)
110	   often uses hardware-offloading solutions.  In some cases, the RDMA
111	   networks still face fundamental challenges to reconcile low latency,
112	   high bandwidth utilization, and high stability.

114	   This document describes a new congestion control mechanism, HPCC++
115	   (Enhanced High Precision Congestion Control), for large-scale, high-
116	   speed networks.  The key idea behind HPCC++ is to leverage the
117	   precise link load information from signaled through inband telemetry
118	   to compute accurate flow rate updates.  Unlike existing approaches
119	   that often require a large number of iterations to find the proper
120	   flow rates, HPCC++ requires only one rate update step in most cases.
121	   Using precise information from inband telemetry enables HPCC++ to
122	   address the limitations in current congestion control schemes.
123	   First, HPCC++ senders can quickly ramp up flow rates for high
124	   utilization and ramp down flow rates for congestion avoidance.
125	   Second, HPCC++ senders can quickly adjust the flow rates to keep each
126	   link's output rate slightly lower than the link's capacity,
127	   preventing queues from being built-up as well as preserving high link
128	   utilization.  Finally, since sending rates are computed precisely
129	   based on direct measurements at switches, HPCC++ requires merely
130	   three independent parameters that are used to tune fairness and
131	   efficiency.

133	   The base form of HPCC++ is the original HPCC algorithm and its full
134	   description can be found in [SIGCOMM-HPCC].  While the original
135	   design lays the foundation for inband telemetry based precision
136	   congestion control, HPCC++ is an enhanced version which takes into
137	   account system constraints and aims to reduce the design overhead and
138	   further improves the performance.  Section 6 describes these detailed
139	   proposed design enhancements and guidelines.

141	   This document describes the architecture changes in switches and end-
142	   hosts to support the needed tranmission of inband telemetry and its
143	   consumption, that imporves the efficiency in handling network
144	   congestion.

146	2.  Terminology

148	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
149	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
150	   "OPTIONAL" in this document are to be interpreted as described in BCP
151	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
152	   capitals, as shown here.

154	3.  System Overview

156	   Figure 1 shows the end-to-end system that HPCC++ operates in.  During
157	   the traverse of the packet from the sender to the receiver, each
158	   switch along the path inserts inband telemetry that reports the
159	   current state of the packet's egress port, including timestamp (ts),
160	   queue length (qLen), transmitted bytes (txBytes), and the link
161	   bandwidth capacity (B), together with switch_ID and port_ID.  When
162	   the receiver gets the packet, it may copy all the inband telemetry
163	   recorded from the network to the ACK message it sends back to the
164	   sender, and then the sender decides how to adjust its flow rate each
165	   time it receives an ACK with network load information.
166	   Alternatively, the receiver may calculate the flow rate based on the
167	   inband telemetry information and feedback the calculated rate back to
168	   the sender.  The notification packets would include delayed ack
169	   information as well.

171	   Note that there also exist network nodes along the reverse
172	   (potentially uncongested) path that the RTCP feedback reports
173	   traverse.  Those network nodes are not shown in the figure for sake
174	   of brevity.

176	    +---------+  pkt    +-------+ pkt+tlm +-------+ pkt+tlm +----------+
177	    |  Data   |-------->|       |-------->|       |-------->| Data     |
178	    |  Sender |=========|Switch1|=========|Switch2|=========| Receiver |
179	    +---------+ Link-0  +-------+  Link-1 +-------+  Link-2 +----------+
180	        /|\                                                        |
181	         |                                                         |
182	         +---------------------------------------------------------+
183	                         Notification Packets/ACKs

185	             Figure 1: System Overview (tlm=inband telemtry)

187	   *  Data sender: responsible for controlling inflight bytes.  HPCC++
188	      is a window-based congestion control scheme that controls the
189	      number of inflight bytes.  The inflight bytes mean the amount of
190	      data that have been sent, but not acknowledged by the sender yet.
191	      Controlling inflight bytes has an important advantage compared to
192	      controlling rates.  In the absence of congestion, the inflight
193	      bytes and rate are interchangeable with equation inflight = rate *
194	      T where T is the base propagation RTT.  The rate can be calculated
195	      locally or obtained from the notification packet.  The sender may
196	      further use the data pacing mechanism, potentially implemented in
197	      hardware, to limit the rate accordingly.

199	   *  Network nodes: responsible of inserting the inband telemetry
200	      information to the data packet.  The inband telemetry information
201	      reports the current load of the packet's egress port, including
202	      timestamp (ts), queue length (qLen), transmitted bytes (txBytes),
203	      and link bandwidth capacity (B).  Besides, the inband telemetry
204	      contains switch_ID and port_ID to identify a link.

206	   *  Data receiver: responsible for either reflecting back the inband
207	      telemetry information in the data packet or calculating the proper
208	      flow rate based on network congestion information in inband
209	      telemetry and sending notification packets back to the sender.

211	4.  HPCC++ Algorithm

213	   HPCC++ is a window-based congestion control algorithm.  The key
214	   design choice of HPCC++ is to rely on network nodes to provide fine-
215	   grained load information, such as queue size and accumulated tx/rx
216	   traffic to compute precise flow rates.  This has two major benefits:
217	   (i) HPCC++ can quickly converge to proper flow rates to highly
218	   utilize bandwidth while avoiding congestion; and (ii) HPCC++ can
219	   consistently maintain a close-to-zero queue for low latency.

221	   This section introduces the list of notations and describes the core
222	   congestion control algorithm.

224	4.1.  Notations

226	   This section summarizes the list of variables and parameters used in
227	   the HPCC++ algorithm.  Figure 3 also includes the default values for
228	   choosing the algorithm parameters either to represent a typical
229	   setting in practical applications or based on theoretical and
230	   simulation studies.

232	     +--------------+-------------------------------------------------+
233	     | Notation     | Variable Name                                   |
234	     +--------------+-------------------------------------------------+
235	     | W_i          | Window for flow i                               |
236	     | Wc_i         | Reference window for flow i                     |
237	     | B_j          | Bandwidth for Link j                            |
238	     | I_j          | Estimated inflight bytes for Link j             |
239	     | U_j          | Normalized inflight bytes for Link j            |
240	     | qlen         | Telemetry info: link j queue length             |
241	     | txRate       | Telemetry info: link j output rate              |
242	     | ts           | Telemetry info: timestamp                       |
243	     | txBytes      | Telemetry info: link j total transmitted bytes  |
244	     |              |                  associated with timestamp ts   |
245	     +--------------+-------------------------------------------------+

247	                        Figure 2: List of variables.

249	    +--------------+----------------------------------+----------------+
250	    | Notation     | Parameter Name                   | Default Value  |
251	    +--------------+----------------------------------+----------------+
252	    | T            | Known baseline RTT               |    5us         |
253	    | eta          | Target link utilization          |    95%         |
254	    | maxStage     | Maximum stages for additive      |                |
255	    |              | increases                        |    5           |
256	    | N            | Maximum number of flows          |    ...         |
257	    | W_ai         | Additive increase amount         |    ...         |
258	    +--------------+----------------------------------+----------------+

260	    Figure 3: List of algorithm parameters and their default values.

262	4.2.  Design Functions and Procedures

264	   The HPCC++ algorithm can be outlined as below:

266	   1: Function MeasureInflight(ack)
267	   2:    u = 0;
268	   3:    for each link i on the path do
269	   4:                  ack.L[i].txBytes-L[i].txBytes
270	             txRate =  ----------------------------- ;
271	                            ack.L[i].ts-L[i].ts
272	   5:               min(ack.L[i].qlen,L[i].qlen)      txRate
273	              u' = ----------------------------- +  ---------- ;
274	                        ack.L[i].B*T                ack.L[i].B
275	   6:         if u' > u then
276	   7:             u = u'; tau = ack.L[i].ts -  L[i].ts;
277	   8:     tau = min(tau, T);
278	   9:     U = (1 - tau/T)*U + tau/T*u;
279	   10:    return U;
280	   11: Function ComputeWind(U, updateWc)
281	   12:    if U >= eta or incStage >= maxStagee then
282	   13:             Wc
283	              W = ----- + W_ai;
284	                  U/eta
285	   14:        if updateWc then
286	   15:            incStagee = 0; Wc = W ;
287	   16:    else
288	   17:        W = Wc + W_ai ;
289	   18:        if updateWc then
290	   19:            incStage++; Wc = W ;
291	   20:    return W

293	   21: Procedure NewAck(ack)
294	   22:    if ack.seq > lastUpdateSeq then
295	   23:        W = ComputeWind(MeasureInflight(ack), True);
296	   24:        lastUpdateSeq = snd_nxt;
297	   25:    else
298	   26:        W = ComputeWind(MeasureInflight(ack), False);
299	   27:    R = W/T; L = ack.L;

301	   The above illustrates the overall process of CC at the sender side
302	   for a single flow.  Each newly received ACK message triggers the
303	   procedure NewACK at Line 21.  At Line 22, the variable lastUpdateSeq
304	   is used to remember the first packet sent with a new W c , and the
305	   sequence number in the incoming ACK should be larger than
306	   lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and
307	   18-19).  The sender also remembers the pacing rate and current inband
308	   telemetry information at Line 27.  The sender computes a new window
309	   size W at Line 23 or Line 26, depending on whether to update W c ,
310	   with function MeasureInflight and ComputeWind.  Function
311	   MeasureInflight estimates normalized inflight bytes with Eqn (2) at
312	   Line 5.  First, it computes txRate of each link from the current and
313	   last accumulated transferred bytes txBytes and timestamp ts (Line 4).
314	   It also uses the minimum of the current and last qlen to filter out
315	   noises in qlen (Line 5).  The loop from Line 3 to 7 selects maxi(Ui)
316	   in Eqn. (3).  Instead of directly using maxi(Ui), we use an EWMA
317	   (Exponentially Weighted Moving Average) to filter the noises from
318	   timer inaccuracy and transient queues.  (Line 9).  Function
319	   ComputeWind combines multiplicative increase/ decrease (MI/MD) and
320	   additive increase (AI) to balance the reaction speed and fairness.
321	   If a sender finds it should increase the window size, it first tries
322	   AI for maxStage times with the stepWAI (Line 17).  If it still finds
323	   room to increase after maxStage times of AI or the normalized
324	   inflight bytes is above, it calls Eqn (4) once to quickly ramp up or
325	   ramp down the window size (Line 12-13).

327	5.  Configuration Parameters

329	   HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai.
330	   eta controls a simple tradeoff between utilization and transient
331	   queue length (due to the temporary collision of packets caused by
332	   their random arrivals, so we set it to 95% by default, which only
333	   loses 5% bandwidth but achieves almost zero queue. maxStage controls
334	   a simple tradeoff between steady state stability and the speed to
335	   reclaim free bandwidth.  We find maxStage = 5 is conservatively large
336	   for stability, while the speed of reclaiming free bandwidth is still
337	   much faster than traditional additive increase, especially in high
338	   bandwidth networks.  W_ai controls the tradeoff between the maximum
339	   number of concurrent flows on a link that can sustain near-zero
340	   queues and the speed of convergence to fairness.  Note that none of
341	   the three parameters are reliability-critical.

343	   HPCC++'s design brings advantages to short-lived flows, by allowing
344	   flows starting at line-rate and the separation of utilization
345	   convergence and fairness convergence.  HPCC++ achieves fast
346	   utilization convergence to mitigate congestion in almost one round-
347	   trip time, while allows flows to gradually converge to fairness.
348	   This design feature of HPCC++ is especially helpful for the workload
349	   of datacenter applications, where flows are usually short and
350	   latency-sensitive.  Normally we set a very small W_ai to support a
351	   large number of concurrent flows on a link, because slower fairness
352	   is not critical.  A rule of thumb is to set W_ai = W_init*(1-eta) / N
353	   where N is the expected or receiver reported maximum number of
354	   concurrent flows on a link.  The intuition is that the total additive
355	   increase every round (N*W_ai ) should not exceed the bandwidth
356	   headroom, and thus no queue forms.  Even if the actual number of
357	   concurrent flows on a link exceeds N, the CC is still stable and
358	   achieves full utilization, but just cannot maintain zero queues.

360	6.  Design Enhancement and Implementation

362	   The basic design of HPCC++, i.e. HPCC, as described above is to add
363	   inband telemetry information into every data packet to react to
364	   congestion as soon as the very first packet observing the network
365	   congestion.  This is especially helpful to reduce the risk of severe
366	   congestion in incast scenarios at the first round-trip time.  In
367	   addition, original HPCC's algorithm introduction of Wc is for the
368	   purpose of solving the over-reaction issue from using this per-packet
369	   response.

371	   Alternatively, the inband telemetry information needs not to be added
372	   to every data packet to reduce the overhead.  Switches can attach
373	   inband telemetry less frequently, e.g., once per RTT or upon
374	   congestion occurance.

376	6.1.  HPCC++ Guidelines

378	   To ensure network stability, HPCC++ establishes a few guidelines for
379	   different implementations:

381	   *  The algorithm should commit the window/rate update at most once
382	      per round-trip time, similar to the procedure of updating Wc.

384	   *  To support different workloads and to properly set W_ai, HPCC++
385	      allows the option to incorporate mechanisms to speed up the
386	      fairness convergence.

388	   *  The switch should capture inband telemetry information that
389	      includes link load (txBytes, qlen, ts) and link spec (switch_ID,
390	      port_ID, B) at the egress port.  Note, each switch should record
391	      all those information at the single snapshot to achieve a precise
392	      link load estimate.

394	   *  HPCC++ can use a probe packet to query the inband telemetry
395	      information.  Thereby, the probe packets should take the same
396	      routing path and QoS queueing with the data packets.

398	   As long the above guidelines are met, this document does not mandate
399	   a particular inband telemetry header format or encapsulation, which
400	   are orthogonal to the HPCC++ algorithm described in this document.
401	   The algorithm can be implemented with a choice of inband telemetry
402	   protocols, such as in-band network telemetry [P4-INT], IOAM
403	   [I-D.ietf-ippm-ioam-data], IFA [I-D.ietf-kumar-ippm-ifa] and others.
404	   In fact, the emerging inband telemetry protocols can inform the
405	   evolution for a broader range of protocols and network functions,
406	   where this document leverages the trend to propose the architecture
407	   change to support HPCC++ algorithm.

409	6.2.  Receiver-based HPCC

411	   Note that the window/rate calculation can be implemented at either
412	   the data sender or the data receiver.  If the ACK packets already
413	   exist for reliability purpose, the inband telemetry information can
414	   be echoed back to the sender via ACK self-clocking.  Not all ACK
415	   packets need to carry the inband telemetry information.  To reduce
416	   the Packet Per Second (PPS) overhead, the receiver may examine the
417	   inband telemetry information and adopt the technique of delayed ACKs
418	   that only sends out an ACK for a few of received packets.  In order
419	   to reduce PPS even further, one may implement the algorithm at the
420	   receiver and feedback the calculated window in the ACK packet once
421	   every RTT.

423	   The receiver-based algorithm, Rx-HPCC, is based on int.L, which is
424	   the inband telemetry information in the packet header.  The receiver
425	   performs the same functions except using int.L instead of ack.L.  The
426	   new function NewINT(int.L) is to replace NewACK(int.L)

428	   28:   Procedure NewINT(int.L)
429	   29:   if now > (lastUpdateTime + T) then
430	   30:      W = ComputeWind(MeasureInflight(int), True);
431	   31:      send_ack(W)
432	   32:      lastUpdateTime = now;
433	   33:   else
434	   34:      W = ComputeWind(MeasureInflight(int), False);

436	   Here, since the receiver does not know the starting sequence number
437	   of a burst, it simply records the lastUpdateTime.  If time T has
438	   passed since lastUpdateTime, the algorithm would recalcuate Wc as in
439	   Line 30 and send out the ACK packet which would include W
440	   information.  Otherwise, it would just update W information locally.
441	   This would reduce the amount of traffic that needs to be feedback to
442	   the data sender.

444	   Note that the receiver can also measure the number of outstanding
445	   flows, N, if the last hop is the congestion point and use this
446	   information to dynamically adjust W_ai to achieve better fairness.
447	   The improvement would allow flows to quickly converge to fairness
448	   without causing large swings under heavy load.

450	7.  Reference Implementations

452	   A prototype of HPCC++ is implemented in NICs to realize the
453	   congestion control algorithm and in switches to realize the inband
454	   telemetry feature.

456	7.1.  Inband telemetry padding at the network elements

458	   HPCC++ only relies on packets to share information across senders,
459	   receivers, and switches.  HPCC++ is open to a variety of inband
460	   telemetry format standards.  Inside a data center, the path length is
461	   often no more than 5 hops.  The overhead of the inband telemetry
462	   padding for HPCC++ is considered to be low.

464	7.2.  Congestion control at NICs

466	   Figure 4 shows HPCC++ implementation on a NIC.  The NIC provides an
467	   HPCC++ module that resides on the data path of the NIC, HPCC++
468	   modules realize both sender and receiver roles.

470	  +------------------------------------------------------------------+
471	  |  +---------+ window update +-----------+ PktSend +-----------+   |
472	  |  |         |-------------->| Scheduler |-------> |Tx pipeline|---+->
473	  |  |         | rate update   +-----------+         +-----------+   |
474	  |  |  HPCC++ |                                           ^         |
475	  |  |         |                           inband telemetry|         |
476	  |  |  module |                                           |         |
477	  |  |         |                                     +-----+-----+   |
478	  |  |         |<----------------------------------- |Rx pipeline| <-+--
479	  |  +---------+      telemetry response event       +-----------+   |
480	  +------------------------------------------------------------------+

482	               Figure 4: Overview of NIC Implementation

484	   1.  Sender side flow

486	   The HPCC++ module running the HPCC CC algorithm in the sender side
487	   for every flow in the NIC.  Flow can be defined by some transport
488	   parameters including 5-tuples, destination QP (queue pair), etc.  It
489	   receives inband telemetry response events per flow which are
490	   generated from the RX pipeline, adjusts the sending window and rate,
491	   and update the scheduler on the rate and window of the flow.

493	   The scheduler contains a pacing mechanism that determine the flow
494	   rate by the value it got from the algorithm.  It also maintains the
495	   current sending window size for active flows.  If the pacing
496	   mechanism and the flow's sending window permits, the scheduler
497	   invokes for the flow a PktSend command to TX pipeline.

499	   The TX pipeline implements packet processing.  Once it receives the
500	   PktSend event with flow ID from the scheduler, it generates the
501	   corresponding packet and delivers to the Network.  If a sent packet
502	   should collect telemetry on its way the TX pipeline may add
503	   indications/headers that triggers the network elements to add
504	   telemetry data according to the inband telemetry protocol in use.
505	   The telemetry can be collected by the data packet or by dedicated
506	   prob packets generated in the TX pipeline.

508	   The RX pipe parses the incoming packets from the network and
509	   identifies whether telemetry is embedded in the parsed packet.  On
510	   receiving a telemetry response packet, the RX pipeline extracts the
511	   network status from the packet and passes it to the HPCC++ module for
512	   processing.  A telemetry response packet can be an ACK containing
513	   inband telemetry, or a dedicated telemetry response prob packet.

515	   2.  Receiver side flow
516	   On receiving a packet containing inband telemetry, the RX pipeline
517	   extracts the network status, and the flow parameters from the packet
518	   and passes it to the TX pipeline.  The packet can be a data packet
519	   containing inband telemetry, or a dedicated telemetry request prob
520	   packet.  The Tx pipeline may process and edit the telemetry data, and
521	   then sends back to the sender the data using either an ACK packet of
522	   the flow or a dedicated telemetry response packet.

524	8.  IANA Considerations

526	   This document makes no request of IANA.

528	9.  Discussion

530	9.1.  Internet Deployment

532	   Although the discussion above mainly focuses on the data center
533	   environment, HPCC++ can be adopted at Internet at large.  There are
534	   several security considerations one should be aware of.

536	   There may rise privacy concern when the telemetry information is
537	   conveyed across Autonomous Systems (ASes) and back to end-users.  The
538	   link load information captured in telemetry can potentially reveal
539	   the provider's network capacity, route utilization, scheduling
540	   policy, etc.  Those usually are considered to be sensitive data of
541	   the network providers.  Hence, certain action may take to anonymize
542	   the telemetry data and only convey the relative ratio in rate
543	   adaptation across ASes without revealing the actual network load.

545	   Another consideration is the security of receiving telemetry
546	   information.  The rate adaptation mechanism in HPCC++ relies on
547	   feedback from the network.  As such, it is vulnerable to attacks
548	   where feedback messages are hijacked, replaced, or intentionally
549	   injected with misleading information resulting in denial of service,
550	   similar to those that can affect TCP.  It is therefore RECOMMENDED
551	   that the notification feedback message is at least integrity checked.
552	   In addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the
553	   potential risk of a receiver providing misleading congestion feedback
554	   information and the mechanisms for mitigating such risks.

556	9.2.  Switch-assisted congestion control

558	   HPCC++ falls in the general category of switch-assisted congestion
559	   control.  However, HPCC++ includes a few unique design choices that
560	   are different from other switch-assisted approaches.

562	   *  First, HPCC++ implements a primal-mode algorithm that requires
563	      only the ``write-to-packet'' operation from switches, which has
564	      already been supported by telemetry protocols like INT [P4-INT] or
565	      IOAM [I-D.ietf-ippm-ioam-data].  Please note that this is very
566	      different from dual-mode algorithms such as XCP
567	      [Katabi-SIGCOMM2002] and RCP [Dukkipati-RCP], where switches take
568	      an actively role in determining flows' rates.

570	   *  Second, HPCC++ achieves a fast utilization convergence by
571	      decoupling it from fairness convergence, which is inspired by XCP.

573	   *  Third, HPCC++ enables the switch-guided multiplicative increase
574	      (MI) by defining the ``inflight byte'' to quantify the link load.
575	      The inflight byte tells both the underload and overload of the
576	      link precisely and thus it allows the flow to increase/decrease
577	      the rate multiplicatively and safely.  By contrast, traditional
578	      approaches of using the queue length or RTT as the feedback cannot
579	      guide the rate increase and instead have to rely on additive
580	      increase (AI) with heuristics.  As the link speed contines to
581	      grow, this becomes increasingly slow in reclaiming the unused
582	      bandwidth.  Besides, queue-based feedback mechanisms subject to
583	      latency inflation.

585	   *  Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP.
586	      As detailed in [SIGCOMM-HPCC], we view the TX rate is more precise
587	      because RX rate and queue length are overlapped and thus it causes
588	      oscillation.

590	9.3.  Work with transport protocols

592	   HPCC++ can be adopted as the CC algorithm by a wide range of
593	   transport protocols such as TCP and UDP, as well as others that may
594	   run on top of them, such as iWARP, RoCE etc.  It requires to have the
595	   window limit and congestion feedback through ACK self-clocking, which
596	   naturally conforms to the paradigm of TCP design.  With that, HPCC++
597	   introduces a scheme to measure the total inflight bytes for more
598	   precise congestion control.  To run in UDP, some modifications need
599	   to be done to enforce the window limit and collect congestion
600	   feedback via probing packets, which is incremental.

602	9.4.  Work with QoS queuing

604	   Under the use of QoS (Quality of service) priority queuing in
605	   switches, the length of flow's own queue cannot tell the actual
606	   queuing time and the exact extent of congestion.  Although general
607	   approaches for running congestion control with QoS queuing are out of
608	   the scope of this document, we provide a few hints for HPCC++ running
609	   friendly with QoS queuing.  In this case, HPCC++ can leverage the
610	   packet sojourn time (the egress timestamp minus the ingress
611	   timestamp) instead of the queue length to quantify the packet's
612	   actual queuing delay.  In addition, the operators typically use the
613	   Deficit Weighted Round Robin (DWRR) instead of the strict priority
614	   (SP) as their QoS scheduling to prevent traffic starvation.  DWRR
615	   provides a minimum bandwdith guarantee for each queue so that HPCC++
616	   can leverage it for precise rate update to avoid congestion.

618	10.  Acknowledgments

620	   The authors would like to thank ICCRG members for their valuable
621	   review comments and helpful input to this specification.

623	11.  Contributors

625	   The following individuals have contributed to the implementation and
626	   evaluation of the proposed scheme, and therefore have helped to
627	   validate and substantially improve this specification: Pedro Y.
628	   Segura, Roberto P.  Cebrian, Robert Southworth and Malek Musleh.

630	12.  References

632	12.1.  Normative References

634	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
635	              Requirement Levels", BCP 14, RFC 2119,
636	              DOI 10.17487/RFC2119, March 1997,
637	              <https://www.rfc-editor.org/info/rfc2119>.

639	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
640	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
641	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

643	12.2.  Informative References

645	   [I-D.ietf-avtcore-cc-feedback-message]
646	              Sarker, Z., Perkins, C., Singh, V., and M. A. Ramalho,
647	              "RTP Control Protocol (RTCP) Feedback for Congestion
648	              Control", Work in Progress, Internet-Draft, draft-ietf-
649	              avtcore-cc-feedback-message-09, 2 November 2020,
650	              <https://www.ietf.org/archive/id/draft-ietf-avtcore-cc-
651	              feedback-message-09.txt>.

653	   [Katabi-SIGCOMM2002]
654	              Katabi, D., Handley, M., and C. Rohrs, "Congestion Control
655	              for High Bandwidth-Delay Product Networks", ACM
656	              SIGCOMM Pittsburgh, Pennsylvania, USA, October 2002.

658	   [Zhu-SIGCOMM2015]
659	              Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
660	              Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M.
661	              Zhang, "Congestion Control for Large-Scale RDMA
662	              Deployments", ACM SIGCOMM London, United Kingdom, August
663	              2015.

665	   [P4-INT]   "In-band Network Telemetry (INT) Dataplane Specification,
666	              v2.0", February 2020, <https://github.com/p4lang/p4-
667	              applications/blob/master/docs/INT_v2_0.pdf>.

669	   [I-D.ietf-ippm-ioam-data]
670	              "Data Fields for In-situ OAM", March 2020,
671	              <https://tools.ietf.org/html/draft-ietf-ippm-ioam-data-
672	              09>.

674	   [I-D.ietf-kumar-ippm-ifa]
675	              "Inband Flow Analyzer", February 2019,
676	              <https://tools.ietf.org/html/draft-kumar-ippm-ifa-01>.

678	   [SIGCOMM-HPCC]
679	              Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang,
680	              L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M.
681	              Yu, "HPCC: High Precision Congestion Control", ACM
682	              SIGCOMM Beijing, China, August 2019.

684	   [Dukkipati-RCP]
685	              Dukkipati, N., "Rate Control Protocol (RCP): Congestion
686	              control to make flows complete quickly.", Stanford
687	              University , 2008.

689	Authors' Addresses

691	   Rui Miao
692	   Alibaba Group
693	   525 Almanor Ave, 4th Floor
694	   Sunnyvale, CA 94085
695	   United States of America

697	   Email: miao.rui@alibaba-inc.com

699	   Hongqiang H. Liu
700	   Alibaba Group
701	   108th Ave NE, Suite 800
702	   Bellevue, WA 98004
703	   United States of America
704	   Email: hongqiang.liu@alibaba-inc.com

706	   Rong Pan
707	   Intel, Corp.
708	   2200 Mission College Blvd.
709	   Santa Clara, CA 95054
710	   United States of America

712	   Email: rong.pan@intel.com

714	   Jeongkeun Lee
715	   Intel, Corp.
716	   4750 Patrick Henry Dr.
717	   Santa Clara, CA 95054
718	   United States of America

720	   Email: jk.lee@intel.com

722	   Changhoon Kim
723	   Intel Corporation
724	   4750 Patrick Henry Dr.
725	   Santa Clara, CA 95054
726	   United States of America

728	   Email: chang.kim@intel.com

730	   Barak Gafni
731	   Mellanox Technologies, Inc.
732	   350 Oakmead Parkway, Suite 100
733	   Sunnyvale, CA 94085
734	   United States of America

736	   Email: gbarak@mellanox.com

738	   Yuval Shpigelman
739	   Mellanox Technologies, Inc.
740	   Haim Hazaz 3A
741	   Netanya 4247417
742	   Israel

744	   Email: yuvals@nvidia.com
745	   Jeff Tantsura
746	   Microsoft Corporation
747	   One Microsoft Way
748	   Redmond, Washington 98052-6399
749	   United States of America

751	   Email: jefftantsura@microsoft.com