idnits 2.17.1 

draft-miao-rtgwg-hpccplus-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 7, 2021) is 1145 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-17) exists of
     draft-ietf-ippm-ioam-data-09

  == Outdated reference: A later version (-07) exists of
     draft-kumar-ippm-ifa-01


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                            R. Miao
3	Internet-Draft                                                    H. Liu
4	Intended status: Experimental                              Alibaba Group
5	Expires: September 8, 2022                                        R. Pan
6	                                                                  J. Lee
7	                                                                  C. Kim
8	                                                       Intel Corporation
9	                                                                B. Gafni
10	                                                           Y. Shpigelman
11	                                             Mellanox Technologies, Inc.
12	                                                             J. Tantsura
13	                                                   Microsoft Corporation
14	                                                           March 7, 2021

16	           HPCC++: Enhanced High Precision Congestion Control
17	                      draft-miao-rtgwg-hpccplus-00

19	Abstract

21	   Congestion control (CC) is the key to achieving ultra-low latency,
22	   high bandwidth and network stability in high-speed networks.
23	   However, the existing high-speed CC schemes have inherent limitations
24	   for reaching these goals.

26	   In this document, we describe HPCC++ (High Precision Congestion
27	   Control), a new high-speed CC mechanism which achieves the three
28	   goals simultaneously.  HPCC++ leverages inband telemetry to obtain
29	   precise link load information and controls traffic precisely.  By
30	   addressing challenges such as delayed signaling during congestion and
31	   overreaction to the congestion signaling using inband and granular
32	   telemetry, HPCC++ can quickly converge to utilize all the available
33	   bandwidth while avoiding congestion, and can maintain near-zero in-
34	   network queues for ultra-low latency.  HPCC++ is also fair and easy
35	   to deploy in hardware, implementable with commodity NICs and
36	   switches.

38	Status of This Memo

40	   This Internet-Draft is submitted in full conformance with the
41	   provisions of BCP 78 and BCP 79.

43	   Internet-Drafts are working documents of the Internet Engineering
44	   Task Force (IETF).  Note that other groups may also distribute
45	   working documents as Internet-Drafts.  The list of current Internet-
46	   Drafts is at https://datatracker.ietf.org/drafts/current/.

48	   Internet-Drafts are draft documents valid for a maximum of six months
49	   and may be updated, replaced, or obsoleted by other documents at any
50	   time.  It is inappropriate to use Internet-Drafts as reference
51	   material or to cite them other than as "work in progress."

53	   This Internet-Draft will expire on September 8, 2022.

55	Copyright Notice

57	   Copyright (c) 2021 IETF Trust and the persons identified as the
58	   document authors.  All rights reserved.

60	   This document is subject to BCP 78 and the IETF Trust's Legal
61	   Provisions Relating to IETF Documents
62	   (https://trustee.ietf.org/license-info) in effect on the date of
63	   publication of this document.  Please review these documents
64	   carefully, as they describe your rights and restrictions with respect
65	   to this document.  Code Components extracted from this document must
66	   include Simplified BSD License text as described in Section 4.e of
67	   the Trust Legal Provisions and are provided without warranty as
68	   described in the Simplified BSD License.

70	Table of Contents

72	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
73	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
74	   3.  System Overview . . . . . . . . . . . . . . . . . . . . . . .   4
75	   4.  HPCC++ Algorithm  . . . . . . . . . . . . . . . . . . . . . .   5
76	     4.1.  Notations . . . . . . . . . . . . . . . . . . . . . . . .   6
77	     4.2.  Design Functions and Procedures . . . . . . . . . . . . .   6
78	   5.  Configuration Parameters  . . . . . . . . . . . . . . . . . .   8
79	   6.  Design enhancement and implementation . . . . . . . . . . . .   9
80	     6.1.  Inband telemetry padding at the network switches  . . . .   9
81	       6.1.1.  Inband telemetry on IFA2.0  . . . . . . . . . . . . .   9
82	       6.1.2.  Inband telemetry on IOAM  . . . . . . . . . . . . . .   9
83	       6.1.3.  Inband telemetry on P4  . . . . . . . . . . . . . . .   9
84	     6.2.  Congestion Notification . . . . . . . . . . . . . . . . .  10
85	       6.2.1.  Forward direction Congestion detection  . . . . . . .  11
86	       6.2.2.  Reverse direction . . . . . . . . . . . . . . . . . .  11
87	     6.3.  Congestion control at NICs  . . . . . . . . . . . . . . .  12
88	       6.3.1.  Sender-based HPCC . . . . . . . . . . . . . . . . . .  12
89	       6.3.2.  Receiver-based HPCC . . . . . . . . . . . . . . . . .  13
90	   7.  Reference Implementation  . . . . . . . . . . . . . . . . . .  14
91	     7.1.  Implementation on RDMA RoCEv2 . . . . . . . . . . . . . .  14
92	     7.2.  Implementation on TCP . . . . . . . . . . . . . . . . . .  15
93	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
94	   9.  Discussion  . . . . . . . . . . . . . . . . . . . . . . . . .  15
95	     9.1.  Internet Deployment . . . . . . . . . . . . . . . . . . .  15
96	     9.2.  Switch-assisted congestion control  . . . . . . . . . . .  16
97	     9.3.  Work with QoS queuing . . . . . . . . . . . . . . . . . .  16
98	     9.4.  Path migration  . . . . . . . . . . . . . . . . . . . . .  17
99	   10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  17
100	   11. Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  17
101	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  17
102	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  17
103	     12.2.  Informative References . . . . . . . . . . . . . . . . .  18
104	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

106	1.  Introduction

108	   The link speed in data center networks has grown from 1Gbps to
109	   100Gbps in the past decade, and this growth is continuing.  Ultralow
110	   latency and high bandwidth, which are demanded by more and more
111	   applications, are two critical requirements in today's and future
112	   high-speed networks.

114	   Given that traditional software-based network stacks in hosts can no
115	   longer sustain the critical latency and bandwidth requirements as
116	   described in [Zhu-SIGCOMM2015], offloading network stacks into
117	   hardware is an inevitable direction in high-speed networks.  As an
118	   example, large-scale networks with RDMA (remote direct memory access)
119	   often uses hardware-offloading solutions.  In some cases, the RDMA
120	   networks still face fundamental challenges to reconcile low latency,
121	   high bandwidth utilization, and high stability.

123	   This document describes a new congestion control mechanism, HPCC++
124	   (Enhanced High Precision Congestion Control), for large-scale, high-
125	   speed networks.  The key idea behind HPCC++ is to leverage the
126	   precise link load information from signaled through inband telemetry
127	   to compute accurate flow rate updates.  Unlike existing approaches
128	   that often require a large number of iterations to find the proper
129	   flow rates, HPCC++ requires only one rate update step in most cases.
130	   Using precise information from inband telemetry enables HPCC++ to
131	   address the limitations in current congestion control schemes.
132	   First, HPCC++ senders can quickly ramp up flow rates for high
133	   utilization and ramp down flow rates for congestion avoidance.
134	   Second, HPCC++ senders can quickly adjust the flow rates to keep each
135	   link's output rate slightly lower than the link's capacity,
136	   preventing queues from being built-up as well as preserving high link
137	   utilization.  Finally, since sending rates are computed precisely
138	   based on direct measurements at switches, HPCC++ requires merely
139	   three independent parameters that are used to tune fairness and
140	   efficiency.

142	   The base form of HPCC++ is the original HPCC algorithm and its full
143	   description can be found in [SIGCOMM-HPCC].  While the original
144	   design lays the foundation for inband telemetry based precision
145	   congestion control, HPCC++ is an enhanced version which takes into
146	   account system constraints and aims to reduce the design overhead and
147	   further improves the performance.  Section 6 describes these detailed
148	   proposed design enhancements and guidelines.

150	   This document describes the architecture changes in switches and end-
151	   hosts to support the needed tranmission of inband telemetry and its
152	   consumption, that imporves the efficiency in handling network
153	   congestion.

155	2.  Terminology

157	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
158	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
159	   "OPTIONAL" in this document are to be interpreted as described in BCP
160	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
161	   capitals, as shown here.

163	3.  System Overview

165	   Figure 1 shows the end-to-end system that HPCC++ operates in.  During
166	   the traverse of the packet from the sender to the receiver, each
167	   switch along the path inserts inband telemetry that reports the
168	   current state of the packet's egress port, including timestamp (ts),
169	   queue length (qLen), transmitted bytes (txBytes), and the link
170	   bandwidth capacity (B), together with switch_ID and port_ID.  When
171	   the receiver gets the packet, it may copy all the inband telemetry
172	   recorded from the network to the ACK message it sends back to the
173	   sender, and then the sender decides how to adjust its flow rate each
174	   time it receives an ACK with network load information.
175	   Alternatively, the receiver may calculate the flow rate based on the
176	   inband telemetry information and feedback the calculated rate back to
177	   the sender.  The notification packets would include delayed ack
178	   information as well.

180	   Note that there also exist network nodes along the reverse
181	   (potentially uncongested) path that the feedback reports traverse.
182	   Those network nodes are not shown in the figure for sake of brevity.

184	    +---------+  pkt    +-------+ pkt+tlm +-------+ pkt+tlm +----------+
185	    |  Data   |-------->|       |-------->|       |-------->| Data     |
186	    |  Sender |=========|Switch1|=========|Switch2|=========| Receiver |
187	    +---------+ Link-0  +-------+  Link-1 +-------+  Link-2 +----------+
188	        /|\                                                        |
189	         |                                                         |
190	         +---------------------------------------------------------+
191	                         Notification Packets/ACKs

193	              Figure 1: System Overview (tlm=inband telemtry)

195	   o  Data sender: responsible for controlling inflight bytes.  HPCC++
196	      is a window-based congestion control scheme that controls the
197	      number of inflight bytes.  The inflight bytes mean the amount of
198	      data that have been sent, but not acknowledged by the sender yet.
199	      Controlling inflight bytes has an important advantage compared to
200	      controlling rates.  In the absence of congestion, the inflight
201	      bytes and rate are interchangeable with equation inflight = rate *
202	      T where T is the base propagation RTT.  The rate can be calculated
203	      locally or obtained from the notification packet.  The sender may
204	      further use the data pacing mechanism, potentially implemented in
205	      hardware, to limit the rate accordingly.

207	   o  Network nodes: responsible of inserting the inband telemetry
208	      information to the data packet.  The inband telemetry information
209	      reports the current load of the packet's egress port, including
210	      timestamp (ts), queue length (qLen), transmitted bytes (txBytes),
211	      and link bandwidth capacity (B).  Besides, the inband telemetry
212	      contains switch_ID and port_ID to identify a link.

214	   o  Data receiver: responsible for either reflecting back the inband
215	      telemetry information in the data packet or calculating the proper
216	      flow rate based on network congestion information in inband
217	      telemetry and sending notification packets back to the sender.

219	4.  HPCC++ Algorithm

221	   HPCC++ is a window-based congestion control algorithm.  The key
222	   design choice of HPCC++ is to rely on network nodes to provide fine-
223	   grained load information, such as queue size and accumulated tx/rx
224	   traffic to compute precise flow rates.  This has two major benefits:
225	   (i) HPCC++ can quickly converge to proper flow rates to highly
226	   utilize bandwidth while avoiding congestion; and (ii) HPCC++ can
227	   consistently maintain a close-to-zero queue for low latency.

229	   This section introduces the list of notations and describes the core
230	   congestion control algorithm.

232	4.1.  Notations

234	   This section summarizes the list of variables and parameters used in
235	   the HPCC++ algorithm.  Figure 3 also includes the default values for
236	   choosing the algorithm parameters either to represent a typical
237	   setting in practical applications or based on theoretical and
238	   simulation studies.

240	     +--------------+-------------------------------------------------+
241	     | Notation     | Variable Name                                   |
242	     +--------------+-------------------------------------------------+
243	     | W_i          | Window for flow i                               |
244	     | Wc_i         | Reference window for flow i                     |
245	     | B_j          | Bandwidth for Link j                            |
246	     | I_j          | Estimated inflight bytes for Link j             |
247	     | U_j          | Normalized inflight bytes for Link j            |
248	     | qlen         | Telemetry info: link j queue length             |
249	     | txRate       | Telemetry info: link j output rate              |
250	     | ts           | Telemetry info: timestamp                       |
251	     | txBytes      | Telemetry info: link j total transmitted bytes  |
252	     |              |                  associated with timestamp ts   |
253	     +--------------+-------------------------------------------------+

255	                       Figure 2: List of variables.

257	    +--------------+----------------------------------+----------------+
258	    | Notation     | Parameter Name                   | Default Value  |
259	    +--------------+----------------------------------+----------------+
260	    | T            | Known baseline RTT               |    5us         |
261	    | eta          | Target link utilization          |    95%         |
262	    | maxStage     | Maximum stages for additive      |                |
263	    |              | increases                        |    5           |
264	    | N            | Maximum number of flows          |    ...         |
265	    | W_ai         | Additive increase amount         |    ...         |
266	    +--------------+----------------------------------+----------------+

268	     Figure 3: List of algorithm parameters and their default values.

270	4.2.  Design Functions and Procedures

272	   The HPCC++ algorithm can be outlined as below:

274	   1: Function MeasureInflight(ack)
275	   2:    u = 0;
276	   3:    for each link i on the path do
277	   4:                  ack.L[i].txBytes-L[i].txBytes
278	             txRate =  ----------------------------- ;
279	                            ack.L[i].ts-L[i].ts
280	   5:               min(ack.L[i].qlen,L[i].qlen)      txRate
281	              u' = ----------------------------- +  ---------- ;
282	                        ack.L[i].B*T                ack.L[i].B
283	   6:         if u' > u then
284	   7:             u = u'; tau = ack.L[i].ts -  L[i].ts;
285	   8:     tau = min(tau, T);
286	   9:     U = (1 - tau/T)*U + tau/T*u;
287	   10:    return U;

289	   11: Function ComputeWind(U, updateWc)
290	   12:    if U >= eta or incStage >= maxStagee then
291	   13:             Wc
292	              W = ----- + W_ai;
293	                  U/eta
294	   14:        if updateWc then
295	   15:            incStagee = 0; Wc = W ;
296	   16:    else
297	   17:        W = Wc + W_ai ;
298	   18:        if updateWc then
299	   19:            incStage++; Wc = W ;
300	   20:    return W

302	   21: Procedure NewAck(ack)
303	   22:    if ack.seq > lastUpdateSeq then
304	   23:        W = ComputeWind(MeasureInflight(ack), True);
305	   24:        lastUpdateSeq = snd_nxt;
306	   25:    else
307	   26:        W = ComputeWind(MeasureInflight(ack), False);
308	   27:    R = W/T; L = ack.L;

310	   The above illustrates the overall process of CC at the sender side
311	   for a single flow.  Each newly received ACK message triggers the
312	   procedure NewACK at Line 21.  At Line 22, the variable lastUpdateSeq
313	   is used to remember the first packet sent with a new W c , and the
314	   sequence number in the incoming ACK should be larger than
315	   lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and
316	   18-19).  The sender also remembers the pacing rate and current inband
317	   telemetry information at Line 27.  The sender computes a new window
318	   size W at Line 23 or Line 26, depending on whether to update W c ,
319	   with function MeasureInflight and ComputeWind.  Function
320	   MeasureInflight estimates normalized inflight bytes with Eqn (2) at
321	   Line 5.  First, it computes txRate of each link from the current and
322	   last accumulated transferred bytes txBytes and timestamp ts (Line 4).
323	   It also uses the minimum of the current and last qlen to filter out
324	   noises in qlen (Line 5).  The loop from Line 3 to 7 selects maxi(Ui)
325	   in Eqn. (3).  Instead of directly using maxi(Ui), we use an EWMA
326	   (Exponentially Weighted Moving Average) to filter the noises from
327	   timer inaccuracy and transient queues.  (Line 9).  Function
328	   ComputeWind combines multiplicative increase/ decrease (MI/MD) and
329	   additive increase (AI) to balance the reaction speed and fairness.
330	   If a sender finds it should increase the window size, it first tries
331	   AI for maxStage times with the stepWAI (Line 17).  If it still finds
332	   room to increase after maxStage times of AI or the normalized
333	   inflight bytes is above, it calls Eqn (4) once to quickly ramp up or
334	   ramp down the window size (Line 12-13).

336	5.  Configuration Parameters

338	   HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai.
339	   eta controls a simple tradeoff between utilization and transient
340	   queue length (due to the temporary collision of packets caused by
341	   their random arrivals, so we set it to 95% by default, which only
342	   loses 5% bandwidth but achieves almost zero queue.  maxStage controls
343	   a simple tradeoff between steady state stability and the speed to
344	   reclaim free bandwidth.  We find maxStage = 5 is conservatively large
345	   for stability, while the speed of reclaiming free bandwidth is still
346	   much faster than traditional additive increase, especially in high
347	   bandwidth networks.  W_ai controls the tradeoff between the maximum
348	   number of concurrent flows on a link that can sustain near-zero
349	   queues and the speed of convergence to fairness.  Note that none of
350	   the three parameters are reliability-critical.

352	   HPCC++'s design brings advantages to short-lived flows, by allowing
353	   flows starting at line-rate and the separation of utilization
354	   convergence and fairness convergence.  HPCC++ achieves fast
355	   utilization convergence to mitigate congestion in almost one round-
356	   trip time, while allows flows to gradually converge to fairness.
357	   This design feature of HPCC++ is especially helpful for the workload
358	   of datacenter applications, where flows are usually short and
359	   latency-sensitive.  Normally we set a very small W_ai to support a
360	   large number of concurrent flows on a link, because slower fairness
361	   is not critical.  A rule of thumb is to set W_ai = W_init*(1-eta) / N
362	   where N is the expected or receiver reported maximum number of
363	   concurrent flows on a link.  The intuition is that the total additive
364	   increase every round (N*W_ai ) should not exceed the bandwidth
365	   headroom, and thus no queue forms.  Even if the actual number of
366	   concurrent flows on a link exceeds N, the CC is still stable and
367	   achieves full utilization, but just cannot maintain zero queues.

369	6.  Design enhancement and implementation

371	   There are three compoments HPCC++ needs to implement: telementry
372	   padding, congestion notification, and rate update.

374	6.1.  Inband telemetry padding at the network switches

376	   HPCC++ only relies on packets to share information across senders,
377	   receivers, and switches.  The switch should capture inband telemetry
378	   information that includes link load (txBytes, qlen, ts) and link spec
379	   (switch_ID, port_ID, B) at the egress port.  Note, each switch should
380	   record all those information at the single snapshot to achieve a
381	   precise link load estimate.  Inside a data center, the path length is
382	   often no more than 5 hops.  The overhead of the inband telemetry
383	   padding for HPCC++ is considered to be low.

385	   As long the above algorithm is met, HPCC++ is open to a variety of
386	   inband telemetry format standards, which are orthogonal to the HPCC++
387	   algorithm.  Although this document does not mandate a particular
388	   inband telemetry header format or encapsulation, we provide concrete
389	   implementation specifications using strandard inband telemetry
390	   protocols, including IFA [I-D.ietf-kumar-ippm-ifa], IETF IOAM
391	   [I-D.ietf-ippm-ioam-data], and P4.org INT [P4-INT].  In fact, the
392	   emerging inband telemetry protocols inform the evolution for a
393	   broader range of protocols and network functions, where this document
394	   leverages the trend to propose the architecture change to support in-
395	   network functions like congestion control with high efficiency.

397	6.1.1.  Inband telemetry on IFA2.0

399	   For more details, please refer to IFA [I-D.ietf-kumar-ippm-ifa]

401	6.1.2.  Inband telemetry on IOAM

403	   Please refer to IETF IOAM [I-D.ietf-ippm-ioam-data]

405	6.1.3.  Inband telemetry on P4
406	       0                   1                   2                   3
407	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
408	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
409	      |  nHop |        pathID         |          Padding              |
410	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
411	      | Speed |                    Timestamp                  |txBytes|
412	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
413	      |      txBytes(lower)           |         Queue Length          |
414	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
415	      |                            2nd Hop                            |
416	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
417	      |                            2nd Hop(lower)                     |
418	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
419	      |                    Options                    |    Padding    |
420	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

422	                    Figure 4: Example P4.org INT header

424	   Figure 4 shows the packet format of the INT padding after UDP header.
425	   The field nHop is the hop count of the packet's path.  The field
426	   pathID is the XOR of all the switch IDs (which are 12 bits) along the
427	   path.  The sender sets nHop and pathID to 0.  Each switch along the
428	   path adds nHop by 1, and XORs its own switch ID to the pathID.  The
429	   sender uses pathID to judge whether the path of the flow has been
430	   changed.  If so, it throws away the existing status records of the
431	   flow and builds up new records.  Each switch has an 8-byte field to
432	   record the status of the egress port of the packet when the packet is
433	   emitted.  B is a enum type which indicates the speed type of the
434	   port(e.g. 40Gbps, 100Gbps, etc.).  Timestamp (24 bits) is when the
435	   packet is emitted from its egress port, txBytes (20 bits) is the
436	   accumulative total bytes sent from the egress port, and Queue length
437	   (16 bits) is the current queue length of the egress port.

439	6.2.  Congestion Notification

441	   HPCC++ uses congestion notification to fetch network congestion
442	   information from switches for proper rate updates at end-hosts.
443	   Although the basic algorithm described in Section 4 is to add inband
444	   telemetry information into every data packet for optimal performance,
445	   HPCC++ supports flexible implementation choices to work seamly with
446	   transport protocol stacks.  We consider congestion nofication choices
447	   in both forward and reverse directions of the traffic.

449	6.2.1.  Forward direction Congestion detection

451	   Forward direction is the traffic direction of data packets that
452	   experience bandwidth contention and possible network congestion.  The
453	   function of congestion notification in forward direction is to fetch
454	   inband telemetry from switches.  HPCC++ defines two approaches of
455	   doing this.

457	   1.  Inband with data packet.

459	   This is basic algorithm setting described in Section 4, where the
460	   end-host inserts inband telemetry header into data packets.  Switches
461	   along the path detect the inband telemetry header and correspondingly
462	   add inband telemetry information into data packet to react to
463	   congestion as soon as the very first packet observing the network
464	   congestion.  This is especially helpful to reduce the risk of severe
465	   congestion in incast scenarios at the first round-trip time.  In
466	   addition, original HPCC's algorithm introduction of Wc is for the
467	   purpose of solving the over-reaction issue from using this per-packet
468	   response.  Different with in Section 4, end-host can choice uses
469	   every data packet or only a subset of data packets to reduce the
470	   overhead.  To insert telemetry header, differet telemetry protocols
471	   have specific settings for IFA, IETF IOAM, and P4.org INT as
472	   following.

474	   2.  Probe packet.

476	   Switches touching every data packet for inband telemetry inserting
477	   may lead to security or performance concerns, HPCC++ supports the
478	   ``out-of-band'' approach that uses special-generated probe packets at
479	   end-hosts to fetch inband telemetry from switches.  Thereby, the
480	   probe packets should take the same routing path and QoS queueing with
481	   the data packets.  End-hosts can generate probe packets less
482	   frequently and we recommend once per round trip time.  In addition,
483	   the end-host issues probe packets only when it has data packet in the
484	   flight.

486	6.2.2.  Reverse direction

488	   Reverse direction is the receiver conveying inband telemetry back to
489	   traffic sender for rate updates.  Similar to forward direction, there
490	   are also inband and out-of-band approaches.

492	   1.  Inband with ACK packet.

494	   HPCC++ supports to use the ACK packet in transport protocols to
495	   convey the inband telemetry.  TCP generates ACK packet once per every
496	   data packet or per a few data packets.  With ACK packet, the receive
497	   sends accumulated inband telemetry back to sender for rate updates.

499	   2.  Notification packet.

501	   Using ACK packet for inband telemetry notification requires transport
502	   stack modification and sometimes leads to delay in notification when
503	   certain delayed acknowledged mechanism is used.  Hence, HPCC++ allows
504	   the receiver to use special-generated notification packets to deliver
505	   inband telemetry.  The nofication packet is generated per each probe
506	   packet or data packet with inband telemetry.

508	6.3.  Congestion control at NICs

510	6.3.1.  Sender-based HPCC

512	   Figure 5 shows HPCC++ implementation on a NIC.  The NIC provides an
513	   HPCC++ module that resides on the data path of the NIC, HPCC++
514	   modules realize both sender and receiver roles.

516	  +------------------------------------------------------------------+
517	  |  +---------+ window update +-----------+ PktSend +-----------+   |
518	  |  |         |-------------->| Scheduler |-------> |Tx pipeline|---+->
519	  |  |         | rate update   +-----------+         +-----------+   |
520	  |  |  HPCC++ |                                           ^         |
521	  |  |         |                           inband telemetry|         |
522	  |  |  module |                                           |         |
523	  |  |         |                                     +-----+-----+   |
524	  |  |         |<----------------------------------- |Rx pipeline| <-+--
525	  |  +---------+      telemetry response event       +-----------+   |
526	  +------------------------------------------------------------------+

528	                 Figure 5: Overview of NIC Implementation

530	   1.  Sender side flow

532	   The HPCC++ module running the HPCC CC algorithm in the sender side
533	   for every flow in the NIC.  Flow can be defined by some transport
534	   parameters including 5-tuples, destination QP (queue pair), etc.  It
535	   receives inband telemetry response events per flow which are
536	   generated from the RX pipeline, adjusts the sending window and rate,
537	   and update the scheduler on the rate and window of the flow.

539	   The scheduler contains a pacing mechanism that determine the flow
540	   rate by the value it got from the algorithm.  It also maintains the
541	   current sending window size for active flows.  If the pacing
542	   mechanism and the flow's sending window permits, the scheduler
543	   invokes for the flow a PktSend command to TX pipeline.

545	   The TX pipeline implements packet processing.  Once it receives the
546	   PktSend event with flow ID from the scheduler, it generates the
547	   corresponding packet and delivers to the Network.  If a sent packet
548	   should collect telemetry on its way the TX pipeline may add
549	   indications/headers that triggers the network elements to add
550	   telemetry data according to the inband telemetry protocol in use.
551	   The telemetry can be collected by the data packet or by dedicated
552	   prob packets generated in the TX pipeline.

554	   The RX pipe parses the incoming packets from the network and
555	   identifies whether telemetry is embedded in the parsed packet.  On
556	   receiving a telemetry response packet, the RX pipeline extracts the
557	   network status from the packet and passes it to the HPCC++ module for
558	   processing.  A telemetry response packet can be an ACK containing
559	   inband telemetry, or a dedicated telemetry response prob packet.

561	   2.  Receiver side flow

563	   On receiving a packet containing inband telemetry, the RX pipeline
564	   extracts the network status, and the flow parameters from the packet
565	   and passes it to the TX pipeline.  The packet can be a data packet
566	   containing inband telemetry, or a dedicated telemetry request prob
567	   packet.  The Tx pipeline may process and edit the telemetry data, and
568	   then sends back to the sender the data using either an ACK packet of
569	   the flow or a dedicated telemetry response packet.

571	6.3.2.  Receiver-based HPCC

573	   Note that the window/rate calculation can be implemented at either
574	   the data sender or the data receiver.  If the ACK packets already
575	   exist for reliability purpose, the inband telemetry information can
576	   be echoed back to the sender via ACK self-clocking.  Not all ACK
577	   packets need to carry the inband telemetry information.  To reduce
578	   the Packet Per Second (PPS) overhead, the receiver may examine the
579	   inband telemetry information and adopt the technique of delayed ACKs
580	   that only sends out an ACK for a few of received packets.  In order
581	   to reduce PPS even further, one may implement the algorithm at the
582	   receiver and feedback the calculated window in the ACK packet once
583	   every RTT.

585	   The receiver-based algorithm, Rx-HPCC, is based on int.L, which is
586	   the inband telemetry information in the packet header.  The receiver
587	   performs the same functions except using int.L instead of ack.L.  The
588	   new function NewINT(int.L) is to replace NewACK(int.L)
589	   28:   Procedure NewINT(int.L)
590	   29:   if now > (lastUpdateTime + T) then
591	   30:      W = ComputeWind(MeasureInflight(int), True);
592	   31:      send_ack(W)
593	   32:      lastUpdateTime = now;
594	   33:   else
595	   34:      W = ComputeWind(MeasureInflight(int), False);

597	   Here, since the receiver does not know the starting sequence number
598	   of a burst, it simply records the lastUpdateTime.  If time T has
599	   passed since lastUpdateTime, the algorithm would recalcuate Wc as in
600	   Line 30 and send out the ACK packet which would include W
601	   information.  Otherwise, it would just update W information locally.
602	   This would reduce the amount of traffic that needs to be feedback to
603	   the data sender.

605	   Note that the receiver can also measure the number of outstanding
606	   flows, N, if the last hop is the congestion point and use this
607	   information to dynamically adjust W_ai to achieve better fairness.
608	   The improvement would allow flows to quickly converge to fairness
609	   without causing large swings under heavy load.

611	7.  Reference Implementation

613	   HPCC++ can be adopted as the CC algorithm by a wide range of
614	   transport protocols such as TCP and UDP, as well as others that may
615	   run on top of them, such as iWARP, RoCE etc.  It requires to have the
616	   window limit and congestion feedback through ACK self-clocking, which
617	   naturally conforms to the paradigm of TCP design.  With that, HPCC++
618	   introduces a scheme to measure the total inflight bytes for more
619	   precise congestion control.  To run in UDP, some modifications need
620	   to be done to enforce the window limit and collect congestion
621	   feedback via probing packets, which is incremental.

623	7.1.  Implementation on RDMA RoCEv2

625	   We describe reference implementation on RDMA RoCEv2.  This is an
626	   implementation for ``Sender-based HPCC++'' (see section 6.3.1.) using
627	   dedicated probe packets to collect the telemetry.  HPCC++ module in
628	   the sender triggers the sending of ``telemetry request packet'' for a
629	   given flow.  The NIC then sends the probe packet.  The packet will
630	   have the same IP and UDP headers as the data packets of the given
631	   flow.  Such packet is expected to be sent every RTT, see section 6
632	   for more details.  On receiving of telemetry request packet, the NIC
633	   extracts the telemetry from all the links along the path from the
634	   sender.  HPCC++ module chooses the link with the highest inflight
635	   bytes and sends its telemetry (queue length, timestamp and tx bytes)
636	   back to the receiver on top of dedicated ``telemetry response
637	   packet''.  On receiving of telemetry response packet, the NIC
638	   extracts the telemetry and pass it to the HPCC++ module which using
639	   this info to implement the rate update scheme.

641	7.2.  Implementation on TCP

643	   Taking the benefit of precise congestion control for TCP is a natural
644	   next step.  Since TCP segmentation at TX side (e.g., TSO) and
645	   coalescing at RX side (e.g., GRO) happen at the NIC HW or low-layer
646	   of TCP/IP stack, carrying per-pkt inband telemetry info between the
647	   TCP congestion control engine and network fabric has to work with the
648	   TSO and GRO.  Instead, one way to adopt HPCC++ for TCP is using the
649	   special probe and notification packets to retrieve inband telemetry
650	   information.  The sender generates a probe packet when it is actively
651	   sending data.  The probe packet has the same 5-tuples (source and
652	   destination addresses, source and destination ports and protocol
653	   number) with the data packets and the inband telemetry header.  The
654	   switches along the path identify the probe packet by its inband
655	   telemetry header and insert the inband telemetry.  Once received the
656	   probe packet with inband telemetry, the receiver replies with a
657	   response packet piggybacking the inband telemetry to the sender.
658	   Note, both probe and response packets use a special DSCP number so
659	   that it can bypass the TSO and GRO in each side.

661	8.  IANA Considerations

663	   This document makes no request of IANA.

665	9.  Discussion

667	9.1.  Internet Deployment

669	   Although the discussion above mainly focuses on the data center
670	   environment, HPCC++ can be adopted at Internet at large.  There are
671	   several security considerations one should be aware of.

673	   There may rise privacy concern when the telemetry information is
674	   conveyed across Autonomous Systems (ASes) and back to end-users.  The
675	   link load information captured in telemetry can potentially reveal
676	   the provider's network capacity, route utilization, scheduling
677	   policy, etc.  Those usually are considered to be sensitive data of
678	   the network providers.  Hence, certain action may take to anonymize
679	   the telemetry data and only convey the relative ratio in rate
680	   adaptation across ASes without revealing the actual network load.

682	   Another consideration is the security of receiving telemetry
683	   information.  The rate adaptation mechanism in HPCC++ relies on
684	   feedback from the network.  As such, it is vulnerable to attacks
685	   where feedback messages are hijacked, replaced, or intentionally
686	   injected with misleading information resulting in denial of service,
687	   similar to those that can affect TCP.  It is therefore RECOMMENDED
688	   that the notification feedback message is at least integrity checked.
689	   In addition, [I-D.ietf-avtcore-cc-feedback-message] discusses the
690	   potential risk of a receiver providing misleading congestion feedback
691	   information and the mechanisms for mitigating such risks.

693	9.2.  Switch-assisted congestion control

695	   HPCC++ falls in the general category of switch-assisted congestion
696	   control.  However, HPCC++ includes a few unique design choices that
697	   are different from other switch-assisted approaches.

699	   o  First, HPCC++ implements a primal-mode algorithm that requires
700	      only the ``write-to-packet'' operation from switches, which has
701	      already been supported by telemetry protocols like INT [P4-INT] or
702	      IOAM [I-D.ietf-ippm-ioam-data].  Please note that this is very
703	      different from dual-mode algorithms such as XCP
704	      [Katabi-SIGCOMM2002] and RCP [Dukkipati-RCP], where switches take
705	      an actively role in determining flows' rates.

707	   o  Second, HPCC++ achieves a fast utilization convergence by
708	      decoupling it from fairness convergence, which is inspired by XCP.

710	   o  Third, HPCC++ enables the switch-guided multiplicative increase
711	      (MI) by defining the ``inflight byte'' to quantify the link load.
712	      The inflight byte tells both the underload and overload of the
713	      link precisely and thus it allows the flow to increase/decrease
714	      the rate multiplicatively and safely.  By contrast, traditional
715	      approaches of using the queue length or RTT as the feedback cannot
716	      guide the rate increase and instead have to rely on additive
717	      increase (AI) with heuristics.  As the link speed contines to
718	      grow, this becomes increasingly slow in reclaiming the unused
719	      bandwidth.  Besides, queue-based feedback mechanisms subject to
720	      latency inflation.

722	   o  Last, HPCC++ uses TX rate instead of RX rate used by XCP and RCP.
723	      As detailed in [SIGCOMM-HPCC], we view the TX rate is more precise
724	      because RX rate and queue length are overlapped and thus it causes
725	      oscillation.

727	9.3.  Work with QoS queuing

729	   Under the use of QoS (Quality of service) priority queuing in
730	   switches, the length of flow's own queue cannot tell the actual
731	   queuing time and the exact extent of congestion.  Although general
732	   approaches for running congestion control with QoS queuing are out of
733	   the scope of this document, we provide a few hints for HPCC++ running
734	   friendly with QoS queuing.  In this case, HPCC++ can leverage the
735	   packet sojourn time (the egress timestamp minus the ingress
736	   timestamp) instead of the queue length to quantify the packet's
737	   actual queuing delay.  In addition, the operators typically use the
738	   Deficit Weighted Round Robin (DWRR) instead of the strict priority
739	   (SP) as their QoS scheduling to prevent traffic starvation.  DWRR
740	   provides a minimum bandwdith guarantee for each queue so that HPCC++
741	   can leverage it for precise rate update to avoid congestion.

743	9.4.  Path migration

745	   HPCC++ allows switches and end-hosts to share precise information of
746	   network utilization, which suggests a framework for path selection
747	   and rate control at end-hosts.  The framework HPCC++ enabled is to
748	   leverage each switch to report its link load information via inband
749	   telemetry.  The end-host fetches inband telemetry along the traffic
750	   routes and makes a timely and accurate decision on path selection and
751	   traffic admission.

753	10.  Acknowledgments

755	   The authors would like to thank RTGWG members for their valuable
756	   review comments and helpful input to this specification.

758	11.  Contributors

760	   The following individuals have contributed to the implementation and
761	   evaluation of the proposed scheme, and therefore have helped to
762	   validate and substantially improve this specification: Pedro Y.
763	   Segura, Roberto P.  Cebrian, Robert Southworth and Malek Musleh.

765	12.  References

767	12.1.  Normative References

769	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
770	              Requirement Levels", BCP 14, RFC 2119,
771	              DOI 10.17487/RFC2119, March 1997,
772	              <https://www.rfc-editor.org/info/rfc2119>.

774	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
775	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
776	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

778	12.2.  Informative References

780	   [Dukkipati-RCP]
781	              Dukkipati, N., "Rate Control Protocol (RCP): Congestion
782	              control to make flows complete quickly.", Stanford
783	              University , 2008.

785	   [I-D.ietf-avtcore-cc-feedback-message]
786	              Sarker, Z., Perkins, C., Singh, V., and M. A. Ramalho,
787	              "RTP Control Protocol (RTCP) Feedback for Congestion
788	              Control", draft-ietf-avtcore-cc-feedback-message-09 (work
789	              in progress), November 2020.

791	   [I-D.ietf-ippm-ioam-data]
792	              "Data Fields for In-situ OAM", March 2020,
793	              <https://tools.ietf.org/html/draft-ietf-ippm-ioam-data-
794	              09>.

796	   [I-D.ietf-kumar-ippm-ifa]
797	              "Inband Flow Analyzer", February 2019,
798	              <https://tools.ietf.org/html/draft-kumar-ippm-ifa-01>.

800	   [Katabi-SIGCOMM2002]
801	              Katabi, D., Handley, M., and C. Rohrs, "Congestion Control
802	              for High Bandwidth-Delay Product Networks", ACM
803	              SIGCOMM Pittsburgh, Pennsylvania, USA, October 2002.

805	   [P4-INT]   "In-band Network Telemetry (INT) Dataplane Specification,
806	              v2.0", February 2020, <https://github.com/p4lang/p4-
807	              applications/blob/master/docs/INT_v2_0.pdf>.

809	   [SIGCOMM-HPCC]
810	              Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang,
811	              L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M.
812	              Yu, "HPCC: High Precision Congestion Control", ACM
813	              SIGCOMM Beijing, China, August 2019.

815	   [Zhu-SIGCOMM2015]
816	              Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
817	              Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M.
818	              Zhang, "Congestion Control for Large-Scale RDMA
819	              Deployments", ACM SIGCOMM London, United Kingdom, August
820	              2015.

822	Authors' Addresses

824	   Rui Miao
825	   Alibaba Group
826	   525 Almanor Ave, 4th Floor
827	   Sunnyvale, CA  94085
828	   USA

830	   Email: miao.rui@alibaba-inc.com

832	   Hongqiang H. Liu
833	   Alibaba Group
834	   108th Ave NE, Suite 800
835	   Bellevue, WA  98004
836	   USA

838	   Email: hongqiang.liu@alibaba-inc.com

840	   Rong Pan
841	   Intel, Corp.
842	   2200 Mission College Blvd.
843	   Santa Clara, CA  95054
844	   USA

846	   Email: rong.pan@intel.com

848	   Jeongkeun Lee
849	   Intel, Corp.
850	   4750 Patrick Henry Dr.
851	   Santa Clara, CA  95054
852	   USA

854	   Email: jk.lee@intel.com

856	   Changhoon Kim
857	   Intel Corporation
858	   4750 Patrick Henry Dr.
859	   Santa Clara, CA  95054
860	   USA

862	   Email: chang.kim@intel.com
863	   Barak Gafni
864	   Mellanox Technologies, Inc.
865	   350 Oakmead Parkway, Suite 100
866	   Sunnyvale, CA  94085
867	   USA

869	   Email: gbarak@mellanox.com

871	   Yuval Shpigelman
872	   Mellanox Technologies, Inc.
873	   Haim Hazaz 3A
874	   Netanya  4247417
875	   Israel

877	   Email: yuvals@nvidia.com

879	   Jeff Tantsura
880	   Microsoft Corporation
881	   One Microsoft Way
882	   Redmond, Washington  98052-6399
883	   USA

885	   Email: jefftantsura@microsoft.com