| < draft-pan-tsvwg-hpccplus-01.txt | draft-pan-tsvwg-hpccplus-02.txt > | |||
|---|---|---|---|---|
| Network Working Group R. Miao | Network Working Group R. Miao | |||
| Internet-Draft H. Liu | Internet-Draft H. Liu | |||
| Intended status: Experimental Alibaba Group | Intended status: Experimental Alibaba Group | |||
| Expires: January 30, 2021 R. Pan | Expires: March 15, 2021 R. Pan | |||
| J. Lee | J. Lee | |||
| C. Kim | C. Kim | |||
| Intel Corporation | Intel Corporation | |||
| B. Gafni | B. Gafni | |||
| Y. Shpigelman | Y. Shpigelman | |||
| Mellanox Technologies, Inc. | Mellanox Technologies, Inc. | |||
| July 29, 2020 | September 11, 2020 | |||
| HPCC++: Enhanced High Precision Congestion Control | HPCC++: Enhanced High Precision Congestion Control | |||
| draft-pan-tsvwg-hpccplus-01 | draft-pan-tsvwg-hpccplus-02 | |||
| Abstract | Abstract | |||
| Congestion control (CC) is the key to achieving ultra-low latency, | Congestion control (CC) is the key to achieving ultra-low latency, | |||
| high bandwidth and network stability in high-speed networks. | high bandwidth and network stability in high-speed networks. | |||
| However, the existing high-speed CC schemes have inherent limitations | However, the existing high-speed CC schemes have inherent limitations | |||
| for reaching these goals. | for reaching these goals. | |||
| In this document, we describe HPCC++ (High Precision Congestion | In this document, we describe HPCC++ (High Precision Congestion | |||
| Control), a new high-speed CC mechanism which achieves the three | Control), a new high-speed CC mechanism which achieves the three | |||
| skipping to change at page 2, line 4 ¶ | skipping to change at page 2, line 4 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on January 30, 2021. | This Internet-Draft will expire on March 15, 2021. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2020 IETF Trust and the persons identified as the | Copyright (c) 2020 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 | 3. System Overview . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 | 4. HPCC++ Algorithm . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 5 | 4.1. Notations . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 | 4.2. Design Functions and Procedures . . . . . . . . . . . . . 6 | |||
| 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 | 5. Configuration Parameters . . . . . . . . . . . . . . . . . . 8 | |||
| 6. Design Enhancement and Implementation . . . . . . . . . . . . 8 | 6. Design Enhancement and Implementation . . . . . . . . . . . . 8 | |||
| 6.1. HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . . 9 | 6.1. HPCC++ Guidelines . . . . . . . . . . . . . . . . . . . . 9 | |||
| 6.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . . . 9 | 6.2. Receiver-based HPCC . . . . . . . . . . . . . . . . . . . 9 | |||
| 6.3. Switch-side Optimizations . . . . . . . . . . . . . . . . 10 | 7. Reference Implementations . . . . . . . . . . . . . . . . . . 10 | |||
| 7. Reference Implementations . . . . . . . . . . . . . . . . . . 11 | 7.1. Inband telemetry padding at the network elements . . . . 10 | |||
| 7.1. Inband telemetry padding at the network elements . . . . 11 | 7.2. Congestion control at NICs . . . . . . . . . . . . . . . 10 | |||
| 7.2. Congestion control at NICs . . . . . . . . . . . . . . . 11 | ||||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | |||
| 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 13 | 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 13 | 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 | 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 12.1. Normative References . . . . . . . . . . . . . . . . . . 13 | 12.1. Normative References . . . . . . . . . . . . . . . . . . 12 | |||
| 12.2. Informative References . . . . . . . . . . . . . . . . . 13 | 12.2. Informative References . . . . . . . . . . . . . . . . . 13 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 1. Introduction | 1. Introduction | |||
| The link speed in data center networks has grown from 1Gbps to | The link speed in data center networks has grown from 1Gbps to | |||
| 100Gbps in the past decade, and this growth is continuing. Ultralow | 100Gbps in the past decade, and this growth is continuing. Ultralow | |||
| latency and high bandwidth, which are demanded by more and more | latency and high bandwidth, which are demanded by more and more | |||
| applications, are two critical requirements in today's and future | applications, are two critical requirements in today's and future | |||
| high-speed networks. | high-speed networks. | |||
| Given that traditional software-based network stacks in hosts can no | Given that traditional software-based network stacks in hosts can no | |||
| longer sustain the critical latency and bandwidth requirements | longer sustain the critical latency and bandwidth requirements | |||
| [Zhu-SIGCOMM2015], offloading network stacks into hardware is an | [Zhu-SIGCOMM2015], offloading network stacks into hardware is an | |||
| inevitable direction in high-speed networks. Large-scale networks | inevitable direction in high-speed networks. Large-scale networks | |||
| with RDMA (remote direct memory access) over Converged Ethernet | with RDMA (remote direct memory access) often uses hardware- | |||
| Version 2 (RoCEv2) often uses hardware-offloading solutions. In some | offloading solutions. In some cases, the RDMA networks still face | |||
| cases, the RDMA networks still face fundamental challenges to | fundamental challenges to reconcile low latency, high bandwidth | |||
| reconcile low latency, high bandwidth utilization, and high | utilization, and high stability. | |||
| stability. | ||||
| This document describes a new CC mechanism, HPCC++ (Enhanced High | This document describes a new CC mechanism, HPCC++ (Enhanced High | |||
| Precision Congestion Control), for large-scale, high-speed networks. | Precision Congestion Control), for large-scale, high-speed networks. | |||
| The key idea behind HPCC++ is to leverage the precise link load | The key idea behind HPCC++ is to leverage the precise link load | |||
| information from inband telemetry to compute accurate flow rate | information from inband telemetry to compute accurate flow rate | |||
| updates. Unlike existing approaches that often require a large | updates. Unlike existing approaches that often require a large | |||
| number of iterations to find the proper flow rates, HPCC++ requires | number of iterations to find the proper flow rates, HPCC++ requires | |||
| only one rate update step in most cases. Using precise information | only one rate update step in most cases. Using precise information | |||
| from inband telemetry enables HPCC++ to address the limitations in | from inband telemetry enables HPCC++ to address the limitations in | |||
| current CC schemes. First, HPCC++ senders can quickly ramp up flow | current CC schemes. First, HPCC++ senders can quickly ramp up flow | |||
| skipping to change at page 3, line 43 ¶ | skipping to change at page 3, line 40 ¶ | |||
| and efficiency. | and efficiency. | |||
| The base form of HPCC++ is the original HPCC algorithm and its full | The base form of HPCC++ is the original HPCC algorithm and its full | |||
| description can be found in [SIGCOMM-HPCC]. While the original | description can be found in [SIGCOMM-HPCC]. While the original | |||
| design lays the foundation for inband telemetry based precision | design lays the foundation for inband telemetry based precision | |||
| congestion control, HPCC++ is an enhanced version which takes into | congestion control, HPCC++ is an enhanced version which takes into | |||
| account system constraints and aims to reduce the design overhead and | account system constraints and aims to reduce the design overhead and | |||
| further improves the performance. Section 6 describes these detailed | further improves the performance. Section 6 describes these detailed | |||
| proposed design enhancements and guidelines. | proposed design enhancements and guidelines. | |||
| HPCC++ proposes a new architecture for congestion control in large- | ||||
| scale, high-speed networks. On one hand, HPCC++ leverages the inband | ||||
| telemetry for congestion feedback, which offers more precise link | ||||
| load information for congestion avoidance than conventional signals | ||||
| such as ECN or RTT. This draft describes the architecture changes in | ||||
| switches and end-host to support inband telemetry and proves the | ||||
| efficiency in handling network congestion. On the other hand, HPCC++ | ||||
| is generic to support a wide range of transport protocols such as | ||||
| TCP, UDP, iWARP, etc. It requires to have the window limit and | ||||
| congestion feedback through ACK self-clocking, which naturally | ||||
| conforms to the paradigm of TCP/iWARP design. However, HPCC++ | ||||
| introduces a scheme to measure the total inflight bytes for more | ||||
| precise congestion control. To run in UDP, some modifications need | ||||
| to be done to enforce the window limit and collect congestion | ||||
| feedback via probing packets, which is incremental. In addition, | ||||
| this new architecture should work for both datacenter and the WAN | ||||
| networks, if the inband telemetry is supported in network switches | ||||
| and end-host protocols. | ||||
| 2. Terminology | 2. Terminology | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
| "OPTIONAL" in this document are to be interpreted as described in BCP | "OPTIONAL" in this document are to be interpreted as described in BCP | |||
| 14 [RFC2119] [RFC8174] when, and only when, they appear in all | 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
| capitals, as shown here. | capitals, as shown here. | |||
| 3. System Overview | 3. System Overview | |||
| skipping to change at page 6, line 39 ¶ | skipping to change at page 6, line 39 ¶ | |||
| | N | Maximum number of flows | ... | | | N | Maximum number of flows | ... | | |||
| | W_ai | Additive increase amount | ... | | | W_ai | Additive increase amount | ... | | |||
| +--------------+----------------------------------+----------------+ | +--------------+----------------------------------+----------------+ | |||
| Figure 3: List of algorithm parameters and their default values. | Figure 3: List of algorithm parameters and their default values. | |||
| 4.2. Design Functions and Procedures | 4.2. Design Functions and Procedures | |||
| The HPCC++ algorithm can be outlined as below: | The HPCC++ algorithm can be outlined as below: | |||
| 1: Function MeasureInflight(ack) | 1: Function MeasureInflight(ack) | |||
| 2: u = 0; | 2: u = 0; | |||
| 3: for each link i on the path do | 3: for each link i on the path do | |||
| 4: ack.L[i].txBytes-L[i].txBytes | 4: ack.L[i].txBytes-L[i].txBytes | |||
| txRate = ----------------------------- ; | txRate = ----------------------------- ; | |||
| ack.L[i].ts-L[i].ts | ack.L[i].ts-L[i].ts | |||
| 5: min(ack.L[i].qlen,L[i].qlen) txRate | 5: min(ack.L[i].qlen,L[i].qlen) txRate | |||
| u' = ----------------------------- + ---------- ; | u' = ----------------------------- + ---------- ; | |||
| ack.L[i].B*T ack.L[i].B | ack.L[i].B*T ack.L[i].B | |||
| 6: if u' > u then | 6: if u' > u then | |||
| 7: u = u'; tau = ack.L[i].ts - L[i].ts; | 7: u = u'; tau = ack.L[i].ts - L[i].ts; | |||
| 8: tau = min(tau, T); | 8: tau = min(tau, T); | |||
| 9: U = (1 - tau/T)*U + tau/T*u; | 9: U = (1 - tau/T)*U + tau/T*u; | |||
| 10: return U; | 10: return U; | |||
| 11: Function ComputeWind(U, updateWc) | 11: Function ComputeWind(U, updateWc) | |||
| 12: if U >= eta or incStage >= maxStagee then | 12: if U >= eta or incStage >= maxStagee then | |||
| 13: Wc | 13: Wc | |||
| W = ----- + W_ai; | W = ----- + W_ai; | |||
| U/eta | U/eta | |||
| 14: if updateWc then | 14: if updateWc then | |||
| 15: incStagee = 0; Wc = W ; | 15: incStagee = 0; Wc = W ; | |||
| 16: else | 16: else | |||
| 17: W = Wc + W_ai ; | 17: W = Wc + W_ai ; | |||
| 18: if updateWc then | 18: if updateWc then | |||
| 19: incStage++; Wc = W ; | 19: incStage++; Wc = W ; | |||
| 20: return W | 20: return W | |||
| 21: Procedure NewAck(ack) | 21: Procedure NewAck(ack) | |||
| 22: if ack.seq > lastUpdateSeq then | 22: if ack.seq > lastUpdateSeq then | |||
| 23: W = ComputeWind(MeasureInflight(ack), True); | 23: W = ComputeWind(MeasureInflight(ack), True); | |||
| 24: lastUpdateSeq = snd_nxt; | 24: lastUpdateSeq = snd_nxt; | |||
| 25: else | 25: else | |||
| 26: W = ComputeWind(MeasureInflight(ack), False); | 26: W = ComputeWind(MeasureInflight(ack), False); | |||
| 27: R = W/T; L = ack.L; | 27: R = W/T; L = ack.L; | |||
| The above illustrates the overall process of CC at the sender side | The above illustrates the overall process of CC at the sender side | |||
| for a single flow. Each newly received ACK message triggers the | for a single flow. Each newly received ACK message triggers the | |||
| procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq | procedure NewACK at Line 21. At Line 22, the variable lastUpdateSeq | |||
| is used to remember the first packet sent with a new W c , and the | is used to remember the first packet sent with a new W c , and the | |||
| sequence number in the incoming ACK should be larger than | sequence number in the incoming ACK should be larger than | |||
| lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and | lastUpdateSeq to trigger a new sync betweenW c andW (Line 14-15 and | |||
| 18-19). The sender also remembers the pacing rate and current inband | 18-19). The sender also remembers the pacing rate and current inband | |||
| telemetry information at Line 27. The sender computes a new window | telemetry information at Line 27. The sender computes a new window | |||
| size W at Line 23 or Line 26, depending on whether to update W c , | size W at Line 23 or Line 26, depending on whether to update W c , | |||
| skipping to change at page 8, line 11 ¶ | skipping to change at page 8, line 11 ¶ | |||
| room to increase after maxStage times of AI or the normalized | room to increase after maxStage times of AI or the normalized | |||
| inflight bytes is above, it calls Eqn (4) once to quickly ramp up or | inflight bytes is above, it calls Eqn (4) once to quickly ramp up or | |||
| ramp down the window size (Line 12-13). | ramp down the window size (Line 12-13). | |||
| 5. Configuration Parameters | 5. Configuration Parameters | |||
| HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. | HPCC++ has three easy-to-set parameters: eta, maxStagee, and W_ai. | |||
| eta controls a simple tradeoff between utilization and transient | eta controls a simple tradeoff between utilization and transient | |||
| queue length (due to the temporary collision of packets caused by | queue length (due to the temporary collision of packets caused by | |||
| their random arrivals, so we set it to 95% by default, which only | their random arrivals, so we set it to 95% by default, which only | |||
| loses 5% bandwidth but achieves almost zero queue. maxStage controls | loses 5% bandwidth but achieves almost zero queue. maxStage controls | |||
| a simple tradeoff between steady state stability and the speed to | a simple tradeoff between steady state stability and the speed to | |||
| reclaim free bandwidth. We find maxStage = 5 is conservatively large | reclaim free bandwidth. We find maxStage = 5 is conservatively large | |||
| for stability, while the speed of reclaiming free bandwidth is still | for stability, while the speed of reclaiming free bandwidth is still | |||
| much faster than traditional additive increase, especially in high | much faster than traditional additive increase, especially in high | |||
| bandwidth networks. W_ai controls the tradeoff between the maximum | bandwidth networks. W_ai controls the tradeoff between the maximum | |||
| number of concurrent flows on a link that can sustain near-zero | number of concurrent flows on a link that can sustain near-zero | |||
| queues and the speed of convergence to fairness. Note that none of | queues and the speed of convergence to fairness. Note that none of | |||
| the three parameters are reliability-critical. | the three parameters are reliability-critical. | |||
| HPCC++'s design brings advantages to short-lived flows, by allowing | HPCC++'s design brings advantages to short-lived flows, by allowing | |||
| skipping to change at page 10, line 15 ¶ | skipping to change at page 10, line 15 ¶ | |||
| 29: if now > (lastUpdateTime + T) then | 29: if now > (lastUpdateTime + T) then | |||
| 30: W = ComputeWind(MeasureInflight(int), True); | 30: W = ComputeWind(MeasureInflight(int), True); | |||
| 31: send_ack(W) | 31: send_ack(W) | |||
| 32: lastUpdateTime = now; | 32: lastUpdateTime = now; | |||
| 33: else | 33: else | |||
| 34: W = ComputeWind(MeasureInflight(int), False); | 34: W = ComputeWind(MeasureInflight(int), False); | |||
| Here, since the receiver does not know the starting sequence number | Here, since the receiver does not know the starting sequence number | |||
| of a burst, it simply records the lastUpdateTime. If time T has | of a burst, it simply records the lastUpdateTime. If time T has | |||
| passed since lastUpdateTime, the algorithm would recalcuate Wc as in | passed since lastUpdateTime, the algorithm would recalcuate Wc as in | |||
| Line 30 and send out the ACK packet which would include W informtion. | Line 30 and send out the ACK packet which would include W | |||
| Otherwise, it would just update W information locally. This would | information. Otherwise, it would just update W information locally. | |||
| reduce the amount of traffic that needs to be feedback to the data | This would reduce the amount of traffic that needs to be feedback to | |||
| sender. | the data sender. | |||
| Note that the receiver can also measure the number of outstanding | Note that the receiver can also measure the number of outstanding | |||
| flows, N, if the last hop is the congestion point and use this | flows, N, if the last hop is the congestion point and use this | |||
| information to dynamically adjust W_ai to achieve better fairness. | information to dynamically adjust W_ai to achieve better fairness. | |||
| The improvement would allow flows to quickly converge to fairness | The improvement would allow flows to quickly converge to fairness | |||
| without causing large swings under heavy load. | without causing large swings under heavy load. | |||
| 6.3. Switch-side Optimizations | ||||
| Switches can potentially generate and send separate packets | ||||
| containing inband telemetry information (aka inband telemetry | ||||
| response packets) directly back to the data senders so that they can | ||||
| slow down as soon as possible. This fast feedback and reaction can | ||||
| further reduce buffer size consumption upon heavy incast. Switches | ||||
| can consider the level of congestion to decide when to trigger direct | ||||
| inband telemetry responses. A simple bloom-filter and timer can be | ||||
| used at switches to avoid sending a burst of inband telemetry | ||||
| responses to the same sender. An inband telemetry response packet | ||||
| must carry the sequence number of the original data packet, so that | ||||
| the sender can correctly correlate the inband telemetry response with | ||||
| the data packet triggered the inband telemetry response. | ||||
| One may optimize the inband telemetry header overhead by implementing | ||||
| a simple subscription-based inband telemetry. The data senders may | ||||
| use a different DSCP codepoint or a flag bit in the inband telemetry | ||||
| instruction header to indicate inband telemetry subscription. (We | ||||
| expect future inband telemetry specs to support such a subscription | ||||
| service.) The senders can selectively subscribe to inband telemetry | ||||
| on a per-packet basis to control the inband telemetry data overhead. | ||||
| While forwarding inband telemetry-subscribed data packets, the | ||||
| switches can monitor the level of congestion and conditionally | ||||
| generate separate inband telemetry responses as described above. The | ||||
| inband telemetry responses can be directly sent back to the senders | ||||
| or to the receivers depending on which version of HPCC++ algorithm | ||||
| (sender-based or receiver-based) is used in the network. | ||||
| 7. Reference Implementations | 7. Reference Implementations | |||
| A prototype of HPCC++ in NICs is implemented to realize the CC | A prototype of HPCC++ in NICs is implemented to realize the CC | |||
| algorithm and switches to realize the inband telemetry feature. | algorithm and switches to realize the inband telemetry feature. | |||
| 7.1. Inband telemetry padding at the network elements | 7.1. Inband telemetry padding at the network elements | |||
| HPCC++ only relies on packets to share information across senders, | HPCC++ only relies on packets to share information across senders, | |||
| receivers, and switches. HPCC++ is open to a variety of inband | receivers, and switches. HPCC++ is open to a variety of inband | |||
| telemetry format standards. Inside a data center, the path length is | telemetry format standards. Inside a data center, the path length is | |||
| often no more than 5 hops. The overhead of the inband telemetry | often no more than 5 hops. The overhead of the inband telemetry | |||
| padding for HPCC++ is considered to be low. | padding for HPCC++ is considered to be low. | |||
| 7.2. Congestion control at NICs | 7.2. Congestion control at NICs | |||
| Figure 4 shows HPCC++ implementation on a NIC. The NIC provides an | (Figure 4) shows HPCC++ implementation on a NIC. The NIC provides an | |||
| HPCC++ module that resides on the data path of the NIC, HPCC++ | HPCC++ module that resides on the data path of the NIC, HPCC++ | |||
| modules realize both sender and receiver roles. | modules realize both sender and receiver roles. | |||
| +------------------------------------------------------------------+ | +------------------------------------------------------------------+ | |||
| | +---------+ window update +-----------+ PktSend +-----------+ | | | +---------+ window update +-----------+ PktSend +-----------+ | | |||
| | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> | | | |-------------->| Scheduler |-------> |Tx pipeline|---+-> | |||
| | | | rate update +-----------+ +-----------+ | | | | | rate update +-----------+ +-----------+ | | |||
| | | HPCC++ | ^ | | | | HPCC++ | ^ | | |||
| | | | inband telemetry| | | | | | inband telemetry| | | |||
| | | module | | | | | | module | | | | |||
| skipping to change at page 12, line 11 ¶ | skipping to change at page 11, line 34 ¶ | |||
| receives inband telemetry response events per flow which are | receives inband telemetry response events per flow which are | |||
| generated from the RX pipeline, adjusts the sending window and rate, | generated from the RX pipeline, adjusts the sending window and rate, | |||
| and update the scheduler on the rate and window of the flow. | and update the scheduler on the rate and window of the flow. | |||
| The scheduler contains a pacing mechanism that determine the flow | The scheduler contains a pacing mechanism that determine the flow | |||
| rate by the value it got from the algorithm. It also maintains the | rate by the value it got from the algorithm. It also maintains the | |||
| current sending window size for active flows. If the pacing | current sending window size for active flows. If the pacing | |||
| mechanism and the flow's sending window permits, the scheduler | mechanism and the flow's sending window permits, the scheduler | |||
| invokes for the flow a PktSend command to TX pipeline. | invokes for the flow a PktSend command to TX pipeline. | |||
| The TX pipeline implements RoCEv2 processing. Once it receives the | The TX pipeline implements packet processing. Once it receives the | |||
| PktSend event with flow ID from the scheduler, it generates the | PktSend event with flow ID from the scheduler, it generates the | |||
| corresponding packet and delivers to the Network. If a sent packet | corresponding packet and delivers to the Network. If a sent packet | |||
| should collect telemetry on its way the TX pipeline may add | should collect telemetry on its way the TX pipeline may add | |||
| indications/headers that triggers the network elements to add | indications/headers that triggers the network elements to add | |||
| telemetry data according to the inband telemetry protocol in use. | telemetry data according to the inband telemetry protocol in use. | |||
| The telemetry can be collected by the data packet or by dedicated | The telemetry can be collected by the data packet or by dedicated | |||
| prob packets generated in the TX pipeline. | prob packets generated in the TX pipeline. | |||
| The RX pipe parses the incoming packets from the network and | The RX pipe parses the incoming packets from the network and | |||
| identifies whether telemetry is embedded in the parsed packet. On | identifies whether telemetry is embedded in the parsed packet. On | |||
| skipping to change at page 13, line 35 ¶ | skipping to change at page 13, line 10 ¶ | |||
| [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | |||
| 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | |||
| May 2017, <https://www.rfc-editor.org/info/rfc8174>. | May 2017, <https://www.rfc-editor.org/info/rfc8174>. | |||
| 12.2. Informative References | 12.2. Informative References | |||
| [I-D.ietf-avtcore-cc-feedback-message] | [I-D.ietf-avtcore-cc-feedback-message] | |||
| Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP | Sarker, Z., Perkins, C., Singh, V., and M. Ramalho, "RTP | |||
| Control Protocol (RTCP) Feedback for Congestion Control", | Control Protocol (RTCP) Feedback for Congestion Control", | |||
| draft-ietf-avtcore-cc-feedback-message-07 (work in | draft-ietf-avtcore-cc-feedback-message-08 (work in | |||
| progress), June 2020. | progress), September 2020. | |||
| [I-D.ietf-ippm-ioam-data] | [I-D.ietf-ippm-ioam-data] | |||
| "Data Fields for In-situ OAM", March 2020, | "Data Fields for In-situ OAM", March 2020, | |||
| <https://tools.ietf.org/html/draft-ietf-ippm-ioam-data- | <https://tools.ietf.org/html/draft-ietf-ippm-ioam-data- | |||
| 09>. | 09>. | |||
| [I-D.ietf-kumar-ippm-ifa] | [I-D.ietf-kumar-ippm-ifa] | |||
| "Inband Flow Analyzer", February 2019, | "Inband Flow Analyzer", February 2019, | |||
| <https://tools.ietf.org/html/draft-kumar-ippm-ifa-01>. | <https://tools.ietf.org/html/draft-kumar-ippm-ifa-01>. | |||
| [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, | [P4-INT] "In-band Network Telemetry (INT) Dataplane Specification, | |||
| v2.0", February 2020, <https://github.com/p4lang/p4- | v2.0", February 2020, <https://github.com/p4lang/p4- | |||
| applications/blob/master/docs/INT_v2_0.pdf>. | applications/blob/master/docs/INT_v2_0.pdf>. | |||
| [SIGCOMM-HPCC] | [SIGCOMM-HPCC] | |||
| Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, | Li, Y., Miao, R., Liu, H., Zhuang, Y., Fei Feng, F., Tang, | |||
| L., Cao, Z., and M. Zhang, "HPCC: High Precision | L., Cao, Z., Zhang, M., Kelly, F., Alizadeh, M., and M. | |||
| Congestion Control", ACM SIGCOMM Beijing, China, August | Yu, "HPCC: High Precision Congestion Control", ACM | |||
| 2019. | SIGCOMM Beijing, China, August 2019. | |||
| [Zhu-SIGCOMM2015] | [Zhu-SIGCOMM2015] | |||
| Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., | Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., | |||
| Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. | Liron, Y., Padhye, J., Raindel, S., Yahia, M., and M. | |||
| Zhang, "Congestion Control for Large-Scale RDMA | Zhang, "Congestion Control for Large-Scale RDMA | |||
| Deployments", ACM SIGCOMM London, United Kingdom, August | Deployments", ACM SIGCOMM London, United Kingdom, August | |||
| 2015. | 2015. | |||
| Authors' Addresses | Authors' Addresses | |||
| End of changes. 20 change blocks. | ||||
| 91 lines changed or deleted | 79 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||