Internet-Draft CCM October 2023
Lyu, et al. Expires 22 April 2024 [Page]
Workgroup:
RTGWG
Internet-Draft:
draft-lyu-rtgwg-coordinated-cm-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
Y. Lyu
Huawei
Y. Zhang
Huawei
M. Liu
Huawei

Coordinated Congestion Management

Abstract

AI fabric is sensitive to bandwidth. Congestion management, including congestion control and load balancing, is a main method to fully utilize network resource. However, current congestion management mechanism are not coordinated, which leads to throughput decreasing. This document provides a scheme for coordinating different congestion management mechanisms. It describes the design principle, behaviors of network switches and hosts in the scheme, and gives an example to show end-to-end procedure.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 22 April 2024.

Table of Contents

1. Introduction

ML/AI has been progressing rapidly over the last decade. ML/AI model compute, which is measured in FLOPs, are constantly increasing. It is imperative to employ distributed parallel training to train such large models in AI cluster.

The communication in AI cluster is bandwidth sensitive. Analyzing data parallelism and model parallelism which are the 2 acceleration methods in AI training, it shows an on-off type of burst traffic pattern with huge traffic amount in each iteration.

Therefore, it is important that AI fabric should provide high effective bandwidth, so to shorten communication time and improve computation efficiency. Effective bandwidth indicates fully utilization of link bandwidth to achieve high throughput. Congestion management is the key technology, including congestion control mechanisms and load balancing mechanisms.

This document discusses the uncoordinated mechanisms in current congestion management. That leads to throughput issues which are particularly harmful in AI fabric. A scheme for coordinating different congestion management mechanisms is provided in this document, which can be effectively and widely deployed in AI fabric.

2. Terminology

3. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. Existing congestion management

Congestion is usually caused by in-cast traffic and/or imbalanced network load. Incast traffic is the traffic from multiple source hosts, but towards to the same destination host. Commonly used solutions include congestion control algorithms that control sending rates and load balancing algorithms that adjust paths for traffic.

Currently, congestion control mechanism and adaptive routing work independently, without coordination. That results in negative impact on system performance. For example, when congestion caused by imbalanced load on network occurs on a switch, both DCQCN and adaptive routing are activated. ECN in data packets is marked causing the CNP to be sent back to sender. Thus, sender slows down the sending rate of the congested flow. Meanwhile, the switch changes the path for congested flow, traversing the new incoming packets to a light-loaded path. The result is that the congested flow is forwarded on the light-loaded path at a low rate. Then, DCQCN needs some time to recover the sending rate at the new path. It reduces effective bandwidth and seriously impact computation efficiency in AI training. Another example, if the congestion is caused by in-cast traffic, congestion control should be enough. Additional adaptive routing adjustments not only fail to mitigate congestion, but may also introduce more out-of-order issue.

It is shown that current congestion management cannot efficiently handle congestion issue in AI fabric. Uncoordinated behaviors reduce effective network bandwidth which is essential for AI workload.

5. Design principle of coordinated congestion management

Coordinated congestion management is designed to coordinate congestion control and adaptive routing. Design principle is shown as below.

6. Coordinated congestion management scheme

The key to the coordinated congestion management is to identify CC traffic and non-CC traffic, thereby they are treated differently in network when congestion occurs. CC flow recognized by network is notified to the source host and the subsequent packets of the CC flow are tagged by the source host. This indicates the network switch to perform CC mechanism on the flow instead of AR. For non-CC traffic, the network switch first performs AR. Only when AR mechansim cannot find light-loaded path for switching, the traffic turns to be CC traffic and CC will be run to alleviate congestion.

Coordinated congestion management requires interaction between network switches and source hosts, and adds a new tag to data packets for the coordination. The following sections explain the detail of the scheme.

6.1. Coordination tag

Coordination tag is inserted into data packets. The tag contains CC indicator and AR indicator.

  • CC indicator: indicates if the packet belongs to a flow which needs congestion control, such as incast flow .

  • AR indicator: indicates the location of upstream AR point where adaptive routing can be performed. The AR point can be a network switch or a source host. AR indicator can be an ID, an IP address or other information which guides how to send a message to the AR point.

The tag can use in-band telemetry scheme to carry in data packet. A new method CSIG [I-D.draft-ravi-ippm-csig] may provide another possibility.

6.2. Notification message

There are 3 types of notification.

  • Type 1: congestion control required
    Example: Type 1 message is sent from incast congetion switch to incast flow source host, notifying the source host to tag (set CC indicator) the packets in the incast flow.

  • Type 2: congestion control released
    Example: When incast congestion is eliminated, the switch sends type 2 message to corresponding hosts, notfifying the source hosts to untag CC indicator in the subsequent packets of the corresponding flow.

  • Type 3: upstream AR required
    Example: If the switch determins to perform AR upstream, type 3 message is sent to the upstream AR point. The upstream AR point can be one-hop neighbour of the switch or a point multi-hop away.

The notification message includes source IP, destination IP, notification type and flow key. Source IP is the ip address of the switch which sends the notification. Destination IP is the ip address of the destination which will handle the notification message. Notification type is one of the above 3 types. Flow key is the information of the flow to be handled, such as 5-tuple information.

6.3. Behavior of network switches

6.3.1. Identify congestion type

When congestion is detected, network switch judge whether it is CC congestion or non-CC congestion. CC congestion includes incast congestion and congestion caused by high-speed port sending traffic to low-speed port.

If congestion occurs at the switch egress port, and the switch is the last-hop switch to destination host, it is determined that the congestion is incast congestion. The flows causing incast congestion are identified as incast flow.

There may have other methods to identify congestion type. This document does not make limitation on that.

6.3.2. Notify CC congestion

When CC congestion is determined by the network switch, it generates type 1 notification messages for each identified CC flow, and sends the notification messages to source hosts of the flows. When CC congestion is eliminated, the switch sends type 2 notification messages to the source hosts.

6.3.3. Notify upstream point to perform AR

When it is determined to perform AR, but network switch cannot do it locally and AR indicator in the data packet shows availability to do AR upstream, a type 3 notification message is sent to upstream point according to AR indicator.

6.3.4. Perform congestion control

Network switch performs congestion control in below cases.

  • It is identified as CC congestion.

  • It is identified as non-CC congestion, but adaptive routing cannot be used because there is no available new path for traffic switching either locally or upstream.

This document does not limit which CC mechanism is performed.

6.3.5. Perform adaptive routing

Network switch performs adaptive routing in below cases.

  • The flow is non-CC traffic. CC indicator in data packet is used to determine if it is CC traffic or non-CC traffic.

  • Type 3 notification message is received. According to flow information in the notification, new path is selected for the subsequent packets of the flow.

In order to enable upstream AR, it is required to update AR indicator in data packets hop by hop. When a data packet arrives at the network switches,

  • if there are several local light-loaded paths available for AR on the switch, the switch updates AR indicator in the data packet to itself, such as its own ID. Then the switch selects the appropriate local path to send the data packet. This document does not define algorithm of local path selection. It depends on routing strategy on the network switch.

  • If there is only one local light-loaded path available for AR, network switch can only select that path for traffic. AR indicator in the data packet will not be updated.

  • If there is no local light-loaded path, network switch gets upstream AR availability by reading AR indicator in the data packet. If AR indicator indicates upstream point can perform AR, network switch generates type 3 notification message and sends it directly to the corresponding upstream point. Otherwise, network switch triggers congestion control mechanism, such as set ECN in data packet.

6.4. Behavior of source hosts

When receiving type 1 notification message, source host sets CC indicator of the subsequent packets for the corresponding flow.

When receiving type 2 notificiation message, source host unset CC indicator of the subsequent packets for the corresponding flow.

When receiving type 3 notification message, source host performs AR on the subsequent packets for the corresponding flow.

When receiving congestion control signals and the CC indicator is set, source host performs CC on the flow.

7. An example of end-to-end procedure

Network topology is shown in Figure 1. This is a 4 layer fattree topology. There are n computing racks and m switching racks. Computing racks have source hosts, layer 1 switches and layer 2 switches. Swithcing racks contain layer 3 and layer 4 switches.

      Switching Rack 1    Switching Rack m
      +---------------+   +---------------+
      |L4-1-1...L4-1-e|   |L4-m-1...L4-m-e|
      |  | \    / |   |   |  | \    / |   |
      |  |  \  /  |   |   |  |  \  /  |   |
      |  |   \/   |   |   |  |   \/   |   |
      |  |   /\   |   |...|  |   /\   |   |
      |  |  /  \  |   |   |  |  /  \  |   |
      |  | /    \ |   |   |  | /    \ |   |
      |L3-1-1...L3-1-d|   |L3-m-1...L3-m-d|
      +--+-----------\    +-/----------+--+
         |            \    /           |
         |             \  /            |
         |  ......      \/     ......  |
         |              /\             |
         |             /  \            |
         |            /    \           |
      +--+-----------/      \----------+---+
      |L2-1-1...L1-1-c|    |L2-n-1...L2-n-c|
      |  | \    / |   |    |  | \    / |   |
      |  |  \  /  |   |    |  |  \  /  |   |
      |  |   \/   |   |    |  |   \/   |   |
      |  |   /\   |   |... |  |   /\   |   |
      |  |  /  \  |   |    |  |  /  \  |   |
      |  | /    \ |   |    |  | /    \ |   |
      |L1-1-1...L1-1-b|    |L1-n-1...L1-n-b|
      |  +        +   |    |  +        +   |
      | H-1-1... H-1-a|    | H-n-1... H-n-a|
      +---------------+    +---------------+
      Computing Rack 1     Computing Rack n

Figure 1: Network Topology

8. Security Considerations

TBD.

9. IANA Considerations

TBD.

10. References

10.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

10.2. Informative References

[I-D.draft-ravi-ippm-csig]
Ravi, A., Dukkipati, N., Mehta, N., and J. Kumar, "Congestion Signaling (CSIG)", Work in Progress, Internet-Draft, draft-ravi-ippm-csig-00, , <https://datatracker.ietf.org/doc/html/draft-ravi-ippm-csig-00>.
[DCQCN]
"Congestion Control for Large-Scale RDMA Deployments", , <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf>.
[Timely]
"TIMELY: RTT-based Congestion Control for the Datacenter", , <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p537.pdf>.
[PLB]
"PLB: Congestion Signals are Simple and Effective for Network Load Balancing", , <https://dl.acm.org/doi/pdf/10.1145/3544216.3544226>.

Authors' Addresses

Yunping(Lily) Lyu
Huawei
Yuhan Zhang
Huawei
Mengzhu Liu
Huawei