Network Working Group H. Wang Internet-Draft H. Huan Intended status: Standards Track Huawei Expires: 8 May 2024 5 November 2023 Application-aware Data Center Network (APDN) Use Cases and Requirements draft-wh-rtgwg-application-aware-dc-network-01 Abstract Deploying large-scale AI services in data centers poses new challenges to traditional technologies such as load balancing and congestion control. Besides, emerging network technologies such as in-network computing are gradually accepted and used in AI data centers. These network-assisted application acceleration technologies require that cross-layer interaction information can be flexibly transmitted between end-hosts and network nodes. APDN (Application-aware Date Center Network) adopts the APN framework for application side to provide more application-aware information for the data center network, enabling the fast evolution of network- application co-design technology. This document elaborates use cases of APDNs and proposes the requirements. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 8 May 2024. Copyright Notice Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved. Wang & Huan Expires 8 May 2024 [Page 1] Internet-Draft APDN November 2023 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 4 2. Use Case and Requirements for Application-aware Date Center Network . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Fine-grained packet scheduling for load balancing . . . . 4 2.2. In-network computing for distributed machine learning training . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Refined congestion control that requires feedback of accurate congestion information . . . . . . . . . . . . . 7 3. Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . 9 4. Security Considerations . . . . . . . . . . . . . . . . . . . 9 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.1. Normative References . . . . . . . . . . . . . . . . . . 9 6.2. Informative References . . . . . . . . . . . . . . . . . 9 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 1. Introduction Distributed training for AI large model has gradually become an important business in large-scale data centers after the emergence of large AI models such as AlphaGo and ChatGPT4. In order to improve the efficiency of large model training, large amounts of computing units (for example, thousands of GPUs running simultaneously) are used to perform computing processing in parallel to reduce JCT(job completion time). The concurrent computing nodes require periodic and bandwidth-intensive communications. The new multi-party communication mode and characteristics between computing units put forward higher requirements for the throughput performance, load balancing capability, and congestion handling capabilities of the entire data center network. Traditional data center technology usually regards the network purely as the data transmission carrier for upper-layer applications, and the network Wang & Huan Expires 8 May 2024 [Page 2] Internet-Draft APDN November 2023 provides basic connectivity services. However, in the scenario of large AI model training, network-assisted technology (e.g., offloading partial computation in the network) is being introduced to improve the efficiency of AI jobs by joint optimization of network communication and computing applications. In most existing network- assisted cases, the network operators customize and implement private protocols in a very small scope, but cannot achieve general interoperability. However, emerging technology for data center network needs to consider serving different transports and applications, as the scale of AI data centers continues to increase and there is a trend to provide cloud services for different AI jobs. The construction of large-scale data centers not only needs to consider general interoperability between devices, but also needs to consider the interoperability between network devices and end-host services. This document illustrates use cases that requires application-aware information between network nodes and applications. Current ways of conveying information are limited by the extensibility of packet headers, where only coarse-grained information can be transmitted between the network and the host through a limited space (for example, one-bit ECN [RFC3168] in IP layer). The Application-aware Networking (APN) framework [I-D.li-apn-framework] defines that application-aware information (i.e. APN attribute) including APN identification (ID) and/or APN parameters (e.g. network performance requirements) is encapsulated at network edge devices and carried in packets traversing an APN domain in order to facilitate service provisioning, perform fine-granularity traffic steering and network resource adjustment. Application-aware Networking (APN) framework for application side [I-D.li-rtgwg-apn-app-side-framework] defines the extension of the APN framework for the application side. In this extension, the APN resources of an APN domain is allocated to applications which compose and encapsulate the APN attribute in packets. This document explores the APN framework for application side to provide richer interactive information between hosts and networks within the data center. This document provides some use cases and proposes the corresponding requirements for APplication-aware Data center Network (APDN). 1.1. Terminology APDN: APplication-aware Data center Network SQN: SeQuence Number Wang & Huan Expires 8 May 2024 [Page 3] Internet-Draft APDN November 2023 TOR: Top Of Rack switch PFC: Priority-based Flow Control NIC: Network Interface Card ECMP: Equal-Cost Multi-Path routing AI: Artificial Intelligence JCT: Job Completion Time PS: Parameter Server INC: In-Network Computing APN: APplication-aware Network 1.2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Use Case and Requirements for Application-aware Date Center Network 2.1. Fine-grained packet scheduling for load balancing Traditional data centers adopt per-flow ECMP method to balance traffic across multiple paths. In traditional data centers that focus on cloud computing, due to the diversity of services and random access, the amount of data flows is large but most of them are typically small and short. The ECMP method can realize near-equally distribution of traffic on multiple paths. In contrast, the communication pattern is different during the large AI model training. It is observed that the traffic requires large bandwidth than ever. A single data flow between multiple machines can usually saturate the upstream bandwidth of the entire server's egress NIC (for example, the throughput of single data flow can reach nearly X*100GB). When per-flow ECMP (e.g., hash-based or round-robin ECMP) is applied, it is common to concurrently distribute elephant flows to a single path. For example, two concurrent 100Gb/s flows may be distributed to the same path with completing for available bandwidth 100Gb/s. In such case, traffic congestion is obvious and greatly affects flow completion time of AI jobs. Wang & Huan Expires 8 May 2024 [Page 4] Internet-Draft APDN November 2023 Therefore, it is necessary to implement fine-grained per-packet ECMP -- all the packets of the same flow are sprayed over multiple paths to achieve balancing and avoid congestion. Due to the differences between the delay (propagation, switching) of different paths, packets in the same flow would be likely to be in extensible disorder when they arrive at the end-host, causing performance degradation of upper transport and application. To this end, a feasible method is to reorder the disordered packets at the egress TOR (top of rack switch) with applying per-packet ECMP. Assuming the scope of multipath transmission starts from ingress to egress TORs, the principle of reordering is that for each TOR-TOR pair, the order in which packets leave the last TOR is consistent with the order in which they arrive at the first TOR. To realize packet reordering in egress TOR, the entering order of packets arriving at ingress TOR should be clearly indicated. Looking back to existing protocols, the sequence number(SQN) information is not directly indicated at the Ethernet and IP layers. * As far as current implementations, the per-flow/application SQN is generally encapsulated in transport (e.g., TCP, QUIC, RoCEv2) or applications. If reordering packets depends on that SQN, the network devices MUST be able to parse large amount of transport/ application layers. * The SQN in the upper-layer protocol is allocated based on each transport/application-level flow. That is, the sequence number space and initial value of different flows may be different, and cannot be directly used to express the sequence in which packets arrive at the initial TOR. Although it is possible to assign a corresponding reordering queue to each flow on the egress TOR and reorder packets with the SQN of the upper layer, the hardware resource consumption cannot be overlooked. * If the network device directly overwrites TOR-TOR pairwise SQN to the upper-layer SQN, the end-to-end transmission reliability will no longer work. Therefore, specific order information needs to be transmitted from the first device to the last device with reordering functionality given a multipath forwarding domain. APN framework is explored to carry the important order information which, in this case, records sequence number of the packets arriving in the ingress TOR (for example, each TOR-TOR pair has an independent and incremental SQN), and the egress TOR reorders the packets according to that information. Wang & Huan Expires 8 May 2024 [Page 5] Internet-Draft APDN November 2023 Requirements: * [REQ1-1] APN SHOULD encapsulate each packet with SQN besides APN ID for reordering. The ingress TOR SHOULD assign and record SQN with certain granularity in each packet regarding their arriving order. The granularity of SQN assignment can be TOR-TOR, port- port, queue-queue. * [REQ1-2] The SQN in APN MUST NOT be modified inside the multi- pathing domain and could be cleared from APN at the egress device. * [REQ1-3] APN SHOULD be able to carry necessary queue information (i.e., the sorting queue ID) usable for fine-grained reordering process. The queue ID SHOULD be in the same granularity as SQN assignment. 2.2. In-network computing for distributed machine learning training Distributed training of machine learning commonly applies AllReduce communication mode[mpi-doc] for cross-accelerator data transfer in the scenarios of data parallelism and model parallelism which perform parallel execution of an application on multiple processors. The exchange of intermediate results (i.e., gradient data in machine learning) of per-processor training occupies the majority of the communication process. Under the Parameter Server(PS) architecture [atp] (a centralized parameter server is responsible for collecting gradient data from multiple clients, aggregating and sending the aggregation results back to each client), when multiple clients send a large amount of gradient data to the same server simultaneously, it is prone to induce incast (many-to-one) congestion from the perspective of server. In-network computing (INC) offloads the processing behavior of the server to the switch. When an on-path network device with both high switching and line-rate computing (regarding simple arithmetic operations) capabilities is used as a parameter server to replace the traditional end-host server for gradient aggregation("addition" operation), the distributed AI training application can complete gradient aggregation on the way. On one hand, it turns multiple data streams to single stream within the network, eliminating incast congestion on the server. On the other hand, distributed computing applications can also benefit from INC due to faster on-switch computing (e.g., ASIC) compared with servers (e.g., CPU). Wang & Huan Expires 8 May 2024 [Page 6] Internet-Draft APDN November 2023 [I-D.draft-lou-rtgwg-sinc] argues that to implement in-network computing, network devices need to be aware of computing tasks required by applications and correctly parse corresponding data units. For multi-source computing, synchronization signals of different data source streams need to be explicitly indicated as well. Current implementations (e.g., ATP[atp], NetReduce[netreduce]) require the switches to parse upper-layer protocol and understand application-specific logic that is dedicated to certain application because there are still neither general transport or application protocols for INC. To support various INC applications, the switch MUST adapt to all kinds of transport/application protocols. Furthermore, the end users may simply apply encryption to the whole payload to achieve security, although they are willing to provide some non-sensitive information to benifit from accelerated INC operations. In such case, the switch is unable to fetch those information necessary for INC operations without decryption of the whole payload. Current status of protocols make it difficult for applications and INC operations to interoperate. Fortunately, APN is able to transmit information about the requested INC operations as well as the corresponding data segments, with which the applications can offload some analysis and calculation to the network. Requirements: * [REQ2-1] APN MUST carry identifier to distinguish different INC tasks. * [REQ2-2] APN MUST support to carry various formats and length of application data (such as gradients in this use case) to apply INC and the expected operations. * [REQ2-3] In order to improve the efficiency of INC, APN SHOULD be able to carry other application-aware information that can be used to assist computations and make sure not to compromise the reliability of end-to-end transport. * [REQ2-4] APN MUST be able to carry complete INC results and record the computation status in the data packets. 2.3. Refined congestion control that requires feedback of accurate congestion information The data center includes at least the following congestion scenarios: Wang & Huan Expires 8 May 2024 [Page 7] Internet-Draft APDN November 2023 * Multi-accelerator collaborative AI model training commonly adopts AllReduce and All2All communication modes (Section 2.2). When multiple clients send a large amount of gradient data to a server at the same time, incast congestion is likely to occur from server side. * Different flows may adopt different methods and strategies of load balancing, it may cause overload on individual links. * Due to random access to services in data center, there are still bursts of traffic that could increase the length of queueing and incur congestion. The industry has proposed different types of congestion control algorithms to alleviate traffic congestion over the paths in data center network. Among them, ECN-based congestion control algorithms are commonly used in data centers, such as DCTCP[RFC8257], DCQCN[dcqcn], etc., which uses ECN to mark congestion according to the occupancy of switch buffer. But these methods could only use a 1-bit mark in the packet to indicate congestion information (i.e., queue size reaching a threshold) and are unable to embrace more in-situ measurement information due to limited header space. Other proposals, for example, HPCC++ [I-D.draft-miao-ccwg-hpcc] collect congestion information along the path hop by hop through inband telemetry, which will keep appending the information of interests to the data packets. However, it greatly increases the length of data packets as traversing hops and requires more consumption of bandwidth resources. A trade-off method such as AECN[I-D.draft-shi-ccwg-advanced-ecn] can be used to collect the most important information representing the congestion along the path. Meanwhile, AECN-like methods apply hop- by-hop calculation to reduce the carrying of redundant information. For example, queue delay and the number of congested hops can be calculated cumulatively as packets traverse the path. In this use case, the end-host can clarify the scope of the information desired to collect, and the network device needs to record/update the corresponding information hop-by-hop, to the data packet. The collected information might echoed back to the sender via transport protocol. APN could serve such interaction between hosts and switches to realize customized information collection. Requirements: * [REQ3-1] APN framework MUST allow the data sender to express its intention about which measurement it wants to collect. Wang & Huan Expires 8 May 2024 [Page 8] Internet-Draft APDN November 2023 * [REQ3-2] APN MUST allow network nodes to record/update necessary measurement results, if the nodes decide to do so. The measurement could be queue length of ports, monitored rate of links, the number of PFC frames, probed RTT, variation and so on. APN MAY record the collector of each measurement in order that information consumers can identify possible congestion points. 3. Encapsulation The encapsulation of application-aware information proposed by use cases of APDN in the APN Header [I-D.draft-li-apn-header] will be defined in the future version of the draft. 4. Security Considerations TBD. 5. IANA Considerations This document has no IANA actions. 6. References 6.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 6.2. Informative References [mpi-doc] "Message-Passing Interface Standard", August 2023, . [dcqcn] "Congestion Control for Large-Scale RDMA Deployments", n.d., . [netreduce] "NetReduce - RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration", n.d., . Wang & Huan Expires 8 May 2024 [Page 9] Internet-Draft APDN November 2023 [atp] "ATP - In-network Aggregation for Multi-tenant Learning", n.d., . [I-D.li-apn-framework] Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and G. S. Mishra, "Application-aware Networking (APN) Framework", Work in Progress, Internet-Draft, draft-li- apn-framework-07, 3 April 2023, . [I-D.li-rtgwg-apn-app-side-framework] Li, Z. and S. Peng, "Extension of Application-aware Networking (APN) Framework for Application Side", Work in Progress, Internet-Draft, draft-li-rtgwg-apn-app-side- framework-00, 22 October 2023, . [I-D.draft-lou-rtgwg-sinc] Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao, "Signaling In-Network Computing operations (SINC)", Work in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15 September 2023, . [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., and G. Judd, "Data Center TCP (DCTCP): TCP Congestion Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, October 2017, . [I-D.draft-miao-ccwg-hpcc] Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B., Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++: Enhanced High Precision Congestion Control", Work in Progress, Internet-Draft, draft-miao-ccwg-hpcc-00, 5 July 2023, . [I-D.draft-shi-ccwg-advanced-ecn] Shi, H. and T. Zhou, "Advanced Explicit Congestion Notification", Work in Progress, Internet-Draft, draft- shi-ccwg-advanced-ecn-00, 10 July 2023, . Wang & Huan Expires 8 May 2024 [Page 10] Internet-Draft APDN November 2023 [I-D.draft-li-apn-header] Li, Z., Peng, S., and S. Zhang, "Application-aware Networking (APN) Header", Work in Progress, Internet- Draft, draft-li-apn-header-04, 12 April 2023, . Acknowledgements Contributors Authors' Addresses Haibo Wang Huawei Email: rainsword.wang@huawei.com Hongyi Huang Huawei Email: hongyi.huang@huawei.com Wang & Huan Expires 8 May 2024 [Page 11]