Network Working Group                                            H. Wang
Internet-Draft                                                   H. Huan
Intended status: Standards Track                                  Huawei
Expires: 8 May 2024                                      5 November 2023


Application-aware Data Center Network (APDN) Use Cases and Requirements
             draft-wh-rtgwg-application-aware-dc-network-01

Abstract

   Deploying large-scale AI services in data centers poses new
   challenges to traditional technologies such as load balancing and
   congestion control.  Besides, emerging network technologies such as
   in-network computing are gradually accepted and used in AI data
   centers.  These network-assisted application acceleration
   technologies require that cross-layer interaction information can be
   flexibly transmitted between end-hosts and network nodes.

   APDN (Application-aware Date Center Network) adopts the APN framework
   for application side to provide more application-aware information
   for the data center network, enabling the fast evolution of network-
   application co-design technology.  This document elaborates use cases
   of APDNs and proposes the requirements.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 8 May 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Wang & Huan                Expires 8 May 2024                   [Page 1]

Internet-Draft                    APDN                     November 2023


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.  Requirements Language . . . . . . . . . . . . . . . . . .   4
   2.  Use Case and Requirements for Application-aware Date Center
           Network . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Fine-grained packet scheduling for load balancing . . . .   4
     2.2.  In-network computing for distributed machine learning
           training  . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.3.  Refined congestion control that requires feedback of
           accurate congestion information . . . . . . . . . . . . .   7
   3.  Encapsulation . . . . . . . . . . . . . . . . . . . . . . . .   9
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     6.2.  Informative References  . . . . . . . . . . . . . . . . .   9
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  11
   Contributors  . . . . . . . . . . . . . . . . . . . . . . . . . .  11
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

1.  Introduction

   Distributed training for AI large model has gradually become an
   important business in large-scale data centers after the emergence of
   large AI models such as AlphaGo and ChatGPT4.  In order to improve
   the efficiency of large model training, large amounts of computing
   units (for example, thousands of GPUs running simultaneously) are
   used to perform computing processing in parallel to reduce JCT(job
   completion time).  The concurrent computing nodes require periodic
   and bandwidth-intensive communications.

   The new multi-party communication mode and characteristics between
   computing units put forward higher requirements for the throughput
   performance, load balancing capability, and congestion handling
   capabilities of the entire data center network.  Traditional data
   center technology usually regards the network purely as the data
   transmission carrier for upper-layer applications, and the network


Wang & Huan                Expires 8 May 2024                   [Page 2]

Internet-Draft                    APDN                     November 2023


   provides basic connectivity services.  However, in the scenario of
   large AI model training, network-assisted technology (e.g.,
   offloading partial computation in the network) is being introduced to
   improve the efficiency of AI jobs by joint optimization of network
   communication and computing applications.  In most existing network-
   assisted cases, the network operators customize and implement private
   protocols in a very small scope, but cannot achieve general
   interoperability.  However, emerging technology for data center
   network needs to consider serving different transports and
   applications, as the scale of AI data centers continues to increase
   and there is a trend to provide cloud services for different AI jobs.
   The construction of large-scale data centers not only needs to
   consider general interoperability between devices, but also needs to
   consider the interoperability between network devices and end-host
   services.

   This document illustrates use cases that requires application-aware
   information between network nodes and applications.  Current ways of
   conveying information are limited by the extensibility of packet
   headers, where only coarse-grained information can be transmitted
   between the network and the host through a limited space (for
   example, one-bit ECN [RFC3168] in IP layer).

   The Application-aware Networking (APN) framework
   [I-D.li-apn-framework] defines that application-aware information
   (i.e.  APN attribute) including APN identification (ID) and/or APN
   parameters (e.g. network performance requirements) is encapsulated at
   network edge devices and carried in packets traversing an APN domain
   in order to facilitate service provisioning, perform fine-granularity
   traffic steering and network resource adjustment.  Application-aware
   Networking (APN) framework for application side
   [I-D.li-rtgwg-apn-app-side-framework] defines the extension of the
   APN framework for the application side.  In this extension, the APN
   resources of an APN domain is allocated to applications which compose
   and encapsulate the APN attribute in packets.

   This document explores the APN framework for application side to
   provide richer interactive information between hosts and networks
   within the data center.  This document provides some use cases and
   proposes the corresponding requirements for APplication-aware Data
   center Network (APDN).

1.1.  Terminology

   APDN: APplication-aware Data center Network

   SQN: SeQuence Number


Wang & Huan                Expires 8 May 2024                   [Page 3]

Internet-Draft                    APDN                     November 2023


   TOR: Top Of Rack switch

   PFC: Priority-based Flow Control

   NIC: Network Interface Card

   ECMP: Equal-Cost Multi-Path routing

   AI: Artificial Intelligence

   JCT: Job Completion Time

   PS: Parameter Server

   INC: In-Network Computing

   APN: APplication-aware Network

1.2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Use Case and Requirements for Application-aware Date Center Network

2.1.  Fine-grained packet scheduling for load balancing

   Traditional data centers adopt per-flow ECMP method to balance
   traffic across multiple paths.  In traditional data centers that
   focus on cloud computing, due to the diversity of services and random
   access, the amount of data flows is large but most of them are
   typically small and short.  The ECMP method can realize near-equally
   distribution of traffic on multiple paths.

   In contrast, the communication pattern is different during the large
   AI model training.  It is observed that the traffic requires large
   bandwidth than ever.  A single data flow between multiple machines
   can usually saturate the upstream bandwidth of the entire server's
   egress NIC (for example, the throughput of single data flow can reach
   nearly X*100GB).  When per-flow ECMP (e.g., hash-based or round-robin
   ECMP) is applied, it is common to concurrently distribute elephant
   flows to a single path.  For example, two concurrent 100Gb/s flows
   may be distributed to the same path with completing for available
   bandwidth 100Gb/s.  In such case, traffic congestion is obvious and
   greatly affects flow completion time of AI jobs.


Wang & Huan                Expires 8 May 2024                   [Page 4]

Internet-Draft                    APDN                     November 2023


   Therefore, it is necessary to implement fine-grained per-packet ECMP
   -- all the packets of the same flow are sprayed over multiple paths
   to achieve balancing and avoid congestion.  Due to the differences
   between the delay (propagation, switching) of different paths,
   packets in the same flow would be likely to be in extensible disorder
   when they arrive at the end-host, causing performance degradation of
   upper transport and application.  To this end, a feasible method is
   to reorder the disordered packets at the egress TOR (top of rack
   switch) with applying per-packet ECMP.  Assuming the scope of
   multipath transmission starts from ingress to egress TORs, the
   principle of reordering is that for each TOR-TOR pair, the order in
   which packets leave the last TOR is consistent with the order in
   which they arrive at the first TOR.

   To realize packet reordering in egress TOR, the entering order of
   packets arriving at ingress TOR should be clearly indicated.  Looking
   back to existing protocols, the sequence number(SQN) information is
   not directly indicated at the Ethernet and IP layers.

   *  As far as current implementations, the per-flow/application SQN is
      generally encapsulated in transport (e.g., TCP, QUIC, RoCEv2) or
      applications.  If reordering packets depends on that SQN, the
      network devices MUST be able to parse large amount of transport/
      application layers.

   *  The SQN in the upper-layer protocol is allocated based on each
      transport/application-level flow.  That is, the sequence number
      space and initial value of different flows may be different, and
      cannot be directly used to express the sequence in which packets
      arrive at the initial TOR.  Although it is possible to assign a
      corresponding reordering queue to each flow on the egress TOR and
      reorder packets with the SQN of the upper layer, the hardware
      resource consumption cannot be overlooked.

   *  If the network device directly overwrites TOR-TOR pairwise SQN to
      the upper-layer SQN, the end-to-end transmission reliability will
      no longer work.

   Therefore, specific order information needs to be transmitted from
   the first device to the last device with reordering functionality
   given a multipath forwarding domain.

   APN framework is explored to carry the important order information
   which, in this case, records sequence number of the packets arriving
   in the ingress TOR (for example, each TOR-TOR pair has an independent
   and incremental SQN), and the egress TOR reorders the packets
   according to that information.


Wang & Huan                Expires 8 May 2024                   [Page 5]

Internet-Draft                    APDN                     November 2023


   Requirements:

   *  [REQ1-1] APN SHOULD encapsulate each packet with SQN besides APN
      ID for reordering.  The ingress TOR SHOULD assign and record SQN
      with certain granularity in each packet regarding their arriving
      order.  The granularity of SQN assignment can be TOR-TOR, port-
      port, queue-queue.

   *  [REQ1-2] The SQN in APN MUST NOT be modified inside the multi-
      pathing domain and could be cleared from APN at the egress device.

   *  [REQ1-3] APN SHOULD be able to carry necessary queue information
      (i.e., the sorting queue ID) usable for fine-grained reordering
      process.  The queue ID SHOULD be in the same granularity as SQN
      assignment.

2.2.  In-network computing for distributed machine learning training

   Distributed training of machine learning commonly applies AllReduce
   communication mode[mpi-doc] for cross-accelerator data transfer in
   the scenarios of data parallelism and model parallelism which perform
   parallel execution of an application on multiple processors.
   The exchange of intermediate results (i.e., gradient data in machine
   learning) of per-processor training occupies the majority of the
   communication process.

   Under the Parameter Server(PS) architecture [atp] (a centralized
   parameter server is responsible for collecting gradient data from
   multiple clients, aggregating and sending the aggregation results
   back to each client), when multiple clients send a large amount of
   gradient data to the same server simultaneously, it is prone to
   induce incast (many-to-one) congestion from the perspective of
   server.

   In-network computing (INC) offloads the processing behavior of the
   server to the switch.  When an on-path network device with both high
   switching and line-rate computing (regarding simple arithmetic
   operations) capabilities is used as a parameter server to replace the
   traditional end-host server for gradient aggregation("addition"
   operation), the distributed AI training application can complete
   gradient aggregation on the way.  On one hand, it turns multiple data
   streams to single stream within the network, eliminating incast
   congestion on the server.  On the other hand, distributed computing
   applications can also benefit from INC due to faster on-switch
   computing (e.g., ASIC) compared with servers (e.g., CPU).


Wang & Huan                Expires 8 May 2024                   [Page 6]

Internet-Draft                    APDN                     November 2023


   [I-D.draft-lou-rtgwg-sinc] argues that to implement in-network
   computing, network devices need to be aware of computing tasks
   required by applications and correctly parse corresponding data
   units.  For multi-source computing, synchronization signals of
   different data source streams need to be explicitly indicated as
   well.

   Current implementations (e.g., ATP[atp], NetReduce[netreduce])
   require the switches to parse upper-layer protocol and understand
   application-specific logic that is dedicated to certain application
   because there are still neither general transport or application
   protocols for INC.  To support various INC applications, the switch
   MUST adapt to all kinds of transport/application protocols.
   Furthermore, the end users may simply apply encryption to the whole
   payload to achieve security, although they are willing to provide
   some non-sensitive information to benifit from accelerated INC
   operations.  In such case, the switch is unable to fetch those
   information necessary for INC operations without decryption of the
   whole payload.  Current status of protocols make it difficult for
   applications and INC operations to interoperate.

   Fortunately, APN is able to transmit information about the requested
   INC operations as well as the corresponding data segments, with which
   the applications can offload some analysis and calculation to the
   network.

   Requirements:

   *  [REQ2-1] APN MUST carry identifier to distinguish different INC
      tasks.

   *  [REQ2-2] APN MUST support to carry various formats and length of
      application data (such as gradients in this use case) to apply INC
      and the expected operations.

   *  [REQ2-3] In order to improve the efficiency of INC, APN SHOULD be
      able to carry other application-aware information that can be used
      to assist computations and make sure not to compromise the
      reliability of end-to-end transport.

   *  [REQ2-4] APN MUST be able to carry complete INC results and record
      the computation status in the data packets.

2.3.  Refined congestion control that requires feedback of accurate
      congestion information

   The data center includes at least the following congestion scenarios:


Wang & Huan                Expires 8 May 2024                   [Page 7]

Internet-Draft                    APDN                     November 2023


   *  Multi-accelerator collaborative AI model training commonly adopts
      AllReduce and All2All communication modes (Section 2.2).  When
      multiple clients send a large amount of gradient data to a server
      at the same time, incast congestion is likely to occur from server
      side.

   *  Different flows may adopt different methods and strategies of load
      balancing, it may cause overload on individual links.

   *  Due to random access to services in data center, there are still
      bursts of traffic that could increase the length of queueing and
      incur congestion.

   The industry has proposed different types of congestion control
   algorithms to alleviate traffic congestion over the paths in data
   center network.  Among them, ECN-based congestion control algorithms
   are commonly used in data centers, such as DCTCP[RFC8257],
   DCQCN[dcqcn], etc., which uses ECN to mark congestion according to
   the occupancy of switch buffer.

   But these methods could only use a 1-bit mark in the packet to
   indicate congestion information (i.e., queue size reaching a
   threshold) and are unable to embrace more in-situ measurement
   information due to limited header space.  Other proposals, for
   example, HPCC++ [I-D.draft-miao-ccwg-hpcc] collect congestion
   information along the path hop by hop through inband telemetry, which
   will keep appending the information of interests to the data packets.
   However, it greatly increases the length of data packets as
   traversing hops and requires more consumption of bandwidth resources.

   A trade-off method such as AECN[I-D.draft-shi-ccwg-advanced-ecn] can
   be used to collect the most important information representing the
   congestion along the path.  Meanwhile, AECN-like methods apply hop-
   by-hop calculation to reduce the carrying of redundant information.
   For example, queue delay and the number of congested hops can be
   calculated cumulatively as packets traverse the path.
   In this use case, the end-host can clarify the scope of the
   information desired to collect, and the network device needs to
   record/update the corresponding information hop-by-hop, to the data
   packet.  The collected information might echoed back to the sender
   via transport protocol.  APN could serve such interaction between
   hosts and switches to realize customized information collection.

   Requirements:

   *  [REQ3-1] APN framework MUST allow the data sender to express its
      intention about which measurement it wants to collect.


Wang & Huan                Expires 8 May 2024                   [Page 8]

Internet-Draft                    APDN                     November 2023


   *  [REQ3-2] APN MUST allow network nodes to record/update necessary
      measurement results, if the nodes decide to do so.  The
      measurement could be queue length of ports, monitored rate of
      links, the number of PFC frames, probed RTT, variation and so on.
      APN MAY record the collector of each measurement in order that
      information consumers can identify possible congestion points.

3.  Encapsulation

   The encapsulation of application-aware information proposed by use
   cases of APDN in the APN Header [I-D.draft-li-apn-header] will be
   defined in the future version of the draft.

4.  Security Considerations

   TBD.

5.  IANA Considerations

   This document has no IANA actions.

6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

6.2.  Informative References

   [mpi-doc]  "Message-Passing Interface Standard", August 2023,
              <https://www.mpi-forum.org/docs/mpi-4.1>.

   [dcqcn]    "Congestion Control for Large-Scale RDMA Deployments",
              n.d.,
              <https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
              p523.pdf>.

   [netreduce]
              "NetReduce - RDMA-Compatible In-Network Reduction for
              Distributed DNN Training Acceleration", n.d.,
              <https://arxiv.org/abs/2009.09736>.


Wang & Huan                Expires 8 May 2024                   [Page 9]

Internet-Draft                    APDN                     November 2023


   [atp]      "ATP - In-network Aggregation for Multi-tenant Learning",
              n.d.,
              <https://www.usenix.org/conference/nsdi21/presentation/
              lao>.

   [I-D.li-apn-framework]
              Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and
              G. S. Mishra, "Application-aware Networking (APN)
              Framework", Work in Progress, Internet-Draft, draft-li-
              apn-framework-07, 3 April 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-apn-
              framework-07>.

   [I-D.li-rtgwg-apn-app-side-framework]
              Li, Z. and S. Peng, "Extension of Application-aware
              Networking (APN) Framework for Application Side", Work in
              Progress, Internet-Draft, draft-li-rtgwg-apn-app-side-
              framework-00, 22 October 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-rtgwg-apn-
              app-side-framework-00>.

   [I-D.draft-lou-rtgwg-sinc]
              Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
              "Signaling In-Network Computing operations (SINC)", Work
              in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15
              September 2023, <https://datatracker.ietf.org/doc/html/
              draft-lou-rtgwg-sinc-01>.

   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
              October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.

   [I-D.draft-miao-ccwg-hpcc]
              Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
              Shpigelman, Y., Tantsura, J., and G. Caspary, "HPCC++:
              Enhanced High Precision Congestion Control", Work in
              Progress, Internet-Draft, draft-miao-ccwg-hpcc-00, 5 July
              2023, <https://datatracker.ietf.org/doc/html/draft-miao-
              ccwg-hpcc-00>.

   [I-D.draft-shi-ccwg-advanced-ecn]
              Shi, H. and T. Zhou, "Advanced Explicit Congestion
              Notification", Work in Progress, Internet-Draft, draft-
              shi-ccwg-advanced-ecn-00, 10 July 2023,
              <https://datatracker.ietf.org/doc/html/draft-shi-ccwg-
              advanced-ecn-00>.


Wang & Huan                Expires 8 May 2024                  [Page 10]

Internet-Draft                    APDN                     November 2023


   [I-D.draft-li-apn-header]
              Li, Z., Peng, S., and S. Zhang, "Application-aware
              Networking (APN) Header", Work in Progress, Internet-
              Draft, draft-li-apn-header-04, 12 April 2023,
              <https://datatracker.ietf.org/doc/html/draft-li-apn-
              header-04>.

Acknowledgements

Contributors

Authors' Addresses

   Haibo Wang
   Huawei
   Email: rainsword.wang@huawei.com


   Hongyi  Huang
   Huawei
   Email: hongyi.huang@huawei.com


Wang & Huan                Expires 8 May 2024                  [Page 11]