Application-Layer Traffic Optimization                            K. Yao
Internet-Draft                                                     Z. Li
Intended status: Informational                              China Mobile
Expires: 14 September 2023                                        T. Pan
                                                                  Y. Zou
                      Beijing University of Posts and Telecommunications
                                                           13 March 2023


            A Load-aware core level load balancing framework
              draft-yao-alto-core-level-load-balancing-00

Abstract

   Most existing literature on load balancing in data center
   networks(DCN) focuses on balancing traffic between servers (links),
   but there is relatively little attention given to balancing traffic
   at the core-level of individual servers.  In this draft, we present a
   load balancing framework for DCN that is aware of core-level load, in
   order to address this issue.  Specifically, our approach transfers
   real-time load from CPUs to an L4 load balancer, which then selects a
   core with lower load based on load information to deliver the data
   packet.  Theoretically, our approach can completely avoid this
   problem, making the system more stable and enabling higher CPU
   utilization without overprovisioning.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 14 September 2023.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Yao, et al.             Expires 14 September 2023               [Page 1]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Conventions Used in This Document . . . . . . . . . . . . . .   3
     2.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
     2.2.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   3.  Framework Overview  . . . . . . . . . . . . . . . . . . . . .   3
   4.  Server side design  . . . . . . . . . . . . . . . . . . . . .   4
   5.  LB side design  . . . . . . . . . . . . . . . . . . . . . . .   5
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
   8.  Normative References  . . . . . . . . . . . . . . . . . . . .   6
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   Current load balancing strategies in data centers primarily focus on
   link balancing, with the goal of distributing traffic evenly across
   parallel paths to improve network link utilization and prevent
   network congestion.  Methods such as ECMP [RFC2991]and WCMP[WCMP] use
   hashing to distribute traffic to different paths.  However, these
   methods do not consider core-level server load balancing, which can
   lead to actual load imbalances in data center networks where heavy-
   hitter flows coexist with low-rate flows.  Some existing works
   estimate server load balancing between servers based on traffic, but
   there are two issues.  First, load estimation may have bias, leading
   to traffic being assigned to heavily-loaded servers.  Second, these
   methods lack granularity at the CPU core level, which may result in
   the phenomenon of single-core overload within servers.  In this
   paper, we attempt to address these issues by transmitting CPU load to
   the load balancer, which can obtain global load information and then
   allocate traffic to servers based on core-level targets.

   Our solution has the following challenges, in server-side, the first
   challenge is to smoothly manage the rapidly changing load on the
   server.  The second challenge is to insert load information into data
   packets and transmit it to the load balancer through the shortest
   path.  Finally, it may require multi-hop routing to reach the
   destination, and thus ensuring real-time load balancing becomes a


Yao, et al.             Expires 14 September 2023               [Page 2]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


   critical issue.  And in load balancer:, the first challenge for the
   load balancer is to allocate processing cores to streams based on
   load information.  The second challenge is to accurately deliver
   colored business streams to the appropriate cores.  Finally, the load
   balancer needs to ensure the consistency of streams while minimizing
   extreme single-core pressure.

2.  Conventions Used in This Document

2.1.  Terminology

   CID Core Identifier

   CPU Central Processing Unit

   LB Load Balancer

   NIC Network Interface Card

2.2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14[RFC2119][RFC8174] when, and only when, they appear in all
   capitals, as shown here.

3.  Framework Overview

   We present the overall framework of our design, as shown in Fig 1.
   Each server sends internal core load information to the load
   balancer, and we use smoothed CPU utilization as a measure of load.
   The server encapsulates the load into a load-aware data packet and
   passes it to the load balancer through layer two or layer three
   methods.  The load balancer can be a x86-based server running virtual
   machine or a switch based on programmable ASIC.  The load balancer
   parses the load information carried in the load-aware data packet and
   maintains k server-CPU pairs with the lowest load using the data
   structure of a minimum heap.  New connections are fairly assigned to
   these k server-CPU pairs by the load balancer.  To ensure consistency
   of flows, we use a Redirect table in the load balancer to record the
   state of flows.  Data packets that hit table entries are directly
   forwarded to the corresponding CPU.


Yao, et al.             Expires 14 September 2023               [Page 3]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


                  +----------+-----+-----+        +--------------------+
        Flow      |  Flow ID | DIP | CID |  miss  | Minimum heap with  |
  --------------> +----------+-----+-----+ ------>|k least-loaded cores|
                  |          |     |     |        |                    |
                  +----------+-----+-----+        +----------+---------+
                       Redirect table                        |
                             |                  Select core  |
                             v  hit             for IP flow  |
                  +----------------------+<------------------+
       data packet|   L4 Load-balancer   |data packet
       +----------+                      +----------+
       |          |     (Tofino/x86)     |          |
       | +------->+--+-^-----------^--+--+<------+  |
       | |           | | Load-aware|  |          |  |
       | |           | |   packet  |  |          |  |
    +--v-+---+    +--v-+---+    +--+--v--+    +--+--v--+
    | Rack1  |    | Rack2  |    | Rack3  |    | Rack4  |
    | +----+ |    | +----+ |    | +----+ |    | +----+ |
    | +----+ |    | +----+ |    | +----+ |    | +----+ |
    |        |    |        |    |        |    |        |
    | +----+ |    | +----+ |    | +----+ |    | +----+ |
    | +----+ |    | +----+ |    | +----+ |    | +----+ |
    |        |    |        |    |        |    |        |
    +--------+    +--------+    +--------+    +--------+

          Figure 1: Figure 1: Load-aware core-level LB Framework

4.  Server side design

   To smooth the historical loads of each core on the server side, we
   use exponential smoothing to smooth the CPU utilization obtained each
   time:

   Load_n = alpha * Load_get + (1-alpha) * Load_n-1

   Where Load_get is the load value obtained this time, Load_n-1 is the
   result of the last calculation, and alpha is the smoothing parameter.
   With the above formula, we can obtain a smoothed load value Load_n to
   represent the CPU load at this time.

   When congestion occurs in the network, the core load information
   carried by the data packets may become invalid.  Therefore, we need
   to record the timestamp when the data packets are sent from the
   server in the packets.  When the packets arrive at the load balancer,
   the load balancer calculates the transmission delay of the packets.
   If the transmission delay is too large, the load balancer will not
   select that server as the target for traffic delivery (assuming there
   is only one path from the LB to the server) because the path is


Yao, et al.             Expires 14 September 2023               [Page 4]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


   already congested.  At this point, regardless of whether the server's
   cores are overloaded or not, no traffic will be assigned to that
   server.

   To design the packet message structure to carry load information, it
   is not necessary to record the load of all cores in the packet.  Only
   the load of the lowest n cores needs to be recorded.  We have
   designed two different solutions to deal with different scenarios.
   Firstly, when there is a fixed path between the load balancer and the
   server, source routing can be used to send the packets to the load
   balancer.  The packet message structure is designed as follows，after
   the Ethernet frame header, we add the SR field, and each time it goes
   to a switch on the path, the packet is bounced one SR header from the
   stack and finally arrives at the load balancer, and the packet also
   contains the source IP, CPU ID, CPU load and timestamp.

   +-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                   Destination Mac                               |
   +-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Source Mac                                   |
   +-+~ 16 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+-+~ 8 bits~-+
   |     Type    |    SR1     |  SR2       |    SR3     |    SR4     |
   +-+-+-+-+~ 32 bits~ -+-+-+-+-+-+--+-+~ 8 bits~-+-+-+~ 8 bits~-+-+-+
   |        Source IP                |   CPU ID   |    CPU Load      |
   +-+-+-+-+-+-+-+-+-+-+-+-+~ 48 bits~-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Timestamp                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 2: Figure 2: Message format

   Secondly, when multiple hops of non-fixed routing are required
   between the load balancer and the server, the core ID and core load
   can be inserted in the IPv6 packet and the packet can be routed to
   the load balancer.  We mark the packet as a load notification packet
   in the Traffic Class field of IPv6, and fill in the 8-bit CPU ID and
   8-bit core load in the Flow Label field, and then route it to the
   load balancer.

5.  LB side design

   The load balancer needs to parse the load notification packets from
   the servers, extract the load of each core in each server, and
   maintain an internal minimum heap structure.  The load information of
   the cores that has been obtained within a certain period of time is
   adjusted through the heap to obtain the k cores with the lowest load.
   Before the next minimum heap is generated, we allocate the new flows
   to these k cores with the lowest load through polling.


Yao, et al.             Expires 14 September 2023               [Page 5]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


   Ordinary L4 load balancers map packets destined for a service with a
   virtual IP address (VIP) to a pool of servers with multiple direct IP
   addresses (DIP or DIP pool).  In our solution, the load balancer
   needs to be accurate at the core level.  Therefore, we specify the
   DIP and delivery core directly for the first packet and record the
   flow identifier (Flow ID), direct IP address (DIP), and core
   identifier (CID) in a table to ensure flow consistency.  In case of
   extreme situations, such as when a large single flow causes too much
   pressure on the core, we can reduce the CPU load by using the method
   of back pressure on the large flow.

   We need to mark the packets to specify the target core in the
   destination server.  For IP, we can overwrite the original VIP with
   DIP.  For core ID, in order not to overwrite the original information
   of the user packet, we construct a new packet header and insert it
   into the user packet to be transmitted to the server NIC.  The NIC
   parses the destination core ID, removes the added packet header, and
   delivers the restored user packet to the designated core.

6.  Security Considerations

   TBD.

7.  IANA Considerations

   TBD.

8.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC2991]  Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
              Multicast Next-Hop Selection", RFC 2991,
              DOI 10.17487/RFC2991, November 2000,
              <https://www.rfc-editor.org/info/rfc2991>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [WCMP]     Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon
              Poutievski, Arjun Singh, and Amin Vahdat, "WCMP: Weighted
              cost multipathing for improved fairness in data centers",
              April 2014, <WCMP>.


Yao, et al.             Expires 14 September 2023               [Page 6]

Internet-Draft   Application-Layer Traffic Optimization       March 2023


Authors' Addresses

   Kehan Yao
   China Mobile
   Beijing
   100053
   China
   Email: yaokehan@chinamobile.com


   Zhiqiang Li
   China Mobile
   Beijing
   100053
   China
   Email: lizhiqiangyjy@chinamobile.com


   Tian Pan
   Beijing University of Posts and Telecommunications
   Beijing
   100876
   China
   Email: pan@bupt.edu.cn


   Yan Zou
   Beijing University of Posts and Telecommunications
   Beijing
   100876
   China
   Email: zouyan@bupt.edu.cn


Yao, et al.             Expires 14 September 2023               [Page 7]