IDR Working Group P. Huo Internet-Draft G. Chen Intended status: Standards Track ByteDance Expires: February 23, 2025 C. Lin New H3C Technologies Syed Hasan Raza Naqvi Broadcom Yossi Kikozashvili DriveNets C. Qi ByteDance August 23, 2024 Bgp Extension for Tunnel Egress Point draft-hcl-idr-extend-tunnel-egress-point-02 Abstract In AI networks, flow characteristics often exhibit a low number of flows but with high bandwidth per flow, making it easy to cause network congestion when using traditional flow-level load balancing methods. Currently, the direction of traffic scheduling focuses on load sharing individual packets of the same flow, which requires sorting based on the Tunnel Egress Point information from the remote end. This document describes the method of publishing Tunnel Egress Point through the BGP protocol. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt hcl, et al. Expires February 23, 2025 [Page 1] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on February 23, 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction...................................................3 1.1. Requirements Language.....................................3 2. Motivation.....................................................3 3. Terminology....................................................5 4. Solution.......................................................5 5. Protocol Extension.............................................7 5.1. Implementation based on different types of networks.......7 5.2. Encoding TEP(Tunnel Egress Point).........................8 5.3. Encoding Tunnel Egress Point in IPv4/IPv6 address family.10 5.4. Encoding Tunnel Egress Point in EVPN family..............10 6. Procedure.....................................................13 6.1. Procedure for IPv4/IPv6..................................13 6.2. Procedure for EVPN.......................................13 7. Deployment consideration......................................14 8. IANA Considerations...........................................14 9. Security Considerations.......................................14 10. References...................................................14 10.1. Normative References....................................14 10.2. Informative References..................................15 Acknowledgments..................................................16 Contributors.....................................................16 Authors' Addresses...............................................17 hcl, et al. Expires February 23, 2025 [Page 2] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 1. Introduction With the widespread application of AI technology, the AI Computing Center has experienced rapid development and increased attention to potential issues within AI networks. The characteristics of AI traffic exhibit a low number of flows with substantial bandwidth per flow, making traditional flow-level load balancing highly susceptible to multiple flows hashing to the same link, resulting in congestion on certain links while others remain idle. This leads to low network utilization and an inability to handle sudden surges in network traffic. Consequently, the need for a new load balancing scheduling model is imperative. Presently, the direction of scheduling in AI networks involves sharing the load of multiple packets within each flow individually, enabling the "spraying" of individual flows across the entire path to enhance effective bandwidth utilization and better application of existing bandwidth. However, sharing the load of individual packets within a flow can result in packet reordering for the same traffic. Therefore, it is necessary for the egress point to carry the egress features of the traffic to the ingress point, enabling packet sorting based on the egress features of the traffic to ensure the proper sequencing of multiple packets within the same flow. This document describes the method of conveying the egress characteristics of routes as route attributes through the BGP protocol to inform the ingress server. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Motivation As shown in the figure 1, Leaf devices are connected downwards to host devices and upwards to Spine devices. When hosts communicate with each other, there are multiple different ECMP paths available for OSF-Egress to forward packets. For example, traffic from H1 to H8 can go through the path H1 -> Leaf1 -> Spine1 -> Leaf4 -> H8, or it can go through H1 -> Leaf1 -> Spine2 -> Leaf4 hcl, et al. Expires February 23, 2025 [Page 3] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 -> H8. In traditional load balancing, after hashing the traffic, the same path is chosen for forwarding for the same flow. In AI networks, where there is less data per flow but each flow carries a larger payload, traditional load balancing strategies can lead to network congestion. To adapt to the characteristics of AI networks, when load balancing with ECMP, multiple small data packets can be combined into a larger packet for transmission, and large packets can be divided into relatively smaller packets for transmission. The combined data packets are then evenly distributed over ECMP paths to fully utilize the bandwidth of each path. However, this may result in packet reordering, so it is necessary to reorder the packets at the packet's destination. During sorting, all packets destined for the same end-point need to be sorted. For example, for two data packets from H1 to H8, they are sorted based on the destination (Leaf4 + H8) to ensure that the packets arrive at H8 in the correct order. Therefore, it is necessary to synchronize end-point information from OSF-Egress to OSF-Ingress through the control plane. When sending packets, OSF-Ingress numbers the packets based on the end-point and selects different paths for "spraying" the packets. The intermediate OSF-Forward device forwards packets towards the final destination device based on the end-point without concerning about packet order. Finally, OFP-Egress reorders packets based on the same end-point number and forwards them to the hosts. This document primarily describes how the control plane delivers end-point information. hcl, et al. Expires February 23, 2025 [Page 4] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +-------------+ +-------------+ | | | | |Spine1 | |Spine2 | +--+--+--+--+-+ +-+--+---+--+-+ / | | |1 / / | \2 / | | \ / / + \ / +--(--(----(----------+ / / \ / / | | +-(-----------+ / \ / / | \ | +---------------)---+ \ ++ / ++ +-)---------+ / \ \ / /2 \ | \ / \ \ +-+--+--+ +-+---+-+ +-+---+-+ +-+------+-+ | | | | | | | | |Leaf1 | |Leaf2 | |Leaf3 | |Leaf4 | +-+---+-+ +-+---+-+ +-+---+-+ +-+-----+--+ | | | | | | | |1,2 H1 H2 H3 H4 H5 H6 H7 H8 Figure 1: AI network 3. Terminology The following terminologies are used in this document. TEP: Tunnel Egress Point. This document OSF: Open Scheduled Fabric. [draft-hcl-rtgwg-osf-framework-00] OSF-Ingress: OSF Ingress router. [draft-hcl-rtgwg-osf-framework-00] OSF-Egress: OSF Egress router. [draft-hcl-rtgwg-osf-framework-00] OSF-Forwarder: OSF Forwarder Router. [draft-hcl-rtgwg-osf-framework- 00] 4. Solution As shown in Figure 2, in the Spin/Leaf network, each Leaf device, when advertising route prefixes externally, includes the Tunnel Egress Point information corresponding to these route prefixes. When the entry Leaf device receives this route, it extracts the Tunnel Egress Point information and forwards it to the forwarding layer. The specific usage of the Tunnel Egress Point by the forwarding layer is beyond the scope of this document. hcl, et al. Expires February 23, 2025 [Page 5] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +---------------------------------------------------------------------------+ | +-----------+ +-----------+ | | |Spin1 | |Spin2 | | | +-+-+-+-+---+ +-+-+-+-+---+ | | | | | | +----------------+------------------+--------------+ | | | | | | | +------)-------+--------)---------+--------)------+ | |. | | | | | | | | | | +-+------+--+ +-+--------+-+ +-+--------+-+ +-+-------+-+ | | |Leaf1 | |Leaf2 | |Leaf3 | |Leaf4 | | | +-+---------+ +--+---------+ +---+--------+ +---+-------+ | | | TEP1 | TEP2 | TEP3 | TEP4 | +---------)---------------)-------------------)---------------)-------------+ | | | | P1 P2 P3 P4 Figure 2: Spin/Leaf network The forwarding paths for traffic are illustrated in Figure 2. For the same traffic from Leaf1 to Leaf2, there are two possible paths: Spin1->Leaf2 and Spin2->Leaf2. Different paths for the same traffic have the same Tunnel Egress Point information. +---------+ Leaf1: |P2 |----+ Spin1---Leaf2 TEP2 +---------+ + Spin2---Leaf2 TEP2 Figure 3: Illustration of Multiple Forwarding Paths In addition to the path information, in order for Leaf2 devices to directly forward packets without the need for secondary table lookups, Leaf1 devices can also prepare the required encapsulation information in advance. +---------+ Leaf1: |P2 |----+ Spin1---Leaf2 TEP2,Encap1 hcl, et al. Expires February 23, 2025 [Page 6] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +---------+ + Spin2---Leaf2 TEP2,Encap1 The specific synchronization process is as follows: 1) When Leaf2 devices announce routing information externally, they carry TEP2 information. 2) When Leaf2 devices announce the encapsulation information Encap1 to reach P2 externally. 3) When Leaf1 devices forward packets, they specify the forwarding path and the destination information TEP2. At the same time, based on the destination address P2, they specify the final encapsulation information Encap1 for sending. 4) The intermediate device independently determines the path to TEP2 and forwards the packet to TEP2. 5) TEP2, as the last hop router, directly encapsulates the packet according to the Encap1 and delivers the packet to P2. 5. Protocol Extension This section introduces the method of extending the BGP protocol to carry the Tunnel Egress Point information within the community attribute. The Tunnel Egress Point information includes the Device Index and Port Index. The Device Index is globally unique and is used to distinguish different Leaf devices, while the Port Index is unique to the local device and is used to differentiate between different interfaces on the local device. 5.1. Implementation based on different types of networks For the network shown in Figure 1, it can be a regular Layer 3 IP network or a Layer 2 network based on EVPN. In the case of a regular Layer 3 IP network, network routes and host routes corresponding to next-hop addresses are advertised using IPv4/IPv6 address families. When advertising network route information, extended TEP attribute information is carried. When advertising host routes corresponding to next-hop addresses, the extended TEP attribute information for this host address and the encapsulation information are carried. If it is an EVPN network, segment route information is advertised through EVPN's Type-5 routes, carrying the TEP attribute information hcl, et al. Expires February 23, 2025 [Page 7] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 for this segment route. The TEP attribute information for this next- hop route and the encapsulation information for this next-hop address are advertised through EVPN's Type-2 routes. 5.2. Encoding TEP(Tunnel Egress Point) Add a new type, TEP type, to "BGP Tunnel Encapsulation Attribute Tunnel Types." The Tunnel Encapsulation attribute is an optional transitive BGP path attribute. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Tunnel Type(2 octets) | Length(2 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Value (variable) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Tunnel Egress Point attribute Tunnel Type: TBD, 2 octets. Identifies a type of tunnel. The field contains values from the IANA registry "BGP Tunnel Encapsulation Attribute Tunnel Types" [IANA-BGP-TUNNEL-ENCAP] [RFC9012] Length: 2 octets, length of Value Currently, two types of TEPs have been defined: one that carries a single SystemPort attribute, and another that carries multiple SystemPort attributes. When the destination address is a unicast address, the corresponding destination node is a single node, and it carries a single SystemPort attribute. When the destination address is a broadcast address, the corresponding destination node is a group of nodes, and it carries multiple SystemPort attributes. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TEP Type | Length(1 octets)| SystemPort(2 octets) | hcl, et al. Expires February 23, 2025 [Page 8] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Single Port Index attribute TEP Type 1: TBD1, Single Port Index Length: length of one SystemPort SystemPort: The System ID and Port ID 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |TEP Type(1 oct)| Length(2 octets) | SysLen(1 oct) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SysPort1(2 octets) | SysPort2(2 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SysPort3(2 octets) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Multiple Port Index attribute TEP Type 2: TBD2, Multiple Port Index Length: Sysport Total length Syslen: Length of one SystemPort SystemPort: The System ID and Port ID Additionally, a TEP type carrying an encapsulation ID is defined to enable direct packet encapsulation on the egress device based on the encapsulation ID. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ hcl, et al. Expires February 23, 2025 [Page 9] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 | TEP Type | Length(1 octets)| Encap ID (2 octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Encapsulation ID attribute TEP Type 3: TBD3, Encapsulation ID Length: length of encap ID Encap ID: 2 octets 5.3. Encoding Tunnel Egress Point in IPv4/IPv6 address family If based on a regular IP network, IPv4/IPv6 routes are advertised using the IPv4/IPv6 address family, the TEP and EncapID attribute is advertised as the path attribute type for IPv4/IPv6 routes. 5.4. Encoding Tunnel Egress Point in EVPN family When advertising two types of routes (MAC/IP Advertisement) as defined in [RFC7432], which announce the mapping between host MAC and IP addresses, TEP information is carried to announce the corresponding TEP and EncapID information. hcl, et al. Expires February 23, 2025 [Page 10] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +---------------------------------------+ | RD (8 octets) | +---------------------------------------+ |Ethernet Segment Identifier (10 octets)| +---------------------------------------+ | Ethernet Tag ID (4 octets) | +---------------------------------------+ | MAC Address Length (1 octet) | +---------------------------------------+ | MAC Address (6 octets) | +---------------------------------------+ | IP Address Length (1 octet) | +---------------------------------------+ | IP Address (0, 4, or 16 octets) | +---------------------------------------+ | MPLS Label1 (3 octets) | +---------------------------------------+ | MPLS Label2 (0 or 3 octets) | +---------------------------------------+ | TEP attribute (0 or variable octets) | +---------------------------------------+ | Encap ID attribute (0 or 4 octets) | +---------------------------------------+ When advertising the 5th type of route (IP Prefix) as defined in [RFC9136], which announces IP prefix information, TEP information is carried to announce the corresponding TEP and EncapID information for this prefix. hcl, et al. Expires February 23, 2025 [Page 11] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 +---------------------------------------+ | RD (8 octets) | +---------------------------------------+ |Ethernet Segment Identifier (10 octets)| +---------------------------------------+ | Ethernet Tag ID (4 octets) | +---------------------------------------+ | IP Prefix Length (1 octet, 0 to 32) | +---------------------------------------+ | IP Prefix (4 octets) | +---------------------------------------+ | GW IP Address (4 octets) | +---------------------------------------+ | MPLS Label (3 octets) | +---------------------------------------+ | TEP attribute (0 or variable octets) | +---------------------------------------+ | Encap ID attribute (0 or 4 octets) | +---------------------------------------+ EVPN IP Prefix Route NLRI for IPv4 +---------------------------------------+ | RD (8 octets) | +---------------------------------------+ |Ethernet Segment Identifier (10 octets)| +---------------------------------------+ | Ethernet Tag ID (4 octets) | +---------------------------------------+ | IP Prefix Length (1 octet, 0 to 128) | +---------------------------------------+ | IP Prefix (16 octets) | +---------------------------------------+ | GW IP Address (16 octets) | +---------------------------------------+ | MPLS Label (3 octets) | +---------------------------------------+ | TEP attribute (0 or variable octets) | +---------------------------------------+ | Encap ID attribute (0 or 4 octets) | +---------------------------------------+ EVPN IP Prefix Route NLRI for IPv6 If based on an EVPN network, routes are advertised through the EVPN address families, and the SystemPort and EncapID attribute is defined as a new Prefix Descriptor TLV type in EVPN, carried within the Link State information. hcl, et al. Expires February 23, 2025 [Page 12] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 The support for BGP Multicast VPN (MVPN) Services [RFC6513] with Tunnel Egress Point is outside the scope of this document. 6. Procedure 6.1. Procedure for IPv4/IPv6 When OFP-Egress advertises IPv4/IPv6 prefix routes outward, the TEP and EncapID attributes serve as the path attribute types for IPv4/IPv6 routes. Upon receiving the prefix routes, OFP-Ingress identifies and records the TEP attribute information, as well as the encapsulation information EncapID. When OFP-Ingress sends packets, it looks up the routes and sorts the packets according to the TEP attribute information. OFP-Forward devices can choose to forward based on IP/IPv6 addresses or based on TEP attributes, without regard to packet disarray during forwarding. OFP-Egress receives the packets, sorts them, and if necessary, reassembles them. It then adds a second-layer encapsulation based on the EncapID, and forwards the packets to the server in sequence. +------------+ +--------+-->p1 | ECMP1 +- +-->p3 +---------+-->p1(encap) |OSF- | ============== \ / |OSF- | | +-->p2 | ECMP2 +------->p2 | +-->p2(encap) |Ingress | ============== /\ |Egress | | +-->p3 | ... +- +--->p1 | +-->p3(encap) +--------+ ============== +---------+ | ECMPn | +------------+ 6.2. Procedure for EVPN When OFP-Egress advertises 2 types of routes (MAC/IP Advertisement) outward, it carries TEP and EncapID attributes. When OFP-Egress advertises 5 types of routes (IP Prefix) outward, it also carries TEP and EncapID attributes. OFP-Ingress, upon receiving 2 types of routes and 5 types of routes, identifies and records TEP and EncapID attribute information. hcl, et al. Expires February 23, 2025 [Page 13] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 When OFP-Ingress sends packets, it forwards MAC for layer 2 packets and performs route lookups for layer 3 packets, ultimately sorting packets according to TEP attribute information. OFP-Forward devices can choose to forward based on MAC or IP addresses, or based on TEP attributes, without regard to packet disarray during forwarding. OFP-Egress receives the packets, sorts them, and if necessary, reassembles them. It then adds a second-layer encapsulation to the packets based on the EncapID and forwards them to the server in sequence. 7. Deployment consideration The Device ID of each Spin device must be globally unique, which can be ensured through configuration or by uniformly distributing guarantees through the controller. 8. IANA Considerations This document registers the following in the "BGP Tunnel Encapsulation Attribute Tunnel Types" registry.[RFC9012] TBD Tunnel Egress Point attribute 9. Security Considerations TBD 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, May 2017 [RFC9012] K. Patel, "The BGP Tunnel Encapsulation Attribute", ISSN: 2070-1721, April 2021, hcl, et al. Expires February 23, 2025 [Page 14] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 . 10.2. Informative References TBD. hcl, et al. Expires February 23, 2025 [Page 15] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 Acknowledgments TBD Contributors Jia Li New H3C Technologies China Email: lij@h3c.com Meng Li New H3C Technologies China Email: li_meng_limeng@h3c.com Jian Chen New H3C Technologies China Email: jian_chen@h3c.com Haina Zhong New H3C Technologies China Email: zhonghaina.06454@h3c.com Jincan Li RuiJie China Email: lijincan@ruijie.com.cn Yanrong Liang RuiJie China Email: liangyanrong@ruijie.com.cn Daniel Roytenberg DriveNets Email: danielro@drivenets.com Eyal Hezi DriveNets Email: ehezi@drivenets.com Alvin Yu Zhang DriveNets Email: azhang@drivenets.com hcl, et al. Expires February 23, 2025 [Page 16] Internet-Draft Bgp Extension for Tunnel Egress Point August 2024 Yehonatan Lemberger DriveNets Email: ylemberger@drivenets.com Yanjun Yang Broadcom Email: Yanjun.yang@broadcom.com Authors' Addresses PengFei Huo ByteDance China Email: huopengfei@bytedance.com Gang Chen ByteDance China Email: chengang.gary@bytedance.com Changwang Lin New H3C Technologies China Email: linchangwang.04414@h3c.com Syed Hasan Raza Naqvi Broadcom Email: syed.naqvi@broadcom.com Yossi Kikozashvili DriveNets Email: ykikozashvili@drivenets.com Chenchen Qi ByteDance China Email: qichenchen@bytedance.com hcl, et al. Expires February 23, 2025 [Page 17]