Network Working Group Y. Liu Internet-Draft T. Jiang Intended status: Informational China Mobile Expires: 24 April 2023 T. Eckert Futurewei Z. Li Huawei Technologies G. Mishra Verizon Inc. Z. Qin China Unicom C. Lin New H3C Technologies X. Geng Huawei 21 October 2022 Problem Satement of IPv6 Multicast Source Routing (MSR6) draft-liu-msr6-problem-statement-01 Abstract This document analyses the gaps of the existing IPv6 multicast solutions under discussion in IETF based on the requirements. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 24 April 2023. Liu, et al. Expires 24 April 2023 [Page 1] Internet-Draft Problem Statement of MSR6 October 2022 Copyright Notice Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Problem Statement for Multicast of Large-scale Network . . . 3 2.1. Typical Scenario in DCN . . . . . . . . . . . . . . . . . 5 2.1.1. AI Training . . . . . . . . . . . . . . . . . . . . . 5 2.1.2. HPC . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3. Storage . . . . . . . . . . . . . . . . . . . . . . . 8 3. Problem Statement for IPv6 Multicast with IPSec . . . . . . . 9 4. Problem Statement for IPv6 Host-initiated Multicast . . . . . 10 5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 9. Normative References . . . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 1. Introduction Multicast could provide efficient P2MP service without bandwidth waste. The increasing amount of live video traffic in the network bring new requirements for multicast solutions. The existing multicast solutions request multicast tree-building on control plane and maintaining end-to-end tree state per flow, which impacts router state capacity and network convergence time. There has been a lot of work in IETF to simplify service deployment, in which Source Routing is a very important technology, including SRv6, BIER, etc. Source routing is able to reduce the state of intermediate nodes and indicate multicast forwarding in the ingress nodes, which could simplify multicast deployment. Source routing requires sufficient flexibility on the forwarding plane and IPv6 has the advantage with good scalability. Therefore, it is important to simplify multicast deployment and meet high quality service requirements with IPv6 Source Routing based multicast. Liu, et al. Expires 24 April 2023 [Page 2] Internet-Draft Problem Statement of MSR6 October 2022 The MSR6 WG will focus on use cases identifed in [I-D.liu-msr6-use-cases] with the following set of characteristics: - Large network scale with numerous multicast service - IPv6 multicast flow transmitting through Internet with requirement of encryption - IPv6 Host Initiated or overlay Multicast Transport According to these usecase this document analyses the problem of the existing IPv6 multicast solutions under discussion in IETF. To solve these problems, MSR6 can be used as a complementary multicast solution. 2. Problem Statement for Multicast of Large-scale Network In large network scale with numerous multicast service, there are scalability issues if using existing multicast solutions. Based on the use case document, 2 typical scenarios are considered as an example: * Multicast for 5G transport, e.g., with1.5k egress nodes, 10k multicast services; * Multicast for DCN, e.g., with 3k switches, 60k links, 1k multicast services; If PIM/mLDP/P2MP RSVP-TE are used in these cases, per-flow state protocols are used to set up multicast tree, which request period state refresh and corresponding protocol message. Multicast stream status are maintained in the intermediate nodes. When there are thousands of concurrent multicast services, per-flow status will bring scalability issues for network device, especially when the multicast tree is dynamic. BIER/BIER-TE([RFC8279]) is introduced in order to avoid explicit multicast tree building and per flow status in intermediate nodes. But there is challenge for BIER in a large scale network. Bit position allocation for BIER is related to the scale of network topology. The number of bit position affects BIFT size and bitstring length directly. When there are too many egress nodes/links in the network, encapsulation expanse and entry numbers of BITF could be unacceptable. If Several SDs or SIs are divided, too many copies, excessive traffic redundancy, similar to degradation to head-end replication. Liu, et al. Expires 24 April 2023 [Page 3] Internet-Draft Problem Statement of MSR6 October 2022 For example, if BIER defined is used for P2MP tunnel in the network, bit position should be allocated for all egress nodes, i.e., 9k bit positions for all possible leaves a. Most of the bit positions are 0 and only few of them are set in some sparse multicast example. In this case, the BIER Header is inefficient and the encapsulation expense is unacceptable. Considering that the number of bit position also determines the BIFT entry size, forwarding speed may also be affected. There are some possible methods to improve the situation in BIER. For example "set" could be used to save the cost of bit position, but multiple packets are supposed to be sent when the BFR-ID of the receivers belong to different set. And when the network size is large, the usefulness of set is not obvious. In the case showed above, even 10 Sets are planned, there needs about 9 hundreds bit positions for each packet and different set requests different BIFTs in each node. In BIER-TE, bitstring need to carry bits to indicate not only the receiving BFER but also the intermediate hops/links across which the packet must be sent. For the most common case, bit position should be allocated for all adjacencies. About 100k bit positions are requested. The bit position representing adjacencies that the multicast tree goes through are set and the rest of the bit positions are set to 0. In the example above, 7 bit positions are set in the bitstring. BIER-TE header is less efficient and the encapsulation expense is more significant,even compared to BIER. Also controller is supposed to allocate different BIFTs for 10k nodes; Some methods defined in BIER-TE is introduced to improve the situation. "Set" could also be used, but not enough as the analysis above. There are some other methods for reducing the number of required bits, such as unicast (forward_routed()), ECMP() or flood (DNC) over "uninteresting" sub- parts of the topology, which brings different kinds of limitation for path planning. Since the exiting BIER/BIER TE cannot satisfy the requirement of multicast in the large-scale network, it need to introduce the new source-routing-based solutions for the multicast TE. There can be possible solutions defined in the drafts. It need to introduce the new source-routing-based solutions for the multicast . There can be possible solutions defined in the existing drafts. The basic idea is combination of RH Segment list and bistring to specify the multicast path. The existing BIER header cannot satisfy the requirement of encapsulating such information. Instead IPv6 Route Header combining with other IPv6 extension header can serve the purpose well. The possible encapsulation is shown in the following figure. Liu, et al. Expires 24 April 2023 [Page 4] Internet-Draft Problem Statement of MSR6 October 2022 +--------------------------------+ --- | IPv6 Header | +--------------------------------+ IPv6 Multicast TE Tunnel Header |IPv6 RH (Segment List/Bitstring)| +--------------------------------+ --- | Payload | +--------------------------------+ 2.1. Typical Scenario in DCN In order to better show the requirements in data center, we list 3 typical potential multicast scenarios with P2MP services: AI training, HPC and Storage. The multicast requirements for large-scale is expressed in 3 aspects: - Network Scale: number of switches, number of links, number of hosts - Multicast Tree Size: number of intermeidate nodes; number of receivers - Multicast Service Number 2.1.1. AI Training The following figure shows a typical RDMA AI training scenario. PS(Parameter Server) Nodes +-------+ +-------+ | CPU | | CPU | | Server| | Server| +-+-+-+-+ +-+-+-+-+ ^ | | | | | | | | +--|-|-|--------------+ | | | | +----+ | +----------------------+ | | | | +--------+ +-------+ | | V Gradients | | | | | | Parameters +---+-+-+ +---+-+-+ +-+---+-+ | GPU | | GPU | | GPU | | Worker| | Worker| | Worker| +-------+ +-------+ +-------+ Worker->PS: The gradient of each worker is pushed to PS node PS->Worker: PS will pull the parameters back to all workers after aggregation Liu, et al. Expires 24 April 2023 [Page 5] Internet-Draft Problem Statement of MSR6 October 2022 In this process, the second stage is information distribution, with the same data content. N connections are used to transmit unicast separately. The bandwidth efficiency is 1/N, the larger the scale, the lower the efficiency. +---------------+ | Source | | +---+ +---+ | | |CPU| |GPU| | | +-+-+ +-+-+ | | | | | | \ / | | +-V---V-+ | | | HCA | | | +-------+ | +--+-+-+-+-+-+--+ | | ... | | +--V-V-----V-V--+ | Switch | +-+-----------+-+ / \ +-------------V-+ +-V------------+ | Destination | | Destination | | +-------+ | | +-------+ | | | HCA | | | | HCA | | | +-V---V-+ | | +-V---V-+ | | / \ | | / \ | | | | | | | | | | +-+-+ +-+-+ | | +-+-+ +-+-+ | | |CPU| |GPU| | | |CPU| |GPU| | | +---+ +---+ | | +---+ +---+ | +---------------+ +---------------+ If the source only sends 1 copy to the network and the switches replicate the packet to different distinations. The use of bandwidth is more efficient and the training is faster. The large-scale multicast requirement in this scenario is as the following: - Network Scale: 10-10k GPU - Multicast Tree Size: 10-10k receivers - Multicast Service Number: depends on the scenario Liu, et al. Expires 24 April 2023 [Page 6] Internet-Draft Problem Statement of MSR6 October 2022 2.1.2. HPC The following is an example of MPI in HPC scenario. +-------------------------------------------+ | Dispatcher | | Master | +---------------------+---------------------+ | +-----------------+ | +---+----+ +--------+ +--------+ |+--V---+| |+------+| |+------+| ||Dispa-|| ||Dispa-|| ||Dispa-|| ||Agent || ||Agent || ||Agent || |+---+--+| |+---+--+| |+---+--+| | | | | | | | | | |+---V--+| |+---V--+| |+---V--+| || MPI || || MPI || ... || MPI || ||Proces|| ||Proces|| ||Proces|| |+---^--+| |+---^--+| |+---^--+| | | | | | | | | | |+---V--+| |+---V--+| |+---V--+| || RoCE |<-->| RoCE |<------------->| RoCE || |+------+| |+------+| |+------+| +--------+ +--------+ +--------+ Stage 1: Dispatcher Master senses millions of cores and schedules millions of Rank MPI jobs on demand. Dispatcher Master sends the scheduling results to Dispatcher Agent Stage 2: Dispatcher Agent starts Million Rank MPI on each node The Dispatcher Agent that receives the message broadcast the message to other Dispatcher Agents and do the initialization before starting the MPI application Stage 3: Dispatcher Agent broadcaast the message to start the MPI application. MPI internal initialization Synchronize the RoCE endpoint in allgather way after the MPI application is started The last 2 stages could benefit from multicast and reduce task completion time. The large-scale multicast requirement in this scenario is as the following: - Network Scale: 1000 k CPU/GUP Liu, et al. Expires 24 April 2023 [Page 7] Internet-Draft Problem Statement of MSR6 October 2022 - Multicast Tree Size: 10k~100k receivers - Multicast Service Number: 1~100 2.1.3. Storage Ceph is an open-source distributed software platform. It mainly focuses on scale-out file system including storage distribution and availability, which is widely used in storage. Ceph Object Storage Daemons (OSDs) are reponsible for storing objects on a local file system on behalf of Ceph clients. Also, Ceph OSDs use the CPU, memory, and networking of Ceph cluster nodes for data replication, erasure coding, recovery, monitoring and reporting functions. The following process request P2MP service. - Application initiates "write" operation from a client to a server. - Client finds the server to write in, and 3 copies are sent to 3 services. +-------+ +-------+ |Client1| |Client2| +---+---+ +---+---+ | | +---------+--------+ | +-------+-------+ | Switch | +-------+-------+ | +----------------+----------------+ | | | +---+---+ +---+---+ +---+---+ | Server| | Server| | Server| +-------+ +-------+ +-------+ The large-scale multicast requirement in this scenario is as the following: - Network Scale: 3k Server (1 Pod) - Multicast Tree Size: 3 receivers - Multicast Service Number: 10k Liu, et al. Expires 24 April 2023 [Page 8] Internet-Draft Problem Statement of MSR6 October 2022 3. Problem Statement for IPv6 Multicast with IPSec In the typical scenario like IPv6-based SDWAN, the multicast traffic may traverse the Internet through the IPv6-based multicast tunnel. At the same time the traffic must be encrypted for the purpose of security. IPSec can be adopted for encryption. The independent layer design of BIER brings the following challenges: Option 1: If the IPv6 IPSec extension header is used for the reason of security (shown in the following figure), the BIER header will be encrypted and the traffic steering information cannot be acquired by the BIER nodes. That is, the BIER cannot work in this option. +--------------------------------+ --- | IPv6 Header | ^ +--------------------------------+ | | IPv6 IPSec Header (ESP & AH) | IPv6 Multicast Tunnel Header +--------------------------------+ | | BIER Header | | +--------------------------------+ --- | Payload | +--------------------------------+ Option 2: In order for BIER Header to work while implement the security function, a new security header may have to be introduced for the BIER layer (shown in the following figure). This means: 1) that the existing IPv6 IPSec extension header cannot be reused; 2) There can be conflicted functions in the two layers: IPv6 layer and BIER layer. +--------------------------------+ --- | IPv6 Header | ^ +--------------------------------+ | | BIER Header | IPv6 Multicast Tunnel Header +--------------------------------+ | | New Security Header | | +--------------------------------+ --- | Payload | +--------------------------------+ For MSR6, which is designed based on native IPv6, it is allowed to reuse IPv6 Authentication header and Encapsulating Security Payload header. If MSR6 is used in this case, the packet is supposed to encapsulated as the following to implement end to end multicast security: Liu, et al. Expires 24 April 2023 [Page 9] Internet-Draft Problem Statement of MSR6 October 2022 +--------------------------------+ --- | IPv6 Header | ^ +--------------------------------+ | | IPv6 EH (MSR6 EH or Options) | IPv6 Multicast Tunnel Header +--------------------------------+ | | IPv6 IPSec Header (ESP & AH) | | +--------------------------------+ --- | Payload | +--------------------------------+ Just as IPsec, there are other existing functionalities that have been in IETF based on IPv6, for example fragmentation, network slicing, IOAM etc, which could all be reused in MSR6 which is based on IPv6 data plane. Comparingly, it has to be defined again if these functions/header are supposed to be used in BIER, which brings redundancy. 4. Problem Statement for IPv6 Host-initiated Multicast In the IPv6 host-initiated multicast scenarios, the host will originate the IPv6 packet to be replicated for the different leaf hosts. The packet originated by the host may have the format shown in the following figure. The packet has the encapsulation of IP layer and Transport Layer. +--------------------------------+ --- | IPv6 Header | IP Layer +--------------------------------+ --- | UDP Header | Transport Layer +--------------------------------+ --- | Payload | +--------------------------------+ If BIER is adopted for the multicast traffic steering, the independent layer design of BIER may make the packet originated by the host as follows. This violates the layer architecture of the Internet, that is, it introduces an extra layer (BIER layer). This does not work in the host. +--------------------------------+ --- | IPv6 Header | IP Layer +--------------------------------+ --- | BIER Header | BIER Layer +--------------------------------+ --- | UDP Header | Transport Layer +--------------------------------+ --- | Payload | +--------------------------------+ Liu, et al. Expires 24 April 2023 [Page 10] Internet-Draft Problem Statement of MSR6 October 2022 For MSR6, multicast traffic steering information will be encapsulated in the IPv6 extension header shown in the following figure. It can still maintain the layer architecture of the Internet. +--------------------------------+ --- | IPv6 Header | +--------------------------------+ IP Layer | IPv6 EH (MSR6 EH or Options) | +--------------------------------+ --- | UDP Header | Transport Layer +--------------------------------+ --- | Payload | +--------------------------------+ Besides, multicast source routing requests no explicit multicast tree set up protocols. The network device replicates and forwards the packet just based on the MSR6 header encapsulated by the host. 5. Summary In summary, in order to satisfy the requirements of the usecase characterized as follows, - Large network scale with numerous multicast service - IPv6 multicast flow transmitting through Internet with requirement of encryption - IPv6 Host Initiated or overlay Multicast Transport according to the analysis of problems of the existing multicast solutions, MSR6 solution should be introduced to take the advantages of IPv6 extension header to encapsulate the extensible multicast traffic steering information and reuse the existing IPv6 encapsulations like IPSec. There can be unified encapsulation for the IPv6 tunneled packet and the IPv6 host initiated packet. The abstract MSR6 header is shown in the following figure: +--------------------------------+ | IPv6 Header | +--------------------------------+ |IPv6 RH (Segment List/Bitstring)| +--------------------------------+ | IPv6 EH (MCAST Options) | +--------------------------------+ | IPv6 IPSec Header (ESP & AH) | +--------------------------------+ Liu, et al. Expires 24 April 2023 [Page 11] Internet-Draft Problem Statement of MSR6 October 2022 6. IANA Considerations This document makes no request of IANA. 7. Security Considerations TBD 8. Acknowledgements TBD 9. Normative References [I-D.cheng-spring-ipv6-msr-design-consideration] Cheng, W., Mishra, G., Li, Z., Wang, A., Qin, Z., and C. Fan, "Design Consideration of IPv6 Multicast Source Routing (MSR6)", Work in Progress, Internet-Draft, draft- cheng-spring-ipv6-msr-design-consideration-01, 25 October 2021, . [I-D.liu-msr6-use-cases] Liu, Y., Yang, F., Wang, A., Zhang, X., Geng, X., and Z. Li, "MSR6(Multicast Source Routing over IPv6) Use Cases", Work in Progress, Internet-Draft, draft-liu-msr6-use- cases-01, 11 July 2022, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, DOI 10.17487/RFC8279, November 2017, . [RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation for Bit Index Explicit Replication (BIER) in MPLS and Non- MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 2018, . Liu, et al. Expires 24 April 2023 [Page 12] Internet-Draft Problem Statement of MSR6 October 2022 [RFC8663] Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx, W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663, DOI 10.17487/RFC8663, December 2019, . Authors' Addresses Yisong Liu China Mobile Email: liuyisong@chinamobile.com Tianji Jiang China Mobile 1525 McCathy Blvd. Milpitas,, CA 95035, United States of America Email: tianjijiang@chinamobile.com Toerless Eckert Futurewei Email: tte+ietf@cs.fau.de Zhenbin Li Huawei Technologies Email: lizhenbin@huawei.com Gyan Mishra Verizon Inc. Email: gyan.s.mishra@verizon.com Zhuangzhuang Qin China Unicom Email: qinzhuangzhuang@chinaunicom.cn Changwang Lin New H3C Technologies Email: linchangwang.04414@h3c.com Xuesong Geng Huawei Email: gengxuesong@huawei.com Liu, et al. Expires 24 April 2023 [Page 13]