Internet-Draft Problem Statement of MSR6 October 2022
Liu, et al. Expires 24 April 2023 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-liu-msr6-problem-statement-01
Published:
Intended Status:
Informational
Expires:
Authors:
Y. Liu
China Mobile
T. Jiang
China Mobile
T. Eckert
Futurewei
Z. Li
Huawei Technologies
G. Mishra
Verizon Inc.
Z. Qin
China Unicom
C. Lin
New H3C Technologies
X. Geng
Huawei

Problem Satement of IPv6 Multicast Source Routing (MSR6)

Abstract

This document analyses the gaps of the existing IPv6 multicast solutions under discussion in IETF based on the requirements.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 24 April 2023.

Table of Contents

1. Introduction

Multicast could provide efficient P2MP service without bandwidth waste. The increasing amount of live video traffic in the network bring new requirements for multicast solutions. The existing multicast solutions request multicast tree-building on control plane and maintaining end-to-end tree state per flow, which impacts router state capacity and network convergence time. There has been a lot of work in IETF to simplify service deployment, in which Source Routing is a very important technology, including SRv6, BIER, etc. Source routing is able to reduce the state of intermediate nodes and indicate multicast forwarding in the ingress nodes, which could simplify multicast deployment. Source routing requires sufficient flexibility on the forwarding plane and IPv6 has the advantage with good scalability. Therefore, it is important to simplify multicast deployment and meet high quality service requirements with IPv6 Source Routing based multicast.

The MSR6 WG will focus on use cases identifed in [I-D.liu-msr6-use-cases] with the following set of characteristics:

- Large network scale with numerous multicast service

- IPv6 multicast flow transmitting through Internet with requirement of encryption

- IPv6 Host Initiated or overlay Multicast Transport

According to these usecase this document analyses the problem of the existing IPv6 multicast solutions under discussion in IETF. To solve these problems, MSR6 can be used as a complementary multicast solution.

2. Problem Statement for Multicast of Large-scale Network

In large network scale with numerous multicast service, there are scalability issues if using existing multicast solutions.

Based on the use case document, 2 typical scenarios are considered as an example:

If PIM/mLDP/P2MP RSVP-TE are used in these cases, per-flow state protocols are used to set up multicast tree, which request period state refresh and corresponding protocol message. Multicast stream status are maintained in the intermediate nodes. When there are thousands of concurrent multicast services, per-flow status will bring scalability issues for network device, especially when the multicast tree is dynamic.

BIER/BIER-TE([RFC8279]) is introduced in order to avoid explicit multicast tree building and per flow status in intermediate nodes. But there is challenge for BIER in a large scale network. Bit position allocation for BIER is related to the scale of network topology. The number of bit position affects BIFT size and bitstring length directly. When there are too many egress nodes/links in the network, encapsulation expanse and entry numbers of BITF could be unacceptable. If Several SDs or SIs are divided, too many copies, excessive traffic redundancy, similar to degradation to head-end replication.

For example, if BIER defined is used for P2MP tunnel in the network, bit position should be allocated for all egress nodes, i.e., 9k bit positions for all possible leaves a. Most of the bit positions are 0 and only few of them are set in some sparse multicast example. In this case, the BIER Header is inefficient and the encapsulation expense is unacceptable. Considering that the number of bit position also determines the BIFT entry size, forwarding speed may also be affected.

There are some possible methods to improve the situation in BIER. For example "set" could be used to save the cost of bit position, but multiple packets are supposed to be sent when the BFR-ID of the receivers belong to different set. And when the network size is large, the usefulness of set is not obvious. In the case showed above, even 10 Sets are planned, there needs about 9 hundreds bit positions for each packet and different set requests different BIFTs in each node.

In BIER-TE, bitstring need to carry bits to indicate not only the receiving BFER but also the intermediate hops/links across which the packet must be sent. For the most common case, bit position should be allocated for all adjacencies. About 100k bit positions are requested. The bit position representing adjacencies that the multicast tree goes through are set and the rest of the bit positions are set to 0. In the example above, 7 bit positions are set in the bitstring. BIER-TE header is less efficient and the encapsulation expense is more significant,even compared to BIER. Also controller is supposed to allocate different BIFTs for 10k nodes;

Some methods defined in BIER-TE is introduced to improve the situation. "Set" could also be used, but not enough as the analysis above. There are some other methods for reducing the number of required bits, such as unicast (forward_routed()), ECMP() or flood (DNC) over "uninteresting" sub- parts of the topology, which brings different kinds of limitation for path planning.

Since the exiting BIER/BIER TE cannot satisfy the requirement of multicast in the large-scale network, it need to introduce the new source-routing-based solutions for the multicast TE. There can be possible solutions defined in the drafts. It need to introduce the new source-routing-based solutions for the multicast . There can be possible solutions defined in the existing drafts. The basic idea is combination of RH Segment list and bistring to specify the multicast path. The existing BIER header cannot satisfy the requirement of encapsulating such information. Instead IPv6 Route Header combining with other IPv6 extension header can serve the purpose well. The possible encapsulation is shown in the following figure.

     +--------------------------------+      ---
     |          IPv6 Header           |
     +--------------------------------+ IPv6 Multicast TE Tunnel Header
     |IPv6 RH (Segment List/Bitstring)|
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+

2.1. Typical Scenario in DCN

In order to better show the requirements in data center, we list 3 typical potential multicast scenarios with P2MP services: AI training, HPC and Storage.

The multicast requirements for large-scale is expressed in 3 aspects:

- Network Scale: number of switches, number of links, number of hosts

- Multicast Tree Size: number of intermeidate nodes; number of receivers

- Multicast Service Number

2.1.1. AI Training

The following figure shows a typical RDMA AI training scenario.

                 PS(Parameter Server) Nodes
               +-------+          +-------+
               |  CPU  |          |  CPU  |
               | Server|          | Server|
               +-+-+-+-+          +-+-+-+-+
    ^            | | |              | | |          |
    |         +--|-|-|--------------+ | |          |
    |       +----+ | +----------------------+      |
    |       | |    +--------+ +-------+ |   |      V
Gradients   | |             | |         |   | Parameters
        +---+-+-+       +---+-+-+     +-+---+-+
        |  GPU  |       |  GPU  |     |  GPU  |
        | Worker|       | Worker|     | Worker|
        +-------+       +-------+     +-------+

Worker->PS: The gradient of each worker is pushed to PS node

PS->Worker: PS will pull the parameters back to all workers after aggregation

In this process, the second stage is information distribution, with the same data content. N connections are used to transmit unicast separately. The bandwidth efficiency is 1/N, the larger the scale, the lower the efficiency.

                      +---------------+
                      |     Source    |
                      | +---+   +---+ |
                      | |CPU|   |GPU| |
                      | +-+-+   +-+-+ |
                      |   |       |   |
                      |    \     /    |
                      |   +-V---V-+   |
                      |   |  HCA  |   |
                      |   +-------+   |
                      +--+-+-+-+-+-+--+
                         | | ... | |
                      +--V-V-----V-V--+
                      |     Switch    |
                      +-+-----------+-+
                       /             \
        +-------------V-+           +-V------------+
        |  Destination  |           |  Destination  |
        |   +-------+   |           |   +-------+   |
        |   |  HCA  |   |           |   |  HCA  |   |
        |   +-V---V-+   |           |   +-V---V-+   |
        |    /     \    |           |    /     \    |
        |   |       |   |           |   |       |   |
        | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
        | |CPU|   |GPU| |           | |CPU|   |GPU| |
        | +---+   +---+ |           | +---+   +---+ |
        +---------------+           +---------------+

If the source only sends 1 copy to the network and the switches replicate the packet to different distinations. The use of bandwidth is more efficient and the training is faster.

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 10-10k GPU

- Multicast Tree Size: 10-10k receivers

- Multicast Service Number: depends on the scenario

2.1.2. HPC

The following is an example of MPI in HPC scenario.

      +-------------------------------------------+
      |                Dispatcher                 |
      |                  Master                   |
      +---------------------+---------------------+
                            |
          +-----------------+
          |
      +---+----+  +--------+             +--------+
      |+--V---+|  |+------+|             |+------+|
      ||Dispa-||  ||Dispa-||             ||Dispa-||
      ||Agent ||  ||Agent ||             ||Agent ||
      |+---+--+|  |+---+--+|             |+---+--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
      ||Proces||  ||Proces||             ||Proces||
      |+---^--+|  |+---^--+|             |+---^--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      || RoCE |<-->| RoCE |<------------->| RoCE ||
      |+------+|  |+------+|             |+------+|
      +--------+  +--------+             +--------+

Stage 1: Dispatcher Master senses millions of cores and schedules millions of Rank MPI jobs on demand. Dispatcher Master sends the scheduling results to Dispatcher Agent

Stage 2: Dispatcher Agent starts Million Rank MPI on each node The Dispatcher Agent that receives the message broadcast the message to other Dispatcher Agents and do the initialization before starting the MPI application

Stage 3: Dispatcher Agent broadcaast the message to start the MPI application. MPI internal initialization Synchronize the RoCE endpoint in allgather way after the MPI application is started

The last 2 stages could benefit from multicast and reduce task completion time.

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 1000 k CPU/GUP

- Multicast Tree Size: 10k~100k receivers

- Multicast Service Number: 1~100

2.1.3. Storage

Ceph is an open-source distributed software platform. It mainly focuses on scale-out file system including storage distribution and availability, which is widely used in storage.

Ceph Object Storage Daemons (OSDs) are reponsible for storing objects on a local file system on behalf of Ceph clients. Also, Ceph OSDs use the CPU, memory, and networking of Ceph cluster nodes for data replication, erasure coding, recovery, monitoring and reporting functions.

The following process request P2MP service.

- Application initiates "write" operation from a client to a server.

- Client finds the server to write in, and 3 copies are sent to 3 services.

               +-------+          +-------+
               |Client1|          |Client2|
               +---+---+          +---+---+
                   |                  |
                   +---------+--------+
                             |
                     +-------+-------+
                     |     Switch    |
                     +-------+-------+
                             |
            +----------------+----------------+
            |                |                |
        +---+---+        +---+---+        +---+---+
        | Server|        | Server|        | Server|
        +-------+        +-------+        +-------+

The large-scale multicast requirement in this scenario is as the following:

- Network Scale: 3k Server (1 Pod)

- Multicast Tree Size: 3 receivers

- Multicast Service Number: 10k

3. Problem Statement for IPv6 Multicast with IPSec

In the typical scenario like IPv6-based SDWAN, the multicast traffic may traverse the Internet through the IPv6-based multicast tunnel. At the same time the traffic must be encrypted for the purpose of security. IPSec can be adopted for encryption.

The independent layer design of BIER brings the following challenges:

Option 1: If the IPv6 IPSec extension header is used for the reason of security (shown in the following figure), the BIER header will be encrypted and the traffic steering information cannot be acquired by the BIER nodes. That is, the BIER cannot work in this option.

     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |  IPv6 IPSec Header (ESP & AH)  | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |           BIER Header          |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+

Option 2: In order for BIER Header to work while implement the security function, a new security header may have to be introduced for the BIER layer (shown in the following figure). This means: 1) that the existing IPv6 IPSec extension header cannot be reused; 2) There can be conflicted functions in the two layers: IPv6 layer and BIER layer.

     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |          BIER Header           | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |       New Security Header      |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+

For MSR6, which is designed based on native IPv6, it is allowed to reuse IPv6 Authentication header and Encapsulating Security Payload header. If MSR6 is used in this case, the packet is supposed to encapsulated as the following to implement end to end multicast security:

     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |  IPv6 EH (MSR6 EH or Options)  | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |  IPv6 IPSec Header (ESP & AH)  |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+

Just as IPsec, there are other existing functionalities that have been in IETF based on IPv6, for example fragmentation, network slicing, IOAM etc, which could all be reused in MSR6 which is based on IPv6 data plane. Comparingly, it has to be defined again if these functions/header are supposed to be used in BIER, which brings redundancy.

4. Problem Statement for IPv6 Host-initiated Multicast

In the IPv6 host-initiated multicast scenarios, the host will originate the IPv6 packet to be replicated for the different leaf hosts. The packet originated by the host may have the format shown in the following figure. The packet has the encapsulation of IP layer and Transport Layer.

     +--------------------------------+       ---
     |          IPv6 Header           |    IP Layer
     +--------------------------------+       ---
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+

If BIER is adopted for the multicast traffic steering, the independent layer design of BIER may make the packet originated by the host as follows. This violates the layer architecture of the Internet, that is, it introduces an extra layer (BIER layer). This does not work in the host.

     +--------------------------------+       ---
     |          IPv6 Header           |    IP Layer
     +--------------------------------+       ---
     |          BIER Header           |   BIER Layer
     +--------------------------------+       ---
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+

For MSR6, multicast traffic steering information will be encapsulated in the IPv6 extension header shown in the following figure. It can still maintain the layer architecture of the Internet.

     +--------------------------------+       ---
     |          IPv6 Header           |
     +--------------------------------+    IP Layer
     |   IPv6 EH (MSR6 EH or Options) |
     +--------------------------------+       ---
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+

Besides, multicast source routing requests no explicit multicast tree set up protocols. The network device replicates and forwards the packet just based on the MSR6 header encapsulated by the host.

5. Summary

In summary, in order to satisfy the requirements of the usecase characterized as follows,

- Large network scale with numerous multicast service

- IPv6 multicast flow transmitting through Internet with requirement of encryption

- IPv6 Host Initiated or overlay Multicast Transport

according to the analysis of problems of the existing multicast solutions, MSR6 solution should be introduced to take the advantages of IPv6 extension header to encapsulate the extensible multicast traffic steering information and reuse the existing IPv6 encapsulations like IPSec. There can be unified encapsulation for the IPv6 tunneled packet and the IPv6 host initiated packet. The abstract MSR6 header is shown in the following figure:

     +--------------------------------+
     |          IPv6 Header           |
     +--------------------------------+
     |IPv6 RH (Segment List/Bitstring)|
     +--------------------------------+
     |    IPv6 EH (MCAST Options)     |
     +--------------------------------+
     |  IPv6 IPSec Header (ESP & AH)  |
     +--------------------------------+

6. IANA Considerations

This document makes no request of IANA.

7. Security Considerations

TBD

8. Acknowledgements

TBD

9. Normative References

[I-D.cheng-spring-ipv6-msr-design-consideration]
Cheng, W., Mishra, G., Li, Z., Wang, A., Qin, Z., and C. Fan, "Design Consideration of IPv6 Multicast Source Routing (MSR6)", Work in Progress, Internet-Draft, draft-cheng-spring-ipv6-msr-design-consideration-01, , <https://www.ietf.org/archive/id/draft-cheng-spring-ipv6-msr-design-consideration-01.txt>.
[I-D.liu-msr6-use-cases]
Liu, Y., Yang, F., Wang, A., Zhang, X., Geng, X., and Z. Li, "MSR6(Multicast Source Routing over IPv6) Use Cases", Work in Progress, Internet-Draft, draft-liu-msr6-use-cases-01, , <https://www.ietf.org/archive/id/draft-liu-msr6-use-cases-01.txt>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8279]
Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, DOI 10.17487/RFC8279, , <https://www.rfc-editor.org/info/rfc8279>.
[RFC8296]
Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation for Bit Index Explicit Replication (BIER) in MPLS and Non-MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, , <https://www.rfc-editor.org/info/rfc8296>.
[RFC8663]
Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx, W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663, DOI 10.17487/RFC8663, , <https://www.rfc-editor.org/info/rfc8663>.

Authors' Addresses

Yisong Liu
China Mobile
Tianji Jiang
China Mobile
1525 McCathy Blvd.
Milpitas,, CA 95035,
United States of America
Toerless Eckert
Futurewei
Zhenbin Li
Huawei Technologies
Gyan Mishra
Verizon Inc.
Zhuangzhuang Qin
China Unicom
Changwang Lin
New H3C Technologies
Xuesong Geng
Huawei