Internet-Draft Network Modeling for DTN April 2023
Cui, et al. Expires 17 October 2023 [Page]
Internet Research Task Force
Intended Status:
Y. Cui
Tsinghua University
Y. Wei
Tsinghua University
Z. Xu
Tsinghua University
P. Liu
China Mobile
Z. Du
China Mobile

Graph Neural Network Based Modeling for Digital Twin Network


This draft introduces the scenarios and requirements for performance modeling of digital twin networks, and explores the implementation methods of network models, proposing a network modeling method based on graph neural networks (GNNs). This method combines GNNs with graph sampling techniques to improve the expressiveness and granularity of the model. The model is generated through data training and validated with typical scenarios. The model performs well in predicting QoS metrics such as network latency, providing a reference option for network performance modeling methods.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 17 October 2023.

Table of Contents

1. Introduction

Digital twin networks are virtual images (or simulations) of physical network infrastructures that can help network designers achieve simplified, automated, elastic, and full-lifecycle operations. The task of network modeling is to predict how network performance metrics, such as throughput and latency, change in various "what-if" scenarios[I-D.irtf-nmrg-network-digital-twin-arch], such as changes in traffic conditions and reconfigurations of network devices. In this paper, we propose a network performance modeling framework based on graph neural networks, which supports modeling various network configurations including topology, routing, and caching, and can make time-series predictions of flow-level performance metrics.

2. Definition of Terms

This document makes use of the following terms:

Digital twin networks.
Graph neural network.
Networking Graph Networks.

3. Scenarios, Requirements and Challenges of Network Modeling for DTN

3.1. Scenarios

Digital twin networks are digital virtual mappings of physical networks, and some of their main applications include network technology experiments, network configuration validation, network performance optimization, etc. All of these applications require accurate network models in the twin network to enable precise simulation and prediction of the functionality and performance characteristics of the physical network.

This document mainly focuses on network performance modeling, while the modeling for network configuration validation is not within the scope.

3.2. Requirements

Physical networks are composed of various network elements and links between them, and different network elements have different functionalities and performance characteristics. In the early planning stages of the network lifecycle, the physical network does not fully exist, but the network owner hopes to predict the network's capabilities and effects based on the network model and its simulation, to determine whether the network can meet the future application requirements running on it, such as network throughput capacity and network latency requirements, and to build the network at the optimal cost. During the network operation stage, network performance modeling can work in conjunction with the online physical network to achieve network changes and optimization, and reduce network operation risks and costs. Therefore, network modeling requires the ability of various performance-related factors in the physical network and striving for accuracy as much as possible. This also puts higher demands on network modeling, including the following aspects.

(1) In order to produce accurate predictions, a network model must have sufficient expressiveness to include as many influencing factors related to network performance indicators as possible. Otherwise, it will inevitably fail to generalize more general network environments. Among these factors, network configuration can span various different levels of operation from end hosts to network devices. For example, congestion control at the host level, scheduling strategies, and active queue management at the queue level, bandwidth and propagation delay at the link level, shared buffer management strategies at the device level, as well as topology and routing schemes at the network level. In addition, there are complex interactions between these factors.

(2) In different network scenarios, the granularity of concern for operators may vary greatly. In wide area network scenarios, operators primarily focus on the long-term average performance of aggregated traffic, where path-level steady-state modeling is usually sufficient to guide the planning process (e.g., traffic engineering). In local area networks and cloud data center networks, operators are more concerned with meeting performance metrics such as latency and throughput, as well as network infrastructure utilization. However, fine-grained network performance observation is a goal that network operators and cloud providers continuously strive for, in order to provide precise information about when and which traffic is being interfered with. This requires network models to support flow-level time series performance prediction.

3.3. Main Challenges

(1) Challenges related to the large state space. Corresponding to the requirement of expressiveness of the large state space, the number of potential scenarios that the network model faces is large. This is because network systems typically consist of dozens to hundreds of network nodes, each of which may contain multiple configurations, leading to an explosion in the combination of potential states. One simple solution to build a network model is to construct a large neural network that takes flat feature vectors containing all configuration information as input. However, the input size of such a neural network is fixed, and it cannot be scaled to handle information from an arbitrary number of nodes and configurations. The final complexity of the neural network will increase with the number of configurations, making it difficult to train and generalize.

(2) Challenges related to modeling granularity. Unlike aggregated end-to-end path-level traffic, the transmission behavior of flows undergoes cascading effects since it is typically controlled by some control loop (e.g., congestion control). Once the configurations related to control (e.g., ECN threshold, queue buffer size) change during flow transmission, the resulting flow traffic measurements (e.g., throughput and packet loss) will experience significant changes, and the measured traffic state at this time will not reflect the results of these changes. Therefore, predicting flow-level performance from traffic measurements may be more difficult than inferring QoS from traffic measurements. Here, we use traffic measurements as input to predict the corresponding QoS, which we call "inference", while using traffic demand as another input together to output flow-level performance (e.g., flow completion time) in "prediction" for the hypothetical scenario.

4. Modeling Digital Twin Networks

4.1. Consideration/Analysis on Network Modeling Methods

Traditional network modeling typically uses methods such as queuing theory and network calculus, which mainly model from the perspective of queues and their forwarding capabilities. In the construction of operator networks, network elements come from different device vendors with varying processing capabilities, and these differences lack precise quantification. Therefore, modeling networks built with these devices is a very complex task. In addition to queue forwarding behavior, the network itself is also influenced by various configuration policies and related network features (such as ECN, Policy Routing, etc.), and coupled with the flexibility of network size, this method is difficult to adapt to the modeling requirements of digital twin networks.

In recent years, the academic community has proposed data-driven graph neural network (GNN) methods, which extend existing neural networks for systems represented in graph form. Networks themselves are a kind of graph structure, and GNNs can be used to learn the complex network behavior from the data. The advantage of GNN is its ability to model non-linear relationships and adapt to different types of data, improving the expressiveness and granularity of network modeling. By combining GNN with graph sampling techniques, the method improves the expressiveness and granularity of network models. This method involves sampling subgraphs from the original network based on specific criteria, such as the degree of connectivity and centrality. Then, these subgraphs are used to train a GNN model that captures the most relevant network features. Experimental results show that this method can improve the accuracy and granularity of network modeling compared to traditional techniques.

This document will introduce a method of network modeling using graph neural networks (GNNs) as a technical option for providing network modeling for DTN.

4.2. Network Modeling Framework

| +----------------+ | +----------------------+   +-----------------+
| |    Intent      |-->|Network Graph Abstract|-->|NGN Configuration|
| +----------------+ | +----------^-----------+   +-------+---------+
|                    |            |                       |
| +----------------+ |            |              +--------V---------+
| |Domain Knowledge|--------------+              | State Transition |
| +----------------+ |                           |Model Construction|
|                    |                           +--------+---------+
|                    |                                    |
| +----------------+ |    +---------------+     +---------V---------+
| |     Data       |----->|Model Training |<----| Network Model Desc|
| +----------------+ |    +-------+-------+     +-------------------+
|                    |            |
|  Target Network    |    +-------V-------+
+--------------------+    | Network Model |
        Figure 1: Network modeling design process

Network modeling design process:

1. Before modeling, determine the network configurations and modeling granularity based on the modeling intent.

2. Use domain knowledge from network experts to abstract the network system into a network relationship graph to represent the complex relationships between different network entities.

3. Build the network model using configurable graph neural network modules and determine the form of the aggregation function based on the properties of the relationships.

4. Use a recurrent graph neural network to model the changes in network state between adjacent time steps.

5. Train the model parameters using the collected data.

4.3. Building a Network Model

Describing the process and results of network modeling, i.e., the four steps (Steps 2 to 5) in Section 4.2 of the network modeling design process.

4.3.1. Networking System as a Relation Graph

Representing a network system as a heterogeneous relationship graph (referred to as "graph" hereafter) to provide a unified interface to simulate various network configurations and their complex relationships. Network entities related to performance are mapped as graph nodes with relevant characteristics. Heterogeneous nodes represent different network entities based on their attributes or configurations. Edges in the graph connect nodes that are considered directly related. There are two types of nodes in the graph, physical nodes representing specific network entities with local configurations (e.g., switches with buffers of a certain size), and virtual nodes representing performance-related entities (e.g., flows or paths), thus allowing final performance metrics to be attached to the graph. Edges reflect the relationships between entities and can be used to embed domain knowledge-induced biases. Specifically, edges can be used to model local or global configurations.

4.3.2. Message-passing on the Heterogeneous Graph

Use Networking Graph Networks (NGN) [battaglia2018] as the fundamental building block for network modeling. An NGN module is defined as a "graph-to-graph" module with heterogeneous nodes that takes an attribute graph as input and, after a series of message-passing steps, outputs another graph with different attributes. Attributes represent the features of nodes and are represented as tensors of fixed dimensions. Each NGN block contains multiple configurable functions, such as aggregation, transformation, and update functions, which can be implemented using standard neural networks and shared among same-type nodes. The aggregation function can take the form of a simple sum or an RNN, while the transformation function can map the information of heterogeneous nodes to the same hidden space of the target type nodes, allowing for unified operations in the update function and no limitation on the modeling capability of GNNs.

One feed-forward NGN pass can be viewed as one step of message passing on the graph. In each round of message passing, nodes aggregate same-type messages using the corresponding aggregation function and transform the aggregated messages using the type transformation function to handle heterogeneous nodes. The transformed messages are then fed into the update function to update the node's state. After a specified number of rounds of message passing, a readout function is used to predict the final performance metric.

Typically, NGNs first perform a global update and then independent local updates for nodes in each local domain. Circular dependencies between different update operations can be resolved through multiple rounds of message passing.

4.3.3. State Transition Learning

The network model needs to support fine-grained prediction granularity and transient prediction (such as the state of a flow) at short time scales. To achieve this, this document uses the recurrent form of the NGN module to learn to predict future states from the current state. The model runs at a time step and has an "encoder-processor-decoder" structure.

                       | +--------------+  |
                       | | +----------+ |  |
G_hidden(t-1)---^----->| +>| NGN_core |-+  |------+----->G_hidden(t)
                |      |   +----------+    |      |
         +------+----+ |Message passing x M| +----V------+
G_in(t)->|NGN_encoder| +-------------------+ |NGN_decoder|->G_out(t)
         +-----------+      Processor        +-----------+

         Figure 2: State transition learning

These three components are NGN modules with the same abstract graph but different neural network parameters.

Encoder: converts the input state into a fixed-dimensional vector, independently encoding different nodes, ignoring relationships between nodes, and not performing message passing.

Processor: performs M rounds of message passing, with the input being the output of the encoder and the previous output of the processor;

Decoder: independently decodes different nodes as the readout function, extracting dynamic information from the hidden graph, including the current performance metrics and the state used for the next step state update. Note that the next graph G_(t+1) is updated according to G_out(t), which is not shown in Figure 2.

To support state transition modeling, the model distinguishes between the static and dynamic features of the network system and represents them as different graphs. The static graph contains the static configuration of the system, including physical node configurations (such as queue priorities and switch buffer sizes) and virtual node configurations (such as flow sizes). The dynamic graph contains the temporary state of the system, mainly related to virtual nodes (such as the remaining size of a flow or end-to-end delay of a path). In addition, when considering dynamic configurations (such as time-varying ECN thresholds), the actions taken (i.e., new configurations) should be placed in the dynamic graph and input at each time step.

4.3.4. Model Training

The L2 loss between the predicted values and the corresponding true values is used to supervise the output features of each node generated by the decoder for model training. To generate long-term prediction trajectories, the model iteratively feeds back the updated absolute state prediction values to the model as input. As a data preprocessing and postprocessing step, we standardized the input and output of the NGN model.

4.4. Model Performance in Data Center Networks and Wide Area Networks

4.4.1. QoS Inference in Data Center Networks

This use case aims to verify whether the model can accurately perform time-series inference and generalize to unseen configurations, demonstrating the application of online performance monitoring. The network model needs to infer the evolution of path-level latency in the time series given real-time measurements of traffic on the given path. The datasets used in this scenario is generated by ns-3 [NS-3]. Under specific experimental settings, the MAPE of path-level latency can be controlled below 7% [wang2022].

4.4.2. Time-Series Prediction in Data Center Networks

This use case verifies whether the model can provide flow-level time-series modeling capability under different configurations. Unlike the previous case, the behavior of the network model in this case is like a network simulator, which needs to predict the Flow Completion Time (FCT) without traffic collection information, only using flow descriptions and static topology information as input. The datasets used in this scenario is generated by ns-3 [NS-3]. Under specific experimental settings, the predicted FCT distribution matches the true distribution well, with a Pearson correlation coefficient of 0.9 [wang2022]. In addition, the model can also predict throughput, latency, and other path/flow-level metrics in time-series prediction. This use case verifies the model's ability in time-series prediction, and theoretical analysis combined with experimental verification shows that the model does not have cumulative errors in long-term time-series prediction.

4.4.3. Steady-State QoS Inference in Wide Area Networks

This use case aims to verify that the model can work in the Wide Area Network (WAN) scenario and demonstrate that the model can effectively model and generalize to global and local configurations, which reflects the application of offline network planning. It is worth noting that the WAN scenario has more topology changes compared to the data center network scenario, which imposes higher demand on the model's performance. Public network modeling dataset [NM-D] is used in this scenario for evaluation. Under specific experimental settings, the model is experimentally verified in three different WAN topologies, including NSFnet, GEANT2, and RedIRIS, and achieves a 50th percentile APE of 10% for path-level latency, which is comparable to the performance of the domain-specific model RouteNet [rusek2019]. This use case verifies the model's generalization in topology and configuration and its versatility in the scenario.

5. Conclusion

This draft implements a network performance modeling method based on graph neural networks, addressing the problems and challenges in network modeling in terms of expressiveness and modeling granularity. The model's versatility and generalization are verified in typical network scenarios, and good simulation performance prediction is achieved.

6. Security Considerations


7. IANA Considerations


8. Informative References

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., and others, "Relational inductive biases, deep learning, and graph networks", .
Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu, Q., Boucadair, M., and C. Jacquenet, "Digital Twin Network: Concepts and Reference Architecture", Work in Progress, Internet-Draft, draft-irtf-nmrg-network-digital-twin-arch-02, , <>.
"Network Modeling Datasets", <>.
"Network Simulator, NS-3", <>.
Rusek, K., Suarez-Varela, J., Mestres, A., Barlet-Ros, P., and A. Cabellos-Aparicio, "Unveiling the potential of Graph Neural Networks for network modeling and optimization in SDN", .
Liu., M. W. L. H. Y. C. R. L. Z., "xNet: Improving Expressiveness and Granularity for Network Modeling with Graph Neural Networks. IEEE INFOCOM,", .


Authors' Addresses

Yong Cui
Tsinghua University
30 Shuangqing Rd, Haidian District
Yunze Wei
Tsinghua University
30 Shuangqing Rd, Haidian District
Zhiyong Xu
Tsinghua University
30 Shuangqing Rd, Haidian District
Peng Liu
China Mobile
No.32 XuanWuMen West Street
Zongpeng Du
China Mobile
No.32 XuanWuMen West Street