idnits 2.17.1 draft-kim-nmrg-rl-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 11, 2019) is 1872 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Wikipedia' is mentioned on line 146, but not defined == Missing Reference: 'TBD' is mentioned on line 385, but not defined == Unused Reference: 'I-D.jiang-nmlrg-network-machine-learning' is defined on line 398, but no explicit reference was found in the text == Unused Reference: 'Teiralbar' is defined on line 410, but no explicit reference was found in the text == Unused Reference: 'Nasim' is defined on line 417, but no explicit reference was found in the text == Unused Reference: 'Minsuk' is defined on line 422, but no explicit reference was found in the text == Unused Reference: 'April' is defined on line 428, but no explicit reference was found in the text == Unused Reference: 'Markus' is defined on line 432, but no explicit reference was found in the text == Unused Reference: 'Ann' is defined on line 436, but no explicit reference was found in the text == Unused Reference: 'Kok-Lim' is defined on line 441, but no explicit reference was found in the text == Unused Reference: 'Al-Dayaa' is defined on line 457, but no explicit reference was found in the text == Unused Reference: 'Chowdappa' is defined on line 464, but no explicit reference was found in the text == Unused Reference: 'Mnih' is defined on line 471, but no explicit reference was found in the text == Unused Reference: 'Stampa' is defined on line 474, but no explicit reference was found in the text == Unused Reference: 'Krizhevsky' is defined on line 478, but no explicit reference was found in the text == Unused Reference: 'Volodymyr' is defined on line 484, but no explicit reference was found in the text == Unused Reference: 'Ju-Bong' is defined on line 491, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-kim-nmrg-rl-03 Summary: 0 errors (**), 0 flaws (~~), 20 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Management Research Group M-S. Kim 3 Internet-Draft ETRI 4 Intended status: Informational Y-H. Han 5 Expires: September 12, 2019 KoreaTech 6 Y-G. Hong 7 ETRI 8 March 11, 2019 10 Intelligent Reinforcement-learning-based Network Management 11 draft-kim-nmrg-rl-04 13 Abstract 15 This document presents intelligent network management scenarios based 16 on reinforcement-learning approaches. Nowadays, a heterogeneous 17 network should usually provide real-time connectivity, the type of 18 network management with the quality of real-time data, and 19 transmission services generated by the operating system for an 20 application service. With that reason intelligent management system 21 is needed to support real-time connection and protection through 22 efficient management of interfering network traffic for high-quality 23 network data transmission in the both cloud and IoE network systems. 24 Reinforcement-learning is one of the machine learning algorithms that 25 can intelligently and autonomously provide to management systems over 26 a communication network. Reinforcement-learning has developed and 27 expanded with deep learning technique based on model-driven or data- 28 driven technical approaches so that these trendy techniques have been 29 widely to intelligently attempt an adaptive networking models with 30 effective strategies in environmental disturbances over variety of 31 networking areas. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 12, 2019. 50 Copyright Notice 52 Copyright (c) 2019 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 3 69 3. Theoretical Approaches . . . . . . . . . . . . . . . . . . . 4 70 3.1. Reinforcement-learning . . . . . . . . . . . . . . . . . 4 71 3.2. Deep-reinforcement-learning . . . . . . . . . . . . . . . 4 72 3.3. Advantage Actor Critic (A2C) . . . . . . . . . . . . . . 4 73 3.4. Asynchronously Advantage Actor Critic (A3C) . . . . . . . 5 74 4. Reinforcement-learning-based process scenario . . . . . . . . 5 75 4.1. Single-agent with Single-model . . . . . . . . . . . . . 6 76 4.2. Multi-agents Sharing Single-model . . . . . . . . . . . . 6 77 4.3. Adversarial Self-Play with Single-model . . . . . . . . . 6 78 4.4. Cooperative Multi-agents with Multiple-models . . . . . . 6 79 4.5. Competitive Multi-agents with Multiple-models . . . . . . 7 80 5. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 7 81 5.1. Intelligent Edge-computing for Traffic Control using 82 Deep-reinforcement-learning . . . . . . . . . . . . . . . 7 83 5.2. Edge computing system in a field of Construction-site 84 using Reinforcement-learning . . . . . . . . . . . . . . 7 85 5.3. Deep-reinforcement-learning-based Cyber Physical 86 Management Control system over a network . . . . . . . . 8 87 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 88 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 89 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 90 8.1. Normative References . . . . . . . . . . . . . . . . . . 9 91 8.2. Informative References . . . . . . . . . . . . . . . . . 9 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 94 1. Introduction 96 Reinforcement-learning for intelligently autonomous network 97 management, in general, is one of the challengeable methods in a 98 dynamic complex and cluttered network environments. With the 99 intelligent approach needs the development of computational systems 100 in a single or large distributed networking nodes, where these 101 environments involve limited and incomplete knowledge. 103 The reinforcement-learning can become a challenge-able and effective 104 technique to transfer and share information via the global 105 environment, as it does not require a priori-knowledge of the agent 106 behavior or environment to accomplish its tasks [Megherbi]. Such a 107 knowledge is usually acquired and learned repeatedly and autonomously 108 by trial and error. The reinforcement-learning is also one of the 109 machine learning techniques that will be adapted to the various 110 networking environments for automatic networks [S.Jiang]. 112 Deep-reinforcement-learning recently proposes has been extended from 113 reinforcement-learning that can emerge as more powerful model-driven 114 or data-driven model in a large state space, to overcome the 115 classical behavior reinforcement-learning process. However, the 116 classical reinforcement-learning slightly has a limitation to be 117 adopted in networking areas, since the networking environments 118 consist of significantly large and complex components in fields of 119 routing configuration, optimization and system management, so that 120 deep-reinforcement-learning can provide much more state information 121 for learning process.[MS] 123 There are many different networking management problems to 124 intelligently solve, such as connectivity, traffic management, fast 125 Internet without latency and etc. Reinforcement-learning-based 126 approaches can surely provide some of specific solutions with 127 multiple cases against human operating capacities although it is a 128 challengeable area due to a multitude of reasons such as large state 129 space, complexity in the giving reward, difficulty in control 130 actions, and difficulty in sharing and merging of the trained 131 knowledge in a distributed memory node to be transferred over a 132 communication network.[MS] 134 2. Conventions and Terminology 136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 138 document are to be interpreted as described in [RFC2119]. 140 3. Theoretical Approaches 142 3.1. Reinforcement-learning 144 Reinforcement-learning is an area of machine learning concerned with 145 how software agents should take actions in an environment so as to 146 maximize some notion of cumulative reward.[Wikipedia] The 147 reinforcement-learning is normally used with a reward from 148 centralized node (the global brain), and capable of autonomous 149 acquirement and incorporation of knowledge. It is continuously self- 150 improving and becoming more efficient as the learning process from an 151 agent experience to optimize management performance for autonomous 152 learning process.[Sutton][Madera] 154 3.2. Deep-reinforcement-learning 156 Some of advanced techniques using reinforcement-learning encounter 157 and combine with deep-learning in neural networks that has made it 158 possible to extract high-level features from raw data in compute 159 vision [A Krizhevsky]. There are many challenges under the deep- 160 learning models such as convolution neural network, recurrent neural 161 network and etc., on the reinforcement-learning approach. The 162 benefit of the deep learning applications is that lots of networking 163 models, but the problematic issue is complex and cluttered networking 164 structures used with large amounts of labelled training data. 166 Recently, the advances in training deep neural networks to develop a 167 novel artificial agent, termed a deep Q-network (deep-reinforcement- 168 learning network), can be used to learn successful policies directly 169 from high-dimensional sensory inputs using end-to-end reinforcement 170 learning [V.Mnih]. 172 The deep-reinforcement-learning (deep Q-network) can provide more 173 extended and powerful scenarios to build networking models with 174 optimized action controls, huge system states and real-time-based 175 reward function. Moreover, the technique has a significant advantage 176 to set highly sequential data in a large model state space. [MS] In 177 particular, the data distribution in reinforcement-learning is able 178 to change as learning behaviors, that is a problem for deep learning 179 approaches assumed by a fixed underlying distribution [V. Mnih]. 181 3.3. Advantage Actor Critic (A2C) 183 Advantage Actor Critic is one of the intelligent reinforcement- 184 learning models based on policy gradient model. The intelligent 185 approach can optimize deep neural network controller in terms of 186 reinforcement-learning algorithms, and show that parallel actor- 187 learners have a stabilizing effect on training and they can be 188 allowing all of the methods to successfully train neural network 189 controllers [Volodymyr Mnih]. Even if the prior deep-reinforcement- 190 learning algorithm with experience replay memory tremendously has 191 performance in challenging of the control service domains, it still 192 needs to use more memory and computational power due to off-policy 193 learning methods. To make up for this algorithms, a new algorithm 194 has appeared. 196 The Advantage Actor Critic (consisting of actor and critic) method 197 would implement generalized policy iteration alternating between a 198 policy evaluation and a policy improvement step. Actor is a policy- 199 based method that can improve the current policy for available the 200 best next action. Critic in the value-based approach can evaluate 201 the current policy and reduce the variance by a bootstrapping method. 202 It is more stable and effective algorithm than the pure policy-based 203 gradient methods.[MS] 205 3.4. Asynchronously Advantage Actor Critic (A3C) 207 Asynchronously Advantage Actor Critic is the updated algorithm based 208 on Advantage Actor Critic. The main algorithm concept is to run 209 multiple environments in parallel to run the agent asynchronously 210 instead of experience replay. The parallel environment reduces the 211 correlation of agent's data and induces each agent to experience 212 various states so that the learning process can become a stationary 213 process. This algorithm is a beneficial and practical point of view 214 since it allows learning performance even with a general multi-core 215 CPU. In addition, it can be applied to continuous space as well as 216 discrete action space, and also has the advantages of learning both 217 feedforward and recurrent agent.[MS] 219 A3C algorithm is possibly a number of complementary improvement to 220 the neural network architecture and it has been shown to accurately 221 produce and estimate of Q-values by including separate streams for 222 the state value and advantage in the network to improve both value- 223 based and policy-based methods by making it easier for the network to 224 represent feature coordinates [Volodymyr Mnih]. 226 4. Reinforcement-learning-based process scenario 228 With a single agent or multiple agents trained for intelligent 229 network management, a variety of training scenarios are possible, 230 depending on how agents are interacted and how many models are linked 231 to the agents. The followings are possible RL training scenarios for 232 network management. 234 4.1. Single-agent with Single-model 236 This is the traditional scenario of training a single agent who tries 237 to achieve one goal related to network management. It receives all 238 of information and rewards from a network (or a simulated network), 239 and decides its appropriate action for the current network status. 241 4.2. Multi-agents Sharing Single-model 243 In this scenario, multiple agents share a single model and a single 244 goal linked to the model. But, each of them is connected to an 245 independent part of network or an independent whole network, so that 246 they receive different information and rewards from such an 247 independent one. The multiple agents experience differently on their 248 connected networks. However, it does not mean their training 249 behavior for network management will diverge. Each of their 250 experience is used to train the single model. This scenario is a 251 kind of parallelized version of the traditional 'Single-Agent with 252 Single-Model' scenario, which can speed-up the RL training process 253 and stabilize the single model's behavior. 255 4.3. Adversarial Self-Play with Single-model 257 This scenario contains two interacting agents with inverse reward 258 functions linked to a single model. This scenario makes an agent 259 have the perfectly matched opposing agent: itself, and trains the 260 agent to become increasingly more skilled for network management. 261 Inverse rewards are used to punish the opposing agent when an agent 262 receives as positive reward, and vice versa. The two agents are 263 linked to a single model for network management, and the model are 264 trained and stabilized while both agents interact in a conflicting 265 manner. 267 4.4. Cooperative Multi-agents with Multiple-models 269 In this scenario, two or more interacting agents share a common 270 reward function linked to multiple different models for network 271 management. In this scenario, a common goal is set up and all agents 272 are trained to achieve the goal together that is hard to be achieved 273 alone. Usually, each agent has access only to partial information of 274 network status and determines an appropriate action by using its own 275 model. Each of actions will be independently taken in order to 276 accomplish a management task and collaboratively achieve the common 277 goal. 279 4.5. Competitive Multi-agents with Multiple-models 281 This scenario contains two or more interacting agents with diverse 282 reward function linked to multiple different models. In this 283 scenario, agents will compete with one another to obtain some limited 284 set of network resources and try to achieve their own goal. In a 285 network, there will be tasks that have different management 286 objectives. This leads multi-objective optimization problems, which 287 are generally difficult to solve analytically. This scenario is 288 suitable for solving such a multi-objective optimization problem 289 related to network management by allowing each agent solve a single- 290 objective problem, but complete with each other. 292 5. Use Cases 294 5.1. Intelligent Edge-computing for Traffic Control using Deep- 295 reinforcement-learning 297 Edge computing is a concept that allows data from a variety of 298 devices to be directly analyzed at the site or near the data, rather 299 than being sent to a centralized data center such as the cloud. As 300 such, edge computing will support data flow acceleration by 301 processing data with low latency in real-time. In addition, by 302 supporting efficient data processing on large amounts of data that 303 can be processed around the source, and internet bandwidth usage will 304 be also reduced. 306 Deep-reinforcement-learning would be useful technique to improve 307 system performance in an intelligent edge-controlled service system 308 for fast response time, reliability and security. Deep- 309 reinforcement-learning is model-free approach so that many algorithms 310 such as DQN, A2C and A3C can be adopted to resolve network problems 311 in time-sensitive systems. 313 5.2. Edge computing system in a field of Construction-site using 314 Reinforcement-learning 316 In a construction site, there are many dangerous elements such as 317 noisy, gas leak and vibration needed by alerts, so that real-time 318 monitoring system to detect the alerts using machine learning 319 techniques can provide more effective solution and approach to 320 recognize dangerous construction elements. 322 Representatively, to monitor these elements CCTV (closed-circuit 323 television) should be locally and continuously broadcasting in a 324 situation of construction site. At that time, it is in-effective and 325 wasteful even if the CCTV is constantly broadcasting unchangeable 326 scenes in high definition. However, the streaming should be 327 converted to high quality streaming data to rapidly show and defect 328 the dangerous situation, when any alert should be detected due to the 329 dangerous elements. To approach technically deep-reinforcement- 330 learning can provide a solution to automatically detect these kinds 331 of dangerous situations with prediction in an advance. It can also 332 provide the transform data including with the high-rate streaming 333 video and quickly prevent the other risks. Deep-reinforcement- 334 learning is an important role to efficiently manage and monitor with 335 the given dataset in real-time. 337 5.3. Deep-reinforcement-learning-based Cyber Physical Management 338 Control system over a network 340 With the nonlinear control system such as cyber physical system 341 provides an unstable system environment with initial control state 342 due to its nonlinear nature. In order to stably control the unstable 343 initial state, the prior-complex mathematical control methods (Linear 344 Quadratic Regulator, Proportional Integral Differential) are used for 345 successful control and management, but these approaches are needed 346 with difficult mathematical process and high-rate effort. Therefore, 347 using deep-reinforcement-learning can surely provide more effective 348 technical approach without difficult initial set of control states to 349 be compared with the other methods. 351 The ultimate purpose of the reinforcement-learning is to interact 352 with the environment and maximize the target reward value. Observing 353 the state in the step and the action by the policy are performed, and 354 the reward judge a value through the compensation given in the 355 environment. Deep-reinforcement-learning using Convolutional Neural 356 Network (CNN) can provide more performing learning process to make 357 stable control and management. 359 As part of the system, it shows how the physical environment and the 360 cyber environment interact with the reinforcement-learning module 361 over a network. The actions to control the physical environment, 362 delivered to the Enhanced Learning model based on DQN, transfer to 363 data to the physical environment using networking communication tools 364 as below. 366 +-----Environment-----+ +---Control and Management---+ 367 . . . . 368 . +-----------------+ . Network +--------------+ . 369 . . Physical System . .----------->. Cyber Module . . 370 . . . .<-----------. . . 371 . +-----------------+ . +--------------+ . 372 . . . . +--------+ . 373 +---------------------+ . .----------.RL Agent. . 374 . +--------+ . 375 +............................+ 377 Figure 1: DRL-based Cyber Physical Management Control System 379 6. IANA Considerations 381 There are no IANA considerations related to this document. 383 7. Security Considerations 385 [TBD] 387 8. References 389 8.1. Normative References 391 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 392 Requirement Levels", BCP 14, RFC 2119, 393 DOI 10.17487/RFC2119, March 1997, 394 . 396 8.2. Informative References 398 [I-D.jiang-nmlrg-network-machine-learning] 399 Jiang, S., "Network Machine Learning", ID draft-jiang- 400 nmlrg-network-machine-learning-02, October 2016. 402 [Megherbi] 403 "Megherbi, D. B., Kim, Minsuk, Madera, Manual., A Study of 404 Collaborative Distributed Multi-Goal and Multi-agent based 405 Systems for Large Critical Key Infrastructures and 406 Resources (CKIR) Dynamic Monitoring and Surveillance, IEEE 407 International Conference on Technologies for Homeland 408 Security", 2013. 410 [Teiralbar] 411 "Megherbi, D. B., Teiralbar, A. Boulenouar, J., A Time- 412 varying Environment Machine Learning Technique for 413 Autonomous Agent Shortest Path Planning, Proceedings of 414 SPIE International Conference on Signal and Image 415 Processing, Orlando, Florida", 2001. 417 [Nasim] "Nasim ArianpooEmail, Victor C.M. Leung, How network 418 monitoring and reinforcement learning can improve tcp 419 fairness in wireless multi-hop networks, EURASIP Journal 420 on Wireless Communications and Networking", 2016. 422 [Minsuk] "Dalila B. Megherbi and Minsuk Kim, A Hybrid P2P and 423 Master-Slave Cooperative Distributed Multi-Agent 424 Reinforcement Learning System with Asynchronously 425 Triggered Exploratory Trials and Clutter-index-based 426 Selected Sub goals, IEEE CIG Conference", 2016. 428 [April] "April Yu, Raphael Palefsky-Smith, Rishi Bedi, Deep 429 Reinforcement Learning for Simulated Autonomous Vehicle 430 Control, Stanford University", 2016. 432 [Markus] "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, Learning 433 Driving Styles for Autonomous Vehicles from Demonstration, 434 Robotics and Automation (ICRA)", 2015. 436 [Ann] "Ann Nowe, Peter Vrancx, Yann De Hauwere, Game Theory and 437 Multi-agent Reinforcement Learning, In book: Reinforcement 438 Learning: State of the Art, Edition: Adaptation, Learning, 439 and Optimization Volume 12", 2012. 441 [Kok-Lim] "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae 442 Hsiang Kwong, Application of Reinforcement Learning to 443 wireless sensor networks: models and algorithms, Published 444 in Journal Computing archive Volume 97 Issue 11, Pages 445 1045-1075", November 2015. 447 [Sutton] "Sutton, R. S., Barto, A. G., Reinforcement Learning: an 448 Introduction, MIT Press", 1998. 450 [Madera] "Madera, M., Megherbi, D. B., An Interconnected Dynamical 451 System Composed of Dynamics-based Reinforcement Learning 452 Agents in a Distributed Environment: A Case Study, 453 Proceedings IEEE International Conference on Computational 454 Intelligence for Measurement Systems and Applications, 455 Italy", 2012. 457 [Al-Dayaa] 458 "Al-Dayaa, H. S., Megherbi, D. B., Towards A Multiple- 459 Lookahead-Levels Reinforcement-Learning Technique and Its 460 Implementation in Integrated Circuits, Journal of 461 Artificial Intelligence, Journal of Supercomputing. Vol. 462 62, issue 1, pp. 588-61", 2012. 464 [Chowdappa] 465 "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan, 466 Thread-Safe Message Passing with P4 and MPI, Technical 467 Report TR-CS-941025, Computer Science Department and NSF 468 Engineering Research Center, Mississippi State 469 University", 1994. 471 [Mnih] "V.Mnih and et al., Human-level Control Through Deep 472 Reinforcement Learning, Nature 518.7540", 2015. 474 [Stampa] "G Stamp, M Arias, etc., A Deep-reinforcement Learning 475 Approach for Software-defined Networking Routing 476 Optimization, cs.NI", 2017. 478 [Krizhevsky] 479 "A Krizhevsky, I Sutskever, and G Hinton, Imagenet 480 classification with deep con- volutional neural networks, 481 In Advances in Neural Information Processing Systems, 482 1106-1114", 2012. 484 [Volodymyr] 485 "Volodymyr Mnih and et al., Asynchronous Methods for Deep 486 Reinforcement Learning, ICML, arXiv:1602.01783", 2016. 488 [MS] "Intelligent Network Management using Reinforcement- 489 learning, draft-kim-nmrg-rl-03", 2018. 491 [Ju-Bong] "Deep Q-Network Based Rotary Inverted Pendulum System and 492 Its Monitoring on the EdgeX Platform, International 493 Conference on Artificial Intelligence in Information and 494 Communication (ICAIIC)", 2019. 496 Authors' Addresses 497 Min-Suk Kim 498 Etri 499 161 Gajeong-Dong Yuseung-Gu 500 Daejeon 305-700 501 Korea 503 Phone: +82 42 860 5930 504 Email: mskim16@etri.re.kr 506 Youn-Hee Han 507 KoreaTech 508 Byeongcheon-myeon Gajeon-ri, Dongnam-gu 509 Choenan-si, Chungcheongnam-do 510 330-708 511 Korea 513 Phone: +82 41 560 1486 514 Email: yhhan@koreatech.ac.kr 516 Yong-Geun Hong 517 ETRI 518 161 Gajeong-Dong Yuseung-Gu 519 Daejeon 305-700 520 Korea 522 Phone: +82 42 860 6557 523 Email: yghong@etri.re.kr