idnits 2.17.1 draft-kim-nmrg-rl-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 8, 2019) is 1726 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Cisco' is mentioned on line 152, but not defined == Missing Reference: 'Wikipedia' is mentioned on line 171, but not defined == Missing Reference: 'TBD' is mentioned on line 470, but not defined == Unused Reference: 'I-D.jiang-nmlrg-network-machine-learning' is defined on line 483, but no explicit reference was found in the text == Unused Reference: 'Teiralbar' is defined on line 495, but no explicit reference was found in the text == Unused Reference: 'Nasim' is defined on line 502, but no explicit reference was found in the text == Unused Reference: 'Minsuk' is defined on line 507, but no explicit reference was found in the text == Unused Reference: 'April' is defined on line 513, but no explicit reference was found in the text == Unused Reference: 'Markus' is defined on line 517, but no explicit reference was found in the text == Unused Reference: 'Ann' is defined on line 521, but no explicit reference was found in the text == Unused Reference: 'Kok-Lim' is defined on line 526, but no explicit reference was found in the text == Unused Reference: 'Al-Dayaa' is defined on line 542, but no explicit reference was found in the text == Unused Reference: 'Chowdappa' is defined on line 549, but no explicit reference was found in the text == Unused Reference: 'Mnih' is defined on line 556, but no explicit reference was found in the text == Unused Reference: 'Stampa' is defined on line 559, but no explicit reference was found in the text == Unused Reference: 'Krizhevsky' is defined on line 563, but no explicit reference was found in the text == Unused Reference: 'Volodymyr' is defined on line 569, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-kim-nmrg-rl-03 Summary: 0 errors (**), 0 flaws (~~), 20 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Management Research Group M-S. Kim 3 Internet-Draft ETRI 4 Intended status: Informational Y-H. Han 5 Expires: January 9, 2020 KoreaTech 6 Y-G. Hong 7 ETRI 8 July 8, 2019 10 Intelligent Reinforcement-learning-based Network Management 11 draft-kim-nmrg-rl-05 13 Abstract 15 This document presents intelligent network management based on 16 Artificial Intelligent (AI) such as reinforcement-learning 17 approaches. In a heterogeneous network, intelligent management with 18 Artificial Intelligent should usually provide real-time connectivity, 19 the type of network management with the quality of real-time data, 20 and transmission services generated by an application service. With 21 that reason intelligent management system is needed to support real- 22 time connection and protection through efficient management of 23 interfering network traffic for high-quality network data 24 transmission in the both cloud and IoE network systems. 25 Reinforcement-learning is one of the machine learning algorithms that 26 can intelligently and autonomously provide to management systems over 27 a communication network. Reinforcement-learning has developed and 28 expanded with deep learning technique based on model-driven or data- 29 driven technical approaches so that these trendy techniques have been 30 widely to intelligently attempt an adaptive networking models with 31 effective strategies in environmental disturbances over variety of 32 networking areas. For Network AI with the intelligent and effective 33 strategies, intent-based network (IBN) can be also considered to 34 continuously and automatically evaluate network status under required 35 policy for dynamic network optimization. The key element for the 36 intent-based network is that it provides a verification of whether 37 the represented network intent is implementable or currently 38 implemented in the network. Additionally, this approach need to 39 provide to take action in real time if the desired network state and 40 actual state are inconsistent. 42 Status of This Memo 44 This Internet-Draft is submitted in full conformance with the 45 provisions of BCP 78 and BCP 79. 47 Internet-Drafts are working documents of the Internet Engineering 48 Task Force (IETF). Note that other groups may also distribute 49 working documents as Internet-Drafts. The list of current Internet- 50 Drafts is at https://datatracker.ietf.org/drafts/current/. 52 Internet-Drafts are draft documents valid for a maximum of six months 53 and may be updated, replaced, or obsoleted by other documents at any 54 time. It is inappropriate to use Internet-Drafts as reference 55 material or to cite them other than as "work in progress." 57 This Internet-Draft will expire on January 9, 2020. 59 Copyright Notice 61 Copyright (c) 2019 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (https://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 77 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 4 78 3. Theoretical Approaches . . . . . . . . . . . . . . . . . . . 4 79 3.1. Reinforcement-learning . . . . . . . . . . . . . . . . . 4 80 3.2. Deep-reinforcement-learning . . . . . . . . . . . . . . . 4 81 3.3. Advantage Actor Critic (A2C) . . . . . . . . . . . . . . 5 82 3.4. Asynchronously Advantage Actor Critic (A3C) . . . . . . . 5 83 3.5. Intent-based Network (IBN) . . . . . . . . . . . . . . . 6 84 4. Reinforcement-learning-based process scenario . . . . . . . . 6 85 4.1. Single-agent with Single-model . . . . . . . . . . . . . 7 86 4.2. Multi-agents Sharing Single-model . . . . . . . . . . . . 7 87 4.3. Adversarial Self-Play with Single-model . . . . . . . . . 7 88 4.4. Cooperative Multi-agents with Multiple-models . . . . . . 7 89 4.5. Competitive Multi-agents with Multiple-models . . . . . . 8 90 5. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 8 91 5.1. Intelligent Edge-computing for Traffic Control using 92 Deep-reinforcement-learning . . . . . . . . . . . . . . . 8 93 5.2. Edge computing system in a field of Construction-site 94 using Reinforcement-learning . . . . . . . . . . . . . . 8 95 5.3. Deep-reinforcement-learning-based remote Control system 96 over a software-defined network . . . . . . . . . . . . . 9 98 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 99 7. Security Considerations . . . . . . . . . . . . . . . . . . . 11 100 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 101 8.1. Normative References . . . . . . . . . . . . . . . . . . 11 102 8.2. Informative References . . . . . . . . . . . . . . . . . 11 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 105 1. Introduction 107 Reinforcement-learning for intelligently autonomous network 108 management, in general, is one of the challengeable methods in a 109 dynamic complex and cluttered network environments. With the 110 intelligent approach needs the development of computational systems 111 in a single or large distributed networking nodes, where these 112 environments involve limited and incomplete knowledge. 114 The reinforcement-learning can become a challenge-able and effective 115 technique to transfer and share information via the global 116 environment, as it does not require a priori-knowledge of the agent 117 behavior or environment to accomplish its tasks [Megherbi]. Such a 118 knowledge is usually acquired and learned repeatedly and autonomously 119 by trial and error. The reinforcement-learning is also one of the 120 machine learning techniques that will be adapted to the various 121 networking environments for automatic networks [S.Jiang]. 123 Deep-reinforcement-learning recently proposes has been extended from 124 reinforcement-learning that can emerge as more powerful model-driven 125 or data-driven model in a large state space, to overcome the 126 classical behavior reinforcement-learning process. However, the 127 classical reinforcement-learning slightly has a limitation to be 128 adopted in networking areas, since the networking environments 129 consist of significantly large and complex components in fields of 130 routing configuration, optimization and system management, so that 131 deep-reinforcement-learning can provide much more state information 132 for learning process.[MS] 134 There are many different networking management problems to 135 intelligently solve, such as connectivity, traffic management, fast 136 Internet without latency and etc. Reinforcement-learning-based 137 approaches can surely provide some of specific solutions with 138 multiple cases against human operating capacities although it is a 139 challengeable area due to a multitude of reasons such as large state 140 space, complexity in the giving reward, difficulty in control 141 actions, and difficulty in sharing and merging of the trained 142 knowledge in a distributed memory node to be transferred over a 143 communication network.[MS] 144 In addition, Intent-based network bridge to solve some of network 145 problems and gaps between network business model and technical 146 scheme. Intents should be applied to application service levels, 147 security policies, compliance, operational processes, and other 148 business needs. The network should constantly monitor and adjust to 149 meet the intent in following the monitoring system. There are some 150 of requirements to satisfy Intent-based network as following: (1) 151 Transfer, (2) policy activation (automatically), (3) guarantee 152 (Continuous monitoring and verification) [Cisco]. Through 153 continuously monitoring with network data, we are able to collect 154 network information and to analyze the collected information by 155 artificial intelligent approach. If the analysis result shows that 156 the new network configuration parameter needs to be changed or 157 reconfigured by deriving the optimized value. 159 2. Conventions and Terminology 161 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 162 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 163 document are to be interpreted as described in [RFC2119]. 165 3. Theoretical Approaches 167 3.1. Reinforcement-learning 169 Reinforcement-learning is an area of machine learning concerned with 170 how software agents should take actions in an environment so as to 171 maximize some notion of cumulative reward.[Wikipedia] The 172 reinforcement-learning is normally used with a reward from 173 centralized node (the global brain), and capable of autonomous 174 acquirement and incorporation of knowledge. It is continuously self- 175 improving and becoming more efficient as the learning process from an 176 agent experience to optimize management performance for autonomous 177 learning process.[Sutton][Madera] 179 3.2. Deep-reinforcement-learning 181 Some of advanced techniques using reinforcement-learning encounter 182 and combine with deep-learning in neural networks that has made it 183 possible to extract high-level features from raw data in compute 184 vision [A Krizhevsky]. There are many challenges under the deep- 185 learning models such as convolution neural network, recurrent neural 186 network and etc., on the reinforcement-learning approach. The 187 benefit of the deep learning applications is that lots of networking 188 models, but the problematic issue is complex and cluttered networking 189 structures used with large amounts of labelled training data. 191 Recently, the advances in training deep neural networks to develop a 192 novel artificial agent, termed a deep Q-network (deep-reinforcement- 193 learning network), can be used to learn successful policies directly 194 from high-dimensional sensory inputs using end-to-end reinforcement 195 learning [V.Mnih]. 197 The deep-reinforcement-learning (deep Q-network) can provide more 198 extended and powerful scenarios to build networking models with 199 optimized action controls, huge system states and real-time-based 200 reward function. Moreover, the technique has a significant advantage 201 to set highly sequential data in a large model state space. [MS] In 202 particular, the data distribution in reinforcement-learning is able 203 to change as learning behaviors, that is a problem for deep learning 204 approaches assumed by a fixed underlying distribution [V. Mnih]. 206 3.3. Advantage Actor Critic (A2C) 208 Advantage Actor Critic is one of the intelligent reinforcement- 209 learning models based on policy gradient model. The intelligent 210 approach can optimize deep neural network controller in terms of 211 reinforcement-learning algorithms, and show that parallel actor- 212 learners have a stabilizing effect on training and they can be 213 allowing all of the methods to successfully train neural network 214 controllers [Volodymyr Mnih]. Even if the prior deep-reinforcement- 215 learning algorithm with experience replay memory tremendously has 216 performance in challenging of the control service domains, it still 217 needs to use more memory and computational power due to off-policy 218 learning methods. To make up for this algorithms, a new algorithm 219 has appeared. 221 The Advantage Actor Critic (consisting of actor and critic) method 222 would implement generalized policy iteration alternating between a 223 policy evaluation and a policy improvement step. Actor is a policy- 224 based method that can improve the current policy for available the 225 best next action. Critic in the value-based approach can evaluate 226 the current policy and reduce the variance by a bootstrapping method. 227 It is more stable and effective algorithm than the pure policy-based 228 gradient methods.[MS] 230 3.4. Asynchronously Advantage Actor Critic (A3C) 232 Asynchronously Advantage Actor Critic is the updated algorithm based 233 on Advantage Actor Critic. The main algorithm concept is to run 234 multiple environments in parallel to run the agent asynchronously 235 instead of experience replay. The parallel environment reduces the 236 correlation of agent's data and induces each agent to experience 237 various states so that the learning process can become a stationary 238 process. This algorithm is a beneficial and practical point of view 239 since it allows learning performance even with a general multi-core 240 CPU. In addition, it can be applied to continuous space as well as 241 discrete action space, and also has the advantages of learning both 242 feedforward and recurrent agent.[MS] 244 A3C algorithm is possibly a number of complementary improvement to 245 the neural network architecture and it has been shown to accurately 246 produce and estimate of Q-values by including separate streams for 247 the state value and advantage in the network to improve both value- 248 based and policy-based methods by making it easier for the network to 249 represent feature coordinates [Volodymyr Mnih]. 251 3.5. Intent-based Network (IBN) 253 ntent-based Network is a new technical approach that can adapt the 254 network flexibly through configuration parameters derived from data 255 analysis for network machine learning. Software-defined Networking 256 (SDN) is a similar concept with Intent-based Network, however, 257 Software-defined Networking has not yet tipped in the sector that 258 relies on network automation. With the approach, network machine 259 learning is integrated with network analysis, routing, wireless 260 communications, and resource management. However, unlike the field 261 of computer vision, which can easily acquire sufficient data, it is 262 difficult to obtain data over a real network. Therefore, there are 263 limitations to apply machine learning technique to network field with 264 the data. Reinforcement Learning (RL) can diminish much attention 265 and the importance of securing high-quality data, so that both 266 concepts of reinforcement learning and intent-based network might 267 solve the limitation and integrate a gap between network machine 268 learning and network technique. 270 Intent-based network is also describing how to apply the setting 271 values for network management/operation in a procedural way. For 272 that reason, the approach is also the core of Intent processing that 273 automatically interprets it and declares it declaratively. Even if 274 the basic concepts of intent-based network reflects and to be 275 announced regarding intent, there is no standardized form of Intent 276 processing technology. While intent-based network has the advantage 277 of providing a higher level of abstraction in network management/ 278 operation and providing ease of use, a more specific and clear 279 definition of the technology is likely to be needed. 281 4. Reinforcement-learning-based process scenario 283 With a single agent or multiple agents trained for intelligent 284 network management, a variety of training scenarios are possible, 285 depending on how agents are interacted and how many models are linked 286 to the agents. The followings are possible RL training scenarios for 287 network management. 289 4.1. Single-agent with Single-model 291 This is the traditional scenario of training a single agent who tries 292 to achieve one goal related to network management. It receives all 293 of information and rewards from a network (or a simulated network), 294 and decides its appropriate action for the current network status. 296 4.2. Multi-agents Sharing Single-model 298 In this scenario, multiple agents share a single model and a single 299 goal linked to the model. But, each of them is connected to an 300 independent part of network or an independent whole network, so that 301 they receive different information and rewards from such an 302 independent one. The multiple agents experience differently on their 303 connected networks. However, it does not mean their training 304 behavior for network management will diverge. Each of their 305 experience is used to train the single model. This scenario is a 306 kind of parallelized version of the traditional 'Single-Agent with 307 Single-Model' scenario, which can speed-up the RL training process 308 and stabilize the single model's behavior. 310 4.3. Adversarial Self-Play with Single-model 312 This scenario contains two interacting agents with inverse reward 313 functions linked to a single model. This scenario makes an agent 314 have the perfectly matched opposing agent: itself, and trains the 315 agent to become increasingly more skilled for network management. 316 Inverse rewards are used to punish the opposing agent when an agent 317 receives as positive reward, and vice versa. The two agents are 318 linked to a single model for network management, and the model are 319 trained and stabilized while both agents interact in a conflicting 320 manner. 322 4.4. Cooperative Multi-agents with Multiple-models 324 In this scenario, two or more interacting agents share a common 325 reward function linked to multiple different models for network 326 management. In this scenario, a common goal is set up and all agents 327 are trained to achieve the goal together that is hard to be achieved 328 alone. Usually, each agent has access only to partial information of 329 network status and determines an appropriate action by using its own 330 model. Each of actions will be independently taken in order to 331 accomplish a management task and collaboratively achieve the common 332 goal. 334 4.5. Competitive Multi-agents with Multiple-models 336 This scenario contains two or more interacting agents with diverse 337 reward function linked to multiple different models. In this 338 scenario, agents will compete with one another to obtain some limited 339 set of network resources and try to achieve their own goal. In a 340 network, there will be tasks that have different management 341 objectives. This leads multi-objective optimization problems, which 342 are generally difficult to solve analytically. This scenario is 343 suitable for solving such a multi-objective optimization problem 344 related to network management by allowing each agent solve a single- 345 objective problem, but complete with each other. 347 5. Use Cases 349 5.1. Intelligent Edge-computing for Traffic Control using Deep- 350 reinforcement-learning 352 Edge computing is a concept that allows data from a variety of 353 devices to be directly analyzed at the site or near the data, rather 354 than being sent to a centralized data center such as the cloud. As 355 such, edge computing will support data flow acceleration by 356 processing data with low latency in real-time. In addition, by 357 supporting efficient data processing on large amounts of data that 358 can be processed around the source, and internet bandwidth usage will 359 be also reduced. 361 Deep-reinforcement-learning would be useful technique to improve 362 system performance in an intelligent edge-controlled service system 363 for fast response time, reliability and security. Deep- 364 reinforcement-learning is model-free approach so that many algorithms 365 such as DQN, A2C and A3C can be adopted to resolve network problems 366 in time-sensitive systems. 368 5.2. Edge computing system in a field of Construction-site using 369 Reinforcement-learning 371 In a construction site, there are many dangerous elements such as 372 noisy, gas leak and vibration needed by alerts, so that real-time 373 monitoring system to detect the alerts using machine learning 374 techniques can provide more effective solution and approach to 375 recognize dangerous construction elements. 377 Representatively, to monitor these elements CCTV (closed-circuit 378 television) should be locally and continuously broadcasting in a 379 situation of construction site. At that time, it is in-effective and 380 wasteful even if the CCTV is constantly broadcasting unchangeable 381 scenes in high definition. However, the streaming should be 382 converted to high quality streaming data to rapidly show and defect 383 the dangerous situation, when any alert should be detected due to the 384 dangerous elements. To approach technically deep-reinforcement- 385 learning can provide a solution to automatically detect these kinds 386 of dangerous situations with prediction in an advance. It can also 387 provide the transform data including with the high-rate streaming 388 video and quickly prevent the other risks. Deep-reinforcement- 389 learning is an important role to efficiently manage and monitor with 390 the given dataset in real-time. 392 5.3. Deep-reinforcement-learning-based remote Control system over a 393 software-defined network 395 With the nonlinear control system such as cyber physical system 396 provides an unstable system environment with initial control state 397 due to its nonlinear nature. In order to stably control the unstable 398 initial state, the prior-complex mathematical control methods (Linear 399 Quadratic Regulator, Proportional Integral Differential) are used for 400 successful control and management, but these approaches are needed 401 with difficult mathematical process and high-rate effort. Therefore, 402 using deep-reinforcement-learning can surely provide more effective 403 technical approach without difficult initial set of control states to 404 be compared with the other methods. 406 The ultimate purpose of the reinforcement-learning is to interact 407 with the environment and maximize the target reward value. Observing 408 the state in the step and the action by the policy are performed, and 409 the reward judge a value through the compensation given in the 410 environment. Deep-reinforcement-learning using Convolutional Neural 411 Network (CNN) can provide more performing learning process to make 412 stable control and management. 414 As part of the system, it shows how the physical environment and the 415 cyber environment interact with the reinforcement-learning module 416 over a network. The actions to control the physical environment, 417 delivered to the Enhanced Learning model based on DQN, transfer to 418 data to the physical environment using networking communication tools 419 as below. 421 +-----Environment-----+ +---Control and Management---+ 422 . . . . 423 . +-----------------+ . Network +--------------+ . 424 . . Physical System . .----------->. Cyber Module . . 425 . . . .<-----------. . . 426 . +-----------------+ . +--------------+ . 427 . . . . +--------+ . 428 +---------------------+ . .----------.RL Agent. . 429 . +--------+ . 430 +............................+ 432 Figure 1: DRL-based Cyber Physical Management Control System 434 With the use-case, the reinforcement learning agent interacts with 435 the physical remote device while exchanging network packets. The 436 Software-defined network controller can manage the network traffic 437 transmission, so that the system is naturally composed of a cyber 438 environment and physical environment, and two environments closely 439 and synchronously.[Ju-Bong] 441 For the intelligent traffic management in the system, software- 442 defined networking for automation (basic concept for IBN) should be 443 used to control and manage of connection between the cyber physical 444 system and edge computing module. The intelligent approach consists 445 of software that intelligently controls the network and technique 446 that allows software to set up and control the network. The concept 447 of can be centralized to control of network operation by software 448 programming, centralizes switch/router control function based on 449 existing hardware. It is possible to manage the network according to 450 the requirements without the detailed network configuration. 452 In addition, software-defined networking switch is able to enable the 453 network traffic control to be controlled and managed by software- 454 based controllers. This approach is really similar with intent-based 455 networking since both approaches can share the similar principle 456 using software to run the network, however, intent-based networking 457 offers an abstraction layer under the implemented policy and 458 instruction across all the physical hardware within the 459 infrastructure for automated networking. To achieve superior intent- 460 based networking over a real network, the physical control system 461 will be implemented to automatically manage and provide IoE edge 462 smart traffic control service for high quality real time connection. 464 6. IANA Considerations 466 There are no IANA considerations related to this document. 468 7. Security Considerations 470 [TBD] 472 8. References 474 8.1. Normative References 476 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 477 Requirement Levels", BCP 14, RFC 2119, 478 DOI 10.17487/RFC2119, March 1997, 479 . 481 8.2. Informative References 483 [I-D.jiang-nmlrg-network-machine-learning] 484 Jiang, S., "Network Machine Learning", ID draft-jiang- 485 nmlrg-network-machine-learning-02, October 2016. 487 [Megherbi] 488 "Megherbi, D. B., Kim, Minsuk, Madera, Manual., A Study of 489 Collaborative Distributed Multi-Goal and Multi-agent based 490 Systems for Large Critical Key Infrastructures and 491 Resources (CKIR) Dynamic Monitoring and Surveillance, IEEE 492 International Conference on Technologies for Homeland 493 Security", 2013. 495 [Teiralbar] 496 "Megherbi, D. B., Teiralbar, A. Boulenouar, J., A Time- 497 varying Environment Machine Learning Technique for 498 Autonomous Agent Shortest Path Planning, Proceedings of 499 SPIE International Conference on Signal and Image 500 Processing, Orlando, Florida", 2001. 502 [Nasim] "Nasim ArianpooEmail, Victor C.M. Leung, How network 503 monitoring and reinforcement learning can improve tcp 504 fairness in wireless multi-hop networks, EURASIP Journal 505 on Wireless Communications and Networking", 2016. 507 [Minsuk] "Dalila B. Megherbi and Minsuk Kim, A Hybrid P2P and 508 Master-Slave Cooperative Distributed Multi-Agent 509 Reinforcement Learning System with Asynchronously 510 Triggered Exploratory Trials and Clutter-index-based 511 Selected Sub goals, IEEE CIG Conference", 2016. 513 [April] "April Yu, Raphael Palefsky-Smith, Rishi Bedi, Deep 514 Reinforcement Learning for Simulated Autonomous Vehicle 515 Control, Stanford University", 2016. 517 [Markus] "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, Learning 518 Driving Styles for Autonomous Vehicles from Demonstration, 519 Robotics and Automation (ICRA)", 2015. 521 [Ann] "Ann Nowe, Peter Vrancx, Yann De Hauwere, Game Theory and 522 Multi-agent Reinforcement Learning, In book: Reinforcement 523 Learning: State of the Art, Edition: Adaptation, Learning, 524 and Optimization Volume 12", 2012. 526 [Kok-Lim] "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae 527 Hsiang Kwong, Application of Reinforcement Learning to 528 wireless sensor networks: models and algorithms, Published 529 in Journal Computing archive Volume 97 Issue 11, Pages 530 1045-1075", November 2015. 532 [Sutton] "Sutton, R. S., Barto, A. G., Reinforcement Learning: an 533 Introduction, MIT Press", 1998. 535 [Madera] "Madera, M., Megherbi, D. B., An Interconnected Dynamical 536 System Composed of Dynamics-based Reinforcement Learning 537 Agents in a Distributed Environment: A Case Study, 538 Proceedings IEEE International Conference on Computational 539 Intelligence for Measurement Systems and Applications, 540 Italy", 2012. 542 [Al-Dayaa] 543 "Al-Dayaa, H. S., Megherbi, D. B., Towards A Multiple- 544 Lookahead-Levels Reinforcement-Learning Technique and Its 545 Implementation in Integrated Circuits, Journal of 546 Artificial Intelligence, Journal of Supercomputing. Vol. 547 62, issue 1, pp. 588-61", 2012. 549 [Chowdappa] 550 "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan, 551 Thread-Safe Message Passing with P4 and MPI, Technical 552 Report TR-CS-941025, Computer Science Department and NSF 553 Engineering Research Center, Mississippi State 554 University", 1994. 556 [Mnih] "V.Mnih and et al., Human-level Control Through Deep 557 Reinforcement Learning, Nature 518.7540", 2015. 559 [Stampa] "G Stamp, M Arias, etc., A Deep-reinforcement Learning 560 Approach for Software-defined Networking Routing 561 Optimization, cs.NI", 2017. 563 [Krizhevsky] 564 "A Krizhevsky, I Sutskever, and G Hinton, Imagenet 565 classification with deep con- volutional neural networks, 566 In Advances in Neural Information Processing Systems, 567 1106-1114", 2012. 569 [Volodymyr] 570 "Volodymyr Mnih and et al., Asynchronous Methods for Deep 571 Reinforcement Learning, ICML, arXiv:1602.01783", 2016. 573 [MS] "Intelligent Network Management using Reinforcement- 574 learning, draft-kim-nmrg-rl-03", 2018. 576 [Ju-Bong] "Deep Q-Network Based Rotary Inverted Pendulum System and 577 Its Monitoring on the EdgeX Platform, International 578 Conference on Artificial Intelligence in Information and 579 Communication (ICAIIC)", 2019. 581 Authors' Addresses 583 Min-Suk Kim 584 Etri 585 161 Gajeong-Dong Yuseung-Gu 586 Daejeon 305-700 587 Korea 589 Phone: +82 42 860 5930 590 Email: mskim16@etri.re.kr 592 Youn-Hee Han 593 KoreaTech 594 Byeongcheon-myeon Gajeon-ri, Dongnam-gu 595 Choenan-si, Chungcheongnam-do 596 330-708 597 Korea 599 Phone: +82 41 560 1486 600 Email: yhhan@koreatech.ac.kr 601 Yong-Geun Hong 602 ETRI 603 161 Gajeong-Dong Yuseung-Gu 604 Daejeon 305-700 605 Korea 607 Phone: +82 42 860 6557 608 Email: yghong@etri.re.kr