idnits 2.17.1 draft-kim-nmlrg-network-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 13, 2017) is 2600 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RL' is mentioned on line 163, but not defined == Missing Reference: 'TBD' is mentioned on line 414, but not defined Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NML Research Group M-S. Kim 3 Internet-Draft Y-G. Hong 4 Intended status: Informational ETRI 5 Expires: September 14, 2017 March 13, 2017 7 Collaborative Intelligent Multi-agent Reinforcement Learning over a 8 Network 9 draft-kim-nmlrg-network-00 11 Abstract 13 This document describes agent reinforcement learning (RL) in a 14 distributed environment to transfer or share information for 15 autonomous shortest path-planning over a communication network. The 16 centralized node, which is the main node to manage agent workflow in 17 hybrid peer-to-peer environment, provides a cumulative reward for 18 each action that a given agent takes with respect to an optimal path 19 based on a to-be-learned policy over the learning process. A reward 20 from the centralized node is reflected when an agent explores to 21 reach its destination for autonomous shortest path-planning in 22 distributed nodes. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on September 14, 2017. 41 Copyright Notice 43 Copyright (c) 2017 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 3 60 3. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3.1. General Motivation for Reinforcement Learning (RL) . . . 4 62 3.2. Reinforcement Learning (RL) in networks . . . . . . . . . 4 63 3.3. Motivation in our work . . . . . . . . . . . . . . . . . 4 64 4. Related Works . . . . . . . . . . . . . . . . . . . . . . . . 4 65 4.1. Autonomous Driving System . . . . . . . . . . . . . . . . 4 66 4.2. Game Theory . . . . . . . . . . . . . . . . . . . . . . . 4 67 4.3. Wireless Sensor Network (WSN) . . . . . . . . . . . . . . 5 68 4.4. Routing Enhancement . . . . . . . . . . . . . . . . . . . 5 69 5. Multi-agent Reinforcement Learning (RL) Technologies . . . . 5 70 5.1. Reinforcement Learning (RL) . . . . . . . . . . . . . . . 5 71 5.2. Reward of Distance and Frequency . . . . . . . . . . . . 5 72 5.3. Distributed Computing Node . . . . . . . . . . . . . . . 6 73 5.4. Agent Sharing Information . . . . . . . . . . . . . . . . 6 74 5.5. Sub-goal Selection . . . . . . . . . . . . . . . . . . . 6 75 5.6. Cluttered-index-based scheme . . . . . . . . . . . . . . 6 76 6. Proposed Architecture for Reinforcement Learning (RL) . . . . 7 77 7. Use case of Multi-agent Reinforcement Learning (RL) . . . . . 8 78 7.1. Distributed Multi-agent Reinforcement Learning: Sharing 79 Information . . . . . . . . . . . . . . . . . . . . . . . 8 80 7.2. Use case of Shortest Path-planning via sub-goal selection 9 81 7.3. Use case of Asynchronous Triggered Multi-agent with 82 Terrain Cluttered-index-based . . . . . . . . . . . . . . 10 83 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 84 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 85 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 86 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 87 10.2. Informative References . . . . . . . . . . . . . . . . . 10 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 90 1. Introduction 92 In large surveilling applications, information of Critical Key 93 Infrastructures and Resources (CKIR) to protect and share is 94 necessary in larger ground, maritime and airborne areas, where there 95 is a special need for collaborative intelligent distributed systems 96 with intelligent learning schemes. These applications also need the 97 development of computational multi-agents learning systems in large 98 distributed networking nodes, where the agents have limited, 99 incomplete knowledge, and only access to local information in 100 distributed computing nodes over a communication network. 102 Reinforcement Learning (RL) is one effective technique to transfer 103 and share information among agents for autonomous shortest agent path 104 planning, as it does not require a-priori-knowledge of the agent's 105 behavior or environment to accomplish its tasks [Megherbi]. Such a 106 knowledge is usually acquired/learned automatically and autonomously 107 by trial and error. 109 Reinforcement Learning (RL) actions involve interacting with a given 110 environment, so the environment provides an agent learning process 111 with the elements as followings: 113 o Starting agent state, one or more obstacles, and agent 114 destinations 116 o Initially, agent explores randomly in a given node 118 o Agents' actions to avoid an obstacle and move to one or more 119 available positions to reach its goal(s) 121 o After an agent reaches its goal, it can use the information 122 collected in initial random path-planning work to improve its 123 learning speed 125 o Optimal ways in the following phase and exploratory learning 126 trials 128 Reinforcement Learning (RL) is one of the Machine Learning techniques 129 that will be adapted to the various networking environments for 130 automatic networks [I-D.jiang-nmlrg-network-machine-learning]. Thus, 131 this document provides motivation, learning technique, and use case 132 for network machine learning. 134 2. Conventions and Terminology 136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 138 document are to be interpreted as described in [RFC2119]. 140 3. Motivation 141 3.1. General Motivation for Reinforcement Learning (RL) 143 Reinforcement Learning (RL) is a system capable of autonomous 144 acquirement and incorporation of knowledge. It can continuously 145 self-improve learning speed with experience and attempts to maximize 146 cumulative reward for a faster optimal path used in used in multi- 147 agents-based monitoring systems [Teiralbar]. 149 3.2. Reinforcement Learning (RL) in networks 151 In large surveilling applications, it is necessary to protect and 152 share information in many Infrastructure and Resource area. In 153 wireless networking layers, Reinforcement Learning (RL) is an 154 emerging technology to monitor dynamics of the network to achieve 155 fair resource allocation for nodes within the wireless mesh setting. 156 Monitoring parameters of the network and adjusts based on the network 157 dynamics can demonstrate to improve fairness in wireless environment 158 Infrastructures and Resources [Nasim]. 160 3.3. Motivation in our work 162 There are many different networking issues such as latency, traffic, 163 management and etc. Reinforcement learning [RL] is one of the 164 Machine Learning mechanisms that will be applied with multiple cases 165 to solve diverse networking problems against human operating 166 capacities. It can be a challenge-able due to a multitude of reasons 167 such as large state space search, complexity in giving reward, 168 difficulty in agent action selection, and difficulty in sharing/ 169 merging learned information among the agents in a distributed memory 170 nodes to be transferred over a communication network [Minsuk]. 172 4. Related Works 174 4.1. Autonomous Driving System 176 Autonomous vehicle is capable of self-automotive driving without 177 human supervision depending on optimized trust region policy by 178 reinforcement learning (RL) that enables learning of more complex and 179 special Neural Network. Such a vehicle provides a comfortable user 180 experience safely and reliably on interactive communication network 181 [April][Markus]. 183 4.2. Game Theory 185 The adaptive multi-agent system, which is combined with complexities 186 from interacting game player, has developed in a field of 187 reinforcement learning (RL). In the early game theory, the 188 interdisciplinary work was only focused on competitive games, but 189 Reinforcement Learning (RL) has developed into a general framework 190 for analyzing strategic interaction and has been attracted field as 191 diverse as psychology, economics and biology [Ann]. 193 4.3. Wireless Sensor Network (WSN) 195 Wireless sensor network (WSN) consists of a large number of sensors 196 and sink nodes for monitoring systems with event parameters such as 197 temperature, humidity, air conditioning, etc. Reinforcement learning 198 (RL) in WSNs has been applied in a wide range of schemes such as 199 cooperative communication, routing and rate control. The sensors and 200 sink nodes are able to observe and carry out optimal actions on their 201 respective operating environment for network and application 202 performance enhancements [Kok-Lim]. 204 4.4. Routing Enhancement 206 Reinforcement Learning (RL) is used to enhance multicast routing 207 protocol in wireless ad hoc networks, where each node has different 208 capability. Routers in the multicast routing protocol are determined 209 to discover optimal route with a predicted reward, and then the 210 routers create the optimal path with multicast transmissions to 211 reduce the overhead in Reinforcement Learning (RL) [Kok-Lim]. 213 5. Multi-agent Reinforcement Learning (RL) Technologies 215 5.1. Reinforcement Learning (RL) 217 Reinforcement Learning (RL) is one of the machine learning algorithms 218 based on an agent learning process. Reinforcement Learning (RL) is 219 normally used with a reward from the centralized node, and capable of 220 autonomous acquirement and incorporation of knowledge. It is 221 continuously self-improving and becoming more efficient as the 222 learning process from an agent's experience to increase an agent 223 learning speed for autonomous shortest path-planning 224 [Sutton][Madera]. 226 5.2. Reward of Distance and Frequency 228 In general, an agent takes the return values of its current state and 229 next available state to decide and move an action, but the learning 230 process in Reinforcement Learning (RL) involves lots of limitations 231 since it provides the agents with only a single level of exploratory 232 learning process. The limitation is generated to reduce agent 233 learning speed in an optimal path, so that the Distance-and-Frequency 234 technique based on the Euclidean distance in Reinforcement Learning 235 (RL) was derived to enhance agent's optimal learning speed. 236 Distance-and-Frequency is based on more levels of agent visibility to 237 enhance learning algorithm by an additional way that uses the state 238 occurrence frequency [Al-Dayaa]. 240 5.3. Distributed Computing Node 242 Autonomous path-planning for multi-agent environment is related to 243 agent transfer of path information, as the agents require information 244 to achieve efficient path-planning on a given local node or 245 distributed memory nodes over a communication network. 247 5.4. Agent Sharing Information 249 The quality of agent decision making often depends on the willingness 250 of agents to share a given learning information with other agents for 251 optimal path-planning. Sharing Information means that an agent would 252 share and communicate the knowledge learned and acquired with / to 253 other agents using Message Passing Interface (MPI). In sharing 254 information, each agent has an attempt of exploring its environment, 255 where all agents explore to reach their destinations via a 256 distributed reinforcement reward-based learning method on the 257 existing local distributed memory nodes. The agents can be running 258 on the same or different nodes over a communication network (via 259 sharing information). The agents have limited resources and 260 incomplete knowledge of their environments. Even if the agents do 261 not share the capabilities and resources to monitor an entire given 262 large terrain, they are able to share the needed information for 263 collaborative path-planning in distributed networking nodes 264 [Chowdappa][Minsuk]. 266 5.5. Sub-goal Selection 268 A new technical method for agent sub-goal selection in distributed 269 nodes is introduced to reduce the agent initial random exploration 270 with a given selected sub-goal. 272 [TBD] 274 5.6. Cluttered-index-based scheme 276 We propose a learning algorithm to optimize agent sub-goal selection. 277 It is a proposed clutter-index-based technique for a new 278 reinforcement learning scheme with a reward and an improved method to 279 optimize multi-agent learning speed over a communication network. 281 [TBD] 283 6. Proposed Architecture for Reinforcement Learning (RL) 285 The architecture using Reinforcement Learning (RL) describes a 286 collaborative multi-agent-based system in distributed environments as 287 shown in figure 1, where the architecture is combined with a hybrid 288 architecture making use of both a master / slave architecture and a 289 peer-to-peer. The centralized node, assigns each slave computing 290 node a portion of the distributed terrain and an initial number of 291 agents. The network communication handles all communication among 292 components and agents in the distributed networking environment. The 293 components are deployed on different nodes. The communication 294 handler alternatively sends one message from the outgoing queue and 295 distributes one message in the incoming queue to the destination 296 agent or component, and runs in a separate thread on each node with 297 two message queues that consists of the incoming queue and the 298 outgoing queue. 300 +--------------------------------------+ 301 +------------|----------+ | +------------|----------+ 302 | Communication Handler | | | Communication Handler | 303 +-----------------------+ | +-----------------------+ 304 | Terrain | | | Terrain | 305 +-----------------------+ | +-----------------------+ 306 | 307 +--------------------------------------+ 308 +------------|----------+ | +------------|----------+ 309 | Communication Handler | | | Communication Handler | 310 +-----------------------+ | +-----------------------+ 311 | Terrain | | | Terrain | 312 +-----------------------+ | +-----------------------+ 313 | 314 +-----------------------+ 315 | Communication Handler | 316 +-----------------------+ 317 |Centralized Global Node| 318 +-----------------------+ 320 Figure 1: Top level components, deployment and agent communication 321 handler 323 Figure2 shows the relationship of an action, state and reward between 324 an agent and its destination in the environment for reinforcement 325 learning. The agent does an action that leads to a reward from 326 achieving an optimal path toward its destination. 328 +-------------------------+ 329 States & Reward ------| Centralized Global Node |<----------------+ 330 | +-------------------------+ | 331 | | 332 | | 333 | States 334 | | 335 | | 336 +-------------+ +-------------+ | 337 | Multi-agent |-------------Action----------->| Destination |-----+ 338 +-------------+ +-------------+ 340 Figure 2: Architecture Overview 342 7. Use case of Multi-agent Reinforcement Learning (RL) 344 7.1. Distributed Multi-agent Reinforcement Learning: Sharing 345 Information 347 In this section, we deal with case of a collaborative distributed 348 multi-agent, where each agent has same or different individual 349 destination in a distributed environment. Since sharing information 350 scheme among the agents is problematic one, we need to expand on the 351 work described by solving the challenging cases. 353 Basically, the main proposed algorithm is presented by distributed 354 multi-agent reinforcement learning as below:. 356 +--Proposed Algorithm------------------------------------------+ 357 | | 358 | Let N, A and D denote number of node, agent and destination | 359 +--------------------------------------------------------------+ 360 | Place N, A and D in random position(x, y) | 361 +--------------------------------------------------------------+ 362 | Every A agents in N nodes | 363 +--------------------------------------------------------------+ 364 | Do inital exploration(random) toward D | 365 | (1) Let S denotes current state | 366 | (2) Relinguish S so other agent can occupy the positions | 367 | (3) Assign the agent's new position | 368 | (4) Update the current state S <- Sn | 369 +--------------------------------------------------------------+ 370 | Do optimized exploration(RL) for number of trials | 371 | (1) Let S denotes current state | 372 | (2) Let P denotes action | 373 | (3) Let R denotes discounted reward value | 374 | (4) Choose action P <- Policy(S, P) in RL | 375 | (5) Move available directions by agent | 376 | (6) Update the learning model in a new value | 377 | (7) Update the current state S <- Sn | 378 +--------------------------------------------------------------+ 380 Figure 3: Use case of Multi-agent Reinforcement Learning 382 Multi-agent reinforcement learning (RL) in distributed nodes can 383 improve the overall system performance to transfer or share 384 information from one node to another node in following cases; 385 expanded complexity in RL technique with various experimental factors 386 and conditions, analyzing multi-agent sharing information for agent 387 learning speed. 389 7.2. Use case of Shortest Path-planning via sub-goal selection 391 Sub-goal selection is a scheme of a distributed multi-agent RL 392 technique based on selected intermediary agent sub-goal(s) with the 393 aim of reducing the initial random trial. The scheme is to improve 394 the multi-agent system performance with asynchronously triggered 395 exploratory phase(s) with selected agent sub-goal(s) for autonomous 396 shortest path-planning. 398 [TBD] 400 7.3. Use case of Asynchronous Triggered Multi-agent with Terrain 401 Cluttered-index-based 403 This is a new proposed technical reward scheme based on the proposed 404 environment-clutter-index for the fast learning speed path-planning. 406 [TBD] 408 8. IANA Considerations 410 There are no IANA considerations related to this document. 412 9. Security Considerations 414 [TBD] 416 10. References 418 10.1. Normative References 420 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 421 Requirement Levels", BCP 14, RFC 2119, 422 DOI 10.17487/RFC2119, March 1997, 423 . 425 10.2. Informative References 427 [I-D.jiang-nmlrg-network-machine-learning] 428 Jiang, S., "Network Machine Learning", ID draft-jiang- 429 nmlrg-network-machine-learning-02, October 2016. 431 [Megherbi] 432 "Megherbi, D. B., Kim, Minsuk, Madera, Manual., "A Study 433 of Collaborative Distributed Multi-Goal and Multi-agent 434 based Systems for Large Critical Key Infrastructures and 435 Resources (CKIR) Dynamic Monitoring and Surveillance", 436 IEEE International Conference on Technologies for Homeland 437 Security", 2013. 439 [Teiralbar] 440 "Megherbi, D. B., Teiralbar, A. Boulenouar, J., "A Time- 441 varying Environment Machine Learning Technique for 442 Autonomous Agent Shortest Path Planning.", Proceedings of 443 SPIE International Conference on Signal and Image 444 Processing, Orlando, Florida", 2001. 446 [Nasim] "Nasim ArianpooEmail, Victor C.M. Leung, "How network 447 monitoring and reinforcement learning can improve tcp 448 fairness in wireless multi-hop networks", EURASIP Journal 449 on Wireless Communications and Networking", 2016. 451 [Minsuk] "Dalila B. Megherbi and Minsuk Kim, "A Hybrid P2P and 452 Master-Slave Cooperative Distributed Multi-Agent 453 Reinforcement Learning System with Asynchronously 454 Triggered Exploratory Trials and Clutter-index-based 455 Selected Sub goals", IEEE CIG Conference", 2016. 457 [April] "April Yu, Raphael Palefsky-Smith, Rishi Bedi, "Deep 458 Reinforcement Learning for Simulated Autonomous Vehicle 459 Control", Stanford University", 2016. 461 [Markus] "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, "Learning 462 Driving Styles for Autonomous Vehicles from 463 Demonstration", Robotics and Automation (ICRA)", 2015. 465 [Ann] "Ann Nowe, Peter Vrancx, Yann De Hauwere, "Game Theory and 466 Multi-agent Reinforcement Learning", In book: 467 Reinforcement Learning: State of the Art, Edition: 468 Adaptation, Learning, and Optimization Volume 12", 2012. 470 [Kok-Lim] "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae 471 Hsiang Kwong, "Application of reinforcement learning to 472 wireless sensor networks: models and algorithms", 473 Published in Journal Computing archive Volume 97 Issue 11, 474 Pages 1045-1075", November 2015. 476 [Sutton] "Sutton, R. S., Barto, A. G., "Reinforcement Learning: an 477 Introduction", MIT Press", 1998. 479 [Madera] "Madera, M., Megherbi, D. B., "An Interconnected Dynamical 480 System Composed of Dynamics-based Reinforcement Learning 481 Agents in a Distributed Environment: A Case Study", 482 Proceedings IEEE International Conference on Computational 483 Intelligence for Measurement Systems and Applications, 484 Italy", 2012. 486 [Al-Dayaa] 487 "Al-Dayaa, H. S., Megherbi, D. B., "Towards A Multiple- 488 Lookahead-Levels Reinforcement-Learning Technique and Its 489 Implementation in Integrated Circuits", Journal of 490 Artificial Intelligence, Journal of Supercomputing. Vol. 491 62, issue 1, pp. 588-61", 2012. 493 [Chowdappa] 494 "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan, 495 "Thread-Safe Message Passing with P4 and MPI", Technical 496 Report TR-CS-941025, Computer Science Department and NSF 497 Engineering Research Center, Mississippi State 498 University", 1994. 500 Authors' Addresses 502 Min-Suk Kim 503 ETRI 504 218 Gajeongno, Yuseong 505 Daejeon 305-700 506 Korea 508 Phone: +82 42 860 5930 509 Email: mskim16@etri.re.kr 511 Yong-Geun Hong 512 ETRI 513 161 Gajeong-Dong Yuseung-Gu 514 Daejeon 305-700 515 Korea 517 Phone: +82 42 860 6557 518 Email: yghong@etri.re.kr