NML Research Group M-S. Kim Internet-Draft Y-G. Hong Intended status: Informational ETRI Expires: September 14, 2017 March 13, 2017 Collaborative Intelligent Multi-agent Reinforcement Learning over a Network draft-kim-nmlrg-network-00 Abstract This document describes agent reinforcement learning (RL) in a distributed environment to transfer or share information for autonomous shortest path-planning over a communication network. The centralized node, which is the main node to manage agent workflow in hybrid peer-to-peer environment, provides a cumulative reward for each action that a given agent takes with respect to an optimal path based on a to-be-learned policy over the learning process. A reward from the centralized node is reflected when an agent explores to reach its destination for autonomous shortest path-planning in distributed nodes. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 14, 2017. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of Kim & Hong Expires September 14, 2017 [Page 1] Internet-Draft Reinforcement Learning over a Network March 2017 publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions and Terminology . . . . . . . . . . . . . . . . . 3 3. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. General Motivation for Reinforcement Learning (RL) . . . 4 3.2. Reinforcement Learning (RL) in networks . . . . . . . . . 4 3.3. Motivation in our work . . . . . . . . . . . . . . . . . 4 4. Related Works . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1. Autonomous Driving System . . . . . . . . . . . . . . . . 4 4.2. Game Theory . . . . . . . . . . . . . . . . . . . . . . . 4 4.3. Wireless Sensor Network (WSN) . . . . . . . . . . . . . . 5 4.4. Routing Enhancement . . . . . . . . . . . . . . . . . . . 5 5. Multi-agent Reinforcement Learning (RL) Technologies . . . . 5 5.1. Reinforcement Learning (RL) . . . . . . . . . . . . . . . 5 5.2. Reward of Distance and Frequency . . . . . . . . . . . . 5 5.3. Distributed Computing Node . . . . . . . . . . . . . . . 6 5.4. Agent Sharing Information . . . . . . . . . . . . . . . . 6 5.5. Sub-goal Selection . . . . . . . . . . . . . . . . . . . 6 5.6. Cluttered-index-based scheme . . . . . . . . . . . . . . 6 6. Proposed Architecture for Reinforcement Learning (RL) . . . . 7 7. Use case of Multi-agent Reinforcement Learning (RL) . . . . . 8 7.1. Distributed Multi-agent Reinforcement Learning: Sharing Information . . . . . . . . . . . . . . . . . . . . . . . 8 7.2. Use case of Shortest Path-planning via sub-goal selection 9 7.3. Use case of Asynchronous Triggered Multi-agent with Terrain Cluttered-index-based . . . . . . . . . . . . . . 10 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 10.2. Informative References . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction In large surveilling applications, information of Critical Key Infrastructures and Resources (CKIR) to protect and share is necessary in larger ground, maritime and airborne areas, where there is a special need for collaborative intelligent distributed systems with intelligent learning schemes. These applications also need the Kim & Hong Expires September 14, 2017 [Page 2] Internet-Draft Reinforcement Learning over a Network March 2017 development of computational multi-agents learning systems in large distributed networking nodes, where the agents have limited, incomplete knowledge, and only access to local information in distributed computing nodes over a communication network. Reinforcement Learning (RL) is one effective technique to transfer and share information among agents for autonomous shortest agent path planning, as it does not require a-priori-knowledge of the agent's behavior or environment to accomplish its tasks [Megherbi]. Such a knowledge is usually acquired/learned automatically and autonomously by trial and error. Reinforcement Learning (RL) actions involve interacting with a given environment, so the environment provides an agent learning process with the elements as followings: o Starting agent state, one or more obstacles, and agent destinations o Initially, agent explores randomly in a given node o Agents' actions to avoid an obstacle and move to one or more available positions to reach its goal(s) o After an agent reaches its goal, it can use the information collected in initial random path-planning work to improve its learning speed o Optimal ways in the following phase and exploratory learning trials Reinforcement Learning (RL) is one of the Machine Learning techniques that will be adapted to the various networking environments for automatic networks [I-D.jiang-nmlrg-network-machine-learning]. Thus, this document provides motivation, learning technique, and use case for network machine learning. 2. Conventions and Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. Motivation Kim & Hong Expires September 14, 2017 [Page 3] Internet-Draft Reinforcement Learning over a Network March 2017 3.1. General Motivation for Reinforcement Learning (RL) Reinforcement Learning (RL) is a system capable of autonomous acquirement and incorporation of knowledge. It can continuously self-improve learning speed with experience and attempts to maximize cumulative reward for a faster optimal path used in used in multi- agents-based monitoring systems [Teiralbar]. 3.2. Reinforcement Learning (RL) in networks In large surveilling applications, it is necessary to protect and share information in many Infrastructure and Resource area. In wireless networking layers, Reinforcement Learning (RL) is an emerging technology to monitor dynamics of the network to achieve fair resource allocation for nodes within the wireless mesh setting. Monitoring parameters of the network and adjusts based on the network dynamics can demonstrate to improve fairness in wireless environment Infrastructures and Resources [Nasim]. 3.3. Motivation in our work There are many different networking issues such as latency, traffic, management and etc. Reinforcement learning [RL] is one of the Machine Learning mechanisms that will be applied with multiple cases to solve diverse networking problems against human operating capacities. It can be a challenge-able due to a multitude of reasons such as large state space search, complexity in giving reward, difficulty in agent action selection, and difficulty in sharing/ merging learned information among the agents in a distributed memory nodes to be transferred over a communication network [Minsuk]. 4. Related Works 4.1. Autonomous Driving System Autonomous vehicle is capable of self-automotive driving without human supervision depending on optimized trust region policy by reinforcement learning (RL) that enables learning of more complex and special Neural Network. Such a vehicle provides a comfortable user experience safely and reliably on interactive communication network [April][Markus]. 4.2. Game Theory The adaptive multi-agent system, which is combined with complexities from interacting game player, has developed in a field of reinforcement learning (RL). In the early game theory, the interdisciplinary work was only focused on competitive games, but Kim & Hong Expires September 14, 2017 [Page 4] Internet-Draft Reinforcement Learning over a Network March 2017 Reinforcement Learning (RL) has developed into a general framework for analyzing strategic interaction and has been attracted field as diverse as psychology, economics and biology [Ann]. 4.3. Wireless Sensor Network (WSN) Wireless sensor network (WSN) consists of a large number of sensors and sink nodes for monitoring systems with event parameters such as temperature, humidity, air conditioning, etc. Reinforcement learning (RL) in WSNs has been applied in a wide range of schemes such as cooperative communication, routing and rate control. The sensors and sink nodes are able to observe and carry out optimal actions on their respective operating environment for network and application performance enhancements [Kok-Lim]. 4.4. Routing Enhancement Reinforcement Learning (RL) is used to enhance multicast routing protocol in wireless ad hoc networks, where each node has different capability. Routers in the multicast routing protocol are determined to discover optimal route with a predicted reward, and then the routers create the optimal path with multicast transmissions to reduce the overhead in Reinforcement Learning (RL) [Kok-Lim]. 5. Multi-agent Reinforcement Learning (RL) Technologies 5.1. Reinforcement Learning (RL) Reinforcement Learning (RL) is one of the machine learning algorithms based on an agent learning process. Reinforcement Learning (RL) is normally used with a reward from the centralized node, and capable of autonomous acquirement and incorporation of knowledge. It is continuously self-improving and becoming more efficient as the learning process from an agent's experience to increase an agent learning speed for autonomous shortest path-planning [Sutton][Madera]. 5.2. Reward of Distance and Frequency In general, an agent takes the return values of its current state and next available state to decide and move an action, but the learning process in Reinforcement Learning (RL) involves lots of limitations since it provides the agents with only a single level of exploratory learning process. The limitation is generated to reduce agent learning speed in an optimal path, so that the Distance-and-Frequency technique based on the Euclidean distance in Reinforcement Learning (RL) was derived to enhance agent's optimal learning speed. Distance-and-Frequency is based on more levels of agent visibility to Kim & Hong Expires September 14, 2017 [Page 5] Internet-Draft Reinforcement Learning over a Network March 2017 enhance learning algorithm by an additional way that uses the state occurrence frequency [Al-Dayaa]. 5.3. Distributed Computing Node Autonomous path-planning for multi-agent environment is related to agent transfer of path information, as the agents require information to achieve efficient path-planning on a given local node or distributed memory nodes over a communication network. 5.4. Agent Sharing Information The quality of agent decision making often depends on the willingness of agents to share a given learning information with other agents for optimal path-planning. Sharing Information means that an agent would share and communicate the knowledge learned and acquired with / to other agents using Message Passing Interface (MPI). In sharing information, each agent has an attempt of exploring its environment, where all agents explore to reach their destinations via a distributed reinforcement reward-based learning method on the existing local distributed memory nodes. The agents can be running on the same or different nodes over a communication network (via sharing information). The agents have limited resources and incomplete knowledge of their environments. Even if the agents do not share the capabilities and resources to monitor an entire given large terrain, they are able to share the needed information for collaborative path-planning in distributed networking nodes [Chowdappa][Minsuk]. 5.5. Sub-goal Selection A new technical method for agent sub-goal selection in distributed nodes is introduced to reduce the agent initial random exploration with a given selected sub-goal. [TBD] 5.6. Cluttered-index-based scheme We propose a learning algorithm to optimize agent sub-goal selection. It is a proposed clutter-index-based technique for a new reinforcement learning scheme with a reward and an improved method to optimize multi-agent learning speed over a communication network. [TBD] Kim & Hong Expires September 14, 2017 [Page 6] Internet-Draft Reinforcement Learning over a Network March 2017 6. Proposed Architecture for Reinforcement Learning (RL) The architecture using Reinforcement Learning (RL) describes a collaborative multi-agent-based system in distributed environments as shown in figure 1, where the architecture is combined with a hybrid architecture making use of both a master / slave architecture and a peer-to-peer. The centralized node, assigns each slave computing node a portion of the distributed terrain and an initial number of agents. The network communication handles all communication among components and agents in the distributed networking environment. The components are deployed on different nodes. The communication handler alternatively sends one message from the outgoing queue and distributes one message in the incoming queue to the destination agent or component, and runs in a separate thread on each node with two message queues that consists of the incoming queue and the outgoing queue. +--------------------------------------+ +------------|----------+ | +------------|----------+ | Communication Handler | | | Communication Handler | +-----------------------+ | +-----------------------+ | Terrain | | | Terrain | +-----------------------+ | +-----------------------+ | +--------------------------------------+ +------------|----------+ | +------------|----------+ | Communication Handler | | | Communication Handler | +-----------------------+ | +-----------------------+ | Terrain | | | Terrain | +-----------------------+ | +-----------------------+ | +-----------------------+ | Communication Handler | +-----------------------+ |Centralized Global Node| +-----------------------+ Figure 1: Top level components, deployment and agent communication handler Figure2 shows the relationship of an action, state and reward between an agent and its destination in the environment for reinforcement learning. The agent does an action that leads to a reward from achieving an optimal path toward its destination. Kim & Hong Expires September 14, 2017 [Page 7] Internet-Draft Reinforcement Learning over a Network March 2017 +-------------------------+ States & Reward ------| Centralized Global Node |<----------------+ | +-------------------------+ | | | | | | States | | | | +-------------+ +-------------+ | | Multi-agent |-------------Action----------->| Destination |-----+ +-------------+ +-------------+ Figure 2: Architecture Overview 7. Use case of Multi-agent Reinforcement Learning (RL) 7.1. Distributed Multi-agent Reinforcement Learning: Sharing Information In this section, we deal with case of a collaborative distributed multi-agent, where each agent has same or different individual destination in a distributed environment. Since sharing information scheme among the agents is problematic one, we need to expand on the work described by solving the challenging cases. Basically, the main proposed algorithm is presented by distributed multi-agent reinforcement learning as below:. Kim & Hong Expires September 14, 2017 [Page 8] Internet-Draft Reinforcement Learning over a Network March 2017 +--Proposed Algorithm------------------------------------------+ | | | Let N, A and D denote number of node, agent and destination | +--------------------------------------------------------------+ | Place N, A and D in random position(x, y) | +--------------------------------------------------------------+ | Every A agents in N nodes | +--------------------------------------------------------------+ | Do inital exploration(random) toward D | | (1) Let S denotes current state | | (2) Relinguish S so other agent can occupy the positions | | (3) Assign the agent's new position | | (4) Update the current state S <- Sn | +--------------------------------------------------------------+ | Do optimized exploration(RL) for number of trials | | (1) Let S denotes current state | | (2) Let P denotes action | | (3) Let R denotes discounted reward value | | (4) Choose action P <- Policy(S, P) in RL | | (5) Move available directions by agent | | (6) Update the learning model in a new value | | (7) Update the current state S <- Sn | +--------------------------------------------------------------+ Figure 3: Use case of Multi-agent Reinforcement Learning Multi-agent reinforcement learning (RL) in distributed nodes can improve the overall system performance to transfer or share information from one node to another node in following cases; expanded complexity in RL technique with various experimental factors and conditions, analyzing multi-agent sharing information for agent learning speed. 7.2. Use case of Shortest Path-planning via sub-goal selection Sub-goal selection is a scheme of a distributed multi-agent RL technique based on selected intermediary agent sub-goal(s) with the aim of reducing the initial random trial. The scheme is to improve the multi-agent system performance with asynchronously triggered exploratory phase(s) with selected agent sub-goal(s) for autonomous shortest path-planning. [TBD] Kim & Hong Expires September 14, 2017 [Page 9] Internet-Draft Reinforcement Learning over a Network March 2017 7.3. Use case of Asynchronous Triggered Multi-agent with Terrain Cluttered-index-based This is a new proposed technical reward scheme based on the proposed environment-clutter-index for the fast learning speed path-planning. [TBD] 8. IANA Considerations There are no IANA considerations related to this document. 9. Security Considerations [TBD] 10. References 10.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . 10.2. Informative References [I-D.jiang-nmlrg-network-machine-learning] Jiang, S., "Network Machine Learning", ID draft-jiang- nmlrg-network-machine-learning-02, October 2016. [Megherbi] "Megherbi, D. B., Kim, Minsuk, Madera, Manual., "A Study of Collaborative Distributed Multi-Goal and Multi-agent based Systems for Large Critical Key Infrastructures and Resources (CKIR) Dynamic Monitoring and Surveillance", IEEE International Conference on Technologies for Homeland Security", 2013. [Teiralbar] "Megherbi, D. B., Teiralbar, A. Boulenouar, J., "A Time- varying Environment Machine Learning Technique for Autonomous Agent Shortest Path Planning.", Proceedings of SPIE International Conference on Signal and Image Processing, Orlando, Florida", 2001. Kim & Hong Expires September 14, 2017 [Page 10] Internet-Draft Reinforcement Learning over a Network March 2017 [Nasim] "Nasim ArianpooEmail, Victor C.M. Leung, "How network monitoring and reinforcement learning can improve tcp fairness in wireless multi-hop networks", EURASIP Journal on Wireless Communications and Networking", 2016. [Minsuk] "Dalila B. Megherbi and Minsuk Kim, "A Hybrid P2P and Master-Slave Cooperative Distributed Multi-Agent Reinforcement Learning System with Asynchronously Triggered Exploratory Trials and Clutter-index-based Selected Sub goals", IEEE CIG Conference", 2016. [April] "April Yu, Raphael Palefsky-Smith, Rishi Bedi, "Deep Reinforcement Learning for Simulated Autonomous Vehicle Control", Stanford University", 2016. [Markus] "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, "Learning Driving Styles for Autonomous Vehicles from Demonstration", Robotics and Automation (ICRA)", 2015. [Ann] "Ann Nowe, Peter Vrancx, Yann De Hauwere, "Game Theory and Multi-agent Reinforcement Learning", In book: Reinforcement Learning: State of the Art, Edition: Adaptation, Learning, and Optimization Volume 12", 2012. [Kok-Lim] "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae Hsiang Kwong, "Application of reinforcement learning to wireless sensor networks: models and algorithms", Published in Journal Computing archive Volume 97 Issue 11, Pages 1045-1075", November 2015. [Sutton] "Sutton, R. S., Barto, A. G., "Reinforcement Learning: an Introduction", MIT Press", 1998. [Madera] "Madera, M., Megherbi, D. B., "An Interconnected Dynamical System Composed of Dynamics-based Reinforcement Learning Agents in a Distributed Environment: A Case Study", Proceedings IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, Italy", 2012. [Al-Dayaa] "Al-Dayaa, H. S., Megherbi, D. B., "Towards A Multiple- Lookahead-Levels Reinforcement-Learning Technique and Its Implementation in Integrated Circuits", Journal of Artificial Intelligence, Journal of Supercomputing. Vol. 62, issue 1, pp. 588-61", 2012. Kim & Hong Expires September 14, 2017 [Page 11] Internet-Draft Reinforcement Learning over a Network March 2017 [Chowdappa] "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan, "Thread-Safe Message Passing with P4 and MPI", Technical Report TR-CS-941025, Computer Science Department and NSF Engineering Research Center, Mississippi State University", 1994. Authors' Addresses Min-Suk Kim ETRI 218 Gajeongno, Yuseong Daejeon 305-700 Korea Phone: +82 42 860 5930 Email: mskim16@etri.re.kr Yong-Geun Hong ETRI 161 Gajeong-Dong Yuseung-Gu Daejeon 305-700 Korea Phone: +82 42 860 6557 Email: yghong@etri.re.kr Kim & Hong Expires September 14, 2017 [Page 12]