idnits 2.17.1 

draft-kim-nmlrg-network-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (March 13, 2017) is 2600 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RL' is mentioned on line 163, but not defined

  == Missing Reference: 'TBD' is mentioned on line 414, but not defined


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	NML Research Group                                              M-S. Kim
3	Internet-Draft                                                 Y-G. Hong
4	Intended status: Informational                                      ETRI
5	Expires: September 14, 2017                               March 13, 2017

7	  Collaborative Intelligent Multi-agent Reinforcement Learning over a
8	                                Network
9	                       draft-kim-nmlrg-network-00

11	Abstract

13	   This document describes agent reinforcement learning (RL) in a
14	   distributed environment to transfer or share information for
15	   autonomous shortest path-planning over a communication network.  The
16	   centralized node, which is the main node to manage agent workflow in
17	   hybrid peer-to-peer environment, provides a cumulative reward for
18	   each action that a given agent takes with respect to an optimal path
19	   based on a to-be-learned policy over the learning process.  A reward
20	   from the centralized node is reflected when an agent explores to
21	   reach its destination for autonomous shortest path-planning in
22	   distributed nodes.

24	Status of This Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at http://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on September 14, 2017.

41	Copyright Notice

43	   Copyright (c) 2017 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (http://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
59	   2.  Conventions and Terminology . . . . . . . . . . . . . . . . .   3
60	   3.  Motivation  . . . . . . . . . . . . . . . . . . . . . . . . .   3
61	     3.1.  General Motivation for Reinforcement Learning (RL)  . . .   4
62	     3.2.  Reinforcement Learning (RL) in networks . . . . . . . . .   4
63	     3.3.  Motivation in our work  . . . . . . . . . . . . . . . . .   4
64	   4.  Related Works . . . . . . . . . . . . . . . . . . . . . . . .   4
65	     4.1.  Autonomous Driving System . . . . . . . . . . . . . . . .   4
66	     4.2.  Game Theory . . . . . . . . . . . . . . . . . . . . . . .   4
67	     4.3.  Wireless Sensor Network (WSN) . . . . . . . . . . . . . .   5
68	     4.4.  Routing Enhancement . . . . . . . . . . . . . . . . . . .   5
69	   5.  Multi-agent Reinforcement Learning (RL) Technologies  . . . .   5
70	     5.1.  Reinforcement Learning (RL) . . . . . . . . . . . . . . .   5
71	     5.2.  Reward of Distance and Frequency  . . . . . . . . . . . .   5
72	     5.3.  Distributed Computing Node  . . . . . . . . . . . . . . .   6
73	     5.4.  Agent Sharing Information . . . . . . . . . . . . . . . .   6
74	     5.5.  Sub-goal Selection  . . . . . . . . . . . . . . . . . . .   6
75	     5.6.  Cluttered-index-based scheme  . . . . . . . . . . . . . .   6
76	   6.  Proposed Architecture for Reinforcement Learning (RL) . . . .   7
77	   7.  Use case of Multi-agent Reinforcement Learning (RL) . . . . .   8
78	     7.1.  Distributed Multi-agent Reinforcement Learning: Sharing
79	           Information . . . . . . . . . . . . . . . . . . . . . . .   8
80	     7.2.  Use case of Shortest Path-planning via sub-goal selection   9
81	     7.3.  Use case of Asynchronous Triggered Multi-agent with
82	           Terrain Cluttered-index-based . . . . . . . . . . . . . .  10
83	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
84	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
85	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
86	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  10
87	     10.2.  Informative References . . . . . . . . . . . . . . . . .  10
88	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

90	1.  Introduction

92	   In large surveilling applications, information of Critical Key
93	   Infrastructures and Resources (CKIR) to protect and share is
94	   necessary in larger ground, maritime and airborne areas, where there
95	   is a special need for collaborative intelligent distributed systems
96	   with intelligent learning schemes.  These applications also need the
97	   development of computational multi-agents learning systems in large
98	   distributed networking nodes, where the agents have limited,
99	   incomplete knowledge, and only access to local information in
100	   distributed computing nodes over a communication network.

102	   Reinforcement Learning (RL) is one effective technique to transfer
103	   and share information among agents for autonomous shortest agent path
104	   planning, as it does not require a-priori-knowledge of the agent's
105	   behavior or environment to accomplish its tasks [Megherbi].  Such a
106	   knowledge is usually acquired/learned automatically and autonomously
107	   by trial and error.

109	   Reinforcement Learning (RL) actions involve interacting with a given
110	   environment, so the environment provides an agent learning process
111	   with the elements as followings:

113	   o  Starting agent state, one or more obstacles, and agent
114	      destinations

116	   o  Initially, agent explores randomly in a given node

118	   o  Agents' actions to avoid an obstacle and move to one or more
119	      available positions to reach its goal(s)

121	   o  After an agent reaches its goal, it can use the information
122	      collected in initial random path-planning work to improve its
123	      learning speed

125	   o  Optimal ways in the following phase and exploratory learning
126	      trials

128	   Reinforcement Learning (RL) is one of the Machine Learning techniques
129	   that will be adapted to the various networking environments for
130	   automatic networks [I-D.jiang-nmlrg-network-machine-learning].  Thus,
131	   this document provides motivation, learning technique, and use case
132	   for network machine learning.

134	2.  Conventions and Terminology

136	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
137	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
138	   document are to be interpreted as described in [RFC2119].

140	3.  Motivation
141	3.1.  General Motivation for Reinforcement Learning (RL)

143	   Reinforcement Learning (RL) is a system capable of autonomous
144	   acquirement and incorporation of knowledge.  It can continuously
145	   self-improve learning speed with experience and attempts to maximize
146	   cumulative reward for a faster optimal path used in used in multi-
147	   agents-based monitoring systems [Teiralbar].

149	3.2.  Reinforcement Learning (RL) in networks

151	   In large surveilling applications, it is necessary to protect and
152	   share information in many Infrastructure and Resource area.  In
153	   wireless networking layers, Reinforcement Learning (RL) is an
154	   emerging technology to monitor dynamics of the network to achieve
155	   fair resource allocation for nodes within the wireless mesh setting.
156	   Monitoring parameters of the network and adjusts based on the network
157	   dynamics can demonstrate to improve fairness in wireless environment
158	   Infrastructures and Resources [Nasim].

160	3.3.  Motivation in our work

162	   There are many different networking issues such as latency, traffic,
163	   management and etc.  Reinforcement learning [RL] is one of the
164	   Machine Learning mechanisms that will be applied with multiple cases
165	   to solve diverse networking problems against human operating
166	   capacities.  It can be a challenge-able due to a multitude of reasons
167	   such as large state space search, complexity in giving reward,
168	   difficulty in agent action selection, and difficulty in sharing/
169	   merging learned information among the agents in a distributed memory
170	   nodes to be transferred over a communication network [Minsuk].

172	4.  Related Works

174	4.1.  Autonomous Driving System

176	   Autonomous vehicle is capable of self-automotive driving without
177	   human supervision depending on optimized trust region policy by
178	   reinforcement learning (RL) that enables learning of more complex and
179	   special Neural Network.  Such a vehicle provides a comfortable user
180	   experience safely and reliably on interactive communication network
181	   [April][Markus].

183	4.2.  Game Theory

185	   The adaptive multi-agent system, which is combined with complexities
186	   from interacting game player, has developed in a field of
187	   reinforcement learning (RL).  In the early game theory, the
188	   interdisciplinary work was only focused on competitive games, but
189	   Reinforcement Learning (RL) has developed into a general framework
190	   for analyzing strategic interaction and has been attracted field as
191	   diverse as psychology, economics and biology [Ann].

193	4.3.  Wireless Sensor Network (WSN)

195	   Wireless sensor network (WSN) consists of a large number of sensors
196	   and sink nodes for monitoring systems with event parameters such as
197	   temperature, humidity, air conditioning, etc.  Reinforcement learning
198	   (RL) in WSNs has been applied in a wide range of schemes such as
199	   cooperative communication, routing and rate control.  The sensors and
200	   sink nodes are able to observe and carry out optimal actions on their
201	   respective operating environment for network and application
202	   performance enhancements [Kok-Lim].

204	4.4.  Routing Enhancement

206	   Reinforcement Learning (RL) is used to enhance multicast routing
207	   protocol in wireless ad hoc networks, where each node has different
208	   capability.  Routers in the multicast routing protocol are determined
209	   to discover optimal route with a predicted reward, and then the
210	   routers create the optimal path with multicast transmissions to
211	   reduce the overhead in Reinforcement Learning (RL) [Kok-Lim].

213	5.  Multi-agent Reinforcement Learning (RL) Technologies

215	5.1.  Reinforcement Learning (RL)

217	   Reinforcement Learning (RL) is one of the machine learning algorithms
218	   based on an agent learning process.  Reinforcement Learning (RL) is
219	   normally used with a reward from the centralized node, and capable of
220	   autonomous acquirement and incorporation of knowledge.  It is
221	   continuously self-improving and becoming more efficient as the
222	   learning process from an agent's experience to increase an agent
223	   learning speed for autonomous shortest path-planning
224	   [Sutton][Madera].

226	5.2.  Reward of Distance and Frequency

228	   In general, an agent takes the return values of its current state and
229	   next available state to decide and move an action, but the learning
230	   process in Reinforcement Learning (RL) involves lots of limitations
231	   since it provides the agents with only a single level of exploratory
232	   learning process.  The limitation is generated to reduce agent
233	   learning speed in an optimal path, so that the Distance-and-Frequency
234	   technique based on the Euclidean distance in Reinforcement Learning
235	   (RL) was derived to enhance agent's optimal learning speed.
236	   Distance-and-Frequency is based on more levels of agent visibility to
237	   enhance learning algorithm by an additional way that uses the state
238	   occurrence frequency [Al-Dayaa].

240	5.3.  Distributed Computing Node

242	   Autonomous path-planning for multi-agent environment is related to
243	   agent transfer of path information, as the agents require information
244	   to achieve efficient path-planning on a given local node or
245	   distributed memory nodes over a communication network.

247	5.4.  Agent Sharing Information

249	   The quality of agent decision making often depends on the willingness
250	   of agents to share a given learning information with other agents for
251	   optimal path-planning.  Sharing Information means that an agent would
252	   share and communicate the knowledge learned and acquired with / to
253	   other agents using Message Passing Interface (MPI).  In sharing
254	   information, each agent has an attempt of exploring its environment,
255	   where all agents explore to reach their destinations via a
256	   distributed reinforcement reward-based learning method on the
257	   existing local distributed memory nodes.  The agents can be running
258	   on the same or different nodes over a communication network (via
259	   sharing information).  The agents have limited resources and
260	   incomplete knowledge of their environments.  Even if the agents do
261	   not share the capabilities and resources to monitor an entire given
262	   large terrain, they are able to share the needed information for
263	   collaborative path-planning in distributed networking nodes
264	   [Chowdappa][Minsuk].

266	5.5.  Sub-goal Selection

268	   A new technical method for agent sub-goal selection in distributed
269	   nodes is introduced to reduce the agent initial random exploration
270	   with a given selected sub-goal.

272	   [TBD]

274	5.6.  Cluttered-index-based scheme

276	   We propose a learning algorithm to optimize agent sub-goal selection.
277	   It is a proposed clutter-index-based technique for a new
278	   reinforcement learning scheme with a reward and an improved method to
279	   optimize multi-agent learning speed over a communication network.

281	   [TBD]

283	6.  Proposed Architecture for Reinforcement Learning (RL)

285	   The architecture using Reinforcement Learning (RL) describes a
286	   collaborative multi-agent-based system in distributed environments as
287	   shown in figure 1, where the architecture is combined with a hybrid
288	   architecture making use of both a master / slave architecture and a
289	   peer-to-peer.  The centralized node, assigns each slave computing
290	   node a portion of the distributed terrain and an initial number of
291	   agents.  The network communication handles all communication among
292	   components and agents in the distributed networking environment.  The
293	   components are deployed on different nodes.  The communication
294	   handler alternatively sends one message from the outgoing queue and
295	   distributes one message in the incoming queue to the destination
296	   agent or component, and runs in a separate thread on each node with
297	   two message queues that consists of the incoming queue and the
298	   outgoing queue.

300	                  +--------------------------------------+
301	     +------------|----------+       |      +------------|----------+
302	     | Communication Handler |       |      | Communication Handler |
303	     +-----------------------+       |      +-----------------------+
304	     |        Terrain        |       |      |        Terrain        |
305	     +-----------------------+       |      +-----------------------+
306	                                     |
307	                  +--------------------------------------+
308	     +------------|----------+       |      +------------|----------+
309	     | Communication Handler |       |      | Communication Handler |
310	     +-----------------------+       |      +-----------------------+
311	     |        Terrain        |       |      |        Terrain        |
312	     +-----------------------+       |      +-----------------------+
313	                                     |
314	                        +-----------------------+
315	                        | Communication Handler |
316	                        +-----------------------+
317	                        |Centralized Global Node|
318	                        +-----------------------+

320	    Figure 1: Top level components, deployment and agent communication
321	                                  handler

323	   Figure2 shows the relationship of an action, state and reward between
324	   an agent and its destination in the environment for reinforcement
325	   learning.  The agent does an action that leads to a reward from
326	   achieving an optimal path toward its destination.

328	                        +-------------------------+
329	  States & Reward ------| Centralized Global Node |<----------------+
330	         |              +-------------------------+                 |
331	         |                                                          |
332	         |                                                          |
333	         |                                                        States
334	         |                                                          |
335	         |                                                          |
336	  +-------------+                               +-------------+     |
337	  | Multi-agent |-------------Action----------->| Destination |-----+
338	  +-------------+                               +-------------+

340	                      Figure 2: Architecture Overview

342	7.  Use case of Multi-agent Reinforcement Learning (RL)

344	7.1.  Distributed Multi-agent Reinforcement Learning: Sharing
345	      Information

347	   In this section, we deal with case of a collaborative distributed
348	   multi-agent, where each agent has same or different individual
349	   destination in a distributed environment.  Since sharing information
350	   scheme among the agents is problematic one, we need to expand on the
351	   work described by solving the challenging cases.

353	   Basically, the main proposed algorithm is presented by distributed
354	   multi-agent reinforcement learning as below:.

356	      +--Proposed Algorithm------------------------------------------+
357	      |                                                              |
358	      | Let N, A and D denote number of node, agent and destination  |
359	      +--------------------------------------------------------------+
360	      | Place N, A and D in random position(x, y)                    |
361	      +--------------------------------------------------------------+
362	      | Every A agents in N nodes                                    |
363	      +--------------------------------------------------------------+
364	      | Do inital exploration(random) toward D                       |
365	      |  (1) Let S denotes current state                             |
366	      |  (2) Relinguish S so other agent can occupy the positions    |
367	      |  (3) Assign the agent's new position                         |
368	      |  (4) Update the current state S <- Sn                        |
369	      +--------------------------------------------------------------+
370	      | Do optimized exploration(RL) for number of trials            |
371	      |  (1) Let S denotes current state                             |
372	      |  (2) Let P denotes action                                    |
373	      |  (3) Let R denotes discounted reward value                   |
374	      |  (4) Choose action P <- Policy(S, P) in RL                   |
375	      |  (5) Move available directions by agent                      |
376	      |  (6) Update the learning model in a new value                |
377	      |  (7) Update the current state S <- Sn                        |
378	      +--------------------------------------------------------------+

380	         Figure 3: Use case of Multi-agent Reinforcement Learning

382	   Multi-agent reinforcement learning (RL) in distributed nodes can
383	   improve the overall system performance to transfer or share
384	   information from one node to another node in following cases;
385	   expanded complexity in RL technique with various experimental factors
386	   and conditions, analyzing multi-agent sharing information for agent
387	   learning speed.

389	7.2.  Use case of Shortest Path-planning via sub-goal selection

391	   Sub-goal selection is a scheme of a distributed multi-agent RL
392	   technique based on selected intermediary agent sub-goal(s) with the
393	   aim of reducing the initial random trial.  The scheme is to improve
394	   the multi-agent system performance with asynchronously triggered
395	   exploratory phase(s) with selected agent sub-goal(s) for autonomous
396	   shortest path-planning.

398	   [TBD]

400	7.3.  Use case of Asynchronous Triggered Multi-agent with Terrain
401	      Cluttered-index-based

403	   This is a new proposed technical reward scheme based on the proposed
404	   environment-clutter-index for the fast learning speed path-planning.

406	   [TBD]

408	8.  IANA Considerations

410	   There are no IANA considerations related to this document.

412	9.  Security Considerations

414	   [TBD]

416	10.  References

418	10.1.  Normative References

420	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
421	              Requirement Levels", BCP 14, RFC 2119,
422	              DOI 10.17487/RFC2119, March 1997,
423	              <http://www.rfc-editor.org/info/rfc2119>.

425	10.2.  Informative References

427	   [I-D.jiang-nmlrg-network-machine-learning]
428	              Jiang, S., "Network Machine Learning", ID draft-jiang-
429	              nmlrg-network-machine-learning-02, October 2016.

431	   [Megherbi]
432	              "Megherbi, D. B., Kim, Minsuk, Madera, Manual., "A Study
433	              of Collaborative Distributed Multi-Goal and Multi-agent
434	              based Systems for Large Critical Key Infrastructures and
435	              Resources (CKIR) Dynamic Monitoring and Surveillance",
436	              IEEE International Conference on Technologies for Homeland
437	              Security", 2013.

439	   [Teiralbar]
440	              "Megherbi, D. B., Teiralbar, A. Boulenouar, J., "A Time-
441	              varying Environment Machine Learning Technique for
442	              Autonomous Agent Shortest Path Planning.", Proceedings of
443	              SPIE International Conference on Signal and Image
444	              Processing, Orlando, Florida", 2001.

446	   [Nasim]    "Nasim ArianpooEmail, Victor C.M. Leung, "How network
447	              monitoring and reinforcement learning can improve tcp
448	              fairness in wireless multi-hop networks", EURASIP Journal
449	              on Wireless Communications and Networking", 2016.

451	   [Minsuk]   "Dalila B. Megherbi and Minsuk Kim, "A Hybrid P2P and
452	              Master-Slave Cooperative Distributed Multi-Agent
453	              Reinforcement Learning System with Asynchronously
454	              Triggered Exploratory Trials and Clutter-index-based
455	              Selected Sub goals", IEEE CIG Conference", 2016.

457	   [April]    "April Yu, Raphael Palefsky-Smith, Rishi Bedi, "Deep
458	              Reinforcement Learning for Simulated Autonomous Vehicle
459	              Control", Stanford University", 2016.

461	   [Markus]   "Markus Kuderer, Shilpa Gulati, Wolfram Burgard, "Learning
462	              Driving Styles for Autonomous Vehicles from
463	              Demonstration", Robotics and Automation (ICRA)", 2015.

465	   [Ann]      "Ann Nowe, Peter Vrancx, Yann De Hauwere, "Game Theory and
466	              Multi-agent Reinforcement Learning", In book:
467	              Reinforcement Learning: State of the Art, Edition:
468	              Adaptation, Learning, and Optimization Volume 12", 2012.

470	   [Kok-Lim]  "Kok-Lim Alvin Yau, Hock Guan Goh, David Chieng, Kae
471	              Hsiang Kwong, "Application of reinforcement learning to
472	              wireless sensor networks: models and algorithms",
473	              Published in Journal Computing archive Volume 97 Issue 11,
474	              Pages 1045-1075", November 2015.

476	   [Sutton]   "Sutton, R. S., Barto, A. G., "Reinforcement Learning: an
477	              Introduction", MIT Press", 1998.

479	   [Madera]   "Madera, M., Megherbi, D. B., "An Interconnected Dynamical
480	              System Composed of Dynamics-based Reinforcement Learning
481	              Agents in a Distributed Environment: A Case Study",
482	              Proceedings IEEE International Conference on Computational
483	              Intelligence for Measurement Systems and Applications,
484	              Italy", 2012.

486	   [Al-Dayaa]
487	              "Al-Dayaa, H. S., Megherbi, D. B., "Towards A Multiple-
488	              Lookahead-Levels Reinforcement-Learning Technique and Its
489	              Implementation in Integrated Circuits", Journal of
490	              Artificial Intelligence, Journal of Supercomputing. Vol.
491	              62, issue 1, pp. 588-61", 2012.

493	   [Chowdappa]
494	              "Chowdappa, Aswini., Skjellum, Anthony., Doss, Nathan,
495	              "Thread-Safe Message Passing with P4 and MPI", Technical
496	              Report TR-CS-941025, Computer Science Department and NSF
497	              Engineering Research Center, Mississippi State
498	              University", 1994.

500	Authors' Addresses

502	   Min-Suk Kim
503	   ETRI
504	   218 Gajeongno, Yuseong
505	   Daejeon  305-700
506	   Korea

508	   Phone: +82 42 860 5930
509	   Email: mskim16@etri.re.kr

511	   Yong-Geun Hong
512	   ETRI
513	   161 Gajeong-Dong Yuseung-Gu
514	   Daejeon  305-700
515	   Korea

517	   Phone: +82 42 860 6557
518	   Email: yghong@etri.re.kr