idnits 2.17.1 

draft-pedro-nmrg-intelligent-reasoning-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document updates
     draft-pedro-nmrg-intelligent-, but the abstract doesn't seem to mention
     this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 723 has weird spacing: '...rw plid    str...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (March 06, 2020) is 1512 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NMRG                                              P. Martinez-Julia, Ed.
3	Internet-Draft                                                      NICT
4	Updates: draft-pedro-nmrg-intelligent-                          S. Homma
5	         reasoning-00 (if approved)                                  NTT
6	Intended status: Informational                            March 06, 2020
7	Expires: September 7, 2020

9	    Intelligent Reasoning on External Events for Network Management
10	               draft-pedro-nmrg-intelligent-reasoning-01

12	Abstract

14	   The adoption of AI in network management solutions is becoming a
15	   reality.  It is mainly supported by the need to resolve complex
16	   problems arisen from the acceptance of SDN/NFV technologies as well
17	   as network slicing.  This allows current computer and network system
18	   infrastructures to constantly grow in complexity, in parallel to the
19	   demands of users.  However, exploiting the possibilities of AI is not
20	   an easy task.  There has been a lot of effort to make Machine
21	   Learning (ML) solutions reliable and acceptable but, at the same
22	   time, other mechanisms have been forgotten.  It is the particular
23	   case of reasoning.  Although it can provide enormous benefits to
24	   management solutions by, for example, inferring new knowledge and
25	   applying different kind of rules (e.g. logical) to choose from
26	   several actions, it has received little attention.  While ML
27	   solutions work with data, so their only requirement from the network
28	   infrastructure is data retrieval, reasoning solutions work in
29	   collaboration to the network they are managing.  This makes the
30	   challenges arisen from intelligent reasoning to be a key for the
31	   evolution of network management towards the full adoption of AI.

33	Status of This Memo

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF).  Note that other groups may also distribute
40	   working documents as Internet-Drafts.  The list of current Internet-
41	   Drafts is at https://datatracker.ietf.org/drafts/current/.

43	   Internet-Drafts are draft documents valid for a maximum of six months
44	   and may be updated, replaced, or obsoleted by other documents at any
45	   time.  It is inappropriate to use Internet-Drafts as reference
46	   material or to cite them other than as "work in progress."

48	   This Internet-Draft will expire on September 7, 2020.

50	Copyright Notice

52	   Copyright (c) 2020 IETF Trust and the persons identified as the
53	   document authors.  All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (https://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document.  Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document.  Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
68	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
69	   3.  Background  . . . . . . . . . . . . . . . . . . . . . . . . .   4
70	     3.1.  Virtual Computer and Network Systems  . . . . . . . . . .   4
71	     3.2.  SDN and NFV . . . . . . . . . . . . . . . . . . . . . . .   4
72	     3.3.  Management and Control  . . . . . . . . . . . . . . . . .   5
73	     3.4.  Slice Gateway (SLG) . . . . . . . . . . . . . . . . . . .   5
74	   4.  Applying AI to Network Management . . . . . . . . . . . . . .   6
75	     4.1.  Beyond Machine Learning . . . . . . . . . . . . . . . . .   6
76	     4.2.  Briefing Artificial Intelligence  . . . . . . . . . . . .   6
77	   5.  Extended Management Operation . . . . . . . . . . . . . . . .   7
78	     5.1.  Intelligent Network Management Process  . . . . . . . . .   7
79	     5.2.  Closed Loop Management Approach . . . . . . . . . . . . .   8
80	   6.  Deep Exploitation of AI in Network Management . . . . . . . .   9
81	     6.1.  From Data to Wisdom . . . . . . . . . . . . . . . . . . .   9
82	     6.2.  External Event Detectors  . . . . . . . . . . . . . . . .   9
83	     6.3.  Network Requirement Anticipation  . . . . . . . . . . . .  10
84	     6.4.  Intelligent Reasoning . . . . . . . . . . . . . . . . . .  11
85	     6.5.  Gaps and Standardization Issues . . . . . . . . . . . . .  12
86	   7.  Relation to Other IETF/IRTF Initiatives . . . . . . . . . . .  13
87	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
88	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
89	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  13
90	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  13
91	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  14
92	     11.2.  Informative References . . . . . . . . . . . . . . . . .  14
93	   Appendix A.  Information Model to Support Reasoning on External
94	                Events . . . . . . . . . . . . . . . . . . . . . . .  15
95	     A.1.  Tree Structure  . . . . . . . . . . . . . . . . . . . . .  15
96	       A.1.1.  event-payloads  . . . . . . . . . . . . . . . . . . .  16
97	         A.1.1.1.  basic . . . . . . . . . . . . . . . . . . . . . .  16
98	         A.1.1.2.  seismometer . . . . . . . . . . . . . . . . . . .  16
99	         A.1.1.3.  bigdata . . . . . . . . . . . . . . . . . . . . .  17
100	       A.1.2.  external-events . . . . . . . . . . . . . . . . . . .  17
101	       A.1.3.  notifications/event . . . . . . . . . . . . . . . . .  17
102	     A.2.  YANG Module . . . . . . . . . . . . . . . . . . . . . . .  18
103	   Appendix B.  The Autonomic Resource Control Architecture (ARCA) .  19
104	   Appendix C.  ARCA Integration With ETSI-NFV-MANO  . . . . . . . .  21
105	     C.1.  Functional Integration  . . . . . . . . . . . . . . . . .  21
106	     C.2.  Target Experiment and Scenario  . . . . . . . . . . . . .  24
107	     C.3.  OpenStack Platform  . . . . . . . . . . . . . . . . . . .  25
108	     C.4.  Initial Results . . . . . . . . . . . . . . . . . . . . .  27
109	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  29

111	1.  Introduction

113	   The current network ecosystem is quickly evolving from an almost
114	   fixed network to a highly flexible, powerful, and somehow hybrid
115	   system.  Network slicing, Software Defined Networking (SDN), and
116	   Network Function Virtualization (NFV) provide the basis for such
117	   evolution.  The need to automate the management and control of such
118	   systems has motivated the move towards autonomic networking (ANIMA)
119	   and the inclusion of AI solutions alongside the management plane of
120	   the network, enough justified by the increasing size and complexity
121	   of the network, which exposes complex problems that must be resolved
122	   in scales that escape human possibilities.  Therefore, in order to
123	   allow current computer and network system infrastructures to
124	   constantly grow in complexity, in parallel to the demands of users,
125	   the AI solutions must work together with other network management
126	   solutions.

128	   However, exploiting the possibilities of AI is not an easy task.
129	   There has been a lot of effort to make Machine Learning (ML)
130	   solutions reliable and acceptable but, at the same time, other
131	   mechanisms have been forgotten.  It is the particular case of
132	   reasoning.  Although it can provide enormous benefits to management
133	   solutions by, for example, inferring new knowledge and applying
134	   different kind of rules (e.g. logical) to choose from several
135	   actions, it has received little attention.  While ML solutions work
136	   with data, so their only requirement from the network infrastructure
137	   is data retrieval, reasoning solutions work in collaboration to the
138	   network they are managing.  This makes the challenges arisen from
139	   intelligent reasoning to be a key for the evolution of network
140	   management towards the full adoption of AI.

142	   The present document aims to gather the necessary information for
143	   getting the most benefits from the application of intelligent
144	   reasoning to network management, including, but not limited to,
145	   defining the gaps that must be covered for reasoning to be correctly
146	   integrated into network management solutions.

148	2.  Terminology

150	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
151	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
152	   document are to be interpreted as described in RFC 2119 [RFC2119].

154	3.  Background

156	3.1.  Virtual Computer and Network Systems

158	   The continuous search for efficiency and cost reduction to get the
159	   most optimum exploitation of available resources (e.g.  CPU power and
160	   electricity) has conducted current physical infrastructures to move
161	   towards virtualization infrastructures.  Also, this trend enables end
162	   systems to be centralized and/or distributed, so that they are
163	   deployed to best accomplish customer requirements in terms of
164	   resources and qualities.

166	   One of the key functional requirements imposed to computer and
167	   network virtualization is a high degree of flexibility and
168	   reliability.  Both qualities are subject to the underlying
169	   technologies but, while the latter has been always enforced to
170	   computer and network systems, flexibility is a relatively new
171	   requirement, which would not have been imposed without the backing of
172	   virtualization and cloud technologies.

174	3.2.  SDN and NFV

176	   SDN and NFV are conceived to bring high degree of flexibility and
177	   conceptual centralization qualities to the network.  On the one hand,
178	   with SDN, the network can be programmed to implement a dynamic
179	   behavior that changes its topology and overall qualities.  Moreover,
180	   with NFV the functions that are typically provided by physical
181	   network equipment are now implemented as virtual appliances that can
182	   be deployed and linked together to provide customized network
183	   services.  SDN and NFV complements to each other to actually
184	   implement the network aspect of the aforementioned virtual computer
185	   and network systems.

187	   Although centralization can lead us to think on the single-point-of-
188	   failure concept, it is not the case for these technologies.
189	   Conceptual centralization highly differs from centralized deployment.
190	   It brings all benefits from having a single point of decision but
191	   retaining the benefits from distributed systems.  For instance,
192	   control decisions in SDN can be centralized while the mechanisms that
193	   enforce such decisions into the network (SDN controllers) can be
194	   implemented as highly distributed systems.  The same approach can be
195	   applied to NFV.  Network functions can be implemented in a central
196	   computing facility, but they can also take advantage of several
197	   replication and distribution techniques to achieve the properties of
198	   distributed systems.  Nevertheless, NFV also allows the deployment of
199	   functions on top of distributed systems, so they benefit from both
200	   distribution alternatives at the same time.

202	3.3.  Management and Control

204	   The introduction of virtualization into the computer and network
205	   system landscape has increased the complexity of both underlying and
206	   overlying systems.  On the one hand, virtualizing underlying systems
207	   adds extra functions that must be managed properly to ensure the
208	   correct operation of the whole system, which not just encompasses
209	   underlying elements but also the virtual elements running on top of
210	   them.  Such functions are used to actually host the overlying virtual
211	   elements, so there is an indirect management operation that involves
212	   virtual systems.  Moreover, such complexities are inherited by final
213	   systems that get virtualized and deployed on top of those
214	   virtualization infrastructures.

216	   In parallel, virtual systems are empowered with additional, and
217	   widely exploited, functionality that must be managed correctly.  It
218	   is the case of the dynamic adaptation of virtual resources to the
219	   specific needs of their operation environments, or even the
220	   composition of distributed elements across heterogeneous underlying
221	   infrastructures, and probably providers.

223	   Taking both complex functions into account, either separately or
224	   jointly, makes clear that management requirements have greatly
225	   surpassed the limits of humans, so automation has become essential to
226	   accomplish most common tasks.

228	3.4.  Slice Gateway (SLG)

230	   A slice gateway (SLG) (see [I-D.homma-nfvrg-slice-gateway]) is
231	   basically a component in the data plane and has the roles of data
232	   packet processing.  Moreover, it provides an interface to export its
233	   functions for interacting with control and management components, so
234	   that it is quite relevant for implementing the requirements described
235	   above within the network slicing domain.

237	   Furthermore, an SLG might be required to support handling services
238	   provided on network slices in addition to controlling them because an
239	   SLG is the edge node on an end-to-end network slice (E2E-NS).

241	   Therefore, the SLG exposes the following requirements:

243	      Data plane for NSs as infrastructure.

245	      Control/management plane for NSs as infrastructure.

247	      Data plane for services on NSs.

249	      Control/management plane for services on NSs.

251	   In summary, SLG provides the required functions for the enforcement
252	   of AI decisions in multi-domain (and federated) network slices, so it
253	   will play a key role in general network management.

255	4.  Applying AI to Network Management

257	4.1.  Beyond Machine Learning

259	   ML is not AI.  AI has a broader spectrum of methods, some of them are
260	   already exploited in the network for a long time.  Perception,
261	   reasoning, and planning are still not fully exploited in the network.

263	4.2.  Briefing Artificial Intelligence

265	   Intelligence does not directly imply intelligent.  On the one hand,
266	   intelligence emphasizes data gathering and management, which can be
267	   processed by systematic methods or intelligent methods.  On the other
268	   hand, intelligent emphasizes the reasoning and understanding of data
269	   to actually "posses" the intelligence.

271	   The justification of applying AI in network (and) management is
272	   sometimes overseen.  First, management decisions are more and more
273	   complex.  We have moved from asking simple questions ("Is there a
274	   problem in my system?") to much more complex ones ("Where should I
275	   migrate this VM to accomplish my goals?").  Moreover, operation
276	   environments are more and more dynamic.  On the one hand,
277	   softwarization and programmability elevate flexibility and allow
278	   networks to be totally adapted to their static and/or dynamic
279	   requirements.  On the other hand, network virtualization highly
280	   enables network automation.

282	   The new functions and possibilities allow network devices to become
283	   autonomic.  However, they must take complex decisions by themselves,
284	   without human intervention, realizing the "dream" of Zero-Touch
285	   Networks (ZTM), which exploit fully programmable elements and
286	   advanced automation methods (ETSI ZSM).  Nevertheless, we have to
287	   remember that AI methods are just resources, not solutions.  They
288	   will not replace the human decisions, just complement and "automate"
289	   them.

291	5.  Extended Management Operation

293	5.1.  Intelligent Network Management Process

295	   In general, the correct and pertinent application of AI to network
296	   management provides enormous benefits, mainly in terms of making
297	   complex management operations feasible and improving the performance
298	   of typically expensive tasks.  By taking advantage of these benefits,
299	   the amount of data that can be analyzed to make decisions on the
300	   network can be hugely increased.

302	   As a result, AI makes possible to enlarge the management process
303	   towards the Intelligent Network Management Process (INMP).  Instead
304	   of just being focused on the analysis of performance measurements
305	   retrieved from the managed network and the subsequent decision
306	   (proaction or reaction), the extension of management operation
307	   enabled by INMP encompasses different sub-processes.

309	   First, INMP has a sub-process for retrieving the performance
310	   measurements from the managed network.  This is the same found in
311	   typical management processes.  Moreover, INMP encourages the
312	   application of the same ML techniques to obtain some insight of the
313	   situation of the managed network.

315	   Second, INMP incorporates a reasoning sub-process.  It receives both
316	   the output of the previous sub-process and additional context
317	   information, which can be provided by an external event detector, as
318	   described below.  Then, this sub-process finds out and particularizes
319	   the rules that are governing the situation described above.  Such
320	   rules are semantically constructed and will abstract the situation of
321	   the network in terms of logical and other semantic concepts, together
322	   with actions and transformations that can be applied to those rules.
323	   All such constructions will be stored in the Intelligent Network
324	   Management Knowledge Base (INMKB), which will follow a pre-determined
325	   ontology and will also extend the knowledge by applying basic and
326	   atomic logic inference statements.

328	   Third, INMP defines the solving sub-process.  It works as follows.
329	   Once obtained the abstracted situation of the managed network and the
330	   rules to it, the solving subprocess builds a graph with all semantic
331	   constructions.  It reflects the managed network, since all network
332	   elements have their semantic counterpart, but it also has all
333	   situations, rules, actions, and even the measurements.  The solving
334	   sub-process applies ontology transformations to find a graph that is
335	   acceptable in terms of the associated situation and its adherence to
336	   administrative goals.

338	   Fourth, INMP incorporates the planning sub-process.  It receives the
339	   solution graph obtained by the previous sub-process and makes a
340	   linear plan of actions to execute in order to enforce the required
341	   changes into the network.  The actions used by this planning sub-
342	   process are the building blocks of the plan.  Each block will be
343	   defined with a precondition, invariant, and postcondition.  A
344	   planning algorithm should be used to obtain such plan of actions by
345	   linking the building blocks so they can be enforced to finally adapt
346	   the managed network to get the desired situation.

348	   All these processes must be executed in parallel, using strong inter-
349	   process communication and synchronization constraints.  Moreover, the
350	   requests to the underlying infrastructure for the adaptation of the
351	   managed network will be sent to the corresponding controllers without
352	   waiting for finishing the deliberation cycle.  This way, the time
353	   required by the whole cycle is highly reduced.  This can be possible
354	   because of the assumptions and anticipations tied to INMP and the
355	   intelligence it denotes.

357	5.2.  Closed Loop Management Approach

359	   Beginning with INMP, a key approach for achieving proper network
360	   management goals is to follow the closed control loop methodology.
361	   It ensures that the objectives are not just accomplished at certain
362	   moment but kept in future cycles of both management and network life-
363	   cycle.

365	   To obtain the benefits from integrating AI within the closed loop,
366	   INMP processes must be re-wired to connect their outputs to their
367	   inputs, so obtaining feedback analysis.  Moreover, an additional
368	   process must be defined for ensuring that the objectives defined in
369	   the last steps of INMP are actually present in the near future
370	   situation of the managed network.

372	   In addition, the data plane elements, such as the SLG described
373	   above, must provide some capabilities to make them coherent to the
374	   closed control loop.  Particularly, they must provide symmetric
375	   enforcement and telemetry interfaces, so that the elements composing
376	   the managed network can be modified and monitored using the same
377	   identifiers and having the same assumptions about their topology and
378	   context.  For instance, SLG must be able to provide the needed
379	   functionality to enable INMP to request SLG to set up and connect the
380	   necessary structures for telemetry collection and request slice
381	   switching.

383	6.  Deep Exploitation of AI in Network Management

385	6.1.  From Data to Wisdom

387	   As AI methods gain access to a huge amount of (intelligence) data
388	   from the systems they manage, they become more and more able to take
389	   strategic decisions, mainly deriving such data to knowledge towards
390	   wisdom.  This supports the well known DIKW process (Data,
391	   Information, Knowledge, Wisdom) that enables elements to operate
392	   autonomously, subject to the goals established by administrators.

394	   In such way, AI methods can be guided by the events or situations
395	   found in underlying networks in a constantly evolving model.  We can
396	   call it the Knowledge (and Intelligence) Driven Network.  In this new
397	   network architecture, the structure itself of the network results
398	   from reasoning on intelligence data.  The network adapts to new
399	   situations without requiring human involvement but administrative
400	   policies are still enforced to decisions.  Nevertheless, intelligence
401	   data must be managed properly to exploit all its potential.  Data
402	   with high accuracy and high frequency will be processed in real-time.
403	   Meanwhile, fast and scalable methods for information retrieval and
404	   decision enfrocement become essential to the objectives of the
405	   network.

407	   To achieve such goals, AI algorithms must be adapted to work on
408	   network problems.  Joint physical and virtual network elements can
409	   form a multi-agent system focused on achieving such system goals.  It
410	   can be applied to several use-cases.  For instance, it can be used
411	   for predicting traffic behaviour, iterative network optimization, and
412	   assessment of administrative policies.

414	6.2.  External Event Detectors

416	   As mentioned above, current mechanisms used to achieve automated
417	   management and control rely only on the continuous monitoring of the
418	   resources they control or the underlying infrastructure that host
419	   them.  However, there are several other sources of information that
420	   can be exploited to make the systems more robust and efficient.  It
421	   is the case of the notifications that can be provided by physical or
422	   virtual elements or devices that are watching for specific events,
423	   hence called external event detectors.

425	   More specifically, although the notifications provided by these
426	   external event detectors are related to successes that occur outside
427	   the boundaries of the controlled system, such successes can affect
428	   the typical operation of controlled systems.  For instance, a heavy
429	   rainfall or snowfall can be detected and correlated to a huge
430	   increase in the amount of requests experienced by some emergency
431	   support service.

433	6.3.  Network Requirement Anticipation

435	   One of the main goals of the MANO mechanisms is to ensure the virtual
436	   computer and network system they manage meets the requirements
437	   established by their owners and administrators.  It is currently
438	   achieved by observing and analyzing the performance measurements
439	   obtained either by directly asking the resources forming the managed
440	   system of by asking the controllers of the underlying infrastructure
441	   that hosts such resources.  Thus, under changing or eventual
442	   situations, the managed system must be adapted to cope with the new
443	   requirements, increasing the amount of resources assigned to it, or
444	   to make efficient use of available infrastructures, reducing the
445	   amount of resources assigned to it.

447	   However, the time required by the infrastructure to make effective
448	   the adaptations requested by the MANO mechanisms is longer than the
449	   time required by client requests to overload the system and make it
450	   discard further client requests.  This situation is generally
451	   undesired but particularly dangerous for some systems, such as the
452	   emergency support system mentioned above.  Therefore, in order to
453	   avoid the disruption of the service, the change in requirements must
454	   be anticipated to ensure that any adaptation has finished as soon as
455	   possible, preferably before the target system gets overloaded or
456	   underloaded.

458	   Here we link the application of AI to network management to ARCA
459	   (Appendix B).  It is integrated to NFV-MANO to enable the latter to
460	   take advantage of the events notified by the external event
461	   detectors, by correlating them to the target amount of resources
462	   required by the managed system and enforcing the necessary
463	   adaptations beforehand, particularly before the system performance
464	   metrics have actually changed.

466	   The following abstract algorithm formalizes the workflow expected to
467	   be followed by the different implementations of the operation
468	   proposed here.

470	   while TRUE do
471	       event = GetExternalEventInformation()
472	       if event != NONE then
473	           anticipated_resource_amount = Anticipator.Get(event)
474	           if IsPolicyCompliant(anticipated_resource_amount) then
475	               current_resource_amount = anticipated_resource_amount
476	               anticipation_time = NOW
477	           end if
478	       end if
479	       anticipated_event = event
480	       if anticipated_event != NONE and
481	               (NOW - anticipation_time) > EXPIRATION_TIME then
482	           current_resource_amount = DEFAULT_RESOURCE_AMOUNT
483	           anticipated_event = NONE
484	       end if
485	       state = GetSystemState()
486	       if not IsAcceptable(state, current_resource_amount) then
487	           current_resource_amount = GetResourceAmountForState(state)
488	           if anticipated_event is not NONE then
489	               Anticipator.Set
490	                   (anticipated_event, current_resource_amount)
491	               anticipated_event = NONE
492	           end if
493	       end if
494	   end while

496	   This algorithm considers both internal and external events to
497	   determine the necessary control and management actions to achieve the
498	   proper anticipation of resources assigned to the target system.  We
499	   propose the different implementations to follow the same approach so
500	   they can guess what to expect when they interact.  For instance, a
501	   consumer, such as an Application Service Provider (ASP), can expect
502	   some specific behavior of the Virtual Network Operator (VNO) from
503	   which it is consuming resources.  This helps both the ASP and VNO to
504	   properly address resource fluctuations.

506	6.4.  Intelligent Reasoning

508	   It is trivial for anybody to understand that the behavior or the
509	   network results from user activity.  For instance, more users means
510	   more traffic.  However, it is not commonly considered that user
511	   activity has a direct dependency on events that occur outside the
512	   boundaries of the networks they use.  For example, if a video becomes
513	   trendy, the load of the network that hosts the video increases, but
514	   also the load of any network with users watching the video.  In the
515	   same way, if a natural incident occurs (e.g. heavy rainfall,
516	   earthquake), people try to contact their relatives and the load of a
517	   telephony network increases.  From this we can easily find out that
518	   there is a clear causality relation between events occurring in the
519	   real and digital world and the behaviour of the network (aka.  The
520	   Internet).

522	   Network management outcomes, in terms of system stability,
523	   performance, reliability, etc., would greatily improve by exploiting
524	   such causality relation.  An easy and straightforward way to do so is
525	   to apply AI reasoning methods.  These methods can be used to "guess"
526	   the effect for a given cause.  Moreover, reasoning can be used to
527	   choose the specific events that can impact the system, so being the
528	   cause for some effect.

530	   Meanwhile, reasoning on network behavior from performance
531	   measurements and external events places some challenges.  First,
532	   external event information must cross the administrative domain of
533	   the network to which it is relevant.  This means that there must be
534	   interfaces and security policies that regulate how information is
535	   exchanged between the external event detecthor, which can be some
536	   sensor deployed in some "smart" place (e.g. smart city, smart
537	   building), and the management solution, which resides inside the
538	   administrative domain of the managed network.  This function must be
539	   highly conformed and regulated, and the protocols used to achieve it
540	   must be widely accepted and tested, in order for it to exploit the
541	   overall potential of external events.

543	   Second, enough meta-data must be associated to performance
544	   measurements to clearly identify all aspects of the effects, so that
545	   they can be traced back to the causes (events).  Such meta-data must
546	   follow an ontology (information model) that is somewhat common and
547	   widely accepted or, at leaset, to be able to easily transform it
548	   among the different formats and models used by different vendors and
549	   software.

551	   Third, the management ontology must be extended by all concepts from
552	   the boundaries of the managed network, its external environment
553	   (surroundings), and any entity that, albeit being far away, can
554	   impact on the function of the managed network.

556	6.5.  Gaps and Standardization Issues

558	   Several gaps and standardization issues arise from applying AI and
559	   reasoning to network management solutions:

561	      Methods from different providers/vendors must be able to coexist
562	      and work together, either directly or by means of a translator.
563	      They must, however, use the same concepts, albeit using different
564	      naming, so they actually share a common ontology.

566	      Information retrieval must be assessed for quality so that the
567	      outputs from AI reasoning, and thus management solutions, can be
568	      reliable.

570	      Ontological concepts must be consistent so that the types and
571	      qualities of information that is retrieved from a system or object
572	      are as expected.

574	      The protocols used to communicate (or disseminate, or publish) the
575	      information must respond to the constraints of their target usage.

577	7.  Relation to Other IETF/IRTF Initiatives

579	   TBD

581	8.  IANA Considerations

583	   This memo includes no request to IANA.

585	9.  Security Considerations

587	   As with other AI mechanisms, the major security concern for the
588	   adoption of intelligent reasoning on external events to manage
589	   network slices and SDN/NFV systems is that the boundaries of the
590	   control and management planes are crossed to introduce information
591	   from outside.  Such communications must be highly and heavily secured
592	   since some malfunction or explicit attacks might compromise the
593	   integrity and execution of the controlled system.  However, it is up
594	   to implementers to deploy the necessary countermeasures to avoid such
595	   situations.  From the design point of view, since all oprations are
596	   performed within the control and/or management planes, the security
597	   level of reasoning solutions is inherited and thus determined by the
598	   security masures established by the systems conforming such planes.

600	10.  Acknowledgements

602	   TBD

604	11.  References
605	11.1.  Normative References

607	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
608	              Requirement Levels", BCP 14, RFC 2119,
609	              DOI 10.17487/RFC2119, March 1997,
610	              <https://www.rfc-editor.org/info/rfc2119>.

612	11.2.  Informative References

614	   [ETSI-NFV-IFA-004]
615	              ETSI NFV GS NFV-IFA 004, "Network Functions Virtualisation
616	              (NFV); Acceleration Technologies; Management Aspects
617	              Specification", 2016.

619	   [ETSI-NFV-IFA-005]
620	              ETSI NFV GS NFV-IFA 005, "Network Functions Virtualisation
621	              (NFV); Management and Orchestration; Or-Vi reference point
622	              - Interface and Information Model Specification", 2016.

624	   [ETSI-NFV-IFA-006]
625	              ETSI NFV GS NFV-IFA 006, "Network Functions Virtualisation
626	              (NFV); Management and Orchestration; Vi-Vnfm reference
627	              point - Interface and Information Model Specification",
628	              2016.

630	   [ETSI-NFV-IFA-019]
631	              ETSI NFV GS NFV-IFA 019, "Network Functions Virtualisation
632	              (NFV); Acceleration Technologies; Management Aspects
633	              Specification; Release 3", 2017.

635	   [ETSI-NFV-MANO]
636	              ETSI NFV GS NFV-MAN 001, "Network Functions Virtualisation
637	              (NFV); Management and Orchestration", 2014.

639	   [I-D.geng-coms-architecture]
640	              Geng, L., Qiang, L., Lucena, J., Ameigeiras, P., Lopez,
641	              D., and L. Contreras, "COMS Architecture", draft-geng-
642	              coms-architecture-02 (work in progress), March 2018.

644	   [I-D.homma-nfvrg-slice-gateway]
645	              Homma, S., Foy, X., and A. Galis, "Gateway Function for
646	              Network Slicing", draft-homma-nfvrg-slice-gateway-00 (work
647	              in progress), July 2018.

649	   [I-D.qiang-coms-netslicing-information-model]
650	              Qiang, L., Galis, A., Geng, L.,
651	              kiran.makhijani@huawei.com, k., Martinez-Julia, P.,
652	              Flinck, H., and X. Foy, "Technology Independent
653	              Information Model for Network Slicing", draft-qiang-coms-
654	              netslicing-information-model-02 (work in progress),
655	              January 2018.

657	   [I-D.song-ntf]
658	              Song, H., Zhou, T., Li, Z., Fioccola, G., Li, Z.,
659	              Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Toward a
660	              Network Telemetry Framework", draft-song-ntf-02 (work in
661	              progress), July 2018.

663	   [ICIN-2017]
664	              P. Martinez-Julia, V. P. Kafle, and H. Harai, "Achieving
665	              the autonomic adaptation of resources in virtualized
666	              network environments, in Proceedings of the 20th ICIN
667	              Conference (Innovations in Clouds, Internet and Networks,
668	              ICIN 2017). Washington, DC, USA: IEEE, 2018, pp. 1--8",
669	              2017.

671	   [ICIN-2018]
672	              P. Martinez-Julia, V. P. Kafle, and H. Harai,
673	              "Anticipating minimum resources needed to avoid service
674	              disruption of emergency support systems, in Proceedings of
675	              the 21th ICIN Conference (Innovations in Clouds, Internet
676	              and Networks, ICIN 2018). Washington, DC, USA: IEEE, 2018,
677	              pp. 1--8", 2018.

679	   [OPENSTACK]
680	              The OpenStack Project, "http://www.openstack.org/", 2018.

682	Appendix A.  Information Model to Support Reasoning on External Events

684	   In this section we introduce the basic model needed to support
685	   reasoning on external events.  It basically includes the concepts and
686	   structures used to describe external events and notify (communicate)
687	   them to the interested sink, the network controller/manager, through
688	   the control and management plane, depending on the specific
689	   instantiation of the system.

691	A.1.  Tree Structure
692	   module: ietf-nmrg-nict-ai-reasoning
693	     +--rw events
694	        +--rw event-payloads
695	        +--rw external-events

697	     notifications:
698	       +---n event

700	   The main models included in the tree structure of the module are the
701	   events and notifications.  On the one hand, events are structured in
702	   payloads and the content of events itself (external-events).  On the
703	   other hand, there is only one notification, which is the event
704	   itself.

706	A.1.1.  event-payloads

708	   +--rw event-payloads
709	      +--rw event-payloads-basic
710	      +--rw event-payloads-seismometer
711	      +--rw event-payloads-bigdata

713	   The event payloads are, for the time being, composed of three types.
714	   First, we have defined the basic payload, which is intended to carry
715	   any arbitrary data.  Second, we have defined the seismometer payload
716	   to carry information about seisms.  Third, we have defined the
717	   bigdata payload that carries notifications coming from BigData
718	   sources.

720	A.1.1.1.  basic

722	   +--rw event-payloads-basic* [plid]
723	      +--rw plid    string
724	      +--rw data?   union

726	   The basic payload is able to hold any data type, so it has a union of
727	   several types.  It is intended to be used by any source of events
728	   that is (still) not covered by other model.  In general, any source
729	   of telemetry information (e.g.  OpenStack [OPENSTACK] controllers)
730	   can use this model as such sources can encode on it their
731	   information, which typically is very simple and plain.  Therefore,
732	   the current model is tightly interrelated to a framework to retrieve
733	   network telemetry (see [I-D.song-ntf]).

735	A.1.1.2.  seismometer
736	   +--rw event-payloads-seismometer* [plid]
737	      +--rw plid         string
738	      +--rw location?    string
739	      +--rw magnitude?   uint8

741	   The seismometer model includes the main information related to a
742	   seism, such as the location of the incident and its magnitude.
743	   Additional fields can be defined in the future by extending this
744	   model.

746	A.1.1.3.  bigdata

748	   +--rw event-payloads-bigdata* [plid]
749	      +--rw plid           string
750	      +--rw description?   string
751	      +--rw severity?      uint8

753	   The bigdata model includes a description of an event (or incident)
754	   and its estimated general severity, unrelated to the system.  The
755	   description is an arbitrary string of characters that would normally
756	   carry information that describes the event using some higher level
757	   format, such as Turtle or N3 for carrying RDF knowlege items.

759	A.1.2.  external-events

761	   +--rw external-events* [id]
762	      +--rw id           string
763	      +--rw source?      string
764	      +--rw context?     string
765	      +--rw sequence?    int64
766	      +--rw timestamp?   yang:date-and-time
767	      +--rw payload?     binary

769	   The model defined to encode external events, which encapsulates the
770	   payloads introduced above, is completed with an identifier of the
771	   message, a string describing the source of the event, a sequence
772	   number and a timestamp.  Additionaly it includes a string describing
773	   the context of the event.  It is intended to communicate the required
774	   information about the system that detected the event, its location,
775	   etc.  As the description of the BigData payload, this field can be
776	   formated with a high level format, such as RDF.

778	A.1.3.  notifications/event
779	   notifications:
780	     +---n event
781	        +--ro id?          string
782	        +--ro source?      string
783	        +--ro context?     string
784	        +--ro sequence?    int64
785	        +--ro timestamp?   yang:date-and-time
786	        +--ro payload?     binary

788	   The event notification inherits all the fields from the model of
789	   external events defined above.  It is intended to allow software and
790	   hardware elements to send, receive, and interpret not just the events
791	   that have been detected and notified by, for instance, a sensor, but
792	   also the notifications issued by the underlying infrastructure
793	   controllers, such as the OpenStack Controller.

795	A.2.  YANG Module

797	   .

799	   module ietf-nmrg-nict-ai-reasoning {
800	     namespace "urn:ietf:params:xml:ns:yang:ietf-nmrg-nict-ainm";
801	     prefix rant;
802	     import ietf-yang-types { prefix yang; }

804	     grouping external-event-information {
805	       leaf id { type string; }
806	       leaf source { type string; }
807	       leaf context { type string; }
808	       leaf sequence { type int64; }
809	       leaf timestamp { type yang:date-and-time; }
810	       leaf payload { type binary; }
811	     }

813	     grouping event-payload-basic {
814	       leaf plid { type string; }
815	       leaf data { type union { type string; type binary; } }
816	     }

818	     grouping event-payload-seismometer {
819	       leaf plid { type string; }
820	       leaf location { type string; }
821	       leaf magnitude { type uint8; }
822	     }

824	     grouping event-payload-bigdata {
825	       leaf plid { type string; }
826	       leaf description { type string; }
827	       leaf severity { type uint8; }
828	     }

830	     notification event {
831	       uses external-event-information;
832	     }

834	     container events {
835	       container event-payloads {
836	         list event-payloads-basic {
837	           key "plid";
838	           uses event-payload-basic;
839	         }
840	         list event-payloads-seismometer {
841	           key "plid";
842	           uses event-payload-seismometer;
843	         }
844	         list event-payloads-bigdata {
845	           key "plid";
846	           uses event-payload-bigdata;
847	         }
848	       }
849	       list external-events {
850	         key "id";
851	         uses external-event-information;
852	       }
853	     }

855	   }

857	   .

859	Appendix B.  The Autonomic Resource Control Architecture (ARCA)

861	   As deeply discussed in ICIN 2018 [ICIN-2018], ARCA leverages the
862	   elastic adaptation of resources assigned to virtual computer and
863	   network systems by calculating or estimating their requirements from
864	   the analysis of load measurements and the detection of external
865	   events.  These events can be notified by physical elements (things,
866	   sensors) that detect changes on the environment, as well as software
867	   elements that analyze digital information, such as connectors to
868	   sources or analyzers of Big Data.  For instance, ARCA is able to
869	   consider the detection of an earthquake or a heavy rainfall to
870	   overcome the damages it can make to the controlled system.

872	   The policies that ARCA must enforce will be specified by
873	   administrators during the configuration of the control/management
874	   engine.  Then, ARCA continues running autonomously, with no more
875	   human involvement unless some parameter must be changed.  ARCA will
876	   adopt the required control and management operations to adapt the
877	   controlled system to the new situation or requirements.  The main
878	   goal of ARCA is thus to reduce the time required for resource
879	   adaptation from hours/minutes to seconds/milliseconds.  With the
880	   aforementioned statements, system administrators are able to specify
881	   the general operational boundaries in terms of lower and upper system
882	   load thresholds, as well as the minimum and maximum amount of
883	   resources that can be allocated to the controlled system to overcome
884	   any eventual situation, including the natural crossing of such
885	   thresholds.

887	   ARCA functional goal is to run autonomously while the performance
888	   goal is to keep the resources assigned to the controlled resources as
889	   close as possible to the optimum (e.g. 5 % from the optimum) while
890	   avoiding service disruption as much as possible, keeping client
891	   request discard rate as low as possible (e.g. below 1 %).  To achieve
892	   both goals, ARCA relies on the Autonomic Computing (AC) paradigm, in
893	   the form of interconnected micro-services.  Therefore, ARCA includes
894	   the four main elements and activities defined by AC, incarnated as:

896	   Collector Is responsible of gathering and formatting the
897	             heterogeneous observations that will be used in the control
898	             cycle.

900	   Analyzer  Correlates the observations to each other in order to find
901	             the situation of the controlled system, especially the
902	             current load of the resources allocated to the system and
903	             the occurrence of an incident that can affect to the normal
904	             operation of the system, such as an earthquake that
905	             increases the traffic in an emergency-support system, which
906	             is the main target scenario studied in this paper.

908	   Decider   Determines the necessary actions to adjust the resources to
909	             the load of the controlled system.

911	   Enforcer  Requests the underlying and overlying infrastructure, such
912	             as OpenStack, to make the necessary changes to reflect the
913	             effects of the decided actions into the system.

915	   Being a micro-service architecture means that the different
916	   components are executed in parallel.  This allows such components to
917	   operate in two ways.  First, their operation can be dispatched by
918	   receiving a message from the previous service or an external service.
919	   Second, the services can be self-dispatched, so they can activate
920	   some action or send some message without being previously stimulated
921	   by any message.  The overall control process loops indefinitely and
922	   it is closed by checking that the expected effects of an action are
923	   actually taking place.  The coherence among the distributed services
924	   involved in the ARCA control process is ensured by enforcing a common
925	   semantic representation and ontology to the messages they exchange.

927	   ARCA semantics are built with the Resource Description Framework
928	   (RDF) and the Web Ontology Language (OWL), which are well known and
929	   widely used standards for the semantic representation and management
930	   of knowledge.  They provide the ability to represent new concepts
931	   without requiring to change the software, just plugin extensions to
932	   the ontology.  ARCA stores all its knowledge is stored in the
933	   Knowledge Base (KB), which is queried and kept up-to-date by the
934	   analyzer and decider micro-services.  It is implemented by Apache
935	   Jena Fuseki, which is a high-performance RDF data store that supports
936	   SPARQL through an HTTP/REST interface.  Being de-facto standards,
937	   both technologies enable ARCA to be easily integrated to
938	   virtualization platforms like OpenStack.

940	Appendix C.  ARCA Integration With ETSI-NFV-MANO

942	   In this section we describe how to fit ARCA on a general SDN/NFV
943	   underlying infrastructure and introduce a showcase experiment that
944	   demonstrates its operation on an OpenStack-based experimentation
945	   platform.  We first describe the integration of ARCA with the NFV-
946	   MANO reference architecture.  We contextualize the significance of
947	   this integration by describing an emergency support scenario that
948	   clearly benefits from it.  Then we proceed to detail the elements
949	   forming the OpenStack platform and finally we discuss some initial
950	   results obtained from them.

952	C.1.  Functional Integration

954	   The most important functional blocks of the NFV reference
955	   architecture promoted by ETSI (see ETSI-NFV-MANO [ETSI-NFV-MANO]) are
956	   the system support functions for operations and business (OSS/BSS),
957	   the element management (EM) and, obviously. the Virtual Network
958	   Functions (VNFs).  But these functions cannot exist without being
959	   instantiated on a specific infrastructure, the NFV infrastructure
960	   (NFVI), and all of them must be coordinated, orchestrated, and
961	   managed by the general NFV-MANO functions.

963	   Both the NFVI and the NFV-MANO elements are subdivided into several
964	   sub-components.  The NFVI has the underlying physical computing,
965	   storage, and network resources, which are sliced
966	   (see[I-D.qiang-coms-netslicing-information-model] and
967	   [I-D.geng-coms-architecture]) and virtualized to conform the virtual
968	   computing, storage, and network resources that will host the VNFs.
969	   In addition, the NFV-MANO is subdivided in the NFV Orchestrator
970	   (NFVO), the VNF manager (VNFM) and the Virtual Infrastructure Manager
971	   (VIM).  As their name indicates, all high-level elements and sub-
972	   components have their own and very specific objective in the NFV
973	   architecture.

975	   During the design of ARCA we enforced both operational and
976	   interfacing aspects to its main objectives.  From the operational
977	   point of view, ARCA processes observations to manage virtual
978	   resources, so it plays the role of the VIM mentioned above.
979	   Therefore, ARCA has been designed with appropriate interfaces to fit
980	   in the place of the VIM.  This way, ARCA provides the NFV reference
981	   architecture with the ability to react to external events to adapt
982	   virtual computer and network systems, even anticipating such
983	   adaptations as performed by ARCA itself.  However, some interfaces
984	   must be extended to fully enable ARCA to perform its work within the
985	   NFV architecture.

987	   Once ARCA is placed in the position of the VIM, it enhances the
988	   general NFV architecture with its autonomic management capabilities.
989	   In particular, it discharges some responsibilities from the VNFM and
990	   NFVO, so they can focus on their own business while the virtual
991	   resources are behaving as they expect (and request).  Moreover, ARCA
992	   improves the scalability and reliability of the managed system in
993	   case of disconnection from the orchestration layer due to some
994	   failure, network split, etc.  It is also achieved by the autonomic
995	   capabilities, which, as described above, are guided by the rules and
996	   policies specified by the administrators and, here, communicated to
997	   ARCA through the NFVO.  However, ARCA will not be limited to such
998	   operation so, more generally, it will accomplish the requirements
999	   established by the Virtual Network Operators (VNOs), which are the
1000	   owners of the slice of virtual resources that is managed by a
1001	   particular instance of NFV-MANO, and therefore ARCA.

1003	   In addition to the operational functions, ARCA incorporates the
1004	   necessary mechanisms to engage the interfaces that enable it to
1005	   interact with other elements of the NFV-MANO reference architecture.
1006	   More specifically, ARCA is bound to the Or-Vi (see ETSI-NFV-IFA-005
1007	   [ETSI-NFV-IFA-005]) and the Nf-Vi (see ETSI-NFV-IFA-004
1008	   [ETSI-NFV-IFA-004] and ETSI-NFV-IFA-019 [ETSI-NFV-IFA-019]).  The
1009	   former is the point of attachment between the NFVO and the VIM while
1010	   the latter is the point of attachment between the NFVI and the VIM.
1011	   In our current design we decided to avoid the support for the point
1012	   of attachment between the VNFM and the VIM, called Vi-Vnfm (see ETSI-
1013	   NFV-IFA-006 [ETSI-NFV-IFA-006]).  We leave it for future evolutions
1014	   of the proposed integration, that will be enabled by a possible
1015	   solution that provides the functions of the VNFM required by ARCA.

1017	   Through the Or-Vi, ARCA receives the instructions it will enforce to
1018	   the virtual computer and network system it is controlling.  As
1019	   mentioned above, these are specified in the form of rules and
1020	   policies, which are in turn formatted as several statements and
1021	   embedded into the Or-Vi messages.  In general, these will be high-
1022	   level objectives, so ARCA will use its reasoning capabilities to
1023	   translate them into more specific, low-level objectives.  For
1024	   instance, the Or-Vi can specify some high-level statement to avoid
1025	   CPU overloading and ARCA will use its innate and acquired knowledge
1026	   to translate it to specific statements that specify which parameters
1027	   it has to measure (CPU load from assigned servers) and which are
1028	   their desired boundaries, in the form of high threshold and low
1029	   threshold.  Moreover, the Or-Vi will be used by the NFVO to specify
1030	   which actions can be used by ARCA to overcome the violation of the
1031	   mentioned policies.

1033	   All information flowing the Or-Vi interface is encoded and formatted
1034	   by following a simple but highly extensible ontology and exploiting
1035	   the aforementioned semantic formats.  This ensures that the
1036	   interconnected system is able to evolve, including the replacement of
1037	   components, updating (addition or removal) the supported concepts to
1038	   understand new scenarios, and connecting external tools to further
1039	   enhance the management process.  The only requirement to ensure this
1040	   feature is to ensure that all elements support the mentioned ontology
1041	   and semantic formats.  Although it is not a finished task, the
1042	   development of semantic technologies allows the easy adaptation and
1043	   translation of existing information formats, so it is expected that
1044	   more and more software pieces become easily integrable with the ETSI-
1045	   NFV-MANO [ETSI-NFV-MANO] architecture.

1047	   In contrast to the Or-Vi interface, the Nf-Vi interface exposes more
1048	   precise and low-level operations.  Although this makes it easier to
1049	   be integrated to ARCA, it also makes it to be tied to specific
1050	   implementations.  In other words, building a proxy that enforces the
1051	   aforementioned ontology to different interface instances to
1052	   homogenize them adds undesirable complexity.  Therefore, new
1053	   components have been specifically developed for ARCA to be able to
1054	   interact with different NFVIs.  Nevertheless, this specialization is
1055	   limited to the collector and enforcer.  Moreover, it allows ARCA to
1056	   have optimized low-level operations, with high improvement of the
1057	   overall performance.  This is the case of the specific
1058	   implementations of the collector and enforcer used with Mininet and
1059	   Docker, which are used as underlying infrastructures in previous
1060	   experiments described in ICIN 2017 [ICIN-2017].  Moreover, as
1061	   discussed in the following section, this is also the case of the
1062	   implementations of the collector and enforcer tied to OpenStack
1063	   telemetry and compute interfaces, respectively.  Hence it is
1064	   important to ensure that telemetry is properly addressed, so we
1065	   insist in the need to adopt a common framework in such endpoint (see
1066	   [I-D.song-ntf]).

1068	   Although OpenStack still lacks some functionality regarding the
1069	   construction of specific virtual networks, we use it as the NFVI
1070	   functional block in the integrated approach.  Therefore, OpenStack is
1071	   the provider of the underlying SDN/NFV infrastructure and we
1072	   exploited its APIs and SDK to achieve the integration.  More
1073	   specifically, in our showcase we use the APIs provided by Ceilometer,
1074	   Gnocchi, and Compute services as well as the SDK provided for Python.
1075	   All of them are gathered within the Nf-Vi interface.  Moreover, we
1076	   have extended the Or-Vi interface to connect external elements, such
1077	   as the physical or environmental event detectors and Big Data
1078	   connectors, which is becoming a mandatory requirement of the current
1079	   virtualization ecosystem and it conforms our main extension to the
1080	   NFV architecture.

1082	C.2.  Target Experiment and Scenario

1084	   From the beginning of our work on the design of ARCA we are targeting
1085	   real-world scenarios, so we get better suited requirements.  In
1086	   particular we work with a scenario that represents an emergency
1087	   support service that is hosted on a virtual computer and network
1088	   system, which is in turn hosted on the distributed virtualization
1089	   infrastructure of a medium-sized organization.  The objective is to
1090	   clearly represent an application that requires high dynamicity and
1091	   high degree of reliability.  The emergency support service
1092	   accomplishes this by being barely used when there is no incident but
1093	   also being heavily loaded when there is an incident.

1095	   Both the underlying infrastructure and virtual network share the same
1096	   topology.  They have four independent but interconnected network
1097	   domains that form part of the same administrative domain
1098	   (organization).  The first domain hosts the systems of the
1099	   headquarters (HQ) of the owner organization, so the VNFs it hosts
1100	   (servants) implement the emergency support service.  We defined them
1101	   as ``servants'' because they are Virtual Machine (VM) instances that
1102	   work together to provide a single service by means of backing the
1103	   Load Balancer (LB) instances deployed in the separate domains.  The
1104	   amount of resources (servants) assigned to the service will be
1105	   adjusted by ARCA, attaching or detaching servants to meet the load
1106	   boundaries specified by administrators.

1108	   The other domains represent different buildings of the organization
1109	   and will host the clients that access to the service when an incident
1110	   occurs.  They also host the necessary LB instances, which are also
1111	   VNFs that are controlled by ARCA to regulate the access of clients to
1112	   servants.  All domains will have physical detectors to provide
1113	   external information that can (and will) be correlated to the load of
1114	   the controlled virtual computer and network system and thus will
1115	   affect to the amount of servants assigned to it.  Although the
1116	   underlying infrastructure, the servants, and the ARCA instance are
1117	   the same as those those used in the real world, both clients and
1118	   detectors will be emulated.  Anyway, this does not reduce the
1119	   transferability of the results obtained from our experiments as it
1120	   allows to expand the amount of clients beyond the limits of most
1121	   physical infrastructures.

1123	   Each underlying OpenStack domain will be able to host a maximum of
1124	   100 clients, as they will be deployed on a low profile virtual
1125	   machine (flavor in OpenStack).  In general, clients will be
1126	   performing requests at a rate of one request every ten seconds, so
1127	   there would be a maximum of 30 requests per second.  However, under
1128	   the simulated incident, the clients will raise their load to reach a
1129	   common maximum of 1200 requests per second.  This mimics the shape
1130	   and size of a real medium-size organization of about 300 users that
1131	   perform a maximum of four requests per second when they need some
1132	   support.

1134	   The topology of the underlying network is simplified by connecting
1135	   the four domains to the same, high-performance switch.  However, the
1136	   topology of the virtual network is built by using direct links
1137	   between the HQ domain and the other three domains.  These are
1138	   complemented by links between domains 2 and 3, and between domains 3
1139	   and 4.  This way, the three domains have three paths to reach the HQ
1140	   domain: a direct path with just one hop, and two indirect paths with
1141	   two and three hops, respectively.

1143	   During the execution of the experiment, the detectors notify the
1144	   incident to the controller as soon as it happens.  However, although
1145	   the clients are stimulated at the same time, there is some delay
1146	   between the occurrence of the incident and the moment the network
1147	   service receives the increase in the load.  One of the main targets
1148	   of our experiment is to study such delay and take advantage of it to
1149	   anticipate the amount of servants required by the system.  We discuss
1150	   it below.

1152	   In summary, this scenario highlights the main benefits of ARCA to
1153	   play the role of VIM and interacting with the underlying OpenStack
1154	   platform.  This means the advancement towards an efficient use of
1155	   resources and thus reducing the CAPEX of the system.  Moreover, as
1156	   the operation of the system is autonomic, the involvement of human
1157	   administrators is reduced and, therefore, the OPEX is also reduced.

1159	C.3.  OpenStack Platform

1161	   The implementation of the scenario described above reflects the
1162	   requirements of any edge/branch networking infrastructure, which are
1163	   composed of several distributed micro-data-centers deployed on the
1164	   wiring centers of the buildings and/or storeys.  We chose to use
1165	   OpenStack to meet such requirements because it is being widely used
1166	   in production infrastructures and the resulting infrastructure will
1167	   have the necessary robustness to accomplish our objectives, at the
1168	   time it reflects the typical underlying platform found in any SDN/NFV
1169	   environment.

1171	   We have deployed four separate network domains, each one with its own
1172	   OpenStack instantiation.  All domains are totally capable of running
1173	   regular OpenStack workload, i.e. executing VMs and networks, but, as
1174	   mentioned above, we designate the domain 1 to be the headquarters of
1175	   the organization.  The different underlying networks required by this
1176	   (quite complex) deployment are provided by several VLANs within a
1177	   high-end L2 switch.  This switch represents the distributed network
1178	   of the organization.  Four separated VLANs are used to isolate the
1179	   traffic within each domain, by connecting an interface of OpenStack's
1180	   controller and compute nodes.  These VLANs therefore form the
1181	   distributed data plane.  Moreover, other VLAN is used to carry the
1182	   control plane as well as the management plane, which are used by the
1183	   NFV-MANO, and thus ARCA.  It is instantiated in the physical machine
1184	   called ARCA Node, to exchange control and management operations in
1185	   relation to the collector and enforcer defined in ARCA.  This VLAN is
1186	   shared among all OpenStack domains to implement the global control of
1187	   the virtualization environment pertaining to the organization.
1188	   Finally, other VLAN is used by the infrastructure to interconnect the
1189	   data planes of the separated domains and also to allow all elements
1190	   of the infrastructure to access the Internet to perform software
1191	   installation and updates.

1193	   Installation of OpenStack is provided by the Red Hat OpenStack
1194	   Platform, which is tightly dependent on the Linux operating system
1195	   and closely related to the software developed by the OpenStack Open
1196	   Source project.  It provides a comprehensive way to install the whole
1197	   platform while being easily customized to meet our specific
1198	   requirements, while it is also backed by operational quality support.

1200	   The ARCA node is also based on Linux but, since it is not directly
1201	   related to the OpenStack deployment, it is not based on the same
1202	   distribution.  It is just configured to be able to access the control
1203	   and management interfaces offered by OpenStack, and therefore it is
1204	   connected to the VLAN that hosts the control and management planes.
1205	   On this node we deploy the NFV-MANO components, including the micro-
1206	   services that form an ARCA instance.

1208	   In summary, we dedicate nine physical computers to the OpenStack
1209	   deployment, all are Dell PowerEdge R610 with 2 x Xeon 5670 2.96 GHz
1210	   (6 core / 12 thread) CPU, 48 GiB RAM, 6 x 146 GiB HD at 10 kRPM, and
1211	   4 x 1 GE NIC.  Moreover, we dedicate an additional computer with the
1212	   same specification to the ARCA Node.  We dedicate a less powerful
1213	   computer to implement the physical router because it will not be
1214	   involved in the general execution of OpenStack nor in the specific
1215	   experiments carried out with it.  Finally, as detailed above, we
1216	   dedicate a high-end physical switch, an HP ProCurve 1810G-24, to
1217	   build the interconnection networks.

1219	C.4.  Initial Results

1221	   Using the platform described above we execute an initial but long-
1222	   lasting experiment based on the target scenario introduced at the
1223	   beginning of this section.  The objective of this experiment is
1224	   twofold.  First, we aim to demonstrate how ARCA behaves in a real
1225	   environment.  Second, we aim to stress the coupling points between
1226	   ARCA and OpenStack, which will raise the limitations of the existing
1227	   interfaces.

1229	   With such objectives in mind, we define a timeline that will be
1230	   followed by both clients and external event detectors.  It forces the
1231	   virtualized system to experience different situations, including
1232	   incidents of many severities.  When an incident is found in the
1233	   timeline, the detectors notify it to the ARCA-based VIM and the
1234	   clients change their request rates, which will depend on the severity
1235	   of the incident.  This behavior is widely discussed in ICIN 2018
1236	   [ICIN-2018], remarking how users behave after occurring a disaster or
1237	   another similar incident.

1239	   The ARCA-based VIM will know the occurrence of the incident from two
1240	   sources.  First, it will receive the notification from the event
1241	   detectors.  Second, it will notice the change of the CPU load of the
1242	   servants assigned to the target service.  In this situation, ARCA has
1243	   different opportunities to overcome the possible overload (or
1244	   underload) of the system.  We explore the anticipation approach
1245	   deeply discussed in ICIN 2018 [ICIN-2018].  Its operation is enclosed
1246	   in the analyzer and decider and it is based on an algorithm that is
1247	   divided in two sub-algorithms.

1249	   The first sub-algorithm reacts to the detection of the incident and
1250	   ulterior correlation of its severity to the amount of servants
1251	   required by the system.  This sub-algorithm hosts the regression of
1252	   the learner, which is based on the SVM/SVR technique, and predicts
1253	   the necessary resources from two features: the severity of the
1254	   incident and the time elapsed from the moment it happened.  The
1255	   resulting amount of servants is established as the minimum amount
1256	   that the VIM can use.

1258	   The second sub-algorithm is fed with the CPU load measurements of the
1259	   servants assigned to the service, as reported by the OpenStack
1260	   platform.  With this information it checks whether the system is
1261	   within the operating parameters established by the NFVO.  If not, it
1262	   adjusts the resources assigned to the system.  It also uses the
1263	   minimum amount established by the other sub-algorithm as the basis
1264	   for the assignation.  After every correction, this algorithm learns
1265	   the behavior by adding new correlation vectors to the SVM/SVR
1266	   structure.

1268	   When the experiment is running, the collector component of the ARCA-
1269	   based VIM is attached to the telemetry interface of OpenStack by
1270	   using the SDK to access the measurement data generated by Ceilometer
1271	   and stored by Gnocchi.  In addition, it is attached to the external
1272	   event detectors in order to receive their notifications.  On the
1273	   other hand, the enforcer component is attached to the Compute
1274	   interface of OpenStack by also using its SDK to request the
1275	   infrastructure to create, destroy, query, or change the status of a
1276	   VM that hosts a servant of the controlled system.  Finally, the
1277	   enforcer also updates the lists of servers used by the load balancers
1278	   to distribute the clients among the available resources.

1280	   During the execution of the experiment we make the ARCA-based VIM to
1281	   report the severity of the last incident, if any, the time elapsed
1282	   since it occurred, the amount of servants assigned to the controlled
1283	   system, the minimum amount of servants to be assigned, as determined
1284	   by the anticipation algorithm, and the average load of all servants.
1285	   In this instance, the severities are spread between 0 (no incident)
1286	   and 4 (strongest incident), the elapsed times are less than 35
1287	   seconds, and the minimum server assignation (MSA) is below 10,
1288	   although the hard maximum is 15.

1290	   With such measurements we illustrate how the learned correlation of
1291	   the three features (dimensions) mentioned above is achieved.  Thus,
1292	   when there is no incident (severity = 0), the MSA is kept to the
1293	   minimum.  In parallel, regardless of the severity level, the
1294	   algorithm learned that there is no need to increase the MSA for the
1295	   first 5 or 10 seconds.  This shows the behavior discussed in this
1296	   paper, that there is a delay between the occurrence of an event and
1297	   the actual need for updated amount of resources, and it forms one
1298	   fundamental aspect of our research.

1300	   By inspecting the results, we know that there is a burst of client
1301	   demands that is centered (peak) around 15 seconds after the
1302	   occurrence of an incident or any other change in the accounted
1303	   severity.  We also know that the burst lasts longer for higher
1304	   severities, and it fluctuates a bit for the highest severities.
1305	   Finally, we can also notice that for the majority of severities, the
1306	   increased MSA is no longer required after 25 seconds from the time
1307	   the severity change was notified.

1309	   All that information becomes part of the knowledge of ARCA and it is
1310	   stored both by the internal structures of the SVM/SVR and, once
1311	   represented semantically, in the semantic database that manages the
1312	   knowledge base of ARCA.  Thus, it is used to predict any future
1313	   behavior.  For instance, is an incident of severity 3 has occurred 10
1314	   seconds ago, ARCA knows that it will need to set the MSA to 6
1315	   servants.  In fact, this information has been used during the
1316	   experiment, so we can also know the accuracy of the algorithm by
1317	   comparing the anticipated MSA value with the required value (or even
1318	   the best value).  However, the analysis of such information is left
1319	   for the future.

1321	   While preparing and executing the experiment we found several
1322	   limitation intrinsic to the current OpenStack platform.  First,
1323	   regardless of the CPU and memory resources assigned to the underlying
1324	   controller nodes, the platform is unable to record and deliver
1325	   performance measurements at a lower interval than every 10 seconds,
1326	   so it is currently not suitable for real time operations, which is
1327	   important for our long-term research objectives.  Moreover, we found
1328	   that the time required by the infrastructure to create a server that
1329	   hosts a somewhat heavy servant is around 10 seconds, which is too far
1330	   from our targets.  Although these limitations can be improved in the
1331	   future, they clearly justify that our anticipation approach is
1332	   essential for the proper working of a virtual system and, thus, the
1333	   integration of external information becomes mandatory for future
1334	   system management technologies, especially considering the
1335	   virtualization environments.

1337	   Finally, we found it difficult for the required measurements to be
1338	   pushed to external components, so we had to poll for them.
1339	   Otherwise, some component of ARCA must be instantiated along the main
1340	   OpenStack components and services so it has first-hand and prompt
1341	   access to such features.  This way, ARCA could receive push
1342	   notifications with the measurements, as it is for the external
1343	   detectors.  This is a key aspect that affects the placement of the
1344	   NFV-VIM, or some subpart of it, on the general architecture.
1345	   Therefore, for future iterations of the NFV reference architecture,
1346	   an integrated view between the VIM and the NFVI could be required to
1347	   reflect the future reality.

1349	Authors' Addresses
1350	   Pedro Martinez-Julia (editor)
1351	   NICT
1352	   4-2-1, Nukui-Kitamachi
1353	   Koganei, Tokyo  184-8795
1354	   Japan

1356	   Phone: +81 42 327 7293
1357	   Email: pedro@nict.go.jp

1359	   Shunsuke Homma
1360	   NTT
1361	   Japan

1363	   Email: shunsuke.homma.fp@hco.ntt.co.jp