idnits 2.17.1 draft-pedro-nmrg-intelligent-reasoning-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates draft-pedro-nmrg-intelligent-, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 723 has weird spacing: '...rw plid str...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 06, 2020) is 1512 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NMRG P. Martinez-Julia, Ed. 3 Internet-Draft NICT 4 Updates: draft-pedro-nmrg-intelligent- S. Homma 5 reasoning-00 (if approved) NTT 6 Intended status: Informational March 06, 2020 7 Expires: September 7, 2020 9 Intelligent Reasoning on External Events for Network Management 10 draft-pedro-nmrg-intelligent-reasoning-01 12 Abstract 14 The adoption of AI in network management solutions is becoming a 15 reality. It is mainly supported by the need to resolve complex 16 problems arisen from the acceptance of SDN/NFV technologies as well 17 as network slicing. This allows current computer and network system 18 infrastructures to constantly grow in complexity, in parallel to the 19 demands of users. However, exploiting the possibilities of AI is not 20 an easy task. There has been a lot of effort to make Machine 21 Learning (ML) solutions reliable and acceptable but, at the same 22 time, other mechanisms have been forgotten. It is the particular 23 case of reasoning. Although it can provide enormous benefits to 24 management solutions by, for example, inferring new knowledge and 25 applying different kind of rules (e.g. logical) to choose from 26 several actions, it has received little attention. While ML 27 solutions work with data, so their only requirement from the network 28 infrastructure is data retrieval, reasoning solutions work in 29 collaboration to the network they are managing. This makes the 30 challenges arisen from intelligent reasoning to be a key for the 31 evolution of network management towards the full adoption of AI. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 7, 2020. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 3.1. Virtual Computer and Network Systems . . . . . . . . . . 4 71 3.2. SDN and NFV . . . . . . . . . . . . . . . . . . . . . . . 4 72 3.3. Management and Control . . . . . . . . . . . . . . . . . 5 73 3.4. Slice Gateway (SLG) . . . . . . . . . . . . . . . . . . . 5 74 4. Applying AI to Network Management . . . . . . . . . . . . . . 6 75 4.1. Beyond Machine Learning . . . . . . . . . . . . . . . . . 6 76 4.2. Briefing Artificial Intelligence . . . . . . . . . . . . 6 77 5. Extended Management Operation . . . . . . . . . . . . . . . . 7 78 5.1. Intelligent Network Management Process . . . . . . . . . 7 79 5.2. Closed Loop Management Approach . . . . . . . . . . . . . 8 80 6. Deep Exploitation of AI in Network Management . . . . . . . . 9 81 6.1. From Data to Wisdom . . . . . . . . . . . . . . . . . . . 9 82 6.2. External Event Detectors . . . . . . . . . . . . . . . . 9 83 6.3. Network Requirement Anticipation . . . . . . . . . . . . 10 84 6.4. Intelligent Reasoning . . . . . . . . . . . . . . . . . . 11 85 6.5. Gaps and Standardization Issues . . . . . . . . . . . . . 12 86 7. Relation to Other IETF/IRTF Initiatives . . . . . . . . . . . 13 87 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 88 9. Security Considerations . . . . . . . . . . . . . . . . . . . 13 89 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 90 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 91 11.1. Normative References . . . . . . . . . . . . . . . . . . 14 92 11.2. Informative References . . . . . . . . . . . . . . . . . 14 93 Appendix A. Information Model to Support Reasoning on External 94 Events . . . . . . . . . . . . . . . . . . . . . . . 15 95 A.1. Tree Structure . . . . . . . . . . . . . . . . . . . . . 15 96 A.1.1. event-payloads . . . . . . . . . . . . . . . . . . . 16 97 A.1.1.1. basic . . . . . . . . . . . . . . . . . . . . . . 16 98 A.1.1.2. seismometer . . . . . . . . . . . . . . . . . . . 16 99 A.1.1.3. bigdata . . . . . . . . . . . . . . . . . . . . . 17 100 A.1.2. external-events . . . . . . . . . . . . . . . . . . . 17 101 A.1.3. notifications/event . . . . . . . . . . . . . . . . . 17 102 A.2. YANG Module . . . . . . . . . . . . . . . . . . . . . . . 18 103 Appendix B. The Autonomic Resource Control Architecture (ARCA) . 19 104 Appendix C. ARCA Integration With ETSI-NFV-MANO . . . . . . . . 21 105 C.1. Functional Integration . . . . . . . . . . . . . . . . . 21 106 C.2. Target Experiment and Scenario . . . . . . . . . . . . . 24 107 C.3. OpenStack Platform . . . . . . . . . . . . . . . . . . . 25 108 C.4. Initial Results . . . . . . . . . . . . . . . . . . . . . 27 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 29 111 1. Introduction 113 The current network ecosystem is quickly evolving from an almost 114 fixed network to a highly flexible, powerful, and somehow hybrid 115 system. Network slicing, Software Defined Networking (SDN), and 116 Network Function Virtualization (NFV) provide the basis for such 117 evolution. The need to automate the management and control of such 118 systems has motivated the move towards autonomic networking (ANIMA) 119 and the inclusion of AI solutions alongside the management plane of 120 the network, enough justified by the increasing size and complexity 121 of the network, which exposes complex problems that must be resolved 122 in scales that escape human possibilities. Therefore, in order to 123 allow current computer and network system infrastructures to 124 constantly grow in complexity, in parallel to the demands of users, 125 the AI solutions must work together with other network management 126 solutions. 128 However, exploiting the possibilities of AI is not an easy task. 129 There has been a lot of effort to make Machine Learning (ML) 130 solutions reliable and acceptable but, at the same time, other 131 mechanisms have been forgotten. It is the particular case of 132 reasoning. Although it can provide enormous benefits to management 133 solutions by, for example, inferring new knowledge and applying 134 different kind of rules (e.g. logical) to choose from several 135 actions, it has received little attention. While ML solutions work 136 with data, so their only requirement from the network infrastructure 137 is data retrieval, reasoning solutions work in collaboration to the 138 network they are managing. This makes the challenges arisen from 139 intelligent reasoning to be a key for the evolution of network 140 management towards the full adoption of AI. 142 The present document aims to gather the necessary information for 143 getting the most benefits from the application of intelligent 144 reasoning to network management, including, but not limited to, 145 defining the gaps that must be covered for reasoning to be correctly 146 integrated into network management solutions. 148 2. Terminology 150 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 151 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 152 document are to be interpreted as described in RFC 2119 [RFC2119]. 154 3. Background 156 3.1. Virtual Computer and Network Systems 158 The continuous search for efficiency and cost reduction to get the 159 most optimum exploitation of available resources (e.g. CPU power and 160 electricity) has conducted current physical infrastructures to move 161 towards virtualization infrastructures. Also, this trend enables end 162 systems to be centralized and/or distributed, so that they are 163 deployed to best accomplish customer requirements in terms of 164 resources and qualities. 166 One of the key functional requirements imposed to computer and 167 network virtualization is a high degree of flexibility and 168 reliability. Both qualities are subject to the underlying 169 technologies but, while the latter has been always enforced to 170 computer and network systems, flexibility is a relatively new 171 requirement, which would not have been imposed without the backing of 172 virtualization and cloud technologies. 174 3.2. SDN and NFV 176 SDN and NFV are conceived to bring high degree of flexibility and 177 conceptual centralization qualities to the network. On the one hand, 178 with SDN, the network can be programmed to implement a dynamic 179 behavior that changes its topology and overall qualities. Moreover, 180 with NFV the functions that are typically provided by physical 181 network equipment are now implemented as virtual appliances that can 182 be deployed and linked together to provide customized network 183 services. SDN and NFV complements to each other to actually 184 implement the network aspect of the aforementioned virtual computer 185 and network systems. 187 Although centralization can lead us to think on the single-point-of- 188 failure concept, it is not the case for these technologies. 189 Conceptual centralization highly differs from centralized deployment. 190 It brings all benefits from having a single point of decision but 191 retaining the benefits from distributed systems. For instance, 192 control decisions in SDN can be centralized while the mechanisms that 193 enforce such decisions into the network (SDN controllers) can be 194 implemented as highly distributed systems. The same approach can be 195 applied to NFV. Network functions can be implemented in a central 196 computing facility, but they can also take advantage of several 197 replication and distribution techniques to achieve the properties of 198 distributed systems. Nevertheless, NFV also allows the deployment of 199 functions on top of distributed systems, so they benefit from both 200 distribution alternatives at the same time. 202 3.3. Management and Control 204 The introduction of virtualization into the computer and network 205 system landscape has increased the complexity of both underlying and 206 overlying systems. On the one hand, virtualizing underlying systems 207 adds extra functions that must be managed properly to ensure the 208 correct operation of the whole system, which not just encompasses 209 underlying elements but also the virtual elements running on top of 210 them. Such functions are used to actually host the overlying virtual 211 elements, so there is an indirect management operation that involves 212 virtual systems. Moreover, such complexities are inherited by final 213 systems that get virtualized and deployed on top of those 214 virtualization infrastructures. 216 In parallel, virtual systems are empowered with additional, and 217 widely exploited, functionality that must be managed correctly. It 218 is the case of the dynamic adaptation of virtual resources to the 219 specific needs of their operation environments, or even the 220 composition of distributed elements across heterogeneous underlying 221 infrastructures, and probably providers. 223 Taking both complex functions into account, either separately or 224 jointly, makes clear that management requirements have greatly 225 surpassed the limits of humans, so automation has become essential to 226 accomplish most common tasks. 228 3.4. Slice Gateway (SLG) 230 A slice gateway (SLG) (see [I-D.homma-nfvrg-slice-gateway]) is 231 basically a component in the data plane and has the roles of data 232 packet processing. Moreover, it provides an interface to export its 233 functions for interacting with control and management components, so 234 that it is quite relevant for implementing the requirements described 235 above within the network slicing domain. 237 Furthermore, an SLG might be required to support handling services 238 provided on network slices in addition to controlling them because an 239 SLG is the edge node on an end-to-end network slice (E2E-NS). 241 Therefore, the SLG exposes the following requirements: 243 Data plane for NSs as infrastructure. 245 Control/management plane for NSs as infrastructure. 247 Data plane for services on NSs. 249 Control/management plane for services on NSs. 251 In summary, SLG provides the required functions for the enforcement 252 of AI decisions in multi-domain (and federated) network slices, so it 253 will play a key role in general network management. 255 4. Applying AI to Network Management 257 4.1. Beyond Machine Learning 259 ML is not AI. AI has a broader spectrum of methods, some of them are 260 already exploited in the network for a long time. Perception, 261 reasoning, and planning are still not fully exploited in the network. 263 4.2. Briefing Artificial Intelligence 265 Intelligence does not directly imply intelligent. On the one hand, 266 intelligence emphasizes data gathering and management, which can be 267 processed by systematic methods or intelligent methods. On the other 268 hand, intelligent emphasizes the reasoning and understanding of data 269 to actually "posses" the intelligence. 271 The justification of applying AI in network (and) management is 272 sometimes overseen. First, management decisions are more and more 273 complex. We have moved from asking simple questions ("Is there a 274 problem in my system?") to much more complex ones ("Where should I 275 migrate this VM to accomplish my goals?"). Moreover, operation 276 environments are more and more dynamic. On the one hand, 277 softwarization and programmability elevate flexibility and allow 278 networks to be totally adapted to their static and/or dynamic 279 requirements. On the other hand, network virtualization highly 280 enables network automation. 282 The new functions and possibilities allow network devices to become 283 autonomic. However, they must take complex decisions by themselves, 284 without human intervention, realizing the "dream" of Zero-Touch 285 Networks (ZTM), which exploit fully programmable elements and 286 advanced automation methods (ETSI ZSM). Nevertheless, we have to 287 remember that AI methods are just resources, not solutions. They 288 will not replace the human decisions, just complement and "automate" 289 them. 291 5. Extended Management Operation 293 5.1. Intelligent Network Management Process 295 In general, the correct and pertinent application of AI to network 296 management provides enormous benefits, mainly in terms of making 297 complex management operations feasible and improving the performance 298 of typically expensive tasks. By taking advantage of these benefits, 299 the amount of data that can be analyzed to make decisions on the 300 network can be hugely increased. 302 As a result, AI makes possible to enlarge the management process 303 towards the Intelligent Network Management Process (INMP). Instead 304 of just being focused on the analysis of performance measurements 305 retrieved from the managed network and the subsequent decision 306 (proaction or reaction), the extension of management operation 307 enabled by INMP encompasses different sub-processes. 309 First, INMP has a sub-process for retrieving the performance 310 measurements from the managed network. This is the same found in 311 typical management processes. Moreover, INMP encourages the 312 application of the same ML techniques to obtain some insight of the 313 situation of the managed network. 315 Second, INMP incorporates a reasoning sub-process. It receives both 316 the output of the previous sub-process and additional context 317 information, which can be provided by an external event detector, as 318 described below. Then, this sub-process finds out and particularizes 319 the rules that are governing the situation described above. Such 320 rules are semantically constructed and will abstract the situation of 321 the network in terms of logical and other semantic concepts, together 322 with actions and transformations that can be applied to those rules. 323 All such constructions will be stored in the Intelligent Network 324 Management Knowledge Base (INMKB), which will follow a pre-determined 325 ontology and will also extend the knowledge by applying basic and 326 atomic logic inference statements. 328 Third, INMP defines the solving sub-process. It works as follows. 329 Once obtained the abstracted situation of the managed network and the 330 rules to it, the solving subprocess builds a graph with all semantic 331 constructions. It reflects the managed network, since all network 332 elements have their semantic counterpart, but it also has all 333 situations, rules, actions, and even the measurements. The solving 334 sub-process applies ontology transformations to find a graph that is 335 acceptable in terms of the associated situation and its adherence to 336 administrative goals. 338 Fourth, INMP incorporates the planning sub-process. It receives the 339 solution graph obtained by the previous sub-process and makes a 340 linear plan of actions to execute in order to enforce the required 341 changes into the network. The actions used by this planning sub- 342 process are the building blocks of the plan. Each block will be 343 defined with a precondition, invariant, and postcondition. A 344 planning algorithm should be used to obtain such plan of actions by 345 linking the building blocks so they can be enforced to finally adapt 346 the managed network to get the desired situation. 348 All these processes must be executed in parallel, using strong inter- 349 process communication and synchronization constraints. Moreover, the 350 requests to the underlying infrastructure for the adaptation of the 351 managed network will be sent to the corresponding controllers without 352 waiting for finishing the deliberation cycle. This way, the time 353 required by the whole cycle is highly reduced. This can be possible 354 because of the assumptions and anticipations tied to INMP and the 355 intelligence it denotes. 357 5.2. Closed Loop Management Approach 359 Beginning with INMP, a key approach for achieving proper network 360 management goals is to follow the closed control loop methodology. 361 It ensures that the objectives are not just accomplished at certain 362 moment but kept in future cycles of both management and network life- 363 cycle. 365 To obtain the benefits from integrating AI within the closed loop, 366 INMP processes must be re-wired to connect their outputs to their 367 inputs, so obtaining feedback analysis. Moreover, an additional 368 process must be defined for ensuring that the objectives defined in 369 the last steps of INMP are actually present in the near future 370 situation of the managed network. 372 In addition, the data plane elements, such as the SLG described 373 above, must provide some capabilities to make them coherent to the 374 closed control loop. Particularly, they must provide symmetric 375 enforcement and telemetry interfaces, so that the elements composing 376 the managed network can be modified and monitored using the same 377 identifiers and having the same assumptions about their topology and 378 context. For instance, SLG must be able to provide the needed 379 functionality to enable INMP to request SLG to set up and connect the 380 necessary structures for telemetry collection and request slice 381 switching. 383 6. Deep Exploitation of AI in Network Management 385 6.1. From Data to Wisdom 387 As AI methods gain access to a huge amount of (intelligence) data 388 from the systems they manage, they become more and more able to take 389 strategic decisions, mainly deriving such data to knowledge towards 390 wisdom. This supports the well known DIKW process (Data, 391 Information, Knowledge, Wisdom) that enables elements to operate 392 autonomously, subject to the goals established by administrators. 394 In such way, AI methods can be guided by the events or situations 395 found in underlying networks in a constantly evolving model. We can 396 call it the Knowledge (and Intelligence) Driven Network. In this new 397 network architecture, the structure itself of the network results 398 from reasoning on intelligence data. The network adapts to new 399 situations without requiring human involvement but administrative 400 policies are still enforced to decisions. Nevertheless, intelligence 401 data must be managed properly to exploit all its potential. Data 402 with high accuracy and high frequency will be processed in real-time. 403 Meanwhile, fast and scalable methods for information retrieval and 404 decision enfrocement become essential to the objectives of the 405 network. 407 To achieve such goals, AI algorithms must be adapted to work on 408 network problems. Joint physical and virtual network elements can 409 form a multi-agent system focused on achieving such system goals. It 410 can be applied to several use-cases. For instance, it can be used 411 for predicting traffic behaviour, iterative network optimization, and 412 assessment of administrative policies. 414 6.2. External Event Detectors 416 As mentioned above, current mechanisms used to achieve automated 417 management and control rely only on the continuous monitoring of the 418 resources they control or the underlying infrastructure that host 419 them. However, there are several other sources of information that 420 can be exploited to make the systems more robust and efficient. It 421 is the case of the notifications that can be provided by physical or 422 virtual elements or devices that are watching for specific events, 423 hence called external event detectors. 425 More specifically, although the notifications provided by these 426 external event detectors are related to successes that occur outside 427 the boundaries of the controlled system, such successes can affect 428 the typical operation of controlled systems. For instance, a heavy 429 rainfall or snowfall can be detected and correlated to a huge 430 increase in the amount of requests experienced by some emergency 431 support service. 433 6.3. Network Requirement Anticipation 435 One of the main goals of the MANO mechanisms is to ensure the virtual 436 computer and network system they manage meets the requirements 437 established by their owners and administrators. It is currently 438 achieved by observing and analyzing the performance measurements 439 obtained either by directly asking the resources forming the managed 440 system of by asking the controllers of the underlying infrastructure 441 that hosts such resources. Thus, under changing or eventual 442 situations, the managed system must be adapted to cope with the new 443 requirements, increasing the amount of resources assigned to it, or 444 to make efficient use of available infrastructures, reducing the 445 amount of resources assigned to it. 447 However, the time required by the infrastructure to make effective 448 the adaptations requested by the MANO mechanisms is longer than the 449 time required by client requests to overload the system and make it 450 discard further client requests. This situation is generally 451 undesired but particularly dangerous for some systems, such as the 452 emergency support system mentioned above. Therefore, in order to 453 avoid the disruption of the service, the change in requirements must 454 be anticipated to ensure that any adaptation has finished as soon as 455 possible, preferably before the target system gets overloaded or 456 underloaded. 458 Here we link the application of AI to network management to ARCA 459 (Appendix B). It is integrated to NFV-MANO to enable the latter to 460 take advantage of the events notified by the external event 461 detectors, by correlating them to the target amount of resources 462 required by the managed system and enforcing the necessary 463 adaptations beforehand, particularly before the system performance 464 metrics have actually changed. 466 The following abstract algorithm formalizes the workflow expected to 467 be followed by the different implementations of the operation 468 proposed here. 470 while TRUE do 471 event = GetExternalEventInformation() 472 if event != NONE then 473 anticipated_resource_amount = Anticipator.Get(event) 474 if IsPolicyCompliant(anticipated_resource_amount) then 475 current_resource_amount = anticipated_resource_amount 476 anticipation_time = NOW 477 end if 478 end if 479 anticipated_event = event 480 if anticipated_event != NONE and 481 (NOW - anticipation_time) > EXPIRATION_TIME then 482 current_resource_amount = DEFAULT_RESOURCE_AMOUNT 483 anticipated_event = NONE 484 end if 485 state = GetSystemState() 486 if not IsAcceptable(state, current_resource_amount) then 487 current_resource_amount = GetResourceAmountForState(state) 488 if anticipated_event is not NONE then 489 Anticipator.Set 490 (anticipated_event, current_resource_amount) 491 anticipated_event = NONE 492 end if 493 end if 494 end while 496 This algorithm considers both internal and external events to 497 determine the necessary control and management actions to achieve the 498 proper anticipation of resources assigned to the target system. We 499 propose the different implementations to follow the same approach so 500 they can guess what to expect when they interact. For instance, a 501 consumer, such as an Application Service Provider (ASP), can expect 502 some specific behavior of the Virtual Network Operator (VNO) from 503 which it is consuming resources. This helps both the ASP and VNO to 504 properly address resource fluctuations. 506 6.4. Intelligent Reasoning 508 It is trivial for anybody to understand that the behavior or the 509 network results from user activity. For instance, more users means 510 more traffic. However, it is not commonly considered that user 511 activity has a direct dependency on events that occur outside the 512 boundaries of the networks they use. For example, if a video becomes 513 trendy, the load of the network that hosts the video increases, but 514 also the load of any network with users watching the video. In the 515 same way, if a natural incident occurs (e.g. heavy rainfall, 516 earthquake), people try to contact their relatives and the load of a 517 telephony network increases. From this we can easily find out that 518 there is a clear causality relation between events occurring in the 519 real and digital world and the behaviour of the network (aka. The 520 Internet). 522 Network management outcomes, in terms of system stability, 523 performance, reliability, etc., would greatily improve by exploiting 524 such causality relation. An easy and straightforward way to do so is 525 to apply AI reasoning methods. These methods can be used to "guess" 526 the effect for a given cause. Moreover, reasoning can be used to 527 choose the specific events that can impact the system, so being the 528 cause for some effect. 530 Meanwhile, reasoning on network behavior from performance 531 measurements and external events places some challenges. First, 532 external event information must cross the administrative domain of 533 the network to which it is relevant. This means that there must be 534 interfaces and security policies that regulate how information is 535 exchanged between the external event detecthor, which can be some 536 sensor deployed in some "smart" place (e.g. smart city, smart 537 building), and the management solution, which resides inside the 538 administrative domain of the managed network. This function must be 539 highly conformed and regulated, and the protocols used to achieve it 540 must be widely accepted and tested, in order for it to exploit the 541 overall potential of external events. 543 Second, enough meta-data must be associated to performance 544 measurements to clearly identify all aspects of the effects, so that 545 they can be traced back to the causes (events). Such meta-data must 546 follow an ontology (information model) that is somewhat common and 547 widely accepted or, at leaset, to be able to easily transform it 548 among the different formats and models used by different vendors and 549 software. 551 Third, the management ontology must be extended by all concepts from 552 the boundaries of the managed network, its external environment 553 (surroundings), and any entity that, albeit being far away, can 554 impact on the function of the managed network. 556 6.5. Gaps and Standardization Issues 558 Several gaps and standardization issues arise from applying AI and 559 reasoning to network management solutions: 561 Methods from different providers/vendors must be able to coexist 562 and work together, either directly or by means of a translator. 563 They must, however, use the same concepts, albeit using different 564 naming, so they actually share a common ontology. 566 Information retrieval must be assessed for quality so that the 567 outputs from AI reasoning, and thus management solutions, can be 568 reliable. 570 Ontological concepts must be consistent so that the types and 571 qualities of information that is retrieved from a system or object 572 are as expected. 574 The protocols used to communicate (or disseminate, or publish) the 575 information must respond to the constraints of their target usage. 577 7. Relation to Other IETF/IRTF Initiatives 579 TBD 581 8. IANA Considerations 583 This memo includes no request to IANA. 585 9. Security Considerations 587 As with other AI mechanisms, the major security concern for the 588 adoption of intelligent reasoning on external events to manage 589 network slices and SDN/NFV systems is that the boundaries of the 590 control and management planes are crossed to introduce information 591 from outside. Such communications must be highly and heavily secured 592 since some malfunction or explicit attacks might compromise the 593 integrity and execution of the controlled system. However, it is up 594 to implementers to deploy the necessary countermeasures to avoid such 595 situations. From the design point of view, since all oprations are 596 performed within the control and/or management planes, the security 597 level of reasoning solutions is inherited and thus determined by the 598 security masures established by the systems conforming such planes. 600 10. Acknowledgements 602 TBD 604 11. References 605 11.1. Normative References 607 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 608 Requirement Levels", BCP 14, RFC 2119, 609 DOI 10.17487/RFC2119, March 1997, 610 . 612 11.2. Informative References 614 [ETSI-NFV-IFA-004] 615 ETSI NFV GS NFV-IFA 004, "Network Functions Virtualisation 616 (NFV); Acceleration Technologies; Management Aspects 617 Specification", 2016. 619 [ETSI-NFV-IFA-005] 620 ETSI NFV GS NFV-IFA 005, "Network Functions Virtualisation 621 (NFV); Management and Orchestration; Or-Vi reference point 622 - Interface and Information Model Specification", 2016. 624 [ETSI-NFV-IFA-006] 625 ETSI NFV GS NFV-IFA 006, "Network Functions Virtualisation 626 (NFV); Management and Orchestration; Vi-Vnfm reference 627 point - Interface and Information Model Specification", 628 2016. 630 [ETSI-NFV-IFA-019] 631 ETSI NFV GS NFV-IFA 019, "Network Functions Virtualisation 632 (NFV); Acceleration Technologies; Management Aspects 633 Specification; Release 3", 2017. 635 [ETSI-NFV-MANO] 636 ETSI NFV GS NFV-MAN 001, "Network Functions Virtualisation 637 (NFV); Management and Orchestration", 2014. 639 [I-D.geng-coms-architecture] 640 Geng, L., Qiang, L., Lucena, J., Ameigeiras, P., Lopez, 641 D., and L. Contreras, "COMS Architecture", draft-geng- 642 coms-architecture-02 (work in progress), March 2018. 644 [I-D.homma-nfvrg-slice-gateway] 645 Homma, S., Foy, X., and A. Galis, "Gateway Function for 646 Network Slicing", draft-homma-nfvrg-slice-gateway-00 (work 647 in progress), July 2018. 649 [I-D.qiang-coms-netslicing-information-model] 650 Qiang, L., Galis, A., Geng, L., 651 kiran.makhijani@huawei.com, k., Martinez-Julia, P., 652 Flinck, H., and X. Foy, "Technology Independent 653 Information Model for Network Slicing", draft-qiang-coms- 654 netslicing-information-model-02 (work in progress), 655 January 2018. 657 [I-D.song-ntf] 658 Song, H., Zhou, T., Li, Z., Fioccola, G., Li, Z., 659 Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Toward a 660 Network Telemetry Framework", draft-song-ntf-02 (work in 661 progress), July 2018. 663 [ICIN-2017] 664 P. Martinez-Julia, V. P. Kafle, and H. Harai, "Achieving 665 the autonomic adaptation of resources in virtualized 666 network environments, in Proceedings of the 20th ICIN 667 Conference (Innovations in Clouds, Internet and Networks, 668 ICIN 2017). Washington, DC, USA: IEEE, 2018, pp. 1--8", 669 2017. 671 [ICIN-2018] 672 P. Martinez-Julia, V. P. Kafle, and H. Harai, 673 "Anticipating minimum resources needed to avoid service 674 disruption of emergency support systems, in Proceedings of 675 the 21th ICIN Conference (Innovations in Clouds, Internet 676 and Networks, ICIN 2018). Washington, DC, USA: IEEE, 2018, 677 pp. 1--8", 2018. 679 [OPENSTACK] 680 The OpenStack Project, "http://www.openstack.org/", 2018. 682 Appendix A. Information Model to Support Reasoning on External Events 684 In this section we introduce the basic model needed to support 685 reasoning on external events. It basically includes the concepts and 686 structures used to describe external events and notify (communicate) 687 them to the interested sink, the network controller/manager, through 688 the control and management plane, depending on the specific 689 instantiation of the system. 691 A.1. Tree Structure 692 module: ietf-nmrg-nict-ai-reasoning 693 +--rw events 694 +--rw event-payloads 695 +--rw external-events 697 notifications: 698 +---n event 700 The main models included in the tree structure of the module are the 701 events and notifications. On the one hand, events are structured in 702 payloads and the content of events itself (external-events). On the 703 other hand, there is only one notification, which is the event 704 itself. 706 A.1.1. event-payloads 708 +--rw event-payloads 709 +--rw event-payloads-basic 710 +--rw event-payloads-seismometer 711 +--rw event-payloads-bigdata 713 The event payloads are, for the time being, composed of three types. 714 First, we have defined the basic payload, which is intended to carry 715 any arbitrary data. Second, we have defined the seismometer payload 716 to carry information about seisms. Third, we have defined the 717 bigdata payload that carries notifications coming from BigData 718 sources. 720 A.1.1.1. basic 722 +--rw event-payloads-basic* [plid] 723 +--rw plid string 724 +--rw data? union 726 The basic payload is able to hold any data type, so it has a union of 727 several types. It is intended to be used by any source of events 728 that is (still) not covered by other model. In general, any source 729 of telemetry information (e.g. OpenStack [OPENSTACK] controllers) 730 can use this model as such sources can encode on it their 731 information, which typically is very simple and plain. Therefore, 732 the current model is tightly interrelated to a framework to retrieve 733 network telemetry (see [I-D.song-ntf]). 735 A.1.1.2. seismometer 736 +--rw event-payloads-seismometer* [plid] 737 +--rw plid string 738 +--rw location? string 739 +--rw magnitude? uint8 741 The seismometer model includes the main information related to a 742 seism, such as the location of the incident and its magnitude. 743 Additional fields can be defined in the future by extending this 744 model. 746 A.1.1.3. bigdata 748 +--rw event-payloads-bigdata* [plid] 749 +--rw plid string 750 +--rw description? string 751 +--rw severity? uint8 753 The bigdata model includes a description of an event (or incident) 754 and its estimated general severity, unrelated to the system. The 755 description is an arbitrary string of characters that would normally 756 carry information that describes the event using some higher level 757 format, such as Turtle or N3 for carrying RDF knowlege items. 759 A.1.2. external-events 761 +--rw external-events* [id] 762 +--rw id string 763 +--rw source? string 764 +--rw context? string 765 +--rw sequence? int64 766 +--rw timestamp? yang:date-and-time 767 +--rw payload? binary 769 The model defined to encode external events, which encapsulates the 770 payloads introduced above, is completed with an identifier of the 771 message, a string describing the source of the event, a sequence 772 number and a timestamp. Additionaly it includes a string describing 773 the context of the event. It is intended to communicate the required 774 information about the system that detected the event, its location, 775 etc. As the description of the BigData payload, this field can be 776 formated with a high level format, such as RDF. 778 A.1.3. notifications/event 779 notifications: 780 +---n event 781 +--ro id? string 782 +--ro source? string 783 +--ro context? string 784 +--ro sequence? int64 785 +--ro timestamp? yang:date-and-time 786 +--ro payload? binary 788 The event notification inherits all the fields from the model of 789 external events defined above. It is intended to allow software and 790 hardware elements to send, receive, and interpret not just the events 791 that have been detected and notified by, for instance, a sensor, but 792 also the notifications issued by the underlying infrastructure 793 controllers, such as the OpenStack Controller. 795 A.2. YANG Module 797 . 799 module ietf-nmrg-nict-ai-reasoning { 800 namespace "urn:ietf:params:xml:ns:yang:ietf-nmrg-nict-ainm"; 801 prefix rant; 802 import ietf-yang-types { prefix yang; } 804 grouping external-event-information { 805 leaf id { type string; } 806 leaf source { type string; } 807 leaf context { type string; } 808 leaf sequence { type int64; } 809 leaf timestamp { type yang:date-and-time; } 810 leaf payload { type binary; } 811 } 813 grouping event-payload-basic { 814 leaf plid { type string; } 815 leaf data { type union { type string; type binary; } } 816 } 818 grouping event-payload-seismometer { 819 leaf plid { type string; } 820 leaf location { type string; } 821 leaf magnitude { type uint8; } 822 } 824 grouping event-payload-bigdata { 825 leaf plid { type string; } 826 leaf description { type string; } 827 leaf severity { type uint8; } 828 } 830 notification event { 831 uses external-event-information; 832 } 834 container events { 835 container event-payloads { 836 list event-payloads-basic { 837 key "plid"; 838 uses event-payload-basic; 839 } 840 list event-payloads-seismometer { 841 key "plid"; 842 uses event-payload-seismometer; 843 } 844 list event-payloads-bigdata { 845 key "plid"; 846 uses event-payload-bigdata; 847 } 848 } 849 list external-events { 850 key "id"; 851 uses external-event-information; 852 } 853 } 855 } 857 . 859 Appendix B. The Autonomic Resource Control Architecture (ARCA) 861 As deeply discussed in ICIN 2018 [ICIN-2018], ARCA leverages the 862 elastic adaptation of resources assigned to virtual computer and 863 network systems by calculating or estimating their requirements from 864 the analysis of load measurements and the detection of external 865 events. These events can be notified by physical elements (things, 866 sensors) that detect changes on the environment, as well as software 867 elements that analyze digital information, such as connectors to 868 sources or analyzers of Big Data. For instance, ARCA is able to 869 consider the detection of an earthquake or a heavy rainfall to 870 overcome the damages it can make to the controlled system. 872 The policies that ARCA must enforce will be specified by 873 administrators during the configuration of the control/management 874 engine. Then, ARCA continues running autonomously, with no more 875 human involvement unless some parameter must be changed. ARCA will 876 adopt the required control and management operations to adapt the 877 controlled system to the new situation or requirements. The main 878 goal of ARCA is thus to reduce the time required for resource 879 adaptation from hours/minutes to seconds/milliseconds. With the 880 aforementioned statements, system administrators are able to specify 881 the general operational boundaries in terms of lower and upper system 882 load thresholds, as well as the minimum and maximum amount of 883 resources that can be allocated to the controlled system to overcome 884 any eventual situation, including the natural crossing of such 885 thresholds. 887 ARCA functional goal is to run autonomously while the performance 888 goal is to keep the resources assigned to the controlled resources as 889 close as possible to the optimum (e.g. 5 % from the optimum) while 890 avoiding service disruption as much as possible, keeping client 891 request discard rate as low as possible (e.g. below 1 %). To achieve 892 both goals, ARCA relies on the Autonomic Computing (AC) paradigm, in 893 the form of interconnected micro-services. Therefore, ARCA includes 894 the four main elements and activities defined by AC, incarnated as: 896 Collector Is responsible of gathering and formatting the 897 heterogeneous observations that will be used in the control 898 cycle. 900 Analyzer Correlates the observations to each other in order to find 901 the situation of the controlled system, especially the 902 current load of the resources allocated to the system and 903 the occurrence of an incident that can affect to the normal 904 operation of the system, such as an earthquake that 905 increases the traffic in an emergency-support system, which 906 is the main target scenario studied in this paper. 908 Decider Determines the necessary actions to adjust the resources to 909 the load of the controlled system. 911 Enforcer Requests the underlying and overlying infrastructure, such 912 as OpenStack, to make the necessary changes to reflect the 913 effects of the decided actions into the system. 915 Being a micro-service architecture means that the different 916 components are executed in parallel. This allows such components to 917 operate in two ways. First, their operation can be dispatched by 918 receiving a message from the previous service or an external service. 919 Second, the services can be self-dispatched, so they can activate 920 some action or send some message without being previously stimulated 921 by any message. The overall control process loops indefinitely and 922 it is closed by checking that the expected effects of an action are 923 actually taking place. The coherence among the distributed services 924 involved in the ARCA control process is ensured by enforcing a common 925 semantic representation and ontology to the messages they exchange. 927 ARCA semantics are built with the Resource Description Framework 928 (RDF) and the Web Ontology Language (OWL), which are well known and 929 widely used standards for the semantic representation and management 930 of knowledge. They provide the ability to represent new concepts 931 without requiring to change the software, just plugin extensions to 932 the ontology. ARCA stores all its knowledge is stored in the 933 Knowledge Base (KB), which is queried and kept up-to-date by the 934 analyzer and decider micro-services. It is implemented by Apache 935 Jena Fuseki, which is a high-performance RDF data store that supports 936 SPARQL through an HTTP/REST interface. Being de-facto standards, 937 both technologies enable ARCA to be easily integrated to 938 virtualization platforms like OpenStack. 940 Appendix C. ARCA Integration With ETSI-NFV-MANO 942 In this section we describe how to fit ARCA on a general SDN/NFV 943 underlying infrastructure and introduce a showcase experiment that 944 demonstrates its operation on an OpenStack-based experimentation 945 platform. We first describe the integration of ARCA with the NFV- 946 MANO reference architecture. We contextualize the significance of 947 this integration by describing an emergency support scenario that 948 clearly benefits from it. Then we proceed to detail the elements 949 forming the OpenStack platform and finally we discuss some initial 950 results obtained from them. 952 C.1. Functional Integration 954 The most important functional blocks of the NFV reference 955 architecture promoted by ETSI (see ETSI-NFV-MANO [ETSI-NFV-MANO]) are 956 the system support functions for operations and business (OSS/BSS), 957 the element management (EM) and, obviously. the Virtual Network 958 Functions (VNFs). But these functions cannot exist without being 959 instantiated on a specific infrastructure, the NFV infrastructure 960 (NFVI), and all of them must be coordinated, orchestrated, and 961 managed by the general NFV-MANO functions. 963 Both the NFVI and the NFV-MANO elements are subdivided into several 964 sub-components. The NFVI has the underlying physical computing, 965 storage, and network resources, which are sliced 966 (see[I-D.qiang-coms-netslicing-information-model] and 967 [I-D.geng-coms-architecture]) and virtualized to conform the virtual 968 computing, storage, and network resources that will host the VNFs. 969 In addition, the NFV-MANO is subdivided in the NFV Orchestrator 970 (NFVO), the VNF manager (VNFM) and the Virtual Infrastructure Manager 971 (VIM). As their name indicates, all high-level elements and sub- 972 components have their own and very specific objective in the NFV 973 architecture. 975 During the design of ARCA we enforced both operational and 976 interfacing aspects to its main objectives. From the operational 977 point of view, ARCA processes observations to manage virtual 978 resources, so it plays the role of the VIM mentioned above. 979 Therefore, ARCA has been designed with appropriate interfaces to fit 980 in the place of the VIM. This way, ARCA provides the NFV reference 981 architecture with the ability to react to external events to adapt 982 virtual computer and network systems, even anticipating such 983 adaptations as performed by ARCA itself. However, some interfaces 984 must be extended to fully enable ARCA to perform its work within the 985 NFV architecture. 987 Once ARCA is placed in the position of the VIM, it enhances the 988 general NFV architecture with its autonomic management capabilities. 989 In particular, it discharges some responsibilities from the VNFM and 990 NFVO, so they can focus on their own business while the virtual 991 resources are behaving as they expect (and request). Moreover, ARCA 992 improves the scalability and reliability of the managed system in 993 case of disconnection from the orchestration layer due to some 994 failure, network split, etc. It is also achieved by the autonomic 995 capabilities, which, as described above, are guided by the rules and 996 policies specified by the administrators and, here, communicated to 997 ARCA through the NFVO. However, ARCA will not be limited to such 998 operation so, more generally, it will accomplish the requirements 999 established by the Virtual Network Operators (VNOs), which are the 1000 owners of the slice of virtual resources that is managed by a 1001 particular instance of NFV-MANO, and therefore ARCA. 1003 In addition to the operational functions, ARCA incorporates the 1004 necessary mechanisms to engage the interfaces that enable it to 1005 interact with other elements of the NFV-MANO reference architecture. 1006 More specifically, ARCA is bound to the Or-Vi (see ETSI-NFV-IFA-005 1007 [ETSI-NFV-IFA-005]) and the Nf-Vi (see ETSI-NFV-IFA-004 1008 [ETSI-NFV-IFA-004] and ETSI-NFV-IFA-019 [ETSI-NFV-IFA-019]). The 1009 former is the point of attachment between the NFVO and the VIM while 1010 the latter is the point of attachment between the NFVI and the VIM. 1011 In our current design we decided to avoid the support for the point 1012 of attachment between the VNFM and the VIM, called Vi-Vnfm (see ETSI- 1013 NFV-IFA-006 [ETSI-NFV-IFA-006]). We leave it for future evolutions 1014 of the proposed integration, that will be enabled by a possible 1015 solution that provides the functions of the VNFM required by ARCA. 1017 Through the Or-Vi, ARCA receives the instructions it will enforce to 1018 the virtual computer and network system it is controlling. As 1019 mentioned above, these are specified in the form of rules and 1020 policies, which are in turn formatted as several statements and 1021 embedded into the Or-Vi messages. In general, these will be high- 1022 level objectives, so ARCA will use its reasoning capabilities to 1023 translate them into more specific, low-level objectives. For 1024 instance, the Or-Vi can specify some high-level statement to avoid 1025 CPU overloading and ARCA will use its innate and acquired knowledge 1026 to translate it to specific statements that specify which parameters 1027 it has to measure (CPU load from assigned servers) and which are 1028 their desired boundaries, in the form of high threshold and low 1029 threshold. Moreover, the Or-Vi will be used by the NFVO to specify 1030 which actions can be used by ARCA to overcome the violation of the 1031 mentioned policies. 1033 All information flowing the Or-Vi interface is encoded and formatted 1034 by following a simple but highly extensible ontology and exploiting 1035 the aforementioned semantic formats. This ensures that the 1036 interconnected system is able to evolve, including the replacement of 1037 components, updating (addition or removal) the supported concepts to 1038 understand new scenarios, and connecting external tools to further 1039 enhance the management process. The only requirement to ensure this 1040 feature is to ensure that all elements support the mentioned ontology 1041 and semantic formats. Although it is not a finished task, the 1042 development of semantic technologies allows the easy adaptation and 1043 translation of existing information formats, so it is expected that 1044 more and more software pieces become easily integrable with the ETSI- 1045 NFV-MANO [ETSI-NFV-MANO] architecture. 1047 In contrast to the Or-Vi interface, the Nf-Vi interface exposes more 1048 precise and low-level operations. Although this makes it easier to 1049 be integrated to ARCA, it also makes it to be tied to specific 1050 implementations. In other words, building a proxy that enforces the 1051 aforementioned ontology to different interface instances to 1052 homogenize them adds undesirable complexity. Therefore, new 1053 components have been specifically developed for ARCA to be able to 1054 interact with different NFVIs. Nevertheless, this specialization is 1055 limited to the collector and enforcer. Moreover, it allows ARCA to 1056 have optimized low-level operations, with high improvement of the 1057 overall performance. This is the case of the specific 1058 implementations of the collector and enforcer used with Mininet and 1059 Docker, which are used as underlying infrastructures in previous 1060 experiments described in ICIN 2017 [ICIN-2017]. Moreover, as 1061 discussed in the following section, this is also the case of the 1062 implementations of the collector and enforcer tied to OpenStack 1063 telemetry and compute interfaces, respectively. Hence it is 1064 important to ensure that telemetry is properly addressed, so we 1065 insist in the need to adopt a common framework in such endpoint (see 1066 [I-D.song-ntf]). 1068 Although OpenStack still lacks some functionality regarding the 1069 construction of specific virtual networks, we use it as the NFVI 1070 functional block in the integrated approach. Therefore, OpenStack is 1071 the provider of the underlying SDN/NFV infrastructure and we 1072 exploited its APIs and SDK to achieve the integration. More 1073 specifically, in our showcase we use the APIs provided by Ceilometer, 1074 Gnocchi, and Compute services as well as the SDK provided for Python. 1075 All of them are gathered within the Nf-Vi interface. Moreover, we 1076 have extended the Or-Vi interface to connect external elements, such 1077 as the physical or environmental event detectors and Big Data 1078 connectors, which is becoming a mandatory requirement of the current 1079 virtualization ecosystem and it conforms our main extension to the 1080 NFV architecture. 1082 C.2. Target Experiment and Scenario 1084 From the beginning of our work on the design of ARCA we are targeting 1085 real-world scenarios, so we get better suited requirements. In 1086 particular we work with a scenario that represents an emergency 1087 support service that is hosted on a virtual computer and network 1088 system, which is in turn hosted on the distributed virtualization 1089 infrastructure of a medium-sized organization. The objective is to 1090 clearly represent an application that requires high dynamicity and 1091 high degree of reliability. The emergency support service 1092 accomplishes this by being barely used when there is no incident but 1093 also being heavily loaded when there is an incident. 1095 Both the underlying infrastructure and virtual network share the same 1096 topology. They have four independent but interconnected network 1097 domains that form part of the same administrative domain 1098 (organization). The first domain hosts the systems of the 1099 headquarters (HQ) of the owner organization, so the VNFs it hosts 1100 (servants) implement the emergency support service. We defined them 1101 as ``servants'' because they are Virtual Machine (VM) instances that 1102 work together to provide a single service by means of backing the 1103 Load Balancer (LB) instances deployed in the separate domains. The 1104 amount of resources (servants) assigned to the service will be 1105 adjusted by ARCA, attaching or detaching servants to meet the load 1106 boundaries specified by administrators. 1108 The other domains represent different buildings of the organization 1109 and will host the clients that access to the service when an incident 1110 occurs. They also host the necessary LB instances, which are also 1111 VNFs that are controlled by ARCA to regulate the access of clients to 1112 servants. All domains will have physical detectors to provide 1113 external information that can (and will) be correlated to the load of 1114 the controlled virtual computer and network system and thus will 1115 affect to the amount of servants assigned to it. Although the 1116 underlying infrastructure, the servants, and the ARCA instance are 1117 the same as those those used in the real world, both clients and 1118 detectors will be emulated. Anyway, this does not reduce the 1119 transferability of the results obtained from our experiments as it 1120 allows to expand the amount of clients beyond the limits of most 1121 physical infrastructures. 1123 Each underlying OpenStack domain will be able to host a maximum of 1124 100 clients, as they will be deployed on a low profile virtual 1125 machine (flavor in OpenStack). In general, clients will be 1126 performing requests at a rate of one request every ten seconds, so 1127 there would be a maximum of 30 requests per second. However, under 1128 the simulated incident, the clients will raise their load to reach a 1129 common maximum of 1200 requests per second. This mimics the shape 1130 and size of a real medium-size organization of about 300 users that 1131 perform a maximum of four requests per second when they need some 1132 support. 1134 The topology of the underlying network is simplified by connecting 1135 the four domains to the same, high-performance switch. However, the 1136 topology of the virtual network is built by using direct links 1137 between the HQ domain and the other three domains. These are 1138 complemented by links between domains 2 and 3, and between domains 3 1139 and 4. This way, the three domains have three paths to reach the HQ 1140 domain: a direct path with just one hop, and two indirect paths with 1141 two and three hops, respectively. 1143 During the execution of the experiment, the detectors notify the 1144 incident to the controller as soon as it happens. However, although 1145 the clients are stimulated at the same time, there is some delay 1146 between the occurrence of the incident and the moment the network 1147 service receives the increase in the load. One of the main targets 1148 of our experiment is to study such delay and take advantage of it to 1149 anticipate the amount of servants required by the system. We discuss 1150 it below. 1152 In summary, this scenario highlights the main benefits of ARCA to 1153 play the role of VIM and interacting with the underlying OpenStack 1154 platform. This means the advancement towards an efficient use of 1155 resources and thus reducing the CAPEX of the system. Moreover, as 1156 the operation of the system is autonomic, the involvement of human 1157 administrators is reduced and, therefore, the OPEX is also reduced. 1159 C.3. OpenStack Platform 1161 The implementation of the scenario described above reflects the 1162 requirements of any edge/branch networking infrastructure, which are 1163 composed of several distributed micro-data-centers deployed on the 1164 wiring centers of the buildings and/or storeys. We chose to use 1165 OpenStack to meet such requirements because it is being widely used 1166 in production infrastructures and the resulting infrastructure will 1167 have the necessary robustness to accomplish our objectives, at the 1168 time it reflects the typical underlying platform found in any SDN/NFV 1169 environment. 1171 We have deployed four separate network domains, each one with its own 1172 OpenStack instantiation. All domains are totally capable of running 1173 regular OpenStack workload, i.e. executing VMs and networks, but, as 1174 mentioned above, we designate the domain 1 to be the headquarters of 1175 the organization. The different underlying networks required by this 1176 (quite complex) deployment are provided by several VLANs within a 1177 high-end L2 switch. This switch represents the distributed network 1178 of the organization. Four separated VLANs are used to isolate the 1179 traffic within each domain, by connecting an interface of OpenStack's 1180 controller and compute nodes. These VLANs therefore form the 1181 distributed data plane. Moreover, other VLAN is used to carry the 1182 control plane as well as the management plane, which are used by the 1183 NFV-MANO, and thus ARCA. It is instantiated in the physical machine 1184 called ARCA Node, to exchange control and management operations in 1185 relation to the collector and enforcer defined in ARCA. This VLAN is 1186 shared among all OpenStack domains to implement the global control of 1187 the virtualization environment pertaining to the organization. 1188 Finally, other VLAN is used by the infrastructure to interconnect the 1189 data planes of the separated domains and also to allow all elements 1190 of the infrastructure to access the Internet to perform software 1191 installation and updates. 1193 Installation of OpenStack is provided by the Red Hat OpenStack 1194 Platform, which is tightly dependent on the Linux operating system 1195 and closely related to the software developed by the OpenStack Open 1196 Source project. It provides a comprehensive way to install the whole 1197 platform while being easily customized to meet our specific 1198 requirements, while it is also backed by operational quality support. 1200 The ARCA node is also based on Linux but, since it is not directly 1201 related to the OpenStack deployment, it is not based on the same 1202 distribution. It is just configured to be able to access the control 1203 and management interfaces offered by OpenStack, and therefore it is 1204 connected to the VLAN that hosts the control and management planes. 1205 On this node we deploy the NFV-MANO components, including the micro- 1206 services that form an ARCA instance. 1208 In summary, we dedicate nine physical computers to the OpenStack 1209 deployment, all are Dell PowerEdge R610 with 2 x Xeon 5670 2.96 GHz 1210 (6 core / 12 thread) CPU, 48 GiB RAM, 6 x 146 GiB HD at 10 kRPM, and 1211 4 x 1 GE NIC. Moreover, we dedicate an additional computer with the 1212 same specification to the ARCA Node. We dedicate a less powerful 1213 computer to implement the physical router because it will not be 1214 involved in the general execution of OpenStack nor in the specific 1215 experiments carried out with it. Finally, as detailed above, we 1216 dedicate a high-end physical switch, an HP ProCurve 1810G-24, to 1217 build the interconnection networks. 1219 C.4. Initial Results 1221 Using the platform described above we execute an initial but long- 1222 lasting experiment based on the target scenario introduced at the 1223 beginning of this section. The objective of this experiment is 1224 twofold. First, we aim to demonstrate how ARCA behaves in a real 1225 environment. Second, we aim to stress the coupling points between 1226 ARCA and OpenStack, which will raise the limitations of the existing 1227 interfaces. 1229 With such objectives in mind, we define a timeline that will be 1230 followed by both clients and external event detectors. It forces the 1231 virtualized system to experience different situations, including 1232 incidents of many severities. When an incident is found in the 1233 timeline, the detectors notify it to the ARCA-based VIM and the 1234 clients change their request rates, which will depend on the severity 1235 of the incident. This behavior is widely discussed in ICIN 2018 1236 [ICIN-2018], remarking how users behave after occurring a disaster or 1237 another similar incident. 1239 The ARCA-based VIM will know the occurrence of the incident from two 1240 sources. First, it will receive the notification from the event 1241 detectors. Second, it will notice the change of the CPU load of the 1242 servants assigned to the target service. In this situation, ARCA has 1243 different opportunities to overcome the possible overload (or 1244 underload) of the system. We explore the anticipation approach 1245 deeply discussed in ICIN 2018 [ICIN-2018]. Its operation is enclosed 1246 in the analyzer and decider and it is based on an algorithm that is 1247 divided in two sub-algorithms. 1249 The first sub-algorithm reacts to the detection of the incident and 1250 ulterior correlation of its severity to the amount of servants 1251 required by the system. This sub-algorithm hosts the regression of 1252 the learner, which is based on the SVM/SVR technique, and predicts 1253 the necessary resources from two features: the severity of the 1254 incident and the time elapsed from the moment it happened. The 1255 resulting amount of servants is established as the minimum amount 1256 that the VIM can use. 1258 The second sub-algorithm is fed with the CPU load measurements of the 1259 servants assigned to the service, as reported by the OpenStack 1260 platform. With this information it checks whether the system is 1261 within the operating parameters established by the NFVO. If not, it 1262 adjusts the resources assigned to the system. It also uses the 1263 minimum amount established by the other sub-algorithm as the basis 1264 for the assignation. After every correction, this algorithm learns 1265 the behavior by adding new correlation vectors to the SVM/SVR 1266 structure. 1268 When the experiment is running, the collector component of the ARCA- 1269 based VIM is attached to the telemetry interface of OpenStack by 1270 using the SDK to access the measurement data generated by Ceilometer 1271 and stored by Gnocchi. In addition, it is attached to the external 1272 event detectors in order to receive their notifications. On the 1273 other hand, the enforcer component is attached to the Compute 1274 interface of OpenStack by also using its SDK to request the 1275 infrastructure to create, destroy, query, or change the status of a 1276 VM that hosts a servant of the controlled system. Finally, the 1277 enforcer also updates the lists of servers used by the load balancers 1278 to distribute the clients among the available resources. 1280 During the execution of the experiment we make the ARCA-based VIM to 1281 report the severity of the last incident, if any, the time elapsed 1282 since it occurred, the amount of servants assigned to the controlled 1283 system, the minimum amount of servants to be assigned, as determined 1284 by the anticipation algorithm, and the average load of all servants. 1285 In this instance, the severities are spread between 0 (no incident) 1286 and 4 (strongest incident), the elapsed times are less than 35 1287 seconds, and the minimum server assignation (MSA) is below 10, 1288 although the hard maximum is 15. 1290 With such measurements we illustrate how the learned correlation of 1291 the three features (dimensions) mentioned above is achieved. Thus, 1292 when there is no incident (severity = 0), the MSA is kept to the 1293 minimum. In parallel, regardless of the severity level, the 1294 algorithm learned that there is no need to increase the MSA for the 1295 first 5 or 10 seconds. This shows the behavior discussed in this 1296 paper, that there is a delay between the occurrence of an event and 1297 the actual need for updated amount of resources, and it forms one 1298 fundamental aspect of our research. 1300 By inspecting the results, we know that there is a burst of client 1301 demands that is centered (peak) around 15 seconds after the 1302 occurrence of an incident or any other change in the accounted 1303 severity. We also know that the burst lasts longer for higher 1304 severities, and it fluctuates a bit for the highest severities. 1305 Finally, we can also notice that for the majority of severities, the 1306 increased MSA is no longer required after 25 seconds from the time 1307 the severity change was notified. 1309 All that information becomes part of the knowledge of ARCA and it is 1310 stored both by the internal structures of the SVM/SVR and, once 1311 represented semantically, in the semantic database that manages the 1312 knowledge base of ARCA. Thus, it is used to predict any future 1313 behavior. For instance, is an incident of severity 3 has occurred 10 1314 seconds ago, ARCA knows that it will need to set the MSA to 6 1315 servants. In fact, this information has been used during the 1316 experiment, so we can also know the accuracy of the algorithm by 1317 comparing the anticipated MSA value with the required value (or even 1318 the best value). However, the analysis of such information is left 1319 for the future. 1321 While preparing and executing the experiment we found several 1322 limitation intrinsic to the current OpenStack platform. First, 1323 regardless of the CPU and memory resources assigned to the underlying 1324 controller nodes, the platform is unable to record and deliver 1325 performance measurements at a lower interval than every 10 seconds, 1326 so it is currently not suitable for real time operations, which is 1327 important for our long-term research objectives. Moreover, we found 1328 that the time required by the infrastructure to create a server that 1329 hosts a somewhat heavy servant is around 10 seconds, which is too far 1330 from our targets. Although these limitations can be improved in the 1331 future, they clearly justify that our anticipation approach is 1332 essential for the proper working of a virtual system and, thus, the 1333 integration of external information becomes mandatory for future 1334 system management technologies, especially considering the 1335 virtualization environments. 1337 Finally, we found it difficult for the required measurements to be 1338 pushed to external components, so we had to poll for them. 1339 Otherwise, some component of ARCA must be instantiated along the main 1340 OpenStack components and services so it has first-hand and prompt 1341 access to such features. This way, ARCA could receive push 1342 notifications with the measurements, as it is for the external 1343 detectors. This is a key aspect that affects the placement of the 1344 NFV-VIM, or some subpart of it, on the general architecture. 1345 Therefore, for future iterations of the NFV reference architecture, 1346 an integrated view between the VIM and the NFVI could be required to 1347 reflect the future reality. 1349 Authors' Addresses 1350 Pedro Martinez-Julia (editor) 1351 NICT 1352 4-2-1, Nukui-Kitamachi 1353 Koganei, Tokyo 184-8795 1354 Japan 1356 Phone: +81 42 327 7293 1357 Email: pedro@nict.go.jp 1359 Shunsuke Homma 1360 NTT 1361 Japan 1363 Email: shunsuke.homma.fp@hco.ntt.co.jp