idnits 2.17.1 draft-claise-opsawg-service-assurance-architecture-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 9, 2020) is 1508 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-18) exists of draft-ietf-opsawg-tacacs-17 -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: September 10, 2020 Y. El Fathi 6 Orange Business Services 7 D. Lopez 8 Telefonica I+D 9 D. Voyer 10 Bell Canada 11 March 9, 2020 13 Service Assurance for Intent-based Networking Architecture 14 draft-claise-opsawg-service-assurance-architecture-02 16 Abstract 18 This document describes an architecture for Service Assurance for 19 Intent-based Networking (SAIN). This architecture aims at assuring 20 that service instances are correctly running. As services rely on 21 multiple sub-services by the underlying network devices, getting the 22 assurance of a healthy service is only possible with a holistic view 23 of network devices. This architecture not only helps to correlate 24 the service degradation with the network root cause but also the 25 impacted services when a network component fails or degrades. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on September 10, 2020. 44 Copyright Notice 46 Copyright (c) 2020 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 63 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 64 3.1. Decomposing a Service Instance Configuration into an 65 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 66 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 10 67 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 11 68 3.4. Building the Expression Graph from the Assurance Graph . 11 69 3.5. Building the Expression from a Subservice . . . . . . . . 12 70 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 12 71 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 13 72 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 14 73 4. Security Considerations . . . . . . . . . . . . . . . . . . . 15 74 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 75 6. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 15 76 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 77 7.1. Normative References . . . . . . . . . . . . . . . . . . 15 78 7.2. Informative References . . . . . . . . . . . . . . . . . 16 79 Appendix A. Changes between revisions . . . . . . . . . . . . . 17 80 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 17 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 83 1. Terminology 85 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 86 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 87 "OPTIONAL" in this document are to be interpreted as described in BCP 88 14 [RFC2119] [RFC8174] when, and only when, they appear in all 89 capitals, as shown here. 91 SAIN Agent: Component that communicates with a device, a set of 92 devices, or another agent to build an expression graph from a 93 received assurance graph and perform the corresponding computation. 95 Assurance Graph: DAG representing the assurance case for one or 96 several service instances. The nodes (also known as vertices in the 97 context of DAG) are the service instances themselves and the 98 subservices, the edges indicate a dependency relations. 100 SAIN collector: Component that fetches or receives the computer- 101 consumable output of the agent(s) and displays it in a user friendly 102 form or process it locally. 104 DAG: Directed Acyclic Graph. 106 ECMP: Equal Cost Multiple Paths 108 Expression Graph: Generic term for a DAG representing a computation 109 in SAIN. More specific terms are: 111 o Subservice Expressions: expression graph representing all the 112 computations to execute for a subservice. 114 o Service Expressions: expression graph representing all the 115 computations to execute for a service instance, i.e. including the 116 computations for all dependent subservices. 118 o Global Computation Graph: expression graph representing all the 119 computations to execute for all services instances (i.e. all 120 computations performed). 122 Dependency: The directed relationship between subservice instances in 123 the assurance graph. 125 Informational Dependency: Type of dependency whose score does not 126 impact the score of its parent subservice or service instance(s) in 127 the assurance graph. However, the symptoms should be taken into 128 account in the parent service instance or subservice instance(s), for 129 informational reasons. 131 Impacting Dependency: Type of dependency whose score impacts the 132 score of its parent subservice or service instance(s) in the 133 assurance graph. The symptoms are taken into account in the parent 134 service instance or subservice instance(s), as the impacting reasons. 136 Metric: Information retrieved from a network device. 138 Metric Engine: Maps metrics to a list of candidate metric 139 implementations depending on the target model. 141 Metric Implementation: Actual way of retrieving a metric from a 142 device. 144 Network Service YANG Module: describes the characteristics of 145 service, as agreed upon with consumers of that service [RFC8199]. 147 Service Instance: A specific instance of a service. 149 Service configuration orchestrator: Quoting RFC8199, "Network Service 150 YANG Modules describe the characteristics of a service, as agreed 151 upon with consumers of that service. That is, a service module does 152 not expose the detailed configuration parameters of all participating 153 network elements and features but describes an abstract model that 154 allows instances of the service to be decomposed into instance data 155 according to the Network Element YANG Modules of the participating 156 network elements. The service-to-element decomposition is a separate 157 process; the details depend on how the network operator chooses to 158 realize the service. For the purpose of this document, the term 159 "orchestrator" is used to describe a system implementing such a 160 process." 162 SAIN Orchestrator: Component of SAIN in charge of fetching the 163 configuration specific to each service instance and converting it 164 into an assurance graph. 166 Health status: Score and symptoms indicating whether a service 167 instance or a subservice is healthy. A non-maximal score MUST always 168 be explained by one or more symptoms. 170 Health score: Integer ranging from 0 to 100 indicating the health of 171 a subservice. A score of 0 means that the subservice is broken, a 172 score of 100 means that the subservice is perfectly operational. 174 Subservice: Part of an assurance graph that assures a specific 175 feature or subpart of the network system. 177 Symptom: Reason explaining why a service instance or a subservice is 178 not completely healthy. 180 2. Introduction 182 Network Service YANG Modules [RFC8199] describe the configuration, 183 state data, operations, and notifications of abstract representations 184 of services implemented on one or multiple network elements. 186 Quoting RFC8199: "Network Service YANG Modules describe the 187 characteristics of a service, as agreed upon with consumers of that 188 service. That is, a service module does not expose the detailed 189 configuration parameters of all participating network elements and 190 features but describes an abstract model that allows instances of the 191 service to be decomposed into instance data according to the Network 192 Element YANG Modules of the participating network elements. The 193 service-to-element decomposition is a separate process; the details 194 depend on how the network operator chooses to realize the service. 195 For the purpose of this document, the term "orchestrator" is used to 196 describe a system implementing such a process." 198 In other words, service configuration orchestrators deploy Network 199 Service YANG Modules through the configuration of Network Element 200 YANG Modules. Network configuration is based on those YANG data 201 models, with protocol/encoding such as NETCONF/XML [RFC6241] , 202 RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. Knowing that a 203 configuration is applied doesn't imply that the service is running 204 correctly (for example the service might be degraded because of a 205 failure in the network), the network operator must monitor the 206 service operational data at the same time as the configuration. The 207 industry has been standardizing on telemetry to push network element 208 performance information. 210 A network administrator needs to monitor her network and services as 211 a whole, independently of the use cases or the management protocols. 212 With different protocols come different data models, and different 213 ways to model the same type of information. When network 214 administrators deal with multiple protocols, the network management 215 must perform the difficult and time-consuming job of mapping data 216 models: the model used for configuration with the model used for 217 monitoring. This problem is compounded by a large, disparate set of 218 data sources (MIB modules, YANG models [RFC7950], IPFIX information 219 elements [RFC7011], syslog plain text [RFC3164], TACACS+ 220 [I-D.ietf-opsawg-tacacs], RADIUS [RFC2865], etc.). In order to avoid 221 this data model mapping, the industry converged on model-driven 222 telemetry to stream the service operational data, reusing the YANG 223 models used for configuration. Model-driven telemetry greatly 224 facilitates the notion of closed-loop automation whereby events from 225 the network drive remediation changes back into the network. 227 However, it proves difficult for network operators to correlate the 228 service degradation with the network root cause. For example, why 229 does my L3VPN fail to connect? Why is this specific service slow? 230 The reverse, i.e. which services are impacted when a network 231 component fails or degrades, is even more interesting for the 232 operators. For example, which service(s) is(are) impacted when this 233 specific optic dBM begins to degrade? Which application is impacted 234 by this ECMP imbalance? Is that issue actually impacting any other 235 customers? 237 Intent-based approaches are often declarative, starting from a 238 statement of the "The service works correctly" and trying to enforce 239 it. Such approaches are mainly suited for greenfield deployments. 241 Instead of approaching intent from a declarative way, this framework 242 focuses on already defined services and tries to infer the meaning of 243 "The service works correctly". To do so, the framework works from an 244 assurance graph, deduced from the service definition and from the 245 network configuration. This assurance graph is decomposed into 246 components, which are then assured independently. The root of the 247 assurance graph represents the service to assure, and its children 248 represent components identified as its direct dependencies; each 249 component can have dependencies as well. The SAIN architecture 250 maintains the correct assurance graph when services are modified or 251 when the network conditions change. 253 When a service is degraded, the framework will highlight where in the 254 assurance service graph to look, as opposed to going hop by hop to 255 troubleshoot the issue. Not only can this framework help to 256 correlate service degradation with network root cause/symptoms, but 257 it can deduce from the assurance graph the number and type of 258 services impacted by a component degradation/failure. This added 259 value informs the operational team where to focus its attention for 260 maximum return. 262 This architecture provides the building blocks to assure both 263 physical and virtual entities and is flexible of services and 264 subservices, of (distributed) graphs, and of components 265 (Section 3.8). 267 3. Architecture 269 SAIN aims at assuring that service instances are correctly running. 270 The goal of SAIN is to assure that service instances are operating 271 correctly and if not, to pinpoint what is wrong. More precisely, 272 SAIN computes a score for each service instance and outputs symptoms 273 explaining that score, especially why the score is not maximal. The 274 score augmented with the symptoms is called the health status. 276 As an example of a service, let us consider a point-to-point L2VPN 277 connection (i.e. pseudowire). Such a service would take as 278 parameters the two ends of the connection (device, interface or 279 subinterface, and address of the other end) and configure both 280 devices (and maybe more) so that a L2VPN connection is established 281 between the two devices. Examples of symptoms might be "Interface 282 has high error rate" or "Interface flapping", or "Device almost out 283 of memory". 285 To compute the health status of such as service, the service is 286 decomposed into an assurance graph formed by subservices linked 287 through dependencies. Each subservice is then turned into an 288 expression graph that details how to fetch metrics from the devices 289 and compute the health status of the subservice. The subservice 290 expressions are combined according to the dependencies between the 291 subservices in order to obtain the expression graph which computes 292 the health status of the service. 294 The overall architecture of our solution is presented in Figure 1. 295 Based on the service configuration, the SAIN orchestrator deduces the 296 assurance graph. It then sends to the SAIN agents the assurance 297 graph along some other configuration options. The SAIN agents are 298 responsible for building the expression graph and computing the 299 health statuses in a distributed manner. The collector is in charge 300 of collecting and displaying the current inferred health status of 301 the service instances and subservices. Finally, the automation loop 302 is closed by having the SAIN Collector providing feedback to the 303 network orchestrator. 305 +-----------------+ 306 | Service | 307 | Configuration |<--------------------+ 308 | Orchestrator | | 309 +-----------------+ | 310 | | | 311 | | Network | 312 | | Service | Feedback 313 | | Instance | Loop 314 | | Configuration | 315 | | | 316 | V | 317 | +-----------------+ +-------------------+ 318 | | SAIN | | SAIN | 319 | | Orchestrator | | Collector | 320 | +-----------------+ +-------------------+ 321 | | ^ 322 | | Configuration | Health Status 323 | | (assurance graph) | (Score + Symptoms) 324 | V | Streamed 325 | +-------------------+ | via Telemetry 326 | |+-------------------+ | 327 | ||+-------------------+ | 328 | +|| SAIN |---------+ 329 | +| agent | 330 | +-------------------+ 331 | ^ ^ ^ 332 | | | | 333 | | | | Metric Collection 334 V V V V 335 +-------------------------------------------------------------+ 336 | Monitored Entities | 337 | | 338 +-------------------------------------------------------------+ 340 Figure 1: SAIN Architecture 342 In order to produce the score assigned to a service instance, the 343 architecture performs the following tasks: 345 o Analyze the configuration pushed to the network device(s) for 346 configuring the service instance and decide: which information is 347 needed from the device(s), such a piece of information being 348 called a metric, which operations to apply to the metrics for 349 computing the health status. 351 o Stream (via telemetry [RFC8641]) operational and config metric 352 values when possible, else continuously poll. 354 o Continuously compute the health status of the service instances, 355 based on the metric values. 357 3.1. Decomposing a Service Instance Configuration into an Assurance 358 Graph 360 In order to structure the assurance of a service instance, the 361 service instance is decomposed into so-called subservice instances. 362 Each subservice instance focuses on a specific feature or subpart of 363 the network system. 365 The decomposition into subservices is an important function of this 366 architecture, for the following reasons. 368 o The result of this decomposition is the assurance case of a 369 service instance, that can be represented as a graph (called 370 assurance graph) to the operator. 372 o Subservices provide a scope for particular expertise and thereby 373 enable contribution from external experts. For instance, the 374 subservice dealing with the optics health should be reviewed and 375 extended by an expert in optical interfaces. 377 o Subservices that are common to several service instances are 378 reused for reducing the amount of computation needed. 380 The assurance graph of a service instance is a DAG representing the 381 structure of the assurance case for the service instance. The nodes 382 of this graph are service instances or subservice instances. Each 383 edge of this graph indicates a dependency between the two nodes at 384 its extremities: the service or subservice at the source of the edge 385 depends on the service or subservice at the destination of the edge. 387 Figure 2 depicts a simplistic example of the assurance graph for a 388 tunnel service. The node at the top is the service instance, the 389 nodes below are its dependencies. In the example, the tunnel service 390 instance depends on the peer1 and peer2 tunnel interfaces, which in 391 turn depend on the respective physical interfaces, which finally 392 depend on the respective peer1 and peer2 devices. The tunnel service 393 instance also depends on the IP connectivity that depends on the IS- 394 IS routing protocol. 396 +------------------+ 397 | Tunnel | 398 | Service Instance | 399 +-----------------+ 400 | 401 +-------------------+-------------------+ 402 | | | 403 +-------------+ +-------------+ +--------------+ 404 | Peer1 | | Peer2 | | IP | 405 | Tunnel | | Tunnel | | Connectivity | 406 | Interface | | Interface | | | 407 +-------------+ +-------------+ +--------------} 408 | | | 409 +-------------+ +-------------+ +-------------+ 410 | Peer1 | | Peer2 | | IS-IS | 411 | Physical | | Physical | | Routing | 412 | Interface | | Interface | | Protocol | 413 +-------------+ +-------------+ +-------------+ 414 | | 415 +-------------+ +-------------+ 416 | | | | 417 | Peer1 | | Peer2 | 418 | Device | | Device | 419 +-------------+ +-------------+ 421 Figure 2: Assurance Graph Example 423 Depicting the assurance graph helps the operator to understand (and 424 assert) the decomposition. The assurance graph shall be maintained 425 during normal operation with addition, modification and removal of 426 service instances. A change in the network configuration or topology 427 shall be reflected in the assurance graph. As a first example, a 428 change of routing protocol from IS-IS to OSPF would change the 429 assurance graph accordingly. As a second example, assuming that ECMP 430 is in place for the source router for that specific tunnel; in that 431 case, multiple interfaces must now be monitored, on top of the 432 monitoring the ECMP health itself. 434 3.2. Intent and Assurance Graph 436 The SAIN orchestrator analyzes the configuration of a service 437 instance to: 439 o Try to capture the intent of the service instance, i.e. what is 440 the service instance trying to achieve, 442 o Decompose the service instance into subservices representing the 443 network features on which the service instance relies. 445 The SAIN orchestrator must be able to analyze configuration from 446 various devices and produce the assurance graph. 448 To schematize what a SAIN orchestrator does, assume that the 449 configuration for a service instance touches 2 devices and configure 450 on each device a virtual tunnel interface. Then: 452 o Capturing the intent would start by detecting that the service 453 instance is actually a tunnel between the two devices, and stating 454 that this tunnel must be functional. This is the current state of 455 SAIN, however it does not completely capture the intent which 456 might additionally include, for instance, on the latency and 457 bandwidth requirements of this tunnel. 459 o Decomposing the service instance into subservices would result in 460 the assurance graph depicted in Figure 2, for instance. 462 In order for SAIN to be applied, the configuration necessary for each 463 service instance should be identifiable and thus should come from a 464 "service-aware" source. While the Figure 1 makes a distinction 465 between the SAIN orchestrator and a different component providing the 466 service instance configuration, in practice those two components are 467 mostly likely combined. The internals of the orchestrator are 468 currently out of scope of this document. 470 3.3. Subservices 472 A subservice corresponds to subpart or a feature of the network 473 system that is needed for a service instance to function properly. 474 In the context of SAIN, subservice is actually a shortcut for 475 subservice assurance, that is the method for assuring that a 476 subservice behaves correctly. 478 Subservices, exactly such as services, have high-level parameters 479 that specify the type and specific instance to be assured. For 480 example, assuring a device requires the specific deviceId as 481 parameter. For example, assuring an interface requires the specific 482 combination of deviceId and interfaceId. 484 A subservice is also characterized by a list of metrics to fetch and 485 a list of computations to apply to these metrics in order to infer a 486 health status. 488 3.4. Building the Expression Graph from the Assurance Graph 490 From the assurance graph is derived a so-called global computation 491 graph. First, each subservice instance is transformed into a set of 492 subservice expressions that take metrics and constants as input (i.e. 494 sources of the DAG) and produce the status of the subservice, based 495 on some heuristics. Then for each service instance, the service 496 expressions are constructed by combining the subservice expressions 497 of its dependencies. The way service expressions are combined 498 depends on the dependency types (impacting or informational). 499 Finally, the global computation graph is built by combining the 500 service expressions. In other words, the global computation graph 501 encodes all the operations needed to produce health statuses from the 502 collected metrics. 504 Subservices shall be device independent. To justify this, let's 505 consider the interface operational status. Depending on the device 506 capabilities, this status can be collected by an industry-accepted 507 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 508 even by a MIB module. If the subservice was dependent on the 509 mechanism to collect the operational status, then we would need 510 multiple subservice definitions in order to support all different 511 mechanisms. This also implies that, while waiting for all the 512 metrics to be available via standard YANG modules, SAIN agents might 513 have to retrieve metric values via non-standard YANG models, via MIB 514 modules, Command Line Interface (CLI), etc., effectively implementing 515 a normalization layer between data models and information models. 517 In order to keep subservices independent from metric collection 518 method, or, expressed differently, to support multiple combinations 519 of platforms, OSes, and even vendors, the framework introduces the 520 concept of "metric engine". The metric engine maps each device- 521 independent metric used in the subservices to a list of device- 522 specific metric implementations that precisely define how to fetch 523 values for that metric. The mapping is parameterized by the 524 characteristics (model, OS version, etc.) of the device from which 525 the metrics are fetched. 527 3.5. Building the Expression from a Subservice 529 Additionally, to the list of metrics, each subservice defines a list 530 of expressions to apply on the metrics in order to compute the health 531 status of the subservice. The definition or the standardization of 532 those expressions (also known as heuristic) is currently out of scope 533 of this standardization. 535 3.6. Open Interfaces with YANG Modules 537 The interfaces between the architecture components are open thanks to 538 the YANG modules specified in YANG Modules for Service Assurance 539 [I-D.claise-opsawg-service-assurance-yang]; they specify objects for 540 assuring network services based on their decomposition into so-called 541 subservices, according to the SAIN architecture. 543 This module is intended for the following use cases: 545 o Assurance graph configuration: 547 * Subservices: configure a set of subservices to assure, by 548 specifying their types and parameters. 550 * Dependencies: configure the dependencies between the 551 subservices, along with their types. 553 o Assurance telemetry: export the health status of the subservices, 554 along with the observed symptoms. 556 3.7. Handling Maintenance Windows 558 Whenever network components are under maintenance, the operator want 559 to inhibit the emission of symptoms from those components. A typical 560 use case is device maintenance, during which the device is not 561 supposed to be operational. As such, symptoms related to the device 562 health should be ignored, as well as symptoms related to the device- 563 specific subservices, such as the interfaces, as their state changes 564 is probably the consequence of the maintenance. 566 To configure network components as "under maintenance" in the SAIN 567 architecture, the ietf-service-assurance model proposed in 568 [I-D.claise-opsawg-service-assurance-yang] specifies an "under- 569 maintenance" flag per service or subservice instance. When this flag 570 is set and only when this flag is set, the companion field 571 "maintenance-contact" must be set to a string that identifies the 572 person or process who requested the maintenance. Any symptom 573 produced by a service or subservice under maintenance, or by one of 574 its dependencies MUST NOT be be reported. A service or subservice 575 under maintenance MAY propagate a symptom "Under Maintenance" towards 576 services or subservices that depend on it. 578 We illustrate this mechanism on three independent examples based on 579 the assurance graph depicted in Figure 2: 581 o Device maintenance, for instance upgrading the device OS. The 582 operator sets the "under-maintenance" flag for the subservice 583 "Peer1" device. This inhibits the emission of symptoms from 584 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 585 Service Instance". All other subservices are unaffected. 587 o Interface maintenance, for instance replacing a broken optic. The 588 operator sets the "under-maintenance" flag for the subservice 589 "Peer1 Physical Interface". This inhibits the emission of 590 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 591 Instance". All other subservices are unaffected. 593 o Routing protocol maintenance, for instance modifying parameters or 594 redistribution. The operator sets the "under-maintenance" flag 595 for the subservice "IS-IS Routing Protocol". This inhibits the 596 emission of symptoms from "IP connectivity" and "Tunnel Service 597 Instance". All other subservices are unaffected. 599 3.8. Flexible Architecture 601 The SAIN architecture is flexible in terms of components. While the 602 SAIN architecture in Figure 1 makes a distinction between two 603 components, the SAIN configuration orchestrator and the SAIN 604 orchestrator, in practice those two components are mostly likely 605 combined. Similarly, the SAIN agents are displayed in Figure 1 as 606 being separate components. Practically, the SAIN agents could be 607 either independent components or directly integrated in monitored 608 entities. A practical example is an agent in a router. 610 The SAIN architecture is also flexible in terms of services and 611 subservices. Most examples in this document deal with the notion of 612 Network Service YANG modules, with well known service such as L2VPN 613 or tunnels. However, the concepts of services is general enough to 614 cross into different domains. One of them is the domain of service 615 management on network elements, with also requires its own assurance. 616 Examples includes a DHCP server on a linux server, a data plane, an 617 IPFIX export, etc. The notion of "service" is generic in this 618 architecture. Indeed, a configured service can itself be a service 619 for someone else. Exactly like an DHCP server/ data plane/IPFIX 620 export can be considered as services for a device, exactly like an 621 routing instance can be considered as a service for a L3VPN, exactly 622 like a tunnel can considered as a service for an application in the 623 cloud. The assurance graph is created to be flexible and open, 624 regardless of the subservice types, locations, or domains. 626 The SAIN architecture is also flexible in terms of distributed 627 graphs. As shown in Figure 1, our architecture comprises several 628 agents. Each agent is responsible for handling a subgraph of the 629 assurance graph. The collector is responsible for fetching the 630 subgraphs from the different agents and gluing them together. As an 631 example, in the graph from Figure 2, the subservices relative to Peer 632 1 might be handled by a different agent than the subservices relative 633 to Peer 2 and the Connectivity and IS-IS subservices might be handled 634 by yet another agent. The agents will export their partial graph and 635 the collector will stitch them together as dependencies of the 636 service instance. 638 And finally, the SAIN architecture is flexible in terms of what it 639 monitors. Most, if not all examples, in this document refer to 640 physical components but this is not a constrain. Indeed, the 641 assurance of virtual components would follow the same principles and 642 an assurance graph composed of virtualized components (or a mix of 643 virtualized and physical ones) is well possible within this 644 architecture. 646 4. Security Considerations 648 The SAIN architecture helps operators to reduce the mean time to 649 detect and mean time to repair. As such, it should not cause any 650 security threats. However, the SAIN agents must be secure: a 651 compromised SAIN agents could be sending wrong root causes or 652 symptoms to the management systems. 654 Except for the configuration of telemetry, the agents do not need 655 "write access" to the devices they monitor. This configuration is 656 applied with a YANG module, whose protection is covered by Secure 657 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 659 If a closed loop system relies on this architecture then the well 660 known issue of t hose system also applies, i.e., a lying device or 661 compromised agent could trigger partial reconfiguration of the 662 service or network. The SAIN architecture neither augments or 663 reduces this risk. 665 5. IANA Considerations 667 This document includes no request to IANA. 669 6. Open Issues 671 -Security Considerations to be completed 673 7. References 675 7.1. Normative References 677 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 678 Requirement Levels", BCP 14, RFC 2119, 679 DOI 10.17487/RFC2119, March 1997, 680 . 682 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 683 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 684 May 2017, . 686 7.2. Informative References 688 [I-D.claise-opsawg-service-assurance-yang] 689 Claise, B. and J. Quilbeuf, "Service Assurance for Intent- 690 based Networking Architecture", February 2020. 692 [I-D.ietf-opsawg-tacacs] 693 Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and 694 L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg- 695 tacacs-17 (work in progress), November 2019. 697 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 698 "Remote Authentication Dial In User Service (RADIUS)", 699 RFC 2865, DOI 10.17487/RFC2865, June 2000, 700 . 702 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 703 DOI 10.17487/RFC3164, August 2001, 704 . 706 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 707 and A. Bierman, Ed., "Network Configuration Protocol 708 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 709 . 711 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 712 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 713 . 715 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 716 "Specification of the IP Flow Information Export (IPFIX) 717 Protocol for the Exchange of Flow Information", STD 77, 718 RFC 7011, DOI 10.17487/RFC7011, September 2013, 719 . 721 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 722 RFC 7950, DOI 10.17487/RFC7950, August 2016, 723 . 725 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 726 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 727 . 729 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 730 Classification", RFC 8199, DOI 10.17487/RFC8199, July 731 2017, . 733 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 734 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 735 . 737 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 738 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 739 September 2019, . 741 Appendix A. Changes between revisions 743 v00 - v01 745 o Terminology clarifications 747 o Figure 1 improved 749 Acknowledgements 751 The authors would like to thank Stephane Litkowski, Charles Eckel, 752 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, and Stefan 753 Vallin for their reviews and feedback. 755 Authors' Addresses 757 Benoit Claise 758 Cisco Systems, Inc. 759 De Kleetlaan 6a b1 760 1831 Diegem 761 Belgium 763 Email: bclaise@cisco.com 765 Jean Quilbeuf 766 Cisco Systems, Inc. 767 1, rue Camille Desmoulins 768 92782 Issy Les Moulineaux 769 France 771 Email: jquilbeu@cisco.com 772 Youssef El Fathi 773 Orange Business Services 774 61 rue des archives 775 75003 Paris 776 France 778 Email: io@elfathi.net 780 Diego R. Lopez 781 Telefonica I+D 782 Don Ramon de la Cruz, 82 783 Madrid 28006 784 Spain 786 Email: diego.r.lopez@telefonica.com 788 Dan Voyer 789 Bell Canada 790 Canada 792 Email: daniel.voyer@bell.ca