idnits 2.17.1 draft-claise-opsawg-service-assurance-architecture-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 27, 2020) is 1369 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- No information found for draft-claise-opsawg-service-assurance-yang - is the name correct? -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: January 28, 2021 Y. El Fathi 6 Orange Business Services 7 D. Lopez 8 Telefonica I+D 9 D. Voyer 10 Bell Canada 11 July 27, 2020 13 Service Assurance for Intent-based Networking Architecture 14 draft-claise-opsawg-service-assurance-architecture-03 16 Abstract 18 This document describes an architecture for Service Assurance for 19 Intent-based Networking (SAIN). This architecture aims at assuring 20 that service instances are correctly running. As services rely on 21 multiple sub-services by the underlying network devices, getting the 22 assurance of a healthy service is only possible with a holistic view 23 of network devices. This architecture not only helps to correlate 24 the service degradation with the network root cause but also the 25 impacted services when a network component fails or degrades. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 28, 2021. 44 Copyright Notice 46 Copyright (c) 2020 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 63 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 64 3.1. Decomposing a Service Instance Configuration into an 65 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 66 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 10 67 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 11 68 3.4. Building the Expression Graph from the Assurance Graph . 11 69 3.5. Building the Expression from a Subservice . . . . . . . . 12 70 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 12 71 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 13 72 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 14 73 3.9. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 15 74 3.10. New Assurance Graph Generation . . . . . . . . . . . . . 15 75 4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 76 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 77 6. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16 78 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 79 7.1. Normative References . . . . . . . . . . . . . . . . . . 16 80 7.2. Informative References . . . . . . . . . . . . . . . . . 17 81 Appendix A. Changes between revisions . . . . . . . . . . . . . 18 82 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 18 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 85 1. Terminology 87 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 88 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 89 "OPTIONAL" in this document are to be interpreted as described in BCP 90 14 [RFC2119] [RFC8174] when, and only when, they appear in all 91 capitals, as shown here. 93 SAIN Agent: Component that communicates with a device, a set of 94 devices, or another agent to build an expression graph from a 95 received assurance graph and perform the corresponding computation. 97 Assurance Graph: DAG representing the assurance case for one or 98 several service instances. The nodes (also known as vertices in the 99 context of DAG) are the service instances themselves and the 100 subservices, the edges indicate a dependency relations. 102 SAIN collector: Component that fetches or receives the computer- 103 consumable output of the agent(s) and displays it in a user friendly 104 form or process it locally. 106 DAG: Directed Acyclic Graph. 108 ECMP: Equal Cost Multiple Paths 110 Expression Graph: Generic term for a DAG representing a computation 111 in SAIN. More specific terms are: 113 o Subservice Expressions: expression graph representing all the 114 computations to execute for a subservice. 116 o Service Expressions: expression graph representing all the 117 computations to execute for a service instance, i.e. including the 118 computations for all dependent subservices. 120 o Global Computation Graph: expression graph representing all the 121 computations to execute for all services instances (i.e. all 122 computations performed). 124 Dependency: The directed relationship between subservice instances in 125 the assurance graph. 127 Informational Dependency: Type of dependency whose score does not 128 impact the score of its parent subservice or service instance(s) in 129 the assurance graph. However, the symptoms should be taken into 130 account in the parent service instance or subservice instance(s), for 131 informational reasons. 133 Impacting Dependency: Type of dependency whose score impacts the 134 score of its parent subservice or service instance(s) in the 135 assurance graph. The symptoms are taken into account in the parent 136 service instance or subservice instance(s), as the impacting reasons. 138 Metric: Information retrieved from a network device. 140 Metric Engine: Maps metrics to a list of candidate metric 141 implementations depending on the target model. 143 Metric Implementation: Actual way of retrieving a metric from a 144 device. 146 Network Service YANG Module: describes the characteristics of 147 service, as agreed upon with consumers of that service [RFC8199]. 149 Service Instance: A specific instance of a service. 151 Service configuration orchestrator: Quoting RFC8199, "Network Service 152 YANG Modules describe the characteristics of a service, as agreed 153 upon with consumers of that service. That is, a service module does 154 not expose the detailed configuration parameters of all participating 155 network elements and features but describes an abstract model that 156 allows instances of the service to be decomposed into instance data 157 according to the Network Element YANG Modules of the participating 158 network elements. The service-to-element decomposition is a separate 159 process; the details depend on how the network operator chooses to 160 realize the service. For the purpose of this document, the term 161 "orchestrator" is used to describe a system implementing such a 162 process." 164 SAIN Orchestrator: Component of SAIN in charge of fetching the 165 configuration specific to each service instance and converting it 166 into an assurance graph. 168 Health status: Score and symptoms indicating whether a service 169 instance or a subservice is healthy. A non-maximal score MUST always 170 be explained by one or more symptoms. 172 Health score: Integer ranging from 0 to 100 indicating the health of 173 a subservice. A score of 0 means that the subservice is broken, a 174 score of 100 means that the subservice is perfectly operational. 176 Subservice: Part of an assurance graph that assures a specific 177 feature or subpart of the network system. 179 Symptom: Reason explaining why a service instance or a subservice is 180 not completely healthy. 182 2. Introduction 184 Network Service YANG Modules [RFC8199] describe the configuration, 185 state data, operations, and notifications of abstract representations 186 of services implemented on one or multiple network elements. 188 Quoting RFC8199: "Network Service YANG Modules describe the 189 characteristics of a service, as agreed upon with consumers of that 190 service. That is, a service module does not expose the detailed 191 configuration parameters of all participating network elements and 192 features but describes an abstract model that allows instances of the 193 service to be decomposed into instance data according to the Network 194 Element YANG Modules of the participating network elements. The 195 service-to-element decomposition is a separate process; the details 196 depend on how the network operator chooses to realize the service. 197 For the purpose of this document, the term "orchestrator" is used to 198 describe a system implementing such a process." 200 In other words, service configuration orchestrators deploy Network 201 Service YANG Modules through the configuration of Network Element 202 YANG Modules. Network configuration is based on those YANG data 203 models, with protocol/encoding such as NETCONF/XML [RFC6241] , 204 RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. Knowing that a 205 configuration is applied doesn't imply that the service is running 206 correctly (for example the service might be degraded because of a 207 failure in the network), the network operator must monitor the 208 service operational data at the same time as the configuration. The 209 industry has been standardizing on telemetry to push network element 210 performance information. 212 A network administrator needs to monitor her network and services as 213 a whole, independently of the use cases or the management protocols. 214 With different protocols come different data models, and different 215 ways to model the same type of information. When network 216 administrators deal with multiple protocols, the network management 217 must perform the difficult and time-consuming job of mapping data 218 models: the model used for configuration with the model used for 219 monitoring. This problem is compounded by a large, disparate set of 220 data sources (MIB modules, YANG models [RFC7950], IPFIX information 221 elements [RFC7011], syslog plain text [RFC3164], TACACS+ 222 [I-D.ietf-opsawg-tacacs], RADIUS [RFC2865], etc.). In order to avoid 223 this data model mapping, the industry converged on model-driven 224 telemetry to stream the service operational data, reusing the YANG 225 models used for configuration. Model-driven telemetry greatly 226 facilitates the notion of closed-loop automation whereby events from 227 the network drive remediation changes back into the network. 229 However, it proves difficult for network operators to correlate the 230 service degradation with the network root cause. For example, why 231 does my L3VPN fail to connect? Why is this specific service slow? 232 The reverse, i.e. which services are impacted when a network 233 component fails or degrades, is even more interesting for the 234 operators. For example, which service(s) is(are) impacted when this 235 specific optic dBM begins to degrade? Which application is impacted 236 by this ECMP imbalance? Is that issue actually impacting any other 237 customers? 239 Intent-based approaches are often declarative, starting from a 240 statement of the "The service works correctly" and trying to enforce 241 it. Such approaches are mainly suited for greenfield deployments. 243 Instead of approaching intent from a declarative way, this framework 244 focuses on already defined services and tries to infer the meaning of 245 "The service works correctly". To do so, the framework works from an 246 assurance graph, deduced from the service definition and from the 247 network configuration. This assurance graph is decomposed into 248 components, which are then assured independently. The root of the 249 assurance graph represents the service to assure, and its children 250 represent components identified as its direct dependencies; each 251 component can have dependencies as well. The SAIN architecture 252 maintains the correct assurance graph when services are modified or 253 when the network conditions change. 255 When a service is degraded, the framework will highlight where in the 256 assurance service graph to look, as opposed to going hop by hop to 257 troubleshoot the issue. Not only can this framework help to 258 correlate service degradation with network root cause/symptoms, but 259 it can deduce from the assurance graph the number and type of 260 services impacted by a component degradation/failure. This added 261 value informs the operational team where to focus its attention for 262 maximum return. 264 This architecture provides the building blocks to assure both 265 physical and virtual entities and is flexible with respect to 266 services and subservices, of (distributed) graphs, and of components 267 (Section 3.8). 269 3. Architecture 271 SAIN aims at assuring that service instances are correctly running. 272 The goal of SAIN is to assure that service instances are operating 273 correctly and if not, to pinpoint what is wrong. More precisely, 274 SAIN computes a score for each service instance and outputs symptoms 275 explaining that score, especially why the score is not maximal. The 276 score augmented with the symptoms is called the health status. 278 The SAIN architecture is a generic architecture, applicable to 279 multiple environments. Obviously wireline but also wireless, 280 including 5G, virtual infrastructure manager (VIM), and even virtual 281 functions. Thanks to the distributed graph design principle, graphs 282 from different environments/orchestrator can be combined together. 284 As an example of a service, let us consider a point-to-point L2VPN 285 connection (i.e. pseudowire). Such a service would take as 286 parameters the two ends of the connection (device, interface or 287 subinterface, and address of the other end) and configure both 288 devices (and maybe more) so that a L2VPN connection is established 289 between the two devices. Examples of symptoms might be "Interface 290 has high error rate" or "Interface flapping", or "Device almost out 291 of memory". 293 To compute the health status of such as service, the service is 294 decomposed into an assurance graph formed by subservices linked 295 through dependencies. Each subservice is then turned into an 296 expression graph that details how to fetch metrics from the devices 297 and compute the health status of the subservice. The subservice 298 expressions are combined according to the dependencies between the 299 subservices in order to obtain the expression graph which computes 300 the health status of the service. 302 The overall architecture of our solution is presented in Figure 1. 303 Based on the service configuration, the SAIN orchestrator deduces the 304 assurance graph. It then sends to the SAIN agents the assurance 305 graph along some other configuration options. The SAIN agents are 306 responsible for building the expression graph and computing the 307 health statuses in a distributed manner. The collector is in charge 308 of collecting and displaying the current inferred health status of 309 the service instances and subservices. Finally, the automation loop 310 is closed by having the SAIN Collector providing feedback to the 311 network orchestrator. 313 +-----------------+ 314 | Service | 315 | Configuration |<--------------------+ 316 | Orchestrator | | 317 +-----------------+ | 318 | | | 319 | | Network | 320 | | Service | Feedback 321 | | Instance | Loop 322 | | Configuration | 323 | | | 324 | V | 325 | +-----------------+ +-------------------+ 326 | | SAIN | | SAIN | 327 | | Orchestrator | | Collector | 328 | +-----------------+ +-------------------+ 329 | | ^ 330 | | Configuration | Health Status 331 | | (assurance graph) | (Score + Symptoms) 332 | V | Streamed 333 | +-------------------+ | via Telemetry 334 | |+-------------------+ | 335 | ||+-------------------+ | 336 | +|| SAIN |---------+ 337 | +| agent | 338 | +-------------------+ 339 | ^ ^ ^ 340 | | | | 341 | | | | Metric Collection 342 V V V V 343 +-------------------------------------------------------------+ 344 | Monitored Entities | 345 | | 346 +-------------------------------------------------------------+ 348 Figure 1: SAIN Architecture 350 In order to produce the score assigned to a service instance, the 351 architecture performs the following tasks: 353 o Analyze the configuration pushed to the network device(s) for 354 configuring the service instance and decide: which information is 355 needed from the device(s), such a piece of information being 356 called a metric, which operations to apply to the metrics for 357 computing the health status. 359 o Stream (via telemetry [RFC8641]) operational and config metric 360 values when possible, else continuously poll. 362 o Continuously compute the health status of the service instances, 363 based on the metric values. 365 3.1. Decomposing a Service Instance Configuration into an Assurance 366 Graph 368 In order to structure the assurance of a service instance, the 369 service instance is decomposed into so-called subservice instances. 370 Each subservice instance focuses on a specific feature or subpart of 371 the network system. 373 The decomposition into subservices is an important function of this 374 architecture, for the following reasons. 376 o TThe result of this decomposition provides a relational picture of 377 a service instance, that can be represented as a graph (called 378 assurance graph) to the operator. 380 o Subservices provide a scope for particular expertise and thereby 381 enable contribution from external experts. For instance, the 382 subservice dealing with the optics health should be reviewed and 383 extended by an expert in optical interfaces. 385 o Subservices that are common to several service instances are 386 reused for reducing the amount of computation needed. 388 The assurance graph of a service instance is a DAG representing the 389 structure of the assurance case for the service instance. The nodes 390 of this graph are service instances or subservice instances. Each 391 edge of this graph indicates a dependency between the two nodes at 392 its extremities: the service or subservice at the source of the edge 393 depends on the service or subservice at the destination of the edge. 395 Figure 2 depicts a simplistic example of the assurance graph for a 396 tunnel service. The node at the top is the service instance, the 397 nodes below are its dependencies. In the example, the tunnel service 398 instance depends on the peer1 and peer2 tunnel interfaces, which in 399 turn depend on the respective physical interfaces, which finally 400 depend on the respective peer1 and peer2 devices. The tunnel service 401 instance also depends on the IP connectivity that depends on the IS- 402 IS routing protocol. 404 +------------------+ 405 | Tunnel | 406 | Service Instance | 407 +-----------------+ 408 | 409 +-------------------+-------------------+ 410 | | | 411 +-------------+ +-------------+ +--------------+ 412 | Peer1 | | Peer2 | | IP | 413 | Tunnel | | Tunnel | | Connectivity | 414 | Interface | | Interface | | | 415 +-------------+ +-------------+ +--------------} 416 | | | 417 +-------------+ +-------------+ +-------------+ 418 | Peer1 | | Peer2 | | IS-IS | 419 | Physical | | Physical | | Routing | 420 | Interface | | Interface | | Protocol | 421 +-------------+ +-------------+ +-------------+ 422 | | 423 +-------------+ +-------------+ 424 | | | | 425 | Peer1 | | Peer2 | 426 | Device | | Device | 427 +-------------+ +-------------+ 429 Figure 2: Assurance Graph Example 431 Depicting the assurance graph helps the operator to understand (and 432 assert) the decomposition. The assurance graph shall be maintained 433 during normal operation with addition, modification and removal of 434 service instances. A change in the network configuration or topology 435 shall be reflected in the assurance graph. As a first example, a 436 change of routing protocol from IS-IS to OSPF would change the 437 assurance graph accordingly. As a second example, assuming that ECMP 438 is in place for the source router for that specific tunnel; in that 439 case, multiple interfaces must now be monitored, on top of the 440 monitoring the ECMP health itself. 442 3.2. Intent and Assurance Graph 444 The SAIN orchestrator analyzes the configuration of a service 445 instance to: 447 o Try to capture the intent of the service instance, i.e. what is 448 the service instance trying to achieve, 450 o Decompose the service instance into subservices representing the 451 network features on which the service instance relies. 453 The SAIN orchestrator must be able to analyze configuration from 454 various devices and produce the assurance graph. 456 To schematize what a SAIN orchestrator does, assume that the 457 configuration for a service instance touches 2 devices and configure 458 on each device a virtual tunnel interface. Then: 460 o Capturing the intent would start by detecting that the service 461 instance is actually a tunnel between the two devices, and stating 462 that this tunnel must be functional. This is the current state of 463 SAIN, however it does not completely capture the intent which 464 might additionally include, for instance, on the latency and 465 bandwidth requirements of this tunnel. 467 o Decomposing the service instance into subservices would result in 468 the assurance graph depicted in Figure 2, for instance. 470 In order for SAIN to be applied, the configuration necessary for each 471 service instance should be identifiable and thus should come from a 472 "service-aware" source. While the Figure 1 makes a distinction 473 between the SAIN orchestrator and a different component providing the 474 service instance configuration, in practice those two components are 475 mostly likely combined. The internals of the orchestrator are 476 currently out of scope of this document. 478 3.3. Subservices 480 A subservice corresponds to subpart or a feature of the network 481 system that is needed for a service instance to function properly. 482 In the context of SAIN, subservice is actually a shortcut for 483 subservice assurance, that is the method for assuring that a 484 subservice behaves correctly. 486 Subservices, just as with services, have high-level parameters that 487 specify the type and specific instance to be assured. For example, 488 assuring a device requires the specific deviceId as parameter. For 489 example, assuring an interface requires the specific combination of 490 deviceId and interfaceId. 492 A subservice is also characterized by a list of metrics to fetch and 493 a list of computations to apply to these metrics in order to infer a 494 health status. 496 3.4. Building the Expression Graph from the Assurance Graph 498 From the assurance graph is derived a so-called global computation 499 graph. First, each subservice instance is transformed into a set of 500 subservice expressions that take metrics and constants as input (i.e. 502 sources of the DAG) and produce the status of the subservice, based 503 on some heuristics. Then for each service instance, the service 504 expressions are constructed by combining the subservice expressions 505 of its dependencies. The way service expressions are combined 506 depends on the dependency types (impacting or informational). 507 Finally, the global computation graph is built by combining the 508 service expressions. In other words, the global computation graph 509 encodes all the operations needed to produce health statuses from the 510 collected metrics. 512 Subservices shall be device independent. To justify this, let's 513 consider the interface operational status. Depending on the device 514 capabilities, this status can be collected by an industry-accepted 515 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 516 even by a MIB module. If the subservice was dependent on the 517 mechanism to collect the operational status, then we would need 518 multiple subservice definitions in order to support all different 519 mechanisms. This also implies that, while waiting for all the 520 metrics to be available via standard YANG modules, SAIN agents might 521 have to retrieve metric values via non-standard YANG models, via MIB 522 modules, Command Line Interface (CLI), etc., effectively implementing 523 a normalization layer between data models and information models. 525 In order to keep subservices independent from metric collection 526 method, or, expressed differently, to support multiple combinations 527 of platforms, OSes, and even vendors, the framework introduces the 528 concept of "metric engine". The metric engine maps each device- 529 independent metric used in the subservices to a list of device- 530 specific metric implementations that precisely define how to fetch 531 values for that metric. The mapping is parameterized by the 532 characteristics (model, OS version, etc.) of the device from which 533 the metrics are fetched. 535 3.5. Building the Expression from a Subservice 537 Additionally, to the list of metrics, each subservice defines a list 538 of expressions to apply on the metrics in order to compute the health 539 status of the subservice. The definition or the standardization of 540 those expressions (also known as heuristic) is currently out of scope 541 of this standardization. 543 3.6. Open Interfaces with YANG Modules 545 The interfaces between the architecture components are open thanks to 546 the YANG modules specified in YANG Modules for Service Assurance 547 [I-D.claise-opsawg-service-assurance-yang]; they specify objects for 548 assuring network services based on their decomposition into so-called 549 subservices, according to the SAIN architecture. 551 This module is intended for the following use cases: 553 o Assurance graph configuration: 555 * Subservices: configure a set of subservices to assure, by 556 specifying their types and parameters. 558 * Dependencies: configure the dependencies between the 559 subservices, along with their types. 561 o Assurance telemetry: export the health status of the subservices, 562 along with the observed symptoms. 564 3.7. Handling Maintenance Windows 566 Whenever network components are under maintenance, the operator want 567 to inhibit the emission of symptoms from those components. A typical 568 use case is device maintenance, during which the device is not 569 supposed to be operational. As such, symptoms related to the device 570 health should be ignored, as well as symptoms related to the device- 571 specific subservices, such as the interfaces, as their state changes 572 is probably the consequence of the maintenance. 574 To configure network components as "under maintenance" in the SAIN 575 architecture, the ietf-service-assurance model proposed in 576 [I-D.claise-opsawg-service-assurance-yang] specifies an "under- 577 maintenance" flag per service or subservice instance. When this flag 578 is set and only when this flag is set, the companion field 579 "maintenance-contact" must be set to a string that identifies the 580 person or process who requested the maintenance. Any symptom 581 produced by a service or subservice under maintenance, or by one of 582 its dependencies MUST NOT be be reported. A service or subservice 583 under maintenance MAY propagate a symptom "Under Maintenance" towards 584 services or subservices that depend on it. 586 We illustrate this mechanism on three independent examples based on 587 the assurance graph depicted in Figure 2: 589 o Device maintenance, for instance upgrading the device OS. The 590 operator sets the "under-maintenance" flag for the subservice 591 "Peer1" device. This inhibits the emission of symptoms from 592 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 593 Service Instance". All other subservices are unaffected. 595 o Interface maintenance, for instance replacing a broken optic. The 596 operator sets the "under-maintenance" flag for the subservice 597 "Peer1 Physical Interface". This inhibits the emission of 598 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 599 Instance". All other subservices are unaffected. 601 o Routing protocol maintenance, for instance modifying parameters or 602 redistribution. The operator sets the "under-maintenance" flag 603 for the subservice "IS-IS Routing Protocol". This inhibits the 604 emission of symptoms from "IP connectivity" and "Tunnel Service 605 Instance". All other subservices are unaffected. 607 3.8. Flexible Architecture 609 The SAIN architecture is flexible in terms of components. While the 610 SAIN architecture in Figure 1 makes a distinction between two 611 components, the SAIN configuration orchestrator and the SAIN 612 orchestrator, in practice those two components are mostly likely 613 combined. Similarly, the SAIN agents are displayed in Figure 1 as 614 being separate components. Practically, the SAIN agents could be 615 either independent components or directly integrated in monitored 616 entities. A practical example is an agent in a router. 618 The SAIN architecture is also flexible in terms of services and 619 subservices. Most examples in this document deal with the notion of 620 Network Service YANG modules, with well known service such as L2VPN 621 or tunnels. However, the concepts of services is general enough to 622 cross into different domains. One of them is the domain of service 623 management on network elements, with also requires its own assurance. 624 Examples includes a DHCP server on a linux server, a data plane, an 625 IPFIX export, etc. The notion of "service" is generic in this 626 architecture. Indeed, a configured service can itself be a service 627 for someone else. Exactly like an DHCP server/ data plane/IPFIX 628 export can be considered as services for a device, exactly like an 629 routing instance can be considered as a service for a L3VPN, exactly 630 like a tunnel can considered as a service for an application in the 631 cloud. The assurance graph is created to be flexible and open, 632 regardless of the subservice types, locations, or domains. 634 The SAIN architecture is also flexible in terms of distributed 635 graphs. As shown in Figure 1, our architecture comprises several 636 agents. Each agent is responsible for handling a subgraph of the 637 assurance graph. The collector is responsible for fetching the 638 subgraphs from the different agents and gluing them together. As an 639 example, in the graph from Figure 2, the subservices relative to Peer 640 1 might be handled by a different agent than the subservices relative 641 to Peer 2 and the Connectivity and IS-IS subservices might be handled 642 by yet another agent. The agents will export their partial graph and 643 the collector will stitch them together as dependencies of the 644 service instance. 646 And finally, the SAIN architecture is flexible in terms of what it 647 monitors. Most, if not all examples, in this document refer to 648 physical components but this is not a constrain. Indeed, the 649 assurance of virtual components would follow the same principles and 650 an assurance graph composed of virtualized components (or a mix of 651 virtualized and physical ones) is well possible within this 652 architecture. 654 3.9. Timing 656 The SAIN architecture requires the Network Time Protocol (NTP) 657 [RFC5905] between all elements: monitored entities, SAIN agents, 658 Service Configuration Orchesttrator, the SAIN Collector, as well as 659 the SAIN Orchestrator. This garantees the correlations of all 660 symptoms in the system, correlated with the right assurance graph 661 version. 663 The SAIN agent might have to remove some symptoms for specific 664 subservice symptoms, because there are outdated and not relevant any 665 longer, or simply because the SAIN agent needs to free up some space. 666 Regardless of the reason, it's important for a SAIN collector 667 (re-)connecting to a SAIN agent to understand the effect of this 668 garbage collection. Therefore, the SAIN agent contains a YANG object 669 specifying the date and time at which the symptoms history starts for 670 the subservice instances. 672 3.10. New Assurance Graph Generation 674 The assurance graph will change along the time, because services and 675 subservices come and go (changing the dependencies between 676 subservices), or simply because a subservice is now under 677 maintenance. Therefore an assurance graph version must be 678 maintained, along with the date and time of its last generation. The 679 date and time of a particular subservice instance (again dependencies 680 or under maintenane) might be kept. From a client point of view, an 681 assurance graph change is triggered by the value of the assurance- 682 graph-version and assurance-graph-last-change YANG leaves. At that 683 point in time, the client (collector) follows the following process: 685 o Keep the previous assurance-graph-last-change value (let's call it 686 time T) 688 o Run through all subservice instance and process the subservice 689 instances for which the last-change is newer that the time T 691 o Keep the new assurance-graph-last-change as the new referenced 692 date and time 694 4. Security Considerations 696 The SAIN architecture helps operators to reduce the mean time to 697 detect and mean time to repair. As such, it should not cause any 698 security threats. However, the SAIN agents must be secure: a 699 compromised SAIN agents could be sending wrong root causes or 700 symptoms to the management systems. 702 Except for the configuration of telemetry, the agents do not need 703 "write access" to the devices they monitor. This configuration is 704 applied with a YANG module, whose protection is covered by Secure 705 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 707 The data collected by SAIN could potentially be compromising to the 708 network or provide more insight into how the network is designed. 709 Considering the data that SAIN requires (including CLI access in some 710 cases), one should weigh data access concerns with the impact that 711 reduced visibility will have on being able to rapidly identify root 712 causes. 714 If a closed loop system relies on this architecture then the well 715 known issue of those system also applies, i.e., a lying device or 716 compromised agent could trigger partial reconfiguration of the 717 service or network. The SAIN architecture neither augments or 718 reduces this risk. 720 5. IANA Considerations 722 This document includes no request to IANA. 724 6. Open Issues 726 Refer to the NMRG document 728 7. References 730 7.1. Normative References 732 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 733 Requirement Levels", BCP 14, RFC 2119, 734 DOI 10.17487/RFC2119, March 1997, 735 . 737 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 738 "Network Time Protocol Version 4: Protocol and Algorithms 739 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 740 . 742 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 743 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 744 May 2017, . 746 7.2. Informative References 748 [I-D.claise-opsawg-service-assurance-yang] 749 Claise, B. and J. Quilbeuf, "Service Assurance for Intent- 750 based Networking Architecture", February 2020. 752 [I-D.ietf-opsawg-tacacs] 753 Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and 754 L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg- 755 tacacs-18 (work in progress), March 2020. 757 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 758 "Remote Authentication Dial In User Service (RADIUS)", 759 RFC 2865, DOI 10.17487/RFC2865, June 2000, 760 . 762 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 763 DOI 10.17487/RFC3164, August 2001, 764 . 766 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 767 and A. Bierman, Ed., "Network Configuration Protocol 768 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 769 . 771 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 772 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 773 . 775 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 776 "Specification of the IP Flow Information Export (IPFIX) 777 Protocol for the Exchange of Flow Information", STD 77, 778 RFC 7011, DOI 10.17487/RFC7011, September 2013, 779 . 781 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 782 RFC 7950, DOI 10.17487/RFC7950, August 2016, 783 . 785 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 786 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 787 . 789 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 790 Classification", RFC 8199, DOI 10.17487/RFC8199, July 791 2017, . 793 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 794 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 795 . 797 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 798 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 799 September 2019, . 801 Appendix A. Changes between revisions 803 v02 - v03 805 o Timing Concepts 807 o New Assurance Graph Generation 809 v01 - v02 811 o Handling maintenance windows 813 o Flexible architecture better explained 815 o Improved the terminology 817 o Notion of mapping information model to data model, while waiting 818 for YANG to be everywhere 820 o Started a security considerations section 822 v00 - v01 824 o Terminology clarifications 826 o Figure 1 improved 828 Acknowledgements 830 The authors would like to thank Stephane Litkowski, Charles Eckel, 831 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 832 and Eric Vyncke for their reviews and feedback. 834 Authors' Addresses 836 Benoit Claise 837 Cisco Systems, Inc. 838 De Kleetlaan 6a b1 839 1831 Diegem 840 Belgium 842 Email: bclaise@cisco.com 844 Jean Quilbeuf 845 Cisco Systems, Inc. 846 1, rue Camille Desmoulins 847 92782 Issy Les Moulineaux 848 France 850 Email: jquilbeu@cisco.com 852 Youssef El Fathi 853 Orange Business Services 854 61 rue des archives 855 75003 Paris 856 France 858 Email: io@elfathi.net 860 Diego R. Lopez 861 Telefonica I+D 862 Don Ramon de la Cruz, 82 863 Madrid 28006 864 Spain 866 Email: diego.r.lopez@telefonica.com 868 Dan Voyer 869 Bell Canada 870 Canada 872 Email: daniel.voyer@bell.ca