idnits 2.17.1 draft-claise-opsawg-service-assurance-architecture-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 23, 2021) is 1098 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft Huawei 4 Intended status: Informational J. Quilbeuf 5 Expires: October 25, 2021 Independent 6 D. Lopez 7 Telefonica I+D 8 D. Voyer 9 Bell Canada 10 T. Arumugam 11 Cisco Systems, Inc. 12 April 23, 2021 14 Service Assurance for Intent-based Networking Architecture 15 draft-claise-opsawg-service-assurance-architecture-05 17 Abstract 19 This document describes an architecture for Service Assurance for 20 Intent-based Networking (SAIN). This architecture aims at assuring 21 that service instances are correctly running. As services rely on 22 multiple sub-services by the underlying network devices, getting the 23 assurance of a healthy service is only possible with a holistic view 24 of network devices. This architecture not only helps to correlate 25 the service degradation with the network root cause but also the 26 impacted services when a network component fails or degrades. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on October 25, 2021. 45 Copyright Notice 47 Copyright (c) 2021 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 63 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 64 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3.1. Decomposing a Service Instance Configuration into an 66 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 67 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 10 68 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 11 69 3.4. Building the Expression Graph from the Assurance Graph . 11 70 3.5. Building the Expression from a Subservice . . . . . . . . 12 71 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 12 72 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 13 73 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 14 74 3.9. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 15 75 3.10. New Assurance Graph Generation . . . . . . . . . . . . . 15 76 4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 77 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 78 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 16 79 7. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16 80 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 81 8.1. Normative References . . . . . . . . . . . . . . . . . . 16 82 8.2. Informative References . . . . . . . . . . . . . . . . . 17 83 Appendix A. Changes between revisions . . . . . . . . . . . . . 18 84 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 18 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 87 1. Terminology 89 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 91 "OPTIONAL" in this document are to be interpreted as described in BCP 92 14 [RFC2119] [RFC8174] when, and only when, they appear in all 93 capitals, as shown here. 95 SAIN Agent: Component that communicates with a device, a set of 96 devices, or another agent to build an expression graph from a 97 received assurance graph and perform the corresponding computation. 99 Assurance Graph: DAG representing the assurance case for one or 100 several service instances. The nodes (also known as vertices in the 101 context of DAG) are the service instances themselves and the 102 subservices, the edges indicate a dependency relations. 104 SAIN collector: Component that fetches or receives the computer- 105 consumable output of the agent(s) and displays it in a user friendly 106 form or process it locally. 108 DAG: Directed Acyclic Graph. 110 ECMP: Equal Cost Multiple Paths 112 Expression Graph: Generic term for a DAG representing a computation 113 in SAIN. More specific terms are: 115 o Subservice Expressions: expression graph representing all the 116 computations to execute for a subservice. 118 o Service Expressions: expression graph representing all the 119 computations to execute for a service instance, i.e. including the 120 computations for all dependent subservices. 122 o Global Computation Graph: expression graph representing all the 123 computations to execute for all services instances (i.e. all 124 computations performed). 126 Dependency: The directed relationship between subservice instances in 127 the assurance graph. 129 Informational Dependency: Type of dependency whose score does not 130 impact the score of its parent subservice or service instance(s) in 131 the assurance graph. However, the symptoms should be taken into 132 account in the parent service instance or subservice instance(s), for 133 informational reasons. 135 Impacting Dependency: Type of dependency whose score impacts the 136 score of its parent subservice or service instance(s) in the 137 assurance graph. The symptoms are taken into account in the parent 138 service instance or subservice instance(s), as the impacting reasons. 140 Metric: Information retrieved from a network device. 142 Metric Engine: Maps metrics to a list of candidate metric 143 implementations depending on the target model. 145 Metric Implementation: Actual way of retrieving a metric from a 146 device. 148 Network Service YANG Module: describes the characteristics of 149 service, as agreed upon with consumers of that service [RFC8199]. 151 Service Instance: A specific instance of a service. 153 Service configuration orchestrator: Quoting RFC8199, "Network Service 154 YANG Modules describe the characteristics of a service, as agreed 155 upon with consumers of that service. That is, a service module does 156 not expose the detailed configuration parameters of all participating 157 network elements and features but describes an abstract model that 158 allows instances of the service to be decomposed into instance data 159 according to the Network Element YANG Modules of the participating 160 network elements. The service-to-element decomposition is a separate 161 process; the details depend on how the network operator chooses to 162 realize the service. For the purpose of this document, the term 163 "orchestrator" is used to describe a system implementing such a 164 process." 166 SAIN Orchestrator: Component of SAIN in charge of fetching the 167 configuration specific to each service instance and converting it 168 into an assurance graph. 170 Health status: Score and symptoms indicating whether a service 171 instance or a subservice is healthy. A non-maximal score MUST always 172 be explained by one or more symptoms. 174 Health score: Integer ranging from 0 to 100 indicating the health of 175 a subservice. A score of 0 means that the subservice is broken, a 176 score of 100 means that the subservice is perfectly operational. 178 Subservice: Part of an assurance graph that assures a specific 179 feature or subpart of the network system. 181 Symptom: Reason explaining why a service instance or a subservice is 182 not completely healthy. 184 2. Introduction 186 Network Service YANG Modules [RFC8199] describe the configuration, 187 state data, operations, and notifications of abstract representations 188 of services implemented on one or multiple network elements. 190 Quoting RFC8199: "Network Service YANG Modules describe the 191 characteristics of a service, as agreed upon with consumers of that 192 service. That is, a service module does not expose the detailed 193 configuration parameters of all participating network elements and 194 features but describes an abstract model that allows instances of the 195 service to be decomposed into instance data according to the Network 196 Element YANG Modules of the participating network elements. The 197 service-to-element decomposition is a separate process; the details 198 depend on how the network operator chooses to realize the service. 199 For the purpose of this document, the term "orchestrator" is used to 200 describe a system implementing such a process." 202 In other words, service configuration orchestrators deploy Network 203 Service YANG Modules through the configuration of Network Element 204 YANG Modules. Network configuration is based on those YANG data 205 models, with protocol/encoding such as NETCONF/XML [RFC6241] , 206 RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. Knowing that a 207 configuration is applied doesn't imply that the service is running 208 correctly (for example the service might be degraded because of a 209 failure in the network), the network operator must monitor the 210 service operational data at the same time as the configuration. The 211 industry has been standardizing on telemetry to push network element 212 performance information. 214 A network administrator needs to monitor her network and services as 215 a whole, independently of the use cases or the management protocols. 216 With different protocols come different data models, and different 217 ways to model the same type of information. When network 218 administrators deal with multiple protocols, the network management 219 must perform the difficult and time-consuming job of mapping data 220 models: the model used for configuration with the model used for 221 monitoring. This problem is compounded by a large, disparate set of 222 data sources (MIB modules, YANG models [RFC7950], IPFIX information 223 elements [RFC7011], syslog plain text [RFC3164], TACACS+ [RFC8907], 224 RADIUS [RFC2865], etc.). In order to avoid this data model mapping, 225 the industry converged on model-driven telemetry to stream the 226 service operational data, reusing the YANG models used for 227 configuration. Model-driven telemetry greatly facilitates the notion 228 of closed-loop automation whereby events from the network drive 229 remediation changes back into the network. 231 However, it proves difficult for network operators to correlate the 232 service degradation with the network root cause. For example, why 233 does my L3VPN fail to connect? Why is this specific service slow? 234 The reverse, i.e. which services are impacted when a network 235 component fails or degrades, is even more interesting for the 236 operators. For example, which service(s) is(are) impacted when this 237 specific optic dBM begins to degrade? Which application is impacted 238 by this ECMP imbalance? Is that issue actually impacting any other 239 customers? 241 Intent-based approaches are often declarative, starting from a 242 statement of the "The service works correctly" and trying to enforce 243 it. Such approaches are mainly suited for greenfield deployments. 245 Instead of approaching intent from a declarative way, this framework 246 focuses on already defined services and tries to infer the meaning of 247 "The service works correctly". To do so, the framework works from an 248 assurance graph, deduced from the service definition and from the 249 network configuration. This assurance graph is decomposed into 250 components, which are then assured independently. The root of the 251 assurance graph represents the service to assure, and its children 252 represent components identified as its direct dependencies; each 253 component can have dependencies as well. The SAIN architecture 254 maintains the correct assurance graph when services are modified or 255 when the network conditions change. 257 When a service is degraded, the framework will highlight where in the 258 assurance service graph to look, as opposed to going hop by hop to 259 troubleshoot the issue. Not only can this framework help to 260 correlate service degradation with network root cause/symptoms, but 261 it can deduce from the assurance graph the number and type of 262 services impacted by a component degradation/failure. This added 263 value informs the operational team where to focus its attention for 264 maximum return. 266 This architecture provides the building blocks to assure both 267 physical and virtual entities and is flexible with respect to 268 services and subservices, of (distributed) graphs, and of components 269 (Section 3.8). 271 3. Architecture 273 SAIN aims at assuring that service instances are correctly running. 274 The goal of SAIN is to assure that service instances are operating 275 correctly and if not, to pinpoint what is wrong. More precisely, 276 SAIN computes a score for each service instance and outputs symptoms 277 explaining that score, especially why the score is not maximal. The 278 score augmented with the symptoms is called the health status. 280 The SAIN architecture is a generic architecture, applicable to 281 multiple environments. Obviously wireline but also wireless, 282 including 5G, virtual infrastructure manager (VIM), and even virtual 283 functions. Thanks to the distributed graph design principle, graphs 284 from different environments/orchestrator can be combined together. 286 As an example of a service, let us consider a point-to-point L2VPN 287 connection (i.e. pseudowire). Such a service would take as 288 parameters the two ends of the connection (device, interface or 289 subinterface, and address of the other end) and configure both 290 devices (and maybe more) so that a L2VPN connection is established 291 between the two devices. Examples of symptoms might be "Interface 292 has high error rate" or "Interface flapping", or "Device almost out 293 of memory". 295 To compute the health status of such as service, the service is 296 decomposed into an assurance graph formed by subservices linked 297 through dependencies. Each subservice is then turned into an 298 expression graph that details how to fetch metrics from the devices 299 and compute the health status of the subservice. The subservice 300 expressions are combined according to the dependencies between the 301 subservices in order to obtain the expression graph which computes 302 the health status of the service. 304 The overall architecture of our solution is presented in Figure 1. 305 Based on the service configuration, the SAIN orchestrator deduces the 306 assurance graph. It then sends to the SAIN agents the assurance 307 graph along some other configuration options. The SAIN agents are 308 responsible for building the expression graph and computing the 309 health statuses in a distributed manner. The collector is in charge 310 of collecting and displaying the current inferred health status of 311 the service instances and subservices. Finally, the automation loop 312 is closed by having the SAIN Collector providing feedback to the 313 network orchestrator. 315 +-----------------+ 316 | Service | 317 | Configuration |<--------------------+ 318 | Orchestrator | | 319 +-----------------+ | 320 | | | 321 | | Network | 322 | | Service | Feedback 323 | | Instance | Loop 324 | | Configuration | 325 | | | 326 | V | 327 | +-----------------+ +-------------------+ 328 | | SAIN | | SAIN | 329 | | Orchestrator | | Collector | 330 | +-----------------+ +-------------------+ 331 | | ^ 332 | | Configuration | Health Status 333 | | (assurance graph) | (Score + Symptoms) 334 | V | Streamed 335 | +-------------------+ | via Telemetry 336 | |+-------------------+ | 337 | ||+-------------------+ | 338 | +|| SAIN |---------+ 339 | +| agent | 340 | +-------------------+ 341 | ^ ^ ^ 342 | | | | 343 | | | | Metric Collection 344 V V V V 345 +-------------------------------------------------------------+ 346 | Monitored Entities | 347 | | 348 +-------------------------------------------------------------+ 350 Figure 1: SAIN Architecture 352 In order to produce the score assigned to a service instance, the 353 architecture performs the following tasks: 355 o Analyze the configuration pushed to the network device(s) for 356 configuring the service instance and decide: which information is 357 needed from the device(s), such a piece of information being 358 called a metric, which operations to apply to the metrics for 359 computing the health status. 361 o Stream (via telemetry [RFC8641]) operational and config metric 362 values when possible, else continuously poll. 364 o Continuously compute the health status of the service instances, 365 based on the metric values. 367 3.1. Decomposing a Service Instance Configuration into an Assurance 368 Graph 370 In order to structure the assurance of a service instance, the 371 service instance is decomposed into so-called subservice instances. 372 Each subservice instance focuses on a specific feature or subpart of 373 the network system. 375 The decomposition into subservices is an important function of this 376 architecture, for the following reasons. 378 o TThe result of this decomposition provides a relational picture of 379 a service instance, that can be represented as a graph (called 380 assurance graph) to the operator. 382 o Subservices provide a scope for particular expertise and thereby 383 enable contribution from external experts. For instance, the 384 subservice dealing with the optics health should be reviewed and 385 extended by an expert in optical interfaces. 387 o Subservices that are common to several service instances are 388 reused for reducing the amount of computation needed. 390 The assurance graph of a service instance is a DAG representing the 391 structure of the assurance case for the service instance. The nodes 392 of this graph are service instances or subservice instances. Each 393 edge of this graph indicates a dependency between the two nodes at 394 its extremities: the service or subservice at the source of the edge 395 depends on the service or subservice at the destination of the edge. 397 Figure 2 depicts a simplistic example of the assurance graph for a 398 tunnel service. The node at the top is the service instance, the 399 nodes below are its dependencies. In the example, the tunnel service 400 instance depends on the peer1 and peer2 tunnel interfaces, which in 401 turn depend on the respective physical interfaces, which finally 402 depend on the respective peer1 and peer2 devices. The tunnel service 403 instance also depends on the IP connectivity that depends on the IS- 404 IS routing protocol. 406 +------------------+ 407 | Tunnel | 408 | Service Instance | 409 +-----------------+ 410 | 411 +-------------------+-------------------+ 412 | | | 413 +-------------+ +-------------+ +--------------+ 414 | Peer1 | | Peer2 | | IP | 415 | Tunnel | | Tunnel | | Connectivity | 416 | Interface | | Interface | | | 417 +-------------+ +-------------+ +--------------} 418 | | | 419 +-------------+ +-------------+ +-------------+ 420 | Peer1 | | Peer2 | | IS-IS | 421 | Physical | | Physical | | Routing | 422 | Interface | | Interface | | Protocol | 423 +-------------+ +-------------+ +-------------+ 424 | | 425 +-------------+ +-------------+ 426 | | | | 427 | Peer1 | | Peer2 | 428 | Device | | Device | 429 +-------------+ +-------------+ 431 Figure 2: Assurance Graph Example 433 Depicting the assurance graph helps the operator to understand (and 434 assert) the decomposition. The assurance graph shall be maintained 435 during normal operation with addition, modification and removal of 436 service instances. A change in the network configuration or topology 437 shall be reflected in the assurance graph. As a first example, a 438 change of routing protocol from IS-IS to OSPF would change the 439 assurance graph accordingly. As a second example, assuming that ECMP 440 is in place for the source router for that specific tunnel; in that 441 case, multiple interfaces must now be monitored, on top of the 442 monitoring the ECMP health itself. 444 3.2. Intent and Assurance Graph 446 The SAIN orchestrator analyzes the configuration of a service 447 instance to: 449 o Try to capture the intent of the service instance, i.e. what is 450 the service instance trying to achieve, 452 o Decompose the service instance into subservices representing the 453 network features on which the service instance relies. 455 The SAIN orchestrator must be able to analyze configuration from 456 various devices and produce the assurance graph. 458 To schematize what a SAIN orchestrator does, assume that the 459 configuration for a service instance touches 2 devices and configure 460 on each device a virtual tunnel interface. Then: 462 o Capturing the intent would start by detecting that the service 463 instance is actually a tunnel between the two devices, and stating 464 that this tunnel must be functional. This is the current state of 465 SAIN, however it does not completely capture the intent which 466 might additionally include, for instance, on the latency and 467 bandwidth requirements of this tunnel. 469 o Decomposing the service instance into subservices would result in 470 the assurance graph depicted in Figure 2, for instance. 472 In order for SAIN to be applied, the configuration necessary for each 473 service instance should be identifiable and thus should come from a 474 "service-aware" source. While the Figure 1 makes a distinction 475 between the SAIN orchestrator and a different component providing the 476 service instance configuration, in practice those two components are 477 mostly likely combined. The internals of the orchestrator are 478 currently out of scope of this document. 480 3.3. Subservices 482 A subservice corresponds to subpart or a feature of the network 483 system that is needed for a service instance to function properly. 484 In the context of SAIN, subservice is actually a shortcut for 485 subservice assurance, that is the method for assuring that a 486 subservice behaves correctly. 488 Subservices, just as with services, have high-level parameters that 489 specify the type and specific instance to be assured. For example, 490 assuring a device requires the specific deviceId as parameter. For 491 example, assuring an interface requires the specific combination of 492 deviceId and interfaceId. 494 A subservice is also characterized by a list of metrics to fetch and 495 a list of computations to apply to these metrics in order to infer a 496 health status. 498 3.4. Building the Expression Graph from the Assurance Graph 500 From the assurance graph is derived a so-called global computation 501 graph. First, each subservice instance is transformed into a set of 502 subservice expressions that take metrics and constants as input (i.e. 504 sources of the DAG) and produce the status of the subservice, based 505 on some heuristics. Then for each service instance, the service 506 expressions are constructed by combining the subservice expressions 507 of its dependencies. The way service expressions are combined 508 depends on the dependency types (impacting or informational). 509 Finally, the global computation graph is built by combining the 510 service expressions. In other words, the global computation graph 511 encodes all the operations needed to produce health statuses from the 512 collected metrics. 514 Subservices shall be device independent. To justify this, let's 515 consider the interface operational status. Depending on the device 516 capabilities, this status can be collected by an industry-accepted 517 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 518 even by a MIB module. If the subservice was dependent on the 519 mechanism to collect the operational status, then we would need 520 multiple subservice definitions in order to support all different 521 mechanisms. This also implies that, while waiting for all the 522 metrics to be available via standard YANG modules, SAIN agents might 523 have to retrieve metric values via non-standard YANG models, via MIB 524 modules, Command Line Interface (CLI), etc., effectively implementing 525 a normalization layer between data models and information models. 527 In order to keep subservices independent from metric collection 528 method, or, expressed differently, to support multiple combinations 529 of platforms, OSes, and even vendors, the framework introduces the 530 concept of "metric engine". The metric engine maps each device- 531 independent metric used in the subservices to a list of device- 532 specific metric implementations that precisely define how to fetch 533 values for that metric. The mapping is parameterized by the 534 characteristics (model, OS version, etc.) of the device from which 535 the metrics are fetched. 537 3.5. Building the Expression from a Subservice 539 Additionally, to the list of metrics, each subservice defines a list 540 of expressions to apply on the metrics in order to compute the health 541 status of the subservice. The definition or the standardization of 542 those expressions (also known as heuristic) is currently out of scope 543 of this standardization. 545 3.6. Open Interfaces with YANG Modules 547 The interfaces between the architecture components are open thanks to 548 the YANG modules specified in YANG Modules for Service Assurance 549 [I-D.claise-opsawg-service-assurance-yang]; they specify objects for 550 assuring network services based on their decomposition into so-called 551 subservices, according to the SAIN architecture. 553 This module is intended for the following use cases: 555 o Assurance graph configuration: 557 * Subservices: configure a set of subservices to assure, by 558 specifying their types and parameters. 560 * Dependencies: configure the dependencies between the 561 subservices, along with their types. 563 o Assurance telemetry: export the health status of the subservices, 564 along with the observed symptoms. 566 3.7. Handling Maintenance Windows 568 Whenever network components are under maintenance, the operator want 569 to inhibit the emission of symptoms from those components. A typical 570 use case is device maintenance, during which the device is not 571 supposed to be operational. As such, symptoms related to the device 572 health should be ignored, as well as symptoms related to the device- 573 specific subservices, such as the interfaces, as their state changes 574 is probably the consequence of the maintenance. 576 To configure network components as "under maintenance" in the SAIN 577 architecture, the ietf-service-assurance model proposed in 578 [I-D.claise-opsawg-service-assurance-yang] specifies an "under- 579 maintenance" flag per service or subservice instance. When this flag 580 is set and only when this flag is set, the companion field 581 "maintenance-contact" must be set to a string that identifies the 582 person or process who requested the maintenance. Any symptom 583 produced by a service or subservice under maintenance, or by one of 584 its dependencies MUST NOT be be reported. A service or subservice 585 under maintenance MAY propagate a symptom "Under Maintenance" towards 586 services or subservices that depend on it. 588 We illustrate this mechanism on three independent examples based on 589 the assurance graph depicted in Figure 2: 591 o Device maintenance, for instance upgrading the device OS. The 592 operator sets the "under-maintenance" flag for the subservice 593 "Peer1" device. This inhibits the emission of symptoms from 594 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 595 Service Instance". All other subservices are unaffected. 597 o Interface maintenance, for instance replacing a broken optic. The 598 operator sets the "under-maintenance" flag for the subservice 599 "Peer1 Physical Interface". This inhibits the emission of 600 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 601 Instance". All other subservices are unaffected. 603 o Routing protocol maintenance, for instance modifying parameters or 604 redistribution. The operator sets the "under-maintenance" flag 605 for the subservice "IS-IS Routing Protocol". This inhibits the 606 emission of symptoms from "IP connectivity" and "Tunnel Service 607 Instance". All other subservices are unaffected. 609 3.8. Flexible Architecture 611 The SAIN architecture is flexible in terms of components. While the 612 SAIN architecture in Figure 1 makes a distinction between two 613 components, the SAIN configuration orchestrator and the SAIN 614 orchestrator, in practice those two components are mostly likely 615 combined. Similarly, the SAIN agents are displayed in Figure 1 as 616 being separate components. Practically, the SAIN agents could be 617 either independent components or directly integrated in monitored 618 entities. A practical example is an agent in a router. 620 The SAIN architecture is also flexible in terms of services and 621 subservices. Most examples in this document deal with the notion of 622 Network Service YANG modules, with well known service such as L2VPN 623 or tunnels. However, the concepts of services is general enough to 624 cross into different domains. One of them is the domain of service 625 management on network elements, with also requires its own assurance. 626 Examples includes a DHCP server on a linux server, a data plane, an 627 IPFIX export, etc. The notion of "service" is generic in this 628 architecture. Indeed, a configured service can itself be a service 629 for someone else. Exactly like an DHCP server/ data plane/IPFIX 630 export can be considered as services for a device, exactly like an 631 routing instance can be considered as a service for a L3VPN, exactly 632 like a tunnel can considered as a service for an application in the 633 cloud. The assurance graph is created to be flexible and open, 634 regardless of the subservice types, locations, or domains. 636 The SAIN architecture is also flexible in terms of distributed 637 graphs. As shown in Figure 1, our architecture comprises several 638 agents. Each agent is responsible for handling a subgraph of the 639 assurance graph. The collector is responsible for fetching the 640 subgraphs from the different agents and gluing them together. As an 641 example, in the graph from Figure 2, the subservices relative to Peer 642 1 might be handled by a different agent than the subservices relative 643 to Peer 2 and the Connectivity and IS-IS subservices might be handled 644 by yet another agent. The agents will export their partial graph and 645 the collector will stitch them together as dependencies of the 646 service instance. 648 And finally, the SAIN architecture is flexible in terms of what it 649 monitors. Most, if not all examples, in this document refer to 650 physical components but this is not a constrain. Indeed, the 651 assurance of virtual components would follow the same principles and 652 an assurance graph composed of virtualized components (or a mix of 653 virtualized and physical ones) is well possible within this 654 architecture. 656 3.9. Timing 658 The SAIN architecture requires the Network Time Protocol (NTP) 659 [RFC5905] between all elements: monitored entities, SAIN agents, 660 Service Configuration Orchesttrator, the SAIN Collector, as well as 661 the SAIN Orchestrator. This garantees the correlations of all 662 symptoms in the system, correlated with the right assurance graph 663 version. 665 The SAIN agent might have to remove some symptoms for specific 666 subservice symptoms, because there are outdated and not relevant any 667 longer, or simply because the SAIN agent needs to free up some space. 668 Regardless of the reason, it's important for a SAIN collector 669 (re-)connecting to a SAIN agent to understand the effect of this 670 garbage collection. Therefore, the SAIN agent contains a YANG object 671 specifying the date and time at which the symptoms history starts for 672 the subservice instances. 674 3.10. New Assurance Graph Generation 676 The assurance graph will change along the time, because services and 677 subservices come and go (changing the dependencies between 678 subservices), or simply because a subservice is now under 679 maintenance. Therefore an assurance graph version must be 680 maintained, along with the date and time of its last generation. The 681 date and time of a particular subservice instance (again dependencies 682 or under maintenane) might be kept. From a client point of view, an 683 assurance graph change is triggered by the value of the assurance- 684 graph-version and assurance-graph-last-change YANG leaves. At that 685 point in time, the client (collector) follows the following process: 687 o Keep the previous assurance-graph-last-change value (let's call it 688 time T) 690 o Run through all subservice instance and process the subservice 691 instances for which the last-change is newer that the time T 693 o Keep the new assurance-graph-last-change as the new referenced 694 date and time 696 4. Security Considerations 698 The SAIN architecture helps operators to reduce the mean time to 699 detect and mean time to repair. As such, it should not cause any 700 security threats. However, the SAIN agents must be secure: a 701 compromised SAIN agents could be sending wrong root causes or 702 symptoms to the management systems. 704 Except for the configuration of telemetry, the agents do not need 705 "write access" to the devices they monitor. This configuration is 706 applied with a YANG module, whose protection is covered by Secure 707 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 709 The data collected by SAIN could potentially be compromising to the 710 network or provide more insight into how the network is designed. 711 Considering the data that SAIN requires (including CLI access in some 712 cases), one should weigh data access concerns with the impact that 713 reduced visibility will have on being able to rapidly identify root 714 causes. 716 If a closed loop system relies on this architecture then the well 717 known issue of those system also applies, i.e., a lying device or 718 compromised agent could trigger partial reconfiguration of the 719 service or network. The SAIN architecture neither augments or 720 reduces this risk. 722 5. IANA Considerations 724 This document includes no request to IANA. 726 6. Contributors 728 o Youssef El Fathi 730 o Eric Vyncke 732 7. Open Issues 734 Refer to the Intent-based Networking NMRG documents 736 8. References 738 8.1. Normative References 740 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 741 Requirement Levels", BCP 14, RFC 2119, 742 DOI 10.17487/RFC2119, March 1997, 743 . 745 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 746 "Network Time Protocol Version 4: Protocol and Algorithms 747 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 748 . 750 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 751 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 752 May 2017, . 754 8.2. Informative References 756 [I-D.claise-opsawg-service-assurance-yang] 757 Claise, B. and J. Quilbeuf, "Service Assurance for Intent- 758 based Networking Architecture", February 2020. 760 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 761 "Remote Authentication Dial In User Service (RADIUS)", 762 RFC 2865, DOI 10.17487/RFC2865, June 2000, 763 . 765 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 766 DOI 10.17487/RFC3164, August 2001, 767 . 769 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 770 and A. Bierman, Ed., "Network Configuration Protocol 771 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 772 . 774 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 775 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 776 . 778 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 779 "Specification of the IP Flow Information Export (IPFIX) 780 Protocol for the Exchange of Flow Information", STD 77, 781 RFC 7011, DOI 10.17487/RFC7011, September 2013, 782 . 784 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 785 RFC 7950, DOI 10.17487/RFC7950, August 2016, 786 . 788 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 789 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 790 . 792 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 793 Classification", RFC 8199, DOI 10.17487/RFC8199, July 794 2017, . 796 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 797 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 798 . 800 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 801 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 802 September 2019, . 804 [RFC8907] Dahm, T., Ota, A., Medway Gash, D., Carrel, D., and L. 805 Grant, "The Terminal Access Controller Access-Control 806 System Plus (TACACS+) Protocol", RFC 8907, 807 DOI 10.17487/RFC8907, September 2020, 808 . 810 Appendix A. Changes between revisions 812 v02 - v03 814 o Timing Concepts 816 o New Assurance Graph Generation 818 v01 - v02 820 o Handling maintenance windows 822 o Flexible architecture better explained 824 o Improved the terminology 826 o Notion of mapping information model to data model, while waiting 827 for YANG to be everywhere 829 o Started a security considerations section 831 v00 - v01 833 o Terminology clarifications 835 o Figure 1 improved 837 Acknowledgements 838 The authors would like to thank Stephane Litkowski, Charles Eckel, 839 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 840 and Eric Vyncke for their reviews and feedback. 842 Authors' Addresses 844 Benoit Claise 845 Huawei 847 Email: benoit.claise@huawei.com 849 Jean Quilbeuf 850 Independent 852 Email: jean@quilbeuf.net 854 Diego R. Lopez 855 Telefonica I+D 856 Don Ramon de la Cruz, 82 857 Madrid 28006 858 Spain 860 Email: diego.r.lopez@telefonica.com 862 Dan Voyer 863 Bell Canada 864 Canada 866 Email: daniel.voyer@bell.ca 868 Thangam Arumugam 869 Cisco Systems, Inc. 870 Milpitas (California) 871 United States of America 873 Email: tarumuga@cisco.com