idnits 2.17.1 draft-ietf-opsawg-service-assurance-architecture-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (28 June 2022) is 668 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-11) exists of draft-ietf-opsawg-service-assurance-yang-06 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Huawei 5 Expires: 30 December 2022 D. Lopez 6 Telefonica I+D 7 D. Voyer 8 Bell Canada 9 T. Arumugam 10 Cisco Systems, Inc. 11 28 June 2022 13 Service Assurance for Intent-based Networking Architecture 14 draft-ietf-opsawg-service-assurance-architecture-06 16 Abstract 18 This document describes an architecture that aims at assuring that 19 service instances are running as expected. As services rely upon 20 multiple sub-services provided by a variety of elements including the 21 underlying network devices and functions, getting the assurance of a 22 healthy service is only possible with a holistic view of all involved 23 elements. This architecture not only helps to correlate the service 24 degradation with symptoms of a specific network component but also to 25 list the services impacted by the failure or degradation of a 26 specific network component. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on 30 December 2022. 45 Copyright Notice 47 Copyright (c) 2022 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 52 license-info) in effect on the date of publication of this document. 53 Please review these documents carefully, as they describe your rights 54 and restrictions with respect to this document. Code Components 55 extracted from this document must include Revised BSD License text as 56 described in Section 4.e of the Trust Legal Provisions and are 57 provided without warranty as described in the Revised BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 3. A Functional Architecture . . . . . . . . . . . . . . . . . . 6 64 3.1. Inferring a Service Instance Configuration into an 65 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 66 3.1.1. Circular Dependencies . . . . . . . . . . . . . . . . 11 67 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 15 68 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 16 69 3.4. Building the Expression Graph from the Assurance Graph . 16 70 3.5. Open Interfaces with YANG Modules . . . . . . . . . . . . 18 71 3.6. Handling Maintenance Windows . . . . . . . . . . . . . . 18 72 3.7. Flexible Functional Architecture . . . . . . . . . . . . 19 73 3.8. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 20 74 3.9. New Assurance Graph Generation . . . . . . . . . . . . . 21 75 4. Security Considerations . . . . . . . . . . . . . . . . . . . 21 76 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 77 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 22 78 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 79 7.1. Normative References . . . . . . . . . . . . . . . . . . 22 80 7.2. Informative References . . . . . . . . . . . . . . . . . 22 81 Appendix A. Changes between revisions . . . . . . . . . . . . . 24 82 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 24 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 85 1. Introduction 87 Network service YANG modules [RFC8199] describe the configuration, 88 state data, operations, and notifications of abstract representations 89 of services implemented on one or multiple network elements. 91 Service orchestrators use Network service YANG modules that will 92 infer network-wide configuration and, therefore the invocation of the 93 appropriate device modules (Section 3 of [RFC8969]). Knowing that a 94 configuration is applied doesn't imply that the service is up and 95 running as expected. For instance, the service might be degraded 96 because of a failure in the network, the experience quality is 97 distorted, or a service function may be reachable at the IP level but 98 does not provide its intended function. Thus, the network operator 99 must monitor the service operational data at the same time as the 100 configuration (Section 3.3 of [RFC8969]. To feed that task, the 101 industry has been standardizing on telemetry to push network element 102 performance information. 104 A network administrator needs to monitor their network and services 105 as a whole, independently of the management protocols. With 106 different protocols come different data models, and different ways to 107 model the same type of information. When network administrators deal 108 with multiple management protocols, the network management entities 109 have to perform the difficult and time-consuming job of mapping data 110 models: e.g. the model used for configuration with the model used for 111 monitoring when separate models or protocols are used. This problem 112 is compounded by a large, disparate set of data sources (MIB modules, 113 YANG models [RFC7950], IPFIX information elements [RFC7011], syslog 114 plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], etc.). In 115 order to avoid this data model mapping, the industry converged on 116 model-driven telemetry to stream the service operational data, 117 reusing the YANG models used for configuration. Model-driven 118 telemetry greatly facilitates the notion of closed-loop automation 119 whereby events/status from the network drive remediation changes back 120 into the network. 122 However, it proves difficult for network operators to correlate the 123 service degradation with the network root cause. For example, "Why 124 does my L3VPN fail to connect?" or "Why is this specific service not 125 highly responsive?". The reverse, i.e., which services are impacted 126 when a network component fails or degrades, is also important for 127 operators. For example, "Which services are impacted when this 128 specific optic dBM begins to degrade?", "Which applications are 129 impacted by this ECMP imbalance?", or "Is that issue actually 130 impacting any other customers?". This task usually falls under the 131 so-called "Service Impact Analysis" functional block. 133 Intent-based approaches are often declarative, starting from a 134 statement of "The service works as expected" and trying to enforce 135 it. Such approaches are mainly suited for greenfield deployments. 137 In this document, we propose an architecture implementing Service 138 Assurance for Intent-Based Networking (SAIN). Aligned with 139 Section 3.3 of [RFC7149], and instead of approaching intent from a 140 declarative way, this architecture focuses on already defined 141 services and tries to infer the meaning of "The service works as 142 expected". To do so, the architecture works from an assurance graph, 143 deduced from the configuration pushed to the device for enabling the 144 service instance. If the SAIN orchestrator supports it, the service 145 model (Section 2 of [RFC8309]) or the network model (Section 2.1 of 147 [RFC8309]) can also be used to build the assurance graph. In some 148 cases, the assurance graph may also be explicitly completed to add an 149 intent not exposed in the service model itself (e.g. the service must 150 rely upon a backup physical path). This assurance graph is 151 decomposed into components, which are then assured independently. 152 The root of the assurance graph represents the service to assure, and 153 its children represent components identified as its direct 154 dependencies; each component can have dependencies as well. The SAIN 155 orchestrator updates automatically the assurance graph when services 156 are modified. 158 When a service is degraded, the SAIN architecture will highlight 159 where in the assurance service graph to look, as opposed to going hop 160 by hop to troubleshoot the issue. More precisely, the SAIN 161 architecture will associate to each service a list of symptoms 162 originating from specific components of the network. These 163 components are good candidates for explaining the source of a service 164 degradation. Not only can this architecture help to correlate 165 service degradation with network root cause/symptoms, but it can 166 deduce from the assurance graph the number and type of services 167 impacted by a component degradation/failure. This added value 168 informs the operational team where to focus its attention for maximum 169 return. Indeed, the operational team should focus his priority on 170 the degrading/failing components impacting the highest number 171 customers, especially the ones with the SLA contracts involving 172 penalties in case of failure. 174 This architecture provides the building blocks to assure both 175 physical and virtual entities and is flexible with respect to 176 services and subservices, of (distributed) graphs, and of components 177 (Section 3.7). 179 The architecture presented in this document is completed by a set of 180 YANG modules defined in a companion document 181 [I-D.ietf-opsawg-service-assurance-yang]. These YANG modules 182 properly define the interfaces between the various components of the 183 architecture in order to foster interoperability. 185 2. Terminology 187 SAIN agent: A functional component that communicates with a device, a 188 set of devices, or another agent to build an expression graph from a 189 received assurance graph and perform the corresponding computation of 190 the health status and symptoms. 192 Assurance case: "An assurance case is a structured argument, 193 supported by evidence, intended to justify that a system is 194 acceptably assured relative to a concern (such as safety or security) 195 in the intended operating environment" [Piovesan2017]. 197 Service instance: A specific instance of a service. 199 Subservice: Part or functionality of the network system that can be 200 independently assured as a single entity in assurance graph. 202 Assurance graph: A Directed Acyclic Graph (DAG) representing the 203 assurance case for one or several service instances. The nodes (also 204 known as vertices in the context of DAG) are the service instances 205 themselves and the subservices, the edges indicate a dependency 206 relations. 208 SAIN collector: A functional component that fetches or receives the 209 computer-consumable output of the SAIN agent(s) and process it 210 locally (including displaying it in a user friendly form). 212 DAG: Directed Acyclic Graph. 214 ECMP: Equal Cost Multiple Paths 216 Expression graph: A generic term for a DAG representing a computation 217 in SAIN. More specific terms are: 219 * Subservice expressions: Is an expression graph representing all 220 the computations to execute for a subservice. 222 * Service expressions: Is an expression graph representing all the 223 computations to execute for a service instance, i.e., including 224 the computations for all dependent subservices. 226 * Global computation graph: Is an expression graph representing all 227 the computations to execute for all services instances (i.e., all 228 computations performed). 230 Dependency: The directed relationship between subservice instances in 231 the assurance graph. 233 Metric: An information retrieved from the network running the assured 234 service. 236 Metric engine: A functional component, part of the SAIN agent, that 237 maps metrics to a list of candidate metric implementations depending 238 on the network element. 240 Metric implementation: Actual way of retrieving a metric from a 241 network element. 243 Network service YANG module: describes the characteristics of a 244 service as agreed upon with consumers of that service [RFC8199]. 246 Service orchestrator: Quoting RFC8199, "Network Service YANG Modules 247 describe the characteristics of a service, as agreed upon with 248 consumers of that service. That is, a service module does not expose 249 the detailed configuration parameters of all participating network 250 elements and features but describes an abstract model that allows 251 instances of the service to be decomposed into instance data 252 according to the Network Element YANG Modules of the participating 253 network elements. The service-to-element decomposition is a separate 254 process; the details depend on how the network operator chooses to 255 realize the service. For the purpose of this document, the term 256 "orchestrator" is used to describe a system implementing such a 257 process." 259 SAIN orchestrator: A functional component that is in charge of 260 fetching the configuration specific to each service instance and 261 converting it into an assurance graph. 263 Health status: Score and symptoms indicating whether a service 264 instance or a subservice is "healthy". A non-maximal score must 265 always be explained by one or more symptoms. 267 Health score: Integer ranging from 0 to 100 indicating the health of 268 a subservice. A score of 0 means that the subservice is broken, a 269 score of 100 means that the subservice in question is operating as 270 expected. 272 Strongly connected component: subset of a directed graph such that 273 there is a (directed) path from any node of the subset to any other 274 node. A DAG does not contain any strongly connected component. 276 Symptom: Reason explaining why a service instance or a subservice is 277 not completely healthy. 279 3. A Functional Architecture 281 The goal of SAIN is to assure that service instances are operating as 282 expected (i.e. the observed service is matching the expected service) 283 and if not, to pinpoint what is wrong. More precisely, SAIN computes 284 a score for each service instance and outputs symptoms explaining 285 that score. Symptoms explain the score. The only valid situation 286 where no symptoms are returned is when the score is maximal, 287 indicating that no issues where detected for that service. The score 288 augmented with the symptoms is called the health status. 290 The SAIN architecture is a generic architecture, applicable to 291 multiple environments (e.g. wireline, wireless), but also different 292 domains (e.g. 5G, NFV domain with a virtual infrastructure manager 293 (VIM)), etc. And as already noted, for physical or virtual devices, 294 as well as virtual functions. Thanks to the distributed graph design 295 principle, graphs from different environments/orchestrator can be 296 combined together. 298 As an example of a service, let us consider a point-to-point L2VPN. 299 [RFC8466] specifies the parameters for such a service. Examples of 300 symptoms might be symptoms reported by specific subservices 301 "Interface has high error rate" or "Interface flapping", or "Device 302 almost out of memory" as well as symptoms more specific to the 303 service such as "Site disconnected from VPN". 305 To compute the health status of such a service, the service 306 definition is decomposed into an assurance graph formed by 307 subservices linked through dependencies. Each subservice is then 308 turned into an expression graph that details how to fetch metrics 309 from the devices and compute the health status of the subservice. 310 The subservice expressions are combined according to the dependencies 311 between the subservices in order to obtain the expression graph which 312 computes the health status of the service. 314 The overall SAIN architecture is presented in Figure 1. Based on the 315 service configuration provided by the service orchestrator, the SAIN 316 orchestrator decomposes the assurance graph. It then sends to the 317 SAIN agents the assurance graph along some other configuration 318 options. The SAIN agents are responsible for building the expression 319 graph and computing the health statuses in a distributed manner. The 320 collector is in charge of collecting and displaying the current 321 inferred health status of the service instances and subservices. 322 Finally, the automation loop is closed by having the SAIN collector 323 providing feedback to the network/service orchestrator. 325 In order to make agents, orchestrators and collectors from different 326 vendors interoperable, their interface is defined as a YANG model in 327 a companion document [I-D.ietf-opsawg-service-assurance-yang]. In 328 Figure 1, the communications that are normalized by this YANG model 329 are tagged with a "Y". The use of this YANG model is further 330 explained in Section 3.5. 332 +-----------------+ 333 | Service | 334 | Orchestrator |<--------------------+ 335 | | | 336 +-----------------+ | 337 | ^ | 338 | | Network | 339 | | Service | Feedback 340 | | Instance | Loop 341 | | Configuration | 342 | | | 343 | V | 344 | +-----------------+ +-------------------+ 345 | | SAIN | | SAIN | 346 | | Orchestrator | | Collector | 347 | +-----------------+ +-------------------+ 348 | | ^ 349 | Y| Configuration | Health Status 350 | | (assurance graph) Y| (Score + Symptoms) 351 | V | Streamed 352 | +-------------------+ | via Telemetry 353 | |+-------------------+ | 354 | ||+-------------------+ | 355 | +|| SAIN |---------+ 356 | +| agent | 357 | +-------------------+ 358 | ^ ^ ^ 359 | | | | 360 | | | | Metric Collection 361 V V V V 362 +-------------------------------------------------------------+ 363 | Network System | 364 | | 365 +-------------------------------------------------------------+ 367 Figure 1: SAIN Architecture 369 In order to produce the score assigned to a service instance, the 370 various involved components perform the following tasks: 372 * Analyze the configuration pushed to the network device(s) for 373 configuring the service instance and decide: which information is 374 needed from the device(s), such a piece of information being 375 called a metric, which operations to apply to the metrics for 376 computing the health status. 378 * Stream (via telemetry [RFC8641]) operational and config metric 379 values when possible, else continuously poll. 381 * Continuously compute the health status of the service instances, 382 based on the metric values. 384 3.1. Inferring a Service Instance Configuration into an Assurance Graph 386 In order to structure the assurance of a service instance, the SAIN 387 orchestrator decomposes the service instance into so-called 388 subservice instances. Each subservice instance focuses on a specific 389 feature or subpart of the service. 391 The decomposition into subservices is an important function of the 392 architecture, for the following reasons: 394 * The result of this decomposition provides a relational picture of 395 a service instance, that can be represented as a graph (called 396 assurance graph) to the operator. 398 * Subservices provide a scope for particular expertise and thereby 399 enable contribution from external experts. For instance, the 400 subservice dealing with the optics health should be reviewed and 401 extended by an expert in optical interfaces. 403 * Subservices that are common to several service instances are 404 reused for reducing the amount of computation needed. 406 The assurance graph of a service instance is a DAG representing the 407 structure of the assurance case for the service instance. The nodes 408 of this graph are service instances or subservice instances. Each 409 edge of this graph indicates a dependency between the two nodes at 410 its extremities: the service or subservice at the source of the edge 411 depends on the service or subservice at the destination of the edge. 413 Figure 2 depicts a simplistic example of the assurance graph for a 414 tunnel service. The node at the top is the service instance, the 415 nodes below are its dependencies. In the example, the tunnel service 416 instance depends on the "peer1" and "peer2" tunnel interfaces, which 417 in turn depend on the respective physical interfaces, which finally 418 depend on the respective "peer1" and "peer2" devices. The tunnel 419 service instance also depends on the IP connectivity that depends on 420 the IS-IS routing protocol. 422 +------------------+ 423 | Tunnel | 424 | Service Instance | 425 +------------------+ 426 | 427 +--------------------+-------------------+ 428 | | | 429 v v v 430 +-------------+ +--------------+ +-------------+ 431 | Peer1 | | IP | | Peer2 | 432 | Tunnel | | Connectivity | | Tunnel | 433 | Interface | | | | Interface | 434 +-------------+ +--------------+ +-------------+ 435 | | | 436 | +-------------+--------------+ | 437 | | | | | 438 v v v v v 439 +-------------+ +-------------+ +-------------+ 440 | Peer1 | | IS-IS | | Peer2 | 441 | Physical | | Routing | | Physical | 442 | Interface | | Protocol | | Interface | 443 +-------------+ +-------------+ +-------------+ 444 | | 445 v v 446 +-------------+ +-------------+ 447 | | | | 448 | Peer1 | | Peer2 | 449 | Device | | Device | 450 +-------------+ +-------------+ 452 Figure 2: Assurance Graph Example 454 Depicting the assurance graph helps the operator to understand (and 455 assert) the decomposition. The assurance graph shall be maintained 456 during normal operation with addition, modification and removal of 457 service instances. A change in the network configuration or topology 458 shall automatically be reflected in the assurance graph. As a first 459 example, a change of routing protocol from IS-IS to OSPF would change 460 the assurance graph accordingly. As a second example, assuming that 461 ECMP is in place for the source router for that specific tunnel; in 462 that case, multiple interfaces must now be monitored, on top of the 463 monitoring the ECMP health itself. 465 3.1.1. Circular Dependencies 467 The edges of the assurance graph represent dependencies. An 468 assurance graph is a DAG if and only if there are no circular 469 dependencies among the subservices, and every assurance graph should 470 avoid circular dependencies. However, in some cases, circular 471 dependencies might appear in the assurance graph. 473 First, the assurance graph of a whole system is obtained by combining 474 the assurance graph of every service running on that system. Here 475 combining means that two subservices having the same type and the 476 same parameters are in fact the same subservice and thus a single 477 node in the graph. For instance, the subservice of type "device" 478 with the only parameter (the device id) set to "PE1" will appear only 479 once in the whole assurance graph even if several services rely on 480 that device. Now, if two engineers design assurance graphs for two 481 different services, and engineer A decides that an interface depends 482 on the link it is connected to, but engineer B decides that the link 483 depends on the interface it is connected to, then when combining the 484 two assurance graphs, we will have a circular dependency interface -> 485 link -> interface. 487 Another case possibly resulting in circular dependencies is when 488 subservices are not properly identified. Assume that we want to 489 assure a kubernetes cluster. If we represent the cluster by a 490 subservice and the network service by another subservice, we will 491 likely model that the network service depends on the cluster, because 492 the network service is orchestrated by kubernetes, and that the 493 cluster depends on the network service because it implements the 494 communications. A finer decomposition might distinguish between the 495 resources for executing containers (a part of our cluster subservice) 496 and the communication between the containers (which could be modelled 497 in the same way as communication between routers). 499 In any case, it is likely that circular dependencies will show up in 500 the assurance graph. A first step would be to detect circular 501 dependencies as soon as possible in the SAIN architecture. Such a 502 detection could be carried out by the SAIN orchestrator. Whenever a 503 circular dependency is detected, the newly added service would not be 504 monitored until more careful modelling or alignment between the 505 different teams (engineer A and B) remove the circular dependency. 507 As more elaborate solution we could consider a graph transformation: 509 * Decompose the graph into strongly connected components. 511 * For each strongly connected component: 513 - Remove all edges between nodes of the strongly connected 514 component 516 - Add a new "top" node for the strongly connected component 518 - For each edge pointing to a node in the strongly connected 519 component, change the destination to the "top" node 521 - Add a dependency from the top node to every node in the 522 strongly connected component. 524 Such an algorithm would include all symptoms detected by any 525 subservice in one of the strongly component and make it available to 526 any subservice that depends on it. Figure 3 shows an example of such 527 a transformation. On the left-hand side, the nodes c, d, e and f 528 form a strongly connected component. The status of a should depend 529 on the status of c, d, e, f, g, and h, but this is hard to compute 530 because of the circular dependency. On the right hand-side, a 531 depends on all these nodes as well, but there the circular dependency 532 has been removed. 534 +---+ +---+ | +---+ +---+ 535 | a | | b | | | a | | b | 536 +---+ +---+ | +---+ +---+ 537 | | | | | 538 v v | v v 539 +---+ +---+ | +------------+ 540 | c |--->| d | | | top | 541 +---+ +---+ | +------------+ 542 ^ | | / | | \ 543 | | | / | | \ 544 | v | v v v v 545 +---+ +---+ | +---+ +---+ +---+ +---+ 546 | f |<---| e | | | f | | c | | d | | e | 547 +---+ +---+ | +---+ +---+ +---+ +---+ 548 | | | | | 549 v v | v v 550 +---+ +---+ | +---+ +---+ 551 | g | | h | | | g | | h | 552 +---+ +---+ | +---+ +---+ 554 Before After 555 Transformation Transformation 557 Figure 3: Graph transformation 559 We consider a concrete example to illustrate this transformation. 560 Let's assume that Engineer A is building an assurance graph dealing 561 with IS-IS and Engineer B is building an assurance graph dealing with 562 OSPF. The graph from Engineer A could contain the following: 564 +------------+ 565 | IS-IS Link | 566 +------------+ 567 | 568 v 569 +------------+ 570 | Phys. Link | 571 +------------+ 572 | | 573 v v 574 +-------------+ +-------------+ 575 | Interface 1 | | Interface 2 | 576 +-------------+ +-------------+ 578 Figure 4: Fragment of assurance graph from Engineer A 580 The graph from Engineer B could contain the following: 582 +------------+ 583 | OSPF Link | 584 +------------+ 585 | | | 586 v | v 587 +-------------+ | +-------------+ 588 | Interface 1 | | | Interface 2 | 589 +-------------+ | +-------------+ 590 | | | 591 v v v 592 +------------+ 593 | Phys. Link | 594 +------------+ 596 Figure 5: Fragment of assurance graph from Engineer B 598 Each Interface subservice and the Physical Link subservice are common 599 to both fragments above. Each of these subservice appears only once 600 in the graph merging the two fragments. Dependencies from both 601 fragments are included in the merged graph, resulting in a circular 602 dependency: 604 +------------+ +------------+ 605 | IS-IS Link | | OSPF Link |---+ 606 +------------+ +------------+ | 607 | | | | 608 | +-------- + | | 609 v v | | 610 +------------+ | | 611 | Phys. Link |<-------+ | | 612 +------------+ | | | 613 | ^ | | | | 614 | | +-------+ | | | 615 v | v | v | 616 +-------------+ +-------------+ | 617 | Interface 1 | | Interface 2 | | 618 +-------------+ +-------------+ | 619 ^ | 620 | | 621 +------------------------------+ 623 Figure 6: Merging graphs from A and B 625 The solution presented above would result in graph looking as 626 follows, where a new "empty" node is included. Using that 627 transformation, all dependencies are indirectly satisfied for the 628 nodes outside the circular dependency, in the sense that both IS-IS 629 and OSPF links have indirect dependencies to the two interfaces and 630 the link. However, the dependencies between the link and the 631 interfaces are lost as they were causing the circular dependency. 633 +------------+ +------------+ 634 | IS-IS Link | | OSPF Link | 635 +------------+ +------------+ 636 | | 637 v v 638 +------------+ 639 | empty | 640 +------------+ 641 | 642 +-----------+-------------+ 643 | | | 644 v v v 645 +-------------+ +------------+ +-------------+ 646 | Interface 1 | | Phys. Link | | Interface 2 | 647 +-------------+ +------------+ +-------------+ 649 Figure 7: Removing circular dependencies after merging graphs 650 from A and B 652 3.2. Intent and Assurance Graph 654 The SAIN orchestrator analyzes the configuration of a service 655 instance to: 657 * Try to capture the intent of the service instance, i.e., what is 658 the service instance trying to achieve. At least, this requires 659 the SAIN orchestrator to know the YANG modules that are being 660 configured on the devices to enable the service. Note that if the 661 service model or the network model is known to the SAIN 662 orchestrator, the latter can exploit it. In that case, the intent 663 could be directly extracted and include more details, such as the 664 notion of sites for a VPN, which is out of scope of the device 665 configuration. 667 * Decompose the service instance into subservices representing the 668 network features on which the service instance relies. 670 The SAIN orchestrator must be able to analyze configuration pushed to 671 various devices for configuring a service instance and produce the 672 assurance graph for that service instance. 674 To schematize what a SAIN orchestrator does, assume that the 675 configuration for a service instance touches two devices and 676 configure on each device a virtual tunnel interface. Then: 678 * Capturing the intent would start by detecting that the service 679 instance is actually a tunnel between the two devices, and stating 680 that this tunnel must be functional. This solution is minimally 681 invasive as it does not require to modify nor know the service 682 model. If the service model or network model is known by the SAIN 683 orchestrator, it can be used to further capture the intent and 684 include more information such as SLO. For instance, the latency 685 and bandwidth requirements for the tunnel, if present in the 686 service model 688 * Decomposing the service instance into subservices would result in 689 the assurance graph depicted in Figure 2, for instance. 691 To be applied, SAIN requires a mechanism mapping a service instance 692 to the configuration actually required on the devices for that 693 service instance to run. While the Figure 1 makes a distinction 694 between the SAIN orchestrator and a different component providing the 695 service instance configuration, in practice those two components are 696 mostly likely combined. The internals of the orchestrator are 697 currently out of scope of this document. 699 3.3. Subservices 701 A subservice corresponds to subpart or a feature of the network 702 system that is needed for a service instance to function properly. 703 In the context of SAIN, a subservice also defines its assurance, that 704 is the method for assuring that a subservice behaves correctly. 706 Subservices, just as with services, have high-level parameters that 707 specify the type and specific instance to be assured. For example, 708 assuring a device requires a specific deviceId as parameter. For 709 example, assuring an interface requires a specific combination of 710 deviceId and interfaceId. 712 A subservice is also characterized by a list of metrics to fetch and 713 a list of operations to apply to these metrics in order to infer a 714 health status. 716 3.4. Building the Expression Graph from the Assurance Graph 718 From the assurance graph is derived a so-called global computation 719 graph. First, each subservice instance is transformed into a set of 720 subservice expressions that take metrics and constants as input 721 (i.e., sources of the DAG) and produce the status of the subservice, 722 based on some heuristics. For instance, the health of an interface 723 is 0 (minimal score) with the symptom "interface admin-down" if the 724 interface is disabled in the configuration. Then for each service 725 instance, the service expressions are constructed by combining the 726 subservice expressions of its dependencies. The way service 727 expressions are combined depends on the dependency types (impacting 728 or informational). Finally, the global computation graph is built by 729 combining the service expressions. In other words, the global 730 computation graph encodes all the operations needed to produce health 731 statuses from the collected metrics. 733 The two types of dependencies for combining subservices are: 735 Informational Dependency: Type of dependency whose health score 736 does not impact the health score of its parent subservice or 737 service instance(s) in the assurance graph. However, the symptoms 738 should be taken into account in the parent service instance or 739 subservice instance(s), for informational reasons. 741 Impacting Dependency: Type of dependency whose score impacts the 742 score of its parent subservice or service instance(s) in the 743 assurance graph. The symptoms are taken into account in the 744 parent service instance or subservice instance(s), as the 745 impacting reasons. 747 The set of dependency type presented here is not exhaustive. More 748 specific dependency types can be defined by extending the YANG model. 749 Adding these new dependency types requires defining the corresponding 750 operation for combining statuses of subservices. 752 Subservices shall be not be dependent on the protocol used to 753 retrieve the metrics. To justify this, let's consider the interface 754 operational status. Depending on the device capabilities, this 755 status can be collected by an industry-accepted YANG module (IETF, 756 Openconfig), by a vendor-specific YANG module, or even by a MIB 757 module. If the subservice was dependent on the mechanism to collect 758 the operational status, then we would need multiple subservice 759 definitions in order to support all different mechanisms. This also 760 implies that, while waiting for all the metrics to be available via 761 standard YANG modules, SAIN agents might have to retrieve metric 762 values via non-standard YANG models, via MIB modules, Command Line 763 Interface (CLI), etc., effectively implementing a normalization layer 764 between data models and information models. 766 In order to keep subservices independent from metric collection 767 method, or, expressed differently, to support multiple combinations 768 of platforms, OSes, and even vendors, the architecture introduces the 769 concept of "metric engine". The metric engine maps each device- 770 independent metric used in the subservices to a list of device- 771 specific metric implementations that precisely define how to fetch 772 values for that metric. The mapping is parameterized by the 773 characteristics (model, OS version, etc.) of the device from which 774 the metrics are fetched. This metric engine is included in the SAIN 775 agent. 777 3.5. Open Interfaces with YANG Modules 779 The interfaces between the architecture components are open thanks to 780 the YANG modules specified in 781 [I-D.ietf-opsawg-service-assurance-yang]; they specify objects for 782 assuring network services based on their decomposition into so-called 783 subservices, according to the SAIN architecture. 785 These modules are intended for the following use cases: 787 * Assurance graph configuration: 789 - Subservices: configure a set of subservices to assure, by 790 specifying their types and parameters. 792 - Dependencies: configure the dependencies between the 793 subservices, along with their types. 795 * Assurance telemetry: export the health status of the subservices, 796 along with the observed symptoms. 798 Some examples of YANG instances can be found in Appendix A of 799 [I-D.ietf-opsawg-service-assurance-yang]. 801 3.6. Handling Maintenance Windows 803 Whenever network components are under maintenance, the operator want 804 to inhibit the emission of symptoms from those components. A typical 805 use case is device maintenance, during which the device is not 806 supposed to be operational. As such, symptoms related to the device 807 health should be ignored, as well as symptoms related to the device- 808 specific subservices, such as the interfaces, as their state changes 809 is probably the consequence of the maintenance. 811 To configure network components as "under maintenance" in the SAIN 812 architecture, the ietf-service-assurance model proposed in 813 [I-D.ietf-opsawg-service-assurance-yang] specifies an "under- 814 maintenance" flag per service or subservice instance. When this flag 815 is set and only when this flag is set, the companion field 816 "maintenance-contact" must be set to a string that identifies the 817 person or process who requested the maintenance. When a service or 818 subservice is flagged as under maintenance, it may report a generic 819 "Under Maintenance" symptom, for propagation towards subservices that 820 depend on this specific subservice: any other symptom from this 821 service, or by one of its impacting dependencies must not be 822 reported. 824 We illustrate this mechanism on three independent examples based on 825 the assurance graph depicted in Figure 2: 827 * Device maintenance, for instance upgrading the device OS. The 828 operator sets the "under-maintenance" flag for the subservice 829 "Peer1" device. This inhibits the emission of symptoms from 830 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 831 Service Instance". All other subservices are unaffected. 833 * Interface maintenance, for instance replacing a broken optic. The 834 operator sets the "under-maintenance" flag for the subservice 835 "Peer1 Physical Interface". This inhibits the emission of 836 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 837 Instance". All other subservices are unaffected. 839 * Routing protocol maintenance, for instance modifying parameters or 840 redistribution. The operator sets the "under-maintenance" flag 841 for the subservice "IS-IS Routing Protocol". This inhibits the 842 emission of symptoms from "IP connectivity" and "Tunnel Service 843 Instance". All other subservices are unaffected. 845 3.7. Flexible Functional Architecture 847 The SAIN architecture is flexible in terms of components. While the 848 SAIN architecture in Figure 1 makes a distinction between two 849 components, the SAIN configuration orchestrator and the SAIN 850 orchestrator, in practice those two components are mostly likely 851 combined. Similarly, the SAIN agents are displayed in Figure 1 as 852 being separate components. Practically, the SAIN agents could be 853 either independent components or directly integrated in monitored 854 entities. A practical example is an agent in a router. 856 The SAIN architecture is also flexible in terms of services and 857 subservices. In the proposed architecture, the SAIN orchestrator is 858 coupled to a Service orchestrator which defines the kinds of service 859 that the architecture handles. Most examples in this document deal 860 with the notion of Network Service YANG modules, with well-known 861 services such as L2VPN or tunnels. However, the concept of services 862 is general enough to cross into different domains. One of them is 863 the domain of service management on network elements, with also 864 requires its own assurance. Examples includes a DHCP server on a 865 Linux server, a data plane, an IPFIX export, etc. The notion of 866 "service" is generic in this architecture and depends on the Service 867 orchestrator and underlying network system. In other terms, if a 868 main service orchestrator coordinates several lower level 869 controllers, a service for the controller can be a subservice from 870 the point of view of the orchestrator. Exactly like a DHCP server/ 871 data plane/IPFIX export can be considered as subservices for a 872 device, exactly like a routing instance can be considered as a 873 subservice for a L3VPN, exactly like a tunnel can considered as a 874 subservice for an application in the cloud. Exactly like a service 875 function can be considered as a subservice for a service function 876 chain [RFC7665]. The assurance graph is created to be flexible and 877 open, regardless of the subservice types, locations, or domains. 879 The SAIN architecture is also flexible in terms of distributed 880 graphs. As shown in Figure 1, the architecture comprises several 881 agents. Each agent is responsible for handling a subgraph of the 882 assurance graph. The collector is responsible for fetching the 883 subgraphs from the different agents and gluing them together. As an 884 example, in the graph from Figure 2, the subservices relative to Peer 885 1 might be handled by a different agent than the subservices relative 886 to Peer 2 and the Connectivity and IS-IS subservices might be handled 887 by yet another agent. The agents will export their partial graph and 888 the collector will stitch them together as dependencies of the 889 service instance. 891 And finally, the SAIN architecture is flexible in terms of what it 892 monitors. Most, if not all examples, in this document refer to 893 physical components but this is not a constrain. Indeed, the 894 assurance of virtual components would follow the same principles and 895 an assurance graph composed of virtualized components (or a mix of 896 virtualized and physical ones) is well possible within this 897 architecture. 899 3.8. Timing 901 The SAIN architecture requires time synchronization, with Network 902 Time Protocol (NTP) [RFC5905] as a candidate, between all elements: 903 monitored entities, SAIN agents, Service orchestrator, the SAIN 904 collector, as well as the SAIN orchestrator. This guarantees the 905 correlations of all symptoms in the system, correlated with the right 906 assurance graph version. 908 The SAIN agent might have to remove some symptoms for specific 909 subservice symptoms, because there are outdated and not relevant any 910 longer, or simply because the SAIN agent needs to free up some space. 911 Regardless of the reason, it's important for a SAIN collector 912 (re-)connecting to a SAIN agent to understand the effect of this 913 garbage collection. Therefore, the SAIN agent contains a YANG object 914 specifying the date and time at which the symptoms history starts for 915 the subservice instances. 917 3.9. New Assurance Graph Generation 919 The assurance graph will change along the time, because services and 920 subservices come and go (changing the dependencies between 921 subservices), or simply because a subservice is now under 922 maintenance. Therefore an assurance graph version must be 923 maintained, along with the date and time of its last generation. The 924 date and time of a particular subservice instance (again dependencies 925 or under maintenance) might be kept. From a client point of view, an 926 assurance graph change is triggered by the value of the assurance- 927 graph-version and assurance-graph-last-change YANG leaves. At that 928 point in time, the client (collector) follows the following process: 930 * Keep the previous assurance-graph-last-change value (let's call it 931 time T) 933 * Run through all subservice instance and process the subservice 934 instances for which the last-change is newer that the time T 936 * Keep the new assurance-graph-last-change as the new referenced 937 date and time 939 4. Security Considerations 941 The SAIN architecture helps operators to reduce the mean time to 942 detect and mean time to repair. As such, it should not cause any 943 security threats. However, the SAIN agents must be secured: a 944 compromised SAIN agent may be sending wrong root causes or symptoms 945 to the management systems. 947 Except for the configuration of telemetry, the agents do not need 948 "write access" to the devices they monitor. This configuration is 949 applied with a YANG module, whose protection is covered by Secure 950 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 952 The data collected by SAIN could potentially be compromising to the 953 network or provide more insight into how the network is designed. 954 Considering the data that SAIN requires (including CLI access in some 955 cases), one should weigh data access concerns with the impact that 956 reduced visibility will have on being able to rapidly identify root 957 causes. 959 If a closed loop system relies on this architecture then the well 960 known issue of those system also applies, i.e., a lying device or 961 compromised agent could trigger partial reconfiguration of the 962 service or network. The SAIN architecture neither augments or 963 reduces this risk. 965 5. IANA Considerations 967 This document includes no request to IANA. 969 6. Contributors 971 * Youssef El Fathi 973 * Eric Vyncke 975 7. References 977 7.1. Normative References 979 [I-D.ietf-opsawg-service-assurance-yang] 980 Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. 981 Arumugam, "YANG Modules for Service Assurance", Work in 982 Progress, Internet-Draft, draft-ietf-opsawg-service- 983 assurance-yang-06, 24 June 2022, 984 . 987 7.2. Informative References 989 [Piovesan2017] 990 Piovesan, A. and E. Griffor, "Reasoning About Safety and 991 Security: The Logic of Assurance", 2017. 993 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 994 "Remote Authentication Dial In User Service (RADIUS)", 995 RFC 2865, DOI 10.17487/RFC2865, June 2000, 996 . 998 [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, 999 DOI 10.17487/RFC5424, March 2009, 1000 . 1002 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1003 "Network Time Protocol Version 4: Protocol and Algorithms 1004 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1005 . 1007 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 1008 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 1009 . 1011 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1012 "Specification of the IP Flow Information Export (IPFIX) 1013 Protocol for the Exchange of Flow Information", STD 77, 1014 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1015 . 1017 [RFC7149] Boucadair, M. and C. Jacquenet, "Software-Defined 1018 Networking: A Perspective from within a Service Provider 1019 Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014, 1020 . 1022 [RFC7665] Halpern, J., Ed. and C. Pignataro, Ed., "Service Function 1023 Chaining (SFC) Architecture", RFC 7665, 1024 DOI 10.17487/RFC7665, October 2015, 1025 . 1027 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 1028 RFC 7950, DOI 10.17487/RFC7950, August 2016, 1029 . 1031 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 1032 Classification", RFC 8199, DOI 10.17487/RFC8199, July 1033 2017, . 1035 [RFC8309] Wu, Q., Liu, W., and A. Farrel, "Service Models 1036 Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018, 1037 . 1039 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 1040 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 1041 . 1043 [RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG 1044 Data Model for Layer 2 Virtual Private Network (L2VPN) 1045 Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October 1046 2018, . 1048 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1049 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1050 September 2019, . 1052 [RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L. 1053 Grant, "The Terminal Access Controller Access-Control 1054 System Plus (TACACS+) Protocol", RFC 8907, 1055 DOI 10.17487/RFC8907, September 2020, 1056 . 1058 [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and 1059 L. Geng, "A Framework for Automating Service and Network 1060 Management with YANG", RFC 8969, DOI 10.17487/RFC8969, 1061 January 2021, . 1063 Appendix A. Changes between revisions 1065 v03 - v04 1067 * Address comments from Mohamed Boucadair 1069 v00 - v01 1071 * Cover the feedback received during the WG call for adoption 1073 Acknowledgements 1075 The authors would like to thank Stephane Litkowski, Charles Eckel, 1076 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 1077 Eric Vyncke, and Mohamed Boucadair for their reviews and feedback. 1079 Authors' Addresses 1081 Benoit Claise 1082 Huawei 1083 Email: benoit.claise@huawei.com 1085 Jean Quilbeuf 1086 Huawei 1087 Email: jean.quilbeuf@huawei.com 1089 Diego R. Lopez 1090 Telefonica I+D 1091 Don Ramon de la Cruz, 82 1092 Madrid 28006 1093 Spain 1094 Email: diego.r.lopez@telefonica.com 1096 Dan Voyer 1097 Bell Canada 1098 Canada 1099 Email: daniel.voyer@bell.ca 1100 Thangam Arumugam 1101 Cisco Systems, Inc. 1102 Milpitas (California), 1103 United States of America 1104 Email: tarumuga@cisco.com