idnits 2.17.1 draft-ietf-opsawg-service-assurance-architecture-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (7 March 2022) is 774 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC8309' is defined on line 1059, but no explicit reference was found in the text == Outdated reference: A later version (-11) exists of draft-ietf-opsawg-service-assurance-yang-02 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-06 == Outdated reference: A later version (-08) exists of draft-irtf-nmrg-ibn-intent-classification-06 -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Huawei 5 Expires: 8 September 2022 D. Lopez 6 Telefonica I+D 7 D. Voyer 8 Bell Canada 9 T. Arumugam 10 Cisco Systems, Inc. 11 7 March 2022 13 Service Assurance for Intent-based Networking Architecture 14 draft-ietf-opsawg-service-assurance-architecture-03 16 Abstract 18 This document describes an architecture for Service Assurance for 19 Intent-based Networking (SAIN). This architecture aims at assuring 20 that service instances are running as expected. As services rely 21 upon multiple sub-services provided by the underlying network devices 22 and functions, getting the assurance of a healthy service is only 23 possible with a holistic view of all involved elements. This 24 architecture not only helps to correlate the service degradation with 25 the network root cause but also the impacted services when a network 26 component fails or degrades. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on 8 September 2022. 45 Copyright Notice 47 Copyright (c) 2022 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 52 license-info) in effect on the date of publication of this document. 53 Please review these documents carefully, as they describe your rights 54 and restrictions with respect to this document. Code Components 55 extracted from this document must include Revised BSD License text as 56 described in Section 4.e of the Trust Legal Provisions and are 57 provided without warranty as described in the Revised BSD License. 59 Table of Contents 61 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 63 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 7 64 3.1. Inferring a Service Instance Configuration into an 65 Assurance Graph . . . . . . . . . . . . . . . . . . . . 10 66 3.1.1. Circular Dependencies . . . . . . . . . . . . . . . . 12 67 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 16 68 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 16 69 3.4. Building the Expression Graph from the Assurance Graph . 17 70 3.5. Building the Expression from a Subservice . . . . . . . . 18 71 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 18 72 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 18 73 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 19 74 3.9. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 20 75 3.10. New Assurance Graph Generation . . . . . . . . . . . . . 21 76 4. Security Considerations . . . . . . . . . . . . . . . . . . . 21 77 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 78 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 22 79 7. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 22 80 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 81 8.1. Normative References . . . . . . . . . . . . . . . . . . 22 82 8.2. Informative References . . . . . . . . . . . . . . . . . 22 83 Appendix A. Changes between revisions . . . . . . . . . . . . . 24 84 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 25 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25 87 1. Terminology 89 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 91 "OPTIONAL" in this document are to be interpreted as described in BCP 92 14 [RFC2119] [RFC8174] when, and only when, they appear in all 93 capitals, as shown here. 95 SAIN agent: A functional component that communicates with a device, a 96 set of devices, or another agent to build an expression graph from a 97 received assurance graph and perform the corresponding computation of 98 the health status and symptoms. 100 Assurance case: According to [Piovesan2017]: "An assurance case is a 101 structured argument, supported by evidence, intended to justify that 102 a system is acceptably assured relative to a concern (such as safety 103 or security) in the intended operating environment." 105 Assurance graph: A Directed Acyclic Graph (DAG) representing the 106 assurance case for one or several service instances. The nodes (also 107 known as vertices in the context of DAG) are the service instances 108 themselves and the subservices, the edges indicate a dependency 109 relations. 111 SAIN collector: A functional component that fetches or receives the 112 computer-consumable output of the SAIN agent(s) and displays it in a 113 user friendly form or process it locally. 115 DAG: Directed Acyclic Graph. 117 ECMP: Equal Cost Multiple Paths 119 Expression graph: A generic term for a DAG representing a computation 120 in SAIN. More specific terms are: 122 * Subservice expressions: Is an expression graph representing all 123 the computations to execute for a subservice. 125 * Service expressions: Is an expression graph representing all the 126 computations to execute for a service instance, i.e., including 127 the computations for all dependent subservices. 129 * Global computation graph: Is an expression graph representing all 130 the computations to execute for all services instances (i.e., all 131 computations performed). 133 Dependency: The directed relationship between subservice instances in 134 the assurance graph. 136 Informational Dependency: Type of dependency whose health score does 137 not impact the health score of its parent subservice or service 138 instance(s) in the assurance graph. However, the symptoms should be 139 taken into account in the parent service instance or subservice 140 instance(s), for informational reasons. 142 Impacting Dependency: Type of dependency whose score impacts the 143 score of its parent subservice or service instance(s) in the 144 assurance graph. The symptoms are taken into account in the parent 145 service instance or subservice instance(s), as the impacting reasons. 147 Metric: An information retrieved from the network running the assured 148 service. 150 Metric engine: A functional components that maps metrics to a list of 151 candidate metric implementations depending on the network element. 153 Metric implementation: Actual way of retrieving a metric from a 154 network element. 156 Network service YANG module: describes the characteristics of a 157 service as agreed upon with consumers of that service [RFC8199]. 159 Service instance: A specific instance of a service. 161 Service configuration orchestrator: Quoting RFC8199, "Network Service 162 YANG Modules describe the characteristics of a service, as agreed 163 upon with consumers of that service. That is, a service module does 164 not expose the detailed configuration parameters of all participating 165 network elements and features but describes an abstract model that 166 allows instances of the service to be decomposed into instance data 167 according to the Network Element YANG Modules of the participating 168 network elements. The service-to-element decomposition is a separate 169 process; the details depend on how the network operator chooses to 170 realize the service. For the purpose of this document, the term 171 "orchestrator" is used to describe a system implementing such a 172 process." 174 SAIN orchestrator: A functional component that is in charge of 175 fetching the configuration specific to each service instance and 176 converting it into an assurance graph. 178 Health status: Score and symptoms indicating whether a service 179 instance or a subservice is "healthy". A non-maximal score must 180 always be explained by one or more symptoms. 182 Health score: Integer ranging from 0 to 100 indicating the health of 183 a subservice. A score of 0 means that the subservice is broken, a 184 score of 100 means that the subservice in question is operating as 185 expected. 187 Subservice: Part or functionality of the network system that can be 188 independently assured as a single entity in assurance graph. 190 Strongly connected component: subset of a directed graph such that 191 there is a (directed) path from any node of the subset to any other 192 node. A DAG does not contain any strongly connected component. 194 Symptom: Reason explaining why a service instance or a subservice is 195 not completely healthy. 197 2. Introduction 199 Network Service YANG Modules [RFC8199] describe the configuration, 200 state data, operations, and notifications of abstract representations 201 of services implemented on one or multiple network elements. 203 Quoting RFC8199: "Network Service YANG Modules describe the 204 characteristics of a service, as agreed upon with consumers of that 205 service. That is, a service module does not expose the detailed 206 configuration parameters of all participating network elements and 207 features but describes an abstract model that allows instances of the 208 service to be decomposed into instance data according to the Network 209 Element YANG Modules of the participating network elements. The 210 service-to-element decomposition is a separate process; the details 211 depend on how the network operator chooses to realize the service. 212 For the purpose of this document, the term "orchestrator" is used to 213 describe a system implementing such a process." 215 Service configuration orchestrators deploy Network Service YANG 216 Modules [RFC8199] that will infer network-wide configuration and, 217 therefore the configuration of the appropriate device modules 218 (Section 3 of [RFC8969]). Network configuration is based on these 219 device YANG modules, with protocol/encoding such as NETCONF/XML 220 [RFC6241] , RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. 221 Knowing that a configuration is applied doesn't imply that the 222 service is running as expected (e.g., the service might be degraded 223 because of a failure in the network), the network operator must 224 monitor the service operational data at the same time as the 225 configuration (Section 3.3 of [RFC8969]. The industry has been 226 standardizing on telemetry to push network element performance 227 information. 229 A network administrator needs to monitor her network and services as 230 a whole, independently of the use cases or the management protocols. 231 With different protocols come different data models, and different 232 ways to model the same type of information. When network 233 administrators deal with multiple protocols, the network management 234 must perform the difficult and time-consuming job of mapping data 235 models: the model used for configuration with the model used for 236 monitoring. This problem is compounded by a large, disparate set of 237 data sources (MIB modules, YANG models [RFC7950], IPFIX information 238 elements [RFC7011], syslog plain text [RFC3164], TACACS+ [RFC8907], 239 RADIUS [RFC2865], etc.). In order to avoid this data model mapping, 240 the industry converged on model-driven telemetry to stream the 241 service operational data, reusing the YANG models used for 242 configuration. Model-driven telemetry greatly facilitates the notion 243 of closed-loop automation whereby events/status from the network 244 drive remediation changes back into the network. 246 However, it proves difficult for network operators to correlate the 247 service degradation with the network root cause. For example, why 248 does my L3VPN fail to connect? Why is this specific service slow? 249 The reverse, i.e., which services are impacted when a network 250 component fails or degrades, is even more interesting for the 251 operators. For example, which services are impacted when this 252 specific optic dBM begins to degrade? Which applications are 253 impacted by this ECMP imbalance? Is that issue actually impacting 254 any other customers? 256 Intent-based approaches are often declarative, starting from a 257 statement of "The service works as expected" and trying to enforce 258 it. Such approaches are mainly suited for greenfield deployments. 260 Aligned with Section 3.3 of [RFC7149], and instead of approaching 261 intent from a declarative way, this architecture focuses on already 262 defined services and tries to infer the meaning of "The service works 263 as expected". To do so, the architecture works from an assurance 264 graph, deduced from the service definition and from the network 265 configuration. In some cases, the assurance graph may also be 266 explicitly completed to add an intent not exposed in the service 267 model itself (e.g. the service must rely on a backup physical path). 268 This assurance graph is decomposed into components, which are then 269 assured independently. The root of the assurance graph represents 270 the service to assure, and its children represent components 271 identified as its direct dependencies; each component can have 272 dependencies as well. The SAIN architecture updates the assurance 273 graph when services are modified or when the network conditions 274 change. 276 When a service is degraded, the SAIN architecture will highlight, to 277 the best of its knowledge, where in the assurance service graph to 278 look, as opposed to going hop by hop to troubleshoot the issue. Not 279 only can this architecture help to correlate service degradation with 280 network root cause/symptoms, but it can deduce from the assurance 281 graph the number and type of services impacted by a component 282 degradation/failure. This added value informs the operational team 283 where to focus its attention for maximum return. Indeed, the 284 operational team should focus his priority on the degrading/failing 285 components impacting the highest number customers, especially the 286 ones with the SLA contracts involving penalties in case of failure. 288 This architecture provides the building blocks to assure both 289 physical and virtual entities and is flexible with respect to 290 services and subservices, of (distributed) graphs, and of components 291 (Section 3.8). 293 3. Architecture 295 The goal of SAIN is to assure that service instances are operating 296 correctly and if not, to pinpoint what is wrong. More precisely, 297 SAIN computes a score for each service instance and outputs symptoms 298 explaining that score, especially why the score is not maximal. The 299 score augmented with the symptoms is called the health status. 301 The SAIN architecture is a generic architecture, applicable to 302 multiple environments. Obviously wireline but also wireless, but 303 also different domains such as 5G, NFV domain with a virtual 304 infrastructure manager (VIM), etc. And as already noted, for 305 physical or virtual devices, as well as virtual functions. Thanks to 306 the distributed graph design principle, graphs from different 307 environments/orchestrator can be combined together. 309 As an example of a service, let us consider a point-to-point L2VPN 310 connection (i.e., pseudowire). Such a service would take as 311 parameters the two ends of the connection (device, interface or 312 subinterface, and address of the other end) and configure both 313 devices (and maybe more) so that a L2VPN connection is established 314 between the two devices. Examples of symptoms might be "Interface 315 has high error rate" or "Interface flapping", or "Device almost out 316 of memory". 318 To compute the health status of such a service, the service 319 definition is decomposed into an assurance graph formed by 320 subservices linked through dependencies. Each subservice is then 321 turned into an expression graph that details how to fetch metrics 322 from the devices and compute the health status of the subservice. 323 The subservice expressions are combined according to the dependencies 324 between the subservices in order to obtain the expression graph which 325 computes the health status of the service. 327 The overall SAIN architecture is presented in Figure 1. Based on the 328 service configuration, the SAIN orchestrator decomposes the assurance 329 graph, to the best of its knowledge. It then sends to the SAIN 330 agents the assurance graph along some other configuration options. 331 The SAIN agents are responsible for building the expression graph and 332 computing the health statuses in a distributed manner. The collector 333 is in charge of collecting and displaying the current inferred health 334 status of the service instances and subservices. Finally, the 335 automation loop is closed by having the SAIN collector providing 336 feedback to the network/service orchestrator. 338 In order to make agents, orchestrators and collectors from different 339 vendors interoperable, their interface is defined as a YANG model in 340 a companion RFC [I-D.ietf-opsawg-service-assurance-yang]. In 341 Figure 1, the communications that are normalized by this model are 342 tagged with a "Y". The use of these YANG modules is further 343 explained in Section 3.6. 345 +-----------------+ 346 | Service | 347 | Configuration |<--------------------+ 348 | Orchestrator | | 349 +-----------------+ | 350 | | | 351 | | Network | 352 | | Service | Feedback 353 | | Instance | Loop 354 | | Configuration | 355 | | | 356 | V | 357 | +-----------------+ +-------------------+ 358 | | SAIN | | SAIN | 359 | | Orchestrator | | Collector | 360 | +-----------------+ +-------------------+ 361 | | ^ 362 | Y| Configuration | Health Status 363 | | (assurance graph) Y| (Score + Symptoms) 364 | V | Streamed 365 | +-------------------+ | via Telemetry 366 | |+-------------------+ | 367 | ||+-------------------+ | 368 | +|| SAIN |---------+ 369 | +| agent | 370 | +-------------------+ 371 | ^ ^ ^ 372 | | | | 373 | | | | Metric Collection 374 V V V V 375 +-------------------------------------------------------------+ 376 | Monitored Entities | 377 | | 378 +-------------------------------------------------------------+ 380 Figure 1: SAIN Architecture 382 In order to produce the score assigned to a service instance, the 383 architecture performs the following tasks: 385 * Analyze the configuration pushed to the network device(s) for 386 configuring the service instance and decide: which information is 387 needed from the device(s), such a piece of information being 388 called a metric, which operations to apply to the metrics for 389 computing the health status. 391 * Stream (via telemetry [RFC8641]) operational and config metric 392 values when possible, else continuously poll. 394 * Continuously compute the health status of the service instances, 395 based on the metric values. 397 3.1. Inferring a Service Instance Configuration into an Assurance Graph 399 In order to structure the assurance of a service instance, the 400 service instance is decomposed into so-called subservice instances. 401 Each subservice instance focuses on a specific feature or subpart of 402 the service. 404 The decomposition into subservices is an important function of this 405 architecture, for the following reasons. 407 * The result of this decomposition provides a relational picture of 408 a service instance, that can be represented as a graph (called 409 assurance graph) to the operator. 411 * Subservices provide a scope for particular expertise and thereby 412 enable contribution from external experts. For instance, the 413 subservice dealing with the optics health should be reviewed and 414 extended by an expert in optical interfaces. 416 * Subservices that are common to several service instances are 417 reused for reducing the amount of computation needed. 419 The assurance graph of a service instance is a DAG representing the 420 structure of the assurance case for the service instance. The nodes 421 of this graph are service instances or subservice instances. Each 422 edge of this graph indicates a dependency between the two nodes at 423 its extremities: the service or subservice at the source of the edge 424 depends on the service or subservice at the destination of the edge. 426 Figure 2 depicts a simplistic example of the assurance graph for a 427 tunnel service. The node at the top is the service instance, the 428 nodes below are its dependencies. In the example, the tunnel service 429 instance depends on the "peer1" and "peer2" tunnel interfaces, which 430 in turn depend on the respective physical interfaces, which finally 431 depend on the respective "peer1" and "peer2" devices. The tunnel 432 service instance also depends on the IP connectivity that depends on 433 the IS-IS routing protocol. 435 +------------------+ 436 | Tunnel | 437 | Service Instance | 438 +------------------+ 439 | 440 +--------------------+-------------------+ 441 | | | 442 v v v 443 +-------------+ +--------------+ +-------------+ 444 | Peer1 | | IP | | Peer2 | 445 | Tunnel | | Connectivity | | Tunnel | 446 | Interface | | | | Interface | 447 +-------------+ +--------------+ +-------------+ 448 | | | 449 | +-------------+--------------+ | 450 | | | | | 451 v v v v v 452 +-------------+ +-------------+ +-------------+ 453 | Peer1 | | IS-IS | | Peer2 | 454 | Physical | | Routing | | Physical | 455 | Interface | | Protocol | | Interface | 456 +-------------+ +-------------+ +-------------+ 457 | | 458 v v 459 +-------------+ +-------------+ 460 | | | | 461 | Peer1 | | Peer2 | 462 | Device | | Device | 463 +-------------+ +-------------+ 465 Figure 2: Assurance Graph Example 467 Depicting the assurance graph helps the operator to understand (and 468 assert) the decomposition. The assurance graph shall be maintained 469 during normal operation with addition, modification and removal of 470 service instances. A change in the network configuration or topology 471 shall be reflected in the assurance graph. As a first example, a 472 change of routing protocol from IS-IS to OSPF would change the 473 assurance graph accordingly. As a second example, assuming that ECMP 474 is in place for the source router for that specific tunnel; in that 475 case, multiple interfaces must now be monitored, on top of the 476 monitoring the ECMP health itself. 478 3.1.1. Circular Dependencies 480 The edges of the assurance graph represent dependencies. An 481 assurance graph is a DAG if and only if there are no circular 482 dependencies among the subservices, and every assurance graph should 483 avoid circular dependencies. However, in some cases, circular 484 dependencies might appear in the assurance graph. 486 First, the assurance graph of a whole system is obtained by combining 487 the assurance graph of every service running on that system. Here 488 combining means that two subservices having the same type and the 489 same parameters are in fact the same subservice and thus a single 490 node in the graph. For instance, the subservice of type "device" 491 with the only parameter (the device id) set to "PE1" will appear only 492 once in the whole assurance graph even if several services rely on 493 that device. Now, if two engineers design assurance graphs for two 494 different services, and engineer A decides that an interface depends 495 on the link it is connected to, but engineer B decides that the link 496 depends on the interface it is connected to, then when combining the 497 two assurance graphs, we will have a circular dependency interface -> 498 link -> interface. 500 Another case possibly resulting in circular dependencies is when 501 subservices are not properly identified. Assume that we want to 502 assure a kubernetes cluster. If we represent the cluster by a 503 subservice and the network service by another subservice, we will 504 likely model that the network service depends on the cluster, because 505 the network service is orchestrated by kubernetes, and that the 506 cluster depends on the network service because it implements the 507 communications. A finer decomposition might distinguish between the 508 resources for executing containers (a part of our cluster subservice) 509 and the communication between the containers (which could be modelled 510 in the same way as communication between routers). 512 In any case, it is likely that circular dependencies will show up in 513 the assurance graph. A first step would be to detect circular 514 dependencies as soon as possible in the SAIN architecture. Such a 515 detection could be carried out by the SAIN Orchestrator. Whenever a 516 circular dependency is detected, the newly added service would not be 517 monitored until more careful modelling or alignment between the 518 different teams (engineer A and B) remove the circular dependency. 520 As more elaborate solution we could consider a graph transformation: 522 * Decompose the graph into strongly connected components. 524 * For each strongly connected component: 526 - Remove all edges between nodes of the strongly connected 527 component 529 - Add a new "top" node for the strongly connected component 531 - For each edge pointing to a node in the strongly connected 532 component, change the destination to the "top" node 534 - Add a dependency from the top node to every node in the 535 strongly connected component. 537 Such an algorithm would include all symptoms detected by any 538 subservice in one of the strongly component and make it available to 539 any subservice that depends on it. Figure 3 shows an example of such 540 a transformation. On the left-hand side, the nodes c, d, e and f 541 form a strongly connected component. The status of a should depend 542 on the status of c, d, e, f, g, and h, but this is hard to compute 543 because of the circular dependency. On the right hand-side, a 544 depends on all this nodes as well, but there the circular dependency 545 has been removed. 547 +---+ +---+ | +---+ +---+ 548 | a | | b | | | a | | b | 549 +---+ +---+ | +---+ +---+ 550 | | | | | 551 v v | v v 552 +---+ +---+ | +------------+ 553 | c |--->| d | | | top | 554 +---+ +---+ | +------------+ 555 ^ | | / | | \ 556 | | | / | | \ 557 | v | v v v v 558 +---+ +---+ | +---+ +---+ +---+ +---+ 559 | f |<---| e | | | f | | c | | d | | e | 560 +---+ +---+ | +---+ +---+ +---+ +---+ 561 | | | | | 562 v v | v v 563 +---+ +---+ | +---+ +---+ 564 | g | | h | | | g | | h | 565 +---+ +---+ | +---+ +---+ 567 Before After 568 Transformation Transformation 570 Figure 3: Graph transformation 572 We consider a concrete example to illustrate this transformation. 573 Let's assume that Engineer A is building an assurance graph dealing 574 with IS-IS and Engineer B is building an assurance graph dealing with 575 OSPF. The graph from Engineer A could contain the following: 577 +------------+ 578 | IS-IS Link | 579 +------------+ 580 | 581 v 582 +------------+ 583 | Phys. Link | 584 +------------+ 585 | | 586 v v 587 +-------------+ +-------------+ 588 | Interface 1 | | Interface 2 | 589 +-------------+ +-------------+ 591 Figure 4: Fragment of assurance graph from Engineer A 593 The graph from Engineer B could contain the following: 595 +------------+ 596 | OSPF Link | 597 +------------+ 598 | | | 599 v | v 600 +-------------+ | +-------------+ 601 | Interface 1 | | | Interface 2 | 602 +-------------+ | +-------------+ 603 | | | 604 v v v 605 +------------+ 606 | Phys. Link | 607 +------------+ 609 Figure 5: Fragment of assurance graph from Engineer B 611 Each Interface subservice and the Physical Link subservice are common 612 two the both fragment above. Each of these subservice appear only 613 once in the graph merging the two fragments. Dependencies from both 614 fragments are included in the merged graph, resulting in a circular 615 dependency: 617 +------------+ +------------+ 618 | IS-IS Link | | OSPF Link |---+ 619 +------------+ +------------+ | 620 | | | | 621 | +-------- + | | 622 v v | | 623 +------------+ | | 624 | Phys. Link |<-------+ | | 625 +------------+ | | | 626 | ^ | | | | 627 | | +-------+ | | | 628 v | v | v | 629 +-------------+ +-------------+ | 630 | Interface 1 | | Interface 2 | | 631 +-------------+ +-------------+ | 632 ^ | 633 | | 634 +------------------------------+ 636 Figure 6: Merging graphs from A and B 638 The solution presented above would result in graph looking as 639 follows, where a new "empty" node is included. Using that 640 transformation, all dependencies are indirectly satisfied for the 641 nodes outside the circular dependency, in the sense that both IS-IS 642 and OSPF links have indirect dependencies to the two interfaces and 643 the link. However, the dependencies between the link and the 644 interfaces are lost as they were causing the circular dependency. 646 +------------+ +------------+ 647 | IS-IS Link | | OSPF Link | 648 +------------+ +------------+ 649 | | 650 v v 651 +------------+ 652 | | 653 +------------+ 654 | 655 +-----------+-------------+ 656 | | | 657 v v v 658 +-------------+ +------------+ +-------------+ 659 | Interface 1 | | Phys. Link | | Interface 2 | 660 +-------------+ +------------+ +-------------+ 662 Figure 7: Removing circular dependencies after merging graphs 663 from A and B 665 3.2. Intent and Assurance Graph 667 The SAIN orchestrator analyzes the configuration of a service 668 instance to: 670 * Try to capture the intent of the service instance, i.e., what is 671 the service instance trying to achieve. 673 * Decompose the service instance into subservices representing the 674 network features on which the service instance relies. 676 The SAIN orchestrator must be able to analyze configuration from 677 various devices and produce the assurance graph. 679 To schematize what a SAIN orchestrator does, assume that the 680 configuration for a service instance touches two devices and 681 configure on each device a virtual tunnel interface. Then: 683 * Capturing the intent would start by detecting that the service 684 instance is actually a tunnel between the two devices, and stating 685 that this tunnel must be functional. This is the current state of 686 SAIN, however it does not completely capture the intent which 687 might additionally include, for instance, the latency and 688 bandwidth requirements of this tunnel. 690 * Decomposing the service instance into subservices would result in 691 the assurance graph depicted in Figure 2, for instance. 693 In order for SAIN to be applied, the configuration necessary for each 694 service instance should be identifiable and thus should come from a 695 "service-aware" source. While the Figure 1 makes a distinction 696 between the SAIN orchestrator and a different component providing the 697 service instance configuration, in practice those two components are 698 mostly likely combined. The internals of the orchestrator are 699 currently out of scope of this document. 701 3.3. Subservices 703 A subservice corresponds to subpart or a feature of the network 704 system that is needed for a service instance to function properly. 705 In the context of SAIN, subservice is actually a shortcut for 706 subservice assurance, that is the method for assuring that a 707 subservice behaves correctly. 709 Subservices, just as with services, have high-level parameters that 710 specify the type and specific instance to be assured. For example, 711 assuring a device requires the specific deviceId as parameter. For 712 example, assuring an interface requires the specific combination of 713 deviceId and interfaceId. 715 A subservice is also characterized by a list of metrics to fetch and 716 a list of computations to apply to these metrics in order to infer a 717 health status. 719 3.4. Building the Expression Graph from the Assurance Graph 721 From the assurance graph is derived a so-called global computation 722 graph. First, each subservice instance is transformed into a set of 723 subservice expressions that take metrics and constants as input 724 (i.e., sources of the DAG) and produce the status of the subservice, 725 based on some heuristics. Then for each service instance, the 726 service expressions are constructed by combining the subservice 727 expressions of its dependencies. The way service expressions are 728 combined depends on the dependency types (impacting or 729 informational). Finally, the global computation graph is built by 730 combining the service expressions. In other words, the global 731 computation graph encodes all the operations needed to produce health 732 statuses from the collected metrics. 734 Subservices shall be device independent. To justify this, let's 735 consider the interface operational status. Depending on the device 736 capabilities, this status can be collected by an industry-accepted 737 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 738 even by a MIB module. If the subservice was dependent on the 739 mechanism to collect the operational status, then we would need 740 multiple subservice definitions in order to support all different 741 mechanisms. This also implies that, while waiting for all the 742 metrics to be available via standard YANG modules, SAIN agents might 743 have to retrieve metric values via non-standard YANG models, via MIB 744 modules, Command Line Interface (CLI), etc., effectively implementing 745 a normalization layer between data models and information models. 747 In order to keep subservices independent from metric collection 748 method, or, expressed differently, to support multiple combinations 749 of platforms, OSes, and even vendors, the architecture introduces the 750 concept of "metric engine". The metric engine maps each device- 751 independent metric used in the subservices to a list of device- 752 specific metric implementations that precisely define how to fetch 753 values for that metric. The mapping is parameterized by the 754 characteristics (model, OS version, etc.) of the device from which 755 the metrics are fetched. 757 3.5. Building the Expression from a Subservice 759 Additionally, to the list of metrics, each subservice defines a list 760 of expressions to apply on the metrics in order to compute the health 761 status of the subservice. The definition or the standardization of 762 those expressions (also known as heuristic) is currently out of scope 763 of this standardization. 765 3.6. Open Interfaces with YANG Modules 767 The interfaces between the architecture components are open thanks to 768 the YANG modules specified in YANG Modules for Service Assurance 769 [I-D.ietf-opsawg-service-assurance-yang]; they specify objects for 770 assuring network services based on their decomposition into so-called 771 subservices, according to the SAIN architecture. 773 This module is intended for the following use cases: 775 * Assurance graph configuration: 777 - Subservices: configure a set of subservices to assure, by 778 specifying their types and parameters. 780 - Dependencies: configure the dependencies between the 781 subservices, along with their types. 783 * Assurance telemetry: export the health status of the subservices, 784 along with the observed symptoms. 786 Some examples of YANG instances can be found in Appendix A of 787 [I-D.ietf-opsawg-service-assurance-yang]. 789 3.7. Handling Maintenance Windows 791 Whenever network components are under maintenance, the operator want 792 to inhibit the emission of symptoms from those components. A typical 793 use case is device maintenance, during which the device is not 794 supposed to be operational. As such, symptoms related to the device 795 health should be ignored, as well as symptoms related to the device- 796 specific subservices, such as the interfaces, as their state changes 797 is probably the consequence of the maintenance. 799 To configure network components as "under maintenance" in the SAIN 800 architecture, the ietf-service-assurance model proposed in 801 [I-D.ietf-opsawg-service-assurance-yang] specifies an "under- 802 maintenance" flag per service or subservice instance. When this flag 803 is set and only when this flag is set, the companion field 804 "maintenance-contact" must be set to a string that identifies the 805 person or process who requested the maintenance. When a service or 806 subservice is flagged as under maintenance, it may report a generic 807 "Under Maintenance" symptom, for propagation towards subservices that 808 depend on this specific subservice: any other symptom from this 809 service, or by one of its impacting dependencies MUST NOT be 810 reported. 812 We illustrate this mechanism on three independent examples based on 813 the assurance graph depicted in Figure 2: 815 * Device maintenance, for instance upgrading the device OS. The 816 operator sets the "under-maintenance" flag for the subservice 817 "Peer1" device. This inhibits the emission of symptoms from 818 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 819 Service Instance". All other subservices are unaffected. 821 * Interface maintenance, for instance replacing a broken optic. The 822 operator sets the "under-maintenance" flag for the subservice 823 "Peer1 Physical Interface". This inhibits the emission of 824 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 825 Instance". All other subservices are unaffected. 827 * Routing protocol maintenance, for instance modifying parameters or 828 redistribution. The operator sets the "under-maintenance" flag 829 for the subservice "IS-IS Routing Protocol". This inhibits the 830 emission of symptoms from "IP connectivity" and "Tunnel Service 831 Instance". All other subservices are unaffected. 833 3.8. Flexible Architecture 835 The SAIN architecture is flexible in terms of components. While the 836 SAIN architecture in Figure 1 makes a distinction between two 837 components, the SAIN configuration orchestrator and the SAIN 838 orchestrator, in practice those two components are mostly likely 839 combined. Similarly, the SAIN agents are displayed in Figure 1 as 840 being separate components. Practically, the SAIN agents could be 841 either independent components or directly integrated in monitored 842 entities. A practical example is an agent in a router. 844 The SAIN architecture is also flexible in terms of services and 845 subservices. Most examples in this document deal with the notion of 846 Network Service YANG modules, with well-known service such as L2VPN 847 or tunnels. However, the concepts of services is general enough to 848 cross into different domains. One of them is the domain of service 849 management on network elements, with also requires its own assurance. 850 Examples includes a DHCP server on a Linux server, a data plane, an 851 IPFIX export, etc. The notion of "service" is generic in this 852 architecture. Indeed, a configured service can itself be a 853 subservice for someone else. Exactly like a DHCP server/ data plane/ 854 IPFIX export can be considered as subservices for a device, exactly 855 like an routing instance can be considered as a subservice for a 856 L3VPN, exactly like a tunnel can considered as a subservice for an 857 application in the cloud. Exactly like a service function can be be 858 considered as a subservice for a service function chain [RFC7665]. 859 The assurance graph is created to be flexible and open, regardless of 860 the subservice types, locations, or domains. 862 The SAIN architecture is also flexible in terms of distributed 863 graphs. As shown in Figure 1, our architecture comprises several 864 agents. Each agent is responsible for handling a subgraph of the 865 assurance graph. The collector is responsible for fetching the 866 subgraphs from the different agents and gluing them together. As an 867 example, in the graph from Figure 2, the subservices relative to Peer 868 1 might be handled by a different agent than the subservices relative 869 to Peer 2 and the Connectivity and IS-IS subservices might be handled 870 by yet another agent. The agents will export their partial graph and 871 the collector will stitch them together as dependencies of the 872 service instance. 874 And finally, the SAIN architecture is flexible in terms of what it 875 monitors. Most, if not all examples, in this document refer to 876 physical components but this is not a constrain. Indeed, the 877 assurance of virtual components would follow the same principles and 878 an assurance graph composed of virtualized components (or a mix of 879 virtualized and physical ones) is well possible within this 880 architecture. 882 3.9. Timing 884 The SAIN architecture requires time synchronization, with Network 885 Time Protocol (NTP) [RFC5905] as a candidate, between all elements: 886 monitored entities, SAIN agents, Service Configuration Orchestrator, 887 the SAIN collector, as well as the SAIN Orchestrator. This 888 guarantees the correlations of all symptoms in the system, correlated 889 with the right assurance graph version. 891 The SAIN agent might have to remove some symptoms for specific 892 subservice symptoms, because there are outdated and not relevant any 893 longer, or simply because the SAIN agent needs to free up some space. 894 Regardless of the reason, it's important for a SAIN collector 895 (re-)connecting to a SAIN agent to understand the effect of this 896 garbage collection. Therefore, the SAIN agent contains a YANG object 897 specifying the date and time at which the symptoms history starts for 898 the subservice instances. 900 3.10. New Assurance Graph Generation 902 The assurance graph will change along the time, because services and 903 subservices come and go (changing the dependencies between 904 subservices), or simply because a subservice is now under 905 maintenance. Therefore an assurance graph version must be 906 maintained, along with the date and time of its last generation. The 907 date and time of a particular subservice instance (again dependencies 908 or under maintenance) might be kept. From a client point of view, an 909 assurance graph change is triggered by the value of the assurance- 910 graph-version and assurance-graph-last-change YANG leaves. At that 911 point in time, the client (collector) follows the following process: 913 * Keep the previous assurance-graph-last-change value (let's call it 914 time T) 916 * Run through all subservice instance and process the subservice 917 instances for which the last-change is newer that the time T 919 * Keep the new assurance-graph-last-change as the new referenced 920 date and time 922 4. Security Considerations 924 The SAIN architecture helps operators to reduce the mean time to 925 detect and mean time to repair. As such, it should not cause any 926 security threats. However, the SAIN agents must be secure: a 927 compromised SAIN agents could be sending wrong root causes or 928 symptoms to the management systems. 930 Except for the configuration of telemetry, the agents do not need 931 "write access" to the devices they monitor. This configuration is 932 applied with a YANG module, whose protection is covered by Secure 933 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 935 The data collected by SAIN could potentially be compromising to the 936 network or provide more insight into how the network is designed. 937 Considering the data that SAIN requires (including CLI access in some 938 cases), one should weigh data access concerns with the impact that 939 reduced visibility will have on being able to rapidly identify root 940 causes. 942 If a closed loop system relies on this architecture then the well 943 known issue of those system also applies, i.e., a lying device or 944 compromised agent could trigger partial reconfiguration of the 945 service or network. The SAIN architecture neither augments or 946 reduces this risk. 948 5. IANA Considerations 950 This document includes no request to IANA. 952 6. Contributors 954 * Youssef El Fathi 956 * Eric Vyncke 958 7. Open Issues 960 Refer to the Intent-based Networking NMRG documents (Intent 961 Assurance, Service Intent: synonym for custom service model see 962 [I-D.irtf-nmrg-ibn-concepts-definitions] and 963 [I-D.irtf-nmrg-ibn-intent-classification] ). 965 8. References 967 8.1. Normative References 969 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 970 Requirement Levels", BCP 14, RFC 2119, 971 DOI 10.17487/RFC2119, March 1997, 972 . 974 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 975 "Network Time Protocol Version 4: Protocol and Algorithms 976 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 977 . 979 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 980 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 981 May 2017, . 983 8.2. Informative References 985 [I-D.ietf-opsawg-service-assurance-yang] 986 Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. 987 Arumugam, "YANG Modules for Service Assurance", Work in 988 Progress, Internet-Draft, draft-ietf-opsawg-service- 989 assurance-yang-02, 4 January 2022, 990 . 993 [I-D.irtf-nmrg-ibn-concepts-definitions] 994 Clemm, A., Ciavaglia, L., Granville, L. Z., and J. 995 Tantsura, "Intent-Based Networking - Concepts and 996 Definitions", Work in Progress, Internet-Draft, draft- 997 irtf-nmrg-ibn-concepts-definitions-06, 15 December 2021, 998 . 1001 [I-D.irtf-nmrg-ibn-intent-classification] 1002 Li, C., Havel, O., Olariu, A., Martinez-Julia, P., Nobre, 1003 J. C., and D. R. Lopez, "Intent Classification", Work in 1004 Progress, Internet-Draft, draft-irtf-nmrg-ibn-intent- 1005 classification-06, 22 February 2022, 1006 . 1009 [Piovesan2017] 1010 Piovesan, A. and E. Griffor, "Reasoning About Safety and 1011 Security: The Logic of Assurance", 2017. 1013 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 1014 "Remote Authentication Dial In User Service (RADIUS)", 1015 RFC 2865, DOI 10.17487/RFC2865, June 2000, 1016 . 1018 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 1019 DOI 10.17487/RFC3164, August 2001, 1020 . 1022 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1023 and A. Bierman, Ed., "Network Configuration Protocol 1024 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1025 . 1027 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 1028 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 1029 . 1031 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1032 "Specification of the IP Flow Information Export (IPFIX) 1033 Protocol for the Exchange of Flow Information", STD 77, 1034 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1035 . 1037 [RFC7149] Boucadair, M. and C. Jacquenet, "Software-Defined 1038 Networking: A Perspective from within a Service Provider 1039 Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014, 1040 . 1042 [RFC7665] Halpern, J., Ed. and C. Pignataro, Ed., "Service Function 1043 Chaining (SFC) Architecture", RFC 7665, 1044 DOI 10.17487/RFC7665, October 2015, 1045 . 1047 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 1048 RFC 7950, DOI 10.17487/RFC7950, August 2016, 1049 . 1051 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 1052 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 1053 . 1055 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 1056 Classification", RFC 8199, DOI 10.17487/RFC8199, July 1057 2017, . 1059 [RFC8309] Wu, Q., Liu, W., and A. Farrel, "Service Models 1060 Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018, 1061 . 1063 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 1064 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 1065 . 1067 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1068 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1069 September 2019, . 1071 [RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L. 1072 Grant, "The Terminal Access Controller Access-Control 1073 System Plus (TACACS+) Protocol", RFC 8907, 1074 DOI 10.17487/RFC8907, September 2020, 1075 . 1077 [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and 1078 L. Geng, "A Framework for Automating Service and Network 1079 Management with YANG", RFC 8969, DOI 10.17487/RFC8969, 1080 January 2021, . 1082 Appendix A. Changes between revisions 1084 v00 - v01 1086 * Cover the feedback received during the WG call for adoption 1088 Acknowledgements 1090 The authors would like to thank Stephane Litkowski, Charles Eckel, 1091 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 1092 Eric Vyncke, and Mohamed Boucadair for their reviews and feedback. 1094 Authors' Addresses 1096 Benoit Claise 1097 Huawei 1098 Email: benoit.claise@huawei.com 1100 Jean Quilbeuf 1101 Huawei 1102 Email: jean.quilbeuf@huawei.com 1104 Diego R. Lopez 1105 Telefonica I+D 1106 Don Ramon de la Cruz, 82 1107 Madrid 28006 1108 Spain 1109 Email: diego.r.lopez@telefonica.com 1111 Dan Voyer 1112 Bell Canada 1113 Canada 1114 Email: daniel.voyer@bell.ca 1116 Thangam Arumugam 1117 Cisco Systems, Inc. 1118 Milpitas (California), 1119 United States of America 1120 Email: tarumuga@cisco.com