idnits 2.17.1 draft-ietf-opsawg-service-assurance-architecture-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 2, 2021) is 1029 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC8309' is defined on line 858, but no explicit reference was found in the text == Outdated reference: A later version (-11) exists of draft-ietf-opsawg-service-assurance-yang-00 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-03 == Outdated reference: A later version (-08) exists of draft-irtf-nmrg-ibn-intent-classification-03 -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Huawei 5 Expires: January 3, 2022 D. Lopez 6 Telefonica I+D 7 D. Voyer 8 Bell Canada 9 T. Arumugam 10 Cisco Systems, Inc. 11 July 2, 2021 13 Service Assurance for Intent-based Networking Architecture 14 draft-ietf-opsawg-service-assurance-architecture-01 16 Abstract 18 This document describes an architecture for Service Assurance for 19 Intent-based Networking (SAIN). This architecture aims at assuring 20 that service instances are running as expected. As services rely 21 upon multiple sub-services provided by the underlying network devices 22 and functions, getting the assurance of a healthy service is only 23 possible with a holistic view of all involved elements. This 24 architecture not only helps to correlate the service degradation with 25 the network root cause but also the impacted services when a network 26 component fails or degrades. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on January 3, 2022. 45 Copyright Notice 47 Copyright (c) 2021 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 63 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 64 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 7 65 3.1. Inferring a Service Instance Configuration into an 66 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 67 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 10 68 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 11 69 3.4. Building the Expression Graph from the Assurance Graph . 12 70 3.5. Building the Expression from a Subservice . . . . . . . . 12 71 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 13 72 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 13 73 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 14 74 3.9. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 15 75 3.10. New Assurance Graph Generation . . . . . . . . . . . . . 15 76 4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 77 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 78 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 17 79 7. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 17 80 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 81 8.1. Normative References . . . . . . . . . . . . . . . . . . 17 82 8.2. Informative References . . . . . . . . . . . . . . . . . 17 83 Appendix A. Changes between revisions . . . . . . . . . . . . . 19 84 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 19 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 87 1. Terminology 89 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 91 "OPTIONAL" in this document are to be interpreted as described in BCP 92 14 [RFC2119] [RFC8174] when, and only when, they appear in all 93 capitals, as shown here. 95 SAIN agent: A functional component that communicates with a device, a 96 set of devices, or another agent to build an expression graph from a 97 received assurance graph and perform the corresponding computation of 98 the health status and symptoms. 100 Assurance case: According to [Piovesan2017]: "An assurance case is a 101 structured argument, supported by evidence, intended to justify that 102 a system is acceptably assured relative to a concern (such as safety 103 or security) in the intended operating environment." 105 Assurance graph: A Directed Acyclic Graph (DAG) representing the 106 assurance case for one or several service instances. The nodes (also 107 known as vertices in the context of DAG) are the service instances 108 themselves and the subservices, the edges indicate a dependency 109 relations. 111 SAIN collector: A functional component that fetches or receives the 112 computer-consumable output of the SAIN agent(s) and displays it in a 113 user friendly form or process it locally. 115 DAG: Directed Acyclic Graph. 117 ECMP: Equal Cost Multiple Paths 119 Expression graph: A generic term for a DAG representing a computation 120 in SAIN. More specific terms are: 122 o Subservice expressions: Is an expression graph representing all 123 the computations to execute for a subservice. 125 o Service expressions: Is an expression graph representing all the 126 computations to execute for a service instance, i.e., including 127 the computations for all dependent subservices. 129 o Global computation graph: Is an expression graph representing all 130 the computations to execute for all services instances (i.e., all 131 computations performed). 133 Dependency: The directed relationship between subservice instances in 134 the assurance graph. 136 Informational Dependency: Type of dependency whose health score does 137 not impact the health score of its parent subservice or service 138 instance(s) in the assurance graph. However, the symptoms should be 139 taken into account in the parent service instance or subservice 140 instance(s), for informational reasons. 142 Impacting Dependency: Type of dependency whose score impacts the 143 score of its parent subservice or service instance(s) in the 144 assurance graph. The symptoms are taken into account in the parent 145 service instance or subservice instance(s), as the impacting reasons. 147 Metric: An information retrieved from the network running the assured 148 service. 150 Metric engine: A functional components that maps metrics to a list of 151 candidate metric implementations depending on the network element. 153 Metric implementation: Actual way of retrieving a metric from a 154 network element. 156 Network service YANG module: describes the characteristics of a 157 service as agreed upon with consumers of that service [RFC8199]. 159 Service instance: A specific instance of a service. 161 Service configuration orchestrator: Quoting RFC8199, "Network Service 162 YANG Modules describe the characteristics of a service, as agreed 163 upon with consumers of that service. That is, a service module does 164 not expose the detailed configuration parameters of all participating 165 network elements and features but describes an abstract model that 166 allows instances of the service to be decomposed into instance data 167 according to the Network Element YANG Modules of the participating 168 network elements. The service-to-element decomposition is a separate 169 process; the details depend on how the network operator chooses to 170 realize the service. For the purpose of this document, the term 171 "orchestrator" is used to describe a system implementing such a 172 process." 174 SAIN orchestrator: A functional component that is in charge of 175 fetching the configuration specific to each service instance and 176 converting it into an assurance graph. 178 Health status: Score and symptoms indicating whether a service 179 instance or a subservice is "healthy". A non-maximal score must 180 always be explained by one or more symptoms. 182 Health score: Integer ranging from 0 to 100 indicating the health of 183 a subservice. A score of 0 means that the subservice is broken, a 184 score of 100 means that the subservice in question is operating as 185 expected. 187 Subservice: Part or functionality of the network system that can be 188 independently assured as a single entity in assurance graph. 190 Symptom: Reason explaining why a service instance or a subservice is 191 not completely healthy. 193 2. Introduction 195 Network Service YANG Modules [RFC8199] describe the configuration, 196 state data, operations, and notifications of abstract representations 197 of services implemented on one or multiple network elements. 199 Quoting RFC8199: "Network Service YANG Modules describe the 200 characteristics of a service, as agreed upon with consumers of that 201 service. That is, a service module does not expose the detailed 202 configuration parameters of all participating network elements and 203 features but describes an abstract model that allows instances of the 204 service to be decomposed into instance data according to the Network 205 Element YANG Modules of the participating network elements. The 206 service-to-element decomposition is a separate process; the details 207 depend on how the network operator chooses to realize the service. 208 For the purpose of this document, the term "orchestrator" is used to 209 describe a system implementing such a process." 211 Service configuration orchestrators deploy Network Service YANG 212 Modules [RFC8199] that will infer network-wide configuration and, 213 therefore the configuration of the appropriate device modules 214 (Section 3 of [RFC8969]). Network configuration is based on these 215 device YANG modules, with protocol/encoding such as NETCONF/XML 216 [RFC6241] , RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. 217 Knowing that a configuration is applied doesn't imply that the 218 service is running as expected (e.g., the service might be degraded 219 because of a failure in the network), the network operator must 220 monitor the service operational data at the same time as the 221 configuration (Section 3.3 of [RFC8969]. The industry has been 222 standardizing on telemetry to push network element performance 223 information. 225 A network administrator needs to monitor her network and services as 226 a whole, independently of the use cases or the management protocols. 227 With different protocols come different data models, and different 228 ways to model the same type of information. When network 229 administrators deal with multiple protocols, the network management 230 must perform the difficult and time-consuming job of mapping data 231 models: the model used for configuration with the model used for 232 monitoring. This problem is compounded by a large, disparate set of 233 data sources (MIB modules, YANG models [RFC7950], IPFIX information 234 elements [RFC7011], syslog plain text [RFC3164], TACACS+ [RFC8907], 235 RADIUS [RFC2865], etc.). In order to avoid this data model mapping, 236 the industry converged on model-driven telemetry to stream the 237 service operational data, reusing the YANG models used for 238 configuration. Model-driven telemetry greatly facilitates the notion 239 of closed-loop automation whereby events/status from the network 240 drive remediation changes back into the network. 242 However, it proves difficult for network operators to correlate the 243 service degradation with the network root cause. For example, why 244 does my L3VPN fail to connect? Why is this specific service slow? 245 The reverse, i.e., which services are impacted when a network 246 component fails or degrades, is even more interesting for the 247 operators. For example, which services are impacted when this 248 specific optic dBM begins to degrade? Which applications are 249 impacted by this ECMP imbalance? Is that issue actually impacting 250 any other customers? 252 Intent-based approaches are often declarative, starting from a 253 statement of "The service works as expected" and trying to enforce 254 it. Such approaches are mainly suited for greenfield deployments. 256 Aligned with Section 3.3 of [RFC7149], and instead of approaching 257 intent from a declarative way, this architecture focuses on already 258 defined services and tries to infer the meaning of "The service works 259 as expected". To do so, the architecture works from an assurance 260 graph, deduced from the service definition and from the network 261 configuration. In some cases, the assurance graph may also be 262 explicitly completed to add an intent not exposed in the service 263 model itself (e.g. the service must rely on a backup physical path). 264 This assurance graph is decomposed into components, which are then 265 assured independently. The root of the assurance graph represents 266 the service to assure, and its children represent components 267 identified as its direct dependencies; each component can have 268 dependencies as well. The SAIN architecture updates the assurance 269 graph when services are modified or when the network conditions 270 change. 272 When a service is degraded, the SAIN architecture will highlight, to 273 the best of its knowledge, where in the assurance service graph to 274 look, as opposed to going hop by hop to troubleshoot the issue. Not 275 only can this architecture help to correlate service degradation with 276 network root cause/symptoms, but it can deduce from the assurance 277 graph the number and type of services impacted by a component 278 degradation/failure. This added value informs the operational team 279 where to focus its attention for maximum return. Indeed, the 280 operational team should focus his priority on the degrading/failing 281 components impacting the highest number customers, especially the 282 ones with the SLA contracts involving penalties in case of failure. 284 This architecture provides the building blocks to assure both 285 physical and virtual entities and is flexible with respect to 286 services and subservices, of (distributed) graphs, and of components 287 (Section 3.8). 289 3. Architecture 291 The goal of SAIN is to assure that service instances are operating 292 correctly and if not, to pinpoint what is wrong. More precisely, 293 SAIN computes a score for each service instance and outputs symptoms 294 explaining that score, especially why the score is not maximal. The 295 score augmented with the symptoms is called the health status. 297 The SAIN architecture is a generic architecture, applicable to 298 multiple environments. Obviously wireline but also wireless, but 299 also different domains such as 5G, NFV domain with a virtual 300 infrastructure manager (VIM), etc. And as already noted, for 301 physical or virtual devices, as well as virtual functions. Thanks to 302 the distributed graph design principle, graphs from different 303 environments/orchestrator can be combined together. 305 As an example of a service, let us consider a point-to-point L2VPN 306 connection (i.e., pseudowire). Such a service would take as 307 parameters the two ends of the connection (device, interface or 308 subinterface, and address of the other end) and configure both 309 devices (and maybe more) so that a L2VPN connection is established 310 between the two devices. Examples of symptoms might be "Interface 311 has high error rate" or "Interface flapping", or "Device almost out 312 of memory". 314 To compute the health status of such a service, the service 315 definition is decomposed into an assurance graph formed by 316 subservices linked through dependencies. Each subservice is then 317 turned into an expression graph that details how to fetch metrics 318 from the devices and compute the health status of the subservice. 319 The subservice expressions are combined according to the dependencies 320 between the subservices in order to obtain the expression graph which 321 computes the health status of the service. 323 The overall SAIN architecture is presented in Figure 1. Based on the 324 service configuration, the SAIN orchestrator decomposes the assurance 325 graph, to the best of its knowledge. It then sends to the SAIN 326 agents the assurance graph along some other configuration options. 327 The SAIN agents are responsible for building the expression graph and 328 computing the health statuses in a distributed manner. The collector 329 is in charge of collecting and displaying the current inferred health 330 status of the service instances and subservices. Finally, the 331 automation loop is closed by having the SAIN collector providing 332 feedback to the network/service orchestrator. 334 In order to make agents, orchestrators and collectors from different 335 vendors interoperable, their interface is defined as a YANG model in 336 a companion RFC [I-D.ietf-opsawg-service-assurance-yang]. In 337 Figure 1, the communications that are normalized by this model are 338 tagged with a "Y". The use of these YANG modules is further 339 explained in Section 3.6. 341 +-----------------+ 342 | Service | 343 | Configuration |<--------------------+ 344 | Orchestrator | | 345 +-----------------+ | 346 | | | 347 | | Network | 348 | | Service | Feedback 349 | | Instance | Loop 350 | | Configuration | 351 | | | 352 | V | 353 | +-----------------+ +-------------------+ 354 | | SAIN | | SAIN | 355 | | Orchestrator | | Collector | 356 | +-----------------+ +-------------------+ 357 | | ^ 358 | Y| Configuration | Health Status 359 | | (assurance graph) Y| (Score + Symptoms) 360 | V | Streamed 361 | +-------------------+ | via Telemetry 362 | |+-------------------+ | 363 | ||+-------------------+ | 364 | +|| SAIN |---------+ 365 | +| agent | 366 | +-------------------+ 367 | ^ ^ ^ 368 | | | | 369 | | | | Metric Collection 370 V V V V 371 +-------------------------------------------------------------+ 372 | Monitored Entities | 373 | | 374 +-------------------------------------------------------------+ 376 Figure 1: SAIN Architecture 378 In order to produce the score assigned to a service instance, the 379 architecture performs the following tasks: 381 o Analyze the configuration pushed to the network device(s) for 382 configuring the service instance and decide: which information is 383 needed from the device(s), such a piece of information being 384 called a metric, which operations to apply to the metrics for 385 computing the health status. 387 o Stream (via telemetry [RFC8641]) operational and config metric 388 values when possible, else continuously poll. 390 o Continuously compute the health status of the service instances, 391 based on the metric values. 393 3.1. Inferring a Service Instance Configuration into an Assurance Graph 395 In order to structure the assurance of a service instance, the 396 service instance is decomposed into so-called subservice instances. 397 Each subservice instance focuses on a specific feature or subpart of 398 the service. 400 The decomposition into subservices is an important function of this 401 architecture, for the following reasons. 403 o The result of this decomposition provides a relational picture of 404 a service instance, that can be represented as a graph (called 405 assurance graph) to the operator. 407 o Subservices provide a scope for particular expertise and thereby 408 enable contribution from external experts. For instance, the 409 subservice dealing with the optics health should be reviewed and 410 extended by an expert in optical interfaces. 412 o Subservices that are common to several service instances are 413 reused for reducing the amount of computation needed. 415 The assurance graph of a service instance is a DAG representing the 416 structure of the assurance case for the service instance. The nodes 417 of this graph are service instances or subservice instances. Each 418 edge of this graph indicates a dependency between the two nodes at 419 its extremities: the service or subservice at the source of the edge 420 depends on the service or subservice at the destination of the edge. 422 Figure 2 depicts a simplistic example of the assurance graph for a 423 tunnel service. The node at the top is the service instance, the 424 nodes below are its dependencies. In the example, the tunnel service 425 instance depends on the "peer1" and "peer2" tunnel interfaces, which 426 in turn depend on the respective physical interfaces, which finally 427 depend on the respective "peer1" and "peer2" devices. The tunnel 428 service instance also depends on the IP connectivity that depends on 429 the IS-IS routing protocol. 431 +------------------+ 432 | Tunnel | 433 | Service Instance | 434 +------------------+ 435 | 436 +--------------------+-------------------+ 437 | | | 438 +-------------+ +--------------+ +-------------+ 439 | Peer1 | | IP | | Peer2 | 440 | Tunnel | | Connectivity | | Tunnel | 441 | Interface | /| |\ | Interface | 442 +-------------+ / +--------------+ \ +-------------+ 443 | / | \ | 444 +-------------+/ +-------------+ \+-------------+ 445 | Peer1 | | IS-IS | | Peer2 | 446 | Physical | | Routing | | Physical | 447 | Interface | | Protocol | | Interface | 448 +-------------+ +-------------+ +-------------+ 449 | | 450 +-------------+ +-------------+ 451 | | | | 452 | Peer1 | | Peer2 | 453 | Device | | Device | 454 +-------------+ +-------------+ 456 Figure 2: Assurance Graph Example 458 Depicting the assurance graph helps the operator to understand (and 459 assert) the decomposition. The assurance graph shall be maintained 460 during normal operation with addition, modification and removal of 461 service instances. A change in the network configuration or topology 462 shall be reflected in the assurance graph. As a first example, a 463 change of routing protocol from IS-IS to OSPF would change the 464 assurance graph accordingly. As a second example, assuming that ECMP 465 is in place for the source router for that specific tunnel; in that 466 case, multiple interfaces must now be monitored, on top of the 467 monitoring the ECMP health itself. 469 3.2. Intent and Assurance Graph 471 The SAIN orchestrator analyzes the configuration of a service 472 instance to: 474 o Try to capture the intent of the service instance, i.e., what is 475 the service instance trying to achieve. 477 o Decompose the service instance into subservices representing the 478 network features on which the service instance relies. 480 The SAIN orchestrator must be able to analyze configuration from 481 various devices and produce the assurance graph. 483 To schematize what a SAIN orchestrator does, assume that the 484 configuration for a service instance touches two devices and 485 configure on each device a virtual tunnel interface. Then: 487 o Capturing the intent would start by detecting that the service 488 instance is actually a tunnel between the two devices, and stating 489 that this tunnel must be functional. This is the current state of 490 SAIN, however it does not completely capture the intent which 491 might additionally include, for instance, the latency and 492 bandwidth requirements of this tunnel. 494 o Decomposing the service instance into subservices would result in 495 the assurance graph depicted in Figure 2, for instance. 497 In order for SAIN to be applied, the configuration necessary for each 498 service instance should be identifiable and thus should come from a 499 "service-aware" source. While the Figure 1 makes a distinction 500 between the SAIN orchestrator and a different component providing the 501 service instance configuration, in practice those two components are 502 mostly likely combined. The internals of the orchestrator are 503 currently out of scope of this document. 505 3.3. Subservices 507 A subservice corresponds to subpart or a feature of the network 508 system that is needed for a service instance to function properly. 509 In the context of SAIN, subservice is actually a shortcut for 510 subservice assurance, that is the method for assuring that a 511 subservice behaves correctly. 513 Subservices, just as with services, have high-level parameters that 514 specify the type and specific instance to be assured. For example, 515 assuring a device requires the specific deviceId as parameter. For 516 example, assuring an interface requires the specific combination of 517 deviceId and interfaceId. 519 A subservice is also characterized by a list of metrics to fetch and 520 a list of computations to apply to these metrics in order to infer a 521 health status. 523 3.4. Building the Expression Graph from the Assurance Graph 525 From the assurance graph is derived a so-called global computation 526 graph. First, each subservice instance is transformed into a set of 527 subservice expressions that take metrics and constants as input 528 (i.e., sources of the DAG) and produce the status of the subservice, 529 based on some heuristics. Then for each service instance, the 530 service expressions are constructed by combining the subservice 531 expressions of its dependencies. The way service expressions are 532 combined depends on the dependency types (impacting or 533 informational). Finally, the global computation graph is built by 534 combining the service expressions. In other words, the global 535 computation graph encodes all the operations needed to produce health 536 statuses from the collected metrics. 538 Subservices shall be device independent. To justify this, let's 539 consider the interface operational status. Depending on the device 540 capabilities, this status can be collected by an industry-accepted 541 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 542 even by a MIB module. If the subservice was dependent on the 543 mechanism to collect the operational status, then we would need 544 multiple subservice definitions in order to support all different 545 mechanisms. This also implies that, while waiting for all the 546 metrics to be available via standard YANG modules, SAIN agents might 547 have to retrieve metric values via non-standard YANG models, via MIB 548 modules, Command Line Interface (CLI), etc., effectively implementing 549 a normalization layer between data models and information models. 551 In order to keep subservices independent from metric collection 552 method, or, expressed differently, to support multiple combinations 553 of platforms, OSes, and even vendors, the architecture introduces the 554 concept of "metric engine". The metric engine maps each device- 555 independent metric used in the subservices to a list of device- 556 specific metric implementations that precisely define how to fetch 557 values for that metric. The mapping is parameterized by the 558 characteristics (model, OS version, etc.) of the device from which 559 the metrics are fetched. 561 3.5. Building the Expression from a Subservice 563 Additionally, to the list of metrics, each subservice defines a list 564 of expressions to apply on the metrics in order to compute the health 565 status of the subservice. The definition or the standardization of 566 those expressions (also known as heuristic) is currently out of scope 567 of this standardization. 569 3.6. Open Interfaces with YANG Modules 571 The interfaces between the architecture components are open thanks to 572 the YANG modules specified in YANG Modules for Service Assurance 573 [I-D.ietf-opsawg-service-assurance-yang]; they specify objects for 574 assuring network services based on their decomposition into so-called 575 subservices, according to the SAIN architecture. 577 This module is intended for the following use cases: 579 o Assurance graph configuration: 581 * Subservices: configure a set of subservices to assure, by 582 specifying their types and parameters. 584 * Dependencies: configure the dependencies between the 585 subservices, along with their types. 587 o Assurance telemetry: export the health status of the subservices, 588 along with the observed symptoms. 590 Some examples of YANG instances can be found in Appendix A of 591 [I-D.ietf-opsawg-service-assurance-yang]. 593 3.7. Handling Maintenance Windows 595 Whenever network components are under maintenance, the operator want 596 to inhibit the emission of symptoms from those components. A typical 597 use case is device maintenance, during which the device is not 598 supposed to be operational. As such, symptoms related to the device 599 health should be ignored, as well as symptoms related to the device- 600 specific subservices, such as the interfaces, as their state changes 601 is probably the consequence of the maintenance. 603 To configure network components as "under maintenance" in the SAIN 604 architecture, the ietf-service-assurance model proposed in 605 [I-D.ietf-opsawg-service-assurance-yang] specifies an "under- 606 maintenance" flag per service or subservice instance. When this flag 607 is set and only when this flag is set, the companion field 608 "maintenance-contact" must be set to a string that identifies the 609 person or process who requested the maintenance. When a service or 610 subservice is flagged as under maintenance, it may report a generic 611 "Under Maintenance" symptom, for propagation towards subservices that 612 depend on this specific subservice: any other symptom from this 613 service, or by one of its impacting dependencies MUST NOT be 614 reported. 616 We illustrate this mechanism on three independent examples based on 617 the assurance graph depicted in Figure 2: 619 o Device maintenance, for instance upgrading the device OS. The 620 operator sets the "under-maintenance" flag for the subservice 621 "Peer1" device. This inhibits the emission of symptoms from 622 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 623 Service Instance". All other subservices are unaffected. 625 o Interface maintenance, for instance replacing a broken optic. The 626 operator sets the "under-maintenance" flag for the subservice 627 "Peer1 Physical Interface". This inhibits the emission of 628 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 629 Instance". All other subservices are unaffected. 631 o Routing protocol maintenance, for instance modifying parameters or 632 redistribution. The operator sets the "under-maintenance" flag 633 for the subservice "IS-IS Routing Protocol". This inhibits the 634 emission of symptoms from "IP connectivity" and "Tunnel Service 635 Instance". All other subservices are unaffected. 637 3.8. Flexible Architecture 639 The SAIN architecture is flexible in terms of components. While the 640 SAIN architecture in Figure 1 makes a distinction between two 641 components, the SAIN configuration orchestrator and the SAIN 642 orchestrator, in practice those two components are mostly likely 643 combined. Similarly, the SAIN agents are displayed in Figure 1 as 644 being separate components. Practically, the SAIN agents could be 645 either independent components or directly integrated in monitored 646 entities. A practical example is an agent in a router. 648 The SAIN architecture is also flexible in terms of services and 649 subservices. Most examples in this document deal with the notion of 650 Network Service YANG modules, with well-known service such as L2VPN 651 or tunnels. However, the concepts of services is general enough to 652 cross into different domains. One of them is the domain of service 653 management on network elements, with also requires its own assurance. 654 Examples includes a DHCP server on a Linux server, a data plane, an 655 IPFIX export, etc. The notion of "service" is generic in this 656 architecture. Indeed, a configured service can itself be a 657 subservice for someone else. Exactly like a DHCP server/ data plane/ 658 IPFIX export can be considered as subservices for a device, exactly 659 like an routing instance can be considered as a subservice for a 660 L3VPN, exactly like a tunnel can considered as a subservice for an 661 application in the cloud. Exactly like a service function can be be 662 considered as a subservice for a service function chain [RFC7665]. 664 The assurance graph is created to be flexible and open, regardless of 665 the subservice types, locations, or domains. 667 The SAIN architecture is also flexible in terms of distributed 668 graphs. As shown in Figure 1, our architecture comprises several 669 agents. Each agent is responsible for handling a subgraph of the 670 assurance graph. The collector is responsible for fetching the 671 subgraphs from the different agents and gluing them together. As an 672 example, in the graph from Figure 2, the subservices relative to Peer 673 1 might be handled by a different agent than the subservices relative 674 to Peer 2 and the Connectivity and IS-IS subservices might be handled 675 by yet another agent. The agents will export their partial graph and 676 the collector will stitch them together as dependencies of the 677 service instance. 679 And finally, the SAIN architecture is flexible in terms of what it 680 monitors. Most, if not all examples, in this document refer to 681 physical components but this is not a constrain. Indeed, the 682 assurance of virtual components would follow the same principles and 683 an assurance graph composed of virtualized components (or a mix of 684 virtualized and physical ones) is well possible within this 685 architecture. 687 3.9. Timing 689 The SAIN architecture requires time synchronization, with Network 690 Time Protocol (NTP) [RFC5905] as a candidate, between all elements: 691 monitored entities, SAIN agents, Service Configuration Orchestrator, 692 the SAIN collector, as well as the SAIN Orchestrator. This 693 guarantees the correlations of all symptoms in the system, correlated 694 with the right assurance graph version. 696 The SAIN agent might have to remove some symptoms for specific 697 subservice symptoms, because there are outdated and not relevant any 698 longer, or simply because the SAIN agent needs to free up some space. 699 Regardless of the reason, it's important for a SAIN collector 700 (re-)connecting to a SAIN agent to understand the effect of this 701 garbage collection. Therefore, the SAIN agent contains a YANG object 702 specifying the date and time at which the symptoms history starts for 703 the subservice instances. 705 3.10. New Assurance Graph Generation 707 The assurance graph will change along the time, because services and 708 subservices come and go (changing the dependencies between 709 subservices), or simply because a subservice is now under 710 maintenance. Therefore an assurance graph version must be 711 maintained, along with the date and time of its last generation. The 712 date and time of a particular subservice instance (again dependencies 713 or under maintenance) might be kept. From a client point of view, an 714 assurance graph change is triggered by the value of the assurance- 715 graph-version and assurance-graph-last-change YANG leaves. At that 716 point in time, the client (collector) follows the following process: 718 o Keep the previous assurance-graph-last-change value (let's call it 719 time T) 721 o Run through all subservice instance and process the subservice 722 instances for which the last-change is newer that the time T 724 o Keep the new assurance-graph-last-change as the new referenced 725 date and time 727 4. Security Considerations 729 The SAIN architecture helps operators to reduce the mean time to 730 detect and mean time to repair. As such, it should not cause any 731 security threats. However, the SAIN agents must be secure: a 732 compromised SAIN agents could be sending wrong root causes or 733 symptoms to the management systems. 735 Except for the configuration of telemetry, the agents do not need 736 "write access" to the devices they monitor. This configuration is 737 applied with a YANG module, whose protection is covered by Secure 738 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 740 The data collected by SAIN could potentially be compromising to the 741 network or provide more insight into how the network is designed. 742 Considering the data that SAIN requires (including CLI access in some 743 cases), one should weigh data access concerns with the impact that 744 reduced visibility will have on being able to rapidly identify root 745 causes. 747 If a closed loop system relies on this architecture then the well 748 known issue of those system also applies, i.e., a lying device or 749 compromised agent could trigger partial reconfiguration of the 750 service or network. The SAIN architecture neither augments or 751 reduces this risk. 753 5. IANA Considerations 755 This document includes no request to IANA. 757 6. Contributors 759 o Youssef El Fathi 761 o Eric Vyncke 763 7. Open Issues 765 Refer to the Intent-based Networking NMRG documents (Intent 766 Assurance, Service Intent: synonym for custom service model see 767 [I-D.irtf-nmrg-ibn-concepts-definitions] and 768 [I-D.irtf-nmrg-ibn-intent-classification] ). 770 8. References 772 8.1. Normative References 774 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 775 Requirement Levels", BCP 14, RFC 2119, 776 DOI 10.17487/RFC2119, March 1997, 777 . 779 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 780 "Network Time Protocol Version 4: Protocol and Algorithms 781 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 782 . 784 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 785 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 786 May 2017, . 788 8.2. Informative References 790 [I-D.ietf-opsawg-service-assurance-yang] 791 Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. 792 Arumugam, "YANG Modules for Service Assurance", draft- 793 ietf-opsawg-service-assurance-yang-00 (work in progress), 794 May 2021. 796 [I-D.irtf-nmrg-ibn-concepts-definitions] 797 Clemm, A., Ciavaglia, L., Granville, L. Z., and J. 798 Tantsura, "Intent-Based Networking - Concepts and 799 Definitions", draft-irtf-nmrg-ibn-concepts-definitions-03 800 (work in progress), February 2021. 802 [I-D.irtf-nmrg-ibn-intent-classification] 803 Li, C., Havel, O., Liu, W., Olariu, A., Martinez-Julia, 804 P., Nobre, J. C., and D. R. Lopez, "Intent 805 Classification", draft-irtf-nmrg-ibn-intent- 806 classification-03 (work in progress), March 2021. 808 [Piovesan2017] 809 Piovesan, A. and E. Griffor, "Reasoning About Safety and 810 Security: The Logic of Assurance", 2017. 812 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 813 "Remote Authentication Dial In User Service (RADIUS)", 814 RFC 2865, DOI 10.17487/RFC2865, June 2000, 815 . 817 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 818 DOI 10.17487/RFC3164, August 2001, 819 . 821 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 822 and A. Bierman, Ed., "Network Configuration Protocol 823 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 824 . 826 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 827 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 828 . 830 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 831 "Specification of the IP Flow Information Export (IPFIX) 832 Protocol for the Exchange of Flow Information", STD 77, 833 RFC 7011, DOI 10.17487/RFC7011, September 2013, 834 . 836 [RFC7149] Boucadair, M. and C. Jacquenet, "Software-Defined 837 Networking: A Perspective from within a Service Provider 838 Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014, 839 . 841 [RFC7665] Halpern, J., Ed. and C. Pignataro, Ed., "Service Function 842 Chaining (SFC) Architecture", RFC 7665, 843 DOI 10.17487/RFC7665, October 2015, 844 . 846 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 847 RFC 7950, DOI 10.17487/RFC7950, August 2016, 848 . 850 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 851 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 852 . 854 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 855 Classification", RFC 8199, DOI 10.17487/RFC8199, July 856 2017, . 858 [RFC8309] Wu, Q., Liu, W., and A. Farrel, "Service Models 859 Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018, 860 . 862 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 863 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 864 . 866 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 867 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 868 September 2019, . 870 [RFC8907] Dahm, T., Ota, A., Medway Gash, D., Carrel, D., and L. 871 Grant, "The Terminal Access Controller Access-Control 872 System Plus (TACACS+) Protocol", RFC 8907, 873 DOI 10.17487/RFC8907, September 2020, 874 . 876 [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and 877 L. Geng, "A Framework for Automating Service and Network 878 Management with YANG", RFC 8969, DOI 10.17487/RFC8969, 879 January 2021, . 881 Appendix A. Changes between revisions 883 v00 - v01 885 o Cover the feedback received during the WG call for adoption 887 Acknowledgements 889 The authors would like to thank Stephane Litkowski, Charles Eckel, 890 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 891 Eric Vyncke, and Mohamed Boucadair for their reviews and feedback. 893 Authors' Addresses 894 Benoit Claise 895 Huawei 897 Email: benoit.claise@huawei.com 899 Jean Quilbeuf 900 Huawei 902 Email: jean.quilbeuf@huawei.com 904 Diego R. Lopez 905 Telefonica I+D 906 Don Ramon de la Cruz, 82 907 Madrid 28006 908 Spain 910 Email: diego.r.lopez@telefonica.com 912 Dan Voyer 913 Bell Canada 914 Canada 916 Email: daniel.voyer@bell.ca 918 Thangam Arumugam 919 Cisco Systems, Inc. 920 Milpitas (California) 921 United States of America 923 Email: tarumuga@cisco.com