idnits 2.17.1 draft-claise-opsawg-service-assurance-architecture-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (January 2, 2021) is 1209 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- No information found for draft-claise-opsawg-service-assurance-yang - is the name correct? -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft Cisco Systems, Inc. 4 Intended status: Informational J. Quilbeuf 5 Expires: July 6, 2021 Independent 6 D. Lopez 7 Telefonica I+D 8 D. Voyer 9 Bell Canada 10 T. Arumugam 11 Cisco Systems, Inc. 12 January 2, 2021 14 Service Assurance for Intent-based Networking Architecture 15 draft-claise-opsawg-service-assurance-architecture-04 17 Abstract 19 This document describes an architecture for Service Assurance for 20 Intent-based Networking (SAIN). This architecture aims at assuring 21 that service instances are correctly running. As services rely on 22 multiple sub-services by the underlying network devices, getting the 23 assurance of a healthy service is only possible with a holistic view 24 of network devices. This architecture not only helps to correlate 25 the service degradation with the network root cause but also the 26 impacted services when a network component fails or degrades. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on July 6, 2021. 45 Internet-DrafService Assurance for Intent-based Networking January 2021 47 Copyright Notice 49 Copyright (c) 2021 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 65 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5 66 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 67 3.1. Decomposing a Service Instance Configuration into an 68 Assurance Graph . . . . . . . . . . . . . . . . . . . . . 9 69 3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 10 70 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 11 71 3.4. Building the Expression Graph from the Assurance Graph . 11 72 3.5. Building the Expression from a Subservice . . . . . . . . 12 73 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 12 74 3.7. Handling Maintenance Windows . . . . . . . . . . . . . . 13 75 3.8. Flexible Architecture . . . . . . . . . . . . . . . . . . 14 76 3.9. Timing . . . . . . . . . . . . . . . . . . . . . . . . . 15 77 3.10. New Assurance Graph Generation . . . . . . . . . . . . . 15 78 4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 79 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 80 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 16 81 7. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16 82 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 83 8.1. Normative References . . . . . . . . . . . . . . . . . . 16 84 8.2. Informative References . . . . . . . . . . . . . . . . . 17 85 Appendix A. Changes between revisions . . . . . . . . . . . . . 18 86 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 18 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 19 89 1. Terminology 91 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 92 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 93 "OPTIONAL" in this document are to be interpreted as described in BCP 95 Internet-DrafService Assurance for Intent-based Networking January 2021 97 14 [RFC2119] [RFC8174] when, and only when, they appear in all 98 capitals, as shown here. 100 SAIN Agent: Component that communicates with a device, a set of 101 devices, or another agent to build an expression graph from a 102 received assurance graph and perform the corresponding computation. 104 Assurance Graph: DAG representing the assurance case for one or 105 several service instances. The nodes (also known as vertices in the 106 context of DAG) are the service instances themselves and the 107 subservices, the edges indicate a dependency relations. 109 SAIN collector: Component that fetches or receives the computer- 110 consumable output of the agent(s) and displays it in a user friendly 111 form or process it locally. 113 DAG: Directed Acyclic Graph. 115 ECMP: Equal Cost Multiple Paths 117 Expression Graph: Generic term for a DAG representing a computation 118 in SAIN. More specific terms are: 120 o Subservice Expressions: expression graph representing all the 121 computations to execute for a subservice. 123 o Service Expressions: expression graph representing all the 124 computations to execute for a service instance, i.e. including the 125 computations for all dependent subservices. 127 o Global Computation Graph: expression graph representing all the 128 computations to execute for all services instances (i.e. all 129 computations performed). 131 Dependency: The directed relationship between subservice instances in 132 the assurance graph. 134 Informational Dependency: Type of dependency whose score does not 135 impact the score of its parent subservice or service instance(s) in 136 the assurance graph. However, the symptoms should be taken into 137 account in the parent service instance or subservice instance(s), for 138 informational reasons. 140 Impacting Dependency: Type of dependency whose score impacts the 141 score of its parent subservice or service instance(s) in the 142 assurance graph. The symptoms are taken into account in the parent 143 service instance or subservice instance(s), as the impacting reasons. 145 Internet-DrafService Assurance for Intent-based Networking January 2021 147 Metric: Information retrieved from a network device. 149 Metric Engine: Maps metrics to a list of candidate metric 150 implementations depending on the target model. 152 Metric Implementation: Actual way of retrieving a metric from a 153 device. 155 Network Service YANG Module: describes the characteristics of 156 service, as agreed upon with consumers of that service [RFC8199]. 158 Service Instance: A specific instance of a service. 160 Service configuration orchestrator: Quoting RFC8199, "Network Service 161 YANG Modules describe the characteristics of a service, as agreed 162 upon with consumers of that service. That is, a service module does 163 not expose the detailed configuration parameters of all participating 164 network elements and features but describes an abstract model that 165 allows instances of the service to be decomposed into instance data 166 according to the Network Element YANG Modules of the participating 167 network elements. The service-to-element decomposition is a separate 168 process; the details depend on how the network operator chooses to 169 realize the service. For the purpose of this document, the term 170 "orchestrator" is used to describe a system implementing such a 171 process." 173 SAIN Orchestrator: Component of SAIN in charge of fetching the 174 configuration specific to each service instance and converting it 175 into an assurance graph. 177 Health status: Score and symptoms indicating whether a service 178 instance or a subservice is healthy. A non-maximal score MUST always 179 be explained by one or more symptoms. 181 Health score: Integer ranging from 0 to 100 indicating the health of 182 a subservice. A score of 0 means that the subservice is broken, a 183 score of 100 means that the subservice is perfectly operational. 185 Subservice: Part of an assurance graph that assures a specific 186 feature or subpart of the network system. 188 Symptom: Reason explaining why a service instance or a subservice is 189 not completely healthy. 191 Internet-DrafService Assurance for Intent-based Networking January 2021 193 2. Introduction 195 Network Service YANG Modules [RFC8199] describe the configuration, 196 state data, operations, and notifications of abstract representations 197 of services implemented on one or multiple network elements. 199 Quoting RFC8199: "Network Service YANG Modules describe the 200 characteristics of a service, as agreed upon with consumers of that 201 service. That is, a service module does not expose the detailed 202 configuration parameters of all participating network elements and 203 features but describes an abstract model that allows instances of the 204 service to be decomposed into instance data according to the Network 205 Element YANG Modules of the participating network elements. The 206 service-to-element decomposition is a separate process; the details 207 depend on how the network operator chooses to realize the service. 208 For the purpose of this document, the term "orchestrator" is used to 209 describe a system implementing such a process." 211 In other words, service configuration orchestrators deploy Network 212 Service YANG Modules through the configuration of Network Element 213 YANG Modules. Network configuration is based on those YANG data 214 models, with protocol/encoding such as NETCONF/XML [RFC6241] , 215 RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc. Knowing that a 216 configuration is applied doesn't imply that the service is running 217 correctly (for example the service might be degraded because of a 218 failure in the network), the network operator must monitor the 219 service operational data at the same time as the configuration. The 220 industry has been standardizing on telemetry to push network element 221 performance information. 223 A network administrator needs to monitor her network and services as 224 a whole, independently of the use cases or the management protocols. 225 With different protocols come different data models, and different 226 ways to model the same type of information. When network 227 administrators deal with multiple protocols, the network management 228 must perform the difficult and time-consuming job of mapping data 229 models: the model used for configuration with the model used for 230 monitoring. This problem is compounded by a large, disparate set of 231 data sources (MIB modules, YANG models [RFC7950], IPFIX information 232 elements [RFC7011], syslog plain text [RFC3164], TACACS+ 233 [I-D.ietf-opsawg-tacacs], RADIUS [RFC2865], etc.). In order to avoid 234 this data model mapping, the industry converged on model-driven 235 telemetry to stream the service operational data, reusing the YANG 236 models used for configuration. Model-driven telemetry greatly 237 facilitates the notion of closed-loop automation whereby events from 238 the network drive remediation changes back into the network. 240 Internet-DrafService Assurance for Intent-based Networking January 2021 242 However, it proves difficult for network operators to correlate the 243 service degradation with the network root cause. For example, why 244 does my L3VPN fail to connect? Why is this specific service slow? 245 The reverse, i.e. which services are impacted when a network 246 component fails or degrades, is even more interesting for the 247 operators. For example, which service(s) is(are) impacted when this 248 specific optic dBM begins to degrade? Which application is impacted 249 by this ECMP imbalance? Is that issue actually impacting any other 250 customers? 252 Intent-based approaches are often declarative, starting from a 253 statement of the "The service works correctly" and trying to enforce 254 it. Such approaches are mainly suited for greenfield deployments. 256 Instead of approaching intent from a declarative way, this framework 257 focuses on already defined services and tries to infer the meaning of 258 "The service works correctly". To do so, the framework works from an 259 assurance graph, deduced from the service definition and from the 260 network configuration. This assurance graph is decomposed into 261 components, which are then assured independently. The root of the 262 assurance graph represents the service to assure, and its children 263 represent components identified as its direct dependencies; each 264 component can have dependencies as well. The SAIN architecture 265 maintains the correct assurance graph when services are modified or 266 when the network conditions change. 268 When a service is degraded, the framework will highlight where in the 269 assurance service graph to look, as opposed to going hop by hop to 270 troubleshoot the issue. Not only can this framework help to 271 correlate service degradation with network root cause/symptoms, but 272 it can deduce from the assurance graph the number and type of 273 services impacted by a component degradation/failure. This added 274 value informs the operational team where to focus its attention for 275 maximum return. 277 This architecture provides the building blocks to assure both 278 physical and virtual entities and is flexible with respect to 279 services and subservices, of (distributed) graphs, and of components 280 (Section 3.8). 282 3. Architecture 284 SAIN aims at assuring that service instances are correctly running. 285 The goal of SAIN is to assure that service instances are operating 286 correctly and if not, to pinpoint what is wrong. More precisely, 287 SAIN computes a score for each service instance and outputs symptoms 288 explaining that score, especially why the score is not maximal. The 289 score augmented with the symptoms is called the health status. 291 Internet-DrafService Assurance for Intent-based Networking January 2021 293 The SAIN architecture is a generic architecture, applicable to 294 multiple environments. Obviously wireline but also wireless, 295 including 5G, virtual infrastructure manager (VIM), and even virtual 296 functions. Thanks to the distributed graph design principle, graphs 297 from different environments/orchestrator can be combined together. 299 As an example of a service, let us consider a point-to-point L2VPN 300 connection (i.e. pseudowire). Such a service would take as 301 parameters the two ends of the connection (device, interface or 302 subinterface, and address of the other end) and configure both 303 devices (and maybe more) so that a L2VPN connection is established 304 between the two devices. Examples of symptoms might be "Interface 305 has high error rate" or "Interface flapping", or "Device almost out 306 of memory". 308 To compute the health status of such as service, the service is 309 decomposed into an assurance graph formed by subservices linked 310 through dependencies. Each subservice is then turned into an 311 expression graph that details how to fetch metrics from the devices 312 and compute the health status of the subservice. The subservice 313 expressions are combined according to the dependencies between the 314 subservices in order to obtain the expression graph which computes 315 the health status of the service. 317 The overall architecture of our solution is presented in Figure 1. 318 Based on the service configuration, the SAIN orchestrator deduces the 319 assurance graph. It then sends to the SAIN agents the assurance 320 graph along some other configuration options. The SAIN agents are 321 responsible for building the expression graph and computing the 322 health statuses in a distributed manner. The collector is in charge 323 of collecting and displaying the current inferred health status of 324 the service instances and subservices. Finally, the automation loop 325 is closed by having the SAIN Collector providing feedback to the 326 network orchestrator. 328 Internet-DrafService Assurance for Intent-based Networking January 2021 330 +-----------------+ 331 | Service | 332 | Configuration |<--------------------+ 333 | Orchestrator | | 334 +-----------------+ | 335 | | | 336 | | Network | 337 | | Service | Feedback 338 | | Instance | Loop 339 | | Configuration | 340 | | | 341 | V | 342 | +-----------------+ +-------------------+ 343 | | SAIN | | SAIN | 344 | | Orchestrator | | Collector | 345 | +-----------------+ +-------------------+ 346 | | ^ 347 | | Configuration | Health Status 348 | | (assurance graph) | (Score + Symptoms) 349 | V | Streamed 350 | +-------------------+ | via Telemetry 351 | |+-------------------+ | 352 | ||+-------------------+ | 353 | +|| SAIN |---------+ 354 | +| agent | 355 | +-------------------+ 356 | ^ ^ ^ 357 | | | | 358 | | | | Metric Collection 359 V V V V 360 +-------------------------------------------------------------+ 361 | Monitored Entities | 362 | | 363 +-------------------------------------------------------------+ 365 Figure 1: SAIN Architecture 367 In order to produce the score assigned to a service instance, the 368 architecture performs the following tasks: 370 o Analyze the configuration pushed to the network device(s) for 371 configuring the service instance and decide: which information is 372 needed from the device(s), such a piece of information being 373 called a metric, which operations to apply to the metrics for 374 computing the health status. 376 Internet-DrafService Assurance for Intent-based Networking January 2021 378 o Stream (via telemetry [RFC8641]) operational and config metric 379 values when possible, else continuously poll. 381 o Continuously compute the health status of the service instances, 382 based on the metric values. 384 3.1. Decomposing a Service Instance Configuration into an Assurance 385 Graph 387 In order to structure the assurance of a service instance, the 388 service instance is decomposed into so-called subservice instances. 389 Each subservice instance focuses on a specific feature or subpart of 390 the network system. 392 The decomposition into subservices is an important function of this 393 architecture, for the following reasons. 395 o TThe result of this decomposition provides a relational picture of 396 a service instance, that can be represented as a graph (called 397 assurance graph) to the operator. 399 o Subservices provide a scope for particular expertise and thereby 400 enable contribution from external experts. For instance, the 401 subservice dealing with the optics health should be reviewed and 402 extended by an expert in optical interfaces. 404 o Subservices that are common to several service instances are 405 reused for reducing the amount of computation needed. 407 The assurance graph of a service instance is a DAG representing the 408 structure of the assurance case for the service instance. The nodes 409 of this graph are service instances or subservice instances. Each 410 edge of this graph indicates a dependency between the two nodes at 411 its extremities: the service or subservice at the source of the edge 412 depends on the service or subservice at the destination of the edge. 414 Figure 2 depicts a simplistic example of the assurance graph for a 415 tunnel service. The node at the top is the service instance, the 416 nodes below are its dependencies. In the example, the tunnel service 417 instance depends on the peer1 and peer2 tunnel interfaces, which in 418 turn depend on the respective physical interfaces, which finally 419 depend on the respective peer1 and peer2 devices. The tunnel service 420 instance also depends on the IP connectivity that depends on the IS- 421 IS routing protocol. 423 Internet-DrafService Assurance for Intent-based Networking January 2021 425 +------------------+ 426 | Tunnel | 427 | Service Instance | 428 +-----------------+ 429 | 430 +-------------------+-------------------+ 431 | | | 432 +-------------+ +-------------+ +--------------+ 433 | Peer1 | | Peer2 | | IP | 434 | Tunnel | | Tunnel | | Connectivity | 435 | Interface | | Interface | | | 436 +-------------+ +-------------+ +--------------} 437 | | | 438 +-------------+ +-------------+ +-------------+ 439 | Peer1 | | Peer2 | | IS-IS | 440 | Physical | | Physical | | Routing | 441 | Interface | | Interface | | Protocol | 442 +-------------+ +-------------+ +-------------+ 443 | | 444 +-------------+ +-------------+ 445 | | | | 446 | Peer1 | | Peer2 | 447 | Device | | Device | 448 +-------------+ +-------------+ 450 Figure 2: Assurance Graph Example 452 Depicting the assurance graph helps the operator to understand (and 453 assert) the decomposition. The assurance graph shall be maintained 454 during normal operation with addition, modification and removal of 455 service instances. A change in the network configuration or topology 456 shall be reflected in the assurance graph. As a first example, a 457 change of routing protocol from IS-IS to OSPF would change the 458 assurance graph accordingly. As a second example, assuming that ECMP 459 is in place for the source router for that specific tunnel; in that 460 case, multiple interfaces must now be monitored, on top of the 461 monitoring the ECMP health itself. 463 3.2. Intent and Assurance Graph 465 The SAIN orchestrator analyzes the configuration of a service 466 instance to: 468 o Try to capture the intent of the service instance, i.e. what is 469 the service instance trying to achieve, 471 o Decompose the service instance into subservices representing the 472 network features on which the service instance relies. 474 Internet-DrafService Assurance for Intent-based Networking January 2021 476 The SAIN orchestrator must be able to analyze configuration from 477 various devices and produce the assurance graph. 479 To schematize what a SAIN orchestrator does, assume that the 480 configuration for a service instance touches 2 devices and configure 481 on each device a virtual tunnel interface. Then: 483 o Capturing the intent would start by detecting that the service 484 instance is actually a tunnel between the two devices, and stating 485 that this tunnel must be functional. This is the current state of 486 SAIN, however it does not completely capture the intent which 487 might additionally include, for instance, on the latency and 488 bandwidth requirements of this tunnel. 490 o Decomposing the service instance into subservices would result in 491 the assurance graph depicted in Figure 2, for instance. 493 In order for SAIN to be applied, the configuration necessary for each 494 service instance should be identifiable and thus should come from a 495 "service-aware" source. While the Figure 1 makes a distinction 496 between the SAIN orchestrator and a different component providing the 497 service instance configuration, in practice those two components are 498 mostly likely combined. The internals of the orchestrator are 499 currently out of scope of this document. 501 3.3. Subservices 503 A subservice corresponds to subpart or a feature of the network 504 system that is needed for a service instance to function properly. 505 In the context of SAIN, subservice is actually a shortcut for 506 subservice assurance, that is the method for assuring that a 507 subservice behaves correctly. 509 Subservices, just as with services, have high-level parameters that 510 specify the type and specific instance to be assured. For example, 511 assuring a device requires the specific deviceId as parameter. For 512 example, assuring an interface requires the specific combination of 513 deviceId and interfaceId. 515 A subservice is also characterized by a list of metrics to fetch and 516 a list of computations to apply to these metrics in order to infer a 517 health status. 519 3.4. Building the Expression Graph from the Assurance Graph 521 From the assurance graph is derived a so-called global computation 522 graph. First, each subservice instance is transformed into a set of 523 subservice expressions that take metrics and constants as input (i.e. 525 Internet-DrafService Assurance for Intent-based Networking January 2021 527 sources of the DAG) and produce the status of the subservice, based 528 on some heuristics. Then for each service instance, the service 529 expressions are constructed by combining the subservice expressions 530 of its dependencies. The way service expressions are combined 531 depends on the dependency types (impacting or informational). 532 Finally, the global computation graph is built by combining the 533 service expressions. In other words, the global computation graph 534 encodes all the operations needed to produce health statuses from the 535 collected metrics. 537 Subservices shall be device independent. To justify this, let's 538 consider the interface operational status. Depending on the device 539 capabilities, this status can be collected by an industry-accepted 540 YANG module (IETF, Openconfig), by a vendor-specific YANG module, or 541 even by a MIB module. If the subservice was dependent on the 542 mechanism to collect the operational status, then we would need 543 multiple subservice definitions in order to support all different 544 mechanisms. This also implies that, while waiting for all the 545 metrics to be available via standard YANG modules, SAIN agents might 546 have to retrieve metric values via non-standard YANG models, via MIB 547 modules, Command Line Interface (CLI), etc., effectively implementing 548 a normalization layer between data models and information models. 550 In order to keep subservices independent from metric collection 551 method, or, expressed differently, to support multiple combinations 552 of platforms, OSes, and even vendors, the framework introduces the 553 concept of "metric engine". The metric engine maps each device- 554 independent metric used in the subservices to a list of device- 555 specific metric implementations that precisely define how to fetch 556 values for that metric. The mapping is parameterized by the 557 characteristics (model, OS version, etc.) of the device from which 558 the metrics are fetched. 560 3.5. Building the Expression from a Subservice 562 Additionally, to the list of metrics, each subservice defines a list 563 of expressions to apply on the metrics in order to compute the health 564 status of the subservice. The definition or the standardization of 565 those expressions (also known as heuristic) is currently out of scope 566 of this standardization. 568 3.6. Open Interfaces with YANG Modules 570 The interfaces between the architecture components are open thanks to 571 the YANG modules specified in YANG Modules for Service Assurance 572 [I-D.claise-opsawg-service-assurance-yang]; they specify objects for 573 assuring network services based on their decomposition into so-called 574 subservices, according to the SAIN architecture. 576 Internet-DrafService Assurance for Intent-based Networking January 2021 578 This module is intended for the following use cases: 580 o Assurance graph configuration: 582 * Subservices: configure a set of subservices to assure, by 583 specifying their types and parameters. 585 * Dependencies: configure the dependencies between the 586 subservices, along with their types. 588 o Assurance telemetry: export the health status of the subservices, 589 along with the observed symptoms. 591 3.7. Handling Maintenance Windows 593 Whenever network components are under maintenance, the operator want 594 to inhibit the emission of symptoms from those components. A typical 595 use case is device maintenance, during which the device is not 596 supposed to be operational. As such, symptoms related to the device 597 health should be ignored, as well as symptoms related to the device- 598 specific subservices, such as the interfaces, as their state changes 599 is probably the consequence of the maintenance. 601 To configure network components as "under maintenance" in the SAIN 602 architecture, the ietf-service-assurance model proposed in 603 [I-D.claise-opsawg-service-assurance-yang] specifies an "under- 604 maintenance" flag per service or subservice instance. When this flag 605 is set and only when this flag is set, the companion field 606 "maintenance-contact" must be set to a string that identifies the 607 person or process who requested the maintenance. Any symptom 608 produced by a service or subservice under maintenance, or by one of 609 its dependencies MUST NOT be be reported. A service or subservice 610 under maintenance MAY propagate a symptom "Under Maintenance" towards 611 services or subservices that depend on it. 613 We illustrate this mechanism on three independent examples based on 614 the assurance graph depicted in Figure 2: 616 o Device maintenance, for instance upgrading the device OS. The 617 operator sets the "under-maintenance" flag for the subservice 618 "Peer1" device. This inhibits the emission of symptoms from 619 "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel 620 Service Instance". All other subservices are unaffected. 622 o Interface maintenance, for instance replacing a broken optic. The 623 operator sets the "under-maintenance" flag for the subservice 624 "Peer1 Physical Interface". This inhibits the emission of 626 Internet-DrafService Assurance for Intent-based Networking January 2021 628 symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service 629 Instance". All other subservices are unaffected. 631 o Routing protocol maintenance, for instance modifying parameters or 632 redistribution. The operator sets the "under-maintenance" flag 633 for the subservice "IS-IS Routing Protocol". This inhibits the 634 emission of symptoms from "IP connectivity" and "Tunnel Service 635 Instance". All other subservices are unaffected. 637 3.8. Flexible Architecture 639 The SAIN architecture is flexible in terms of components. While the 640 SAIN architecture in Figure 1 makes a distinction between two 641 components, the SAIN configuration orchestrator and the SAIN 642 orchestrator, in practice those two components are mostly likely 643 combined. Similarly, the SAIN agents are displayed in Figure 1 as 644 being separate components. Practically, the SAIN agents could be 645 either independent components or directly integrated in monitored 646 entities. A practical example is an agent in a router. 648 The SAIN architecture is also flexible in terms of services and 649 subservices. Most examples in this document deal with the notion of 650 Network Service YANG modules, with well known service such as L2VPN 651 or tunnels. However, the concepts of services is general enough to 652 cross into different domains. One of them is the domain of service 653 management on network elements, with also requires its own assurance. 654 Examples includes a DHCP server on a linux server, a data plane, an 655 IPFIX export, etc. The notion of "service" is generic in this 656 architecture. Indeed, a configured service can itself be a service 657 for someone else. Exactly like an DHCP server/ data plane/IPFIX 658 export can be considered as services for a device, exactly like an 659 routing instance can be considered as a service for a L3VPN, exactly 660 like a tunnel can considered as a service for an application in the 661 cloud. The assurance graph is created to be flexible and open, 662 regardless of the subservice types, locations, or domains. 664 The SAIN architecture is also flexible in terms of distributed 665 graphs. As shown in Figure 1, our architecture comprises several 666 agents. Each agent is responsible for handling a subgraph of the 667 assurance graph. The collector is responsible for fetching the 668 subgraphs from the different agents and gluing them together. As an 669 example, in the graph from Figure 2, the subservices relative to Peer 670 1 might be handled by a different agent than the subservices relative 671 to Peer 2 and the Connectivity and IS-IS subservices might be handled 672 by yet another agent. The agents will export their partial graph and 673 the collector will stitch them together as dependencies of the 674 service instance. 676 Internet-DrafService Assurance for Intent-based Networking January 2021 678 And finally, the SAIN architecture is flexible in terms of what it 679 monitors. Most, if not all examples, in this document refer to 680 physical components but this is not a constrain. Indeed, the 681 assurance of virtual components would follow the same principles and 682 an assurance graph composed of virtualized components (or a mix of 683 virtualized and physical ones) is well possible within this 684 architecture. 686 3.9. Timing 688 The SAIN architecture requires the Network Time Protocol (NTP) 689 [RFC5905] between all elements: monitored entities, SAIN agents, 690 Service Configuration Orchesttrator, the SAIN Collector, as well as 691 the SAIN Orchestrator. This garantees the correlations of all 692 symptoms in the system, correlated with the right assurance graph 693 version. 695 The SAIN agent might have to remove some symptoms for specific 696 subservice symptoms, because there are outdated and not relevant any 697 longer, or simply because the SAIN agent needs to free up some space. 698 Regardless of the reason, it's important for a SAIN collector 699 (re-)connecting to a SAIN agent to understand the effect of this 700 garbage collection. Therefore, the SAIN agent contains a YANG object 701 specifying the date and time at which the symptoms history starts for 702 the subservice instances. 704 3.10. New Assurance Graph Generation 706 The assurance graph will change along the time, because services and 707 subservices come and go (changing the dependencies between 708 subservices), or simply because a subservice is now under 709 maintenance. Therefore an assurance graph version must be 710 maintained, along with the date and time of its last generation. The 711 date and time of a particular subservice instance (again dependencies 712 or under maintenane) might be kept. From a client point of view, an 713 assurance graph change is triggered by the value of the assurance- 714 graph-version and assurance-graph-last-change YANG leaves. At that 715 point in time, the client (collector) follows the following process: 717 o Keep the previous assurance-graph-last-change value (let's call it 718 time T) 720 o Run through all subservice instance and process the subservice 721 instances for which the last-change is newer that the time T 723 o Keep the new assurance-graph-last-change as the new referenced 724 date and time 726 Internet-DrafService Assurance for Intent-based Networking January 2021 728 4. Security Considerations 730 The SAIN architecture helps operators to reduce the mean time to 731 detect and mean time to repair. As such, it should not cause any 732 security threats. However, the SAIN agents must be secure: a 733 compromised SAIN agents could be sending wrong root causes or 734 symptoms to the management systems. 736 Except for the configuration of telemetry, the agents do not need 737 "write access" to the devices they monitor. This configuration is 738 applied with a YANG module, whose protection is covered by Secure 739 Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. 741 The data collected by SAIN could potentially be compromising to the 742 network or provide more insight into how the network is designed. 743 Considering the data that SAIN requires (including CLI access in some 744 cases), one should weigh data access concerns with the impact that 745 reduced visibility will have on being able to rapidly identify root 746 causes. 748 If a closed loop system relies on this architecture then the well 749 known issue of those system also applies, i.e., a lying device or 750 compromised agent could trigger partial reconfiguration of the 751 service or network. The SAIN architecture neither augments or 752 reduces this risk. 754 5. IANA Considerations 756 This document includes no request to IANA. 758 6. Contributors 760 o Youssef El Fathi 762 o Eric Vyncke 764 7. Open Issues 766 Refer to the Intent-based Networking NMRG documents 768 8. References 770 8.1. Normative References 772 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 773 Requirement Levels", BCP 14, RFC 2119, 774 DOI 10.17487/RFC2119, March 1997, 775 . 777 Internet-DrafService Assurance for Intent-based Networking January 2021 779 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 780 "Network Time Protocol Version 4: Protocol and Algorithms 781 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 782 . 784 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 785 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 786 May 2017, . 788 8.2. Informative References 790 [I-D.claise-opsawg-service-assurance-yang] 791 Claise, B. and J. Quilbeuf, "Service Assurance for Intent- 792 based Networking Architecture", February 2020. 794 [I-D.ietf-opsawg-tacacs] 795 Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and 796 L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg- 797 tacacs-18 (work in progress), March 2020. 799 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 800 "Remote Authentication Dial In User Service (RADIUS)", 801 RFC 2865, DOI 10.17487/RFC2865, June 2000, 802 . 804 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 805 DOI 10.17487/RFC3164, August 2001, 806 . 808 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 809 and A. Bierman, Ed., "Network Configuration Protocol 810 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 811 . 813 [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure 814 Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, 815 . 817 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 818 "Specification of the IP Flow Information Export (IPFIX) 819 Protocol for the Exchange of Flow Information", STD 77, 820 RFC 7011, DOI 10.17487/RFC7011, September 2013, 821 . 823 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 824 RFC 7950, DOI 10.17487/RFC7950, August 2016, 825 . 827 Internet-DrafService Assurance for Intent-based Networking January 2021 829 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 830 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 831 . 833 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 834 Classification", RFC 8199, DOI 10.17487/RFC8199, July 835 2017, . 837 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 838 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 839 . 841 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 842 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 843 September 2019, . 845 Appendix A. Changes between revisions 847 v02 - v03 849 o Timing Concepts 851 o New Assurance Graph Generation 853 v01 - v02 855 o Handling maintenance windows 857 o Flexible architecture better explained 859 o Improved the terminology 861 o Notion of mapping information model to data model, while waiting 862 for YANG to be everywhere 864 o Started a security considerations section 866 v00 - v01 868 o Terminology clarifications 870 o Figure 1 improved 872 Acknowledgements 874 The authors would like to thank Stephane Litkowski, Charles Eckel, 875 Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, 876 and Eric Vyncke for their reviews and feedback. 878 Internet-DrafService Assurance for Intent-based Networking January 2021 880 Authors' Addresses 882 Benoit Claise 883 Cisco Systems, Inc. 884 De Kleetlaan 6a b1 885 1831 Diegem 886 Belgium 888 Email: bclaise@cisco.com 890 Jean Quilbeuf 891 Independent 893 Email: jean@quilbeuf.net 895 Diego R. Lopez 896 Telefonica I+D 897 Don Ramon de la Cruz, 82 898 Madrid 28006 899 Spain 901 Email: diego.r.lopez@telefonica.com 903 Dan Voyer 904 Bell Canada 905 Canada 907 Email: daniel.voyer@bell.ca 909 Thangam Arumugam 910 Cisco Systems, Inc. 911 Milpitas (California) 912 United States 914 Email: tarumuga@cisco.com