idnits 2.17.1 draft-ietf-opsawg-ntf-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (1 December 2021) is 878 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC7540' is defined on line 1503, but no explicit reference was found in the text == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-16 == Outdated reference: A later version (-11) exists of draft-ietf-ippm-ioam-direct-export-07 == Outdated reference: A later version (-08) exists of draft-ietf-netconf-distributed-notif-02 == Outdated reference: A later version (-12) exists of draft-ietf-netconf-udp-notif-04 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-05 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-11 == Outdated reference: A later version (-21) exists of draft-song-opsawg-ifit-framework-16 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) -- Obsolete informational reference (is this intentional?): RFC 8889 (Obsoleted by RFC 9342) Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: 4 June 2022 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Rakuten Mobile 10 A. Wang 11 China Telecom 12 1 December 2021 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-12 17 Abstract 19 Network telemetry is a technology for gaining network insight and 20 facilitating efficient and automated network management. It 21 encompasses various techniques for remote data generation, 22 collection, correlation, and consumption. This document describes an 23 architectural framework for network telemetry, motivated by 24 challenges that are encountered as part of the operation of networks 25 and by the requirements that ensue. This document clarifies the 26 terminologies and classifies the modules and components of a network 27 telemetry system from different perspectives. The framework and 28 taxonomy help to set a common ground for the collection of related 29 work and provide guidance for related technique and standard 30 developments. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on 4 June 2022. 49 Copyright Notice 51 Copyright (c) 2021 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 56 license-info) in effect on the date of publication of this document. 57 Please review these documents carefully, as they describe your rights 58 and restrictions with respect to this document. Code Components 59 extracted from this document must include Revised BSD License text as 60 described in Section 4.e of the Trust Legal Provisions and are 61 provided without warranty as described in the Revised BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Applicability Statement . . . . . . . . . . . . . . . . . 4 67 1.2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 68 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 69 2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 70 2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 71 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 10 72 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 11 73 2.5. The Necessity of a Network Telemetry Framework . . . . . 13 74 3. Network Telemetry Framework . . . . . . . . . . . . . . . . . 14 75 3.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 15 76 3.1.1. Management Plane Telemetry . . . . . . . . . . . . . 18 77 3.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 18 78 3.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 19 79 3.1.4. External Data Telemetry . . . . . . . . . . . . . . . 21 80 3.2. Second Level Function Components . . . . . . . . . . . . 22 81 3.3. Data Acquisition Mechanism and Type Abstraction . . . . . 24 82 3.4. Mapping Existing Mechanisms into the Framework . . . . . 26 83 4. Evolution of Network Telemetry Applications . . . . . . . . . 27 84 5. Security Considerations . . . . . . . . . . . . . . . . . . . 28 85 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 86 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 87 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 88 9. Informative References . . . . . . . . . . . . . . . . . . . 29 89 Appendix A. A Survey on Existing Network Telemetry Techniques . 35 90 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 35 91 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 35 92 A.1.2. gRPC Network Management Interface . . . . . . . . . . 36 93 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 36 94 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 36 95 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 36 96 A.3.1. The Alternate Marking (AM) technology . . . . . . . . 36 97 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 38 98 A.3.3. IP Flow Information Export (IPFIX) Protocol . . . . . 38 99 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 38 100 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 39 101 A.3.6. Existing OAM for Specific Data Planes . . . . . . . . 39 102 A.4. External Data and Event Telemetry . . . . . . . . . . . . 39 103 A.4.1. Sources of External Events . . . . . . . . . . . . . 39 104 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 41 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 107 1. Introduction 109 Network visibility is the ability of management tools to see the 110 state and behavior of a network, which is essential for successful 111 network operation. Network Telemetry revolves around network data 112 that can help provide insights about the current state of the 113 network, including network devices, forwarding, control, and 114 management planes, and that can be generated and obtained through a 115 variety of techniques, including but not limited to network 116 instrumentation and measurements, and that can be processed for 117 purposes ranging from service assurance to network security using a 118 wide variety of data analytical techniques. In this document, 119 Network Telemetry refer to both the data itself (i.e., "Network 120 Telemetry Data"), and the techniques and processes used to generate, 121 export, collect, and consume that data for use by potentially 122 automated management applications. Network telemetry extends beyond 123 the classical network Operations, Administration, and Management 124 (OAM) techniques and expects to support better flexibility, 125 scalability, accuracy, coverage, and performance. 127 However, the term "network telemetry" lacks an unambiguous 128 definition. The scope and coverage of it cause confusion and 129 misunderstandings. It is beneficial to clarify the concept and 130 provide a clear architectural framework for network telemetry, so we 131 can articulate the technical field, and better align the related 132 techniques and standard works. 134 To fulfill such an undertaking, we first discuss some key 135 characteristics of network telemetry which set a clear distinction 136 from the conventional network OAM and show that some conventional OAM 137 technologies can be considered a subset of the network telemetry 138 technologies. We then provide an architectural framework for network 139 telemetry which includes four modules, each concerned with a 140 different category of telemetry data and corresponding procedures. 141 All the modules are internally structured in the same way, including 142 components that allow the operator to configure data sources in 143 regard to what data to generate and how to make that available to 144 client applications, components that instrument the underlying data 145 sources, and components that perform the actual rendering, encoding, 146 and exporting of the generated data. We show how the network 147 telemetry framework can benefit the current and future network 148 operations. Based on the distinction of modules and function 149 components, we can map the existing and emerging techniques and 150 protocols into the framework. The framework can also simplify the 151 designing, maintaining, and understanding a network telemetry system. 152 In addition, we outline the evolution stages of the network telemetry 153 system and discuss the potential security concerns. 155 The purpose of the framework and taxonomy is to set a common ground 156 for the collection of related work and provide guidance for future 157 technique and standard developments. To the best of our knowledge, 158 this document is the first such effort for network telemetry in 159 industry standards organizations. This document does not define 160 specific technologies. 162 1.1. Applicability Statement 164 Large-scale network data collection is a major threat to user privacy 165 and may be indistinguishable from pervasive monitoring [RFC7258]. 166 The network telemetry framework presented in this document must not 167 be applied to generating, exporting, collecting, analyzing, or 168 retaining individual user data or any data that can identify end 169 users or characterize their behavior without consent. Based on this 170 principle, the network telemetry framework is not applicable to 171 networks whose endpoints represent individual users, such as general- 172 purpose access networks. 174 1.2. Glossary 176 Before further discussion, we list some key terminology and acronyms 177 used in this document. We make an intended differentiation between 178 the terms of network telemetry and OAM. However, it should be 179 understood that there is not a hard-line distinction between the two 180 concepts. Rather, network telemetry is considered as an extension of 181 OAM. It covers all the existing OAM protocols but puts more emphasis 182 on the newer and emerging techniques and protocols concerning all 183 aspects of network data from acquisition to consumption. 185 AI: Artificial Intelligence. In network domain, AI refers to the 186 machine-learning based technologies for automated network 187 operation and other tasks. 189 AM: Alternate Marking, a flow performance measurement method, 190 specified in [RFC8321]. 192 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 194 DPI: Deep Packet Inspection, referring to the techniques that 195 examines packet beyond packet L3/L4 headers. 197 gNMI: gRPC Network Management Interface, a network management 198 protocol from OpenConfig Operator Working Group, mainly 199 contributed by Google. See [gnmi] for details. 201 GPB: Google Protocol Buffer, an extensible mechanism for serializing 202 structured data. See [gpb] for details. 204 gRPC: gRPC Remote Procedure Call, an open source high performance 205 RPC framework that gNMI is based on. See [grpc] for details. 207 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 209 IOAM: In-situ OAM [I-D.ietf-ippm-ioam-data], a dataplane on-path 210 telemetry technique. 212 JSON: An open standard file format and data interchange format that 213 uses human-readable text to store and transmit data objects, 214 specified in [RFC8259]. 216 MIB: Management Information Base, a database used for managing the 217 entities in a network. 219 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 221 NetFlow: A Cisco protocol for flow record collecting, described in 222 [RFC3954]. 224 Network Telemetry: The process and instrumentation for acquiring and 225 utilizing network data remotely for network monitoring and 226 operation. A general term for a large set of network visibility 227 techniques and protocols, concerning aspects like data generation, 228 collection, correlation, and consumption. Network telemetry 229 addresses the current network operation issues and enables smooth 230 evolution toward future intent-driven autonomous networks. 232 NMS: Network Management System, referring to applications that allow 233 network administrators to manage a network. 235 OAM: Operations, Administration, and Maintenance. A group of 236 network management functions that provide network fault 237 indication, fault localization, performance information, and data 238 and diagnosis functions. Most conventional network monitoring 239 techniques and protocols belong to network OAM. 241 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 242 technique. A representative technique is described in 243 [I-D.ietf-ippm-ioam-direct-export]. 245 RESTCONF: An HTTP-based protocol that provides a programmatic 246 interface for accessing data defined in YANG, using the datastore 247 concepts defined in NETCONF, as specified in [RFC8040]. 249 SMIv2: Structure of Management Information Version 2, defining MIB 250 objects, specified in [RFC2578]. 252 SNMP: Simple Network Management Protocol. Version 1, 2, and 3 are 253 specified in [RFC1157], [RFC3416], and [RFC3414], respectively. 255 XML: Extensible Markup Language is a markup language for data 256 encoding that is both human-readable and machine-readable, 257 specified by W3C [xml]. 259 YANG: YANG is a data modeling language for the definition of data 260 sent over network management protocols such as the NETCONF and 261 RESTCONF. YANG is defined in [RFC6020] and [RFC7950]. 263 YANG ECA: A YANG model for Event-Condition-Action policies, defined 264 in [I-D.wwx-netmod-event-yang]. 266 YANG-Push: A mechanism that allows subscriber applications to 267 request a stream of updates from a YANG datastore on a network 268 device. Details are specified in [RFC8641] and [RFC8639]. 270 2. Background 272 The term "big data" is used to describe the extremely large volume of 273 data sets that can be analyzed computationally to reveal patterns, 274 trends, and associations. Networks are undoubtedly a source of big 275 data because of their scale and the volume of network traffic they 276 forward. When a network's endpoints do not represent individual 277 users (e.g. in industrial, datacenter, and infrastructure contexts), 278 network operations can often benefit from large-scale data collection 279 without breaching user privacy. 281 Today one can access advanced big data analytics capability through a 282 plethora of commercial and open source platforms (e.g., Apache 283 Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine 284 learning). Thanks to the advance of computing and storage 285 technologies, network big data analytics gives network operators an 286 opportunity to gain network insights and move towards network 287 autonomy. Some operators start to explore the application of 288 Artificial Intelligence (AI) to make sense of network data. Software 289 tools can use the network data to detect and react on network faults, 290 anomalies, and policy violations, as well as predicting future 291 events. In turn, the network policy updates for planning, intrusion 292 prevention, optimization, and self-healing may be applied. 294 It is conceivable that an autonomic network [RFC7575] is the logical 295 next step for network evolution following Software Defined Network 296 (SDN), aiming to reduce (or even eliminate) human labor, make more 297 efficient use of network resources, and provide better services more 298 aligned with customer requirements. The related technique of 299 Intent-based Networking (IBN) 300 [I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility 301 and telemetry data in order to ensure that the network is behaving as 302 intended. 304 However, while the data processing capability is improved and 305 applications require more data to function better, the networks lag 306 behind in extracting and translating network data into useful and 307 actionable information in efficient ways. The system bottleneck is 308 shifting from data consumption to data supply. Both the number of 309 network nodes and the traffic bandwidth keep increasing at a fast 310 pace. The network configuration and policy change at smaller time 311 slots than before. More subtle events and fine-grained data through 312 all network planes need to be captured and exported in real time. In 313 a nutshell, it is a challenge to get enough high-quality data out of 314 the network in a manner that is efficient, timely, and flexible. 315 Therefore, we need to survey the existing technologies and protocols 316 and identify any potential gaps. 318 In the remainder of this section, first we clarify the scope of 319 network data (i.e., telemetry data) relevant in this document. Then, 320 we discuss several key use cases for today's and future network 321 operations. Next, we show why the current network OAM techniques and 322 protocols are insufficient for these use cases. The discussion 323 underlines the need of new methods, techniques, and protocols, as 324 well as the extensions of existing ones, which we assign under the 325 umbrella term - Network Telemetry. 327 2.1. Telemetry Data Coverage 329 Any information that can be extracted from networks (including data 330 plane, control plane, and management plane) and used to gain 331 visibility or as basis for actions is considered telemetry data. It 332 includes statistics, event records and logs, snapshots of state, 333 configuration data, etc. It also covers the outputs of any active 334 and passive measurements [RFC7799]. In some cases, raw data is 335 processed in network before being sent to a data consumer. Such 336 processed data is also considered telemetry data. The value of 337 telemetry data varies. In some cases, if the cost is acceptable, 338 less but higher quality data are preferred than lots of low quality 339 data. A classification of telemetry data is provided in Section 3. 340 To preserve the privacy of end-users, no user packet content should 341 be collected. Specifically, the data objects generated, exported, 342 and collected by a network telemetry application should not include 343 any packet payload from traffic associated with end-users systems. 345 2.2. Use Cases 347 The following set of use cases is essential for network operations. 348 While the list is by no means exhaustive, it is enough to highlight 349 the requirements for data velocity, variety, volume, and veracity, 350 the attributes of big data, in networks. 352 * Security: Network intrusion detection and prevention systems need 353 to monitor network traffic and activities and act upon anomalies. 354 Given increasingly sophisticated attack vector coupled with 355 increasingly severe consequences of security breaches, new tools 356 and techniques need to be developed, relying on wider and deeper 357 visibility into networks. The ultimate goal is to achieve the 358 security with no, or only minimal, human intervention. 360 * Policy and Intent Compliance: Network policies are the rules that 361 constrain the services for network access, provide service 362 differentiation, or enforce specific treatment on the traffic. 363 For example, a service function chain is a policy that requires 364 the selected flows to pass through a set of ordered network 365 functions. Intent, as defined in 366 [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational 367 goals that a network should meet and outcomes that a network is 368 supposed to deliver, defined in a declarative manner without 369 specifying how to achieve or implement them. An intent requires a 370 complex translation and mapping process before being applied on 371 networks. While a policy or intent is enforced, the compliance 372 needs to be verified and monitored continuously by relying on 373 visibility that is provided through network telemetry data. Any 374 violation must be notified immediately, potentially resulting in 375 updates to how the policy or intent is applied in the network to 376 ensure that it remains in force, or otherwise alerting the network 377 administrator to the policy or intent violation. 379 * SLA Compliance: A Service-Level Agreement (SLA) is a service 380 contract between a service provider and a client, which include 381 the metrics for the service measurement and remedy/penalty 382 procedures when the service level misses the agreement. Users 383 need to check if they get the service as promised and network 384 operators need to evaluate how they can deliver the services that 385 can meet the SLA based on realtime network telemetry data, 386 including data from network measurements. 388 * Root Cause Analysis: Many network failure can be the effect of a 389 sequence of chained events. Troubleshooting and recovery require 390 quick identification of the root cause of any observable issues. 391 However, the root cause is not always straightforward to identify, 392 especially when the failure is sporadic and the number of event 393 messages, both related and unrelated to the same cause, is 394 overwhelming. While technologies such as machine learning can be 395 used for root cause analysis, it is up to the network to sense and 396 provide the relevant diagnostic data which are either actively fed 397 into, or passively retrieved by, the root cause analysis 398 applications. 400 * Network Optimization: This covers all short-term and long-term 401 network optimization techniques, including load balancing, Traffic 402 Engineering (TE), and network planning. Network operators are 403 motivated to optimize their network utilization and differentiate 404 services for better Return On Investment (ROI) or lower Capital 405 Expenditures (CAPEX). The first step is to know the real-time 406 network conditions before applying policies for traffic 407 manipulation. In some cases, micro-bursts need to be detected in 408 a very short time-frame so that fine-grained traffic control can 409 be applied to avoid network congestion. Long-term planning of 410 network capacity and topology requires analysis of real-world 411 network telemetry data that is obtained over long periods of time. 413 * Event Tracking and Prediction: The visibility into traffic path 414 and performance is critical for services and applications that 415 rely on healthy network operation. Numerous related network 416 events are of interest to network operators. For example, Network 417 operators want to learn where and why packets are dropped for an 418 application flow. They also want to be warned of issues in 419 advance, so proactive actions can be taken to avoid catastrophic 420 consequences. 422 2.3. Challenges 424 For a long time, network operators have relied upon SNMP [RFC3416], 425 Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the 426 network. Some other OAM techniques as described in [RFC7276] are 427 also used to facilitate network troubleshooting. These conventional 428 techniques are not sufficient to support the above use cases for the 429 following reasons: 431 * Most use cases need to continuously monitor the network and 432 dynamically refine the data collection in real-time. Poll-based 433 low-frequency data collection is ill-suited for these 434 applications. Subscription-based streaming data directly pushed 435 from the data source (e.g., the forwarding chip) is preferred to 436 provide sufficient data quantity and precision at scale. 438 * Comprehensive data is needed from packet processing engines to 439 traffic manager, from line cards to main control board, from user 440 flows to control protocol packets, from device configurations to 441 operations, and from physical layer to application layer. 442 Conventional OAM only covers a narrow range of data (e.g., SNMP 443 only handles data from the Management Information Base (MIB)). 444 Classical network devices cannot provide all the necessary probes. 445 More open and programmable network devices are therefore needed. 447 * Many application scenarios need to correlate network-wide data 448 from multiple sources (i.e., from distributed network devices, 449 different components of a network device, or different network 450 planes). A piecemeal solution is often lacking the capability to 451 consolidate the data from multiple sources. The composition of a 452 complete solution, as partly proposed by Autonomic Resource 453 Control Architecture(ARCA) 454 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 455 guided by a comprehensive framework. 457 * Some conventional OAM techniques (e.g., CLI and Syslog) lack a 458 formal data model. The unstructured data hinder the tool 459 automation and application extensibility. Standardized data 460 models are essential to support the programmable networks. 462 * Although some conventional OAM techniques support data push (e.g., 463 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the 464 pushed data are limited to only predefined management plane 465 warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). 466 Network operators require the data with arbitrary source, 467 granularity, and precision which are beyond the capability of the 468 existing techniques. 470 * The conventional passive measurement techniques can either consume 471 excessive network resources and render excessive redundant data, 472 or lead to inaccurate results; on the other hand, the conventional 473 active measurement techniques can interfere with the user traffic 474 and their results are indirect. Techniques that can collect 475 direct and on-demand data from user traffic are more favorable. 477 These challenges were addressed by newer standards and techniques 478 (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push) 479 and more are emerging. These standards and techniques need to be 480 recognized and accommodated in a new framework. 482 2.4. Network Telemetry 484 Network telemetry has emerged as a mainstream technical term to refer 485 to the network data collection and consumption techniques. Several 486 network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and 487 gRPC [grpc]) have been widely deployed. Network telemetry allows 488 separate entities to acquire data from network devices so that data 489 can be visualized and analyzed to support network monitoring and 490 operation. Network telemetry covers the conventional network OAM and 491 has a wider scope. For instance, it is expected that network 492 telemetry can provide the necessary network insight for autonomous 493 networks and address the shortcomings of conventional OAM techniques. 495 Network telemetry usually assumes machines as data consumers rather 496 than human operators. Hence, the network telemetry can directly 497 trigger the automated network operation, while in contrast some 498 conventional OAM tools were designed and used to help human operators 499 to monitor and diagnose the networks and guide manual network 500 operations. Such a proposition leads to very different techniques. 502 Although new network telemetry techniques are emerging and subject to 503 continuous evolution, several characteristics of network telemetry 504 have been well accepted. Note that network telemetry is intended to 505 be an umbrella term covering a wide spectrum of techniques, so the 506 following characteristics are not expected to be held by every 507 specific technique. 509 * Push and Streaming: Instead of polling data from network devices, 510 telemetry collectors subscribe to streaming data pushed from data 511 sources in network devices. 513 * Volume and Velocity: The telemetry data is intended to be consumed 514 by machines rather than by human being. Therefore, the data 515 volume can be huge and the processing is optimized for the needs 516 of automation in realtime. 518 * Normalization and Unification: Telemetry aims to address the 519 overall network automation needs. Efforts are made to normalize 520 the data representation and unify the protocols, so to simplify 521 data analysis and provide integrated analysis across heterogeneous 522 devices and data sources across a network. 524 * Model-based: The telemetry data is modeled in advance which allows 525 applications to configure and consume data with ease. 527 * Data Fusion: The data for a single application can come from 528 multiple data sources (e.g., cross-domain, cross-device, and 529 cross-layer) based on common naming/ID and needs to be correlated 530 to take effect. 532 * Dynamic and Interactive: Since the network telemetry means to be 533 used in a closed control loop for network automation, it needs to 534 run continuously and adapt to the dynamic and interactive queries 535 from the network operation controller. 537 In addition, an ideal network telemetry solution may also have the 538 following features or properties: 540 * In-Network Customization: The data that is generated can be 541 customized in network at run-time to cater to the specific need of 542 applications. This needs the support of a programmable data plane 543 which allows probes with custom functions to be deployed at 544 flexible locations. 546 * In-Network Data Aggregation and Correlation: Network devices and 547 aggregation points can work out which events and what data needs 548 to be stored, reported, or discarded thus reducing the load on the 549 central collection and processing points while still ensuring that 550 the right information is ready to be processed in a timely way. 552 * In-Network Processing: Sometimes it is not necessary or feasible 553 to gather all information to a central point to be processed and 554 acted upon. It is possible for the data processing to be done in 555 network, allowing reactive actions to be taken locally. 557 * Direct Data Plane Export: The data originated from the data plane 558 forwarding chips can be directly exported to the data consumer for 559 efficiency, especially when the data bandwidth is large and the 560 real-time processing is required. 562 * In-band Data Collection: In addition to the passive and active 563 data collection approaches, the new hybrid approach allows to 564 directly collect data for any target flow on its entire forwarding 565 path [I-D.song-opsawg-ifit-framework]. 567 It is worth noting that a network telemetry system should not be 568 intrusive to normal network operations by avoiding the pitfall of the 569 "observer effect". That is, it should not change the network 570 behavior and affect the forwarding performance. Moreover, high- 571 volume telemetry traffic may cause network congestion unless proper 572 isolation or traffic engineering techniques are in place, or 573 congestion control mechanisms ensure that telemetry traffic backs off 574 if it exceeds the network capacity. [RFC8084] and [RFC8085] are 575 relevant Best Current Practices (BCP) in this space. 577 Although in many cases a system for network telemetry involves a 578 remote data collecting and consuming entity, it is important to 579 understand that there are no inherent assumptions about how a system 580 should be architected. While a network architecture with centralized 581 controller (e.g., SDN) seems a natural fit for network telemetry, 582 network telemetry can work in distributed fashions as well. For 583 example, telemetry data producers and consumers can have a peer-to- 584 peer relationship, in which a network node can be the direct consumer 585 of telemetry data from other nodes. 587 2.5. The Necessity of a Network Telemetry Framework 589 Network data analytics (e.g., machine learning) is applied for 590 network operation automation, relying on abundant and coherent data 591 from networks. Data acquisition that is limited to a single source 592 and static in nature will in many cases not be sufficient to meet an 593 application's telemetry data needs. As a result, multiple data 594 sources, involving a variety of techniques and standards, will need 595 to be integrated. It is desirable to have a framework that 596 classifies and organizes different telemetry data source and types, 597 defines different components of a network telemetry system and their 598 interactions, and helps coordinate and integrate multiple telemetry 599 approaches across layers. This allows flexible combinations of data 600 for different applications, while normalizing and simplifying 601 interfaces. In detail, such a framework would benefit the 602 development of network operation applications for the following 603 reasons: 605 * Future networks, autonomous or otherwise, depend on holistic and 606 comprehensive network visibility. The use cases and applications 607 are better to be supported uniformly and coherently using an 608 integrated, converged mechanism and common telemetry data 609 representations wherever feasible. Therefore, the protocols and 610 mechanisms should be consolidated into a minimum yet comprehensive 611 set. A telemetry framework can help to normalize the technique 612 developments. 614 * Network visibility presents multiple viewpoints. For example, the 615 device viewpoint takes the network infrastructure as the 616 monitoring object from which the network topology and device 617 status can be acquired; the traffic viewpoint takes the flows or 618 packets as the monitoring object from which the traffic quality 619 and path can be acquired. An application may need to switch its 620 viewpoint during operation. It may also need to correlate a 621 service and its impact on user experience to acquire the 622 comprehensive information. 624 * Applications require network telemetry to be elastic in order to 625 make efficient use of network resources and reduce the impact of 626 processing related to network telemetry on network performance. 627 For example, routine network monitoring should cover the entire 628 network with a low data sampling rate. Only when issues arise or 629 critical trends emerge should telemetry data source be modified 630 and telemetry data rates boosted as needed. 632 * Efficient data aggregation is critical for applications to reduce 633 the overall quantity of data and improve the accuracy of analysis. 635 A telemetry framework collects together all the telemetry-related 636 works from different sources and working groups within IETF. This 637 makes it possible to assemble a comprehensive network telemetry 638 system and to avoid repetitious or redundant work. The framework 639 should cover the concepts and components from the standardization 640 perspective. This document describes the modules which make up a 641 network telemetry framework and decomposes the telemetry system into 642 a set of distinct components that existing and future work can easily 643 map to. 645 3. Network Telemetry Framework 647 The top level network telemetry framework partitions the network 648 telemetry into four modules based on the telemetry data object source 649 and represents their relationship. Once the network operation 650 applications acquire the data from these modules, they can apply data 651 analytics and take actions. At the next level, the framework 652 decomposes each module into separate components. Each of the modules 653 follows the same underlying structure, with one component dedicated 654 to the configuration of data subscriptions and data sources, a second 655 component dedicated to encoding and exporting data, and a third 656 component instrumenting the generation of telemetry related to the 657 underlying resources. Throughout the framework, the same set of 658 abstract data acquiring mechanisms and data types (Section 3.3) are 659 applied. The two-level architecture with the uniform data 660 abstraction helps accurately pinpoint a protocol or technique to its 661 position in a network telemetry system or disaggregate a network 662 telemetry system into manageable parts. 664 3.1. Top Level Modules 666 Telemetry can be applied on the forwarding plane, the control plane, 667 and the management plane in a network, as well as other sources out 668 of the network, as shown in Figure 1. Therefore, we categorize the 669 network telemetry into four distinct modules (management plane, 670 control plane, forwarding plane, and external data and event 671 telemetry) with each having its own interface to Network Operation 672 Applications. 674 +------------------------------+ 675 | | 676 | Network Operation |<-------+ 677 | Applications | | 678 | | | 679 +------------------------------+ | 680 ^ ^ ^ | 681 | | | | 682 V V | V 683 +--------------+-----------|---+ +-----------+ 684 | | Control | | | | 685 | | Plane | | | External | 686 | <---> | | | Data and | 687 | | Telemetry | | | Event | 688 | Management | ^ V | | Telemetry | 689 | Plane +-------|-------+ | | 690 | Telemetry | V | +-----------+ 691 | | Forwarding | 692 | | Plane | 693 | <---> | 694 | | Telemetry | 695 | | | 696 +--------------+---------------+ 698 Figure 1: Modules in Layer Category of NTF 700 The rationale of this partition lies in the different telemetry data 701 objects which result in different data source and export locations. 702 Such differences have profound implications on in-network data 703 programming and processing capability, data encoding and transport 704 protocol, and required data bandwidth and latency. Data can be sent 705 directly, or proxied via the control and management planes. There 706 are advantages/disadvantages to both approaches. 708 Note that in some cases the network controller itself may be the 709 source of telemetry data that is unique to it or derived from the 710 telemetry data collected from the network elements. Some of the 711 principles and taxonomy specific to the control plane and management 712 plane telemetry could also be applied to the controller when it is 713 required to provide the telemetry data to Network Operation 714 Applications hosted outside. The scope of the document is focused on 715 the network elements telemetry and further details related to 716 controllers are thus out of scope. 718 We summarize the major differences of the four modules in the 719 following table. They are compared from six angles: 721 * Data Object 723 * Data Export Location 725 * Data Model 727 * Data Encoding 729 * Telemetry Application Protocol 731 * Data Transport Method 733 Data Object is the target and source of each module. Because the 734 data source varies, the location where data is mostly conveniently 735 exported also varies. For example, forwarding plane data mainly 736 originates as data exported from the forwarding Application-Specific 737 Integrated Circuits (ASICs), while control plane data mainly 738 originates from the protocol daemons running on the control CPU(s). 739 For convenience and efficiency, it is preferred to export the data 740 off the device from locations near the source. Because the locations 741 that can export data have different capabilities, different choices 742 of data model, encoding, and transport method are made to balance the 743 performance and cost. For example, the forwarding chip has high 744 throughput but limited capacity for processing complex data and 745 maintaining state, while the main control CPU is capable of complex 746 data and state processing, but has limited bandwidth for high 747 throughput data. As a result, the suitable telemetry protocol for 748 each module can be different. Some representative techniques are 749 shown in the corresponding table blocks to highlight the technical 750 diversity of these modules. Note that the selected techniques just 751 reflect the de facto state of the art and are by no means exhaustive 752 (e.g., IPFIX can also be implemented over TCP and SCTP, but that is 753 not recommended for forwarding plane). The key point is that one 754 cannot expect to use a universal protocol to cover all the network 755 telemetry requirements. 757 +-----------+-------------+-------------+--------------+----------+ 758 | Module |Management |Control |Forwarding |External | 759 | |Plane |Plane |Plane |Data | 760 +-----------+-------------+-------------+--------------+----------+ 761 |Object |config. & |control |flow & packet |terminal, | 762 | |operation |protocol & |QoS, traffic |social & | 763 | |state |signaling, |stat., buffer |environ- | 764 | | |RIB |& queue stat.,|mental | 765 | | | |ACL, FIB | | 766 +-----------+-------------+-------------+--------------+----------+ 767 |Export |main control |main control |fwding chip |various | 768 |Location |CPU |CPU, |or linecard | | 769 | | |linecard CPU |CPU; main | | 770 | | |or forwarding|control CPU | | 771 | | |chip |unlikely | | 772 +-----------+-------------+-------------+--------------+----------+ 773 |Data |YANG, MIB, |YANG, |YANG |YANG, | 774 |Model |syslog |custom |custom, |custom | 775 +-----------+-------------+-------------+--------------+----------+ 776 |Data |GPB, JSON, |GPB, JSON, |plain text |GPB, JSON | 777 |Encoding |XML |XML, | |XML, plain| 778 | | |plain text | |text | 779 +-----------+-------------+-------------+--------------+----------+ 780 |Application|gRPC,NETCONF,|gRPC,NETCONF,|IPFIX, traffic|gRPC | 781 |Protocol |RESTCONF |IPFIX,traffic|mirroring, | | 782 | | |mirroring |gRPC, NETFLOW | | 783 +-----------+-------------+-------------+--------------+----------+ 784 |Data |HTTP(S), TCP |HTTP(S), TCP,|UDP |HTTP(S), | 785 |Transport | |UDP | |TCP, UDP | 786 +-----------+-------------+-------------+--------------+----------+ 788 Figure 2: Comparison of the Data Object Modules 790 Note that the interaction with the applications that consume network 791 telemetry data can be indirect. Some in-device data transfer is 792 possible. For example, in the management plane telemetry, the 793 management plane will need to acquire data from the data plane. Some 794 operational states can only be derived from data plane data sources 795 such as the interface status and statistics. As another example, 796 obtaining control plane telemetry data may require the ability to 797 access the Forwarding Information Base (FIB) of the data plane. 799 On the other hand, an application may involve more than one plane and 800 interact with multiple planes simultaneously. For example, an SLA 801 compliance application may require both the data plane telemetry and 802 the control plane telemetry. 804 The requirements and challenges for each module are summarized as 805 follows (note that the requirements may pertain across all telemetry 806 modules; however, we emphasize those that are most pronounced for a 807 particular plane). 809 3.1.1. Management Plane Telemetry 811 The management plane of network elements interacts with the Network 812 Management System (NMS), and provides information such as performance 813 data, network logging data, network warning and defects data, and 814 network statistics and state data. The management plane includes 815 many protocols, including some that are considered "legacy", such as 816 SNMP and syslog. Regardless the protocol, management plane telemetry 817 must address the following requirements: 819 * Convenient Data Subscription: An application should have the 820 freedom to choose which data is exported (see section 4.3) and the 821 means and frequency of how that data is exported (e.g., on-change 822 or periodic subscription). 824 * Structured Data: For automatic network operation, machines will 825 replace human for network data comprehension. Data modeling 826 languages, such as YANG, can efficiently describe structured data 827 and normalize data encoding and transformation. 829 * High Speed Data Transport: In order to keep up with the velocity 830 of information, a data source needs to be able to send large 831 amounts of data at high frequency. Compact encoding formats or 832 data compression schemes are needed to reduce the quantity of data 833 and improve the data transport efficiency. The subscription mode, 834 by replacing the query mode, reduces the interactions between 835 clients and servers and helps to improve the data source's 836 efficiency. 838 * Network Congestion Avoidance: The application must protect the 839 network from congestion by congestion control mechanisms or at 840 least circuit breakers. [RFC8084] and [RFC8085] provide some 841 solutions in this space. 843 3.1.2. Control Plane Telemetry 845 The control plane telemetry refers to the health condition monitoring 846 of different network control protocols at all layers of the protocol 847 stack. Keeping track of the operational status of these protocols is 848 beneficial for detecting, localizing, and even predicting various 849 network issues, as well as network optimization, in real-time and 850 with fine granularity. Some particular challenges and issues faced 851 by the control plane telemetry are as follows: 853 * One challenging problem for the control plane telemetry is how to 854 correlate the End-to-End (E2E) Key Performance Indicators (KPI) to 855 a specific layer's KPIs. For example, IPTV users may describe 856 their User Experience (UE) by the video smoothness and definition. 857 Then in case of an unusually poor UE KPI or a service 858 disconnection, it is non-trivial to delimit and pinpoint the issue 859 in the responsible protocol layer (e.g., the Transport Layer or 860 the Network Layer), the responsible protocol (e.g., ISIS or BGP at 861 the Network Layer), and finally the responsible device(s) with 862 specific reasons. 864 * Conventional OAM-based approaches for control plane KPI 865 measurement include Ping (L3), Traceroute (L3), Y.1731 [y1731] 866 (L2), and so on. One common issue behind these methods is that 867 they only measure the KPIs instead of reflecting the actual 868 running status of these protocols, making them less effective or 869 efficient for control plane troubleshooting and network 870 optimization. 872 * An example of the control plane telemetry is the BGP monitoring 873 protocol (BMP), it is currently used for monitoring the BGP routes 874 and enables rich applications, such as BGP peer analysis, AS 875 analysis, prefix analysis, and security analysis. However, the 876 monitoring of other layers, protocols and the cross-layer, cross- 877 protocol KPI correlations are still in their infancy (e.g., IGP 878 monitoring is not as extensive as BMP), which require further 879 research. 881 * The requirement and solutions for network congestion avoidance are 882 also applicable to the control plane telemetry. 884 3.1.3. Forwarding Plane Telemetry 886 An effective forwarding plane telemetry system relies on the data 887 that the network device can expose. The quality, quantity, and 888 timeliness of data must meet some stringent requirements. This 889 raises some challenges to the network data plane devices where the 890 first-hand data originates. 892 * A data plane device's main function is user traffic processing and 893 forwarding. While supporting network visibility is important, the 894 telemetry is just an auxiliary function, and it should strive to 895 not impede normal traffic processing and forwarding (i.e., the 896 forwarding behavior should not be altered and the trade-off 897 between forwarding performance and telemetry should be well- 898 balanced). 900 * Network operation applications require end-to-end visibility 901 across various sources, which can result in a huge volume of data. 902 However, the sheer quantity of data must not exhaust the network 903 bandwidth, regardless of the data delivery approach (i.e., whether 904 through in-band or out-of-band channels). 906 * The data plane devices must provide timely data with the minimum 907 possible delay. Long processing, transport, storage, and analysis 908 delay can impact the effectiveness of the control loop and even 909 render the data useless. 911 * The data should be structured and labeled, and easy for 912 applications to parse and consume. At the same time, the data 913 types needed by applications can vary significantly. The data 914 plane devices need to provide enough flexibility and 915 programmability to support the precise data provision for 916 applications. 918 * The data plane telemetry should support incremental deployment and 919 work even though some devices are unaware of the system. 921 * The requirement and solutions for network congestion avoidance are 922 also applicable to the forwarding plane telemetry. 924 Although not specific to the forwarding plane, these challenges are 925 more difficult to the forwarding plane because of the limited 926 resource and flexibility. Data plane programmability is essential to 927 support network telemetry. Newer data plane forwarding chips are 928 equipped with advanced telemetry features and provide flexibility to 929 support customized telemetry functions. 931 Technique Taxonomy: concerning about how one instruments the 932 telemetry, there can be multiple possible dimensions to classify the 933 forwarding plane telemetry techniques. 935 * Active, Passive, and Hybrid: This dimension concerns about the 936 end-to-end measurement. Active and passive methods (as well as 937 the hybrid types) are well documented in [RFC7799]. Passive 938 methods include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic 939 mirroring. These methods usually have low data coverage. The 940 bandwidth cost is very high in order to improve the data coverage. 941 On the other hand, active methods include Ping, OWAMP [RFC4656], 942 TWAMP [RFC5357], STAMP [RFC8762], and Cisco's SLA Protocol 943 [RFC6812]. These methods are intrusive and only provide indirect 944 network measurements. Hybrid methods, including in-situ OAM 945 [I-D.ietf-ippm-ioam-data], Alternate-Marking (AM) [RFC8321], and 946 Multipoint Alternate Marking [RFC8889], provide a well-balanced 947 and more flexible approach. However, these methods are also more 948 complex to implement. 950 * In-Band and Out-of-Band: Telemetry data carried in user packets 951 before being exported to a data collector is considered in-band 952 (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]). Telemetry data 953 that is directly exported to a data collector without modifying 954 user packets is considered out-of-band (e.g., the postcard-based 955 approach described in Appendix A.3.5). It is also possible to 956 have hybrid methods, where only the telemetry instruction or 957 partial data is carried by user packets (e.g., AM [RFC8321]). 959 * End-to-End and In-Network: End-to-End methods start from, and end 960 at, the network end hosts (e.g., Ping). In-Network methods work 961 in networks and are transparent to end hosts. However, if needed, 962 In-Network methods can be easily extended into end hosts. 964 * Data Subject: Depending on the telemetry objective, the methods 965 can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), 966 path-based (e.g., Traceroute), and node-based (e.g., IPFIX 967 [RFC7011]). The various data objects can be packet, flow record, 968 measurement, states, and signal. 970 3.1.4. External Data Telemetry 972 Events that occur outside the boundaries of the network system are 973 another important source of network telemetry. Correlating both 974 internal telemetry data and external events with the requirements of 975 network systems, as presented in 976 [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and 977 functional advantage to management operations. 979 As with other sources of telemetry information, the data and events 980 must meet strict requirements, especially in terms of timeliness, 981 which is essential to properly incorporate external event information 982 into network management applications. The specific challenges are 983 described as follows: 985 * The role of the external event detector can be played by multiple 986 elements, including hardware (e.g., physical sensors, such as 987 seismometers) and software (e.g., Big Data sources that can 988 analyze streams of information, such as Twitter messages). Thus, 989 the transmitted data must support different shapes but, at the 990 same time, follow a common but extensible schema. 992 * Since the main function of the external event detectors is to 993 perform the notifications, their timeliness is assumed. However, 994 once messages have been dispatched, they must be quickly collected 995 and inserted into the control plane with variable priority, which 996 is higher for important sources and events and lower for secondary 997 ones. 999 * The schema used by external detectors must be easily adopted by 1000 current and future devices and applications. Therefore, it must 1001 be easily mapped to current data models, such as in terms of YANG. 1003 * As the communication with external entities outside the boundary 1004 of a provider network may be realized over the Internet, the risk 1005 of congestion is even more relevant in this context and proper 1006 counter-measures must be taken. Solutions such as network 1007 transport circuit breakers are needed as well. 1009 Organizing both internal and external telemetry information together 1010 will be key for the general exploitation of the management 1011 possibilities of current and future network systems, as reflected in 1012 the incorporation of cognitive capabilities to new hardware and 1013 software (virtual) elements. 1015 3.2. Second Level Function Components 1017 The telemetry module at each plane can be further partitioned into 1018 five distinct conceptual components: 1020 * Data Query, Analysis, and Storage: This component works at the 1021 network operation application block in Figure 1. It is normally a 1022 part of the network management system at the receiver side. On 1023 the one hand, it is responsible for issuing data requirements. 1024 The data of interest can be modeled data through configuration or 1025 custom data through programming. The data requirements can be 1026 queries for one-shot data or subscriptions for events or streaming 1027 data. On the other hand, it receives, stores, and processes the 1028 returned data from network devices. Data analysis can be 1029 interactive to initiate further data queries. This component can 1030 reside in either network devices or remote controllers. It can be 1031 centralized and distributed, and involve one or more instances. 1033 * Data Configuration and Subscription: This component manages data 1034 queries on devices. It determines the protocol and channel for 1035 applications to acquire desired data. This component is also 1036 responsible for configuring the desired data that might not be 1037 directly available form data sources. The subscription data can 1038 be described by models, templates, or programs. 1040 * Data Encoding and Export: This component determines how telemetry 1041 data is delivered to the data analysis and storage component with 1042 access control. The data encoding and the transport protocol may 1043 vary due to the data export location. 1045 * Data Generation and Processing: The requested data needs to be 1046 captured, filtered, processed, and formatted in network devices 1047 from raw data sources. This may involve in-network computing and 1048 processing on either the fast path or the slow path in network 1049 devices. 1051 * Data Object and Source: This component determines the monitoring 1052 objects and original data sources provisioned in the device. A 1053 data source usually just provides raw data which needs further 1054 processing. Each data source can be considered a probe. Some 1055 data sources can be dynamically installed, while others will be 1056 more static. 1058 +----------------------------------------+ 1059 +----------------------------------------+ | 1060 | | | 1061 | Data Query, Analysis, & Storage | | 1062 | | + 1063 +-------+++ -----------------------------+ 1064 ||| ^^^ 1065 ||| ||| 1066 ||V ||| 1067 +--+V--------------------+++------------+ 1068 +-----V---------------------+------------+ | 1069 +---------------------+-------+----------+ | | 1070 | Data Configuration | | | | 1071 | & Subscription | Data Encoding | | | 1072 | (model, template, | & Export | | | 1073 | & program) | | | | 1074 +---------------------+------------------| | | 1075 | | | | 1076 | Data Generation | | | 1077 | & Processing | | | 1078 | | | | 1079 +----------------------------------------| | | 1080 | | | | 1081 | Data Object and Source | |-+ 1082 | |-+ 1083 +----------------------------------------+ 1085 Figure 3: Components in the Network Telemetry Framework 1087 3.3. Data Acquisition Mechanism and Type Abstraction 1089 Broadly speaking, network data can be acquired through subscription 1090 (push) and query (poll). A subscription is a contract between 1091 publisher and subscriber. After initial setup, the subscribed data 1092 is automatically delivered to registered subscribers until the 1093 subscription expires. There are two variations of subscription. The 1094 subscriptions can be either pre-defined, or the subscribers are 1095 allowed to configure and tailor the published data to their specific 1096 needs. 1098 In contrast, queries are used when a client expects immediate and 1099 one-off feedback from network devices. The queried data may be 1100 directly extracted from some specific data source, or synthesized and 1101 processed from raw data. Queries work well for interactive network 1102 telemetry applications. 1104 In general, data can be pulled (i.e., queried) whenever needed, but 1105 in many cases, pushing the data (i.e., subscription) is more 1106 efficient, and can reduce the latency of a client detecting a change. 1107 From the data consumer point of view, there are four types of data 1108 from network devices that a telemetry data consumer can subscribe or 1109 query: 1111 * Simple Data: The data that are steadily available from some 1112 datastore or static probes in network devices. 1114 * Derived Data: The data need to be synthesized or processed in 1115 network from raw data from one or more network devices. The data 1116 processing function can be statically or dynamically loaded into 1117 network devices. 1119 * Event-triggered Data: The data are conditionally acquired based on 1120 the occurrence of some events. An example of event-triggered data 1121 could be an interface changing operational state between up and 1122 down. Such data can be actively pushed through subscription or 1123 passively polled through query. There are many ways to model 1124 events, including using Finite State Machine (FSM) or Event 1125 Condition Action (ECA) [I-D.wwx-netmod-event-yang]. 1127 * Streaming Data: The data are continuously generated. It can be 1128 time series or the dump of databases. For example, an interface 1129 packet counter is exported every second. The streaming data 1130 reflect realtime network states and metrics and require large 1131 bandwidth and processing power. The streaming data are always 1132 actively pushed to the subscribers. 1134 The above telemetry data types are not mutually exclusive. Rather, 1135 they are often composite. Derived data is composed of simple data; 1136 Event-triggered data can be simple or derived; streaming data can be 1137 based on some recurring event. The relationships of these data types 1138 are illustrated in Figure 4. 1140 +----------------------+ +-----------------+ 1141 | Event-triggered Data |<----+ Streaming Data | 1142 +-------+---+----------+ +-----+---+-------+ 1143 | | | | 1144 | | | | 1145 | | +--------------+ | | 1146 | +-->| Derived Data |<--+ | 1147 | +------+------ + | 1148 | | | 1149 | V | 1150 | +--------------+ | 1151 +------>| Simple Data |<------+ 1152 +--------------+ 1154 Figure 4: Data Type Relationship 1156 Subscription usually deals with event-triggered data and streaming 1157 data, and query usually deals with simple data and derived data. But 1158 the other ways are also possible. Advanced network telemetry 1159 techniques are designed mainly for event-triggered or streaming data 1160 subscription, and derived data query. 1162 3.4. Mapping Existing Mechanisms into the Framework 1164 The following table shows how the existing mechanisms (mainly 1165 published in IETF and with the emphasis on the latest new 1166 technologies) are positioned in the framework. Given the vast body 1167 of existing work, we cannot provide an exhaustive list, so the 1168 mechanisms in the tables should be considered as just examples. 1169 Also, some comprehensive protocols and techniques may cover multiple 1170 aspects or modules of the framework, so a name in a block only 1171 emphasizes one particular characteristic of it. More details about 1172 some listed mechanisms can be found in Appendix A. 1174 +-------------+-----------------+---------------+--------------+ 1175 | | Management | Control | Forwarding | 1176 | | Plane | Plane | Plane | 1177 +-------------+-----------------+---------------+--------------+ 1178 | data config.| gNMI, NETCONF, | gNMI, NETCONF,| NETCONF, | 1179 | & subscribe | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | 1180 | | YANG-Push | YANG-Push | YANG-Push | 1181 +-------------+-----------------+---------------+--------------+ 1182 | data gen. & | MIB, | YANG | IOAM, PSAMP | 1183 | process | YANG | | PBT, AM, | 1184 +-------------+-----------------+---------------+--------------+ 1185 | data encode.| gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | 1186 | & export | | | | 1187 +-------------+-----------------+---------------+--------------+ 1188 Figure 5: Existing Work Mapping 1190 Although the framework is generally suitable for any network 1191 environments, the multi-domain telemetry has some unique challenges 1192 which deserve further architectural consideration, which is out of 1193 the scope of this document. 1195 4. Evolution of Network Telemetry Applications 1197 Network telemetry is an evolving technical area. As the network 1198 moves towards the automated operation, network telemetry applications 1199 undergo several stages of evolution which add new layer of 1200 requirements to the underlying network telemetry techniques. Each 1201 stage is built upon the techniques adopted by the previous stages 1202 plus some new requirements. 1204 Stage 0 - Static Telemetry: The telemetry data source and type are 1205 determined at design time. The network operator can only 1206 configure how to use it with limited flexibility. 1208 Stage 1 - Dynamic Telemetry: The custom telemetry data can be 1209 dynamically programmed or configured at runtime without 1210 interrupting the network operation, allowing a trade-off among 1211 resource, performance, flexibility, and coverage. 1213 Stage 2 - Interactive Telemetry: The network operator can 1214 continuously customize and fine tune the telemetry data in real 1215 time to reflect the network operation's visibility requirements. 1216 Compared with Stage 1, the changes are frequent based on the real- 1217 time feedback. At this stage, some tasks can be automated, but 1218 human operators still need to sit in the middle to make decisions. 1220 Stage 3 - Closed-loop Telemetry: The telemetry is free from the 1221 interference of human operators, except for generating the 1222 reports. The intelligent network operation engine automatically 1223 issues the telemetry data requests, analyzes the data, and updates 1224 the network operations in closed control loops. 1226 Existing technologies are ready for stage 0 and stage 1. Individual 1227 stage 2 and stage 3 applications are also possible now. However, the 1228 future autonomic networks may need a comprehensive operation 1229 management system which works at stage 2 and stage 3 to cover all the 1230 network operation tasks. A well-defined network telemetry framework 1231 is the first step towards this direction. 1233 5. Security Considerations 1235 The complexity of network telemetry raises significant security 1236 implications. For example, telemetry data can be manipulated to 1237 exhaust various network resources at each plane as well as the data 1238 consumer; falsified or tampered data can mislead the decision-making 1239 and paralyze networks; wrong configuration and programming for 1240 telemetry is equally harmful. The telemetry data is highly 1241 sensitive, which exposes a lot of information about the network and 1242 its configuration. Some of that information can make designing 1243 attacks against the network much easier (e.g., exact details of what 1244 software and patches have been installed), and allows an attacker to 1245 determine whether a device may be subject to unprotected security 1246 vulnerabilities. 1248 Given that this document has proposed a framework for network 1249 telemetry and the telemetry mechanisms discussed are more extensive 1250 (in both message frequency and traffic amount) than the conventional 1251 network OAM concepts, we must also reflect that various new security 1252 considerations may also arise. A number of techniques already exist 1253 for securing the forwarding plane, the control plane, and the 1254 management plane in a network, but it is important to consider if any 1255 new threat vectors are now being enabled via the use of network 1256 telemetry procedures and mechanisms. 1258 This document proposes a conceptual architectural for collecting, 1259 transporting, and analyzing a wide variety of data sources in support 1260 of network applications. The protocols, data formats, and 1261 configurations chosen to implement this framework will dictate the 1262 specific security considerations. These considerations may include: 1264 * Telemetry framework trust and policy model; 1266 * Role management and access control for enabling and disabling 1267 telemetry capabilities; 1269 * Protocol transport used telemetry data and inherent security 1270 capabilities; 1272 * Telemetry data stores, storage encryption, methods of access, and 1273 retention practices; 1275 * Tracking telemetry events and any abnormalities that might 1276 identify malicious attacks using telemetry interfaces. 1278 * Authentication and signing of telemetry data to make data more 1279 trustworthy. 1281 * Segregating the telemetry data traffic from the data traffic 1282 carried over the network (e.g., historically management access and 1283 management data may be carried via an independent management 1284 network). 1286 Some security considerations highlighted above may be minimized or 1287 negated with policy management of network telemetry. In a network 1288 telemetry deployment it would be advantageous to separate telemetry 1289 capabilities into different classes of policies, i.e., Role Based 1290 Access Control and Event-Condition-Action policies. Also, potential 1291 conflicts between network telemetry mechanisms must be detected 1292 accurately and resolved quickly to avoid unnecessary network 1293 telemetry traffic propagation escalating into an unintended or 1294 intended denial of service attack. 1296 Further study of the security issues will be required, and it is 1297 expected that the security mechanisms and protocols are developed and 1298 deployed along with a network telemetry system. 1300 6. IANA Considerations 1302 This document includes no request to IANA. 1304 7. Contributors 1306 The other contributors of this document are Tianran Zhou, Zhenbin Li, 1307 Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm 1309 8. Acknowledgments 1311 We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe 1312 Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe 1313 Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, 1314 Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin 1315 Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Eric 1316 Vyncke, Jean-Michel Combes, and many others who have provided helpful 1317 comments and suggestions to improve this document. 1319 9. Informative References 1321 [gnmi] "gNMI - gRPC Network Management Interface", 1322 . 1325 [gpb] "Google Protocol Buffers", 1326 . 1328 [grpc] "gPPC, A high performance, open-source universal RPC 1329 framework", . 1331 [I-D.ietf-grow-bmp-local-rib] 1332 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1333 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1334 Work in Progress, Internet-Draft, draft-ietf-grow-bmp- 1335 local-rib-13, 31 August 2021, 1336 . 1339 [I-D.ietf-ippm-ioam-data] 1340 Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields 1341 for In-situ OAM", Work in Progress, Internet-Draft, draft- 1342 ietf-ippm-ioam-data-16, 8 November 2021, 1343 . 1346 [I-D.ietf-ippm-ioam-direct-export] 1347 Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., 1348 Bhandari, S., Sivakolundu, R., and T. Mizrahi, "In-situ 1349 OAM Direct Exporting", Work in Progress, Internet-Draft, 1350 draft-ietf-ippm-ioam-direct-export-07, 13 October 2021, 1351 . 1354 [I-D.ietf-netconf-distributed-notif] 1355 Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, 1356 "Subscription to Distributed Notifications", Work in 1357 Progress, Internet-Draft, draft-ietf-netconf-distributed- 1358 notif-02, 6 May 2021, . 1361 [I-D.ietf-netconf-udp-notif] 1362 Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., 1363 and P. Lucente, "UDP-based Transport for Configured 1364 Subscriptions", Work in Progress, Internet-Draft, draft- 1365 ietf-netconf-udp-notif-04, 21 October 2021, 1366 . 1369 [I-D.irtf-nmrg-ibn-concepts-definitions] 1370 Clemm, A., Ciavaglia, L., Granville, L. Z., and J. 1371 Tantsura, "Intent-Based Networking - Concepts and 1372 Definitions", Work in Progress, Internet-Draft, draft- 1373 irtf-nmrg-ibn-concepts-definitions-05, 2 September 2021, 1374 . 1377 [I-D.pedro-nmrg-anticipated-adaptation] 1378 Martinez-Julia, P., "Exploiting External Event Detectors 1379 to Anticipate Resource Requirements for the Elastic 1380 Adaptation of SDN/NFV Systems", Work in Progress, 1381 Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- 1382 02, 29 June 2018, . 1385 [I-D.song-ippm-postcard-based-telemetry] 1386 Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, 1387 T., Li, Z., Shin, J., and K. Lee, "In-Situ OAM Marking- 1388 based Direct Export", Work in Progress, Internet-Draft, 1389 draft-song-ippm-postcard-based-telemetry-11, 15 November 1390 2021, . 1393 [I-D.song-opsawg-dnp4iq] 1394 Song, H. and J. Gong, "Requirements for Interactive Query 1395 with Dynamic Network Probes", Work in Progress, Internet- 1396 Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, 1397 . 1400 [I-D.song-opsawg-ifit-framework] 1401 Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- 1402 situ Flow Information Telemetry", Work in Progress, 1403 Internet-Draft, draft-song-opsawg-ifit-framework-16, 21 1404 October 2021, . 1407 [I-D.wwx-netmod-event-yang] 1408 Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, 1409 "A YANG Data model for ECA Policy Management", Work in 1410 Progress, Internet-Draft, draft-wwx-netmod-event-yang-10, 1411 1 November 2020, . 1414 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1415 "Simple Network Management Protocol (SNMP)", RFC 1157, 1416 DOI 10.17487/RFC1157, May 1990, 1417 . 1419 [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. 1420 Schoenwaelder, Ed., "Structure of Management Information 1421 Version 2 (SMIv2)", STD 58, RFC 2578, 1422 DOI 10.17487/RFC2578, April 1999, 1423 . 1425 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1426 DOI 10.17487/RFC2981, October 2000, 1427 . 1429 [RFC3176] Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's 1430 sFlow: A Method for Monitoring Traffic in Switched and 1431 Routed Networks", RFC 3176, DOI 10.17487/RFC3176, 1432 September 2001, . 1434 [RFC3414] Blumenthal, U. and B. Wijnen, "User-based Security Model 1435 (USM) for version 3 of the Simple Network Management 1436 Protocol (SNMPv3)", STD 62, RFC 3414, 1437 DOI 10.17487/RFC3414, December 2002, 1438 . 1440 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1441 for the Simple Network Management Protocol (SNMP)", 1442 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1443 . 1445 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1446 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1447 September 2004, . 1449 [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export 1450 Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, 1451 . 1453 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1454 Zekauskas, "A One-way Active Measurement Protocol 1455 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1456 . 1458 [RFC5085] Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual 1459 Circuit Connectivity Verification (VCCV): A Control 1460 Channel for Pseudowires", RFC 5085, DOI 10.17487/RFC5085, 1461 December 2007, . 1463 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1464 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1465 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1466 . 1468 [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, 1469 DOI 10.17487/RFC5424, March 2009, 1470 . 1472 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1473 the Network Configuration Protocol (NETCONF)", RFC 6020, 1474 DOI 10.17487/RFC6020, October 2010, 1475 . 1477 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1478 and A. Bierman, Ed., "Network Configuration Protocol 1479 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1480 . 1482 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 1483 S., and E. Yedavalli, "Cisco Service-Level Assurance 1484 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 1485 . 1487 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1488 "Specification of the IP Flow Information Export (IPFIX) 1489 Protocol for the Exchange of Flow Information", STD 77, 1490 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1491 . 1493 [RFC7258] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 1494 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 1495 2014, . 1497 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1498 Weingarten, "An Overview of Operations, Administration, 1499 and Maintenance (OAM) Tools", RFC 7276, 1500 DOI 10.17487/RFC7276, June 2014, 1501 . 1503 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1504 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1505 DOI 10.17487/RFC7540, May 2015, 1506 . 1508 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1509 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1510 Networking: Definitions and Design Goals", RFC 7575, 1511 DOI 10.17487/RFC7575, June 2015, 1512 . 1514 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1515 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1516 May 2016, . 1518 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1519 Monitoring Protocol (BMP)", RFC 7854, 1520 DOI 10.17487/RFC7854, June 2016, 1521 . 1523 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 1524 RFC 7950, DOI 10.17487/RFC7950, August 2016, 1525 . 1527 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 1528 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 1529 . 1531 [RFC8084] Fairhurst, G., "Network Transport Circuit Breakers", 1532 BCP 208, RFC 8084, DOI 10.17487/RFC8084, March 2017, 1533 . 1535 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 1536 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 1537 March 2017, . 1539 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 1540 Interchange Format", STD 90, RFC 8259, 1541 DOI 10.17487/RFC8259, December 2017, 1542 . 1544 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1545 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1546 "Alternate-Marking Method for Passive and Hybrid 1547 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1548 January 2018, . 1550 [RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, 1551 E., and A. Tripathy, "Subscription to YANG Notifications", 1552 RFC 8639, DOI 10.17487/RFC8639, September 2019, 1553 . 1555 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1556 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1557 September 2019, . 1559 [RFC8671] Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S. 1560 Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring 1561 Protocol (BMP)", RFC 8671, DOI 10.17487/RFC8671, November 1562 2019, . 1564 [RFC8762] Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple 1565 Two-Way Active Measurement Protocol", RFC 8762, 1566 DOI 10.17487/RFC8762, March 2020, 1567 . 1569 [RFC8889] Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto, 1570 "Multipoint Alternate-Marking Method for Passive and 1571 Hybrid Performance Monitoring", RFC 8889, 1572 DOI 10.17487/RFC8889, August 2020, 1573 . 1575 [RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, 1576 R., and A. Ghanwani, "Service Function Chaining (SFC) 1577 Operations, Administration, and Maintenance (OAM) 1578 Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, 1579 . 1581 [xml] "Extensible Markup Language (XML) 1.0 (Fifth Edition)", 1582 . 1584 [y1731] "ITU-T Y.1731: OAM Functions and Mechanisms for Ethernet 1585 based networks, 2015", 1586 . 1588 Appendix A. A Survey on Existing Network Telemetry Techniques 1590 In this non-normative appendix, we provide an overview of some 1591 existing techniques and standard proposals for each network telemetry 1592 module. 1594 A.1. Management Plane Telemetry 1596 A.1.1. Push Extensions for NETCONF 1598 NETCONF [RFC6241] is a popular network management protocol 1599 recommended by IETF. Its core strength is for managing 1600 configuration, but can also be used for data collection. YANG-Push 1601 [RFC8641] [RFC8639] extends NETCONF and enables subscriber 1602 applications to request a continuous, customized stream of updates 1603 from a YANG datastore. Providing such visibility into changes made 1604 upon YANG configuration and operational objects enables new 1605 capabilities based on the remote mirroring of configuration and 1606 operational state. Moreover, distributed data collection mechanism 1607 [I-D.ietf-netconf-distributed-notif] via UDP based publication 1608 channel [I-D.ietf-netconf-udp-notif] provides enhanced efficiency for 1609 the NETCONF based telemetry. 1611 A.1.2. gRPC Network Management Interface 1613 gRPC Network Management Interface (gNMI) [gnmi] is a network 1614 management protocol based on the gRPC [grpc] RPC (Remote Procedure 1615 Call) framework. With a single gRPC service definition, both 1616 configuration and telemetry can be covered. gRPC is an HTTP/2 1617 [RFC7540]-based open-source micro-service communication framework. 1618 It provides a number of capabilities which are well-suited for 1619 network telemetry, including: 1621 * Full-duplex streaming transport model combined with a binary 1622 encoding mechanism provides good telemetry efficiency. 1624 * gRPC provides higher-level features consistency across platforms 1625 that common HTTP/2 libraries typically do not. This 1626 characteristic is especially valuable for the fact that telemetry 1627 data collectors normally reside on a large variety of platforms. 1629 * The built-in load-balancing and failover mechanism. 1631 A.2. Control Plane Telemetry 1633 A.2.1. BGP Monitoring Protocol 1635 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1636 sessions and is intended to provide a convenient interface for 1637 obtaining route views. 1639 The BGP routing information is collected from the monitored device(s) 1640 to the BMP monitoring station by setting up the BMP TCP session. The 1641 BGP peers are monitored by the BMP Peer Up and Peer Down 1642 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1643 Adjacency_RIB_out [RFC8671], and Local_Rib 1644 [I-D.ietf-grow-bmp-local-rib]) are encapsulated in the BMP Route 1645 Monitoring Message and the BMP Route Mirroring Message, providing 1646 both an initial table dump and real-time route updates. In addition, 1647 BGP statistics are reported through the BMP Stats Report Message, 1648 which could be either timer triggered or event-driven. Future BMP 1649 extensions could further enrich BGP monitoring applications. 1651 A.3. Data Plane Telemetry 1653 A.3.1. The Alternate Marking (AM) technology 1655 The Alternate Marking method enables efficient measurements of packet 1656 loss, delay, and jitter both in IP and Overlay Networks, as presented 1657 in [RFC8321] and [RFC8889]. 1659 This technique can be applied to point-to-point and multipoint-to- 1660 multipoint flows. Alternate Marking creates batches of packets by 1661 alternating the value of 1 bit (or a label) of the packet header. 1662 These batches of packets are unambiguously recognized over the 1663 network and the comparison of packet counters for each batch allows 1664 the packet loss calculation. The same idea can be applied to delay 1665 measurement by selecting ad hoc packets with a marking bit dedicated 1666 for delay measurements. 1668 Alternate Marking method needs two counters each marking period for 1669 each flow under monitor. For instance, by considering n measurement 1670 points and m monitored flows, the order of magnitude of the packet 1671 counters for each time interval is n*m*2 (1 per color). 1673 Since networks offer rich sets of network performance measurement 1674 data (e.g., packet counters), conventional approaches run into 1675 limitations. The bottleneck is the generation and export of the data 1676 and the amount of data that can be reasonably collected from the 1677 network. In addition, management tasks related to determining and 1678 configuring which data to generate lead to significant deployment 1679 challenges. 1681 The Multipoint Alternate Marking approach, described in [RFC8889], 1682 aims to resolve this issue and make the performance monitoring more 1683 flexible in case a detailed analysis is not needed. 1685 An application orchestrates network performance measurements tasks 1686 across the network to allow for optimized monitoring. The 1687 application can choose how roughly or precisely to configure 1688 measurement points depending on the application's requirements. 1690 Using Alternate Marking, it is possible to monitor a Multipoint 1691 Network without in depth examination by using the Network Clustering 1692 (subnetworks that are portions of the entire network that preserve 1693 the same property of the entire network, called clusters). So in the 1694 case that there is packet loss or the delay is too high then the 1695 specific filtering criteria could be applied to gather a more 1696 detailed analysis by using a different combination of clusters up to 1697 a per-flow measurement as described in Alternate-Marking (AM) 1698 [RFC8321]. 1700 In summary, an application can configure end-to-end network 1701 monitoring. If the network does not experience issues, this 1702 approximate monitoring is good enough and is very cheap in terms of 1703 network resources. However, in case of problems, the application 1704 becomes aware of the issues from this approximate monitoring and, in 1705 order to localize the portion of the network that has issues, 1706 configures the measurement points more extensively, allowing more 1707 detailed monitoring to be performed. After the detection and 1708 resolution of the problem, the initial approximate monitoring can be 1709 used again. 1711 A.3.2. Dynamic Network Probe 1713 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1714 proposes a programmable means to customize the data that an 1715 application collects from the data plane. A direct benefit of DNP is 1716 the reduction of the exported data. A full DNP solution covers 1717 several components including data source, data subscription, and data 1718 generation. The data subscription needs to define the derived data 1719 which can be composed and derived from the raw data sources. The 1720 data generation takes advantage of the moderate in-network computing 1721 to produce the desired data. 1723 While DNP can introduce unforeseeable flexibility to the data plane 1724 telemetry, it also faces some challenges. It requires a flexible 1725 data plane that can be dynamically reprogrammed at run-time. The 1726 programming API is yet to be defined. 1728 A.3.3. IP Flow Information Export (IPFIX) Protocol 1730 Traffic on a network can be seen as a set of flows passing through 1731 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1732 provides a means of transmitting traffic flow information for 1733 administrative or other purposes. A typical IPFIX enabled system 1734 includes a pool of Metering Processes that collects data packets at 1735 one or more Observation Points, optionally filters them and 1736 aggregates information about these packets. An Exporter then gathers 1737 each of the Observation Points together into an Observation Domain 1738 and sends this information via the IPFIX protocol to a Collector. 1740 A.3.4. In-Situ OAM 1742 Classical passive and active monitoring and measurement techniques 1743 are either inaccurate or resource-consuming. It is preferable to 1744 directly acquire data associated with a flow's packets when the 1745 packets pass through a network. In-situ OAM (iOAM) 1746 [I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new 1747 instruction header to user packets and the instruction directs the 1748 network nodes to add the requested data to the packets. Thus, at the 1749 path end, the packet's experience gained on the entire forwarding 1750 path can be collected. Such firsthand data is invaluable to many 1751 network OAM applications. 1753 However, iOAM also faces some challenges. The issues on performance 1754 impact, security, scalability and overhead limits, encapsulation 1755 difficulties in some protocols, and cross-domain deployment need to 1756 be addressed. 1758 A.3.5. Postcard Based Telemetry 1760 The postcard-based telemetry, as embodied in IOAM DEX 1761 [I-D.ietf-ippm-ioam-direct-export] and IOAM Marking 1762 [I-D.song-ippm-postcard-based-telemetry], is a complementary 1763 technique to the passport-based IOAM. PBT directly exports data at 1764 each node through an independent packet. At the cost of higher 1765 bandwidth overhead and the need for data correlation, PBT shows 1766 several unique advantages. It can also help to identify packet drop 1767 location in case a packet is dropped on its forwarding path. 1769 A.3.6. Existing OAM for Specific Data Planes 1771 Various data planes raises unique OAM requirements. IETF has 1772 published OAM technique and framework documents (e.g., [RFC8924] and 1773 [RFC5085]) targeting different data planes such as Multi-Protocol 1774 Label Switching (MPLS), L2 Virtual Private Network (L2-VPN), Network 1775 Virtualization Overlays (NVO3), Virtual Extensible LAN (VXLAN), Bit 1776 Indexed Explicit Replication (BIER), Service Function Chaining (SFC), 1777 Segment Routing (SR), and Deterministic Networking (DETNET). The 1778 aforementioned data plane telemetry techniques can be used to enhance 1779 the OAM capability on such data planes. 1781 A.4. External Data and Event Telemetry 1783 A.4.1. Sources of External Events 1785 To ensure that the information provided by external event detectors 1786 and used by the network management solutions is meaningful for 1787 management purposes, the network telemetry framework must ensure that 1788 such detectors (sources) are easily connected to the management 1789 solutions (sinks). This requires the specification of a list of 1790 potential external data sources that could be of interest in network 1791 management and match it to the connectors and/or interfaces required 1792 to connect them. 1794 Categories of external event sources that may be of interest to 1795 network management include:: 1797 * Smart objects and sensors. With the consolidation of the Internet 1798 of Things~(IoT) any network system will have many smart objects 1799 attached to its physical surroundings and logical operation 1800 environments. Most of these objects will be essentially based on 1801 sensors of many kinds (e.g., temperature, humidity, presence) and 1802 the information they provide can be very useful for the management 1803 of the network, even when they are not specifically deployed for 1804 such purpose. Elements of this source type will usually provide a 1805 specific protocol for interaction, especially one of those 1806 protocols related to IoT, such as the Constrained Application 1807 Protocol (CoAP). 1809 * Online news reporters. Several online news services have the 1810 ability to provide enormous quantity of information about 1811 different events occurring in the world. Some of those events can 1812 impact on the network system managed by a specific framework and, 1813 therefore, such information may be of interest to the management 1814 solution. For instance, diverse security reports, such as the 1815 Common Vulnerabilities and Exposures (CVE), can be issued by the 1816 corresponding authority and used by the management solution to 1817 update the managed system if needed. Instead of a specific 1818 protocol and data format, the sources of this kind of information 1819 usually follow a relaxed but structured format. This format will 1820 be part of both the ontology and information model of the 1821 telemetry framework. 1823 * Global event analyzers. The advance of Big Data analyzers 1824 provides a huge amount of information and, more interestingly, the 1825 identification of events detected by analyzing many data streams 1826 from different origins. In contrast with the other types of 1827 sources, which are focused on specific events, the detectors of 1828 this source type will detect generic events. For example, during 1829 a sport event some unexpected movement makes it fascinating and 1830 many people connect to sites that are reporting on the event. The 1831 underlying networks supporting the services that cover the event 1832 can be affected by such situation, so their management solutions 1833 should be aware of it. In contrast with the other source types, a 1834 new information model, format, and reporting protocol is required 1835 to integrate the detectors of this type with the management 1836 solution. 1838 Additional types of detector types can be added to the system, but 1839 they will be generally the result of composing the properties offered 1840 by these main classes. 1842 A.4.2. Connectors and Interfaces 1844 For allowing external event detectors to be properly integrated with 1845 other management solutions, both elements must expose interfaces and 1846 protocols that are subject to their particular objective. Since 1847 external event detectors will be focused on providing their 1848 information to their main consumers, which generally will not be 1849 limited to the network management solutions, the framework must 1850 include the definition of the required connectors for ensuring the 1851 interconnection between detectors (sources) and their consumers 1852 within the management systems (sinks) are effective. 1854 In some situations, the interconnection between the external event 1855 detectors and the management system is via the management plane. For 1856 those situations there will be a special connector that provides the 1857 typical interfaces found in most other elements connected to the 1858 management plane. For instance, the interfaces could accomplish this 1859 with a specific data model (YANG) and specific telemetry protocol, 1860 such as NETCONF, YANG-Push, or gRPC. 1862 Authors' Addresses 1864 Haoyu Song 1865 Futurewei 1866 United States of America 1868 Email: haoyu.song@futurewei.com 1870 Fengwei Qin 1871 China Mobile 1872 P.R. China 1874 Email: qinfengwei@chinamobile.com 1876 Pedro Martinez-Julia 1877 NICT 1878 Japan 1880 Email: pedro@nict.go.jp 1882 Laurent Ciavaglia 1883 Rakuten Mobile 1884 France 1886 Email: laurent.ciavaglia@rakuten.com 1887 Aijun Wang 1888 China Telecom 1889 P.R. China 1891 Email: wangaj.bri@chinatelecom.cn