idnits 2.17.1 draft-ietf-opsawg-ntf-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (29 November 2021) is 878 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC7540' is defined on line 1482, but no explicit reference was found in the text == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-16 == Outdated reference: A later version (-11) exists of draft-ietf-ippm-ioam-direct-export-07 == Outdated reference: A later version (-08) exists of draft-ietf-netconf-distributed-notif-02 == Outdated reference: A later version (-12) exists of draft-ietf-netconf-udp-notif-04 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-05 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-11 == Outdated reference: A later version (-21) exists of draft-song-opsawg-ifit-framework-16 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) -- Obsolete informational reference (is this intentional?): RFC 8889 (Obsoleted by RFC 9342) Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: 2 June 2022 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Rakuten Mobile 10 A. Wang 11 China Telecom 12 29 November 2021 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-11 17 Abstract 19 Network telemetry is a technology for gaining network insight and 20 facilitating efficient and automated network management. It 21 encompasses various techniques for remote data generation, 22 collection, correlation, and consumption. This document describes an 23 architectural framework for network telemetry, motivated by 24 challenges that are encountered as part of the operation of networks 25 and by the requirements that ensue. This document clarifies the 26 terminologies and classifies the modules and components of a network 27 telemetry system from different perspectives. The framework and 28 taxonomy help to set a common ground for the collection of related 29 work and provide guidance for related technique and standard 30 developments. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on 2 June 2022. 49 Copyright Notice 51 Copyright (c) 2021 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 56 license-info) in effect on the date of publication of this document. 57 Please review these documents carefully, as they describe your rights 58 and restrictions with respect to this document. Code Components 59 extracted from this document must include Revised BSD License text as 60 described in Section 4.e of the Trust Legal Provisions and are 61 provided without warranty as described in the Revised BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 1.1. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 68 2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 69 2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 7 70 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 9 71 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 10 72 2.5. The Necessity of a Network Telemetry Framework . . . . . 13 73 3. Network Telemetry Framework . . . . . . . . . . . . . . . . . 14 74 3.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 14 75 3.1.1. Management Plane Telemetry . . . . . . . . . . . . . 18 76 3.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 18 77 3.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 19 78 3.1.4. External Data Telemetry . . . . . . . . . . . . . . . 21 79 3.2. Second Level Function Components . . . . . . . . . . . . 22 80 3.3. Data Acquisition Mechanism and Type Abstraction . . . . . 24 81 3.4. Mapping Existing Mechanisms into the Framework . . . . . 26 82 4. Evolution of Network Telemetry Applications . . . . . . . . . 27 83 5. Security Considerations . . . . . . . . . . . . . . . . . . . 28 84 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 85 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 86 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 87 9. Informative References . . . . . . . . . . . . . . . . . . . 29 88 Appendix A. A Survey on Existing Network Telemetry Techniques . 35 89 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 35 90 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 35 91 A.1.2. gRPC Network Management Interface . . . . . . . . . . 36 92 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 36 93 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 36 94 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 36 95 A.3.1. The Alternate Marking (AM) technology . . . . . . . . 36 96 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 38 97 A.3.3. IP Flow Information Export (IPFIX) Protocol . . . . . 38 98 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 38 99 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 39 100 A.3.6. Existing OAM for Specific Data Planes . . . . . . . . 39 101 A.4. External Data and Event Telemetry . . . . . . . . . . . . 39 102 A.4.1. Sources of External Events . . . . . . . . . . . . . 39 103 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 41 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 106 1. Introduction 108 Network visibility is the ability of management tools to see the 109 state and behavior of a network, which is essential for successful 110 network operation. Network Telemetry revolves around network data 111 that can help provide insights about the current state of the 112 network, including network devices, forwarding, control, and 113 management planes, and that can be generated and obtained through a 114 variety of techniques, including but not limited to network 115 instrumentation and measurements, and that can be processed for 116 purposes ranging from service assurance to network security using a 117 wide variety of techniques including machine learning, data analysis, 118 and correlation. In this document, Network Telemetry refer to both 119 the data itself (i.e., "Network Telemetry Data"), and the techniques 120 and processes used to generate, export, collect, and consume that 121 data for use by potentially automated management applications. 122 Network telemetry extends beyond the historical network Operations, 123 Administration, and Management (OAM) techniques and expects to 124 support better flexibility, scalability, accuracy, coverage, and 125 performance. 127 However, the term "network telemetry" lacks an unambiguous 128 definition. The scope and coverage of it cause confusion and 129 misunderstandings. It is beneficial to clarify the concept and 130 provide a clear architectural framework for network telemetry, so we 131 can articulate the technical field, and better align the related 132 techniques and standard works. 134 To fulfill such an undertaking, we first discuss some key 135 characteristics of network telemetry which set a clear distinction 136 from the conventional network OAM and show that some conventional OAM 137 technologies can be considered a subset of the network telemetry 138 technologies. We then provide an architectural framework for network 139 telemetry which includes four modules, each concerned with a 140 different category of telemetry data and corresponding procedures. 141 All the modules are internally structured in the same way, including 142 components that allow to configure data sources in regard to what 143 data to generate and how to make that available to client 144 applications, components that instrument the underlying data sources, 145 and components that perform the actual rendering, encoding, and 146 exporting of the generated data. We show how the network telemetry 147 framework can benefit the current and future network operations. 148 Based on the distinction of modules and function components, we can 149 map the existing and emerging techniques and protocols into the 150 framework. The framework can also simplify the tasks for designing, 151 maintaining, and understanding a network telemetry system. At last, 152 we outline the evolution stages of the network telemetry system and 153 discuss the potential security concerns. 155 The purpose of the framework and taxonomy is to set a common ground 156 for the collection of related work and provide guidance for future 157 technique and standard developments. To the best of our knowledge, 158 this document is the first such effort for network telemetry in 159 industry standards organizations. 161 1.1. Glossary 163 Before further discussion, we list some key terminology and acronyms 164 used in this document. We make an intended differentiation between 165 the terms of network telemetry and OAM. However, it should be 166 understood that there is not a hard-line distinction between the two 167 concepts. Rather, network telemetry is considered as an extension of 168 OAM. It covers all the existing OAM protocols but puts more emphasis 169 on the newer and emerging techniques and protocols concerning all 170 aspects of network data from acquisition to consumption. 172 AI: Artificial Intelligence. In network domain, AI refers to the 173 machine-learning based technologies for automated network 174 operation and other tasks. 176 AM: Alternate Marking, a flow performance measurement method, 177 specified in [RFC8321]. 179 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 181 DPI: Deep Packet Inspection, referring to the techniques that 182 examines packet beyond packet L3/L4 headers. 184 gNMI: gRPC Network Management Interface, a network management 185 protocol from OpenConfig Operator Working Group, mainly 186 contributed by Google. See [gnmi] for details. 188 GPB: Google Protocol Buffer, an extensible mechanism for serializing 189 structured data. See [gpb] for details. 191 gRPC: gRPC Remote Procedure Call, an open source high performance 192 RPC framework that gNMI is based on. See [grpc] for details. 194 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 196 IOAM: In-situ OAM [I-D.ietf-ippm-ioam-data], a dataplane on-path 197 telemetry technique. 199 JSON: An open standard file format and data interchange format that 200 uses human-readable text to store and transmit data objects, 201 specified in [RFC8259]. 203 MIB: Management Information Base, a database used for managing the 204 entities in a network. 206 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 208 NetFlow: A Cisco protocol for flow record collecting, described in 209 [RFC3954]. 211 Network Telemetry: The process and instrumentation for acquiring and 212 utilizing network data remotely for network monitoring and 213 operation. A general term for a large set of network visibility 214 techniques and protocols, concerning aspects like data generation, 215 collection, correlation, and consumption. Network telemetry 216 addresses the current network operation issues and enables smooth 217 evolution toward future intent-driven autonomous networks. 219 NMS: Network Management System, referring to applications that allow 220 network administrators to manage a network. 222 OAM: Operations, Administration, and Maintenance. A group of 223 network management functions that provide network fault 224 indication, fault localization, performance information, and data 225 and diagnosis functions. Most conventional network monitoring 226 techniques and protocols belong to network OAM. 228 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 229 technique. A representative technique is described in 230 [I-D.ietf-ippm-ioam-direct-export]. 232 RESTCONF: An HTTP-based protocol that provides a programmatic 233 interface for accessing data defined in YANG, using the datastore 234 concepts defined in NETCONF, as specified in [RFC8040]. 236 SMIv2: Structure of Management Information Version 2, defining MIB 237 objects, specified in [RFC2578]. 239 SNMP: Simple Network Management Protocol. Version 1, 2, and 3 are 240 specified in [RFC1157], [RFC3416], and [RFC3414], respectively. 242 XML: Extensible Markup Language is a markup language for data 243 encoding that is both human-readable and machine-readable, 244 specified by W3C [xml]. 246 YANG: YANG is a data modeling language for the definition of data 247 sent over network management protocols such as the NETCONF and 248 RESTCONF. YANG is defined in [RFC6020] and [RFC7950]. 250 YANG ECA: A YANG model for Event-Condition-Action policies, defined 251 in [I-D.wwx-netmod-event-yang]. 253 YANG-Push: A mechanism that allows subscriber applications to 254 request a stream of updates from a YANG datastore on a network 255 device. Details are specified in [RFC8641] and [RFC8639]. 257 2. Background 259 The term "big data" is used to describe the extremely large volume of 260 data sets that can be analyzed computationally to reveal patterns, 261 trends, and associations. Networks are undoubtedly a source of big 262 data because of their scale and the volume of network traffic they 263 forward. When a network's endpoints do not represent individual 264 users (e.g. in industrial, datacenter, and infrastructure contexts), 265 network operations can often benefit from large-scale data collection 266 without breaching user privacy. 268 Today one can access advanced big data analytics capability through a 269 plethora of commercial and open source platforms (e.g., Apache 270 Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine 271 learning). Thanks to the advance of computing and storage 272 technologies, network big data analytics gives network operators an 273 opportunity to gain network insights and move towards network 274 autonomy. Some operators start to explore the application of 275 Artificial Intelligence (AI) to make sense of network data. Software 276 tools can use the network data to detect and react on network faults, 277 anomalies, and policy violations, as well as predicting future 278 events. In turn, the network policy updates for planning, intrusion 279 prevention, optimization, and self-healing may be applied. 281 It is conceivable that an autonomic network [RFC7575] is the logical 282 next step for network evolution following Software Defined Network 283 (SDN), aiming to reduce (or even eliminate) human labor, make more 284 efficient use of network resources, and provide better services more 285 aligned with customer requirements. The related technique of 286 Intent-based Networking (IBN) 287 [I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility 288 and telemetry data in order to ensure that the network is behaving as 289 intended. 291 However, while the data processing capability is improved and 292 applications are hungry for more data, the networks lag behind in 293 extracting and translating network data into useful and actionable 294 information in efficient ways. The system bottleneck is shifting 295 from data consumption to data supply. Both the number of network 296 nodes and the traffic bandwidth keep increasing at a fast pace. The 297 network configuration and policy change at smaller time slots than 298 before. More subtle events and fine-grained data through all network 299 planes need to be captured and exported in real time. In a nutshell, 300 it is a challenge to get enough high-quality data out of the network 301 in a manner that is efficient, timely, and flexible. Therefore, we 302 need to survey the existing technologies and protocols and identify 303 any potential gaps. 305 In the remainder of this section, first we clarify the scope of 306 network data (i.e., telemetry data) concerned in the context. Then, 307 we discuss several key use cases for today's and future network 308 operations. Next, we show why the current network OAM techniques and 309 protocols are insufficient for these use cases. The discussion 310 underlines the need of new methods, techniques, and protocols, as 311 well as the extensions of existing ones, which we assign under the 312 umbrella term - Network Telemetry. 314 2.1. Telemetry Data Coverage 316 Any information that can be extracted from networks (including data 317 plane, control plane, and management plane) and used to gain 318 visibility or as basis for actions is considered telemetry data. It 319 includes statistics, event records and logs, snapshots of state, 320 configuration data, etc. It also covers the outputs of any active 321 and passive measurements [RFC7799]. In some cases, raw data is 322 processed in network before being sent to a data consumer. Such 323 processed data is also considered telemetry data. The value of 324 telemetry data varies. Less but higher quality data are often better 325 than lots of low quality data. A classification of telemetry data is 326 provided in Section 3. To preserve user privacy, the user packet 327 content should not be collected. 329 2.2. Use Cases 331 The following set of use cases is essential for network operations. 332 While the list is by no means exhaustive, it is enough to highlight 333 the requirements for data velocity, variety, volume, and veracity, 334 the attributes of big data, in networks. 336 * Security: Network intrusion detection and prevention systems need 337 to monitor network traffic and activities and act upon anomalies. 338 Given increasingly sophisticated attack vector coupled with 339 increasingly severe consequences of security breaches, new tools 340 and techniques need to be developed, relying on wider and deeper 341 visibility into networks. The ultimate goal is to achieve the 342 ideal security with no, or only minimal, human intervention. 344 * Policy and Intent Compliance: Network policies are the rules that 345 constrain the services for network access, provide service 346 differentiation, or enforce specific treatment on the traffic. 347 For example, a service function chain is a policy that requires 348 the selected flows to pass through a set of ordered network 349 functions. Intent, as defined in 350 [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational 351 goal that a network should meet and outcomes that a network is 352 supposed to deliver, defined in a declarative manner without 353 specifying how to achieve or implement them. An intent requires a 354 complex translation and mapping process before being applied on 355 networks. While a policy or intent is enforced, the compliance 356 needs to be verified and monitored continuously by relying on 357 visibility that is provided through network telemetry data. Any 358 violation must be notified immediately, potentially resulting in 359 updates to how the policy or intent is applied in the network to 360 ensure that it remains in force, or otherwise alerting the network 361 administrator to the policy or intent violation. 363 * SLA Compliance: A Service-Level Agreement (SLA) defines the level 364 of service a user expects from a network operator, which include 365 the metrics for the service measurement and remedy/penalty 366 procedures when the service level misses the agreement. Users 367 need to check if they get the service as promised and network 368 operators need to evaluate how they can deliver the services that 369 can meet the SLA based on realtime network telemetry data, 370 including data from network measurements. 372 * Root Cause Analysis: Any network failure can be the effect of a 373 sequence of chained events. Troubleshooting and recovery require 374 quick identification of the root cause of any observable issues. 375 However, the root cause is not always straightforward to identify, 376 especially when the failure is sporadic and the number of event 377 messages, both related and unrelated to the same cause, is 378 overwhelming. While machine learning technologies can be used for 379 root cause analysis, it up to the network to sense and provide the 380 relevant diagnostic data which are either actively fed into, or 381 passively retrieved by, machine learning applications. 383 * Network Optimization: This covers all short-term and long-term 384 network optimization techniques, including load balancing, Traffic 385 Engineering (TE), and network planning. Network operators are 386 motivated to optimize their network utilization and differentiate 387 services for better Return On Investment (ROI) or lower Capital 388 Expenditures (CAPEX). The first step is to know the real-time 389 network conditions before applying policies for traffic 390 manipulation. In some cases, micro-bursts need to be detected in 391 a very short time-frame so that fine-grained traffic control can 392 be applied to avoid network congestion. Long-term planning of 393 network capacity and topology requires analysis of real-world 394 network telemetry data that is obtained over long periods of time. 396 * Event Tracking and Prediction: The visibility into traffic path 397 and performance is critical for services and applications that 398 rely on healthy network operation. Numerous related network 399 events are of interest to network operators. For example, Network 400 operators want to learn where and why packets are dropped for an 401 application flow. They also want to be warned of issues in 402 advance, so proactive actions can be taken to avoid catastrophic 403 consequences. 405 2.3. Challenges 407 For a long time, network operators have relied upon SNMP [RFC3416], 408 Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the 409 network. Some other OAM techniques as described in [RFC7276] are 410 also used to facilitate network troubleshooting. These conventional 411 techniques are not sufficient to support the above use cases for the 412 following reasons: 414 * Most use cases need to continuously monitor the network and 415 dynamically refine the data collection in real-time. The poll- 416 based low-frequency data collection is ill-suited for these 417 applications. Subscription-based streaming data directly pushed 418 from the data source (e.g., the forwarding chip) is preferred to 419 provide enough data quantity and precision at scale. 421 * Comprehensive data is needed from packet processing engine to 422 traffic manager, from line cards to main control board, from user 423 flows to control protocol packets, from device configurations to 424 operations, and from physical layer to application layer. 425 Conventional OAM only covers a narrow range of data (e.g., SNMP 426 only handles data from the Management Information Base (MIB)). 427 Classical network devices cannot provide all the necessary probes. 428 More open and programmable network devices are therefore needed. 430 * Many application scenarios need to correlate network-wide data 431 from multiple sources (i.e., from distributed network devices, 432 different components of a network device, or different network 433 planes). A piecemeal solution is often lacking the capability to 434 consolidate the data from multiple sources. The composition of a 435 complete solution, as partly proposed by Autonomic Resource 436 Control Architecture(ARCA) 437 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 438 guided by a comprehensive framework. 440 * Some conventional OAM techniques (e.g., CLI and Syslog) lack a 441 formal data model. The unstructured data hinder the tool 442 automation and application extensibility. Standardized data 443 models are essential to support the programmable networks. 445 * Although some conventional OAM techniques support data push (e.g., 446 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the 447 pushed data are limited to only predefined management plane 448 warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). 449 Network operators require the data with arbitrary source, 450 granularity, and precision which are beyond the capability of the 451 existing techniques. 453 * The conventional passive measurement techniques can either consume 454 excessive network resources and render excessive redundant data, 455 or lead to inaccurate results; on the other hand, the conventional 456 active measurement techniques can interfere with the user traffic 457 and their results are indirect. Techniques that can collect 458 direct and on-demand data from user traffic are more favorable. 460 These challenges were addressed by newer standards and techniques 461 (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push) 462 and more are emerging. These standards and techniques need to be 463 recognized and accommodated in a new framework. 465 2.4. Network Telemetry 467 Network telemetry has emerged as a mainstream technical term to refer 468 to the network data collection and consumption techniques. Several 469 network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and 470 gRPC [grpc]) have been widely deployed. Network telemetry allows 471 separate entities to acquire data from network devices so that data 472 can be visualized and analyzed to support network monitoring and 473 operation. Network telemetry covers the conventional network OAM and 474 has a wider scope. It is expected that network telemetry can provide 475 the necessary network insight for autonomous networks and address the 476 shortcomings of conventional OAM techniques. 478 Network telemetry usually assumes machines as data consumers rather 479 than human operators. Hence, the network telemetry can directly 480 trigger the automated network operation, while in contrast some 481 conventional OAM tools are designed and used to help human operators 482 to monitor and diagnose the networks and guide manual network 483 operations. Such a proposition leads to very different techniques. 485 Although new network telemetry techniques are emerging and subject to 486 continuous evolution, several characteristics of network telemetry 487 have been well accepted. Note that network telemetry is intended to 488 be an umbrella term covering a wide spectrum of techniques, so the 489 following characteristics are not expected to be held by every 490 specific technique. 492 * Push and Streaming: Instead of polling data from network devices, 493 telemetry collectors subscribe to streaming data pushed from data 494 sources in network devices. 496 * Volume and Velocity: The telemetry data is intended to be consumed 497 by machines rather than by human being. Therefore, the data 498 volume can be huge and the processing is optimized for the needs 499 of automation in realtime. 501 * Normalization and Unification: Telemetry aims to address the 502 overall network automation needs. Efforts are made to normalize 503 the data representation and unify the protocols, so to simplify 504 data analysis and provide integrated analysis across heterogeneous 505 devices and data sources across a network. 507 * Model-based: The telemetry data is modeled in advance which allows 508 applications to configure and consume data with ease. 510 * Data Fusion: The data for a single application can come from 511 multiple data sources (e.g., cross-domain, cross-device, and 512 cross-layer) based on common naming/ID and needs to be correlated 513 to take effect. 515 * Dynamic and Interactive: Since the network telemetry means to be 516 used in a closed control loop for network automation, it needs to 517 run continuously and adapt to the dynamic and interactive queries 518 from the network operation controller. 520 In addition, an ideal network telemetry solution may also have the 521 following features or properties: 523 * In-Network Customization: The data that is generated can be 524 customized in network at run-time to cater to the specific need of 525 applications. This needs the support of a programmable data plane 526 which allows probes with custom functions to be deployed at 527 flexible locations. 529 * In-Network Data Aggregation and Correlation: Network devices and 530 aggregation points can work out which events and what data needs 531 to be stored, reported, or discarded thus reducing the load on the 532 central collection and processing points while still ensuring that 533 the right information is ready to be processed in a timely way. 535 * In-Network Processing: Sometimes it is not necessary or feasible 536 to gather all information to a central point to be processed and 537 acted upon. It is possible for the data processing to be done in 538 network, allowing reactive actions to be taken locally. 540 * Direct Data Plane Export: The data originated from the data plane 541 forwarding chips can be directly exported to the data consumer for 542 efficiency, especially when the data bandwidth is large and the 543 real-time processing is required. 545 * In-band Data Collection: In addition to the passive and active 546 data collection approaches, the new hybrid approach allows to 547 directly collect data for any target flow on its entire forwarding 548 path [I-D.song-opsawg-ifit-framework]. 550 It is worth noting that a network telemetry system should not be 551 intrusive to normal network operations by avoiding the pitfall of the 552 "observer effect". That is, it should not change the network 553 behavior and affect the forwarding performance. Moreover, high- 554 volume telemetry traffic may cause network congestion unless proper 555 isolation or traffic engineering techniques are in place, or 556 congestion control mechanisms ensure that telemetry traffic backs off 557 if it exceeds the network capacity. [RFC8084] and [RFC8085] are 558 relevant Best Current Practices (BCP) in this space. 560 Although in many cases a system for network telemetry involves a 561 remote data collecting and consuming entity, it is important to 562 understand that there are no inherent assumptions about how a system 563 should be architected. While a network architecture with centralized 564 controller (e.g., SDN) seems a natural fit for network telemetry, 565 network telemetry can work in distributed fashions as well. For 566 example, telemetry data producers and consumers can have a peer-to- 567 peer relationship, in which a network node can be the direct consumer 568 of telemetry data from other nodes. 570 2.5. The Necessity of a Network Telemetry Framework 572 Network data analytics and machine-learning technologies are applied 573 for network operation automation, relying on abundant and coherent 574 data from networks. Data acquisition that is limited to a single 575 source and static in nature will in many cases not be sufficient to 576 meet an application's telemetry data needs. As a result, multiple 577 data sources, involving a variety of techniques and standards, will 578 need to be integrated. It is desirable to have a framework that 579 classifies and organizes different telemetry data source and types, 580 defines different components of a network telemetry system and their 581 interactions, and helps coordinate and integrate multiple telemetry 582 approaches across layers. This allows flexible combinations of data 583 for different applications, while normalizing and simplifying 584 interfaces. In detail, such a framework would benefit application 585 development for the following reasons: 587 * Future networks, autonomous or otherwise, depend on holistic and 588 comprehensive network visibility. All the use cases and 589 applications are better to be supported uniformly and coherently 590 under a single intelligent agent using an integrated, converged 591 mechanism and common telemetry data representations wherever 592 feasible. Therefore, the protocols and mechanisms should be 593 consolidated into a minimum yet comprehensive set. A telemetry 594 framework can help to normalize the technique developments. 596 * Network visibility presents multiple viewpoints. For example, the 597 device viewpoint takes the network infrastructure as the 598 monitoring object from which the network topology and device 599 status can be acquired; the traffic viewpoint takes the flows or 600 packets as the monitoring object from which the traffic quality 601 and path can be acquired. An application may need to switch its 602 viewpoint during operation. It may also need to correlate a 603 service and its impact on user experience to acquire the 604 comprehensive information. 606 * Applications require network telemetry to be elastic in order to 607 make efficient use of network resources and reduce the impact of 608 processing related to network telemetry on network performance. 609 For example, routine network monitoring should cover the entire 610 network with a low data sampling rate. Only when issues arise or 611 critical trends emerge should telemetry data source be modified 612 and telemetry data rates boosted as needed. 614 * Efficient data aggregation is critical for applications to reduce 615 the overall quantity of data and improve the accuracy of analysis. 617 A telemetry framework collects together all the telemetry-related 618 works from different sources and working groups within IETF. This 619 makes it possible to assemble a comprehensive network telemetry 620 system and to avoid repetitious or redundant work. The framework 621 should cover the concepts and components from the standardization 622 perspective. This document describes the modules which make up a 623 network telemetry framework and decomposes the telemetry system into 624 a set of distinct components that existing and future work can easily 625 map to. 627 Disclaimer: large-scale network data collection is a major threat to 628 user privacy [RFC7258]. The network telemetry framework presented in 629 this document should not be applied to collect and retain individual 630 user data or any data that can identify end users without consent. 631 Any data collection or retention using the framework must be tightly 632 limited to protect user privacy. 634 3. Network Telemetry Framework 636 The top level network telemetry framework partitions the network 637 telemetry into four modules based on the telemetry data object source 638 and represents their relationship. At the next level, the framework 639 decomposes each module into separate components. Each of the modules 640 follows the same underlying structure, with one component dedicated 641 to the configuration of data subscriptions and data sources, a second 642 component dedicated to encoding and exporting data, and a third 643 component instrumenting the generation of telemetry related to the 644 underlying resources. Throughout the framework, the same set of 645 abstract data acquiring mechanisms and data types (Section 3.3) are 646 applied. The two-level architecture with the uniform data 647 abstraction helps accurately pinpoint a protocol or technique to its 648 position in a network telemetry system or disaggregate a network 649 telemetry system into manageable parts. 651 3.1. Top Level Modules 653 Telemetry can be applied on the forwarding plane, the control plane, 654 and the management plane in a network, as well as other sources out 655 of the network, as shown in Figure 1. Therefore, we categorize the 656 network telemetry into four distinct modules with each having its own 657 interface to Network Operation Applications. 659 +------------------------------+ 660 | | 661 | Network Operation |<-------+ 662 | Applications | | 663 | | | 664 +------------------------------+ | 665 ^ ^ ^ | 666 | | | | 667 V V | V 668 +--------------+-----------|---+ +-----------+ 669 | | Control | | | | 670 | | Plane | | | External | 671 | <---> | | | Data and | 672 | | Telemetry | | | Event | 673 | Management | ^ V | | Telemetry | 674 | Plane +-------|-------+ | | 675 | Telemetry | V | +-----------+ 676 | | Forwarding | 677 | | Plane | 678 | <---> | 679 | | Telemetry | 680 | | | 681 +--------------+---------------+ 683 Figure 1: Modules in Layer Category of NTF 685 The rationale of this partition lies in the different telemetry data 686 objects which result in different data source and export locations. 687 Such differences have profound implications on in-network data 688 programming and processing capability, data encoding and transport 689 protocol, and required data bandwidth and latency. Data can be sent 690 directly, or proxied via the control and management planes. There 691 are advantages/disadvantages to both approaches. 693 Note that in some cases the network controller itself may be the 694 source of telemetry data that is unique to it or derived from the 695 telemetry data collected from the network elements. Some of the 696 principles and taxonomy specific to the control plane and management 697 plane telemetry could also be applied to the controller when it is 698 required to provide the telemetry data to Network Operation 699 Applications hosted outside. The scope of the document is focused on 700 the network elements telemetry and further details related to 701 controllers are thus out of scope. 703 We summarize the major differences of the four modules in the 704 following table. They are compared from six angles: 706 * Data Object 707 * Data Export Location 709 * Data Model 711 * Data Encoding 713 * Telemetry Application Protocol 715 * Data Transport Method 717 Data Object is the target and source of each module. Because the 718 data source varies, the location where data is mostly conveniently 719 exported also varies. For example, forwarding plane data mainly 720 originates as data exported from the forwarding Application-Specific 721 Integrated Circuits (ASICs), while control plane data mainly 722 originates from the protocol daemons running on the control CPU(s). 723 For convenience and efficiency, it is preferred to export the data 724 off the device from locations near the source. Because the locations 725 that can export data have different capabilities, different choices 726 of data model, encoding, and transport method are made to balance the 727 performance and cost. For example, the forwarding chip has high 728 throughput but limited capacity for processing complex data and 729 maintaining states, while the main control CPU is capable of complex 730 data and state processing, but has limited bandwidth for high 731 throughput data. As a result, the suitable telemetry protocol for 732 each module can be different. Some representative techniques are 733 shown in the corresponding table blocks to highlight the technical 734 diversity of these modules. Note that the selected techniques just 735 reflect the de facto state of the art and are by no means exhaustive 736 (e.g., IPFIX can also be implemented over TCP and SCTP, but that is 737 not recommended for forwarding plane). The key point is that one 738 cannot expect to use a universal protocol to cover all the network 739 telemetry requirements. 741 +-----------+-------------+-------------+--------------+----------+ 742 | Module |Management |Control |Forwarding |External | 743 | |Plane |Plane |Plane |Data | 744 +-----------+-------------+-------------+--------------+----------+ 745 |Object |config. & |control |flow & packet |terminal, | 746 | |operation |protocol & |QoS, traffic |social & | 747 | |state |signaling, |stat., buffer |environ- | 748 | | |RIB |& queue stat.,|mental | 749 | | | |ACL, FIB | | 750 +-----------+-------------+-------------+--------------+----------+ 751 |Export |main control |main control |fwding chip |various | 752 |Location |CPU |CPU, |or linecard | | 753 | | |linecard CPU |CPU; main | | 754 | | |or forwarding|control CPU | | 755 | | |chip |unlikely | | 756 +-----------+-------------+-------------+--------------+----------+ 757 |Data |YANG, MIB, |YANG, |YANG |YANG, | 758 |Model |syslog |custom |custom, |custom | 759 +-----------+-------------+-------------+--------------+----------+ 760 |Data |GPB, JSON, |GPB, JSON, |plain text |GPB, JSON | 761 |Encoding |XML |XML, | |XML, plain| 762 | | |plain text | |text | 763 +-----------+-------------+-------------+--------------+----------+ 764 |Application|gRPC,NETCONF,|gRPC,NETCONF,|IPFIX, traffic|gRPC | 765 |Protocol |RESTCONF |IPFIX,traffic|mirroring, | | 766 | | |mirroring |gRPC, NETFLOW | | 767 +-----------+-------------+-------------+--------------+----------+ 768 |Data |HTTP(S), TCP |HTTP(S), TCP,|UDP |HTTP(S), | 769 |Transport | |UDP | |TCP, UDP | 770 +-----------+-------------+-------------+--------------+----------+ 772 Figure 2: Comparison of the Data Object Modules 774 Note that the interaction with the applications that consume network 775 telemetry data can be indirect. Some in-device data transfer is 776 possible. For example, in the management plane telemetry, the 777 management plane will need to acquire data from the data plane. Some 778 operational states can only be derived from data plane data sources 779 such as the interface status and statistics. As another example, 780 obtaining control plane telemetry data may require the ability to 781 access the Forwarding Information Base (FIB) of the data plane. 783 On the other hand, an application may involve more than one plane and 784 interact with multiple planes simultaneously. For example, an SLA 785 compliance application may require both the data plane telemetry and 786 the control plane telemetry. 788 The requirements and challenges for each module are summarized as 789 follows (note that the requirements may pertain across all telemetry 790 modules; however, we emphasize those that are most pronounced for a 791 particular plane). 793 3.1.1. Management Plane Telemetry 795 The management plane of network elements interacts with the Network 796 Management System (NMS), and provides information such as performance 797 data, network logging data, network warning and defects data, and 798 network statistics and state data. The management plane includes 799 many protocols, including some that are considered "legacy", such as 800 SNMP and syslog. Regardless the protocol, management plane telemetry 801 must address the following requirements: 803 * Convenient Data Subscription: An application should have the 804 freedom to choose which data is exported (see section 4.3) and the 805 means and frequency of how that data is exported (e.g., on-change 806 or periodic subscription). 808 * Structured Data: For automatic network operation, machines will 809 replace human for network data comprehension. Data modeling 810 languages, such as YANG, can efficiently describe structured data 811 and normalize data encoding and transformation. 813 * High Speed Data Transport: In order to keep up with the velocity 814 of information, a data source needs to be able to send large 815 amounts of data at high frequency. Compact encoding formats or 816 data compression schemes are needed to reduce the quantity of data 817 and improve the data transport efficiency. The subscription mode, 818 by replacing the query mode, reduces the interactions between 819 clients and servers and helps to improve the data source's 820 efficiency. 822 * Network Congestion Avoidance: The application must protect the 823 network from congestion by congestion control mechanisms or at 824 least circuit breakers. [RFC8084] and [RFC8085] provide some 825 solutions in this space. 827 3.1.2. Control Plane Telemetry 829 The control plane telemetry refers to the health condition monitoring 830 of different network control protocols at all layers of the protocol 831 stack. Keeping track of the operational status of these protocols is 832 beneficial for detecting, localizing, and even predicting various 833 network issues, as well as network optimization, in real-time and 834 with fine granularity. Some particular challenges and issues faced 835 by the control plane telemetry are as follows: 837 * One challenging problem for the control plane telemetry is how to 838 correlate the End-to-End (E2E) Key Performance Indicators (KPI) to 839 a specific layer's KPIs. For example, IPTV users may describe 840 their User Experience (UE) by the video smoothness and definition. 841 Then in case of an unusually poor UE KPI or a service 842 disconnection, it is non-trivial to delimit and pinpoint the issue 843 in the responsible protocol layer (e.g., the Transport Layer or 844 the Network Layer), the responsible protocol (e.g., ISIS or BGP at 845 the Network Layer), and finally the responsible device(s) with 846 specific reasons. 848 * Conventional OAM-based approaches for control plane KPI 849 measurement include Ping (L3), Traceroute (L3), Y.1731 [y1731] 850 (L2), and so on. One common issue behind these methods is that 851 they only measure the KPIs instead of reflecting the actual 852 running status of these protocols, making them less effective or 853 efficient for control plane troubleshooting and network 854 optimization. 856 * An example of the control plane telemetry is the BGP monitoring 857 protocol (BMP), it is currently used for monitoring the BGP routes 858 and enables rich applications, such as BGP peer analysis, AS 859 analysis, prefix analysis, and security analysis. However, the 860 monitoring of other layers, protocols and the cross-layer, cross- 861 protocol KPI correlations are still in their infancy (e.g., IGP 862 monitoring is not as extensive as BMP), which require further 863 research. 865 * The requirement and solutions for network congestion avoidance are 866 also applicable to the control plane telemetry. 868 3.1.3. Forwarding Plane Telemetry 870 An effective forwarding plane telemetry system relies on the data 871 that the network device can expose. The quality, quantity, and 872 timeliness of data must meet some stringent requirements. This 873 raises some challenges to the network data plane devices where the 874 first-hand data originates. 876 * A data plane device's main function is user traffic processing and 877 forwarding. While supporting network visibility is important, the 878 telemetry is just an auxiliary function, and it should strive to 879 not impede normal traffic processing and forwarding (i.e., the 880 forwarding behavior should not be altered and the trade-off 881 between forwarding performance and telemetry should be well- 882 balanced). 884 * Network operation applications require end-to-end visibility 885 across various sources, which can result in a huge volume of data. 886 However, the sheer quantity of data must not exhaust the network 887 bandwidth, regardless of the data delivery approach (i.e., whether 888 through in-band or out-of-band channels). 890 * The data plane devices must provide timely data with the minimum 891 possible delay. Long processing, transport, storage, and analysis 892 delay can impact the effectiveness of the control loop and even 893 render the data useless. 895 * The data should be structured and labeled, and easy for 896 applications to parse and consume. At the same time, the data 897 types needed by applications can vary significantly. The data 898 plane devices need to provide enough flexibility and 899 programmability to support the precise data provision for 900 applications. 902 * The data plane telemetry should support incremental deployment and 903 work even though some devices are unaware of the system. 905 * The requirement and solutions for network congestion avoidance are 906 also applicable to the forwarding plane telemetry. 908 Although not specific to the forwarding plane, these challenges are 909 more difficult to the forwarding plane because of the limited 910 resource and flexibility. Data plane programmability is essential to 911 support network telemetry. Newer data plane forwarding chips are 912 equipped with advanced telemetry features and provide flexibility to 913 support customized telemetry functions. 915 Technique Taxonomy: concerning about how one instruments the 916 telemetry, there can be multiple possible dimensions to classify the 917 forwarding plane telemetry techniques. 919 * Active, Passive, and Hybrid: This dimension concerns about the 920 end-to-end measurement. Active and passive methods (as well as 921 the hybrid types) are well documented in [RFC7799]. Passive 922 methods include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic 923 mirroring. These methods usually have low data coverage. The 924 bandwidth cost is very high in order to improve the data coverage. 925 On the other hand, active methods include Ping, OWAMP [RFC4656], 926 TWAMP [RFC5357], STAMP [RFC8762], and Cisco's SLA Protocol 927 [RFC6812]. These methods are intrusive and only provide indirect 928 network measurements. Hybrid methods, including in-situ OAM 929 [I-D.ietf-ippm-ioam-data], Alternate-Marking (AM) [RFC8321], and 930 Multipoint Alternate Marking [RFC8889], provide a well-balanced 931 and more flexible approach. However, these methods are also more 932 complex to implement. 934 * In-Band and Out-of-Band: Telemetry data carried in user packets 935 before being exported to a data collector is considered in-band 936 (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]). Telemetry data 937 that is directly exported to a data collector without modifying 938 user packets is considered out-of-band (e.g., the postcard-based 939 approach described in Appendix A.3.5). It is also possible to 940 have hybrid methods, where only the telemetry instruction or 941 partial data is carried by user packets (e.g., AM [RFC8321]). 943 * End-to-End and In-Network: End-to-End methods start from, and end 944 at, the network end hosts (e.g., Ping). In-Network methods work 945 in networks and are transparent to end hosts. However, if needed, 946 In-Network methods can be easily extended into end hosts. 948 * Data Subject: Depending on the telemetry objective, the methods 949 can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), 950 path-based (e.g., Traceroute), and node-based (e.g., IPFIX 951 [RFC7011]). The various data objects can be packet, flow record, 952 measurement, states, and signal. 954 3.1.4. External Data Telemetry 956 Events that occur outside the boundaries of the network system are 957 another important source of network telemetry. Correlating both 958 internal telemetry data and external events with the requirements of 959 network systems, as presented in 960 [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and 961 functional advantage to management operations. 963 As with other sources of telemetry information, the data and events 964 must meet strict requirements, especially in terms of timeliness, 965 which is essential to properly incorporate external event information 966 into network management applications. The specific challenges are 967 described as follows: 969 * The role of the external event detector can be played by multiple 970 elements, including hardware (e.g., physical sensors, such as 971 seismometers) and software (e.g., Big Data sources that can 972 analyze streams of information, such as Twitter messages). Thus, 973 the transmitted data must support different shapes but, at the 974 same time, follow a common but extensible schema. 976 * Since the main function of the external event detectors is to 977 perform the notifications, their timeliness is assumed. However, 978 once messages have been dispatched, they must be quickly collected 979 and inserted into the control plane with variable priority, which 980 is higher for important sources and events and lower for secondary 981 ones. 983 * The schema used by external detectors must be easily adopted by 984 current and future devices and applications. Therefore, it must 985 be easily mapped to current data models, such as in terms of YANG. 987 * As the communication with external entities outside the boundary 988 of a provider network may be realized over the Internet, the risk 989 of congestion is even more relevant in this context and proper 990 counter-measures must be taken. Solutions such as network 991 transport circuit breakers are needed as well. 993 Organizing both internal and external telemetry information together 994 will be key for the general exploitation of the management 995 possibilities of current and future network systems, as reflected in 996 the incorporation of cognitive capabilities to new hardware and 997 software (virtual) elements. 999 3.2. Second Level Function Components 1001 The telemetry module at each plane can be further partitioned into 1002 five distinct conceptual components: 1004 * Data Query, Analysis, and Storage: This component works at the 1005 application layer. It is normally a part of the network 1006 management system at the receiver side. On the one hand, it is 1007 responsible for issuing data requirements. The data of interest 1008 can be modeled data through configuration or custom data through 1009 programming. The data requirements can be queries for one-shot 1010 data or subscriptions for events or streaming data. On the other 1011 hand, it receives, stores, and processes the returned data from 1012 network devices. Data analysis can be interactive to initiate 1013 further data queries. This component can reside in either network 1014 devices or remote controllers. It can be centralized and 1015 distributed, and involve one or more instances. 1017 * Data Configuration and Subscription: This component manages data 1018 queries on devices. It determines the protocol and channel for 1019 applications to acquire desired data. This component is also 1020 responsible for configuring the desired data that might not be 1021 directly available form data sources. The subscription data can 1022 be described by models, templates, or programs. 1024 * Data Encoding and Export: This component determines how telemetry 1025 data is delivered to the data analysis and storage component with 1026 access control. The data encoding and the transport protocol may 1027 vary due to the data export location. 1029 * Data Generation and Processing: The requested data needs to be 1030 captured, filtered, processed, and formatted in network devices 1031 from raw data sources. This may involve in-network computing and 1032 processing on either the fast path or the slow path in network 1033 devices. 1035 * Data Object and Source: This component determines the monitoring 1036 objects and original data sources provisioned in the device. A 1037 data source usually just provides raw data which needs further 1038 processing. Each data source can be considered a probe. Some 1039 data sources can be dynamically installed, while others will be 1040 more static. 1042 +----------------------------------------+ 1043 +----------------------------------------+ | 1044 | | | 1045 | Data Query, Analysis, & Storage | | 1046 | | + 1047 +-------+++ -----------------------------+ 1048 ||| ^^^ 1049 ||| ||| 1050 ||V ||| 1051 +--+V--------------------+++------------+ 1052 +-----V---------------------+------------+ | 1053 +---------------------+-------+----------+ | | 1054 | Data Configuration | | | | 1055 | & Subscription | Data Encoding | | | 1056 | (model, template, | & Export | | | 1057 | & program) | | | | 1058 +---------------------+------------------| | | 1059 | | | | 1060 | Data Generation | | | 1061 | & Processing | | | 1062 | | | | 1063 +----------------------------------------| | | 1064 | | | | 1065 | Data Object and Source | |-+ 1066 | |-+ 1067 +----------------------------------------+ 1069 Figure 3: Components in the Network Telemetry Framework 1071 3.3. Data Acquisition Mechanism and Type Abstraction 1073 Broadly speaking, network data can be acquired through subscription 1074 (push) and query (poll). A subscription is a contract between 1075 publisher and subscriber. After initial setup, the subscribed data 1076 is automatically delivered to registered subscribers until the 1077 subscription expires. There are two variations of subscription. The 1078 subscriptions can be either pre-defined, or the subscribers are 1079 allowed to configure and tailor the published data to their specific 1080 needs. 1082 In contrast, queries are used when a client expects immediate and 1083 one-off feedback from network devices. The queried data may be 1084 directly extracted from some specific data source, or synthesized and 1085 processed from raw data. Queries work well for interactive network 1086 telemetry applications. 1088 In general, data can be pulled (i.e., queried) whenever needed, but 1089 in many cases, pushing the data (i.e., subscription) is more 1090 efficient, and can reduce the latency of a client detecting a change. 1091 From the data consumer point of view, there are four types of data 1092 from network devices that a telemetry data consumer can subscribe or 1093 query: 1095 * Simple Data: The data that are steadily available from some 1096 datastore or static probes in network devices. 1098 * Derived Data: The data need to be synthesized or processed in 1099 network from raw data from one or more network devices. The data 1100 processing function can be statically or dynamically loaded into 1101 network devices. 1103 * Event-triggered Data: The data are conditionally acquired based on 1104 the occurrence of some events. An example of event-triggered data 1105 could be an interface changing operational state between up and 1106 down. Such data can be actively pushed through subscription or 1107 passively polled through query. There are many ways to model 1108 events, including using Finite State Machine (FSM) or Event 1109 Condition Action (ECA) [I-D.wwx-netmod-event-yang]. 1111 * Streaming Data: The data are continuously generated. It can be 1112 time series or the dump of databases. For example, an interface 1113 packet counter is exported every second. The streaming data 1114 reflect realtime network states and metrics and require large 1115 bandwidth and processing power. The streaming data are always 1116 actively pushed to the subscribers. 1118 The above telemetry data types are not mutually exclusive. Rather, 1119 they are often composite. Derived data is composed of simple data; 1120 Event-triggered data can be simple or derived; streaming data can be 1121 based on some recurring event. The relationships of these data types 1122 are illustrated in Figure 4. 1124 +----------------------+ +-----------------+ 1125 | Event-triggered Data |<----+ Streaming Data | 1126 +-------+---+----------+ +-----+---+-------+ 1127 | | | | 1128 | | | | 1129 | | +--------------+ | | 1130 | +-->| Derived Data |<--+ | 1131 | +------+------ + | 1132 | | | 1133 | V | 1134 | +--------------+ | 1135 +------>| Simple Data |<------+ 1136 +--------------+ 1138 Figure 4: Data Type Relationship 1140 Subscription usually deals with event-triggered data and streaming 1141 data, and query usually deals with simple data and derived data. But 1142 the other ways are also possible. Advanced network telemetry 1143 techniques are designed mainly for event-triggered or streaming data 1144 subscription, and derived data query. 1146 3.4. Mapping Existing Mechanisms into the Framework 1148 The following table shows how the existing mechanisms (mainly 1149 published in IETF and with the emphasis on the latest new 1150 technologies) are positioned in the framework. Given the vast body 1151 of existing work, we cannot provide an exhaustive list, so the 1152 mechanisms in the tables should be considered as just examples. 1153 Also, some comprehensive protocols and techniques may cover multiple 1154 aspects or modules of the framework, so a name in a block only 1155 emphasizes one particular characteristic of it. More details about 1156 some listed mechanisms can be found in Appendix A. 1158 +-------------+-----------------+---------------+--------------+ 1159 | | Management | Control | Forwarding | 1160 | | Plane | Plane | Plane | 1161 +-------------+-----------------+---------------+--------------+ 1162 | data config.| gNMI, NETCONF, | gNMI, NETCONF,| NETCONF, | 1163 | & subscribe | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | 1164 | | YANG-Push | YANG-Push | YANG-Push | 1165 +-------------+-----------------+---------------+--------------+ 1166 | data gen. & | MIB, | YANG | IOAM, PSAMP | 1167 | process | YANG | | PBT, AM, | 1168 +-------------+-----------------+---------------+--------------+ 1169 | data encode.| gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | 1170 | & export | | | | 1171 +-------------+-----------------+---------------+--------------+ 1172 Figure 5: Existing Work Mapping 1174 Although the framework is generally suitable for any network 1175 environments, the multi-domain telemetry has some unique challenges 1176 which deserve further architectural consideration, which is out of 1177 the scope of this document. 1179 4. Evolution of Network Telemetry Applications 1181 Network telemetry is an evolving technical area. As the network 1182 moves towards the automated operation, network telemetry applications 1183 undergo several stages of evolution which add new layer of 1184 requirements to the underlying network telemetry techniques. Each 1185 stage is built upon the techniques adopted by the previous stages 1186 plus some new requirements. 1188 Stage 0 - Static Telemetry: The telemetry data source and type are 1189 determined at design time. The network operator can only 1190 configure how to use it with limited flexibility. 1192 Stage 1 - Dynamic Telemetry: The custom telemetry data can be 1193 dynamically programmed or configured at runtime without 1194 interrupting the network operation, allowing a trade-off among 1195 resource, performance, flexibility, and coverage. 1197 Stage 2 - Interactive Telemetry: The network operator can 1198 continuously customize and fine tune the telemetry data in real 1199 time to reflect the network operation's visibility requirements. 1200 Compared with Stage 1, the changes are frequent based on the real- 1201 time feedback. At this stage, some tasks can be automated, but 1202 human operators still need to sit in the middle to make decisions. 1204 Stage 3 - Closed-loop Telemetry: The telemetry is free from the 1205 interference of human operators, except for generating the 1206 reports. The intelligent network operation engine automatically 1207 issues the telemetry data requests, analyzes the data, and updates 1208 the network operations in closed control loops. 1210 Existing technologies are ready for stage 0 and stage 1. Individual 1211 stage 2 and stage 3 applications are also possible now. However, the 1212 future autonomic networks may need a comprehensive operation 1213 management system which works at stage 2 and stage 3 to cover all the 1214 network operation tasks. A well-defined network telemetry framework 1215 is the first step towards this direction. 1217 5. Security Considerations 1219 The complexity of network telemetry raises significant security 1220 implications. For example, telemetry data can be manipulated to 1221 exhaust various network resources at each plane as well as the data 1222 consumer; falsified or tampered data can mislead the decision-making 1223 and paralyze networks; wrong configuration and programming for 1224 telemetry is equally harmful. The telemetry data is highly 1225 sensitive, which exposes a lot of information about the network and 1226 its configuration. Some of that information can make designing 1227 attacks against the network much easier (e.g., exact details of what 1228 software and patches have been installed), and allows an attacker to 1229 determine whether a device may be subject to unprotected security 1230 vulnerabilities. 1232 Given that this document has proposed a framework for network 1233 telemetry and the telemetry mechanisms discussed are more extensive 1234 (in both message frequency and traffic amount) than the conventional 1235 network OAM concepts, we must also reflect that various new security 1236 considerations may also arise. A number of techniques already exist 1237 for securing the forwarding plane, the control plane, and the 1238 management plane in a network, but it is important to consider if any 1239 new threat vectors are now being enabled via the use of network 1240 telemetry procedures and mechanisms. 1242 Security considerations for networks that use telemetry methods may 1243 include: 1245 * Telemetry framework trust and policy model; 1247 * Role management and access control for enabling and disabling 1248 telemetry capabilities; 1250 * Protocol transport used telemetry data and inherent security 1251 capabilities; 1253 * Telemetry data stores, storage encryption and methods of access; 1255 * Tracking telemetry events and any abnormalities that might 1256 identify malicious attacks using telemetry interfaces. 1258 * Authentication and signing of telemetry data to make data more 1259 trustworthy. 1261 * Segregating the telemetry data traffic from the data traffic 1262 carried over the network (e.g., historically management access and 1263 management data may be carried via an independent management 1264 network). 1266 Some security considerations highlighted above may be minimized or 1267 negated with policy management of network telemetry. In a network 1268 telemetry deployment it would be advantageous to separate telemetry 1269 capabilities into different classes of policies, i.e., Role Based 1270 Access Control and Event-Condition-Action policies. Also, potential 1271 conflicts between network telemetry mechanisms must be detected 1272 accurately and resolved quickly to avoid unnecessary network 1273 telemetry traffic propagation escalating into an unintended or 1274 intended denial of service attack. 1276 Further study of the security issues will be required, and it is 1277 expected that the security mechanisms and protocols are developed and 1278 deployed along with a network telemetry system. 1280 6. IANA Considerations 1282 This document includes no request to IANA. 1284 7. Contributors 1286 The other contributors of this document are Tianran Zhou, Zhenbin Li, 1287 Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm 1289 8. Acknowledgments 1291 We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe 1292 Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe 1293 Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, 1294 Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin 1295 Duke, and many others who have provided helpful comments and 1296 suggestions to improve this document. 1298 9. Informative References 1300 [gnmi] "gNMI - gRPC Network Management Interface", 1301 . 1304 [gpb] "Google Protocol Buffers", 1305 . 1307 [grpc] "gPPC, A high performance, open-source universal RPC 1308 framework", . 1310 [I-D.ietf-grow-bmp-local-rib] 1311 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1312 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1313 Work in Progress, Internet-Draft, draft-ietf-grow-bmp- 1314 local-rib-13, 31 August 2021, 1315 . 1318 [I-D.ietf-ippm-ioam-data] 1319 Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields 1320 for In-situ OAM", Work in Progress, Internet-Draft, draft- 1321 ietf-ippm-ioam-data-16, 8 November 2021, 1322 . 1325 [I-D.ietf-ippm-ioam-direct-export] 1326 Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., 1327 Bhandari, S., Sivakolundu, R., and T. Mizrahi, "In-situ 1328 OAM Direct Exporting", Work in Progress, Internet-Draft, 1329 draft-ietf-ippm-ioam-direct-export-07, 13 October 2021, 1330 . 1333 [I-D.ietf-netconf-distributed-notif] 1334 Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, 1335 "Subscription to Distributed Notifications", Work in 1336 Progress, Internet-Draft, draft-ietf-netconf-distributed- 1337 notif-02, 6 May 2021, . 1340 [I-D.ietf-netconf-udp-notif] 1341 Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., 1342 and P. Lucente, "UDP-based Transport for Configured 1343 Subscriptions", Work in Progress, Internet-Draft, draft- 1344 ietf-netconf-udp-notif-04, 21 October 2021, 1345 . 1348 [I-D.irtf-nmrg-ibn-concepts-definitions] 1349 Clemm, A., Ciavaglia, L., Granville, L. Z., and J. 1350 Tantsura, "Intent-Based Networking - Concepts and 1351 Definitions", Work in Progress, Internet-Draft, draft- 1352 irtf-nmrg-ibn-concepts-definitions-05, 2 September 2021, 1353 . 1356 [I-D.pedro-nmrg-anticipated-adaptation] 1357 Martinez-Julia, P., "Exploiting External Event Detectors 1358 to Anticipate Resource Requirements for the Elastic 1359 Adaptation of SDN/NFV Systems", Work in Progress, 1360 Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- 1361 02, 29 June 2018, . 1364 [I-D.song-ippm-postcard-based-telemetry] 1365 Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, 1366 T., Li, Z., Shin, J., and K. Lee, "In-Situ OAM Marking- 1367 based Direct Export", Work in Progress, Internet-Draft, 1368 draft-song-ippm-postcard-based-telemetry-11, 15 November 1369 2021, . 1372 [I-D.song-opsawg-dnp4iq] 1373 Song, H. and J. Gong, "Requirements for Interactive Query 1374 with Dynamic Network Probes", Work in Progress, Internet- 1375 Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, 1376 . 1379 [I-D.song-opsawg-ifit-framework] 1380 Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- 1381 situ Flow Information Telemetry", Work in Progress, 1382 Internet-Draft, draft-song-opsawg-ifit-framework-16, 21 1383 October 2021, . 1386 [I-D.wwx-netmod-event-yang] 1387 Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, 1388 "A YANG Data model for ECA Policy Management", Work in 1389 Progress, Internet-Draft, draft-wwx-netmod-event-yang-10, 1390 1 November 2020, . 1393 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1394 "Simple Network Management Protocol (SNMP)", RFC 1157, 1395 DOI 10.17487/RFC1157, May 1990, 1396 . 1398 [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. 1399 Schoenwaelder, Ed., "Structure of Management Information 1400 Version 2 (SMIv2)", STD 58, RFC 2578, 1401 DOI 10.17487/RFC2578, April 1999, 1402 . 1404 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1405 DOI 10.17487/RFC2981, October 2000, 1406 . 1408 [RFC3176] Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's 1409 sFlow: A Method for Monitoring Traffic in Switched and 1410 Routed Networks", RFC 3176, DOI 10.17487/RFC3176, 1411 September 2001, . 1413 [RFC3414] Blumenthal, U. and B. Wijnen, "User-based Security Model 1414 (USM) for version 3 of the Simple Network Management 1415 Protocol (SNMPv3)", STD 62, RFC 3414, 1416 DOI 10.17487/RFC3414, December 2002, 1417 . 1419 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1420 for the Simple Network Management Protocol (SNMP)", 1421 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1422 . 1424 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1425 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1426 September 2004, . 1428 [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export 1429 Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, 1430 . 1432 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1433 Zekauskas, "A One-way Active Measurement Protocol 1434 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1435 . 1437 [RFC5085] Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual 1438 Circuit Connectivity Verification (VCCV): A Control 1439 Channel for Pseudowires", RFC 5085, DOI 10.17487/RFC5085, 1440 December 2007, . 1442 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1443 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1444 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1445 . 1447 [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, 1448 DOI 10.17487/RFC5424, March 2009, 1449 . 1451 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1452 the Network Configuration Protocol (NETCONF)", RFC 6020, 1453 DOI 10.17487/RFC6020, October 2010, 1454 . 1456 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1457 and A. Bierman, Ed., "Network Configuration Protocol 1458 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1459 . 1461 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 1462 S., and E. Yedavalli, "Cisco Service-Level Assurance 1463 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 1464 . 1466 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1467 "Specification of the IP Flow Information Export (IPFIX) 1468 Protocol for the Exchange of Flow Information", STD 77, 1469 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1470 . 1472 [RFC7258] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 1473 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 1474 2014, . 1476 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1477 Weingarten, "An Overview of Operations, Administration, 1478 and Maintenance (OAM) Tools", RFC 7276, 1479 DOI 10.17487/RFC7276, June 2014, 1480 . 1482 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1483 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1484 DOI 10.17487/RFC7540, May 2015, 1485 . 1487 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1488 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1489 Networking: Definitions and Design Goals", RFC 7575, 1490 DOI 10.17487/RFC7575, June 2015, 1491 . 1493 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1494 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1495 May 2016, . 1497 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1498 Monitoring Protocol (BMP)", RFC 7854, 1499 DOI 10.17487/RFC7854, June 2016, 1500 . 1502 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 1503 RFC 7950, DOI 10.17487/RFC7950, August 2016, 1504 . 1506 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 1507 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 1508 . 1510 [RFC8084] Fairhurst, G., "Network Transport Circuit Breakers", 1511 BCP 208, RFC 8084, DOI 10.17487/RFC8084, March 2017, 1512 . 1514 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 1515 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 1516 March 2017, . 1518 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 1519 Interchange Format", STD 90, RFC 8259, 1520 DOI 10.17487/RFC8259, December 2017, 1521 . 1523 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1524 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1525 "Alternate-Marking Method for Passive and Hybrid 1526 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1527 January 2018, . 1529 [RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, 1530 E., and A. Tripathy, "Subscription to YANG Notifications", 1531 RFC 8639, DOI 10.17487/RFC8639, September 2019, 1532 . 1534 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1535 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1536 September 2019, . 1538 [RFC8671] Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S. 1539 Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring 1540 Protocol (BMP)", RFC 8671, DOI 10.17487/RFC8671, November 1541 2019, . 1543 [RFC8762] Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple 1544 Two-Way Active Measurement Protocol", RFC 8762, 1545 DOI 10.17487/RFC8762, March 2020, 1546 . 1548 [RFC8889] Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto, 1549 "Multipoint Alternate-Marking Method for Passive and 1550 Hybrid Performance Monitoring", RFC 8889, 1551 DOI 10.17487/RFC8889, August 2020, 1552 . 1554 [RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, 1555 R., and A. Ghanwani, "Service Function Chaining (SFC) 1556 Operations, Administration, and Maintenance (OAM) 1557 Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, 1558 . 1560 [xml] "Extensible Markup Language (XML) 1.0 (Fifth Edition)", 1561 . 1563 [y1731] "ITU-T Y.1731: OAM Functions and Mechanisms for Ethernet 1564 based networks, 2015", 1565 . 1567 Appendix A. A Survey on Existing Network Telemetry Techniques 1569 In this non-normative appendix, we provide an overview of some 1570 existing techniques and standard proposals for each network telemetry 1571 module. 1573 A.1. Management Plane Telemetry 1575 A.1.1. Push Extensions for NETCONF 1577 NETCONF [RFC6241] is a popular network management protocol 1578 recommended by IETF. Its core strength is for managing 1579 configuration, but can also be used for data collection. YANG-Push 1580 [RFC8641] [RFC8639] extends NETCONF and enables subscriber 1581 applications to request a continuous, customized stream of updates 1582 from a YANG datastore. Providing such visibility into changes made 1583 upon YANG configuration and operational objects enables new 1584 capabilities based on the remote mirroring of configuration and 1585 operational state. Moreover, distributed data collection mechanism 1586 [I-D.ietf-netconf-distributed-notif] via UDP based publication 1587 channel [I-D.ietf-netconf-udp-notif] provides enhanced efficiency for 1588 the NETCONF based telemetry. 1590 A.1.2. gRPC Network Management Interface 1592 gRPC Network Management Interface (gNMI) [gnmi] is a network 1593 management protocol based on the gRPC [grpc] RPC (Remote Procedure 1594 Call) framework. With a single gRPC service definition, both 1595 configuration and telemetry can be covered. gRPC is an HTTP/2 1596 [RFC7540]-based open-source micro-service communication framework. 1597 It provides a number of capabilities which are well-suited for 1598 network telemetry, including: 1600 * Full-duplex streaming transport model combined with a binary 1601 encoding mechanism provides good telemetry efficiency. 1603 * gRPC provides higher-level features consistency across platforms 1604 that common HTTP/2 libraries typically do not. This 1605 characteristic is especially valuable for the fact that telemetry 1606 data collectors normally reside on a large variety of platforms. 1608 * The built-in load-balancing and failover mechanism. 1610 A.2. Control Plane Telemetry 1612 A.2.1. BGP Monitoring Protocol 1614 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1615 sessions and is intended to provide a convenient interface for 1616 obtaining route views. 1618 The BGP routing information is collected from the monitored device(s) 1619 to the BMP monitoring station by setting up the BMP TCP session. The 1620 BGP peers are monitored by the BMP Peer Up and Peer Down 1621 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1622 Adjacency_RIB_out [RFC8671], and Local_Rib 1623 [I-D.ietf-grow-bmp-local-rib]) are encapsulated in the BMP Route 1624 Monitoring Message and the BMP Route Mirroring Message, providing 1625 both an initial table dump and real-time route updates. In addition, 1626 BGP statistics are reported through the BMP Stats Report Message, 1627 which could be either timer triggered or event-driven. Future BMP 1628 extensions could further enrich BGP monitoring applications. 1630 A.3. Data Plane Telemetry 1632 A.3.1. The Alternate Marking (AM) technology 1634 The Alternate Marking method enables efficient measurements of packet 1635 loss, delay, and jitter both in IP and Overlay Networks, as presented 1636 in [RFC8321] and [RFC8889]. 1638 This technique can be applied to point-to-point and multipoint-to- 1639 multipoint flows. Alternate Marking creates batches of packets by 1640 alternating the value of 1 bit (or a label) of the packet header. 1641 These batches of packets are unambiguously recognized over the 1642 network and the comparison of packet counters for each batch allows 1643 the packet loss calculation. The same idea can be applied to delay 1644 measurement by selecting ad hoc packets with a marking bit dedicated 1645 for delay measurements. 1647 Alternate Marking method needs two counters each marking period for 1648 each flow under monitor. For instance, by considering n measurement 1649 points and m monitored flows, the order of magnitude of the packet 1650 counters for each time interval is n*m*2 (1 per color). 1652 Since networks offer rich sets of network performance measurement 1653 data (e.g., packet counters), conventional approaches run into 1654 limitations. The bottleneck is the generation and export of the data 1655 and the amount of data that can be reasonably collected from the 1656 network. In addition, management tasks related to determining and 1657 configuring which data to generate lead to significant deployment 1658 challenges. 1660 The Multipoint Alternate Marking approach, described in [RFC8889], 1661 aims to resolve this issue and make the performance monitoring more 1662 flexible in case a detailed analysis is not needed. 1664 An application orchestrates network performance measurements tasks 1665 across the network to allow for optimized monitoring. The 1666 application can choose how roughly or precisely to configure 1667 measurement points depending on the application's requirements. 1669 Using Alternate Marking, it is possible to monitor a Multipoint 1670 Network without in depth examination by using the Network Clustering 1671 (subnetworks that are portions of the entire network that preserve 1672 the same property of the entire network, called clusters). So in the 1673 case that there is packet loss or the delay is too high then the 1674 specific filtering criteria could be applied to gather a more 1675 detailed analysis by using a different combination of clusters up to 1676 a per-flow measurement as described in Alternate-Marking (AM) 1677 [RFC8321]. 1679 In summary, an application can configure end-to-end network 1680 monitoring. If the network does not experience issues, this 1681 approximate monitoring is good enough and is very cheap in terms of 1682 network resources. However, in case of problems, the application 1683 becomes aware of the issues from this approximate monitoring and, in 1684 order to localize the portion of the network that has issues, 1685 configures the measurement points more extensively, allowing more 1686 detailed monitoring to be performed. After the detection and 1687 resolution of the problem, the initial approximate monitoring can be 1688 used again. 1690 A.3.2. Dynamic Network Probe 1692 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1693 proposes a programmable means to customize the data that an 1694 application collects from the data plane. A direct benefit of DNP is 1695 the reduction of the exported data. A full DNP solution covers 1696 several components including data source, data subscription, and data 1697 generation. The data subscription needs to define the derived data 1698 which can be composed and derived from the raw data sources. The 1699 data generation takes advantage of the moderate in-network computing 1700 to produce the desired data. 1702 While DNP can introduce unforeseeable flexibility to the data plane 1703 telemetry, it also faces some challenges. It requires a flexible 1704 data plane that can be dynamically reprogrammed at run-time. The 1705 programming API is yet to be defined. 1707 A.3.3. IP Flow Information Export (IPFIX) Protocol 1709 Traffic on a network can be seen as a set of flows passing through 1710 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1711 provides a means of transmitting traffic flow information for 1712 administrative or other purposes. A typical IPFIX enabled system 1713 includes a pool of Metering Processes that collects data packets at 1714 one or more Observation Points, optionally filters them and 1715 aggregates information about these packets. An Exporter then gathers 1716 each of the Observation Points together into an Observation Domain 1717 and sends this information via the IPFIX protocol to a Collector. 1719 A.3.4. In-Situ OAM 1721 Classical passive and active monitoring and measurement techniques 1722 are either inaccurate or resource-consuming. It is preferable to 1723 directly acquire data associated with a flow's packets when the 1724 packets pass through a network. In-situ OAM (iOAM) 1725 [I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new 1726 instruction header to user packets and the instruction directs the 1727 network nodes to add the requested data to the packets. Thus, at the 1728 path end, the packet's experience gained on the entire forwarding 1729 path can be collected. Such firsthand data is invaluable to many 1730 network OAM applications. 1732 However, iOAM also faces some challenges. The issues on performance 1733 impact, security, scalability and overhead limits, encapsulation 1734 difficulties in some protocols, and cross-domain deployment need to 1735 be addressed. 1737 A.3.5. Postcard Based Telemetry 1739 The postcard-based telemetry, as embodied in IOAM DEX 1740 [I-D.ietf-ippm-ioam-direct-export] and IOAM Marking 1741 [I-D.song-ippm-postcard-based-telemetry], is a complementary 1742 technique to the passport-based IOAM. PBT directly exports data at 1743 each node through an independent packet. At the cost of higher 1744 bandwidth overhead and the need for data correlation, PBT shows 1745 several unique advantages. It can also help to identify packet drop 1746 location in case a packet is dropped on its forwarding path. 1748 A.3.6. Existing OAM for Specific Data Planes 1750 Various data planes raises unique OAM requirements. IETF has 1751 published OAM technique and framework documents (e.g., [RFC8924] and 1752 [RFC5085]) targeting different data planes such as Multi-Protocol 1753 Label Switching (MPLS), L2 Virtual Private Network (L2-VPN), Network 1754 Virtualization Overlays (NVO3), Virtual Extensible LAN (VXLAN), Bit 1755 Indexed Explicit Replication (BIER), Service Function Chaining (SFC), 1756 Segment Routing (SR), and Deterministic Networking (DETNET). The 1757 aforementioned data plane telemetry techniques can be used to enhance 1758 the OAM capability on such data planes. 1760 A.4. External Data and Event Telemetry 1762 A.4.1. Sources of External Events 1764 To ensure that the information provided by external event detectors 1765 and used by the network management solutions is meaningful for 1766 management purposes, the network telemetry framework must ensure that 1767 such detectors (sources) are easily connected to the management 1768 solutions (sinks). This requires the specification of a list of 1769 potential external data sources that could be of interest in network 1770 management and match it to the connectors and/or interfaces required 1771 to connect them. 1773 Categories of external event sources that may be of interest to 1774 network management include:: 1776 * Smart objects and sensors. With the consolidation of the Internet 1777 of Things~(IoT) any network system will have many smart objects 1778 attached to its physical surroundings and logical operation 1779 environments. Most of these objects will be essentially based on 1780 sensors of many kinds (e.g., temperature, humidity, presence) and 1781 the information they provide can be very useful for the management 1782 of the network, even when they are not specifically deployed for 1783 such purpose. Elements of this source type will usually provide a 1784 specific protocol for interaction, especially one of those 1785 protocols related to IoT, such as the Constrained Application 1786 Protocol (CoAP). 1788 * Online news reporters. Several online news services have the 1789 ability to provide enormous quantity of information about 1790 different events occurring in the world. Some of those events can 1791 impact on the network system managed by a specific framework and, 1792 therefore, such information may be of interest to the management 1793 solution. For instance, diverse security reports, such as the 1794 Common Vulnerabilities and Exposures (CVE), can be issued by the 1795 corresponding authority and used by the management solution to 1796 update the managed system if needed. Instead of a specific 1797 protocol and data format, the sources of this kind of information 1798 usually follow a relaxed but structured format. This format will 1799 be part of both the ontology and information model of the 1800 telemetry framework. 1802 * Global event analyzers. The advance of Big Data analyzers 1803 provides a huge amount of information and, more interestingly, the 1804 identification of events detected by analyzing many data streams 1805 from different origins. In contrast with the other types of 1806 sources, which are focused on specific events, the detectors of 1807 this source type will detect generic events. For example, during 1808 a sport event some unexpected movement makes it fascinating and 1809 many people connect to sites that are reporting on the event. The 1810 underlying networks supporting the services that cover the event 1811 can be affected by such situation, so their management solutions 1812 should be aware of it. In contrast with the other source types, a 1813 new information model, format, and reporting protocol is required 1814 to integrate the detectors of this type with the management 1815 solution. 1817 Additional types of detector types can be added to the system, but 1818 they will be generally the result of composing the properties offered 1819 by these main classes. 1821 A.4.2. Connectors and Interfaces 1823 For allowing external event detectors to be properly integrated with 1824 other management solutions, both elements must expose interfaces and 1825 protocols that are subject to their particular objective. Since 1826 external event detectors will be focused on providing their 1827 information to their main consumers, which generally will not be 1828 limited to the network management solutions, the framework must 1829 include the definition of the required connectors for ensuring the 1830 interconnection between detectors (sources) and their consumers 1831 within the management systems (sinks) are effective. 1833 In some situations, the interconnection between the external event 1834 detectors and the management system is via the management plane. For 1835 those situations there will be a special connector that provides the 1836 typical interfaces found in most other elements connected to the 1837 management plane. For instance, the interfaces could accomplish this 1838 with a specific data model (YANG) and specific telemetry protocol, 1839 such as NETCONF, YANG-Push, or gRPC. 1841 Authors' Addresses 1843 Haoyu Song 1844 Futurewei 1845 United States of America 1847 Email: haoyu.song@futurewei.com 1849 Fengwei Qin 1850 China Mobile 1851 P.R. China 1853 Email: qinfengwei@chinamobile.com 1855 Pedro Martinez-Julia 1856 NICT 1857 Japan 1859 Email: pedro@nict.go.jp 1861 Laurent Ciavaglia 1862 Rakuten Mobile 1863 France 1865 Email: laurent.ciavaglia@rakuten.com 1866 Aijun Wang 1867 China Telecom 1868 P.R. China 1870 Email: wangaj.bri@chinatelecom.cn