idnits 2.17.1 draft-ietf-opsawg-ntf-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 9, 2020) is 1292 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-07 == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-10 == Outdated reference: A later version (-08) exists of draft-ietf-netconf-distributed-notif-00 == Outdated reference: A later version (-12) exists of draft-ietf-netconf-udp-notif-00 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-02 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-07 == Outdated reference: A later version (-21) exists of draft-song-opsawg-ifit-framework-13 == Outdated reference: A later version (-10) exists of draft-wwx-netmod-event-yang-09 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: April 12, 2021 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Nokia 10 A. Wang 11 China Telecom 12 October 9, 2020 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-05 17 Abstract 19 Network telemetry is the technology for gaining network insight and 20 facilitating efficient and automated network management. It engages 21 various techniques for remote data collection, correlation, and 22 consumption. This document provides an architectural framework for 23 network telemetry, motivated by the network operation challenges and 24 requirements. As evidenced by some key characteristics and industry 25 practices, network telemetry covers technologies and protocols beyond 26 the conventional network Operations, Administration, and Management 27 (OAM). It promises better flexibility, scalability, accuracy, 28 coverage, and performance and allows automated control loops to suit 29 both today's and tomorrow's network operation. This document 30 clarifies the terminologies and classifies the modules and components 31 of a network telemetry system from several different perspectives. 32 The framework and taxonomy help to set a common ground for the 33 collection of related work and provide guidance for related technique 34 and standard developments. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on April 12, 2021. 53 Copyright Notice 55 Copyright (c) 2020 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 5 73 2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 74 2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 75 2.4. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 8 76 2.5. Network Telemetry . . . . . . . . . . . . . . . . . . . . 9 77 3. The Necessity of a Network Telemetry Framework . . . . . . . 11 78 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 13 79 4.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 13 80 4.1.1. Management Plane Telemetry . . . . . . . . . . . . . 16 81 4.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 16 82 4.1.3. Data Plane Telemetry . . . . . . . . . . . . . . . . 17 83 4.1.4. External Data Telemetry . . . . . . . . . . . . . . . 19 84 4.2. Second Level Function Components . . . . . . . . . . . . 19 85 4.3. Data Acquiring Mechanism and Type Abstraction . . . . . . 21 86 4.4. Existing Works Mapped in the Framework . . . . . . . . . 23 87 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 24 88 6. Security Considerations . . . . . . . . . . . . . . . . . . . 25 89 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 90 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 91 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 26 92 10. Informative References . . . . . . . . . . . . . . . . . . . 26 93 Appendix A. A Survey on Existing Network Telemetry Techniques . 30 94 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 30 95 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 30 96 A.1.2. gRPC Network Management Interface . . . . . . . . . . 31 97 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 31 98 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 31 99 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 32 100 A.3.1. The Alternate Marking technology . . . . . . . . . . 32 101 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 33 102 A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 33 103 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 34 104 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 34 105 A.4. External Data and Event Telemetry . . . . . . . . . . . . 34 106 A.4.1. Sources of External Events . . . . . . . . . . . . . 34 107 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 36 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 110 1. Introduction 112 Network visibility is the ability of management tools to see the 113 state and behavior of a network. It is essential for successful 114 network operation. Network telemetry is the process of measuring, 115 correlating, recording, and distributing information about the 116 behavior of a network. Network telemetry has been considered as an 117 ideal means to gain sufficient network visibility with better 118 flexibility, scalability, accuracy, coverage, and performance than 119 some conventional network Operations, Administration, and Management 120 (OAM) techniques. 122 However, the term of network telemetry lacks a solid and unambiguous 123 definition. The scope and coverage of it cause confusion and 124 misunderstandings. It is beneficial to clarify the concept and 125 provide a clear architectural framework for network telemetry, so we 126 can articulate the technical field, and better align the related 127 techniques and standard works. 129 To fulfill such an undertaking, we first discuss some key 130 characteristics of network telemetry which set a clear distinction 131 from the conventional network OAM and show that some conventional OAM 132 technologies can be considered a subset of the network telemetry 133 technologies. We then provide an architectural framework for network 134 telemetry by partitioning a network telemetry system into four 135 modules each with the same building components and data abstracts. 136 We show how the network telemetry framework can benefit the current 137 and future network operations. Based on the distinction of modules 138 and function components, we can map the existing and emerging 139 techniques and protocols into the framework. The framework can also 140 simplify the tasks for designing, maintaining, and understanding a 141 network telemetry system. At last, we outline the evolution stages 142 of the network telemetry system and discuss the potential security 143 concerns. 145 The purpose of the framework and taxonomy is to set a common ground 146 for the collection of related work and provide guidance for future 147 technique and standard developments. To the best of our knowledge, 148 this document is the first such effort for network telemetry in 149 industry standards organizations. 151 2. Background 153 The term "big data" is used to describe the extremely large volume of 154 data sets that can be analyzed computationally to reveal patterns, 155 trends, and associations. Network is undoubtedly a source of big 156 data because of its scale and all the traffic goes through it. It is 157 easy to see that network OAM can benefit from network big data. 159 Today one can access advanced big data analytics capability through a 160 plethora of commercial and open source platforms (e.g., Apache 161 Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine 162 learning). Thanks to the advance of computing and storage 163 technologies, network big data analytics gives network operators an 164 opportunity to gain network insights and move towards network 165 autonomy. Some operators start to explore the application of 166 Artificial Intelligence (AI) to make sense of network data. Software 167 tools can use the network data to detect and react on network faults, 168 anomalies, and policy violations, as well as predicting future 169 events. In turn, the network policy updates for planning, intrusion 170 prevention, optimization, and self-healing may be applied. 172 It is conceivable that an autonomic network [RFC7575] is the logical 173 next step for network evolution following Software Defined Network 174 (SDN), aiming to reduce (or even eliminate) human labor, make more 175 efficient use of network resources, and provide better services more 176 aligned with customer requirements. Intent-based Networking (IBN) 177 [I-D.irtf-nmrg-ibn-concepts-definitions] provides the necessary 178 mechanisms. Although it takes time to reach the ultimate goal, the 179 journey has started nevertheless. 181 However, while the data processing capability is improved and 182 applications are hungry for more data, the networks lag behind in 183 extracting and translating network data into useful and actionable 184 information in efficient ways. The system bottleneck is shifting 185 from data consumption to data supply. Both the number of network 186 nodes and the traffic bandwidth keep increasing at a fast pace. The 187 network configuration and policy change at smaller time slots than 188 before. More subtle events and fine-grained data through all network 189 planes need to be captured and exported in real time. In a nutshell, 190 it is a challenge to get enough high-quality data out of network 191 efficiently, timely, and flexibly. Therefore, we need to examine the 192 existing network technologies and protocols, and identify any 193 potential technique and standard gaps based on the real network and 194 device architectures. 196 In the remaining of this section, first we clarify the scope of 197 network data (i.e., telemetry data) concerned in the context. Then, 198 we discuss several key use cases for today's and future network 199 operations. Next, we show why the current network OAM techniques and 200 protocols are insufficient for these use cases. The discussion 201 underlines the need of new methods, techniques, and protocols which 202 we assign under an umbrella term - network telemetry. 204 2.1. Telemetry Data Coverage 206 Any information that can be extracted from networks (including data 207 plane, control plane, and management plane) and used to gain 208 visibility or as basis for actions is considered telemetry data. It 209 includes statistics, event records and logs, snapshots of state, 210 configuration data, etc. It also covers the outputs of any active 211 and passive measurements [RFC7799]. Specially, raw data can be 212 processed in network before sending to a data consumer. Such 213 processed data are also telemetry data in the context. A 214 classification of the telemetry data form is provided in Section 4. 216 2.2. Use Cases 218 These use cases are essential for network operations. While the list 219 is by no means exhaustive, it is enough to highlight the requirements 220 for data velocity, variety, volume, and veracity in networks. 222 Security: Network intrusion detection and prevention need monitor 223 network traffic and activities, and act upon anomalies. Given the 224 more and more sophisticated attack vector and higher and higher 225 tolls due to security breach, new tools and techniques need to be 226 developed, relying on wider and deeper visibility in networks. 228 Policy and Intent Compliance: Network policies are the rules that 229 constraint the services for network access, provide service 230 differentiation, or enforce specific treatment on the traffic. 231 For example, a service function chain is a policy that requires 232 the selected flows to pass through a set of ordered network 233 functions. Intent, as defined in 234 [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational 235 goal that a network should meet and outcomes that a network is 236 supposed to deliver, defined in a declarative manner without 237 specifying how to achieve or implement them. An intent requires a 238 complex translation and mapping process before being applied on 239 networks. While a policy or an intent is enforced, the compliance 240 needs to be verified and monitored continuously, and any violation 241 needs to be reported immediately. 243 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 244 service a user expects from a network operator, which include the 245 metrics for the service measurement and remedy/penalty procedures 246 when the service level misses the agreement. Users need to check 247 if they get the service as promised and network operators need to 248 evaluate how they can deliver the services that can meet the SLA 249 based on realtime network measurement. 251 Root Cause Analysis: Any network failure can be the cause or effect 252 of a sequence of chained events. Troubleshooting and recovery 253 require quick identification of the root cause of any observable 254 issues. However, the root cause is not always straightforward to 255 identify, especially when the failure is sporadic and the related 256 and unrelated events are overwhelming and interleaved. While 257 machine learning technologies can be used for root cause analysis, 258 it up to the network to sense and provide the relevant data. 260 Network Optimization: This covers all short-term and long-term 261 network optimization techniques, including load balancing, Traffic 262 Engineering (TE), and network planning. Network operators are 263 motivated to optimize their network utilization and differentiate 264 services for better Return On Investment (ROI) or lower Capital 265 Expenditures (CAPEX). The first step is to know the real-time 266 network conditions before applying policies for traffic 267 manipulation. In some cases, micro-bursts need to be detected in 268 a very short time-frame so that fine-grained traffic control can 269 be applied to avoid network congestion. The long-term network 270 capacity planning and topology augmentation rely on the 271 accumulated data of network operations. 273 Event Tracking and Prediction: The visibility of traffic path and 274 performance is critical for services and applications that rely on 275 healthy network operation. Numerous related network events are of 276 interest to network operators. For example, Network operators 277 want to learn where and why packets are dropped for an application 278 flow. They also want to be warned of issues in advance so 279 proactive actions can be taken to avoid catastrophic consequences. 281 2.3. Challenges 283 For a long time, network operators have relied upon SNMP [RFC3416], 284 Command-Line Interface (CLI), or Syslog to monitor the network. Some 285 other OAM techniques as described in [RFC7276] are also used to 286 facilitate network troubleshooting. these conventional techniques 287 are not sufficient to support the above use cases for the following 288 reasons, which explains why new standards and techniques keep 289 emerging and the needs remain high: 291 o Most use cases need to continuously monitor the network and 292 dynamically refine the data collection in real-time. The poll- 293 based low-frequency data collection is ill-suited for these 294 applications. Subscription-based streaming data directly pushed 295 from the data source (e.g., the forwarding chip) is preferred to 296 provide enough data quantity and precision at scale. 298 o Comprehensive data is needed from packet processing engine to 299 traffic manager, from line cards to main control board, from user 300 flows to control protocol packets, from device configurations to 301 operations, and from physical layer to application layer. 302 Conventional OAM only covers a narrow range of data (e.g., SNMP 303 only handles data from the Management Information Base (MIB)). 304 Traditional network devices cannot provide all the necessary 305 probes. More open and programmable network devices are therefore 306 needed. 308 o Many application scenarios need to correlate network-wide data 309 from multiple sources (i.e., from distributed network devices, 310 different components of a network device, or different network 311 planes). A piecemeal solution is often lacking the capability to 312 consolidate the data from multiple sources. The composition of a 313 complete solution, as partly proposed by Autonomic Resource 314 Control Architecture(ARCA) 315 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 316 guided by a comprehensive framework. 318 o Some of the conventional OAM techniques (e.g., CLI and Syslog) 319 lack a formal data model. The unstructured data hinder the tool 320 automation and application extensibility. Standardized data 321 models are essential to support the programmable networks. 323 o Although some conventional OAM techniques support data push (e.g., 324 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 325 are limited to only predefined management plane warnings (e.g., 326 SNMP Trap) or sampled user packets (e.g., sFlow). Network 327 operators require the data with arbitrary source, granularity, and 328 precision which are beyond the capability of the existing 329 techniques. 331 o The conventional passive measurement techniques can either consume 332 excessive network resources and render excessive redundant data, 333 or lead to inaccurate results; on the other hand, the conventional 334 active measurement techniques can interfere with the user traffic 335 and their results are indirect. Techniques that can collect 336 direct and on-demand data from user traffic are more favorable. 338 2.4. Glossary 340 Before further discussion, we list some key terminology and acronyms 341 used in this documents. We make an intended differentiation between 342 network telemetry and network OAM. However, it should be understood 343 that there is not a hard-line distinction between the two concepts. 344 Rather, some OAM techniques are in the scope of network telemetry. 346 AI: Artificial Intelligence. In network domain, AI refers to the 347 machine-learning based technologies for automated network 348 operation and other tasks. 350 AM: Alternate Marking, a flow performance measurement method, 351 specified in [RFC8321]. 353 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 355 DNP: Dynamic Network Probe, referring to programmable in-network 356 sensors for network monitoring and measurement. 358 DPI: Deep Packet Inspection, referring to the techniques that 359 examines packet beyond packet L3/L4 headers. 361 gNMI: gRPC Network Management Interface, a network management 362 protocol from OpenConfig Operator Working Group, mainly 363 contributed by Google. See [gnmi] for details. 365 gRPC: gRPC Remote Procedure Call, a open source high performance RPC 366 framework that gNMI is based on. See [grpc] for details. 368 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 370 IOAM: In-situ OAM, a dataplane on-path telemetry technique. 372 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 374 NetFlow: A Cisco protocol for flow record collecting, described in 375 [RFC3594]. 377 Network Telemetry: Acquiring and processing network data remotely 378 for network monitoring and operation. A general term for a large 379 set of network visibility techniques and protocols, with the 380 characteristics defined in this document. Network telemetry 381 addresses the current network operation issues and enables smooth 382 evolution toward future intent-driven autonomous networks. 384 NMS: Network Management System, referring to applications that allow 385 network administrators manage a network's software and hardware 386 components. It usually records data from a network's remote 387 points to carry out central reporting to a system administrator. 389 OAM: Operations, Administration, and Maintenance. A group of 390 network management functions that provide network fault 391 indication, fault localization, performance information, and data 392 and diagnosis functions. Most conventional network monitoring 393 techniques and protocols belong to network OAM. 395 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 396 technique. 398 SMIv2 Structure of Management Information Version 2, specified in 399 [RFC2578]. 401 SNMP: Simple Network Management Protocol. Version 1 and 2 are 402 specified in [RFC1157] and [RFC3416], respectively. 404 YANG: The abbreviation of "Yet Another Next Generation". YANG is a 405 data modeling language for the definition of data sent over 406 network management protocols such as the NETCONF and RESTCONF. 407 YANG is defined in [RFC6020]. 409 YANG ECN A YANG model for Event-Condition-Action policies, defined 410 in [I-D.wwx-netmod-event-yang]. 412 YANG FSM: A YANG model that describes events, operations, and finite 413 state machine of YANG-defined network elements. 415 YANG PUSH: A method to subscribe pushed data from remote YANG 416 datastore on network devices. Details are specified in [RFC8641] 417 and [RFC8639]. 419 2.5. Network Telemetry 421 Network telemetry has emerged as a mainstream technical term to refer 422 to the newer data collection and consumption techniques, 423 distinguishing itself in some notable ways from the convention 424 network OAM. Several such techniques have been widely deployed. The 425 representative techniques and protocols include IPFIX [RFC7011] and 426 gPRC [grpc]. Network telemetry allows separate entities to acquire 427 data from network devices so that data can be visualized and analyzed 428 to support network monitoring and operation. Network telemetry 429 overlaps with the conventional network OAM and has a wider scope than 430 it. It is expected that network telemetry can provide the necessary 431 network insight for autonomous networks and address the shortcomings 432 of conventional OAM techniques. 434 One difference between the network telemetry and the conventional 435 network OAM is that in general the network telemetry assumes machines 436 as data consumer rather than human operators. Hence, the network 437 telemetry can directly trigger the automated network operation, while 438 the conventional OAM tools usually help human operators to monitor 439 and diagnose the networks and guide manual network operations. The 440 difference leads to very different techniques. 442 Although the network telemetry techniques are just emerging and 443 subject to continuous evolution, several characteristics of network 444 telemetry have been well accepted. Note that network telemetry is 445 intended to be an umbrella term covering a wide spectrum of 446 techniques, so the following characteristics are not expected to be 447 held by every specific technique. 449 o Push and Streaming: Instead of polling data from network devices, 450 the telemetry collector subscribes to the streaming data pushed 451 from data sources in network devices. 453 o Volume and Velocity: The telemetry data is intended to be consumed 454 by machines rather than by human being. Therefore, the data 455 volume is huge and the processing is often in realtime. 457 o Normalization and Unification: Telemetry aims to address the 458 overall network automation needs. The piecemeal solutions offered 459 by the conventional OAM approach are no longer suitable. Efforts 460 need to be made to normalize the data representation and unify the 461 protocols. 463 o Model-based: The telemetry data is modeled in advance which allows 464 applications to configure and consume data with ease. 466 o Data Fusion: The data for a single application can come from 467 multiple data sources (e.g., cross-domain, cross-device, and 468 cross-layer) and needs to be correlated to take effect. 470 o Dynamic and Interactive: Since the network telemetry means to be 471 used in a closed control loop for network automation, it needs to 472 run continuously and adapt to the dynamic and interactive queries 473 from the network operation controller. 475 In addition, an ideal network telemetry solution may also have the 476 following features or properties: 478 o In-Network Customization: The data can be customized in network at 479 run-time to cater to the specific need of applications. This 480 needs the support of a programmable data plane which allows probes 481 with custom functions to be deployed at flexible locations. 483 o In-Network Data Aggregation and Correlation: Network devices and 484 aggregation points can work out which events and what data needs 485 to be stored, reported, or discarded thus reducing the load on the 486 central collection and processing points while still ensuring that 487 the right information is ready to be processed in a timely way. 489 o In-Network Processing: Sometimes it is not necessary or feasible 490 to gather all information to a central point to be processed and 491 acted upon. It is possible for the data processing to be done in 492 network, allowing reactive actions to be taken locally. 494 o Direct Data Plane Export: The data originated from the data plane 495 forwarding chips can be directly exported to the data consumer for 496 efficiency, especially when the data bandwidth is large and the 497 real-time processing is required. 499 o In-band Data Collection: In addition to the passive and active 500 data collection approaches, the new hybrid approach allows to 501 directly collect data for any target flow on its entire forwarding 502 path [I-D.song-opsawg-ifit-framework]. 504 It is worth noting that, a network telemetry system should not be 505 intrusive to normal network operations, by avoiding the pitfall of 506 the "observer effect". That is, it should not change the network 507 behavior and affect the forwarding performance. Otherwise, the whole 508 purpose of network telemetry is defied. 510 Although in many cases a network telemetry system involves a remote 511 data collecting, processing, and reacting entity, it is important to 512 understand that network telemetry does not infer the necessity of 513 such an entity. Telemetry data producers and consumers can work in 514 distributed or peer-to-peer fashions instead. In such cases, a 515 network node can be the direct consumer of telemetry data from other 516 nodes. 518 3. The Necessity of a Network Telemetry Framework 520 Network data analytics and machine-learning technologies are applied 521 for network operation automation, relying on abundant and coherent 522 data from networks. The single-sourced and static data acquisition 523 cannot meet the data requirements. The scattered standards and 524 diverse techniques are hard to be integrated. It is desirable to 525 have a framework that classifies and organizes different telemetry 526 data source and types, defines different components of a network 527 telemetry system and their interactions, and helps coordinate and 528 integrate multiple telemetry approaches from different layers. This 529 allows flexible combinations for different applications, while 530 normalizing and simplifying interfaces. In detail, such a framework 531 would benefit application development for the following reasons: 533 o The future autonomous networks will require a holistic view on 534 network visibility. All the use cases and applications need to be 535 supported uniformly and coherently under a single intelligent 536 agent. Therefore, the protocols and mechanisms should be 537 consolidated into a minimum yet comprehensive set. A telemetry 538 framework can help to normalize the technique developments. 540 o Network visibility presents multiple viewpoints. For example, the 541 device viewpoint takes the network infrastructure as the 542 monitoring object from which the network topology and device 543 status can be acquired; the traffic viewpoint takes the flows or 544 packets as the monitoring object from which the traffic quality 545 and path can be acquired. An application may need to switch its 546 viewpoint during operation. It may also need to correlate a 547 service and its impact on network experience to acquire the 548 comprehensive information. 550 o Applications require network telemetry to be elastic in order to 551 efficiently use the network resource and reduce the performance 552 impact. Routine network monitoring covers the entire network with 553 low data sampling rate. When issues arise or trends emerge, the 554 telemetry data source can be modified and the data rate can be 555 boosted. 557 o Efficient data fusion is critical for applications to reduce the 558 overall quantity of data and improve the accuracy of analysis. 560 A telemetry framework collects together all of the telemetry-related 561 works from different sources and working groups within IETF. This 562 makes it possible to assemble a comprehensive network telemetry 563 system and to avoid repetitious or redundant work. The framework 564 should cover the concepts and components from the standardization 565 perspective. This document clarifies the layered modules on which 566 the telemetry is exerted and decomposes the telemetry system into a 567 set of distinct components that the existing and future work can 568 easily map to. 570 4. Network Telemetry Framework 572 The top level network telemetry framework partitions the network 573 telemetry into four modules based on the telemetry data object source 574 and represents their relationship. The next level framework reveals 575 that each module replicates the same architecture comprising the same 576 set of components. Throughout the framework, the same set of 577 abstract data acquiring mechanisms and data types are applied. The 578 two-level architecture with the uniform data abstraction helps 579 accurately pinpoint a protocol or technique to its position in a 580 network telemetry system or disaggregate a network telemetry system 581 into manageable parts. 583 4.1. Top Level Modules 585 Telemetry can be applied on the forwarding plane, the control plane, 586 and the management plane in a network, as well as other sources out 587 of the network, as shown in Figure 1. Therefore, we categorize the 588 network telemetry into four distinct modules with each having its own 589 interface to Network Operation Applications. 591 +------------------------------+ 592 | | 593 | Network Operation |<-------+ 594 | Applications | | 595 | | | 596 +------------------------------+ | 597 ^ ^ ^ | 598 | | | | 599 V | V V 600 +-----------|---+--------------+ +-----------+ 601 | | | | | | 602 | Control Pl|ane| | | External | 603 | Telemetry | <---> | | Data and | 604 | | | | | Event | 605 | ^ V | Management | | Telemetry | 606 +------|--------+ Plane | | | 607 | V | Telemetry | +-----------+ 608 | Forwarding | | 609 | Plane <---> | 610 | Telemetry | | 611 | | | 612 +---------------+--------------+ 614 Figure 1: Modules in Layer Category of NTF 616 The rationale of this partition lies in the different telemetry data 617 objects which result in different data source and export locations. 618 Such differences have profound implications on in-network data 619 programming and processing capability, data encoding and transport 620 protocol, and data bandwidth and latency. 622 We summarize the major differences of the four modules in the 623 following table. They are compared from six aspects: data object, 624 data export location, data model, data encoding, telemetry protocol, 625 and transport method. Data object is the target and source of each 626 module. Because the data source varies, the data export location 627 varies. For example, the forwarding plane data are mainly from the 628 fast path(e.g., forwarding chips) while the control plane data are 629 mainly from the slow path (e.g., main control CPU). For convenience 630 and efficiency, it is preferred to export the data from locations 631 near the source. Because each data export location has different 632 capability, the proper data model, encoding, and transport method 633 cannot be kept the same. For example, the forwarding chip has high 634 throughput but limited capacity for processing complex data and 635 maintaining states, while the main control CPU is capable of complex 636 data and state processing, but has limited bandwidth for high 637 throughput data. As a result, the suitable telemetry protocol for 638 each module can be different. Some representative techniques are 639 shown in the corresponding table blocks to highlight the technical 640 diversity of these modules. The key point is that one cannot expect 641 to use a universal protocol to cover all the network telemetry 642 requirements. 644 +---------+--------------+--------------+--------------+-----------+ 645 | Module | Control | Management | Forwarding | External | 646 | | Plane | Plane | Plane | Data | 647 +---------+--------------+--------------+--------------+-----------+ 648 |Object | control | config. & | flow & packet| terminal, | 649 | | protocol & | operation | QoS, traffic | social & | 650 | | signaling, | state, MIB | stat., buffer| environ- | 651 | | RIB, ACL | | & queue stat.| mental | 652 +---------+--------------+--------------+--------------+-----------+ 653 |Export | main control | main control | fwding chip | various | 654 |Location | CPU, | CPU | or linecard | | 655 | | linecard CPU | | CPU; main | | 656 | | or fwding | | control CPU | | 657 | | chip | | unlikely | | 658 +---------+--------------+--------------+--------------+-----------+ 659 |Data | YANG, | MIB, syslog, | template, | YANG | 660 |Model | custom | YANG, | YANG, | | 661 | | | custom | custom | | 662 +---------+--------------+--------------+--------------+-----------+ 663 |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | 664 |Encoding | XML, plain | XML | | XML, plain| 665 +---------+--------------+--------------+--------------+-----------+ 666 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | 667 | | IPFIX,mirror | | | | 668 +---------+--------------+--------------+--------------+-----------+ 669 |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | 670 | | UDP | | | UDP | 671 +---------+--------------+--------------+--------------+-----------+ 673 Figure 2: Comparison of the Data Object Modules 675 Note that the interaction with the network operation applications can 676 be indirect. Some in-device data transfer is possible. For example, 677 in the management plane telemetry, the management plane may need to 678 acquire data from the data plane. Some of the operational states can 679 only be derived from the data plane such as the interface status and 680 statistics. For another example, the control plane telemetry may 681 need to access the Forwarding Information Base (FIB) in data plane. 683 On the other hand, an application may involve more than one plane and 684 interact with multiple planes simultaneously. For example, an SLA 685 compliance application may require both the data plane telemetry and 686 the control plane telemetry. 688 The requirements and challenges for each module are summarized as 689 follows. 691 4.1.1. Management Plane Telemetry 693 The management plane of network elements interacts with the Network 694 Management System (NMS), and provides information such as performance 695 data, network logging data, network warning and defects data, and 696 network statistics and state data. Some legacy protocols, such as 697 SNMP and Syslog, are widely used for the management plane. However, 698 these protocols are insufficient to meet the requirements of the 699 future automated network operation applications. 701 New management plane telemetry protocols should consider the 702 following requirements: 704 Convenient Data Subscription: An application should have the freedom 705 to choose the data export means such as the data types and the 706 export frequency. 708 Structured Data: For automatic network operation, machines will 709 replace human for network data comprehension. The schema 710 languages such as YANG can efficiently describe structured data 711 and normalize data encoding and transformation. 713 High Speed Data Transport: In order to retain the information, a 714 server needs to send a large amount of data at high frequency. 715 Compact encoding formats are needed to compress the data and 716 improve the data transport efficiency. The subscription mode, by 717 replacing the query mode, reduces the interactions between clients 718 and servers and helps to improve the server's efficiency. 720 4.1.2. Control Plane Telemetry 722 The control plane telemetry refers to the health condition monitoring 723 of different network control protocols covering Layer 2 to Layer 7. 724 Keeping track of the running status of these protocols is beneficial 725 for detecting, localizing, and even predicting various network 726 issues, as well as network optimization, in real-time and in fine 727 granularity. 729 One of the most challenging problems for the control plane telemetry 730 is how to correlate the End-to-End (E2E) Key Performance Indicators 731 (KPI) to a specific layer's KPIs. For example, an IPTV user may 732 describe his User Experience (UE) by the video fluency and 733 definition. Then in case of an unusually poor UE KPI or a service 734 disconnection, it is non-trivial to delimit and pinpoint the issue in 735 the responsible protocol layer (e.g., the Transport Layer or the 736 Network Layer), the responsible protocol (e.g., ISIS or BGP at the 737 Network Layer), and finally the responsible device(s) with specific 738 reasons. 740 Traditional OAM-based approaches for control plane KPI measurement 741 include PING (L3), Tracert (L3), Y.1731 (L2), and so on. One common 742 issue behind these methods is that they only measure the KPIs instead 743 of reflecting the actual running status of these protocols, making 744 them less effective or efficient for control plane troubleshooting 745 and network optimization. 747 An example of the control plane telemetry is the BGP monitoring 748 protocol (BMP), it is currently used to monitoring the BGP routes and 749 enables rich applications, such as BGP peer analysis, AS analysis, 750 prefix analysis, security analysis, and so on. However, the 751 monitoring of other layers, protocols and the cross-layer, cross- 752 protocol KPI correlations are still in their infancy (e.g., the IGP 753 monitoring is missing), which require further research. 755 4.1.3. Data Plane Telemetry 757 An effective data plane telemetry system relies on the data that the 758 network device can expose. The data's quality, quantity, and 759 timeliness must meet some stringent requirements. This raises some 760 challenges to the network data plane devices where the first hand 761 data originate. 763 o A data plane device's main function is user traffic processing and 764 forwarding. While supporting network visibility is important, the 765 telemetry is just an auxiliary function, and it should not impede 766 normal traffic processing and forwarding (i.e., the performance is 767 not lowered and the behavior is not altered due to the telemetry 768 functions). 770 o The network operation applications requires end-to-end visibility 771 from various sources, which results in a huge volume of data. 772 However, the sheer data quantity should not stress the network 773 bandwidth, regardless of the data delivery approach (i.e., through 774 in-band or out-of-band channels). 776 o The data plane devices must provide timely data with the minimum 777 possible delay. Long processing, transport, storage, and analysis 778 delay can impact the effectiveness of the control loop and even 779 render the data useless. 781 o The data should be structured and labeled, and easy for 782 applications to parse and consume. At the same time, the data 783 types needed by applications can vary significantly. The data 784 plane devices need to provide enough flexibility and 785 programmability to support the precise data provision for 786 applications. 788 o The data plane telemetry should support incremental deployment and 789 work even though some devices are unaware of the system. This 790 challenge is highly relevant to the standards and legacy networks. 792 The data plane programmability is essential to support network 793 telemetry. Newer data plane forwarding chips are equipped with 794 advanced telemetry features and provide flexibility to support 795 customized telemetry functions. 797 4.1.3.1. Technique Taxonomy 799 There can be multiple possible dimensions to classify the data plane 800 telemetry techniques. 802 Active, Passive, and Hybrid: The active and passive methods (as well 803 as the hybrid types) are well documented in [RFC7799]. The 804 passive methods include TCPDUMP, IPFIX [RFC7011], sflow, and 805 traffic mirror. These methods usually have low data coverage. 806 The bandwidth cost is very high in order to improve the data 807 coverage. On the other hand, the active methods include Ping, 808 Traceroute, OWAMP [RFC4656], TWAMP [RFC5357], and Cisco's SLA 809 Protocol [RFC6812]. These methods are intrusive and only provide 810 indirect network measurement results. The hybrid methods, 811 including in-situ OAM [I-D.ietf-ippm-ioam-data], IPFPM [RFC8321], 812 and Multipoint Alternate Marking 813 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 814 and more flexible approach. However, these methods are also more 815 complex to implement. 817 In-Band and Out-of-Band: The telemetry data, before being exported 818 to some collector, can be carried in user packets. Such methods 819 are considered in-band (e.g., in-situ OAM 820 [I-D.ietf-ippm-ioam-data]). If the telemetry data is directly 821 exported to some collector without modifying the user packets, 822 such methods are considered out-of-band (e.g., postcard-based 823 INT). It is possible to have hybrid methods. For example, only 824 the telemetry instruction or partial data is carried by user 825 packets (e.g., IPFPM [RFC8321]). 827 E2E and In-Network: Some E2E methods start from and end at the 828 network end hosts (e.g., Ping). The other methods work in 829 networks and are transparent to end hosts. However, if needed, 830 the in-network methods can be easily extended into end hosts. 832 Information Type: Depending on the telemetry objective, the methods 833 can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), 834 path-based (e.g., Traceroute), and node-based (e.g., IPFIX 836 [RFC7011]). The various data objects can be packet, flow record, 837 measurement, states, and signal. 839 4.1.4. External Data Telemetry 841 Events that occur outside the boundaries of the network system are 842 another important source of network telemetry. Correlating both 843 internal telemetry data and external events with the requirements of 844 network systems, as presented in 845 [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and 846 functional advantage to management operations. 848 As with other sources of telemetry information, the data and events 849 must meet strict requirements, especially in terms of timeliness, 850 which is essential to properly incorporate external event information 851 to management cycles. The specific challenges are described as 852 follows: 854 o The role of external event detector can be played by multiple 855 elements, including hardware (e.g. physical sensors, such as 856 seismometers) and software (e.g. Big Data sources that analyze 857 streams of information, such as Twitter messages). Thus, the 858 transmitted data must support different shapes but, at the same 859 time, follow a common but extensible schema. 861 o Since the main function of the external event detectors is to 862 perform the notifications, their timeliness is assumed. However, 863 once messages have been dispatched, they must be quickly collected 864 and inserted into the control plane with variable priority, which 865 will be high for important sources and/or important events and low 866 for secondary ones. 868 o The schema used by external detectors must be easily adopted by 869 current and future devices and applications. Therefore, it must 870 be easily mapped to current information models, such as in terms 871 of YANG. 873 Organizing together both internal and external telemetry information 874 will be key for the general exploitation of the management 875 possibilities of current and future network systems, as reflected in 876 the incorporation of cognitive capabilities to new hardware and 877 software (virtual) elements. 879 4.2. Second Level Function Components 881 Reflecting the best current practice, the telemetry module at each 882 plane is further partitioned into five distinct components: 884 Data Query, Analysis, and Storage: This component works at the 885 application layer. On the one hand, it is responsible for issuing 886 data requirements. The data of interest can be modeled data 887 through configuration or custom data through programming. The 888 data requirements can be queries for one-shot data or 889 subscriptions for events or streaming data. On the other hand, it 890 receives, stores, and processes the returned data from network 891 devices. Data analysis can be interactive to initiate further 892 data queries. This component can reside in either network devices 893 or remote controllers. 895 Data Configuration and Subscription: This component deploys data 896 queries on devices. It determines the protocol and channel for 897 applications to acquire desired data. This component is also 898 responsible for configuring the desired data that might not be 899 directly available form data sources. The subscription data can 900 be described by models, templates, or programs. 902 Data Encoding and Export: This component determines how telemetry 903 data are delivered to the data analysis and storage component. 904 The data encoding and the transport protocol may vary due to the 905 data exporting location. 907 Data Generation and Processing: The requested data needs to be 908 captured, processed, and formatted in network devices from raw 909 data sources. This may involve in-network computing and 910 processing on either the fast path or the slow path in network 911 devices. 913 Data Object and Source: This component determines the monitoring 914 object and original data source. The data source usually just 915 provides raw data which needs further processing. A data source 916 can be considered a probe. A probe can be statically installed or 917 dynamically installed. 919 +----------------------------------------+ 920 | | 921 | Data Query, Analysis, & Storage | 922 | | 923 +-------+++ -----------------------------+ 924 ||| ^^^ 925 ||| ||| 926 ||V ||| 927 +--+V--------------------+++------------+ 928 +-----V---------------------+------------+ | 929 +---------------------+-------+----------+ | | 930 | Data Configuration | | | | 931 | & Subscription | Data Encoding | | | 932 | (model, template, | & Export | | | 933 | & program) | | | | 934 +---------------------+------------------| | | 935 | | | | 936 | Data Generation | | | 937 | & Processing | | | 938 | | | | 939 +----------------------------------------| | | 940 | | | | 941 | Data Object and Source | |-+ 942 | |-+ 943 +----------------------------------------+ 945 Figure 3: Components in the Network Telemetry Framework 947 4.3. Data Acquiring Mechanism and Type Abstraction 949 Broadly speaking, network data can be acquired through subscription 950 (push) and query (poll). Subscription is a contract between 951 publisher and subscriber. After initial setup, the subscribed data 952 is automatically delivered to registered subscribers until the 953 subscription expires. Subscription can be partitioned into two sub 954 modes: the Publish-Subscription (Pub-Sub) mode and the Subscription- 955 Publish (Sub-Pub) mode. In the Pub-Sub mode, a publisher publishes 956 pre-defined data and any qualified subscribers can subscribe the data 957 as-is. In the Sub-Pub mode, a subscriber initiates a data request 958 and sends it to a publisher; the publisher will deliver the requested 959 data when available. 961 In contrast, query is used when a querier expects immediate and one- 962 off feedback from network devices. The queried data may be directly 963 extracted from some specific data source, or synthesized and 964 processed from raw data. Query suits for interactive network 965 telemetry applications. 967 There are four types of data from network devices: 969 Simple Data: The data that are steadily available from some data 970 store or static probes in network devices. such data can be 971 specified by YANG model. 973 Complex Data: The data need to be synthesized or processed in 974 network from raw data from one or more network devices. The data 975 processing function can be statically or dynamically loaded into 976 network devices. 978 Event-triggered Data: The data are conditionally acquired based on 979 the occurrence of some events. It can be actively pushed through 980 subscription or passively polled through query. There are many 981 ways to model events, including using Finite State Machine (FSM) 982 or Event Condition Action (ECN) [I-D.wwx-netmod-event-yang]. 984 Streaming Data: The data are continuously generated. It can be time 985 series or the dump of databases. The streaming data reflect 986 realtime network states and metrics and require large bandwidth 987 and processing power. The streaming data are always actively 988 pushed to the subscribers. 990 The above data types are not mutually exclusive. Rather, they often 991 overlap. For example, event-triggered data can be simple or complex, 992 and streaming data can be simple, complex, or triggered by events. 993 The relationships of these data types are illustrated in Figure 4. 995 +--------------+ 996 +------>| Simple Data |<------+ 997 | +------------- + | 998 | ^ | 999 | | | 1000 | +------+-------+ | 1001 | +-->| Complex Data |<--+ | 1002 | | +--------------+ | | 1003 | | | | 1004 | | | | 1005 +-------+---+----------+ +-----+---+-------+ 1006 | Event-triggered Data |<----+ Streaming Data | 1007 +----------------------+ +-----------------+ 1009 Figure 4: Data Type Relationship 1011 Subscription usually deals with event-triggered data and streaming 1012 data, and query usually deals with simple data and complex data. But 1013 the other ways are also possible. The conventional OAM techniques 1014 are mostly about querying simple data. While these techniques are 1015 still useful, more advanced network telemetry techniques are designed 1016 mainly for event-triggered or streaming data subscription, and 1017 complex data query. 1019 4.4. Existing Works Mapped in the Framework 1021 The following two tables provide a non-exhaustive list of existing 1022 works (mainly published in IETF and with the emphasis on the latest 1023 new technologies) and shows their positions in the framework. More 1024 details can be found in Appendix A. 1026 The first table is based on the data acquiring mechanisms and data 1027 types. 1029 +-----------------+---------------+----------------+ 1030 | | Query | Subscription | 1031 | | | | 1032 +-----------------+---------------+----------------+ 1033 | Simple Data | SNMP, NETCONF,| SNMP, NETCONF | 1034 | | YANG, BMP, | YANG, gRPC | 1035 | | SMIv2, gRPC | | 1036 +-----------------+---------------+----------------+ 1037 | Complex Data | DNP, YANG FSM | DNP, YANG PUSH | 1038 | | gRPC, NETCONF | gPRC, NETCONF | 1039 +-----------------+---------------+----------------+ 1040 | Event-triggered | DNP, NETCONF, | gRPC, NETCONF, | 1041 | Data | YANG FSM | YANG PUSH, DNP | 1042 | | | YANG FSM | 1043 +-----------------+---------------+----------------+ 1044 | Streaming Data | | gRPC, NETCONF, | 1045 | | N/A | IOAM, PBT, DNP | 1046 | | | IPFIX, IPFPM | 1047 +-----------------+---------------+----------------+ 1049 Figure 5: Existing Work Mapping I 1051 The second table is based on the telemetry modules and components. 1053 +-------------+-----------------+---------------+--------------+ 1054 | | Management | Control | Forwarding | 1055 | | Plane | Plane | Plane | 1056 +-------------+-----------------+---------------+--------------+ 1057 | data config.| gRPC, NETCONF, | NETCONF/YANG | NETCONF/YANG,| 1058 | & subscribe | SMIv2,YANG PUSH | | YANG FSM | 1059 +-------------+-----------------+---------------+--------------+ 1060 | data gen. & | DNP, | DNP, | IOAM, | 1061 | process | YANG | YANG | PBT, IPFPM, | 1062 | | | | DNP | 1063 +-------------+-----------------+---------------+--------------+ 1064 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 1065 | export | YANG PUSH | | | 1066 +-------------+-----------------+---------------+--------------+ 1068 Figure 6: Existing Work Mapping II 1070 5. Evolution of Network Telemetry 1072 Network telemetry is a fast evolving technical area. As the network 1073 moves towards the automated operation, network telemetry undergoes 1074 several stages of evolution. Each stage is built upon the techniques 1075 enabled by previous stages. 1077 Stage 0 - Static Telemetry: The telemetry data source and type are 1078 determined at design time. The network operator can only 1079 configure how to use it with limited flexibility. 1081 Stage 1 - Dynamic Telemetry: The custom telemetry data can be 1082 dynamically programmed or configured at runtime, allowing a 1083 tradeoff among resource, performance, flexibility, and coverage. 1084 DNP is an effort towards this direction. 1086 Stage 2 - Interactive Telemetry: The network operator can 1087 continuously customize the telemetry data in real time to reflect 1088 the network operation's visibility requirements. At this stage, 1089 some tasks can be automated, although ultimately human operators 1090 will still need to sit in the middle to make decisions. 1092 Stage 3 - Closed-loop Telemetry: Human operators are completely 1093 excluded from the control loop. The intelligent network operation 1094 engine automatically issues the telemetry data requests, analyzes 1095 the data, and updates the network operations in closed control 1096 loops. 1098 The most of the existing technologies belong to stage 0 and stage 1. 1099 Individual stage 2 and stage 3 applications are also possible now. 1101 However, the future autonomic networks may need a comprehensive 1102 operation management system which relies on stage 2 and stage 3 1103 telemetry to cover all the network operation tasks. A well-defined 1104 network telemetry framework is the first step towards this direction. 1106 6. Security Considerations 1108 The complexity of network telemetry raises significant security 1109 implications. For example, telemetry data can be manipulated to 1110 exhaust various network resources at each plane as well as the data 1111 consumer; falsified or tampered data can mislead the decision making 1112 and paralyze networks; wrong configuration and programming for 1113 telemetry is equally harmful. 1115 Given that this document has proposed a framework for network 1116 telemetry and the telemetry mechanisms discussed are distinct (in 1117 both message frequency and traffic amount) from the conventional 1118 network OAM concepts, we must also reflect that various new security 1119 considerations may also arise. A number of techniques already exist 1120 for securing the forwarding plane, the control plane, and the 1121 management plane in a network, but it is important to consider if any 1122 new threat vectors are now being enabled via the use of network 1123 telemetry procedures and mechanisms. 1125 Security considerations for networks that use telemetry methods may 1126 include: 1128 o Telemetry framework trust and policy model; 1130 o Role management and access control for enabling and disabling 1131 telemetry capabilities; 1133 o Protocol transport used telemetry data and inherent security 1134 capabilities; 1136 o Telemetry data stores, storage encryption and methods of access; 1138 o Tracking telemetry events and any abnormalities that might 1139 identify malicious attacks using telemetry interfaces. 1141 Some of the security considerations highlighted above may be 1142 minimized or negated with policy management of network telemetry. In 1143 a network telemetry deployment it would be advantageous to separate 1144 telemetry capabilities into different classes of policies, i.e., Role 1145 Based Access Control and Event-Condition-Action policies. Also, 1146 potential conflicts between network telemetry mechanisms must be 1147 detected accurately and resolved quickly to avoid unnecessary network 1148 telemetry traffic propagation escalating into an unintended or 1149 intended denial of service attack. 1151 Further study of the security issues will be required, and it is 1152 expected that the secuirty mechanisms and protocols are devloped and 1153 deployed along with a network telemetry system. 1155 7. IANA Considerations 1157 This document includes no request to IANA. 1159 8. Contributors 1161 The other contributors of this document are listed as follows. 1163 o Tianran Zhou 1165 o Zhenbin Li 1167 o Zhenqiang Li 1169 o Daniel King 1171 o Adrian Farrel 1173 o Alexander Clemm 1175 9. Acknowledgments 1177 We would like to thank Greg Mirsky, Randy Presuhn, Joe Clarke, Victor 1178 Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, 1179 Parviz Yegani, Young Lee, Qin Wu, and many others who have provided 1180 helpful comments and suggestions to improve this document. 1182 10. Informative References 1184 [gnmi] "gNMI - gRPC Network Management Interface", 1185 . 1188 [grpc] "gPPC, A high performance, open-source universal RPC 1189 framework", . 1191 [I-D.fioccola-ippm-multipoint-alt-mark] 1192 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 1193 "Multipoint Alternate Marking method for passive and 1194 hybrid performance monitoring", draft-fioccola-ippm- 1195 multipoint-alt-mark-04 (work in progress), June 2018. 1197 [I-D.ietf-grow-bmp-adj-rib-out] 1198 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 1199 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 1200 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work 1201 in progress), August 2019. 1203 [I-D.ietf-grow-bmp-local-rib] 1204 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1205 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1206 draft-ietf-grow-bmp-local-rib-07 (work in progress), May 1207 2020. 1209 [I-D.ietf-ippm-ioam-data] 1210 Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields 1211 for In-situ OAM", draft-ietf-ippm-ioam-data-10 (work in 1212 progress), July 2020. 1214 [I-D.ietf-netconf-distributed-notif] 1215 Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, 1216 "Subscription to Distributed Notifications", draft-ietf- 1217 netconf-distributed-notif-00 (work in progress), October 1218 2020. 1220 [I-D.ietf-netconf-udp-notif] 1221 Zheng, G., Zhou, T., Graf, T., Francois, P., and P. 1222 Lucente, "UDP-based Transport for Configured 1223 Subscriptions", draft-ietf-netconf-udp-notif-00 (work in 1224 progress), October 2020. 1226 [I-D.irtf-nmrg-ibn-concepts-definitions] 1227 Clemm, A., Ciavaglia, L., Granville, L., and J. Tantsura, 1228 "Intent-Based Networking - Concepts and Definitions", 1229 draft-irtf-nmrg-ibn-concepts-definitions-02 (work in 1230 progress), September 2020. 1232 [I-D.kumar-rtgwg-grpc-protocol] 1233 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 1234 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 1235 progress), July 2016. 1237 [I-D.openconfig-rtgwg-gnmi-spec] 1238 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 1239 C., and C. Morrow, "gRPC Network Management Interface 1240 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 1241 progress), March 2018. 1243 [I-D.pedro-nmrg-anticipated-adaptation] 1244 Martinez-Julia, P., "Exploiting External Event Detectors 1245 to Anticipate Resource Requirements for the Elastic 1246 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 1247 anticipated-adaptation-02 (work in progress), June 2018. 1249 [I-D.song-ippm-postcard-based-telemetry] 1250 Song, H., Zhou, T., Li, Z., Shin, J., and K. Lee, 1251 "Postcard-based On-Path Flow Data Telemetry", draft-song- 1252 ippm-postcard-based-telemetry-07 (work in progress), April 1253 2020. 1255 [I-D.song-opsawg-dnp4iq] 1256 Song, H. and J. Gong, "Requirements for Interactive Query 1257 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 1258 (work in progress), June 2017. 1260 [I-D.song-opsawg-ifit-framework] 1261 Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- 1262 situ Flow Information Telemetry", draft-song-opsawg-ifit- 1263 framework-13 (work in progress), October 2020. 1265 [I-D.wwx-netmod-event-yang] 1266 Bierman, A., WU, Q., Bryskin, I., Birkholz, H., Liu, X., 1267 and B. Claise, "A YANG Data model for ECA Policy 1268 Management", draft-wwx-netmod-event-yang-09 (work in 1269 progress), July 2020. 1271 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1272 "Simple Network Management Protocol (SNMP)", RFC 1157, 1273 DOI 10.17487/RFC1157, May 1990, 1274 . 1276 [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. 1277 Schoenwaelder, Ed., "Structure of Management Information 1278 Version 2 (SMIv2)", STD 58, RFC 2578, 1279 DOI 10.17487/RFC2578, April 1999, 1280 . 1282 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1283 DOI 10.17487/RFC2981, October 2000, 1284 . 1286 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1287 for the Simple Network Management Protocol (SNMP)", 1288 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1289 . 1291 [RFC3594] Duffy, P., "PacketCable Security Ticket Control Sub-Option 1292 for the DHCP CableLabs Client Configuration (CCC) Option", 1293 RFC 3594, DOI 10.17487/RFC3594, September 2003, 1294 . 1296 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1297 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1298 September 2004, . 1300 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1301 Zekauskas, "A One-way Active Measurement Protocol 1302 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1303 . 1305 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1306 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1307 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1308 . 1310 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1311 the Network Configuration Protocol (NETCONF)", RFC 6020, 1312 DOI 10.17487/RFC6020, October 2010, 1313 . 1315 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1316 and A. Bierman, Ed., "Network Configuration Protocol 1317 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1318 . 1320 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 1321 S., and E. Yedavalli, "Cisco Service-Level Assurance 1322 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 1323 . 1325 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1326 "Specification of the IP Flow Information Export (IPFIX) 1327 Protocol for the Exchange of Flow Information", STD 77, 1328 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1329 . 1331 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1332 Weingarten, "An Overview of Operations, Administration, 1333 and Maintenance (OAM) Tools", RFC 7276, 1334 DOI 10.17487/RFC7276, June 2014, 1335 . 1337 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1338 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1339 DOI 10.17487/RFC7540, May 2015, 1340 . 1342 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1343 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1344 Networking: Definitions and Design Goals", RFC 7575, 1345 DOI 10.17487/RFC7575, June 2015, 1346 . 1348 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1349 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1350 May 2016, . 1352 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1353 Monitoring Protocol (BMP)", RFC 7854, 1354 DOI 10.17487/RFC7854, June 2016, 1355 . 1357 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1358 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1359 "Alternate-Marking Method for Passive and Hybrid 1360 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1361 January 2018, . 1363 [RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, 1364 E., and A. Tripathy, "Subscription to YANG Notifications", 1365 RFC 8639, DOI 10.17487/RFC8639, September 2019, 1366 . 1368 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1369 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1370 September 2019, . 1372 Appendix A. A Survey on Existing Network Telemetry Techniques 1374 In this non-normative appendix, we provide an overview of some 1375 existing techniques and standard proposals for each network telemetry 1376 module. 1378 A.1. Management Plane Telemetry 1380 A.1.1. Push Extensions for NETCONF 1382 NETCONF [RFC6241] is one popular network management protocol, which 1383 is also recommended by IETF. Although it can be used for data 1384 collection, NETCONF is good at configurations. YANG Push 1386 [RFC8641][RFC8639] extends NETCONF and enables subscriber 1387 applications to request a continuous, customized stream of updates 1388 from a YANG datastore. Providing such visibility into changes made 1389 upon YANG configuration and operational objects enables new 1390 capabilities based on the remote mirroring of configuration and 1391 operational state. Moreover, distributed data collection mechanism 1392 [I-D.ietf-netconf-distributed-notif] via UDP based publication 1393 channel [I-D.ietf-netconf-udp-notif] provides enhanced efficiency for 1394 the NETCONF based telemetry. 1396 A.1.2. gRPC Network Management Interface 1398 gRPC Network Management Interface (gNMI) 1399 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1400 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1401 Procedure Call) framework. With a single gRPC service definition, 1402 both configuration and telemetry can be covered. gRPC is an HTTP/2 1403 [RFC7540] based open source micro service communication framework. 1404 It provides a number of capabilities which are well-suited for 1405 network telemetry, including: 1407 o Full-duplex streaming transport model combined with a binary 1408 encoding mechanism provided further improved telemetry efficiency. 1410 o gRPC provides higher-level features consistency across platforms 1411 that common HTTP/2 libraries typically do not. This 1412 characteristic is especially valuable for the fact that telemetry 1413 data collectors normally reside on a large variety of platforms. 1415 o The built-in load-balancing and failover mechanism. 1417 A.2. Control Plane Telemetry 1419 A.2.1. BGP Monitoring Protocol 1421 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1422 sessions and intended to provide a convenient interface for obtaining 1423 route views. 1425 The BGP routing information is collected from the monitored device(s) 1426 to the BMP monitoring station by setting up the BMP TCP session. The 1427 BGP peers are monitored by the BMP Peer Up and Peer Down 1428 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1429 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1430 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 1431 Monitoring Message and the BMP Route Mirroring Message, in the form 1432 of both initial table dump and real-time route update. In addition, 1433 BGP statistics are reported through the BMP Stats Report Message, 1434 which could be either timer triggered or event-driven. More BMP 1435 extensions can be explored to enrich the applications of BGP 1436 monitoring. 1438 A.3. Data Plane Telemetry 1440 A.3.1. The Alternate Marking technology 1442 The Alternate Marking method is efficient to perform packet loss, 1443 delay, and jitter measurements both in an IP and Overlay Networks, as 1444 presented in [RFC8321] and [I-D.fioccola-ippm-multipoint-alt-mark]. 1446 This technique can be applied to point-to-point and multipoint-to- 1447 multipoint flows. Alternate Marking creates batches of packets by 1448 alternating the value of 1 bit (or a label) of the packet header. 1449 These batches of packets are unambiguously recognized over the 1450 network and the comparison of packet counters for each batch allows 1451 the packet loss calculation. The same idea can be applied to delay 1452 measurement by selecting ad hoc packets with a marking bit dedicated 1453 for delay measurements. 1455 Alternate Marking method needs two counters each marking period for 1456 each flow under monitor. For instance, by considering n measurement 1457 points and m monitored flows, the order of magnitude of the packet 1458 counters for each time interval is n*m*2 (1 per color). 1460 Since networks offer rich sets of network performance measurement 1461 data (e.g packet counters), traditional approaches run into 1462 limitations. One reason is the fact that the bottleneck is the 1463 generation and export of the data and the amount of data that can be 1464 reasonably collected from the network. In addition, management tasks 1465 related to determining and configuring which data to generate lead to 1466 significant deployment challenges. 1468 Multipoint Alternate Marking approach, described in 1469 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1470 and makes the performance monitoring more flexible in case a detailed 1471 analysis is not needed. 1473 An application orchestrates network performance measurements tasks 1474 across the network to allow an optimized monitoring and it can 1475 calibrate how deep can be obtained monitoring data from the network 1476 by configuring measurement points roughly or meticulously. 1478 Using Alternate Marking, it is possible to monitor a Multipoint 1479 Network without examining in depth by using the Network Clustering 1480 (subnetworks that are portions of the entire network that preserve 1481 the same property of the entire network, called clusters). So in 1482 case there is packet loss or the delay is too high the filtering 1483 criteria could be specified more in order to perform a detailed 1484 analysis by using a different combination of clusters up to a per- 1485 flow measurement as described in IPFPM [RFC8321]. 1487 In summary, an application can configure end-to-end network 1488 monitoring. If the network does not experiment issues, this 1489 approximate monitoring is good enough and is very cheap in terms of 1490 network resources. However, in case of problems, the application 1491 becomes aware of the issues from this approximate monitoring and, in 1492 order to localize the portion of the network that has issues, 1493 configures the measurement points more exhaustively. So a new 1494 detailed monitoring is performed. After the detection and resolution 1495 of the problem the initial approximate monitoring can be used again. 1497 A.3.2. Dynamic Network Probe 1499 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1500 provides a programmable means to customize the data that an 1501 application collects from the data plane. A direct benefit of DNP is 1502 the reduction of the exported data. A full DNP solution covers 1503 several components including data source, data subscription, and data 1504 generation. The data subscription needs to define the complex data 1505 which can be composed and derived from the raw data sources. The 1506 data generation takes advantage of the moderate in-network computing 1507 to produce the desired data. 1509 While DNP can introduce unforeseeable flexibility to the data plane 1510 telemetry, it also faces some challenges. It requires a flexible 1511 data plane that can be dynamically reprogrammed at run-time. The 1512 programming API is yet to be defined. 1514 A.3.3. IP Flow Information Export (IPFIX) protocol 1516 Traffic on a network can be seen as a set of flows passing through 1517 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1518 provides a means of transmitting traffic flow information for 1519 administrative or other purposes. A typical IPFIX enabled system 1520 includes a pool of Metering Processes collects data packets at one or 1521 more Observation Points, optionally filters them and aggregates 1522 information about these packets. An Exporter then gathers each of 1523 the Observation Points together into an Observation Domain and sends 1524 this information via the IPFIX protocol to a Collector. 1526 A.3.4. In-Situ OAM 1528 Traditional passive and active monitoring and measurement techniques 1529 are either inaccurate or resource-consuming. It is preferable to 1530 directly acquire data associated with a flow's packets when the 1531 packets pass through a network. In-situ OAM (iOAM) 1532 [I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new 1533 instruction header to user packets and the instruction directs the 1534 network nodes to add the requested data to the packets. Thus, at the 1535 path end, the packet's experience gained on the entire forwarding 1536 path can be collected. Such firsthand data is invaluable to many 1537 network OAM applications. 1539 However, iOAM also faces some challenges. The issues on performance 1540 impact, security, scalability and overhead limits, encapsulation 1541 difficulties in some protocols, and cross-domain deployment need to 1542 be addressed. 1544 A.3.5. Postcard Based Telemetry 1546 PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to 1547 IOAM. PBT directly exports data at each node through an independent 1548 packet. PBT solves several issues of IOAM. It can also help to 1549 identify packet drop location in case a packet is dropped on its 1550 forwarding path. 1552 A.4. External Data and Event Telemetry 1554 A.4.1. Sources of External Events 1556 To ensure that the information provided by external event detectors 1557 and used by the network management solutions is meaningful for the 1558 management purposes, the network telemetry framework must ensure that 1559 such detectors (sources) are easily connected to the management 1560 solutions (sinks). This requires the specification of a simple 1561 taxonomy of detectors and match it to the connectors and/or 1562 interfaces required to connect them. 1564 Once detectors are classified in such taxonomy, their definitions are 1565 enlarged with the qualities and other aspects used to handle them and 1566 represented in the ontology and information model (e.g. YANG). 1567 Therefore, differentiating several types of detectors as potential 1568 sources of external events is essential for the integrity of the 1569 management framework. We thus differentiate the following source 1570 types of external events: 1572 o Smart objects and sensors. With the consolidation of the Internet 1573 of Things~(IoT) any network system will have many smart objects 1574 attached to its physical surroundings and logical operation 1575 environments. Most of these objects will be essentially based on 1576 sensors of many kinds (e.g. temperature, humidity, presence) and 1577 the information they provide can be very useful for the management 1578 of the network, even when they are not specifically deployed for 1579 such purpose. Elements of this source type will usually provide a 1580 specific protocol for interaction, especially one of those 1581 protocols related to IoT, such as the Constrained Application 1582 Protocol (CoAP). It will be used by the telemetry framework to 1583 interact with the relevant objects. 1585 o Online news reporters. Several online news services have the 1586 ability to provide enormous quantity of information about 1587 different events occurring in the world. Some of those events can 1588 impact on the network system managed by a specific framework and, 1589 therefore, it will be interested on getting such information. For 1590 instance, diverse security reports, such as the Common 1591 Vulnerabilities and Exposures (CVE), can be issued by the 1592 corresponding authority and used by the management solution to 1593 update the managed system if needed. Instead of a specific 1594 protocol and data format, the sources of this kind of information 1595 usually follow a relaxed but structured format. This format will 1596 be part of both the ontology and information model of the 1597 telemetry framework. 1599 o Global event analyzers. The advance of Big Data analyzers 1600 provides a huge amount of information and, more interestingly, the 1601 identification of events detected by analyzing many data streams 1602 from different origins. In contrast with the other types of 1603 sources, which are focused in specific events, the detectors of 1604 this source type will detect very generic events. For example, a 1605 sports event takes place and some unexpected movement makes it 1606 highly interesting and many people connects to sites that are 1607 covering such event. The systems supporting the services that 1608 cover the event can be affected by such situation so their 1609 management solutions should be aware of it. In contrast with the 1610 other source types, a new information model, format, and reporting 1611 protocol is required to integrate the detectors of this type with 1612 the management solution. 1614 Additional types of detector types can be added to the system but 1615 they will be generally the result of composing the properties offered 1616 by these main classes. In any case, future revisions of the network 1617 telemetry framework will include the required types that cover new 1618 circumstances and that cannot be obtained by composition. 1620 A.4.2. Connectors and Interfaces 1622 For allowing external event detectors to be properly integrated with 1623 other management solutions, both elements must expose interfaces and 1624 protocols that are subject to their particular objective. Since 1625 external event detectors will be focused on providing their 1626 information to their main consumers, which generally will not be 1627 limited to the network management solutions, the framework must 1628 include the definition of the required connectors for ensuring the 1629 interconnection between detectors (sources) and their consumers 1630 within the management systems (sinks) are effective. 1632 In some situations, the interconnection between the external event 1633 detectors and the management system is via the management plane. For 1634 those situations there will be a special connector that provides the 1635 typical interfaces found in most other elements connected to the 1636 management plane. For instance, the interfaces will accomplish with 1637 a specific information model (YANG) and specific telemetry protocol, 1638 such as NETCONF, SNMP, or gRPC. 1640 Authors' Addresses 1642 Haoyu Song 1643 Futurewei 1644 2330 Central Expressway 1645 Santa Clara 1646 USA 1648 Email: hsong@futurewei.com 1650 Fengwei Qin 1651 China Mobile 1652 No. 32 Xuanwumenxi Ave., Xicheng District 1653 Beijing, 100032 1654 P.R. China 1656 Email: qinfengwei@chinamobile.com 1658 Pedro Martinez-Julia 1659 NICT 1660 4-2-1, Nukui-Kitamachi 1661 Koganei, Tokyo 184-8795 1662 Japan 1664 Email: pedro@nict.go.jp 1665 Laurent Ciavaglia 1666 Nokia 1667 Villarceaux 91460 1668 France 1670 Email: laurent.ciavaglia@nokia.com 1672 Aijun Wang 1673 China Telecom 1674 Beiqijia Town, Changping District 1675 Beijing, 102209 1676 P.R. China 1678 Email: wangaj.bri@chinatelecom.cn