idnits 2.17.1 draft-song-opsawg-ntf-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 6, 2019) is 1870 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC1157' is defined on line 916, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-grow-bmp-adj-rib-out-03 == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-02 == Outdated reference: A later version (-05) exists of draft-ietf-netconf-udp-pub-channel-04 == Outdated reference: A later version (-25) exists of draft-ietf-netconf-yang-push-22 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-01 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-03 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft Huawei 4 Intended status: Informational ZQ. Li 5 Expires: September 7, 2019 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Nokia 10 A. Wang 11 China Telecom 12 March 6, 2019 14 Network Telemetry Framework 15 draft-song-opsawg-ntf-03 17 Abstract 19 This document provides an architectural framework for network 20 telemetry to address the current and future network operation 21 challenges and requirements. As evidenced by the defining 22 characteristics and industry practice, network telemetry covers 23 technologies and protocols beyond the conventional network 24 Operations, Administration, and Management (OAM). Network telemetry 25 promises better flexibility, scalability, accuracy, coverage, and 26 performance and allows automated control loops to suit both today's 27 and tomorrow's network operation requirements. This document 28 clarifies the terminologies and classifies the modules and components 29 of a network telemetry system. The framework and taxonomy help to 30 set a common ground for the collection of related work and provide 31 guidance for future technique and standard developments. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 7, 2019. 50 Copyright Notice 52 Copyright (c) 2019 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 5 72 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 73 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 74 3. The Necessity of a Network Telemetry Framework . . . . . . . 9 75 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 10 76 4.1. Data Acquiring Mechanisms . . . . . . . . . . . . . . . . 11 77 4.2. Data Objects . . . . . . . . . . . . . . . . . . . . . . 12 78 4.3. Function Components . . . . . . . . . . . . . . . . . . . 14 79 4.4. Existing Works Mapped in the Framework . . . . . . . . . 16 80 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 17 81 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 82 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 83 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19 84 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 85 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 86 10.1. Normative References . . . . . . . . . . . . . . . . . . 19 87 10.2. Informative References . . . . . . . . . . . . . . . . . 20 88 Appendix A. A Survey on Existing Network Telemetry Techniques . 23 89 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 23 90 A.1.1. Requirements and Challenges . . . . . . . . . . . . . 23 91 A.1.2. Push Extensions for NETCONF . . . . . . . . . . . . . 23 92 A.1.3. gRPC Network Management Interface . . . . . . . . . . 24 93 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 24 94 A.2.1. Requirements and Challenges . . . . . . . . . . . . . 24 95 A.2.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 25 96 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 25 97 A.3.1. Requirements and Challenges . . . . . . . . . . . . . 25 98 A.3.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 26 99 A.3.3. The IPFPM technology . . . . . . . . . . . . . . . . 27 100 A.3.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 28 101 A.3.5. IP Flow Information Export (IPFIX) protocol . . . . . 29 102 A.3.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 29 103 A.3.7. Postcard Based Telemetry . . . . . . . . . . . . . . 29 104 A.4. External Data and Event Telemetry . . . . . . . . . . . . 29 105 A.4.1. Requirements and Challenges . . . . . . . . . . . . . 30 106 A.4.2. Sources of External Events . . . . . . . . . . . . . 30 107 A.4.3. Connectors and Interfaces . . . . . . . . . . . . . . 32 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 110 1. Introduction 112 Network visibility is essential for network operation. Network 113 telemetry has been widely considered as an ideal mean to gain 114 sufficient network visibility with better flexibility, scalability, 115 accuracy, coverage, and performance than conventional OAM 116 technologies. However, confusion and misunderstandings about the 117 network telemetry remain (e.g., the scope and coverage of the term). 118 We need an unambiguous concept and a clear architectural framework 119 for network telemetry so we can better align the related technology 120 and standard work. 122 First, we show some key characteristics of network telemetry which 123 set a clear distinction from the conventional network OAM and show 124 that some conventional OAM technologies can be considered a subset of 125 the network telemetry technologies. We then provide an architectural 126 framework for network telemetry to meet the current and future 127 network operation requirements. Following the framework, we classify 128 the components of a network telemetry system so we can easily map the 129 existing and emerging techniques and protocols into the framework. 130 At last, we outline a roadmap for the evolution of the network 131 telemetry system. 133 The purpose of the framework and taxonomy is to set a common ground 134 for the collection of related work and provide guidance for future 135 technique and standard developments. 137 1.1. Requirements Language 139 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 140 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 141 "OPTIONAL" in this document are to be interpreted as described in BCP 142 14 [RFC2119][RFC8174] when, and only when, they appear in all 143 capitals, as shown here. 145 2. Motivation 147 Thanks to the advance of the computing and storage technologies, 148 today's big data analytics and machine learning-based Artificial 149 Intelligence (AI) give network operators an unprecedented opportunity 150 to gain network insights and move towards network autonomy. Software 151 tools can use the network data to detect and react on network faults, 152 anomalies, and policy violations, as well as predicting future 153 events. In turn, the network policy updates for planning, intrusion 154 prevention, optimization, and self-healing may be applied. 156 It is conceivable that an intent-driven autonomous network is the 157 logical next step for network evolution following Software Defined 158 Network (SDN), aiming to reduce (or even eliminate) human labor, make 159 the most efficient use of network resources, and provide better 160 services more aligned with customer requirements. Although it takes 161 time to reach the ultimate goal, the journey has started 162 nevertheless. 164 However, the system bottleneck is shifting from data consumption to 165 data supply. Both the number of network nodes and the traffic 166 bandwidth keep increasing at a fast pace; The network configuration 167 and policy change at a much smaller time frame than ever before; More 168 subtle events and fine-grained data through all network planes need 169 to be captured and exported in real time. In a nutshell, it is 170 challenging to get enough high-quality data out of network 171 efficiently, timely, and flexibly. Therefore, we need to examine the 172 existing network technologies and protocols, and identify any 173 potential gaps based on the real network and device architectures. 175 In the remaining of this section, first we discuss several key use 176 cases for today's and future network operations. Next, we show why 177 the current network OAM techniques and protocols are insufficient for 178 these use cases. The discussion underlines the need for new methods, 179 techniques, and protocols which we may assign under an umbrella term 180 - network telemetry. 182 2.1. Use Cases 184 These use cases are essential for network operations. While the list 185 is by no means exhaustive, it is enough to highlight the requirements 186 for data velocity, variety, and volume in networks. 188 Policy and Intent Compliance: Network policies are the rules that 189 constraint the services for network access, provide service 190 differentiation, or enforce specific treatment on the traffic. 191 For example, a service function chain is a policy that requires 192 the selected flows to pass through a set of ordered network 193 functions. An intents is a high-level abstract policy which 194 requires a complex translation and mapping process before being 195 applied on networks. While a policy is enforced, the compliance 196 needs to be verified and monitored continuously. 198 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 199 service a user expects from a network operator, which include the 200 metrics for the service measurement and remedy/penalty procedures 201 when the service level misses the agreement. Users need to check 202 if they get the service as promised and network operators need to 203 evaluate how they can deliver the services that can meet the SLA. 205 Root Cause Analysis: Any network failure can be the cause or effect 206 of a sequence of chained events. Troubleshooting and recovery 207 require quick identification of the root cause of any observable 208 issues. However, the root cause is not always straightforward to 209 identify, especially when the failure is sporadic and the related 210 and unrelated events are overwhelming. While machine learning 211 technologies can be used for root cause analysis, it up to the 212 network to sense and provide all the relevant data. 214 Network Optimization: This covers all short-term and long-term 215 network optimization techniques, including load balancing, Traffic 216 Engineering (TE), and network planning. Network operators are 217 motivated to optimize their network utilization and differentiate 218 services for better ROI or lower CAPEX. The first step is to know 219 the real-time network conditions before applying policies for 220 traffic manipulation. In some cases, micro-bursts need to be 221 detected in a very short time-frame so that fine-grained traffic 222 control can be applied to avoid network congestion. The long-term 223 network capacity planning and topology augmentation also rely on 224 the accumulated data of the network operations. 226 Event Tracking and Prediction: The visibility of user traffic path 227 and performance is critical for healthy network operation. 228 Numerous related network events are of interest to network 229 operators. For example, Network operators always want to learn 230 where and why packets are dropped for an application flow. They 231 also want to be warned of issues in advance so proactive actions 232 can be taken to avoid catastrophic consequences. 234 2.2. Challenges 236 For a long time, network operators have relied upon SNMP [RFC3416], 237 Command-Line Interface (CLI), or Syslog to monitor the network. Some 238 other OAM techniques as described in [RFC7276] are also used to 239 facilitate network troubleshooting. These conventional techniques 240 are not sufficient to support the above use cases for the following 241 reasons: 243 o Most use cases need to continuously monitor the network and 244 dynamically refine the data collection in real-time and 245 interactively. The poll-based low-frequency data collection is 246 ill-suited for these applications. Subscription-based streaming 247 data directly pushed from the data source (e.g., the forwarding 248 chip) is preferred to provide enough data quantity and precision 249 at scale. 251 o Comprehensive data is needed from packet processing engine to 252 traffic manager, from line cards to main control board, from user 253 flows to control protocol packets, from device configurations to 254 operations, and from physical layer to application layer. 255 Conventional OAM only covers a narrow range of data (e.g., SNMP 256 only handles data from the Management Information Base (MIB)). 257 Traditional network devices cannot provide all the necessary 258 probes. An open and programmable network device is therefore 259 needed. 261 o Many application scenarios need to correlate data from multiple 262 sources (i.e., from distributed network devices, different 263 components of a network device, or different network planes). A 264 piecemeal solution is often lacking the capability to consolidate 265 the data from multiple sources. The composition of a complete 266 solution, as partly proposed by Autonomic Resource Control 267 Architecture(ARCA) [I-D.pedro-nmrg-anticipated-adaptation], will 268 be empowered and guided by a comprehensive framework. 270 o Some of the conventional OAM techniques (e.g., CLI and Syslog) are 271 lack of formal data model. The unstructured data hinder the tool 272 automation and application extensibility. Standardized data 273 models are essential to support the programmable networks. 275 o Although some conventional OAM techniques support data push (e.g., 276 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 277 are limited to only predefined management plane warnings (e.g., 278 SNMP Trap) or sampled user packets (e.g., sFlow). We require the 279 data with arbitrary source, granularity, and precision which are 280 beyond the capability of the existing techniques. 282 o The conventional passive measurement techniques can either consume 283 too much network resources and render too much redundant data, or 284 lead to inaccurate results; the conventional active measurement 285 techniques can interfere with the user traffic and their results 286 are indirect. We need techniques that can collect direct and on- 287 demand data from user traffic. 289 2.3. Glossary 291 Before further discussion, we list some key terminology and acronyms 292 used in this documents. We make an intended distinction between 293 network telemetry and network OAM. 295 AI: Artificial Intelligence. Use machine-learning based 296 technologies to automate network operation. 298 BMP: BGP Monitoring Protocol 300 DNP: Dynamic Network Probe 302 DPI: Deep Packet Inspection 304 gNMI: gRPC Network Management Interface 306 gRPC: gRPC Remote Procedure Call 308 IDN: Intent-Driven Network 310 IPFIX: IP Flow Information Export Protocol 312 IPFPM: IP Flow Performance Measurement 314 IOAM: In-situ OAM 316 NETCONF: Network Configuration Protocol 318 Network Telemetry: Acquiring network data remotely for network 319 monitoring and operation. A general term for a large set of 320 network visibility techniques and protocols, with the 321 characteristics defined in this document. Network telemetry 322 addresses the current network operation issues and enables smooth 323 evolution toward intent-driven autonomous networks. 325 NMS: Network Management System 327 OAM: Operations, Administration, and Maintenance. A group of 328 network management functions that provide network fault 329 indication, fault localization, performance information, and data 330 and diagnosis functions. Most conventional network monitoring 331 techniques and protocols belong to network OAM. 333 SNMP: Simple Network Management Protocol 335 YANG: A data modeling language for NETCONF 336 YANG FSM: A YANG model to define device side finite state machine 338 YANG PUSH: A method to subscribe pushed data from remote YANG 339 datastore 341 2.4. Network Telemetry 343 Network telemetry has emerged as a mainstream technical term to refer 344 to the newer data collection and consumption techniques, 345 distinguishing itself form the convention techniques for network OAM. 346 The representative techniques and protocols include IPFIX [RFC7011] 347 and gPRC [I-D.kumar-rtgwg-grpc-protocol]. Network telemetry allows 348 separate entities to acquire data from network devices so that data 349 can be visualized and analyzed to support network monitoring and 350 operation. Network telemetry overlaps with the conventional network 351 OAM and has a wider scope than it. It is expected that network 352 telemetry can provide the necessary network insight for autonomous 353 networks, address the shortcomings of conventional OAM techniques, 354 and allow for the emergence of new techniques bearing certain 355 characteristics. 357 One difference between the network telemetry and the network OAM is 358 that the network telemetry assumes machines as data consumer, while 359 the conventional network OAM usually assumes human operators. Hence, 360 the network telemetry can directly trigger the automated network 361 operation, but the conventional OAM tools only help human operators 362 to monitor and diagnose the networks and guide manual network 363 operations. The difference leads to very different techniques. 365 Although the network telemetry techniques are just emerging and 366 subject to continuous evolution, several characteristics of network 367 telemetry have been well accepted (Note that network telemetry is 368 intended to be an umbrella term covering a wide spectrum of 369 techniques, so the following characteristics are not expected to be 370 held by every specific technique): 372 o Push and Streaming: Instead of polling data from network devices, 373 the telemetry collector subscribes to the streaming data pushed 374 from data sources in network devices. 376 o Volume and Velocity: The telemetry data is intended to be consumed 377 by machine rather than by a human. Therefore, the data volume is 378 huge and the processing is often in realtime. 380 o Normalization and Unification: Telemetry aims to address the 381 overall network automation needs. The piecemeal solutions offered 382 by the conventional OAM approach are no longer suitable. Efforts 383 need to be made to normalize the data representation and unify the 384 protocols. 386 o Model-based: The telemetry data is modeled in advance which allows 387 applications to configure and consume data with ease. 389 o Data Fusion: The data for a single application can come from 390 multiple data sources (e.g., cross-domain, cross-device, and 391 cross-layer) and needs to be correlated to take effect. 393 o Dynamic and Interactive: Since the network telemetry means to be 394 used in a closed control loop for network automation, it needs to 395 run continuously and adapt to the dynamic and interactive queries 396 from the network operation controller. 398 Note that a technique does not need to have all the above 399 characteristics to be qualified as telemetry. An ideal network 400 telemetry solution may also have the following features or 401 properties: 403 o In-Network Customization: The data can be customized in network at 404 run-time to cater to the specific need of applications. This 405 needs the support of a programmable data plane which allows probes 406 to be deployed at flexible locations. 408 o Direct Data Plane Export: The data originated from data plane can 409 be directly exported to the data consumer for efficiency, 410 especially when the data bandwidth is large and the real-time 411 processing is required. 413 o In-band Data Collection: In addition to the passive and active 414 data collection approaches, the new hybrid approach allows to 415 directly collect data for any target flow on its entire forwarding 416 path. 418 o Non-intrusive: The telemetry system should avoid the pitfall of 419 the "observer effect". That is, it should not change the network 420 behavior and affect the forwarding performance. 422 3. The Necessity of a Network Telemetry Framework 424 Big data analytics and machine-learning based AI technologies are 425 applied for network operation automation, relying on abundant data 426 from networks. The single-sourced and static data acquisition cannot 427 meet the data requirements. It is desirable to have a framework that 428 integrates multiple telemetry approaches from different layers. This 429 allows flexible combinations for different applications. The 430 framework would benefit application development for the following 431 reasons: 433 o The future autonomous networks will require a holistic view on 434 network visibility. All the use cases and applications need to be 435 supported uniformly and coherently under a single intelligent 436 agent. Therefore, the protocols and mechanisms should be 437 consolidated into a minimum yet comprehensive set. A telemetry 438 framework can help to normalize the technique developments. 440 o Network visibility presents multiple viewpoints. For example, the 441 device viewpoint takes the network infrastructure as the 442 monitoring object from which the network topology and device 443 status can be acquired; the traffic viewpoint takes the flows or 444 packets as the monitoring object from which the traffic quality 445 and path can be acquired. An application may need to switch its 446 viewpoint during operation. It may also need to correlate a 447 service and impact on network experience to acquire the 448 comprehensive information. 450 o Applications require network telemetry to be elastic in order to 451 efficiently use the network resource and reduce the performance 452 impact. Routine network monitoring covers the entire network with 453 low data sampling rate. When issues arise or trends emerge, the 454 telemetry data source can be modified and the data rate can be 455 boosted. 457 o Efficient data fusion is critical for applications to reduce the 458 overall quantity of data and improve the accuracy of analysis. 460 So far, some telemetry related work has been done within IETF. 461 However, the work is fragmented and scattered in different working 462 groups. The lack of coherence makes it difficult to assemble a 463 comprehensive network telemetry system and causes repetitive and 464 redundant work. 466 A formal network telemetry framework is needed for constructing a 467 working system. The framework should cover the concepts and 468 components from the standardization perspective. This document 469 clarifies the layers on which the telemetry is exerted and decomposes 470 the telemetry system into a set of distinct components that the 471 existing and future work can easily map to. 473 4. Network Telemetry Framework 475 Network telemetry techniques can be classified from multiple 476 dimensions. In this document, we provide three unique perspectives: 477 data acquiring mechanisms, data objects, and function components. 479 4.1. Data Acquiring Mechanisms 481 Broadly speaking, network data can be acquired through subscription 482 (push) and query (poll). A subscriber may request data when it is 483 ready. It follows a pub-sub mode or a sub-pub mode. In the pub-sub 484 mode, pre-defined data are published and multiple qualified 485 subscribers can subscribe the data. In the sub-pub mode, a 486 subscriber designates what data are of interest and demands the 487 network devices to deliver the data when they are avaiable. 489 In contrast, a querier expects immediate feedback from network 490 devices. It is usually used in a more interactive environment. The 491 queried data may be directly extracted from some specific data 492 source, or synthesized and processed from raw data. 494 There are four types of data from network devices: 496 Simple Data: The data that are steadily available from some data 497 store or static probes in network devices. such data can be 498 specified by YANG model. 500 Custom Data: The data need to be synthesized or processed from raw 501 data from one or more network devices. The data processing 502 function can be statically or dynamically loaded into network 503 devices. 505 Event-triggered Data: The data are conditionally acquired based on 506 the occurrence of some event. An event can be modeled as a Finite 507 State Machine (FSM). 509 Streaming Data: The data are continuously or periodically generated. 511 The above data types are not mutual exclusive. For example, event- 512 triggered data can be simple or custom, and streaming data can be 513 event triggered. The relationships of these data types are 514 illustrated in Figure 1 515 +--------------------------+ 516 | +----------------------+ | 517 | | +-----------------+ | | 518 | | | +-------------+ | | | 519 | | | | Simple Data | | | | 520 | | | +-------------+ | | | 521 | | | Custom Data | | | 522 | | +-----------------+ | | 523 | | Event-triggered Data | | 524 | +----------------------+ | 525 | Streaming Data | 526 +--------------------------+ 528 Figure 1: Data Type Relationship 530 Subscription usually deals with event-triggered data and streaming 531 data, and query usually deals with simple data and custom data. It 532 is easy to see that conventional OAM techniques are mostly about 533 querying simple data only. While these techniques are still useful, 534 advanced network telemetry techniques pay more attention on the other 535 three data types, and prefer subscription and custom data query over 536 simple data query. 538 4.2. Data Objects 540 Telemetry can be applied on the forwarding plane, the control plane, 541 and the management plane in a network, as well as other sources out 542 of the network, as shown in Figure 2. Therefore, we categorize the 543 network telemetry into four distinct modules. 545 +------------------------------+ 546 | | 547 | Network Operation |<-------+ 548 | Applications | | 549 | | | 550 +------------------------------+ | 551 ^ ^ ^ | 552 | | | | 553 V | V V 554 +-----------|---+--------------+ +-----------+ 555 | | | | | | 556 | Control Pl|ane| | | External | 557 | Telemetry | <---> | | Data and | 558 | | | | | Event | 559 | ^ V | Management | | Telemetry | 560 +------|--------+ Plane | | | 561 | V | Telemetry | +-----------+ 562 | Forwarding | | 563 | Plane <---> | 564 | Telemetry | | 565 | | | 566 +---------------+--------------+ 568 Figure 2: Layer Category of the Network Telemetry Framework 570 The rationale of this partition lies in the different telemetry data 571 objects which result in different data source and export locations. 572 Such differences have profound implications on in-network data 573 programming and processing capability, data encoding and transport 574 protocol, and data bandwidth and latency. 576 We summarize the major differences of the four modules in the 577 following table. Some representative techniques are shown in some 578 table blocks to highlight the technical diversity of these modules. 580 +---------+--------------+--------------+--------------+-----------+ 581 | Module | Control | Management | Forwarding | External | 582 | | Plane | Plane | Plane | Data | 583 +---------+--------------+--------------+--------------+-----------+ 584 |Object | control | config. & | flow & packet| terminal, | 585 | | protocol & | operation | QoS, traffic | social & | 586 | | signaling, | state, MIB | stat., buffer| environ- | 587 | | RIB, ACL | | & queue stat.| mental | 588 +---------+--------------+--------------+--------------+-----------+ 589 |Export | main control | main control | fwding chip | various | 590 |Location | CPU, | CPU | or linecard | | 591 | | linecard CPU | | CPU; main | | 592 | | or fwding | | control CPU | | 593 | | chip | | unlikely | | 594 +---------+--------------+--------------+--------------+-----------+ 595 |Model | YANG, | MIB, syslog, | template, | YANG | 596 | | custom | YANG, | YANG, | | 597 | | | custom | custom | | 598 +---------+--------------+--------------+--------------+-----------+ 599 |Encoding | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | 600 | | XML, plain | XML | | XML, plain| 601 +---------+--------------+--------------+--------------+-----------+ 602 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | 603 | | IPFIX,mirror | | | | 604 +---------+--------------+--------------+--------------+-----------+ 605 |Transport| HTTP, TCP, | HTTP, TCP | UDP | TCP, UDP | 606 | | UDP | | | | 607 +---------+--------------+--------------+--------------+-----------+ 609 Figure 3: Layer Category of the Network Telemetry Framework 611 Note that the interaction with the network operation applications can 612 be indirect. For example, in the management plane telemetry, the 613 management plane may need to acquire data from the data plane. Some 614 of the operational states can only be derived from the data plane 615 such as the interface status and statistics. For another example, 616 the control plane telemetry may need to access the FIB in data plane. 617 On the other hand, an application may involve more than one plane 618 simultaneously. For example, an SLA compliance application may 619 require both the data plane telemetry and the control plane 620 telemetry. 622 4.3. Function Components 624 At each plane, the telemetry can be further partitioned into five 625 distinct components: 627 Data Query, Analysis, and Storage: This component works at the 628 application layer. On the one hand, it is responsible for issuing 629 data queries. The queries can be for modeled data through 630 configuration or custom data through programming. The queries can 631 be one shot or subscriptions for events or streaming data. On the 632 other hand, it receives, stores, and processes the returned data 633 from network devices. Data analysis can be interactive to 634 initiate further data queries. 636 Data Configuration and Subscription: This component deploys data 637 queries on devices. It determines the protocol and channel for 638 applications to acquire desired data. This component is also 639 responsible for configuring the desired data that might not be 640 directly available form data sources. The subscription data can 641 be described by models, templates, or programs. 643 Data Encoding and Export: This component determines how telemetry 644 data are delivered to the data analysis and storage component. 645 The data encoding and the transport protocol may vary due to the 646 data exporting location. 648 Data Generation and Processing: The requested data needs to be 649 captured, processed, and formatted in network devices from raw 650 data sources. This may involve in-network computing and 651 processing on either the fast path or the slow path in network 652 devices. 654 Data Object and Source: This component determines the monitoring 655 object and original data source. The data source usually just 656 provides raw data which needs further processing. A data source 657 can be considered a probe. A probe can be statically installed or 658 dynamically installed. 660 +----------------------------------------+ 661 | | 662 | Data Query, Analysis, & Storage | 663 | | 664 +----------------------------------------+ 665 | ^ 666 | | 667 V | 668 +---------------------+------------------+ 669 | Data Configuration | | 670 | & Subscription | Data Encoding | 671 | (model, template, | & Export | 672 | & program) | | 673 +---------------------+------------------| 674 | | 675 | Data Generation | 676 | & Processing | 677 | | 678 +----------------------------------------| 679 | | 680 | Data Object and Source | 681 | | 682 +----------------------------------------+ 684 Figure 4: Components in the Network Telemetry Framework 686 Since most existing standard-related work belongs to the first four 687 components, in the remainder of the document, we focus on these 688 components only. 690 4.4. Existing Works Mapped in the Framework 692 The following two tables provide a non-exhaustive list of existing 693 works (mainly published in IETF and with the emphasis on the latest 694 new technologies) and shows their positions in the framework. The 695 details about the mentioned work can be found in Appendix A. 697 +-----------------+---------------+----------------+ 698 | | Query | Subscription | 699 | | | | 700 +-----------------+---------------+----------------+ 701 | Simple Data | SNMP, NETCONF,| | 702 | | YANG, BMP, | | 703 | | IOAM, PBT | | 704 +-----------------+---------------+----------------+ 705 | Custom Data | DNP, YANG FSM | | 706 | | gRPC, NETCONF | | 707 +-----------------+---------------+----------------+ 708 | Event-triggered | | gRPC, NETCONF, | 709 | Data | | YANG PUSH, DNP | 710 | | | IOAM, PBT, | 711 | | | YANG FSM | 712 +-----------------+---------------+----------------+ 713 | Streaming Data | | gRPC, NETCONF, | 714 | | | IOAM, PBT, DNP | 715 | | | IPFIX, IPFPM | 716 +-----------------+---------------+----------------+ 718 Figure 5: Existing Work Mapping I 720 +--------------+---------------+----------------+---------------+ 721 | | Management | Control | Forwarding | 722 | | Plane | Plane | Plane | 723 +--------------+---------------+----------------+---------------+ 724 | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | 725 | & subscrib. | YANG PUSH | | YANG FSM | 726 +--------------+---------------+----------------+---------------+ 727 | data gen. & | DNP, | DNP, | In-situ OAM, | 728 | processing | YANG | YANG | PBT, IPFPM, | 729 | | | | DNP | 730 +--------------+---------------+----------------+---------------+ 731 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 732 | export | YANG PUSH | | | 733 +--------------+---------------+----------------+---------------+ 735 Figure 6: Existing Work Mapping II 737 5. Evolution of Network Telemetry 739 As the network is evolving towards the automated operation, network 740 telemetry also undergoes several levels of evolution. 742 Level 0 - Static Telemetry: The telemetry data is determined at 743 design time. The network operator can only configure how to use 744 it with limited flexibility. 746 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 747 programmed or configured at runtime, allowing a tradeoff among 748 resource, performance, flexibility, and coverage. DNP is an 749 effort towards this direction. 751 Level 2 - Interactive Telemetry: The network operator can 752 continuously customize the telemetry data in real time to reflect 753 the network operation's visibility requirements. At this level, 754 some tasks can be automated, although ultimately human operators 755 will still need to sit in the middle to make decisions. 757 Level 3 - Closed-loop Telemetry: Human operators are completely 758 excluded from the control loop. The intelligent network operation 759 engine automatically issues the telemetry data request, analyzes 760 the data, and updates the network operations in closed control 761 loops. 763 While most of the existing technologies belong to level 0 and level 764 1, with the help of a clearly defined network telemetry framework, we 765 can assemble the technologies to support level 2 and make solid steps 766 towards level 3. 768 6. Security Considerations 770 Given that this document has proposed a framework for network 771 telemetry and the telemetry mechanisms discussed are distinct (in 772 both message frequency and traffic amount) from the conventional 773 network OAM concepts, we must also reflect that various new security 774 considerations may also arise. A number of techniques already exist 775 for securing the data plane, control plane, and the management plane 776 in a network, but the it is important to consider if any new threat 777 vectors are now being enabled via the use of network telemetry 778 procedures and mechanisms. 780 Security considerations for networks that use telemetry methods may 781 include: 783 o Telemetry framework trust and policy model; 785 o Role management and access control for enabling and disabling 786 telemetry capabilities; 788 o Protocol transport used telemetry data and inherent security 789 capabilities; 791 o Telemetry data stores, storage encryption and methods of access; 793 o Tracking telemetry events and any abnormalities that might 794 identify malicious attacks using telemetry interfaces. 796 Some of the security considerations highlighted above may be 797 minimized or negated with policy management of network telemetry. In 798 a network telemetry deployment it would be advantageous to separate 799 telemetry capabilities into different classes of policies, i.e., Role 800 Based Access Control and Event-Condition-Action policies. Also, 801 potential conflicts between network telemetry mechanisms must be 802 detected accurately and resolved quickly to avoid unnecessary network 803 telemetry traffic propagation escalating into an unintended or 804 intended denial of service attack. 806 Further discussion and development of this section will be required, 807 and it is expected that this security section, and subsequent policy 808 section will be developed further. 810 7. IANA Considerations 812 This document includes no request to IANA. 814 8. Contributors 816 The other major contributors of this document are listed as follows. 818 o Tianran Zhou 820 o Zhenbin Li 822 o Daniel King 824 9. Acknowledgments 826 We would like to thank Adrian Farrel, Randy Presuhn, Victor Liu, 827 James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz 828 Yegani, Young Lee, Alexander Clemm, Joe Clarke, and many others who 829 have provided helpful comments and suggestions to improve this 830 document. 832 10. References 834 10.1. Normative References 836 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 837 Requirement Levels", BCP 14, RFC 2119, 838 DOI 10.17487/RFC2119, March 1997, 839 . 841 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 842 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 843 May 2017, . 845 10.2. Informative References 847 [I-D.brockners-inband-oam-requirements] 848 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 849 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 850 T., Lapukhov, P., and r. remy@barefootnetworks.com, 851 "Requirements for In-situ OAM", draft-brockners-inband- 852 oam-requirements-03 (work in progress), March 2017. 854 [I-D.fioccola-ippm-multipoint-alt-mark] 855 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 856 "Multipoint Alternate Marking method for passive and 857 hybrid performance monitoring", draft-fioccola-ippm- 858 multipoint-alt-mark-04 (work in progress), June 2018. 860 [I-D.ietf-grow-bmp-adj-rib-out] 861 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 862 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 863 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-03 (work 864 in progress), December 2018. 866 [I-D.ietf-grow-bmp-local-rib] 867 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 868 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 869 draft-ietf-grow-bmp-local-rib-02 (work in progress), 870 September 2018. 872 [I-D.ietf-netconf-udp-pub-channel] 873 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 874 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 875 pub-channel-04 (work in progress), October 2018. 877 [I-D.ietf-netconf-yang-push] 878 Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen- 879 Nygaard, E., Bierman, A., and B. Lengyel, "Subscription to 880 YANG Datastores", draft-ietf-netconf-yang-push-22 (work in 881 progress), February 2019. 883 [I-D.kumar-rtgwg-grpc-protocol] 884 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 885 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 886 progress), July 2016. 888 [I-D.openconfig-rtgwg-gnmi-spec] 889 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 890 C., and C. Morrow, "gRPC Network Management Interface 891 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 892 progress), March 2018. 894 [I-D.pedro-nmrg-anticipated-adaptation] 895 Martinez-Julia, P., "Exploiting External Event Detectors 896 to Anticipate Resource Requirements for the Elastic 897 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 898 anticipated-adaptation-02 (work in progress), June 2018. 900 [I-D.song-ippm-postcard-based-telemetry] 901 Song, H., Zhou, T., Li, Z., and J. Shin, "Postcard-based 902 In-band Flow Data Telemetry", draft-song-ippm-postcard- 903 based-telemetry-01 (work in progress), December 2018. 905 [I-D.song-opsawg-dnp4iq] 906 Song, H. and J. Gong, "Requirements for Interactive Query 907 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 908 (work in progress), June 2017. 910 [I-D.zhou-netconf-multi-stream-originators] 911 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 912 "Subscription to Multiple Stream Originators", draft-zhou- 913 netconf-multi-stream-originators-03 (work in progress), 914 October 2018. 916 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 917 "Simple Network Management Protocol (SNMP)", RFC 1157, 918 DOI 10.17487/RFC1157, May 1990, 919 . 921 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 922 DOI 10.17487/RFC2981, October 2000, 923 . 925 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 926 for the Simple Network Management Protocol (SNMP)", 927 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 928 . 930 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 931 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 932 September 2004, . 934 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 935 Zekauskas, "A One-way Active Measurement Protocol 936 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 937 . 939 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 940 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 941 RFC 5357, DOI 10.17487/RFC5357, October 2008, 942 . 944 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 945 and A. Bierman, Ed., "Network Configuration Protocol 946 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 947 . 949 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 950 "Specification of the IP Flow Information Export (IPFIX) 951 Protocol for the Exchange of Flow Information", STD 77, 952 RFC 7011, DOI 10.17487/RFC7011, September 2013, 953 . 955 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 956 Weingarten, "An Overview of Operations, Administration, 957 and Maintenance (OAM) Tools", RFC 7276, 958 DOI 10.17487/RFC7276, June 2014, 959 . 961 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 962 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 963 DOI 10.17487/RFC7540, May 2015, 964 . 966 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 967 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 968 May 2016, . 970 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 971 Monitoring Protocol (BMP)", RFC 7854, 972 DOI 10.17487/RFC7854, June 2016, 973 . 975 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 976 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 977 "Alternate-Marking Method for Passive and Hybrid 978 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 979 January 2018, . 981 Appendix A. A Survey on Existing Network Telemetry Techniques 983 We provide an overview of the challenges and existing solutions for 984 each network telemetry module. 986 A.1. Management Plane Telemetry 988 A.1.1. Requirements and Challenges 990 The management plane of the network element interacts with the 991 Network Management System (NMS), and provides information such as 992 performance data, network logging data, network warning and defects 993 data, and network statistics and state data. Some legacy protocols 994 are widely used for the management plane, such as SNMP and Syslog. 995 However, these protocols are insufficient to meet the requirements of 996 the automatic network operation applications. 998 New management plane telemetry protocols should consider the 999 following requirements: 1001 Convenient Data Subscription: An application should have the freedom 1002 to choose the data export means such as the data types and the 1003 export frequency. 1005 Structured Data: For automatic network operation, machines will 1006 replace human for network data comprehension. The schema 1007 languages such as YANG can efficiently describe structured data 1008 and normalize data encoding and transformation. 1010 High Speed Data Transport: In order to retain the information, a 1011 server needs to send a large amount of data at high frequency. 1012 Compact encoding formats are needed to compress the data and 1013 improve the data transport efficiency. The push mode, by 1014 replacing the poll mode, can also reduce the interactions between 1015 clients and servers, which help to improve the server's 1016 efficiency. 1018 A.1.2. Push Extensions for NETCONF 1020 NETCONF [RFC6241] is one popular network management protocol, which 1021 is also recommended by IETF. Although it can be used for data 1022 collection, NETCONF is good at configurations. YANG Push 1024 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 1025 applications to request a continuous, customized stream of updates 1026 from a YANG datastore. Providing such visibility into changes made 1027 upon YANG configuration and operational objects enables new 1028 capabilities based on the remote mirroring of configuration and 1029 operational state. Moreover, distributed data collection mechanism 1030 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 1031 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 1032 efficiency for the NETCONF based telemetry. 1034 A.1.3. gRPC Network Management Interface 1036 gRPC Network Management Interface (gNMI) 1037 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1038 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1039 Procedure Call) framework. With a single gRPC service definition, 1040 both configuration and telemetry can be covered. gRPC is an HTTP/2 1041 [RFC7540] based open source micro service communication framework. 1042 It provides a number of capabilities which are well-suited for 1043 network telemetry, including: 1045 o Full-duplex streaming transport model combined with a binary 1046 encoding mechanism provided further improved telemetry efficiency. 1048 o gRPC provides higher-level features consistency across platforms 1049 that common HTTP/2 libraries typically do not. This 1050 characteristic is especially valuable for the fact that telemetry 1051 data collectors normally reside on a large variety of platforms. 1053 o The built-in load-balancing and failover mechanism. 1055 A.2. Control Plane Telemetry 1057 A.2.1. Requirements and Challenges 1059 The control plane telemetry refers to the health condition monitoring 1060 of different network protocols, which covers Layer 2 to Layer 7. 1061 Keeping track of the running status of these protocols is beneficial 1062 for detecting, localizing, and even predicting various network 1063 issues, as well as network optimization, in real-time and in fine 1064 granularity. 1066 One of the most challenging problems for the control plane telemetry 1067 is how to correlate the E2E Key Performance Indicators (KPI) to a 1068 specific layer's KPIs. For example, an IPTV user may describe his 1069 User Experience (UE) by the video fluency and definition. Then in 1070 case of an unusually poor UE KPI or a service disconnection, it is 1071 non-trivial work to delimit and localize the issue to the responsible 1072 protocol layer (e.g., the Transport Layer or the Network Layer), the 1073 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 1074 finally the responsible device(s) with specific reasons. 1076 Traditional OAM-based approaches for control plane KPI measurement 1077 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 1078 issue behind these methods is that they only measure the KPIs instead 1079 of reflecting the actual running status of these protocols, making 1080 them less effective or efficient for control plane troubleshooting 1081 and network optimization. An example of the control plane telemetry 1082 is the BGP monitoring protocol (BMP), it is currently used to 1083 monitoring the BGP routes and enables rich applications, such as BGP 1084 peer analysis, AS analysis, prefix analysis, security analysis, and 1085 so on. However, the monitoring of other layers, protocols and the 1086 cross-layer, cross-protocol KPI correlations are still in their 1087 infancy (e.g., the IGP monitoring is missing), which require 1088 substantial further research. 1090 A.2.2. BGP Monitoring Protocol 1092 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1093 sessions and intended to provide a convenient interface for obtaining 1094 route views. 1096 The BGP routing information is collected from the monitored device(s) 1097 to the BMP monitoring station by setting up the BMP TCP session. The 1098 BGP peers are monitored by the BMP Peer Up and Peer Down 1099 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1100 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1101 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 1102 Monitoring Message and the BMP Route Mirroring Message, in the form 1103 of both initial table dump and real-time route update. In addition, 1104 BGP statistics are reported through the BMP Stats Report Message, 1105 which could be either timer triggered or event-driven. More BMP 1106 extensions can be explored to enrich the applications of BGP 1107 monitoring. 1109 A.3. Data Plane Telemetry 1111 A.3.1. Requirements and Challenges 1113 An effective data plane telemetry system relies on the data that the 1114 network device can expose. The data's quality, quantity, and 1115 timeliness must meet some stringent requirements. This raises some 1116 challenges to the network data plane devices where the first hand 1117 data originate. 1119 o A data plane device's main function is user traffic processing and 1120 forwarding. While supporting network visibility is important, the 1121 telemetry is just an auxiliary function, and it should not impede 1122 normal traffic processing and forwarding (i.e., the performance is 1123 not lowered and the behavior is not altered due to the telemetry 1124 functions). 1126 o The network operation applications requires end-to-end visibility 1127 from various sources, which results in a huge volume of data. 1128 However, the sheer data quantity should not stress the network 1129 bandwidth, regardless of the data delivery approach (i.e., through 1130 in-band or out-of-band channels). 1132 o The data plane devices must provide timely data with the minimum 1133 possible delay. Long processing, transport, storage, and analysis 1134 delay can impact the effectiveness of the control loop and even 1135 render the data useless. 1137 o The data should be structured and labeled, and easy for 1138 applications to parse and consume. At the same time, the data 1139 types needed by applications can vary significantly. The data 1140 plane devices need to provide enough flexibility and 1141 programmability to support the precise data provision for 1142 applications. 1144 o The data plane telemetry should support incremental deployment and 1145 work even though some devices are unaware of the system. This 1146 challenge is highly relevant to the standards and legacy networks. 1148 The industry has agreed that the data plane programmability is 1149 essential to support network telemetry. Newer data plane chips are 1150 all equipped with advanced telemetry features and provide flexibility 1151 to support customized telemetry functions. 1153 A.3.2. Technique Taxonomy 1155 There can be multiple possible dimensions to classify the data plane 1156 telemetry techniques. 1158 Active and Passive: The active and passive methods (as well as the 1159 hybrid types) are well documented in [RFC7799]. The passive 1160 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 1161 mirror. These methods usually have low data coverage. The 1162 bandwidth cost is very high in order to improve the data coverage. 1163 On the other hand, the active methods include Ping, Traceroute, 1164 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 1165 and only provide indirect network measurement results. The hybrid 1166 methods, including in-situ OAM 1168 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 1169 Multipoint Alternate Marking 1170 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 1171 and more flexible approach. However, these methods are also more 1172 complex to implement. 1174 In-Band and Out-of-Band: The telemetry data, before being exported 1175 to some collector, can be carried in user packets. Such methods 1176 are considered in-band (e.g., in-situ OAM 1177 [I-D.brockners-inband-oam-requirements]). If the telemetry data 1178 is directly exported to some collector without modifying the user 1179 packets, Such methods are considered out-of-band (e.g., postcard- 1180 based INT). It is possible to have hybrid methods. For example, 1181 only the telemetry instruction or partial data is carried by user 1182 packets (e.g., IPFPM [RFC8321]). 1184 E2E and In-Network: Some E2E methods start from and end at the 1185 network end hosts (e.g., Ping). The other methods work in 1186 networks and are transparent to end hosts. However, if needed, 1187 the in-network methods can be easily extended into end hosts. 1189 Flow, Path, and Node: Depending on the telemetry objective, the 1190 methods can be flow-based (e.g., in-situ OAM 1191 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 1192 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 1194 A.3.3. The IPFPM technology 1196 The Alternate Marking method is efficient to perform packet loss, 1197 delay, and jitter measurements both in an IP and Overlay Networks, as 1198 presented in IPFPM [RFC8321] and 1199 [I-D.fioccola-ippm-multipoint-alt-mark]. 1201 This technique can be applied to point-to-point and multipoint-to- 1202 multipoint flows. Alternate Marking creates batches of packets by 1203 alternating the value of 1 bit (or a label) of the packet header. 1204 These batches of packets are unambiguously recognized over the 1205 network and the comparison of packet counters for each batch allows 1206 the packet loss calculation. The same idea can be applied to delay 1207 measurement by selecting ad hoc packets with a marking bit dedicated 1208 for delay measurements. 1210 Alternate Marking method needs two counters each marking period for 1211 each flow under monitor. For instance, by considering n measurement 1212 points and m monitored flows, the order of magnitude of the packet 1213 counters for each time interval is n*m*2 (1 per color). 1215 Since networks offer rich sets of network performance measurement 1216 data (e.g packet counters), traditional approaches run into 1217 limitations. One reason is the fact that the bottleneck is the 1218 generation and export of the data and the amount of data that can be 1219 reasonably collected from the network. In addition, management tasks 1220 related to determining and configuring which data to generate lead to 1221 significant deployment challenges. 1223 Multipoint Alternate Marking approach, described in 1224 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1225 and makes the performance monitoring more flexible in case a detailed 1226 analysis is not needed. 1228 An application orchestrates network performance measurements tasks 1229 across the network to allow an optimized monitoring and it can 1230 calibrate how deep can be obtained monitoring data from the network 1231 by configuring measurement points roughly or meticulously. 1233 Using Alternate Marking, it is possible to monitor a Multipoint 1234 Network without examining in depth by using the Network Clustering 1235 (subnetworks that are portions of the entire network that preserve 1236 the same property of the entire network, called clusters). So in 1237 case there is packet loss or the delay is too high the filtering 1238 criteria could be specified more in order to perform a detailed 1239 analysis by using a different combination of clusters up to a per- 1240 flow measurement as described in IPFPM [RFC8321]. 1242 In summary, an application can configure end-to-end network 1243 monitoring. If the network does not experiment issues, this 1244 approximate monitoring is good enough and is very cheap in terms of 1245 network resources. However, in case of problems, the application 1246 becomes aware of the issues from this approximate monitoring and, in 1247 order to localize the portion of the network that has issues, 1248 configures the measurement points more exhaustively. So a new 1249 detailed monitoring is performed. After the detection and resolution 1250 of the problem the initial approximate monitoring can be used again. 1252 A.3.4. Dynamic Network Probe 1254 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1255 provides a programmable means to customize the data that an 1256 application collects from the data plane. A direct benefit of DNP is 1257 the reduction of the exported data. A full DNP solution covers 1258 several components including data source, data subscription, and data 1259 generation. The data subscription needs to define the custom data 1260 which can be composed and derived from the raw data sources. The 1261 data generation takes advantage of the moderate in-network computing 1262 to produce the desired data. 1264 While DNP can introduce unforeseeable flexibility to the data plane 1265 telemetry, it also faces some challenges. It requires a flexible 1266 data plane that can be dynamically reprogrammed at run-time. The 1267 programming API is yet to be defined. 1269 A.3.5. IP Flow Information Export (IPFIX) protocol 1271 Traffic on a network can be seen as a set of flows passing through 1272 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1273 provides a means of transmitting traffic flow information for 1274 administrative or other purposes. A typical IPFIX enabled system 1275 includes a pool of Metering Processes collects data packets at one or 1276 more Observation Points, optionally filters them and aggregates 1277 information about these packets. An Exporter then gathers each of 1278 the Observation Points together into an Observation Domain and sends 1279 this information via the IPFIX protocol to a Collector. 1281 A.3.6. In-Situ OAM 1283 Traditional passive and active monitoring and measurement techniques 1284 are either inaccurate or resource-consuming. It is preferable to 1285 directly acquire data associated with a flow's packets when the 1286 packets pass through a network. In-situ OAM (iOAM) 1287 [I-D.brockners-inband-oam-requirements], a data generation technique, 1288 embeds a new instruction header to user packets and the instruction 1289 directs the network nodes to add the requested data to the packets. 1290 Thus, at the path end, the packet's experience gained on the entire 1291 forwarding path can be collected. Such firsthand data is invaluable 1292 to many network OAM applications. 1294 However, iOAM also faces some challenges. The issues on performance 1295 impact, security, scalability and overhead limits, encapsulation 1296 difficulties in some protocols, and cross-domain deployment need to 1297 be addressed. 1299 A.3.7. Postcard Based Telemetry 1301 PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to 1302 IOAM. PBT directly exports data at each node through an independent 1303 packet. PBT solves several issues of IOAM. It can also help to 1304 identify packet drop location in case a packet is dropped on its 1305 forwarding path. 1307 A.4. External Data and Event Telemetry 1309 Events that occur outside the boundaries of the network system are 1310 another important source of telemetry information. Correlating both 1311 internal telemetry data and external events with the requirements of 1312 network systems, as presented in Exploiting External Event Detectors 1313 to Anticipate Resource Requirements for the Elastic Adaptation of 1314 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 1315 strategic and functional advantage to management operations. 1317 A.4.1. Requirements and Challenges 1319 As with other sources of telemetry information, the data and events 1320 must meet strict requirements, especially in terms of timeliness, 1321 which is essential to properly incorporate external event information 1322 to management cycles. Thus, the specific challenges are described as 1323 follows: 1325 o The role of external event detector can be played by multiple 1326 elements, including hardware (e.g. physical sensors, such as 1327 seismometers) and software (e.g. Big Data sources that analyze 1328 streams of information, such as Twitter messages). Thus, the 1329 transmitted data must support different shapes but, at the same 1330 time, follow a common but extensible ontology. 1332 o Since the main function of the external event detectors is to 1333 perform the notifications, their timeliness is assumed. However, 1334 once messages have been dispatched, they must be quickly collected 1335 and inserted into the control plane with variable priority, which 1336 will be high for important sources and/or important events and low 1337 for secondary ones. 1339 o The ontology used by external detectors must be easily adopted by 1340 current and future devices and applications. Therefore, it must 1341 be easily mapped to current information models, such as in terms 1342 of YANG. 1344 Organizing together both internal and external telemetry information 1345 will be key for the general exploitation of the management 1346 possibilities of current and future network systems, as reflected in 1347 the incorporation of cognitive capabilities to new hardware and 1348 software (virtual) elements. 1350 A.4.2. Sources of External Events 1352 To ensure that the information provided by external event detectors 1353 and used by the network management solutions is meaningful for the 1354 management purposes, the network telemetry framework must ensure that 1355 such detectors (sources) are easily connected to the management 1356 solutions (sinks). This requires the specification of a simple 1357 taxonomy of detectors and match it to the connectors and/or 1358 interfaces required to connect them. 1360 Once detectors are classified in such taxonomy, their definitions are 1361 enlarged with the qualities and other aspects used to handle them and 1362 represented in the ontology and information model (e.g. YANG). 1363 Therefore, differentiating several types of detectors as potential 1364 sources of external events is essential for the integrity of the 1365 management framework. We thus differentiate the following source 1366 types of external events: 1368 o Smart objects and sensors. With the consolidation of the Internet 1369 of Things~(IoT) any network system will have many smart objects 1370 attached to its physical surroundings and logical operation 1371 environments. Most of these objects will be essentially based on 1372 sensors of many kinds (e.g. temperature, humidity, presence) and 1373 the information they provide can be very useful for the management 1374 of the network, even when they are not specifically deployed for 1375 such purpose. Elements of this source type will usually provide a 1376 specific protocol for interaction, especially one of those 1377 protocols related to IoT, such as the Constrained Application 1378 Protocol (CoAP). It will be used by the telemetry framework to 1379 interact with the relevant objects. 1381 o Online news reporters. Several online news services have the 1382 ability to provide enormous quantity of information about 1383 different events occurring in the world. Some of those events can 1384 impact on the network system managed by a specific framework and, 1385 therefore, it will be interested on getting such information. For 1386 instance, diverse security reports, such as the Common 1387 Vulnerabilities and Exposures (CVE), can be issued by the 1388 corresponding authority and used by the management solution to 1389 update the managed system if needed. Instead of a specific 1390 protocol and data format, the sources of this kind of information 1391 usually follow a relaxed but structured format. This format will 1392 be part of both the ontology and information model of the 1393 telemetry framework. 1395 o Global event analyzers. The advance of Big Data analyzers 1396 provides a huge amount of information and, more interestingly, the 1397 identification of events detected by analyzing many data streams 1398 from different origins. In contrast with the other types of 1399 sources, which are focused in specific events, the detectors of 1400 this source type will detect very generic events. For example, a 1401 sports event takes place and some unexpected movement makes it 1402 highly interesting and many people connects to sites that are 1403 covering such event. The systems supporting the services that 1404 cover the event can be affected by such situation so their 1405 management solutions should be aware of it. In contrast with the 1406 other source types, a new information model, format, and reporting 1407 protocol is required to integrate the detectors of this type with 1408 the management solution. 1410 Additional types of detector types can be added to the system but 1411 they will be generally the result of composing the properties offered 1412 by these main classes. In any case, future revisions of the network 1413 telemetry framework will include the required types that cover new 1414 circumstances and that cannot be obtained by composition. 1416 A.4.3. Connectors and Interfaces 1418 For allowing external event detectors to be properly integrated with 1419 other management solutions, both elements must expose interfaces and 1420 protocols that are subject to their particular objective. Since 1421 external event detectors will be focused on providing their 1422 information to their main consumers, which generally will not be 1423 limited to the network management solutions, the framework must 1424 include the definition of the required connectors for ensuring the 1425 interconnection between detectors (sources) and their consumers 1426 within the management systems (sinks) are effective. 1428 In some situations, the interconnection between the external event 1429 detectors and the management system is via the management plane. For 1430 those situations there will be a special connector that provides the 1431 typical interfaces found in most other elements connected to the 1432 management plane. For instance, the interfaces will accomplish with 1433 a specific information model (YANG) and specific telemetry protocol, 1434 such as NETCONF, SNMP, or gRPC. 1436 Authors' Addresses 1438 Haoyu Song (editor) 1439 Huawei 1440 2330 Central Expressway 1441 Santa Clara 1442 USA 1444 Email: haoyu.song@huawei.com 1446 Zhenqiang Li 1447 China Mobile 1448 No. 32 Xuanwumenxi Ave., Xicheng District 1449 Beijing, 100032 1450 P.R. China 1452 Email: lizhenqiang@chinamobile.com 1453 Pedro Martinez-Julia 1454 NICT 1455 4-2-1, Nukui-Kitamachi 1456 Koganei, Tokyo 184-8795 1457 Japan 1459 Email: pedro@nict.go.jp 1461 Laurent Ciavaglia 1462 Nokia 1463 Villarceaux 91460 1464 France 1466 Email: laurent.ciavaglia@nokia.com 1468 Aijun Wang 1469 China Telecom 1470 Beiqijia Town, Changping District 1471 Beijing, 102209 1472 P.R. China 1474 Email: wangaj.bri@chinatelecom.cn