idnits 2.17.1 draft-song-opsawg-ntf-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 19, 2018) is 2009 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC1157' is defined on line 1054, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-grow-bmp-adj-rib-out-02 == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-02 == Outdated reference: A later version (-05) exists of draft-ietf-netconf-udp-pub-channel-04 == Outdated reference: A later version (-25) exists of draft-ietf-netconf-yang-push-19 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-03 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft T. Zhou 4 Intended status: Informational ZB. Li 5 Expires: April 22, 2019 Huawei 6 ZQ. Li 7 China Mobile 8 P. Martinez-Julia 9 NICT 10 L. Ciavaglia 11 Nokia 12 A. Wang 13 China Telecom 14 October 19, 2018 16 Network Telemetry Framework 17 draft-song-opsawg-ntf-01 19 Abstract 21 This document provides an architectural framework for network 22 telemetry to meet the current and future network operation 23 requirements. The defining characteristics of network telemetry show 24 a clear distinction from the conventional network Operations, 25 Administration, and Management (OAM) concept; hence network telemetry 26 requires new procedures, methods, and protocols. This document 27 clarifies the terminologies and classifies the categories and 28 components of a network telemetry framework. The requirements, 29 challenges, existing solutions, and future directions are discussed 30 for each category. The network telemetry framework and the taxonomy 31 help to set a common ground for the collection of related works and 32 put future technique and standard developments into perspective. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on April 22, 2019. 50 Copyright Notice 52 Copyright (c) 2018 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3 70 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 5 72 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 6 73 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 7 74 3. The Necessity of a Network Telemetry Framework . . . . . . . 9 75 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 10 76 4.1. Existing Works Mapped in the Framework . . . . . . . . . 12 77 4.2. Management Plane Telemetry . . . . . . . . . . . . . . . 13 78 4.2.1. Requirements and Challenges . . . . . . . . . . . . . 13 79 4.2.2. Push Extensions for NETCONF . . . . . . . . . . . . . 14 80 4.2.3. gRPC Network Management Interface . . . . . . . . . . 14 81 4.3. Control Plane Telemetry . . . . . . . . . . . . . . . . . 15 82 4.3.1. Requirements and Challenges . . . . . . . . . . . . . 15 83 4.3.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 15 84 4.4. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 16 85 4.4.1. Requirements and Challenges . . . . . . . . . . . . . 16 86 4.4.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 17 87 4.4.3. The IPFPM technology . . . . . . . . . . . . . . . . 17 88 4.4.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 19 89 4.4.5. IP Flow Information Export (IPFIX) protocol . . . . . 19 90 4.4.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 19 91 4.5. External Data and Event Telemetry . . . . . . . . . . . . 20 92 4.5.1. Requirements and Challenges . . . . . . . . . . . . . 20 93 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 21 94 6. Security Considerations . . . . . . . . . . . . . . . . . . . 21 95 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 96 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21 97 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 22 98 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 99 10.1. Normative References . . . . . . . . . . . . . . . . . . 22 100 10.2. Informative References . . . . . . . . . . . . . . . . . 22 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25 103 1. Introduction 105 Network visibility is essential for network operation. Network 106 telemetry has been widely accepted as the ideal mean to gain full 107 network visibility. However, there are still confusion and 108 misunderstandings about the connotation of network telemetry. We 109 need an unambiguous understanding of the concept so we can better 110 align the related technology and standard developments. 112 First, we show some key characteristics of network telemetry which 113 set a clear distinction from the conventional network OAM. We then 114 provide an architectural framework for network telemetry to meet the 115 current and future network operation requirements. Following the 116 framework, we classify the components of a network telemetry system 117 so we can esily map the exising and emerging techniques and protocols 118 into the framework. The requirements, challenges, existing 119 solutions, and future directions are discussed for each framework 120 category. At last, we outline a roadmap for the evolution of the 121 network telemetry system. 123 The network telemetry framework and the taxonomy help to set a common 124 ground for the collection of related works and put future technique 125 and standard developments into perspective. 127 1.1. Requirements Language 129 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 130 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 131 "OPTIONAL" in this document are to be interpreted as described in BCP 132 14 [RFC2119][RFC8174] when, and only when, they appear in all 133 capitals, as shown here. 135 2. Motivation 137 The advance of Artifical Intelligence (AI), and specifically Machine 138 Learning (ML), technologies gives networks an unprecedented 139 opportunity to realize network autonomy with closed control loops. 140 An intent-driven autonomous network is the logical next step for 141 network evolution following Software Defined Network (SDN), aiming to 142 reduce (or even eliminate) human labor, make the most efficient use 143 of network resources, and provide better services more aligned with 144 customer requirements. Although we still have a long way to reach 145 the ultimate goal, the machine automation journey has started 146 nevertheless. 148 The storage and computing technologies are already mature enough to 149 be able to retain and process a huge amount of data and make real- 150 time inference. Tools based on machine learning technologies and big 151 data analytics are powerful in detecting and reacting on network 152 faults, anomalies, and policy violations. In turn, the network 153 policy updates for planning, intrusion prevention, optimization, and 154 self-healing may be applied. Tools exist that will profile, 155 classify, and predict future events based on historical data trends. 156 However, to increase the accuracy of these preditive capabilities, 157 and better support autonomous networking, improvements must be made. 158 The current network architecture, protocol suite, and system design 159 are not ready yet to provide enough quality data. 161 In the remaining of this section, first we identify the key network 162 operation use cases that network operators need the most. These use 163 cases are also the essential functions of the future autonomous 164 networks. Next, we show why the current network OAM techniques and 165 protocols are not sufficient to meet the requirements of these use 166 cases. The discussion underlines the need for new methods, 167 techniques, and protocols which we may assign under an umbrella term, 168 Network Telemetry. 170 2.1. Use Cases 172 The use cases highlighted use data extracted from the network data 173 plane, as well as control plane and management plane. 175 Intent and Policy Compliance: Network policies are the rules that 176 constraint the services for network access, provide differentiate 177 within a service, or enforce specific treatment on the traffic. 178 For example, a service function chain is a policy that requires 179 the selected flows to pass through a set of network functions in 180 order. An intents is a high-level abstract policy which requires 181 a complex translation and mapping process before being applied on 182 networks. While a policy is enforced, the compliance needs to be 183 verified and monitored continuously. 185 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 186 service a user expects from a network operator, which include the 187 metrics for the service measurement and remedy/penalty procedures 188 when the service level misses the agreement. Users need to check 189 if they get the service as promised and network operators need to 190 evaluate how they can deliver the services that can meet the SLA. 192 Root Cause Analysis: Network failure often involves a sequence of 193 chained events and the source of the failure is not 194 straightforward to identify, especially when the failure is 195 sporadic. While machine learning or other data analytics 196 technologies can be used for root cause analysis, it up to the 197 network to provide all the relevant data for analysis. 199 Load Balancing, Traffic Engineering, and Network Planning: Network 200 operators are motivated to optimize their network utilization for 201 better ROI or lower CAPEX, as well as differentiation across 202 services and/or users of a given service. The first step is to 203 know the real-time network conditions before applying policies to 204 steer the user traffic or adjust the load balancing algorithm. In 205 some cases network micro-bursts need to be detected in a very 206 short time-frame so that fine-grained traffic control can be 207 applied to avoid possible network congestion. The long-term 208 network capacity planning and topology augmentation also rely on 209 the accumulated data of the network operation. 211 Event Tracking and Prediction: Network path and performance 212 visibility is critical for healthy network operation. Numerous 213 network events are of interest to network operators. For example, 214 Network operators always want to learn where and why packets are 215 dropped for an application flow. They also want to be warned of 216 issues while proactive action may still be taken before an issue 217 becomes a catastrophic problem, such as a component failure. 219 2.2. Challenges 221 The conventional OAM techniques, as described in [RFC7276], are not 222 sufficient to support the above use cases for the following reasons: 224 o Most use cases need to continuously monitor the network and 225 dynamically refine the data collection in real-time and 226 interactively. The poll-based low-frequency data collection is 227 ill-suited for these applications. Streaming data directly pushed 228 from the data source is preferred. 230 o Various data is needed from any place ranging from the packet 231 processing engine to the QoS traffic manager. Traditional data 232 plane devices cannot provide the necessary probes. An open and 233 programmable data plane is therefore needed. 235 o Many application scenarios need to correlate data from multiple 236 sources (e.g., from distributed nodes or from different network 237 plane). A piecemeal solution is often lacking the capability to 238 consolidate the data from multiple sources. The composition of a 239 complete solution, as partly proposed by Autonomic Resource 240 Control Architecture(ARCA) 241 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 242 guided by a comprehensive framework. 244 o The passive measurement techniques can either consume too much 245 network resources and render too much redundant data, or lead to 246 inaccurate results. The active measurement techniques are 247 indirect, and they can interfere with the user traffic. We need 248 techniques that can collect direct and on-demand data from user 249 traffic. 251 2.3. Glossary 253 Before further discussion, we list some key terminology and acronyms 254 used in this documents. We make an intended distinction between 255 network telemetry and network OAM. 257 AI: Artificial Intelligence. Use machine-learning based 258 technologies to automate network operation. 260 BMP: BGP Monitoring Protocol 262 DNP: Dynamic Network Probe 264 DPI: Deep Packet Inspection 266 gNMI: gRPC Network Management Interface 268 gRPC: gRPC Remote Procedure Call 270 IDN: Intent-Driven Network 272 IPFIX: IP Flow Information Export Protocol 274 IPFPM: IP Flow Performance Measurement 276 IOAM: In-situ OAM 278 NETCONF: Network Configuration Protocol 280 Network Telemetry: A general term for a new brood of network 281 visibility techniques and protocols, with the characteristics 282 defined in this document. Network telemetry enables smooth 283 evolution toward intent-driven autonomous networks. 285 NMS: Network Management System 286 OAM: Operations, Administration, and Maintenance. A group of 287 network management functions that provide network fault 288 indication, fault localization, performance information, and data 289 and diagnosis functions. Most conventional network monitoring 290 techniques and protocols belong to network OAM. 292 SNMP: Simple Network Management Protocol 294 YANG: A data modeling language for NETCONF 296 YANG FSM: A YANG model to define device side finite state machine 298 YANG PUSH: A method to subscribe pushed data from remote YANG 299 datastore 301 2.4. Network Telemetry 303 For a long time, network operators have relied upon SNMP [RFC3416] or 304 Command-Line Interface (CLI) to monitor the network. SNMP and CLI 305 can access limited Management Information Base (MIB) information from 306 the mangement plane. Most existing implementatons are mainly poll- 307 based and supports low data rate with low timing accuracy. Such 308 issues make SNMP and CLI insufficient for today and tomorrow's 309 network operations. 311 Network telemetry has emerged as a mainstream technical term to refer 312 to the newer techniques of data collection and consumption, 313 distinguishing itself form the convention techniques for network OAM. 314 The representative techniques and protocols include IPFIX [RFC7011] 315 and gPRC [I-D.kumar-rtgwg-grpc-protocol]. SNMP is also envolving to 316 support event notifications [RFC2981][RFC3877]. It is expected that 317 network telemetry can provide the necessary network visibility for 318 autonomous networks, address the shortcomings of conventional OAM 319 techniques, and allow for the emergence of new techniques bearing 320 certain characteristics. 322 One key difference between the network telemetry and the network OAM 323 is that the network telemetry assumes an intelligent machine in the 324 center of a closed control loop, while the network OAM assumes the 325 human network operators in the middle of an open control loop. The 326 network telemetry can directly trigger the automated network 327 operation; The conventional OAM tools only help human operators to 328 monitor and diagnose the networks and guide manual network 329 operations. The different assumptions lead to very different 330 techniques. 332 Although the network telemetry techniques are just emerging and 333 subject to continuous evolution, several defining characteristics of 334 network telemetry have been well accepted: 336 o Push and Streaming: Instead of polling data from network devices, 337 the telemetry collector subscribes to the streaming data pushed 338 from the data source in network devices. 340 o Volume and Velocity: The telemetry data is intended to be consumed 341 by machine rather than by a human. Therefore, the data volume is 342 huge and the processing is often in realtime. 344 o Normalization and Unification: Telemetry aims to address the 345 overall network automation needs. The piecemeal solutions offered 346 by the conventional OAM approach are no longer suitable. Efforts 347 need to be made to normalize the data representation and unify the 348 protocols. 350 o Model-based: The data is model-based which allows applications to 351 configure and consume data with ease. 353 o Data Fusion: The data for a single application can come from 354 multiple data sources (e.g., cross-domain, cross-device, and 355 cross-layer) and needs to be correlated to take effect. 357 o Dynamic and Interactive: Since the network telemetry means to be 358 used in a closed control loop for network automation, it needs to 359 run continuously and adapt to the dynamic and interactive queries 360 from the network operation controller. 362 The ideal network telemetry solution should also support the 363 following features: 365 o In-Network Customization: The data can be customized in network at 366 run-time to cater to the specific need of applications. This 367 needs the support of a programmable data plane which allows probes 368 to be deployed at flexible locations. 370 o Direct Data Plane Export: The data originated from data plane can 371 be directly exported to the data consumer for efficiency, 372 especially when the data bandwidth is large and the real-time 373 processing is required. 375 o In-band Data Collection: In addition to the passive and active 376 data collection approaches, the new hybrid approach allows to 377 directly collect data for any target flow on its entire forwarding 378 path. 380 o Non-intrusive: The telemetry system should not fall into the trap 381 of the "observer effect". That is, it should not change the 382 network behavior or affect the forwarding performance. 384 3. The Necessity of a Network Telemetry Framework 386 Big data analytics and machine-learning based AI technologies are 387 applied for network operation automation, relying on abundant data 388 from networks. The single-sourced and static data acquisition cannot 389 meet the data requirements. It is desirable to have a framework that 390 integrates multiple telemetry approaches from different layers. This 391 allows flexible combinations for different applications. The 392 framework would benefit application development for the following 393 reasons: 395 o The future autonomous networks will require a holistic view on 396 network visibility. All the use cases and applications need to be 397 supported uniformly and coherently under a single intelligent 398 agent. Therefore, the protocols and mechanisms should be 399 consolidated into a minimum yet comprehensive set. A telemetry 400 framework can help to normalize the technique developments. 402 o Network visibility presents multiple viewpoints. For example, the 403 device viewpoint takes the network infrastructure as the 404 monitoring object from which the network topology and device 405 status can be acquired; the traffic viewpoint takes the flows or 406 packets as the monitoring object from which the traffic quality 407 and path can be acquired. An application may need to switch its 408 viewpoint during operation. It may also need to correlate a 409 service and impact on network experience to acquire the 410 comprehensive information. 412 o Applications require network telemetry to be elastic in order to 413 efficiently use the network resource and reduce the performance 414 impact. Routine network monitoring covers the entire network with 415 low data sampling rate. When issues arise or trends emerge, the 416 telemetry data source can be modified and the data rate can be 417 boosted. 419 o Efficient data fusion is critical for applications to reduce the 420 overall quantity of data and improve the accuracy of analysis. 422 So far, some telemetry related work has been done within IETF. 423 However, this work is fragmented and scattered in different working 424 groups. The lack of coherence makes it difficult to assemble a 425 comprehensive network telemetry system and causes repetitive and 426 redundant work. 428 A formal network telemetry framework is needed for constructing a 429 working system. The framework should cover the concepts and 430 components from the standardization perspective. This document 431 clarifies the layers on which the telemetry is exerted and decomposes 432 the telemetry system into a set of distinct components that the 433 existing and future work can easily map to. 435 4. Network Telemetry Framework 437 Telemetry can be applied on the data plane, the control plane, and 438 the management plane in a network, as well as other sources out of 439 the network, as shown in Figure 1. 441 +------------------------------+ 442 | | 443 | Network Operation |<-------+ 444 | Applications | | 445 | | | 446 +------------------------------+ | 447 ^ ^ ^ | 448 | | | | 449 V | V V 450 +-----------|---+--------------+ +-----------+ 451 | | | | | | 452 | Control Pl|ane| | | External | 453 | Telemetry | <---> | | Data and | 454 | | | | | Event | 455 | ^ V | Management | | Telemetry | 456 +------|--------+ Plane | | | 457 | V | Telemetry | +-----------+ 458 | | | 459 | Data Plane <---> | 460 | Telemetry | | 461 | | | 462 +---------------+--------------+ 464 Figure 1: Layer Category of the Network Telemetry Framework 466 Note that the interaction with the network operation applications can 467 be indirect. For example, in the management plane telemetry, the 468 management plane may need to acquire data from the data plane. Some 469 of the operational states can only be derived from the data plane 470 such as the interface status and statistics. For another example, 471 the control plane telemetry may need to access the FIB in data plane. 472 On the other hand, an application may involve more than one plane 473 simultaneously. For example, an SLA compliance application may 474 require both the data plane telemetry and the control plane 475 telemetry. 477 At each plane, the telemetry can be further partitioned into five 478 distinct components: 480 Data Source: Determine where the original data is acquired. The 481 data source usually just provides raw data which needs further 482 processing. A data source can be considered a probe. A probe can 483 be statically installed or dynamically installed. 485 Data Subscription: Determine the protocol and channel for 486 applications to acquire desired data. Data subscription is also 487 responsible to define the desired data that might not be directly 488 available form data sources. The subscription data can be 489 described by a model. The model can be statically installed or 490 dynamically installed. 492 Data Generation: The original data needs to be processed, encoded, 493 and formatted in network devices to meet application subscription 494 requirements. This may involve in-network computing and 495 processing on either the fast path or the slow path in network 496 devices. 498 Data Export: Determine how the ready data are delivered to 499 applications. 501 Data Analysis and Storage: In this final step, data is consumed by 502 applications or stored for future reference. Data analysis can be 503 interactive. It may initiate further data subscription. 505 +------------------------------+ 506 | | 507 | Data Analysis/Storage | 508 | | 509 +------------------------------+ 510 | ^ 511 | | 512 V | 513 +---------------+--------------+ 514 | | | 515 | Data | Data | 516 | Subscription | Export | 517 | | | 518 +---------------+--------------| 519 | | 520 | Data Generation | 521 | | 522 +------------------------------| 523 | | 524 | Data Source | 525 | | 526 +------------------------------+ 528 Figure 2: Components in the Network Telemetry Framework 530 Since most existing standard-related work belongs to the first four 531 components, in the remainder of the document, we focus on these 532 components only. 534 4.1. Existing Works Mapped in the Framework 536 The following table provides a non-exhaustive list of existing works 537 (mainly published in IETF and with the emphasis on the latest new 538 technologies) and shows their positions in the framework. 540 +-----------+--------------+---------------+--------------+ 541 | | Management | Control | Data | 542 | | Plane | Plane | Plane | 543 +-----------+--------------+---------------+--------------+ 544 | | YANG Data | Control Proto.| Flow/Packet | 545 | Data | Store | Network State | Statistics | 546 | Source | | | States | 547 | | | | DPI | 548 +-----------+--------------+---------------+--------------+ 549 | | gRPC | NETCONF/YANG | NETCONF/YANG | 550 | Data | YANG PUSH | BGP | YANG FSM | 551 | Subscribe | | | | 552 | | | | | 553 +-----------+--------------+---------------+--------------+ 554 | | Soft DNP | Soft DNP | In-situ OAM | 555 | Data | | | IPFPM | 556 | Generation| | | Hard DNP | 557 | | | | | 558 +-----------+--------------+---------------+--------------+ 559 | | gRPC | BMP | IPFIX | 560 | Data | YANG PUSH | | UDP | 561 | Export | UDP | | | 562 | | | | | 563 +-----------+--------------+---------------+--------------+ 565 Figure 3: Existing Work 567 4.2. Management Plane Telemetry 569 4.2.1. Requirements and Challenges 571 The management plane of the network element interacts with the 572 Network Management System (NMS), and provides information such as 573 performance data, network logging data, network warning and defects 574 data, and network statistics and state data. Some legacy protocols 575 are widely used for the management plane, such as SNMP and Syslog. 576 However, these protocols are insufficient to meet the requirements of 577 the automatic network operation applications. 579 New management plane telemetry protocols should consider the 580 following requirements: 582 Convenient Data Subscription: An application should have the freedom 583 to choose the data export means such as the data types and the 584 export frequency. 586 Structured Data: For automatic network operation, machines will 587 replace human for network data comprehension. The schema 588 languages such as YANG can efficiently describe structured data 589 and normalize data encoding and transformation. 591 High Speed Data Transport: In order to retain the information, a 592 server needs to send a large amount of data at high frequency. 593 Compact encoding formats are needed to compress the data and 594 improve the data transport efficiency. The push mode, by 595 replacing the poll mode, can also reduce the interactions between 596 clients and servers, which help to improve the server's 597 efficiency. 599 4.2.2. Push Extensions for NETCONF 601 NETCONF [RFC6241] is one popular network management protocol, which 602 is also recommended by IETF. Although it can be used for data 603 collection, NETCONF is good at configurations. YANG Push 604 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 605 applications to request a continuous, customized stream of updates 606 from a YANG datastore. Providing such visibility into changes made 607 upon YANG configuration and operational objects enables new 608 capabilities based on the remote mirroring of configuration and 609 operational state. Moreover, distributed data collection mechanism 610 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 611 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 612 efficiency for the NETCONF based telemetry. 614 4.2.3. gRPC Network Management Interface 616 gRPC Network Management Interface (gNMI) 617 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 618 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 619 Procedure Call) framework. With a single gRPC service definition, 620 both configuration and telemetry can be covered. gRPC is an HTTP/2 621 [RFC7540] based open source micro service communication framework. 622 It provides a number of capabilities which are well-suited for 623 network telemetry, including: 625 o Full-duplex streaming transport model combined with a binary 626 encoding mechanism provided further improved telemetry efficiency. 628 o gRPC provides higher-level features consistency across platforms 629 that common HTTP/2 libraries typically do not. This 630 characteristic is especially valuable for the fact that telemetry 631 data collectors normally reside on a large variety of platforms. 633 o The built-in load-balancing and failover mechanism. 635 4.3. Control Plane Telemetry 637 4.3.1. Requirements and Challenges 639 The control plane telemetry refers to the health condition monitoring 640 of different network protocols, which covers Layer 2 to Layer 7. 641 Keeping track of the running status of these protocols is beneficial 642 for detecting, localizing, and even predicting various network 643 issues, as well as network optimization, in real-time and in fine 644 granularity. 646 One of the most challenging problems for the control plane telemetry 647 is how to correlate the E2E Key Performance Indicators (KPI) to a 648 specific layer's KPIs. For example, an IPTV user may describe his 649 User Experience (UE) by the video fluency and definition. Then in 650 case of an unusually poor UE KPI or a service disconnection, it is 651 non-trivial work to delimit and localize the issue to the responsible 652 protocol layer (e.g., the Transport Layer or the Network Layer), the 653 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 654 finally the responsible device(s) with specific reasons. 656 Traditional OAM-based approaches for control plane KPI measurement 657 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 658 issue behind these methods is that they only measure the KPIs instead 659 of reflecting the actual running status of these protocols, making 660 them less effective or efficient for control plane troubleshooting 661 and network optimization. An example of the control plane telemetry 662 is the BGP monitoring protocol (BMP), it is currently used to 663 monitoring the BGP routes and enables rich applications, such as BGP 664 peer analysis, AS analysis, prefix analysis, security analysis, and 665 so on. However, the monitoring of other layers, protocols and the 666 cross-layer, cross-protocol KPI correlations are still in their 667 infancy (e.g., the IGP monitoring is missing), which require 668 substantial further research. 670 4.3.2. BGP Monitoring Protocol 672 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 673 sessions and intended to provide a convenient interface for obtaining 674 route views. 676 The BGP routing information is collected from the monitored device(s) 677 to the BMP monitoring station by setting up the BMP TCP session. The 678 BGP peers are monitored by the BMP Peer Up and Peer Down 679 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 680 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 681 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 682 Monitoring Message and the BMP Route Mirroring Message, in the form 683 of both initial table dump and real-time route update. In addition, 684 BGP statistics are reported through the BMP Stats Report Message, 685 which could be either timer triggered or event-driven. More BMP 686 extensions can be explored to enrich the applications of BGP 687 monitoring. 689 4.4. Data Plane Telemetry 691 4.4.1. Requirements and Challenges 693 An effective data plane telemetry system relies on the data that the 694 network device can expose. The data's quality, quantity, and 695 timeliness must meet some stringent requirements. This raises some 696 challenges to the network data plane devices where the first hand 697 data originate. 699 o A data plane device's main function is user traffic processing and 700 forwarding. While supporting network visibility is important, the 701 telemetry is just an auxiliary function, and it should not impede 702 normal traffic processing and forwarding (i.e., the performance is 703 not lowered and the behavior is not altered due to the telemetry 704 functions). 706 o The network operation applications requires end-to-end visibility 707 from various sources, which results in a huge volume of data. 708 However, the sheer data quantity should not stress the network 709 bandwidth, regardless of the data delivery approach (i.e., through 710 in-band or out-of-band channels). 712 o The data plane devices must provide timely data with the minimum 713 possible delay. Long processing, transport, storage, and analysis 714 delay can impact the effectiveness of the control loop and even 715 render the data useless. 717 o The data should be structured and labeled, and easy for 718 applications to parse and consume. At the same time, the data 719 types needed by applications can vary significantly. The data 720 plane devices need to provide enough flexibility and 721 programmability to support the precise data provision for 722 applications. 724 o The data plane telemetry should support incremental deployment and 725 work even though some devices are unaware of the system. This 726 challenge is highly relevant to the standards and legacy networks. 728 The industry has agreed that the data plane programmability is 729 essential to support network telemetry. Newer data plane chips are 730 all equipped with advanced telemetry features and provide flexibility 731 to support customized telemetry functions. 733 4.4.2. Technique Taxonomy 735 There can be multiple possible dimensions to classify the data plane 736 telemetry techniques. 738 Active and Passive: The active and passive methods (as well as the 739 hybrid types) are well documented in [RFC7799]. The passive 740 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 741 mirror. These methods usually have low data coverage. The 742 bandwidth cost is very high in order to improve the data coverage. 743 On the other hand, the active methods include Ping, Traceroute, 744 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 745 and only provide indirect network measurement results. The hybrid 746 methods, including in-situ OAM 747 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 748 Multipoint Alternate Marking 749 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 750 and more flexible approach. However, these methods are also more 751 complex to implement. 753 In-Band and Out-of-Band: The telemetry data, before being exported 754 to some collector, can be carried in user packets. Such methods 755 are considered in-band (e.g., in-situ OAM 756 [I-D.brockners-inband-oam-requirements]). If the telemetry data 757 is directly exported to some collector without modifying the user 758 packets, Such methods are considered out-of-band (e.g., postcard- 759 based INT). It is possible to have hybrid methods. For example, 760 only the telemetry instruction or partial data is carried by user 761 packets (e.g., IPFPM [RFC8321]). 763 E2E and In-Network: Some E2E methods start from and end at the 764 network end hosts (e.g., Ping). The other methods work in 765 networks and are transparent to end hosts. However, if needed, 766 the in-network methods can be easily extended into end hosts. 768 Flow, Path, and Node: Depending on the telemetry objective, the 769 methods can be flow-based (e.g., in-situ OAM 770 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 771 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 773 4.4.3. The IPFPM technology 775 The Alternate Marking method is efficient to perform packet loss, 776 delay, and jitter measurements both in an IP and Overlay Networks, as 777 presented in IPFPM [RFC8321] and 778 [I-D.fioccola-ippm-multipoint-alt-mark]. 780 This technique can be applied to point-to-point and multipoint-to- 781 multipoint flows. Alternate Marking creates batches of packets by 782 alternating the value of 1 bit (or a label) of the packet header. 783 These batches of packets are unambiguously recognized over the 784 network and the comparison of packet counters for each batch allows 785 the packet loss calculation. The same idea can be applied to delay 786 measurement by selecting ad hoc packets with a marking bit dedicated 787 for delay measurements. 789 Alternate Marking method needs two counters each marking period for 790 each flow under monitor. For instance, by considering n measurement 791 points and m monitored flows, the order of magnitude of the packet 792 counters for each time interval is n*m*2 (1 per color). 794 Since networks offer rich sets of network performance measurement 795 data (e.g packet counters), traditional approaches run into 796 limitations. One reason is the fact that the bottleneck is the 797 generation and export of the data and the amount of data that can be 798 reasonably collected from the network. In addition, management tasks 799 related to determining and configuring which data to generate lead to 800 significant deployment challenges. 802 Multipoint Alternate Marking approach, described in 803 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 804 and makes the performance monitoring more flexible in case a detailed 805 analysis is not needed. 807 An application orchestrates network performance measurements tasks 808 across the network to allow an optimized monitoring and it can 809 calibrate how deep can be obtained monitoring data from the network 810 by configuring measurement points roughly or meticulously. 812 Using Alternate Marking, it is possible to monitor a Multipoint 813 Network without examining in depth by using the Network Clustering 814 (subnetworks that are portions of the entire network that preserve 815 the same property of the entire network, called clusters). So in 816 case there is packet loss or the delay is too high the filtering 817 criteria could be specified more in order to perform a detailed 818 analysis by using a different combination of clusters up to a per- 819 flow measurement as described in IPFPM [RFC8321]. 821 In summary, an application can configure end-to-end network 822 monitoring. If the network does not experiment issues, this 823 approximate monitoring is good enough and is very cheap in terms of 824 network resources. However, in case of problems, the application 825 becomes aware of the issues from this approximate monitoring and, in 826 order to localize the portion of the network that has issues, 827 configures the measurement points more exhaustively. So a new 828 detailed monitoring is performed. After the detection and resolution 829 of the problem the initial approximate monitoring can be used again. 831 4.4.4. Dynamic Network Probe 833 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 834 provides a programmable means to customize the data that an 835 application collects from the data plane. A direct benefit of DNP is 836 the reduction of the exported data. A full DNP solution covers 837 several components including data source, data subscription, and data 838 generation. The data subscription needs to define the custom data 839 which can be composed and derived from the raw data sources. The 840 data generation takes advantage of the moderate in-network computing 841 to produce the desired data. 843 While DNP can introduce unforeseeable flexibility to the data plane 844 telemetry, it also faces some challenges. It requires a flexible 845 data plane that can be dynamically reprogrammed at run-time. The 846 programming API is yet to be defined. 848 4.4.5. IP Flow Information Export (IPFIX) protocol 850 Traffic on a network can be seen as a set of flows passing through 851 network elements. IP Flow Information Export (IPFIX) [RFC7011] 852 provides a means of transmitting traffic flow information for 853 administrative or other purposes. A typical IPFIX enabled system 854 includes a pool of Metering Processes collects data packets at one or 855 more Observation Points, optionally filters them and aggregates 856 information about these packets. An Exporter then gathers each of 857 the Observation Points together into an Observation Domain and sends 858 this information via the IPFIX protocol to a Collector. 860 4.4.6. In-Situ OAM 862 Traditional passive and active monitoring and measurement techniques 863 are either inaccurate or resource-consuming. It is preferable to 864 directly acquire data associated with a flow's packets when the 865 packets pass through a network. In-situ OAM (iOAM) 866 [I-D.brockners-inband-oam-requirements], a data generation technique, 867 embeds a new instruction header to user packets and the instruction 868 directs the network nodes to add the requested data to the packets. 869 Thus, at the path end, the packet's experience gained on the entire 870 forwarding path can be collected. Such firsthand data is invaluable 871 to many network OAM applications. 873 However, iOAM also faces some challenges. The issues on performance 874 impact, security, scalability and overhead limits, encapsulation 875 difficulties in some protocols, and cross-domain deployment need to 876 be addressed. 878 4.5. External Data and Event Telemetry 880 Events that occur outside the boundaries of the network system are 881 another important source of telemetry information. Correlating both 882 internal telemetry data and external events with the requirements of 883 network systems, as presented in Exploiting External Event Detectors 884 to Anticipate Resource Requirements for the Elastic Adaptation of 885 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 886 strategic and functional advantage to management operations. 888 4.5.1. Requirements and Challenges 890 As with other sources of telemetry information, the data and events 891 must meet strict requirements, especially in terms of timeliness, 892 which is essential to properly incorporate external event information 893 to management cycles. Thus, the specific challenges are described as 894 follows: 896 o The role of external event detector can be played by multiple 897 elements, including hardware (e.g. physical sensors, such as 898 seismometers) and software (e.g. Big Data sources that analyze 899 streams of information, such as Twitter messages). Thus, the 900 transmitted data must support different shapes but, at the same 901 time, follow a common but extensible ontology. 903 o Since the main function of the external event detectors is to 904 perform the notifications, their timeliness is assumed. However, 905 once messages have been dispatched, they must be quickly collected 906 and inserted into the control plane with variable priority, which 907 will be high for important sources and/or important events and low 908 for secondary ones. 910 o The ontology used by external detectors must be easily adopted by 911 current and future devices and applications. Therefore, it must 912 be easily mapped to current information models, such as in terms 913 of YANG. 915 Organizing together both internal and external telemetry information 916 will be key for the general exploitation of the management 917 possibilities of current and future network systems, as reflected in 918 the incorporation of cognitive capabilities to new hardware and 919 software (virtual) elements. 921 5. Evolution of Network Telemetry 923 As the network is evolving towards the automated operation, network 924 telemetry also undergoes several levels of evolution. 926 Level 0 - Static Telemetry: The telemetry data is determined at 927 design time. The network operator can only configure how to use 928 it with limited flexibility. 930 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 931 programmed or configured at runtime, allowing a tradeoff among 932 resource, performance, flexibility, and coverage. DNP is an 933 effort towards this direction. 935 Level 2 - Interactive Telemetry: The network operator can 936 continuously customize the telemetry data in real time to reflect 937 the network operation's visibility requirements. At this level, 938 some tasks can be automated, although ultimately human operators 939 will still need to sit in the middle to make decisions. 941 Level 3 - Closed-loop Telemetry: Human operators are completely 942 excluded from the control loop. The intelligent network operation 943 engine automatically issues the telemetry data request, analyzes 944 the data, and updates the network operations in closed control 945 loops. 947 While most of the existing technologies belong to level 0 and level 948 1, with the help of a clearly defined network telemetry framework, we 949 can assemble the technologies to support level 2 and make solid steps 950 towards level 3. 952 6. Security Considerations 954 TBD 956 7. IANA Considerations 958 This document includes no request to IANA. 960 8. Contributors 962 The other major contributors of this document are listed as follows. 964 o Daniel King 966 o Yunan Gu 968 9. Acknowledgments 970 We would like to thank Adrian Farrel, Randy Presuhn, Victor Liu, 971 James Guichard, Uri Blumenthal, Giuseppe Fioccola, and many others 972 who have provided helpful comments and suggestions to improve this 973 document. 975 10. References 977 10.1. Normative References 979 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 980 Requirement Levels", BCP 14, RFC 2119, 981 DOI 10.17487/RFC2119, March 1997, 982 . 984 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 985 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 986 May 2017, . 988 10.2. Informative References 990 [I-D.brockners-inband-oam-requirements] 991 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 992 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 993 T., <>, P., and r. remy@barefootnetworks.com, 994 "Requirements for In-situ OAM", draft-brockners-inband- 995 oam-requirements-03 (work in progress), March 2017. 997 [I-D.fioccola-ippm-multipoint-alt-mark] 998 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 999 "Multipoint Alternate Marking method for passive and 1000 hybrid performance monitoring", draft-fioccola-ippm- 1001 multipoint-alt-mark-04 (work in progress), June 2018. 1003 [I-D.ietf-grow-bmp-adj-rib-out] 1004 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 1005 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 1006 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-02 (work 1007 in progress), September 2018. 1009 [I-D.ietf-grow-bmp-local-rib] 1010 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1011 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1012 draft-ietf-grow-bmp-local-rib-02 (work in progress), 1013 September 2018. 1015 [I-D.ietf-netconf-udp-pub-channel] 1016 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 1017 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 1018 pub-channel-04 (work in progress), October 2018. 1020 [I-D.ietf-netconf-yang-push] 1021 Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen- 1022 Nygaard, E., Bierman, A., and B. Lengyel, "YANG Datastore 1023 Subscription", draft-ietf-netconf-yang-push-19 (work in 1024 progress), September 2018. 1026 [I-D.kumar-rtgwg-grpc-protocol] 1027 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 1028 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 1029 progress), July 2016. 1031 [I-D.openconfig-rtgwg-gnmi-spec] 1032 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 1033 C., and C. Morrow, "gRPC Network Management Interface 1034 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 1035 progress), March 2018. 1037 [I-D.pedro-nmrg-anticipated-adaptation] 1038 Martinez-Julia, P., "Exploiting External Event Detectors 1039 to Anticipate Resource Requirements for the Elastic 1040 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 1041 anticipated-adaptation-02 (work in progress), June 2018. 1043 [I-D.song-opsawg-dnp4iq] 1044 Song, H. and J. Gong, "Requirements for Interactive Query 1045 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 1046 (work in progress), June 2017. 1048 [I-D.zhou-netconf-multi-stream-originators] 1049 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 1050 "Subscription to Multiple Stream Originators", draft-zhou- 1051 netconf-multi-stream-originators-03 (work in progress), 1052 October 2018. 1054 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1055 "Simple Network Management Protocol (SNMP)", RFC 1157, 1056 DOI 10.17487/RFC1157, May 1990, 1057 . 1059 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1060 DOI 10.17487/RFC2981, October 2000, 1061 . 1063 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1064 for the Simple Network Management Protocol (SNMP)", 1065 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1066 . 1068 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1069 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1070 September 2004, . 1072 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1073 Zekauskas, "A One-way Active Measurement Protocol 1074 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1075 . 1077 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1078 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1079 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1080 . 1082 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1083 and A. Bierman, Ed., "Network Configuration Protocol 1084 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1085 . 1087 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1088 "Specification of the IP Flow Information Export (IPFIX) 1089 Protocol for the Exchange of Flow Information", STD 77, 1090 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1091 . 1093 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1094 Weingarten, "An Overview of Operations, Administration, 1095 and Maintenance (OAM) Tools", RFC 7276, 1096 DOI 10.17487/RFC7276, June 2014, 1097 . 1099 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1100 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1101 DOI 10.17487/RFC7540, May 2015, 1102 . 1104 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1105 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1106 May 2016, . 1108 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1109 Monitoring Protocol (BMP)", RFC 7854, 1110 DOI 10.17487/RFC7854, June 2016, 1111 . 1113 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1114 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1115 "Alternate-Marking Method for Passive and Hybrid 1116 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1117 January 2018, . 1119 Authors' Addresses 1121 Haoyu Song (editor) 1122 Huawei 1123 2330 Central Expressway 1124 Santa Clara 1125 USA 1127 Email: haoyu.song@huawei.com 1129 Tianran Zhou 1130 Huawei 1131 156 Beiqing Road 1132 Beijing, 100095 1133 P.R. China 1135 Email: zhoutianran@huawei.com 1137 Zhenbin Li 1138 Huawei 1139 156 Beiqing Road 1140 Beijing, 100095 1141 P.R. China 1143 Email: lizhenbin@huawei.com 1145 Zhenqiang Li 1146 China Mobile 1147 No. 32 Xuanwumenxi Ave., Xicheng District 1148 Beijing, 100032 1149 P.R. China 1151 Email: lizhenqiang@chinamobile.com 1152 Pedro Martinez-Julia 1153 NICT 1154 4-2-1, Nukui-Kitamachi 1155 Koganei, Tokyo 184-8795 1156 Japan 1158 Email: pedro@nict.go.jp 1160 Laurent Ciavaglia 1161 Nokia 1162 Villarceaux 91460 1163 France 1165 Email: laurent.ciavaglia@nokia.com 1167 Aijun Wang 1168 China Telecom 1169 Beiqijia Town, Changping District 1170 Beijing, 102209 1171 P.R. China 1173 Email: wangaj.bri@chinatelecom.cn