<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
]>
<rfc category="info" docName="draft-wu-t2trg-network-telemetry-00"
     ipr="trust200902">
  <?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

  <?rfc toc="yes" ?>

  <?rfc symrefs="yes" ?>

  <?rfc sortrefs="yes"?>

  <?rfc iprnotified="no" ?>

  <?rfc strict="yes" ?>

  <front>
    <title abbrev="Network Telemetry and Big Data">Network Telemetry and Big
    Data Analysis</title>

    <author fullname="Qin Wu" initials="Q." surname="Wu">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>101 Software Avenue, Yuhua District</street>

          <city>Nanjing</city>

          <region>Jiangsu</region>

          <code>210012</code>

          <country>China</country>
        </postal>

        <email>bill.wu@huawei.com</email>
      </address>
    </author>

    <author fullname="John Strassner" initials="J." surname="Strassner">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>2230 Central Expressway</street>

          <city>San Jose, CA</city>

          <region>CA</region>

          <code/>

          <country>USA</country>
        </postal>

        <email>john.sc.strassner@huawei.com</email>
      </address>
    </author>

    <author fullname="Adrian Farrel" initials="A." surname="Farrel">
      <organization>Old Dog Consulting</organization>

      <address>
        <email>adrian@olddog.co.uk</email>
      </address>
    </author>

    <author fullname="Liang Zhang" initials="L." surname="Zhang">
      <organization>Huawei</organization>

      <address>
        <email>zhangliang1@huawei.com</email>
      </address>
    </author>

    <date year="2016"/>

    <area/>

    <workgroup>Network Working Group</workgroup>

    <keyword>RFC</keyword>

    <keyword>Request for Comments</keyword>

    <keyword>I-D</keyword>

    <keyword>Internet-Draft</keyword>

    <keyword>Measurement, big Data</keyword>

    <abstract>
      <t>This document focuses on network measurement and analysis in the
      network environment. It first defines network telemetry, describes an
      exemplary network telemetry architecture, and then explores the
      characteristics of network telemetry data. It ends with detailing a set
      of issues with retrieving and processing network telemetry data.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Today, billions of devices can connect to the internet and VPN and
      establish a good ecosystem of connectivity. Our daily life also has been
      greatly changed with a large number of IoT applications and mobile
      application being built on top of it (e.g., smart tags on many daily
      life objects, wearable health monitoring sensors, smartphones,
      intelligent cars, and smart home appliances). However, the increased
      amount of connection of devices and the proliferation of web and
      multimedia services also imposes a great impact on the network. Examples
      include:<list style="symbols">
          <t>The massive scale and highly dynamic nature of the IoT
          applications and mobile applications (e.g., interaction with other
          thing at anytime and in any location)</t>

          <t>The increasingly vast amounts of data gathered from the network
          enviroment at varying speeds, with different amounts of accuracy,
          and the new communication patterns created</t>

          <t>The disparate types of pre- and post-processing necessary to
          understand the meaning and context (e.g., semantics) of measured
          data</t>
        </list></t>

      <t>Therefore the network may be subject to increased network incidents
      and unregulated network changes, without better network visibility or a
      good view of the available network resources and network topology, it is
      not easy to <list style="symbols">
          <t>schedule network resource to adapt to near real-time service
          demands</t>

          <t>measure the network performance and assess network quality as a
          whole</t>

          <t>provide quick network diagnosis, prove network innocence when the
          application quality get worse or identify what parts of the network
          can cause problems if a network glitch or service interruption
          happens.</t>
        </list></t>

      <t>In this document, we first define network telemetry in the context of
      network environment, followed by an exemplary architecture for
      collecting and processing telemetry data. We then explore the
      characteristics of network telemetry data, and end with describing a set
      of issues with retrieving and processing network telemetry data.</t>
    </section>

    <section title="The definition of Network Telemetry">
      <t>Network Telemetry describes how information from various data sources
      can be collected using a set of automated communication processes and
      transmitted to one or more receiving equipment for analysis tasks.
      Analysis tasks may include event correlation, anomaly detection,
      performance monitoring, metric calculation, trend analysis, and other
      related processes.</t>
    </section>

    <section title="Network Telemetry architecture">
      <t>A Network Telemetry architecture describes how different types of
      Network Telemetry data are transmitted from different network sources
      and received by different collection entities. In an ideal network
      telemetry architecture, the ability to collect data should be
      independent of any specific application and vendor limitations. This
      means that protocol and data format translation are required, so that a
      normalized form of data can be used to simplify the various analysis and
      processing tasks required.</t>

      <t>The Network Telemetry architecture is made up of the following three
      key functional components: <list style="symbols">
          <t>Data Source: The Data Source can be any type of network device
          that generates data. Examples include the management system that
          accesses IGP/BGP routing information, network inventory, topology,
          and resource data, as well as other types of information that
          provides data to be measured and/or contextual information to better
          understand the network telemetry data.</t>

          <t>Data Collector: The Data Collector may be a part of a control
          and/or management system (e.g., NMS/OSS, SDN Controller, or OAM
          system) and/or a dedicated set of entities. It gathers data from
          various Data Sources, and performs processing tasks to feed raw
          and/or processed data to the Data Analyzer.</t>

          <t>Data Analyzer: The Data Analyzer processes data from various data
          collectors to provide actionable insight. This ranges from
          generating simple statistical metrics to inferring problems to
          recommending solutions to said problems.</t>
        </list></t>

      <t>Figure 1 shows an exemplary architecture for network telemetry and
      analysis.</t>

      <figure title="Network Telemetry and Analysis Architecture">
        <artwork align="center">                   +----------------------+
                   | Policy-based Manager |
                   +----------+-----------+
                             / \
                              |
                              |
             +----------------+----------+-----------------------+
             |                           |                       |
             |                           |                       |
            \ /                         \ /                     \ /
     +----------------+         +--------+-----------+      +----+-----+
     | Data Analyzer, |/       \|    Data Fusion,    |/    \| Decision |
     |   Normalizer,  +---------+     Analytics,     +-----+|  Logic   |
     |  Filter, etc.  |\       /|   and other Apps   |\    /| and Apps |
     +--------+-------+         +---+------------+---+      +----------+
             / \                   / \          / \
              |                     |            |
              |                     |            |
             \ /                   \ /          \ /
    +--------+-------------+   +----+----+  +----+----+
    | Data Abstraction and |   |  Other  |  | Other   |
    |   Modeling Software  |   | OT Data |  | IT Data |
    +------+--------+------+   +---------+  +---------+
          / \      / \
           |        |
           |        |
          \ /      \ /
      +----+--------+-----+
      |  Data Collectors  |
      +----+---------+----+
          / \       / \
           |         |
           |         |
           |        \ /
           |    +----+------------+        +-----------+
           |    | Edge Software   |/      \| Temporary |
           |    |  (analysis &amp;    +--------+   Data    |
           |    | transformation) |\      /|  Storage  |
           |    +------------+----+        +-----------+
           |                / \
           |                 |
           |                 |
          \ /               \ /
   +-------+------+   +------+-------+
   | Data Sources |   | Data Sources |
   +--------------+   +--------------+</artwork>
      </figure>

      <t><list style="symbols">
          <t>Data Abstraction and Modeling Software. This component uses an
          overarching information model to define relevant terms, objects, and
          values that all components in the Network Telemetry Architecture can
          use.</t>

          <t>Edge Software refers to performing compute, storage, and/or
          networking functions on nodes at the edges of a network. This
          enables processing of data to occur at or near the source of the
          data. Figure 4 shows that some information from some Data Sources
          may be sent directly to Data Collectors, while other data may be
          sent first to Edge Software for further processing before it is
          consumed by Data Collectors.</t>

          <t>Policy-based Manager. This component is responsible for managing
          different aspects of the Network Telemetry Architecture in a
          distributed and extensible manner through the use of a set of
          policies that govern the behavior of the system. Examples include
          defining rules that determine what data to collect when, where, and
          how, as well as defining rules that, given a specific context,
          determine how to process collected data.</t>
        </list></t>

      <t>This reference architecture assumes that Data Collectors can choose
      different measurement data formats to gather measurement data, and
      different protocols to transmit said data; the Data Abstraction and
      Modeling Software normalizes collected data into a common form. Both the
      Data Collector and the Data Analyzer may support data filtering,
      correlation, and other types of data processing mechanisms. In the above
      architecture, bi-directional communication is shown for generality. This
      may be implemented a number of different ways, such as using a
      request-response mechanism, a publish-subscribe mechanism, or even as a
      set of uni-directional (e.g., push and pull) requests.</t>
    </section>

    <section title="Measurement data Characteristics">
      <t>Measurement data is generated from different data sources, and has
      varying characteristics, including (but not limited to): <list
          style="symbols">
          <t>Measurement data can be any of network performance data, network
          logging data, network warning and defects data, network statistics
          and state data, and network resource operation data (e.g.,
          operations on RIBs and FIBs[RFC4984]).</t>

          <t>Most measurement data are monitor state data rather than
          configuration data. However, on occasion, network configuration data
          may also be included (e.g., to establish context for the measurement
          data).</t>

          <t>In many cases, telemetry data requires real time delivery with
          high throughput, multi-channel data collection mechanisms.</t>

          <t>In most cases, the required frequency of access to monitoring
          state data is extremely high.</t>
        </list></t>
    </section>

    <section title="Issues">
      <section title="Data Fetching Efficiency">
        <t>Today, the existing data feching methods (See appendix B) prove
        insufficiency due to the following factors:<list style="symbols">
            <t>The existing Network management protocol is not dedicated and
            also not sufficient for data collection.<list style="symbols">
                <t>E.g.,NETCONF more focus onnetwork configuration, only
                retrieve operational data</t>
              </list></t>

            <t>SNMP relies on Periodic fetching. Periodic fetching of data is
            not an adequate solution for many types of applications <list
                style="symbols">
                <t>E.g., Applications that require frequent update to the
                stored data</t>
              </list>In addition, it adds significant load on participating
            networks, devices, and applications</t>

            <t>We increasingly rely on RPC-style interactions [RFC5531] to
            fetch data on demand by application. However most of applications
            are interested in update of the data or change to the data.</t>

            <t>When data fetching protocol is selected, human readable format
            such as XML, JSON to encode structured data enable us to parse
            without knowing schema, however it lacks efficiency on the
            wire.</t>
          </list></t>
      </section>

      <section title="Existing Network Level Metrics Inefficiency issue">
        <t>Quality of Service (QoS) and Quality of Experience (QoE) assessment
        [RFC7266] of multimedia services has been well studied in ITU-T SG 12.
        Media quality is commonly expressed in terms of MoS (Mean Opinion
        Score) [RFC3611][G107]. MoS is typically rated on a scale from 1 to 5,
        in which 5 represents excellent and 1 represents unacceptable. When
        multimedia application quality becomes bad,it is hard to know whether
        this is network problem or application specific problem(e.;g.,Codec
        type, Coding bit rate, packetization scheme, loss recovery
        technique,the interaction between transport problems and
        application-layer protocols ). To make sure this is not network
        problem or know how serious network events or network interrruption
        is, network health index or network key performance Index(KPI) or key
        quality index(KQI) becomes important.</t>

        <t>However, QoS/QoE assessment of network service that is dependent on
        or not dependent on the underlying network technology (e.g., MPLS, IP)
        is not well studied or defined in any body or organization. The
        QoS/QoE of generic network services requires a set of appropriate
        network performance, reliability, or other metric definitions. This
        may take the form of key quality and or performance indicators,
        ranging from high-level metrics (e.g., dropped calls) to low-level
        metrics (e.g., packet loss, delay, and jitter). IP service performance
        parameters are defined in ITU-T Y.1540 [Y1540]; however, these
        existing network performance metrics are proving insufficient due to
        several factors: <list style="symbols">
            <t>These transport-specific metrics are defined for specific
            technologies. For example, network performance parameters in
            Y.1540 are only designed for IP networks, and do not apply to
            connection- oriented networks, such as an MPLS-TP network.</t>

            <t>Not all the metrics are end-to-end performance metrics at the
            network level. For example, the TE performance metrice defined in
            ISIS-TE [RFC5305] is only defined for per link usage.</t>

            <t>These transport specific metrics are all single objective
            metrics; there are no transport specific metrics defined as
            multi-objective metrics. For example, IP transfer Delay (IPTD) is
            a single-objective metric and cannot be used to measure similar
            and important performance behaviors such as IP packet Delay
            Variation[Y1541]).</t>

            <t>Different services have different performance requirements. It
            is hard to measure network QoS to satisfy all possible services
            using a single metric.</t>

            <t>Transport-specific metrics are not applied to the whole
            network, but to a specific flow passing through the network
            corresponding to matched QoS classes.</t>

            <t>If there are multiple paths from source to destination in the
            IP network, then transport-specific metrics change with the path
            selected and it may be also hard to know which path the packet
            will traverse.</t>
          </list></t>
      </section>

      <section title="Measurement data format consistency issue">
        <t>The data format is typically vendor- and device-specific. This also
        means that different commands, having different syntax and semantics
        characteristics that use different protocols, may have to be issued to
        retrieve the same type of data from different devices.</t>

        <t>The Data Analyzer may need to ingest data in a specific format that
        is not supported by the Data Collectors that service it. For example,
        the ALTO data format used between a data source and a Data Collector
        generates an abstracted network topology and provides it to
        network-aware applications (i.e., a Data Analyzer) over a web service
        based API [I-D.wu-alto-te-metrics]. In this case, prefix data in the
        network topology information need to be generated into ALTO Network
        Maps, TE (topology) data needs to be generated into ALTO Cost Maps. To
        provide better data format mapping, ALTO Network Map and Cost MAP need
        to be modeled in the same way as prefix data and TE data in the
        network topology information. However, these data use different data
        formats, and do not have a common model structure to represent them in
        a consistent way.</t>

        <t>This is why the architecture shown in Figure 1 has a "Data
        Abstraction and Modeling Software" component. This component
        normalizes all data received into a common format for analysis and
        processing by the Data Analyzer. If this component is not present,
        then the Data Analyzer would have to deal with m vendor devices X n
        versions of software for each device at a minimum. Furthermore,
        different protocols have different capabilities, and may or may not be
        able to transmit and receive different types of data. The Data
        Abstraction and Modeling Software component can provide information
        that defines the structure of data that should be received; this can
        be useful for checking for incomplete collection data as well as
        missing collection data.</t>
      </section>

      <section title="Data Correlation issue">
        <t>To provide consistent configuration, reporting and representation
        for OAM information, the LIME YANG model
        [I-D.draft-ietf-lime-yang-oam-model-01] is proposed to correlate
        defects, faults, and network failures between the different layers and
        irregardless of network technologies. This helps improve efficiency of
        fault detection and localization, and provide better OAM
        visibility.</t>

        <t>Today we see large amounts of data collected from different data
        sources. These data can be network log data, network event data,
        network performance data, network fault data, network statistics
        state, network operation state. However, these data can only be
        meaningful if they are correlated in time and space. In particular,
        useful trend analysis and anomaly detection depend on proper
        correlation of the data collected from the different Data Sources. In
        addition, Correlate different type data from different Data Sources
        with time or space can provide better network visibility. But such
        correlations is still an challenging issue.</t>
      </section>

      <section title="Data Synchronization Issues">
        <t>When retrieving data from Data Sources or Data Collectors,
        synchronization the same type of data between data source and data
        collector or between data collector and data analyzer is a complicated
        thing.<list style="symbols">
            <t>Arrange src and dst synchronized, especially when multiple
            source feed one data collector, or multiple data collector feed
            one data analyzer</t>

            <t>Aggregate data from different data source and synchronize the
            data to the data analyzer is also not easy task.</t>
          </list></t>

        <t>The reference architecture of Figure 1 defines a "Policy-based
        Manager" to manage the set of data that are collected how, when,
        where, and by which devices. This component provides mechanisms that
        help ensure that needed information is collected by the appropriate
        components of the Network Telemetry Architecture. It also facilitates
        the synchronization of different components that make up the Network
        Telemetry Architecture, since these are likely distributed throughout
        one or more networks.</t>

        <t>It also provides a mechanism for the Data Analyzer, or other
        applications (e.g., the "Data Fusion, Analytics, and other Apps", as
        well as the "Decision Logic and Apps" components in Figure 1) to
        provide information to the Policy-based Manager in the form of
        feedback (e.g., see [I-D.draft-strassner-anima-control-loops-01]).</t>
      </section>
    </section>
  </middle>

  <back>
    <references title="Informative References">
      <?rfc include="reference.RFC.7266.xml"?>

      <?rfc include="reference.RFC.5531.xml"?>

      <?rfc include="reference.RFC.4984.xml"?>

      <?rfc include="reference.RFC.3611.xml"?>

      <?rfc include="reference.RFC.5305.xml"?>

      <?rfc include="reference.RFC.5693.xml"?>

      <?rfc include="reference.I-D.wu-alto-te-metrics.xml"?>

      <?rfc include="reference.I-D.ietf-idr-ls-distribution.xml"?>

      <?rfc include="reference.I-D.ietf-idr-te-pm-bgp.xml"?>

      <?rfc include="reference.I-D.ietf-lime-yang-oam-model"?>

      <?rfc include="reference.I-D.strassner-anima-control-loops"?>

      <reference anchor="Y1540">
        <front>
          <title>Internet protocol data communication service – IP packet
          transfer and availability performance parameters</title>

          <author>
            <organization>ITU-T</organization>
          </author>

          <date month="March" year="2011"/>
        </front>

        <seriesInfo name="ITU-T" value="Recommendation Y.1540"/>
      </reference>

      <reference anchor="Y1541">
        <front>
          <title>Network performance objectives for IP-based services</title>

          <author>
            <organization>ITU-T</organization>
          </author>

          <date month="December" year="2011"/>
        </front>

        <seriesInfo name="ITU-T" value="Recommendation Y.1541"/>
      </reference>

      <reference anchor="G107">
        <front>
          <title>The E-model: a computational model for use in transmission
          planning</title>

          <author>
            <organization>ITU-T</organization>
          </author>

          <date month="June" year="2015"/>
        </front>

        <seriesInfo name="ITU-T" value="Recommendation G.107"/>
      </reference>
    </references>

    <section title="Network Telemetry data source Classification">
      <figure>
        <artwork>+-----------------------------|------------------------------+
|    Data Source Catetory     |    Information               |
|                             |                              |
------------------------------|-------------------------------
| Network Data                |    Usage records             |
|                             |   Performance Monitoring Data|
|                             |   Fault Monitoring Data      |
|                             |  Real Time Traffic Data      |
|                             | Real Time Statistics Data    |
|                             | Network Configuration Data   |
|                             |      Provision Data          |
------------------------------|-------------------------------
|                             |                              |
| Subscriber Data             |   Profile Data               |
|                             |   Network Registry           |
|                             |   Operation Data             |
|                             |   Billing Data               |
|                             |                              |
------------------------------|------------------------------|
|                             |                              |
| Application Data derived    |   Traffic Analysis           |
| from interfaces, channels,  |   Web, Search, SMS, Email    |
| software, etc.              |   Social Media Data          |
|                             |   Mobile apps                |
+-----------------------------|------------------------------+</artwork>
      </figure>
    </section>

    <section title="Existing Network Data Collection Methods">
      <section title="Network Log Collection">
        <t>There are three typical Log data Collection methods:<list
            style="symbols">
            <t>Text based Collection</t>

            <t>SNMP Trap</t>

            <t>Syslog based Collection</t>
          </list></t>

        <section title="Text based data collection">
          <t>Text base Log data is designed for low speed network. The amount
          of IoT data can not be too large. It only can be parsed by the
          network personnel with experience to define such kind of Log. The
          log data can be transferred either by Email or via FTP. The
          difference between using Email and using FTP are:<list
              style="symbols">
              <t>The volume of data transferred by FTP can be much larger than
              via Email.</t>

              <t>FTP based collection is active data collection while Email
              based collection is passive data collection</t>
            </list></t>
        </section>

        <section title="SNMP Trap">
          <t>SNMP Trap is a notification mechanism which enables an agent to
          notify the management system of significant events by way of an
          unsolicited SNMP message. In case there are large number of devices
          and each device has large number of objects, SNMP Trap is more
          efficient to get the data than polling information from every object
          on every device.</t>
        </section>

        <section title="Syslog based Collection">
          <t>Syslog protocol is used to convey event notification messages and
          allows the use of any number of transport protocols for transmission
          of syslog messages. It is widely used in the network device((e.g.,
          switch, router) .</t>
        </section>
      </section>

      <section title="Network Traffic Collection">
        <t>Network Traffic Collection is a process of exporting network
        traffic flow information from routers, probes and other devices. It
        doesn't care operation state on the network device but traffic flow
        characteristic on the links between any two adjacent network device.
        Take IPFIX as an example, it is widely adopted in the router and
        switch to get IP traffic flow information for the network management
        system.</t>
      </section>

      <section title="Network Performance Collection">
        <t>Network performance collection is a process of exporting network
        performance information from routers, probers and other devices. The
        network peformance information can be applied to the quality,
        performance, and reliability of data delivery services and
        applications running over network. It is also applied to traffic
        contract argreed by the user and the network service provider.
        Measurement mechanism defined in IPPM WG and OAM technology and OAM
        tools can be used to perform performance measurement.</t>
      </section>

      <section title="Network Faults Collection">
        <t>Network fault collection is a process of exporting network fault,
        failure, warning, defects from router, probers and other devices. It
        usually adopts OAM technology,OAM tools, OAM model(e.g., SNMP MIB or
        NETCONF YANG model) to localize fault and pinpoint fault location.
        However OAM YANG model is mainly focused on configure OAM
        functionality on the network element, how to use OAM YANG model to
        collect more data, e.g., warning, failure, defects and how to use
        these data needs to be further standardized.</t>
      </section>

      <section title="Network Topology data Collection">
        <t>For network topology data collection, routing protocols are
        important collection method, since every router need to propagate its
        information throughout the whole network. In addition, we can use
        NMS/OSS to get network topology data if they have access to network
        topology database or routing protocols.</t>

        <t>Network Topology data comprise node information and link
        information. It can be collected in two typical ways, if the network
        topology data is within one IGP area or one AS, we can use ISIS
        protocol or OSPF to gather them and write into RIB or topology
        datasore, and then we can use I2RS protocol to read these network
        topology data; if the network topology data is beyond one IGP area and
        span across several domains, we can use BGP-LS
        [I-D.ietf-idr-ls-distribution][I-D.ietf-idr-te-pm-bgp] to collect
        network topology data in different domain and aggregated them in the
        central network topology database.</t>
      </section>

      <section title="Other Data Collection">
        <t>To collect and process large volume of data in real time or in near
        real time to detect subtle event and aid failure diagnosis, we can
        choose some other data fetching efficient tools, e.g., Facebook's
        Scribe, Chukwa built on top of Hadoop File subsystem to parse out
        structured data from some of the logs and load them into a
        datastore.</t>
      </section>
    </section>
  </back>
</rfc>
