opsawg                                                       P. Lapukhov
Internet-Draft                                                  Facebook
Intended status: Standards Track                          March 18, 2016                                R. Chang
Expires: September 19, December 12, 2016                             Barefoot Networks
                                                           June 10, 2016

           Data-plane probe for in-band telemetry collection
                   draft-lapukhov-dataplane-probe-00
                   draft-lapukhov-dataplane-probe-01

Abstract

   Detecting and isolating network faults in IP networks has
   traditionally been done using tools like ping and traceroute (see
   [RFC7276]) or more complex systems built on similar concepts of
   active probing and path tracing.  While using active synthetic probes
   is proven to be helpful in detecting data-plane faults, isolating
   fault location has proven to be is a much harder problem, especially in diverse
   networks with multiple active forwarding planes (e.g.  IP and MPLS).
   Moreover, existing end-to-end tools do not generally support
   functionality beyond dealing with packet loss - for example, they are
   hardly useful for detecting and reporting transient (i.e. milli- or
   even micro-second) network congestion.

   Modern network forwarding hardware can enable allow for more sophisticated
   data-plane functionality that provides substantial improvement to the
   isolation and identification capabilities of network elements.  For
   example, it has become possible to encode a snapshot of a network
   elements forwarding
   element's state within the packet payload as it transits the device.
   One example of such device/network state would be queue depth on the egress port
   taken by that specific packet.  When combined with a unique device
   identifier embedded in the same packet, this could allow for precise
   time and topological identification of the the congested location
   within the network.

   This document proposes a standard format for requesting and embedding
   telemetry information in UDP-based probing packets, active probes, i.e. packets packet designated for
   actively testing the network while not carrying application traffic.
   These active probes could be conveyed over multiple protocols (ICMP,
   UDP, TCP, etc.) but this and the document specifically focuses on UDP, given its
   simple semantics. does not prescribe any particular
   transport.  In addition addition, this document provides recommendations on
   handling the active probes by devices that do not support the
   required data-plane functionality.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 19, December 12, 2016.

Copyright Notice

   Copyright (c) 2016 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Data plane probe  . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Probe transport . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Probe structure . . . . . . . . . . . . . . . . . . . . .   4
     2.3.  Header Format . . . . . . . . . . . . . . . . . . . . . .   5
     2.4.  Telemetry Record Template . . . . . . . . . . . . . . . .   7
     2.5. Data Frame and Telemetry Record  . . . . . . . . . . . . . . . Data Records . . . . .   8   7
   3.  Telemetry Record Types  . . . . . . . . . . . . . . . . . . .   9   8
     3.1.  Device Identifier . . . . . . . . . . . . . . . . . . . .   9
     3.2.  Timestamp . . . . . . . . . . . . . . . . . . . . . . . .  10   9
     3.3.  Queueing Delay  . . . . . . . . . . . . . . . . . . . . .  10   9
     3.4.  Ingress/Egress Port IDs . . . . . . . . . . . . . . . . .  11  10
     3.5.  Forwarding Information  . . . . . . . . . . . . . . . . .  11
       3.5.1.  IPv6 Route  . . . . . . . .  Opaque State Snapshot . . . . . . . . . . . . .  12
       3.5.2.  IPv4 Route  . . . . . . . . . . . . . . . . . . . . .  12
       3.5.3.  MPLS Route  . . . . . . . . . . . . . . . . . . . . .  12  10
   4.  Operating in loopback mode  . . . . . . . . . . . . . . . . .  13  11
   5.  Processing Probe Packet . . . . . . . . . . . . . . . . . . .  14  11
     5.1.  Detecting a probe . . . . . . . . . . . . . . . . . . . .  14  12
   6.  Non-Capable Devices . . . . . . . . . . . . . . . . . . . . .  14  12
   7.  Handling data-plane probes in the MPLS domain . . . . . . . .  14  12
   8.  Multi-chip device considerations  . . . . . . . . . . . . . .  15  12
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15  13
   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     11.1.  13
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  15
     11.2.  13
     10.2.  Informative References . . . . . . . . . . . . . . . . .  15
   Author's Address  .  13
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15  13

1.  Introduction

   Detecting and isolating faults in IP networks may involve multiple
   tools and approaches, but by far the two most popular utilities used
   by operators are ping and traceroute.  The ping utility provides the
   basic end-to-end connectivity check by sending a special ICMP packet.
   There are other variants of ping that work using TCP or UDP probes,
   but may require a special responder application (for UDP) on the
   other end of the probed connection.

   This type of active probing approach has its limitations.  First, it
   operates end-to-end and thus it is impossible to tell where in the
   path the fault has happened from simply observing the packet loss
   ratios.  Secondly, in multipath (ECMP) scenarios it can be quite difficult
   to fully and/or deterministically exercise all the possible paths
   connecting two end-points.

   The traceroute utility has multiple variants as well - UDP, ICMP and
   TCP based, for instance, and special variant for MPLS LSP testing.
   Practically all variants follow the same model of operations: varying
   TTL field setting in outgoing probes and analyzing the returned ICMP
   unreachable messages.  This does allow isolating the fault down to
   the IP hop that is losing packets, but has its own limitations.  As
   with the ping utility, it becomes complicated to explore all possible
   ECMP paths in the network.  This is especially problematic in large
   Clos fabric topologies that are very common in large data-center
   networks.  Next, many network devices limit the rate of outgoing ICMP
   messages as well as the rate of "exception" packets "punted" to the
   control plane processor.  This puts a functional limit on the packet
   rate that the traceroute can probe a given hop with, and hence
   impacts the resolution and time to isolate a fault.  Lastly, the
   treatment for these control packets is often different from the
   packets that take regular forwarding path: the latter are normally
   not redirected to the control plane processor and handled purely in
   the data-plane hardware.

   Modern network processing elements (both hardware and software based)
   are capable of packet handling beyond basic forwarding and simple
   header modifications.  Of special interest is the ability to capture
   and embed instantaneous state from the network element and encode
   this state directly into the transit packet.  One example would be to
   record the transit device's name, ingress and egress port
   identifiers, queue depths, queueing delays, timestamps and so on.  By collecting
   this state along each network device in the path, it becomes trivial
   to trace a probe's path through the network as well as record transit
   device characteristics.  Extending this model, one could build a tool
   that combines the useful properties of ping and traceroute using a
   single packet flight through the network, without the constraints of
   control plane (aka "slow path") processing.  To aid in the
   development of such tooling, this document defines a format for
   requesting and embedding telemetry information in the body of active
   probing packets.

2.  Data plane probe

   This section defines the structure of the active data-plane probe
   packets. probe.

2.1.  Probe transport

   This document assumes the use of IP/UDP does not prescribe any specific encapsulation for data-plane probing
   (either IPv4 or IPv6).  A receiving application may listen on a pre-
   defined UDP port to collect and possibly echo back the information
   embedded in the
   data-plane probe.  One potential limitation to this methodology
   is the size of  For example, the probe packet, as some data-plane faults may only
   impact packets of could be embedded inside a given size
   UDP packet, or range of sizes.  In this case, the
   data-plane probe may not be able to detect such issues, given the
   requirement to pre-allocate storage in the packet body. within an IPv6 extension header.

2.2.  Probe structure

   The sender is responsible for constructing probe consists of a packet large enough to
   hold all fixed-size "Header" and arbitrary number of
   variable-length "telemetry data frames" following the header.  Frames
   are variable length, and each frame, in turn, consists of multiple
   "telemetry record" fields defined below in this document.  The
   records to be are added by the network elements.  Concurrently, per the probes must not exceed the minimum MTU allowed along the path, so
   it is assumed that the sender either knows the needed MTU or relies
   on well-known mechanisms for path MTU discovery.  After adding the
   mandatory protocol (IP, UDP, etc.) headers, request of the packet payload is
   built according to telemetry information
   specified in the following layout: header.

   +---------------------------------------------------------+
   |                       Header                            |
   +---------------------------------------------------------+
   |                Telemetry Record template                               |
   +---------------------------------------------------------+
   | Placeholder for telemetry record 1 data frame N                   |
   +---------------------------------------------------------+
   | Placeholder for telemetry record 2                Telemetry data frame N-1                 |
   +---------------------------------------------------------+
   .                                                         .
   .                                                         .
   .                                                         .
   +---------------------------------------------------------+
   | Placeholder for telemetry record N                Telemetry data frame 1                   |
   +---------------------------------------------------------+

                          Figure 1: Probe layout

   Notice that all record placeholders are equal size, as prescribed by the telemetry record template, and that space for those must be pre-
   allocated by first frame is at the sender end of the packet.  Each record corresponds to a
   single network element on  For
   efficient hardware implementation, new frames are pushed onto the path from sender
   stack at each hop.  This eliminates the need for the transit network
   elements to receiver of inspect the
   packet. full packet and allows for arbitrarily long
   packets as the MTU allows.

2.3.  Header Format

   The probe payload starts with a fixed-size header.  The header
   identifies the packet as a data-plane probe packet, and encodes basic
   information shared by all telemetry records.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Probe Marker (1)                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Probe Marker (2)                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Version Number        |       Must Be Zero        |S|O|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+     | Message Type  |             Flags             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                     Telemetry Request Vector                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Hop Limit   |   Hop Count   |         Must Be Zero          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sender's Handle         Maximum Length        |        Current Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Sender's Handle        |        Sequence Number        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Write Offset                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                          Figure 2: Header Format

   (1)   The "Probe Marker" fields are arbitrary 32-bit values generally
         used by the network elements to identify the packet as a probe
         packet.  These fields should be interpreted as unsigned integer
         values, stored in network byte order.  For example, a network
         element may be configured to recognize a UDP packet destined to
         port 31337 and having 0xDEAD 0xBEEF as the values in "Probe
         Marker" field as an active probe, and treat it respectively.

   (2)   "Version Number" is currently set to 1.

   (3)   The "Global Flags" "Message Type" field value could be either "1" - "Probe" or
         "2" - "Probe Reply"

   (4)   The "Flags" field is 8 bits, and defines the following flags:

   (5)

         (1)  "Overflow" (O-bit) (least significant bit).  This bit is
              set by the network element if there the number of records on the
              packet is no record
             placeholder available: at the maximum limit as specified by the packet:
              i.e. the packet is already "full" of telemetry
              information.

        (2)  "Sealed" (S-bit).  This bit instructs

   (6)   "Telemetry Request Vector" is a 32-bit long field that requests
         well-known inband telemetry information from the network element
             to forward the packet WITHOUT embedding telemetry data,
             even if it matches
         elements on the probe identification rules.  This
             mechanism could be used path.  A bit set in this vector translates to send "realistic" probes a
         request of
             arbitrary size after the network path associated with the
             combination a particular type of source/destination IP addresses and ports
             has been previously established. information.  The network element must
             not inspect following
         types/bits are currently defined, starting with the "Telemetry Record Template" field for
             "sealed" probes. least
         significant bit first:

         (1)  Bit 0: Device identifier.

         (2)  Bit 1: Timestamp.

         (3)  Bit 2: Queueing delay.

         (4)  The "Message Type" field value could be either "1" - "Probe" or
        "2" - "Probe Reply"  Bit 3: Ingress/Egress port identifiers.

         (5)  Bit 31: Opaque state snapshot request.

   (7)   "Hop Limit" is defined only for "Message Type" of "1"
         ("Probe").  For "Probe Reply" the "Hop Limit" field must be set
         to zero.  This field is treated as an integer value and decremented by
        every network element in
         representing the path as "Probe" propagates. number of network elements.  See the Section 4
         section on the intended use of the field.

   (6)

   (8)   The "Sender's Handle" "Hop Count" field is set by the sender to allow specifies the
        receiver to identify a particular originator current number of probe packets.
        Along with "Sequence Number" it allows for tracking hops of packet
        order and loss within the network.

   (7)  The "Write Offset" field specifies the offset for the next
        telemetry record to be written in
         capable network elements the probe packet body. has transit through.  It
        counts from the start of the packet body
         begins with zero and must be initially
        set to the first octet after the "Record Template" field.  It
        must be incremented by one for every
         network element that adds a telemetry record, without overflowing the storage.  This record.  Combined with a
         push mechanism, this simplifies the work for the subsequent
         network element - it and the packet receiver.  The subsequent
         network element just needs to parse the template and then add the data at
         insert new record(s) immediately after the "Write
        Offset".

2.4.  Telemetry Record Template template.

   (9)   The following figure defines "Max Length" field specifies the "Record Template".  This template
   uses type-length fields to describe maximum length of the
         telemetry data records as
   added by network elements.  The most significant bit payload in bytes.  Given that the "Type"
   field must be sender knows the
         minimum path MTU, the sender can set to zero.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      TL record count (N)      |          Must Be Zero         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0|           Type 1            |            Length 1           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0|           Type 2            |            Length 2           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   .                                                               .
   .                                                               .
   .                                                               .
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |0|           Type N            |           Length N            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 3: Record Template

2.5.  Telemetry Record

   This section defines the structure of a telemetry record.  Every
   network element capable maximum of reporting inband telemetry data must add a
   record as defined in payload
         bytes allowed before exceeding the "Record Template" MTU.  Thus, a simple
         comparison between "Current Length" and "Max Length" allows to the probe packet.  The
   new record must
         decide whether or not data could be inserted at added.

   (10)  The "Current Length" field specifies the "Write Offset" position current length of data
         stored in the
   packet payload, with the "Write Offset" subsequenly probe.  This field is incremented by eacn network
         element by the size number of bytes it has added with the new record. telemetry
         data frame.

   (11)  The order of TLV elements must follow "Sender's Handle" field is set by the sender to allow the
         receiver to identify a particular originator of probe packets.
         Along with "Sequence Number" it allows for tracking of packet
         order prescribed and loss within the network.

2.4.  Telemetry Data Frame and Telemetry Data Records

   Each telemetry data frame is constructed by concatenating multiple
   telemetry data record, per the Figure 3 portion request in "Telemetry Request Vector"
   fields of the dataplane probe packet. header.  The most significant bit frame starts with a 16-bit
   length field, which reflects the frame size in bytes, excluding the
   length of the type field ("S-bit") must be set itself.  Following the "Frame Length" field is a
   "Telemetry Response Vector" field: this vector corresponds to
   "1" if the
   records the network element was able capable of recording in the frame.
   The body of the frame is constructed by appending fixed-size records
   corresponding to understand and record every bit set in "Telemetry Response Vector".  All
   of the records, except the one requested telemetry type.  That by 31st bit must be set to zero otherwise,
   along ("Opaque State
   Snapshot") are fixed size, with their lengths defined in Section 3.
   The order of the records in the frame follows the contents order of the "Value" field.  The length bits
   in the "Telemetry Request Vector" (also reflected in "Telemetry
   Response Vector").  Finally, if requested, a variable-length field is
   appended at the end of the frame, with the TLV length field occupying the
   first 8 bits.  This "length" field reflects the length including of the "Type" and "Length" fields. opaque
   data excluding the length field itself.

   If writing inserting a new telemetry record to the packet body would cause it "Current Length" to
   exceed the packet size, "Max Length", no record is added and the overflow "O-bit" must
   be set to "1" in the probe header.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|           Type 1
   |         Frame  Length 1         |         Must be Zero          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                  Telemetry Response Vector                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   .                      Fixed Size Field 0                       .
   .                             Value 1                        (if requested)                         .
   .                                                               .
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|           Type 2
   |            Length 2                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   .                      Fixed Size Field 1                       .
   .                        (if requested)                         .
   .                                                               .
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   .                                                               .
   .                             Value 2                                                               .
   ~                                                               ~
   .                                                               .
   .                                                               .
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   .                      Fixed Size Field 30                      .
   .                        (if requested)                         .
   .                                                               .
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|           Type N
   |     Length N    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                                                               |
   .                   Opaque State Snapshot                       .
   .                             Value N                           .
   .                       (if requested)                          .
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 4: 3: Telemetry Record Frame Format

3.  Telemetry Record Types

   This section defines some of the telemetry record types that could be
   supported by the network elements.

3.1.  Device Identifier

   This record is used to identify the device reporting telemetry
   information.  This document does not prescribe any specific
   identifier format.  In general, it is expected to be configured by
   the operator.  The length of this record is 32-bit.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 1 (Device ID)      |           Length =  12        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           Device ID (1)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Device ID (2)                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                        Figure 5: 4: Device Identifier

3.2.  Timestamp

   This telemetry record encodes the time that data associated with the
   packet.  Most existing hardware support timestamping for IEEE1588.
   To leverage existing hardware capabilities, packet enters receive time is
   stored similarly as 48-bits of seconds, 32-bits of nanoseconds, and
   leaves the device, in UTC.  The "entering"
   residence time is recorded when the
   L2 header enters the processing pipeline. in 48-bits of nanoseconds.  The "exit" time length of this
   record is
   recorded when the network elements starts serializing L2 header on
   egress port. 128 bits.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 10 (Timestamp)     |           Length =  28        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Receive Seconds [47:16]                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    Receive Microseconds                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Seconds [15:0]     |  Receive Nanoseconds [31:16]  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Send Seconds                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  Receive Nanoseconds [15:0]   |                       Send Microseconds     Residence Time [47:32]    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Send Nanoseconds                     Residence Time [31:0]                     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                            Figure 6: 5: Timestamp

3.3.  Queueing Delay

   Encodes

   This record encodes the amount of time that the frame has spent
   queued in the network element.  This is only recorded if packet has
   been queued, and defines the time spent in memory buffers.  This
   could be helpful to detect queueing-related delays in the network.
   If the queueing delay exceeds the maximum number of 2+ seconds
   allowed by the 31-bit number, the network element must set the
   overflow "O-bit".  In case of the cut-through switching operation
   this must be set to zero.  The length of this record is 32 bits.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|  Type = 11 (Queueing Delay) |           Length =  16        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            Seconds                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Microseconds                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |
   |O|                        Nanoseconds                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 7: 6: Queueing Delay

3.4.  Ingress/Egress Port IDs

   This record stores the ingress and egress physical ports used to
   receive and send packet respectively.  Here, "physical port" means a
   unit with actual MAC and PHY devices associated - not any logical
   subdivision based, for example, on protocol level tags (e.g.  VLAN).
   The port identifiers are opaque, and defined as 32-bit 16-bit entries.  For
   example, those could be the corresponding SNMP ifIndex values.  The
   length of this record is 32 bits.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 12 (Port IDs)      |           Length =  12        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Ingress Port ID        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |        Egress Port ID         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     Figure 8: 7: Ingress/Egress Port IDs

3.5.  Forwarding Information

   Records defined in this section require  Opaque State Snapshot

   This record has variable size.  It allows the network element to
   store
   forwarding information that was used to direct arbitrary state in the packet probe, without a pre-defined schema.
   The schema needs to the
   next-hop.  In the network that uses multiple forwarding plane
   implementations (e.g.  IP and MPLS) the originator of the probe is
   required made known to populate the record template with all kinds of forwarding
   information it expects in the path. analyzer by some out-of-band
   means.  The network elements then
   populate the entries they know about, e.g. 16-bit "Schema Id" field in IPv4-only network the
   "IPv6 Route" record will be left unfilled, and so will be "MPLS
   Route".

3.5.1.  IPv6 Route

   This record stores the IPv6 route that has been used for packet
   forwarding.  If not used, then S-bit is set supposed to zero, along with the
   value field.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 20 (IPv6 Route)    |           Length =  24        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      ECMP group size    |  ECMP group index   | Prefix Length |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       IPv6 Address (1)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       IPv6 Address (2)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       IPv6 Address (3)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       IPv6 Address (4)                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                           Figure 9: IPv6 Route

3.5.2.  IPv4 Route

   This record stores let
   the IPv4 route that has been used for packet
   forwarding.  If not used, then S-bit is set analyzer know which particular schema to zero, along with the
   value field.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 21 (IPv4 Route)    |           Length =  12        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     ECMP group size   |  ECMP group index     | Prefix Length |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       IPv4 Address                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                           Figure 10: IPv4 Route

3.5.3.  MPLS Route

   This record stores the MPLS label mapping that has been used for
   packet forwarding.  It use, and it is possible that inbound or outbound label set
   set expected
   to zero, if it was not used (e.g. be configured on ingress or egress of the
   domain).  At network element by the edge of IP2MPLS or MPLS2IP domain it operator.  This ID is
   expected
   that to be configured on the device would fill in the "MPLS Route" telemetry record along
   with by the corresponding "IPv6 Route" or "IPv4 Route" records. network operator.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |S|   Type = 22 (MPLS Route)
   |     Length =  16        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Operation   |      ECMP group size                    |  ECMP group index         Schema Id             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      Must Be Zero                                                               |         Incoming MPLS Label
   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                                                               |      Must Be Zero
   |         Outgoing MPLS Label                          Opaque Data                          |
   ~                                                               ~
   .                                                               .
   .                                                               .
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                          Figure 11: MPLS Route

   There are three MPLS operations defined

      "1" - Push

      "2" - Pop

      "3" - Swap 8: Opaque State

4.  Operating in loopback mode

   In "loopback" mode the flow of probes is "turned back" at a given some
   network element.  The network element that "turns" packets around is
   identified using the "Hop Limit" field.  The network element that
   receives a "Probe" type packet having "Hop Limit" value of "1" equal to "Hop
   Count" is required to perform the following:

      Change the "Message Type" field to "Probe Reply" Reply", and set keep the
      "Hop Limit" to at zero.

      Swap the destination/source IP addresses and port values in the IP/
      UDP headers of transport header
      to send the probe packet. packet back to the originator.

      Add a new telemetry record as required using the newly build IP/UDP
      headers data frame corresponding to determine the new forwarding
      information.

   This way, the original probe is routed back to originator.  Notice
   that the return path may be different from the path that the original
   probe has taken.  This path will be recorded by the network elements
   as the reply is transported back to the sender.  Using this technique
   one may progressively test a path until its breaking point.  Unlike
   the traditional traceroute utility, however, the returning

   If a network element is incapable of redirecting packets
   are the original probes, not back to the ICMP messages.
   originator, another option would be exporting those packets to a
   network analyzer device, using some sort of encapsulation header.

5.  Processing Probe Packet
5.1.  Detecting a probe

   Since the probe looks like a regular UDP packet, the data-plane
   hardware needs

   As mentioned previously, a way combination of techniques need to recognize it for special processing. be used
   to differentiate the active probes.  This
   document does may include, but should not prescribe a specific way to do that.  For example,
   classification could
   be based on only the destination UDP port, or limited to using more complex pattern matching techniques, e.g matching on just the
   contents known position of "Probe Marker" field. Id" fields.

6.  Non-Capable Devices

   Non-capable devices are those that cannot process a probe natively in
   the fast-path data plane.  Further, there could be two types of such
   devices: those that can still process it via the control-plane
   software, and those that can not.  The control-plane processing
   should be triggered by use of the "Router-Alert" option for IPv4 of
   IPv6 packets (see [RFC2113] or [RFC2711]) added by the originator of
   the probe.  A control-plane capable device is expected to interpret
   and fill-in as much telemetry-record data as it possibly could, given
   the limited abilities.

   Network elements that are not capable of processing the data-plane
   probes are expected to perform regular packet forwarding.  If a
   network element receives a packet with the router-alert option set,
   but has no special configuration to detect such probes, it should
   process it according to [RFC6398].  Absence of the router alert
   option leaves the non dataplane-capable devices with the only option
   of processing the probe using traditional forwarding.

7.  Handling data-plane probes in the MPLS domain

   In general, the payload of an MPLS packet is opaque to the network
   element.  However, in many cases the network element still performs a
   lookup beyond the MPLS label stack, e.g. to obtain information such
   as L4 ports for load balancing.  It may be possible to perform data-
   plane probe classification in the same manner, additionally using the
   "Probe Marker" to distinguish the probe packets.

   In accordance to [RFC6178] Label Edge Routers (LERs) are required not
   to impose an MPLS router-alert label for packets carrying the router-
   alert option.  It may be beneficial to enable such translation, so
   that an end-to-end validation could be performed if a control-plane
   capable MPLS network element is present on the probe's path.

8.  Multi-chip device considerations

   TBD

9.  IANA Considerations

   None

10.  Acknowledgements

   The author would like to thank L.J.  Wobker and Changhoom Kim for
   reviewing and providing valuable comments for the initial version of
   this document.

11.  References

11.1.

10.1.  Normative References

   [RFC2113]  Katz, D., "IP Router Alert Option", RFC 2113,
              DOI 10.17487/RFC2113, February 1997,
              <http://www.rfc-editor.org/info/rfc2113>.

   [RFC2711]  Partridge, C. and A. Jackson, "IPv6 Router Alert Option",
              RFC 2711, DOI 10.17487/RFC2711, October 1999,
              <http://www.rfc-editor.org/info/rfc2711>.

   [RFC6398]  Le Faucheur, F., Ed., "IP Router Alert Considerations and
              Usage", BCP 168, RFC 6398, DOI 10.17487/RFC6398, October
              2011, <http://www.rfc-editor.org/info/rfc6398>.

   [RFC6178]  Smith, D., Mullooly, J., Jaeger, W., and T. Scholl, "Label
              Edge Router Forwarding of IPv4 Option Packets", RFC 6178,
              DOI 10.17487/RFC6178, March 2011,
              <http://www.rfc-editor.org/info/rfc6178>.

11.2.

10.2.  Informative References

   [RFC7276]  Mizrahi, T., Sprecher, N., Bellagamba, E., and Y.
              Weingarten, "An Overview of Operations, Administration,
              and Maintenance (OAM) Tools", RFC 7276,
              DOI 10.17487/RFC7276, June 2014,
              <http://www.rfc-editor.org/info/rfc7276>.

Author's Address

Authors' Addresses

   Petr Lapukhov
   Facebook
   1 Hacker Way
   Menlo Park, CA  94025
   US

   Email: petr@fb.com
   Remy Chang
   Barefoot Networks
   2185 Park Boulevard
   Palo Alto, CA  94306
   US

   Email: remy@barefootnetworks.com