idnits 2.17.1 

draft-irtf-nmrg-autonomic-sla-violation-detection-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 21, 2017) is 2380 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'LMAP' is mentioned on line 538, but not defined

  == Missing Reference: 'IPFIX' is mentioned on line 546, but not defined

  == Missing Reference: 'ALTO' is mentioned on line 555, but not defined

  == Outdated reference: A later version (-30) exists of
     draft-ietf-anima-autonomic-control-plane-09

  -- Obsolete informational reference (is this intentional?): RFC 4148
     (Obsoleted by RFC 6248)


     Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Management Research Group                               J. Nobre
3	Internet-Draft                       University of Vale do Rio dos Sinos
4	Intended status: Informational                              L. Granville
5	Expires: April 24, 2018          Federal University of Rio Grande do Sul
6	                                                                A. Clemm
7	                                                                  Huawei
8	                                                      A. Gonzalez Prieto
9	                                                                  VMware
10	                                                        October 21, 2017

12	     Autonomic Networking Use Case for Distributed Detection of SLA
13	                               Violations
14	          draft-irtf-nmrg-autonomic-sla-violation-detection-12

16	Abstract

18	   This document describes an experimental use case for autonomic
19	   networking concerning monitoring of Service Level Agreements (SLAs).
20	   The use case aims to detect violations of SLAs in a distributed
21	   fashion, striving to optimize and dynamically adapt the autonomic
22	   deployment of active measurement probes in a way that maximizes the
23	   likelihood of detecting service level violations with a given
24	   resource budget to perform active measurements, and is able to do so
25	   without any outside guidance or intervention.

27	   This document is a product of the IRTF Network Management Research
28	   Group (NMRG).  It is published for informational purposes.

30	Status of This Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at https://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on April 24, 2018.

47	Copyright Notice

49	   Copyright (c) 2017 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (https://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
65	   2.  Definitions and Acronyms  . . . . . . . . . . . . . . . . . .   5
66	   3.  Current Approaches  . . . . . . . . . . . . . . . . . . . . .   6
67	   4.  Use Case Description  . . . . . . . . . . . . . . . . . . . .   6
68	   5.  A Distributed Autonomic Solution  . . . . . . . . . . . . . .   7
69	   6.  Intended User Experience  . . . . . . . . . . . . . . . . . .  10
70	   7.  Implementation Considerations . . . . . . . . . . . . . . . .  10
71	     7.1.  Device Based Self-Knowledge and Decisions . . . . . . . .  11
72	     7.2.  Interaction with other devices  . . . . . . . . . . . . .  11
73	   8.  Comparison with current solutions . . . . . . . . . . . . . .  11
74	   9.  Related IETF Work . . . . . . . . . . . . . . . . . . . . . .  12
75	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  12
76	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
77	   12. Security Considerations . . . . . . . . . . . . . . . . . . .  13
78	   13. Informative References  . . . . . . . . . . . . . . . . . . .  13
79	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  14

81	1.  Introduction

83	   The Internet has been growing dramatically in terms of size,
84	   capacity, and accessibility in the last years.  Communication
85	   requirements of distributed services and applications running on top
86	   of the Internet have become increasingly demanding.  Some examples
87	   are real-time interactive video or financial trading.  Providing such
88	   services involves stringent requirements in terms of acceptable
89	   latency, loss, or jitter.

91	   Performance requirements lead to the articulation of Service Level
92	   Objectives (SLOs) which must be met.  Those SLOs are part of Service
93	   Level Agreements (SLAs) that define a contract between the provider
94	   and the consumer of a service.  SLOs, in effect, constitute a service
95	   level guarantee that the consumer of the service can expect to
96	   receive (and often has to pay for).  Likewise, the provider of a
97	   service needs to ensure that the service level guarantee and
98	   associated SLOs are met.  Some examples of clauses that relate to
99	   service level objectives can be found in [RFC7297]).

101	   Violations of SLOs can be associated with significant financial loss,
102	   which can by divided into two categories.  For one, there is the loss
103	   that can be incurred by the user of a service when the agreed service
104	   levels are not provided.  For example, a financial brokerage's stock
105	   orders might suffer losses when it is unable to execute stock
106	   transactions in a timely manner.  An electronic retailer may lose
107	   customers when their online presence is perceived by customers as
108	   sluggish.  An online gaming provider may not be able to provide fair
109	   access to online players, resulting in frustrated players who are
110	   lost as customers.  In each case, the failure of a service provider
111	   to meet promised service level guarantees can have a substantial
112	   financial impact on users of the service.  By the same token, there
113	   is the loss that is incurred by the provider of a service who is
114	   unable to meet promised service level objectives.  Those losses can
115	   take several forms, such as penalties for not meeting the service
116	   and, in many cases more important, loss of revenue due to reduced
117	   customer satisfaction.  Hence, service level objectives are a key
118	   concern for the service provider.  In order to ensure that SLOs are
119	   not being violated, service levels need to be continuously monitored
120	   at the network infrastructure layer in order to know, for example,
121	   when mitigating actions need to be taken.  To that end, service level
122	   measurements must take place.

124	   Network measurements can be performed using active or passive
125	   measurement techniques.  In passive measurements, production traffic
126	   is observed and no monitoring traffic is created by the measurement
127	   process itself.  That is, network conditions are checked in a non
128	   intrusive way.  In the context of IP Flow Information eXport (IPFIX),
129	   several documents were produced that define how to export data
130	   associated with flow records, i.e. data that is collected as part of
131	   passive measurement mechanisms, generally applied against flows of
132	   production traffic (e.g., [RFC7011]).  In addition, it would be
133	   possible to collect real data traffic (not just summarized flow
134	   records) with time-stamped packets, possibly sampled (e.g., per
135	   [RFC5474], as a means of measuring and inferring service levels.
136	   Active measurements, on the other hand, are more intrusive to the
137	   network in the sense that it involves injecting synthetic test
138	   traffic into the network to measure network service levels, as
139	   opposed to simply observing production traffic.  The IP Performance
140	   Metrics (IPPM) WG produced documents that describe active measurement
141	   mechanisms, such as: One-Way Active Measurement Protocol (OWAMP)
142	   [RFC4656], Two-Way Active Measurement Protocol (TWAMP) [RFC5357], and
143	   Cisco Service Level Assurance Protocol (SLA) [RFC6812].  In addition,
144	   there are some mechanisms that do not cleanly fit into either active
145	   or passive categories, such as Performance and Diagnostic Metrics
146	   Destination Option (PDM) techniques
147	   [draft-ietf-ippm-6man-pdm-option].

149	   Active measurement mechanisms offer a high level of control of what
150	   and how to measure.  They do not require inspecting production
151	   traffic.  Because of this, active measurements usually offer better
152	   accuracy and privacy than passive measurement mechanisms.  Traffic
153	   encryption and regulations that limit the amount of payload
154	   inspection that can occur are non-issues.  Furthermore, active
155	   measurement mechanisms are able to detect end-to-end network
156	   performance problems in a fine-grained way (e.g., simulating the
157	   traffic that must be handled considering specific Service Level
158	   Objectives - SLOs).  As a result, active measurements are often
159	   preferred over passive measurement for SLA monitoring.  Measurement
160	   probes must be hosted in network devices and measurement sessions
161	   must be activated to compute the current network metrics (e.g.,
162	   considering those described in [RFC4148]).  This activation should be
163	   dynamic in order to follow changes in network conditions, such as
164	   those related with routes being added or new customer demands.

166	   While offering many advantages, active measurements are expensive in
167	   terms of network resource consumption.  Active measurements generally
168	   involve measurement probes that generate synthetic test traffic that
169	   is directed at a responder.  The responder needs to timestamp test
170	   traffic it receives and reflect it back to the originating
171	   measurement probe.  The measurement probe subsequently processes the
172	   returned packets along with time stamping information in order to
173	   compute service levels.  Accordingly, active measurements consume
174	   substantial CPU cycles as well as memory of network devices to
175	   generate and process test traffic.  In addition, synthetic traffic
176	   increases network load.  Active measurements thus compete for
177	   resources with other functions, including routing and switching.

179	   The resources required and traffic generated by the active
180	   measurement sessions are to a large part a function of the number of
181	   measured network destinations.  (In addition, the amount of traffic
182	   generated for each measurement plays a role, which in turn influences
183	   the accuracy of the measurement.)  The more destinations are being
184	   measured, the larger the amount of resources consumed and traffic
185	   needed to perform the measurements.  Thus, to have a better
186	   monitoring coverage it is necessary to deploy more sessions which
187	   consequently increases consumed resources.  Otherwise, enabling the
188	   observation of just a small subset of all network flows can lead to
189	   an insufficient coverage.

191	   Furthermore, while some end-to-end service levels can be determined
192	   by adding up the service levels observed across different path
193	   segments, the same is not true for all service levels.  For example,
194	   the end-to-end delay or packet loss from a node A to a node C routed
195	   via a node B can often be computed simply by adding delays (or loss)
196	   from A to B, and B to C.  This allows to decompose a large set of
197	   end-to-end measurements into a much smaller set of segment
198	   measurements.  However, end-to-end jitter and (for example) Mean
199	   Opinion Scores cannot be decomposed as easily and, for higher
200	   accuracy, must be measured end-to-end.

202	   Hence, the decision how to place measurement probes becomes an
203	   important management activity.  The goal is to obtain maximum
204	   benefits of service level monitoring with a limited amount of
205	   measurement overhead.  Specifically, the goal is to maximize the
206	   number of service level violations that are detected with a limited
207	   amount of resources.

209	   The use case and the solution approach described in this document
210	   address an important practical issue.  They are intended to provide a
211	   basis for further experimentation to lead into solutions for wider
212	   deployment.  This document represents the consensus of the IRTF's
213	   Network Management Research Group (NMRG).  It was discussed
214	   extensively and received three separate in-depth reviews.

216	2.  Definitions and Acronyms

218	   Active Measurements: Techniques to measure service levels that
219	   involve generating and observing synthetic test traffic

221	   Passive Measurements: Techniques used to measure service levels based
222	   on observation of production traffic

224	   AN: Autonomic Network; a network containing exclusively autonomic
225	   nodes, requiring no configuration and deriving all required
226	   information through self-knowledge, discovery, or intent.

228	   Autonomic Service Agent (ASA): An agent implemented on an autonomic
229	   node that implements an autonomic function, either in part (in the
230	   case of a distributed function, as in the context of this document),
231	   or whole.

233	   Measurement Session: A communications association between a Probe and
234	   a Responder used to send and reflect synthetic test traffic for
235	   active measurements

237	   Probe: The source of synthetic test traffic in an active measurement
238	   Responder: The destination for synthetic test traffic in an active
239	   measurement

241	   SLA: Service Level Agreement

243	   SLO: Service Level Objective

245	   P2P: Peer-to-Peer

247	   (Note: definitions of AN and ASA are borrowed from [RFC7575]).

249	3.  Current Approaches

251	   The current best practice in feasible deployments of active
252	   measurement solutions to distribute the available measurement
253	   sessions along the network consists in relying entirely on the human
254	   administrator expertise to infer which would be the best location to
255	   activate such sessions.  This is done through several steps.  First,
256	   it is necessary to collect traffic information in order to grasp the
257	   traffic matrix.  Then, the administrator uses this information to
258	   infer which are the best destinations for measurement sessions.
259	   After that, the administrator activates sessions on the chosen subset
260	   of destinations considering the available resources.  This practice,
261	   however, does not scale well because it is still labor intensive and
262	   error-prone for the administrator to determine which sessions should
263	   be activated given the set of critical flows that needs to be
264	   measured.  Even worse, this practice completely fails in networks
265	   whose critical flows are too short in time and dynamic in terms of
266	   traversing network path, like in modern cloud environments.  That is
267	   so because fast reactions are necessary to reconfigure the sessions
268	   and administrators are not just enough in computing and activating
269	   the new set of required sessions every time the network traffic
270	   pattern changes.  Finally, the current active measurements practice
271	   usually covers only a fraction of the network flows that should be
272	   observed, which invariably leads to the damaging consequence of
273	   undetected SLA violations.

275	4.  Use Case Description

277	   The use case involves a service level provider who needs to monitor
278	   the network to detect service level violations using active service
279	   level measurements, and wants to be able to do so with minimal human
280	   intervention.  The goal is to conduct the measurements in an
281	   effective manner maximizing the percentage of detected service level
282	   violations.  The service level provider has a bounded resource budget
283	   with regards to measurements that can be performed, specifically,
284	   with regards to the number of measurements that can be conducted
285	   concurrently from any one network device, and possibly with regards
286	   to the total amount of measurement traffic on the network.  However,
287	   while at any one point in time the number of measurements conducted
288	   is limited, it is possible for a device to change which destinations
289	   to measure over time.  This can be exploited to achieve a balance of
290	   eventually covering all possible destinations using a reasonable
291	   amount of "sampling" where measurement coverage of a destination
292	   cannot be continuous.  The solution needs to be dynamic and be able
293	   to cope with network conditions which may change over time.  The
294	   solution should also be embeddable inside network devices that
295	   control the deployment of active measurement mechanisms.

297	   The goal is to conduct the measurements in a smart manner that
298	   ensures that the network is broadly covered and the likelihood of
299	   detecting service level violations is maximized.  In order to
300	   maximize that likelihood, it is reasonable to focus measurement
301	   resources on destinations that are more likely to incur a violation,
302	   while spending less resources on destinations that are more likely to
303	   be in compliance.  In order to do so, there are various aspects that
304	   can be exploited, including past measurements (destinations close to
305	   a service level threshold requiring more focus than destinations
306	   further from it), complementation with passive measurements such as
307	   flow data (to identify network destinations that are currently
308	   popular and critical), and observations from other parts of the
309	   network.  In addition, measurements can be coordinated among
310	   different network devices to avoid hitting the same destination at
311	   the same time and to be able to share results that may be useful in
312	   future probe placement.

314	   Clearly, static solutions will have severe limitations.  At the same
315	   time, human administrators cannot be in the loop for continuous
316	   dynamic measurement probe reconfigurations.  Accordingly, an
317	   automated or, ideally, autonomic solution is needed in which network
318	   measurements are automatically orchestrated and dynamically
319	   reconfigured from within the network.  This can be accomplished using
320	   an autonomic solution that is distributed, using Autonomic Service
321	   Agents that are implemented on nodes in the network.

323	5.  A Distributed Autonomic Solution

325	   The use of Autonomic Networking (AN) [RFC7575] can help such
326	   detection through an efficient activation of measurement sessions.
327	   Such an approach, along with a detailed assessment confirming its
328	   viability, has been described [P2PBNM-Nobre-2012].  The problem to be
329	   solved by AN in the present use case is how to steer the process of
330	   Measurement Session activation by a complete solution that sets all
331	   necessary parameters for this activation to operate efficiently,
332	   reliably and securely, with no required human intervention other than
333	   setting overall policy.

335	   When a node first comes online, it has no information about which
336	   measurements are more critical than others.  In the absence of
337	   information about past measurements and information from measurement
338	   peers, it may start with an initial set of measurement sessions,
339	   possibly randomly seeding a set of starter measurements, perhaps
340	   taking a round robin approach for subsequent measurement rounds.
341	   However, as measurements are collected, a node will gain increasing
342	   information that it can utilize to refine its strategy of selecting
343	   measurement targets going forward.  For one, it may take note of
344	   which targets returned measurement results very close to service
345	   level thresholds that may therefore require closer scrutiny compared
346	   to others.  Second, it may utilize observations that are made by its
347	   measurement peers in order to conclude which measurement targets may
348	   be more critical than others, and in order to ensure that proper
349	   overall measurement coverage is obtained (so that not every node
350	   incidentally measure the same targets, while other targets are not
351	   measured at all).

353	   We advocate for embedding Peer-to-Peer (P2P) technology in network
354	   devices in order to conduct the Measurement Session activation
355	   decisions using autonomic control loops.  Specifically, we advocate
356	   for network devices to implement an autonomic function to monitor
357	   service levels for violations of service level objectives,
358	   determining which Measurement Sessions to set up at any given point
359	   in time based on current and past observations of the node, and of
360	   other peer nodes.

362	   By performing these functions locally and autonomically on the device
363	   itself, which measurements to conduct can be modified quickly based
364	   on local observations while taking local resource availability into
365	   account.  This allows a solution to be more robust and react more
366	   dynamically to rapidly changing service levels than a solution that
367	   has to rely on central coordination.  However, in order to optimize
368	   decisions which measurements to conduct, a node will need to
369	   communicate with other nodes.  This allows a node to take into
370	   account other nodes' observations in addition to its own in its
371	   decisions.

373	   For example, remote destinations whose observed service levels are on
374	   the verge of violating stated objectives may require closer
375	   monitoring than remote destinations that are comfortably within a
376	   range of tolerance.  It also allows nodes to coordinate their probing
377	   decisions to collectively achieve the best possible measurement
378	   coverage.  As the amount of resources available for monitoring and
379	   for exchange of measurement data and coordination with other nodes
380	   are limited, a node may further be interested in identifying other
381	   nodes whose observations are most similar to and correlated with its
382	   own.  This helps a node prioritize and guide with which other nodes
383	   to primarily coordinate and exchange data with.  All of this requires
384	   the use of a P2P overlay.

386	   A P2P overlay is essential for several reasons:

388	   o  It makes it possible for nodes (respectively Autonomic Service
389	      Agents that are deployed on those nodes) in the network to
390	      autonomically set up Measurement Sessions, without having to rely
391	      on central management system or controller to perform
392	      configuration operations associated with configuring measurement
393	      probes and responders.

395	   o  It facilitates the exchange of data between different nodes to
396	      share measurement results so that each node can refine its
397	      measurement strategy based not just its own observations, but
398	      observations from its peers.

400	   o  It allows nodes to coordinate their measurements to obtain the
401	      best possible test coverage and avoid measurements that have a
402	      very low likelihood of detecting service level violations.

404	   The provisioning of the P2P overlay should be transparent for the
405	   network administrator.  An Autonomic Control Plane such as defined in
406	   [I-D.anima-autonomic-control-plane] provides an ideal candidate for
407	   the P2P overlay to run on.

409	   An autonomic solution for the distributed detection of SLA violations
410	   provide several benefits.  First, efficiency: this solution should
411	   optimize the resource consumption and avoid resource starvation on
412	   the network devices.  A device that is "self-aware" of its available
413	   resources will be able to adjust measurement activities rapidly as
414	   needed, without requiring a separate control loop involving resource
415	   monitoring by an external system.  Secondly, placing logic where to
416	   conduct measurements in the node enables rapid control loops in which
417	   devices are able to react instantly to observations and adjust their
418	   measurement strategy.  For example, a device could decide to adjust
419	   the amount of synthetic test traffic being sent during the
420	   measurement itself depending on results observed so far on this and
421	   on other concurrent measurement sessions.  As a result, the solution
422	   could decrease the time necessary to detect SLA violations.
423	   Adaptivity features of an autonomic loop could capture faster the
424	   network dynamics than an human administrator and even a central
425	   controller.  Finally, the solution could help to reduce the workload
426	   of human administrator, or, at least, to avoid their need to perform
427	   operational tasks.

429	   In practice, these factors combine to maximize the likelihood of SLA
430	   violations being detected while operating within a given resource
431	   budget, allowing to conduct a continuous measurement strategy that
432	   takes into account past measurement results, observations of other
433	   measures such as link utilization or flow data, sharing of
434	   measurement results between network devices, and coordinating future
435	   measurement activities among nodes.  Combined this can result in
436	   efficient measurement decisions that achieve a golden balance between
437	   broad network coverage and honing in on service level "hot spots".

439	6.  Intended User Experience

441	   The autonomic solution should not require any human intervention in
442	   the distributed detection of SLA violations.  By virtue of the
443	   solution being autonomic, human users will not have to plan which
444	   measurements to conduct in a network, often a very labor intensive
445	   task today that requires detailed analysis of traffic matrices and
446	   network topologies and is not prone to easy dynamic adjustment.
447	   Likewise, they will not have to configure measurement probes and
448	   responders.

450	   There are some ways in which a human administrator may still interact
451	   with the solution.  For one, the human administrator will of course
452	   be notified and obtain reports about service level violations that
453	   are observed.  Second, a human administrator may set a policies
454	   regarding how closely to monitor the network for service level
455	   violations and how many resources to spend.  For example, an
456	   administrator may set a resource budget that is assigned to network
457	   devices for measurement operations.  With that given budget, the
458	   number of SLO violations that are detected will be maximized.
459	   Alternatively, an administrator may set a target for the percentage
460	   of SLO violations that must be detected, i.e. a target for the ratio
461	   between the number of detected SLO violations, and the number of
462	   total SLO violations that are actually occurring (some of which might
463	   go undetected).  In that case, the solution will aim to minimize the
464	   resources spent (i.e. the amount of test traffic and Measurement
465	   Sessions) that are required to achieve that target.

467	7.  Implementation Considerations

469	   The active measurement model assumes that a typical infrastructure
470	   will have multiple network segments and Autonomous Systems (ASs), and
471	   a reasonably large number of routers.  It also considers that
472	   multiple SLOs can be in place at a given time.  Since
473	   interoperability in a heterogenous network is a goal, features found
474	   on different active measurement mechanisms (e.g.  OWAMP, TWAMP, and
475	   IPSLA) and device programability interfaces (such as Juniper's Junos
476	   API or Cisco's Embedded Event Manager) could be used for the
477	   implementation.  The autonomic solution should include and/or
478	   reference specific algorithms, protocols, metrics and technologies
479	   for the implementation of distributed detection of SLA violations as
480	   a whole.

482	   Finally, it should be noted that there are multiple deployment
483	   scenarios, including deployment scenarios that involve physical
484	   devices hosting autonomic functions, or virtualized infrastructure
485	   hosting the same.  Co-deployment in conjunction with Virtual Network
486	   Functions (VNF) is a possibility for further study.

488	7.1.  Device Based Self-Knowledge and Decisions

490	   Each device has self-knowledge about the local SLA monitoring.  This
491	   could be in the form of historical measurement data and SLOs.
492	   Besides that, the devices would have algorithms that could decide
493	   which probes should be activated in a given time.  The choice of
494	   which algorithm is better for a specific situation would be also
495	   autonomic.

497	7.2.  Interaction with other devices

499	   Network devices should share information about service level
500	   measurement results.  This information can speed up the detection of
501	   SLA violations and increase the number of detected SLA violations.
502	   For example, if one device detects that a remote destination is in
503	   danger of violating an SLO, other devices may conduct additional
504	   measurements to the same destination or other destinations in its
505	   proximity.  For any given network device, the exchange of data may be
506	   more important with some devices (for example, devices in the same
507	   network neighborhood, or devices that are "correlated" by some other
508	   means) than with others.  The definition of network devices that
509	   exchange measurement data, i.e., management peers, creates a new
510	   topology.  Different approaches could be used to define this topology
511	   (e.g., correlated peers [P2PBNM-Nobre-2012]).  To bootstrap peer
512	   selection, each device should use its known endpoints neighbors
513	   (e.g., FIB and RIB tables) as the initial seed to get possible peers.
514	   It should be noted that a solution will benefit if topology
515	   information and network discovery functions are provided by the
516	   underlying autonomic framework.  A solution will need to be able to
517	   discover measurement peers as well as measurement targets,
518	   specifically measurement targets that support active measurement
519	   responders and which will be able to respond to measurement requests
520	   and reflect measurement traffic as needed.

522	8.  Comparison with current solutions

524	   There is no standardized solution for distributed autonomic detection
525	   of SLA violations.  Current solutions are restricted to ad hoc
526	   scripts running on a per node fashion to automate some
527	   administrator's actions.  There are some proposals for passive probe
528	   activation (e.g., DECON and CSAMP), but without the focus on
529	   autonomic features.

531	9.  Related IETF Work

533	   The following paragraphs discuss related IETF work and are provided
534	   for reference.  This section is not exhaustive, rather it provides an
535	   overview of the various initiatives and how they relate to autonomic
536	   distributed detection of SLA violations.

538	   1.  [LMAP]: The Large-Scale Measurement of Broadband Performance
539	       Working Group aims at the standards for performance management.
540	       Since their mechanisms also consist in deploying measurement
541	       probes the autonomic solution could be relevant for LMAP
542	       specially considering SLA violation screening.  Besides that, a
543	       solution to decrease the workload of human administrators in
544	       service providers is probably highly desirable.

546	   2.  [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process
547	       of standardization of IP flows (i.e., netflows).  IPFIX uses
548	       measurement probes (i.e., metering exporters) to gather flow
549	       data.  In this context, the autonomic solution for the activation
550	       of active measurement probes could be possibly extended to
551	       address also passive measurement probes.  Besides that, flow
552	       information could be used in the decision making of probe
553	       activation.

555	   3.  [ALTO]: The Application Layer Traffic Optimization Working Group
556	       aims to provide topological information at a higher abstraction
557	       layer, which can be based upon network policy, and with
558	       application-relevant service functions located in it.  Their work
559	       could be leveraged for the definition of the topology regarding
560	       the network devices which exchange measurement data.

562	10.  Acknowledgements

564	   We wish to acknowledge the helpful contributions, comments, and
565	   suggestions that were received from Mohamed Boucadair, Brian
566	   Carpenter, Hanlin Fang, Bruno Klauser, Diego Lopez, Vincent Roca, and
567	   Eric Voit.  In addition, we thank Diego Lopez, Vincent Roca, and
568	   Brian Carpenter for their detailed reviews.

570	11.  IANA Considerations

572	   This memo includes no request to IANA.

574	12.  Security Considerations

576	   Security of the solution hinges on the security of the network
577	   underlay, i.e. the Autonomic Control Plane.  If the Autonomic Control
578	   Plane were to be compromised, an attacker could undermine the
579	   effectiveness of measurement coordination by reporting fraudulent
580	   measurement results to peers.  This would cause measurement probes to
581	   be deployed in an ineffective manner that would increase the
582	   likelihood that violations of service level objectives go undetected.

584	   Likewise, security of the solution hinges on the security of the
585	   deployment mechanism for autonomic functions, in this case, the
586	   autonomic function that conducts the service level measurements.  If
587	   an attacker were able to hijack an autonomic function, it could try
588	   to exhaust or exceed the resources that should be spent on autonomic
589	   measurements in order to deplete network resources, including network
590	   bandwidth due to higher-than-necessary volumes of synthetic test
591	   traffic generated by measurement probes.  Again, it could also lead
592	   to reporting of misleading results, among other things resulting in
593	   non-optimal selection of measurement targets and in turn an increase
594	   in the likelihood that service level violations go undetected.

596	13.  Informative References

598	   [draft-anima-boot]
599	              Pritikin, M., Richardson, M., Behringer, M., Bjarnason,
600	              S., and K. Watsen, "draft-ietf-anima-bootstrapping-
601	              keyinfra", draft-ietf-anima-bootstrapping-keyinfra-06
602	              (work in progress), May 2017.

604	   [draft-ietf-ippm-6man-pdm-option]
605	              Elkins, N., Hamilton, R., and M. Ackermann, "draft-ietf-
606	              ippm-6man-pdm-option", draft-ietf-ippm-6man-pdm-option-11
607	              (work in progress), June 2017.

609	   [I-D.anima-autonomic-control-plane]
610	              Behringer, M., Eckert, T., and S. Bjarnason, "An Autonomic
611	              Control Plane", draft-ietf-anima-autonomic-control-
612	              plane-09 (work in progress), August 2017.

614	   [P2PBNM-Nobre-2012]
615	              Nobre, J., Granville, L., Clemm, A., and A. Gonzalez
616	              Prieto, "Decentralized Detection of SLA Violations Using
617	              P2P Technology, 8th International Conference Network and
618	              Service Management (CNSM)", 2012,
619	              <http://ieeexplore.ieee.org/xpls/
620	              abs_all.jsp?arnumber=6379997>.

622	   [RFC4148]  Stephan, E., "IP Performance Metrics (IPPM) Metrics
623	              Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August
624	              2005, <https://www.rfc-editor.org/info/rfc4148>.

626	   [RFC4656]  Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M.
627	              Zekauskas, "A One-way Active Measurement Protocol
628	              (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006,
629	              <https://www.rfc-editor.org/info/rfc4656>.

631	   [RFC5357]  Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J.
632	              Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)",
633	              RFC 5357, DOI 10.17487/RFC5357, October 2008,
634	              <https://www.rfc-editor.org/info/rfc5357>.

636	   [RFC5474]  Duffield, N., Ed., Chiou, D., Claise, B., Greenberg, A.,
637	              Grossglauser, M., and J. Rexford, "A Framework for Packet
638	              Selection and Reporting", RFC 5474, DOI 10.17487/RFC5474,
639	              March 2009, <https://www.rfc-editor.org/info/rfc5474>.

641	   [RFC6812]  Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare,
642	              S., and E. Yedavalli, "Cisco Service-Level Assurance
643	              Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013,
644	              <https://www.rfc-editor.org/info/rfc6812>.

646	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
647	              "Specification of the IP Flow Information Export (IPFIX)
648	              Protocol for the Exchange of Flow Information", STD 77,
649	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
650	              <https://www.rfc-editor.org/info/rfc7011>.

652	   [RFC7297]  Boucadair, M., Jacquenet, C., and N. Wang, "IP
653	              Connectivity Provisioning Profile (CPP)", RFC 7297,
654	              DOI 10.17487/RFC7297, July 2014,
655	              <https://www.rfc-editor.org/info/rfc7297>.

657	   [RFC7575]  Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A.,
658	              Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic
659	              Networking: Definitions and Design Goals", RFC 7575,
660	              DOI 10.17487/RFC7575, June 2015,
661	              <https://www.rfc-editor.org/info/rfc7575>.

663	Authors' Addresses
664	   Jeferson Campos Nobre
665	   University of Vale do Rio dos Sinos
666	   Porto Alegre
667	   Brazil

669	   Email: jcnobre@unisinos.br

671	   Lisandro Zambenedetti Granvile
672	   Federal University of Rio Grande do Sul
673	   Porto Alegre
674	   Brazil

676	   Email: granville@inf.ufrgs.br

678	   Alexander Clemm
679	   Huawei
680	   Santa Clara, California
681	   USA

683	   Email: ludwig@clemm.org

685	   Alberto Gonzalez Prieto
686	   VMware
687	   Palo Alto, California
688	   USA

690	   Email: agonzalezpri@vmware.com