idnits 2.17.1 

draft-irtf-nmrg-autonomic-sla-violation-detection-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (April 6, 2016) is 2942 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'LMAP' is mentioned on line 313, but not defined

  == Missing Reference: 'IPFIX' is mentioned on line 320, but not defined

  == Missing Reference: 'ALTO' is mentioned on line 326, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 4148
     (Obsoleted by RFC 6248)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Management Research Group                               J. Nobre
3	Internet-Draft                                              L. Granville
4	Intended status: Informational   Federal University of Rio Grande do Sul
5	Expires: October 8, 2016                                        A. Clemm
6	                                                               A. Prieto
7	                                                           Cisco Systems
8	                                                           April 6, 2016

10	     Autonomic Networking Use Case for Distributed Detection of SLA
11	                               Violations
12	          draft-irtf-nmrg-autonomic-sla-violation-detection-03

14	Abstract

16	   This document describes a use case for autonomic networking in
17	   distributed detection of Service Level Agreement (SLA) violations.
18	   It is one of a series of use cases intended to illustrate
19	   requirements for autonomic networking.

21	Status of This Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on October 8, 2016.

38	Copyright Notice

40	   Copyright (c) 2016 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
56	   2.  Definitions and Acronyms  . . . . . . . . . . . . . . . . . .   4
57	   3.  Current Approaches  . . . . . . . . . . . . . . . . . . . . .   4
58	   4.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   5
59	   5.  Benefits of an Autonomic Solution . . . . . . . . . . . . . .   5
60	   6.  Intended User and Administrator Experience  . . . . . . . . .   6
61	   7.  Analysis of Parameters and Information Involved . . . . . . .   6
62	     7.1.  Device Based Self-Knowledge and Decisions . . . . . . . .   6
63	     7.2.  Interaction with other devices  . . . . . . . . . . . . .   6
64	   8.  Comparison with current solutions . . . . . . . . . . . . . .   7
65	   9.  Related IETF Work . . . . . . . . . . . . . . . . . . . . . .   7
66	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   8
67	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
68	   12. Security Considerations . . . . . . . . . . . . . . . . . . .   8
69	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	     13.1.  Normative References . . . . . . . . . . . . . . . . . .   8
71	     13.2.  Informative References . . . . . . . . . . . . . . . . .   9
72	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

74	1.  Introduction

76	   The Internet has been growing dramatically in terms of size and
77	   capacity, and accessibility in the last years.  Communication
78	   requirements of distributed services and applications running on top
79	   of the Internet have become increasingly demanding.  Some examples
80	   are real-time interactive video or financial trading.  Providing such
81	   services involves stringent requirements in terms of acceptable
82	   latency, loss, or jitter.  Those requirements lead to the
83	   articulation of Service Level Objectives (SLOs) which are to be met.
84	   Those SLOs become part of Service Level Agreements (SLAs) that
85	   articulate a contract between the provider and the consumer of a
86	   service.  To fulfill a service, it needs to be ensured that the SLOs
87	   are met.  Examples of service fulfillment clauses can be found on
88	   [RFC7297]).  Violations of SLOs can be associated with significant
89	   financial loss, which can by divided in two types.  First, there is
90	   the loss incurred by the service users (e.g., the trader whose orders
91	   are not executed in a timely manner) and the loss incurred by the
92	   service provider in terms of penalties for not meeting the service
93	   and loss of revenues due to reduced customer satisfaction.  Thus, the
94	   service level requirements of critical network services have become a
95	   key concern for network administrators.  To ensure that SLAs are not
96	   being violated, service levels need to be constantly monitored at the
97	   network infrastructure layer.  To that end, network measurements must
98	   take place.

100	   Network measurement mechanisms are performed through either active or
101	   passive measurement techniques.  In passive measurements, production
102	   traffic is observed.  Network conditions are checked in a non
103	   intrusive way because no monitoring traffic is created by the
104	   measurement process itself.  In the context of IP Flow Information
105	   EXport (IPFIX) WG, several documents were produced to define passive
106	   measurement mechanisms (e.g., flow records specification [RFC3954]).
107	   Active measurement, on the other hand, is intrusive because it
108	   injects synthetic traffic into the network to measure the network
109	   performance.  The IP Performance Metrics (IPPM) WG produced documents
110	   that describe active measurement mechanisms, such as: One-Way Active
111	   Measurement Protocol (OWAMP) [RFC4656], Two-Way Active Measurement
112	   Protocol (TWAMP) [RFC5357], and Cisco Service Level Assurance
113	   Protocol (SLA) [RFC6812].  Besides that, there are some mechanisms
114	   that do not fit into either active or passive categories, such as
115	   Performance and Diagnostic Metrics Destination Option (PDM)
116	   techniques [draft-ietf-ippm-6man-pdm-option].

118	   Active measurement mechanisms offer a high level of control of what
119	   and how to measure.  It also does not require inspecting production
120	   traffic.  Because of this, it usually offers better accuracy and
121	   privacy than passive measurement mechanisms.  Traffic encryption and
122	   regulations that limit the amount of payload inspection that can
123	   occur are non-issues.  Furthermore, active measurement mechanisms are
124	   able to detect end-to-end network performance problems in a fine-
125	   grained way (e.g., simulating the traffic that must be handled
126	   considering specific Service Level Objectives - SLOs).  As a result,
127	   active measurements are often preferred over passive measurement for
128	   SLA monitoring.  Measurement probes must be hosted in network devices
129	   and measurement sessions must be activated to compute the current
130	   network metrics (e.g., considering those described in [RFC4148]).
131	   This activation should be dynamic in order to follow changes in
132	   network conditions, such as those related with routes being added or
133	   new customer demands.

135	   The activation of active measurement sessions (hosted in senders and
136	   responders considering the architecture described by Cisco [RFC6812])
137	   is expensive in terms of the resource consumption, e.g., CPU cycle
138	   and memory footprint, and monitoring functions compete for resources
139	   with other functions, including routing and switching.  Besides that,
140	   the activated sessions also increase the network load because of the
141	   injected traffic.  The resources required and traffic generated by
142	   the active measurement sessions are a function of the number of
143	   measured network destinations, i.e., with more destinations the
144	   larger will be the resources and the traffic needed to deploy the
145	   sessions.  Thus, to have a better monitoring coverage it is necessary
146	   to deploy more sessions what consequently turns increases consumed
147	   resources.  Otherwise, enabling the observation of just a small
148	   subset of all network flows can lead to an insufficient coverage.
149	   Hence, the decision how to place measurement probes becomes an
150	   important management activity, so that with a limited amount of
151	   measurement overhead the maximum benefits in terms of service level
152	   monitoring are obtained.

154	2.  Definitions and Acronyms

156	   Active Measurements: Techniques to measure service levels that
157	   involves generating and observing synthetic test traffic

159	   Passive Measurements: Techniques used to measure levels based on
160	   observation of production traffic

162	   SLA: Service Level Parameter

164	   SLO: Service Level Objective

166	   P2P: Peer-to-Peer

168	3.  Current Approaches

170	   The current best practice in feasible deployments of active
171	   measurement solutions to distribute the available measurement
172	   sessions along the network consists in relying entirely on the human
173	   administrator expertise to infer which would be the best location to
174	   activate such sessions.  This is done through several steps.  First,
175	   it is necessary to collect traffic information in order to grasp the
176	   traffic matrix.  Then, the administrator uses this information to
177	   infer which are the best destinations for measurement sessions.
178	   After that, the administrator activates sessions on the chosen subset
179	   of destinations considering the available resources.  This practice,
180	   however, does not scale well because it is still labor intensive and
181	   error-prone for the administrator to compute which sessions should be
182	   activated given the set of critical flows that needs to be measured.
183	   Even worse, this practice completely fails in networks whose critical
184	   flows are too short in time and dynamic in terms of traversing
185	   network path, like in modern cloud environments.  That is so because
186	   fast reactions are necessary to reconfigure the sessions and
187	   administrators are not just enough in computing and activating the
188	   new set of required sessions every time the network traffic pattern
189	   changes.  Finally, the current active measurements practice usually
190	   covers only a fraction of the network flows that should be observed,
191	   which invariably leads to the damaging consequence of undetected SLA
192	   violations.

194	4.  Problem Statement

196	   The problem to solve involves automating the placement of active
197	   measurement probes in the most effective manner possible.
198	   Specifically, assuming a bounded resource budget that is available
199	   for measurements, the problem becomes how to place those measurement
200	   probes such that the likelihood of detecting service level violations
201	   is maximized, and subsequently performing the required
202	   configurations.  The method should be embeddable as management
203	   software inside network devices that controls the deployment of
204	   active measurement mechanisms.  The method shall furthermore be
205	   dynamic and be able to adapt to changing network conditions.

207	5.  Benefits of an Autonomic Solution

209	   The use case considered here is the distributed autonomic detection
210	   of SLA violations.  The use of Autonomic Networking (AN) properties
211	   can help such detection through an efficient activation of
212	   measurement sessions [P2PBNM-Nobre-2012].  The problem to be solved
213	   by AN in the present use case is how to steer the process of
214	   measurement session activation by a complete solution that sets all
215	   necessary parameters for this activation to operate efficiently,
216	   reliably and securely, with no required human intervention, while
217	   allowing for their input.

219	   We advocate for embedding Peer-to-Peer (P2P) technology in network
220	   devices in order to improve the measurement session activation
221	   decisions using autonomic control loops.  The provisioning of the P2P
222	   management overlay should be transparent for the network
223	   administrator.  It would be possible to control the measurement
224	   session activation using local data and logic and to share
225	   measurement results among different network devices.

227	   An autonomic solution for the distributed detection of SLA violations
228	   can provide several benefits.  First, efficiency: this solution could
229	   optimize the resource consumption and avoid resource starvation on
230	   the network devices.  In practice, the solution should maximize the
231	   benefits of SLA monitoring (i.e., maximize the likelihood of SLA
232	   violations being detected) by operating within a given resource
233	   budget.  This optimization comes from different sources: taking into
234	   account past measurement results, taking into account other
235	   observations (such as, observations of link utilizations and passive
236	   measurements, where available) sharing of measurement results between
237	   network devices, better efficiency in the probe activation decisions,
238	   etc.  Second, effectiveness: the number of detected SLA violations
239	   could be increased.  This increase is related with a better coverage
240	   of the network.  Third, the solution could decrease the time
241	   necessary to detect SLA violations.  Adaptivity features of an
242	   autonomic loop could capture faster the network dynamics than an
243	   human administrator.  Finally, the solution could help to reduce the
244	   workload of human administrator, or, at least, to avoid their need to
245	   perform operational tasks.

247	6.  Intended User and Administrator Experience

249	   The autonomic solution should not require the human intervention in
250	   the distributed detection of SLA violations.  Besides that, it could
251	   enable the control of SLA monitoring by less experienced human
252	   administrators.  However, some information may be provided from the
253	   human administrator.  For example, the human administrator may
254	   provide the SLOs regarding the SLA being monitored.  The
255	   configuration and bootstrapping of network devices using the
256	   autonomic solution should be minimal for the human administrator.
257	   Probably it would be necessary just to inform the address of a device
258	   which is already using the solution and the devices themselves could
259	   exchange configuration data.

261	7.  Analysis of Parameters and Information Involved

263	   The active measurement model assumes that a typical infrastructure
264	   will have multiple network segments and Autonomous Systems (ASs), and
265	   a reasonably large number of several of routers and hosts.  It also
266	   considers that multiple SLOs can be in place in a given time.  Since
267	   interoperability in a heterogenous network is a goal, features found
268	   on different active measurement mechanisms (e.g.  OWAMP, TWAMP, and
269	   IPSLA) and programability interfaces (e.g., Cisco's EEM and onePK)
270	   could be used for the implementation.  The autonomic solution should
271	   include and/or reference specific algorithms, protocols, metrics and
272	   technologies for the implementation of distributed detection of SLA
273	   violations as a whole.

275	7.1.  Device Based Self-Knowledge and Decisions

277	   Each device has self-knowledge about the local SLA monitoring.  This
278	   could be in the form of historical measurement data and SLOs.
279	   Besides that, the devices would have algorithms that could decide
280	   which probes should be activated in a given time.  The choice of
281	   which algorithm is better for a specific situation would be also
282	   autonomic.

284	7.2.  Interaction with other devices

286	   Network devices should share information about service level
287	   measurement results.  This information can speed up the detection of
288	   SLA violations and increase the number of detected SLA violations.
289	   In any case, it is necessary to assure that the results from remote
290	   devices have local relevancy.  The definition of network devices that
291	   exchange measurement data, i.e., management peers, creates a new
292	   topology.  Different approaches could be used to define this topology
293	   (e.g., correlated peers [P2PBNM-Nobre-2012]).  To bootstrap peer
294	   selection, each device should use its known endpoints neighbors
295	   (e.g., FIB and RIB tables) as the initial seed to get possible peers.

297	8.  Comparison with current solutions

299	   There is no standartized solution for distributed autonomic detection
300	   of SLA violations.  Current solutions are restricted to ad hoc
301	   scripts running on a per node fashion to automate some
302	   administrator's actions.  There some proposals for passive probe
303	   activation (e.g., DECON and CSAMP), but without the focus on
304	   autonomic features.  It is also mentioning a proposal from Barford et
305	   al. to detect and localize links which cause anomalies along a
306	   network path.

308	9.  Related IETF Work

310	   The following paragraphs discuss related IETF work and are provided
311	   for reference.  This section is not exhaustive, rather it provides an
312	   overview of the various initiatives and how they relate to autonomic
313	   distributed detection of SLA violations.  1.  [LMAP]: The Large-Scale
314	   Measurement of Broadband Performance Working Group aims at the
315	   standards for performance management.  Since their mechanisms also
316	   consist in deploying measurement probes the autonomic solution could
317	   be relevant for LMAP specially considering SLA violation screening.
318	   Besides that, a solution to decrease the workload of human
319	   administrators in service providers is probably highly desirable.  2.
320	   [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process of
321	   standardization of IP flows (i.e., netflows).  IPFIX uses measurement
322	   probes (i.e., metering exporters) to gather flow data.  In this
323	   context, the autonomic solution for the activation of active
324	   measurement probes could be possibly extended to address also passive
325	   measurement probes.  Besides that, flow information could be used in
326	   the decision making of probe activation.  3.  [ALTO]: The Application
327	   Layer Traffic Optimization Working Group aims to provide topological
328	   information at a higher abstraction layer, which can be based upon
329	   network policy, and with application-relevant service functions
330	   located in it.  Their work could be leveraged for the definition of
331	   the topology regarding the network devices which exchange measurement
332	   data.

334	10.  Acknowledgements

336	   We wish to acknowledge the helpful contributions, comments, and
337	   suggestions that were received from Mohamed Boucadair, Bruno Klauser,
338	   Eric Voit, and Hanlin Fang.

340	11.  IANA Considerations

342	   This memo includes no request to IANA.

344	12.  Security Considerations

346	   The bootstrapping of a new device follows the approach proposed on
347	   anima wg [draft-anima-boot], thus in order to exchange data a device
348	   should register first.  This registration could be performed by a
349	   "Registrar" device or a cloud service provided by the organization to
350	   facilitate autonomic mechanisms.  The new device sends its own
351	   credentials to the Registrar, and after successful authentication,
352	   receives domain information, to enable subsequent enrolment to the
353	   domain.  The Registrar sends all required information: a device name,
354	   domain name, plus some parameters for the operation.  Measurement
355	   data should be exchanged signed and encripted among devices since
356	   these data could carry sensible information about network
357	   infrastructures.  Some attacks should be considering when analyzing
358	   the security of the autonomic solution.  Denial of service (DoS)
359	   attacks could be performed if the solution be tempered to active more
360	   local probe than the available resources allow.  Besides that,
361	   results could be forged by a device (attacker) in order to this
362	   device be considered peer of a specific device (target).  This could
363	   be done to gain information about a network.

365	13.  References

367	13.1.  Normative References

369	   [draft-anima-boot]
370	              Pritikin, M., Richardson, M., Behringer, M., and S.
371	              Bjarnason, "draft-ietf-anima-bootstrapping-keyinfra",
372	              draft-ietf-anima-bootstrapping-keyinfra-02 (work in
373	              progress), March 2016.

375	   [draft-ietf-ippm-6man-pdm-option]
376	              Elkins, N., Hamilton, R., and M. Ackermann, "draft-ietf-
377	              ippm-6man-pdm-option", draft-ietf-ippm-6man-pdm-option-01
378	              (work in progress), October 2015.

380	   [P2PBNM-Nobre-2012]
381	              Nobre, J., Granville, L., Clemm, A., and A. Prieto,
382	              "Decentralized Detection of SLA Violations Using P2P
383	              Technology, 8th International Conference Network and
384	              Service Management (CNSM)", 2012,
385	              <http://ieeexplore.ieee.org/xpls/
386	              abs_all.jsp?arnumber=6379997>.

388	   [RFC4656]  Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M.
389	              Zekauskas, "A One-way Active Measurement Protocol
390	              (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006,
391	              <http://www.rfc-editor.org/info/rfc4656>.

393	   [RFC5357]  Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J.
394	              Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)",
395	              RFC 5357, DOI 10.17487/RFC5357, October 2008,
396	              <http://www.rfc-editor.org/info/rfc5357>.

398	   [RFC6812]  Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare,
399	              S., and E. Yedavalli, "Cisco Service-Level Assurance
400	              Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013,
401	              <http://www.rfc-editor.org/info/rfc6812>.

403	   [RFC7297]  Boucadair, M., Jacquenet, C., and N. Wang, "IP
404	              Connectivity Provisioning Profile (CPP)", RFC 7297,
405	              DOI 10.17487/RFC7297, July 2014,
406	              <http://www.rfc-editor.org/info/rfc7297>.

408	13.2.  Informative References

410	   [RFC3954]  Claise, B., Ed., "Cisco Systems NetFlow Services Export
411	              Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004,
412	              <http://www.rfc-editor.org/info/rfc3954>.

414	   [RFC4148]  Stephan, E., "IP Performance Metrics (IPPM) Metrics
415	              Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August
416	              2005, <http://www.rfc-editor.org/info/rfc4148>.

418	Authors' Addresses

420	   Jeferson Campos Nobre
421	   Federal University of Rio Grande do Sul
422	   Porto Alegre
423	   Brazil

425	   Email: jcnobre@inf.ufrgs.br
426	   Lisandro Zambenedetti Granvile
427	   Federal University of Rio Grande do Sul
428	   Porto Alegre
429	   Brazil

431	   Email: granville@inf.ufrgs.br

433	   Alexander Clemm
434	   Cisco Systems
435	   San Jose
436	   USA

438	   Email: alex@cisco.com

440	   Alberto Gonzalez Prieto
441	   Cisco Systems
442	   San Jose
443	   USA

445	   Email: albertgo@cisco.com