idnits 2.17.1 draft-irtf-nmrg-autonomic-sla-violation-detection-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 6, 2016) is 2942 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'LMAP' is mentioned on line 313, but not defined == Missing Reference: 'IPFIX' is mentioned on line 320, but not defined == Missing Reference: 'ALTO' is mentioned on line 326, but not defined -- Obsolete informational reference (is this intentional?): RFC 4148 (Obsoleted by RFC 6248) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Management Research Group J. Nobre 3 Internet-Draft L. Granville 4 Intended status: Informational Federal University of Rio Grande do Sul 5 Expires: October 8, 2016 A. Clemm 6 A. Prieto 7 Cisco Systems 8 April 6, 2016 10 Autonomic Networking Use Case for Distributed Detection of SLA 11 Violations 12 draft-irtf-nmrg-autonomic-sla-violation-detection-03 14 Abstract 16 This document describes a use case for autonomic networking in 17 distributed detection of Service Level Agreement (SLA) violations. 18 It is one of a series of use cases intended to illustrate 19 requirements for autonomic networking. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on October 8, 2016. 38 Copyright Notice 40 Copyright (c) 2016 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 4 57 3. Current Approaches . . . . . . . . . . . . . . . . . . . . . 4 58 4. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 59 5. Benefits of an Autonomic Solution . . . . . . . . . . . . . . 5 60 6. Intended User and Administrator Experience . . . . . . . . . 6 61 7. Analysis of Parameters and Information Involved . . . . . . . 6 62 7.1. Device Based Self-Knowledge and Decisions . . . . . . . . 6 63 7.2. Interaction with other devices . . . . . . . . . . . . . 6 64 8. Comparison with current solutions . . . . . . . . . . . . . . 7 65 9. Related IETF Work . . . . . . . . . . . . . . . . . . . . . . 7 66 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 67 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 68 12. Security Considerations . . . . . . . . . . . . . . . . . . . 8 69 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 13.1. Normative References . . . . . . . . . . . . . . . . . . 8 71 13.2. Informative References . . . . . . . . . . . . . . . . . 9 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 74 1. Introduction 76 The Internet has been growing dramatically in terms of size and 77 capacity, and accessibility in the last years. Communication 78 requirements of distributed services and applications running on top 79 of the Internet have become increasingly demanding. Some examples 80 are real-time interactive video or financial trading. Providing such 81 services involves stringent requirements in terms of acceptable 82 latency, loss, or jitter. Those requirements lead to the 83 articulation of Service Level Objectives (SLOs) which are to be met. 84 Those SLOs become part of Service Level Agreements (SLAs) that 85 articulate a contract between the provider and the consumer of a 86 service. To fulfill a service, it needs to be ensured that the SLOs 87 are met. Examples of service fulfillment clauses can be found on 88 [RFC7297]). Violations of SLOs can be associated with significant 89 financial loss, which can by divided in two types. First, there is 90 the loss incurred by the service users (e.g., the trader whose orders 91 are not executed in a timely manner) and the loss incurred by the 92 service provider in terms of penalties for not meeting the service 93 and loss of revenues due to reduced customer satisfaction. Thus, the 94 service level requirements of critical network services have become a 95 key concern for network administrators. To ensure that SLAs are not 96 being violated, service levels need to be constantly monitored at the 97 network infrastructure layer. To that end, network measurements must 98 take place. 100 Network measurement mechanisms are performed through either active or 101 passive measurement techniques. In passive measurements, production 102 traffic is observed. Network conditions are checked in a non 103 intrusive way because no monitoring traffic is created by the 104 measurement process itself. In the context of IP Flow Information 105 EXport (IPFIX) WG, several documents were produced to define passive 106 measurement mechanisms (e.g., flow records specification [RFC3954]). 107 Active measurement, on the other hand, is intrusive because it 108 injects synthetic traffic into the network to measure the network 109 performance. The IP Performance Metrics (IPPM) WG produced documents 110 that describe active measurement mechanisms, such as: One-Way Active 111 Measurement Protocol (OWAMP) [RFC4656], Two-Way Active Measurement 112 Protocol (TWAMP) [RFC5357], and Cisco Service Level Assurance 113 Protocol (SLA) [RFC6812]. Besides that, there are some mechanisms 114 that do not fit into either active or passive categories, such as 115 Performance and Diagnostic Metrics Destination Option (PDM) 116 techniques [draft-ietf-ippm-6man-pdm-option]. 118 Active measurement mechanisms offer a high level of control of what 119 and how to measure. It also does not require inspecting production 120 traffic. Because of this, it usually offers better accuracy and 121 privacy than passive measurement mechanisms. Traffic encryption and 122 regulations that limit the amount of payload inspection that can 123 occur are non-issues. Furthermore, active measurement mechanisms are 124 able to detect end-to-end network performance problems in a fine- 125 grained way (e.g., simulating the traffic that must be handled 126 considering specific Service Level Objectives - SLOs). As a result, 127 active measurements are often preferred over passive measurement for 128 SLA monitoring. Measurement probes must be hosted in network devices 129 and measurement sessions must be activated to compute the current 130 network metrics (e.g., considering those described in [RFC4148]). 131 This activation should be dynamic in order to follow changes in 132 network conditions, such as those related with routes being added or 133 new customer demands. 135 The activation of active measurement sessions (hosted in senders and 136 responders considering the architecture described by Cisco [RFC6812]) 137 is expensive in terms of the resource consumption, e.g., CPU cycle 138 and memory footprint, and monitoring functions compete for resources 139 with other functions, including routing and switching. Besides that, 140 the activated sessions also increase the network load because of the 141 injected traffic. The resources required and traffic generated by 142 the active measurement sessions are a function of the number of 143 measured network destinations, i.e., with more destinations the 144 larger will be the resources and the traffic needed to deploy the 145 sessions. Thus, to have a better monitoring coverage it is necessary 146 to deploy more sessions what consequently turns increases consumed 147 resources. Otherwise, enabling the observation of just a small 148 subset of all network flows can lead to an insufficient coverage. 149 Hence, the decision how to place measurement probes becomes an 150 important management activity, so that with a limited amount of 151 measurement overhead the maximum benefits in terms of service level 152 monitoring are obtained. 154 2. Definitions and Acronyms 156 Active Measurements: Techniques to measure service levels that 157 involves generating and observing synthetic test traffic 159 Passive Measurements: Techniques used to measure levels based on 160 observation of production traffic 162 SLA: Service Level Parameter 164 SLO: Service Level Objective 166 P2P: Peer-to-Peer 168 3. Current Approaches 170 The current best practice in feasible deployments of active 171 measurement solutions to distribute the available measurement 172 sessions along the network consists in relying entirely on the human 173 administrator expertise to infer which would be the best location to 174 activate such sessions. This is done through several steps. First, 175 it is necessary to collect traffic information in order to grasp the 176 traffic matrix. Then, the administrator uses this information to 177 infer which are the best destinations for measurement sessions. 178 After that, the administrator activates sessions on the chosen subset 179 of destinations considering the available resources. This practice, 180 however, does not scale well because it is still labor intensive and 181 error-prone for the administrator to compute which sessions should be 182 activated given the set of critical flows that needs to be measured. 183 Even worse, this practice completely fails in networks whose critical 184 flows are too short in time and dynamic in terms of traversing 185 network path, like in modern cloud environments. That is so because 186 fast reactions are necessary to reconfigure the sessions and 187 administrators are not just enough in computing and activating the 188 new set of required sessions every time the network traffic pattern 189 changes. Finally, the current active measurements practice usually 190 covers only a fraction of the network flows that should be observed, 191 which invariably leads to the damaging consequence of undetected SLA 192 violations. 194 4. Problem Statement 196 The problem to solve involves automating the placement of active 197 measurement probes in the most effective manner possible. 198 Specifically, assuming a bounded resource budget that is available 199 for measurements, the problem becomes how to place those measurement 200 probes such that the likelihood of detecting service level violations 201 is maximized, and subsequently performing the required 202 configurations. The method should be embeddable as management 203 software inside network devices that controls the deployment of 204 active measurement mechanisms. The method shall furthermore be 205 dynamic and be able to adapt to changing network conditions. 207 5. Benefits of an Autonomic Solution 209 The use case considered here is the distributed autonomic detection 210 of SLA violations. The use of Autonomic Networking (AN) properties 211 can help such detection through an efficient activation of 212 measurement sessions [P2PBNM-Nobre-2012]. The problem to be solved 213 by AN in the present use case is how to steer the process of 214 measurement session activation by a complete solution that sets all 215 necessary parameters for this activation to operate efficiently, 216 reliably and securely, with no required human intervention, while 217 allowing for their input. 219 We advocate for embedding Peer-to-Peer (P2P) technology in network 220 devices in order to improve the measurement session activation 221 decisions using autonomic control loops. The provisioning of the P2P 222 management overlay should be transparent for the network 223 administrator. It would be possible to control the measurement 224 session activation using local data and logic and to share 225 measurement results among different network devices. 227 An autonomic solution for the distributed detection of SLA violations 228 can provide several benefits. First, efficiency: this solution could 229 optimize the resource consumption and avoid resource starvation on 230 the network devices. In practice, the solution should maximize the 231 benefits of SLA monitoring (i.e., maximize the likelihood of SLA 232 violations being detected) by operating within a given resource 233 budget. This optimization comes from different sources: taking into 234 account past measurement results, taking into account other 235 observations (such as, observations of link utilizations and passive 236 measurements, where available) sharing of measurement results between 237 network devices, better efficiency in the probe activation decisions, 238 etc. Second, effectiveness: the number of detected SLA violations 239 could be increased. This increase is related with a better coverage 240 of the network. Third, the solution could decrease the time 241 necessary to detect SLA violations. Adaptivity features of an 242 autonomic loop could capture faster the network dynamics than an 243 human administrator. Finally, the solution could help to reduce the 244 workload of human administrator, or, at least, to avoid their need to 245 perform operational tasks. 247 6. Intended User and Administrator Experience 249 The autonomic solution should not require the human intervention in 250 the distributed detection of SLA violations. Besides that, it could 251 enable the control of SLA monitoring by less experienced human 252 administrators. However, some information may be provided from the 253 human administrator. For example, the human administrator may 254 provide the SLOs regarding the SLA being monitored. The 255 configuration and bootstrapping of network devices using the 256 autonomic solution should be minimal for the human administrator. 257 Probably it would be necessary just to inform the address of a device 258 which is already using the solution and the devices themselves could 259 exchange configuration data. 261 7. Analysis of Parameters and Information Involved 263 The active measurement model assumes that a typical infrastructure 264 will have multiple network segments and Autonomous Systems (ASs), and 265 a reasonably large number of several of routers and hosts. It also 266 considers that multiple SLOs can be in place in a given time. Since 267 interoperability in a heterogenous network is a goal, features found 268 on different active measurement mechanisms (e.g. OWAMP, TWAMP, and 269 IPSLA) and programability interfaces (e.g., Cisco's EEM and onePK) 270 could be used for the implementation. The autonomic solution should 271 include and/or reference specific algorithms, protocols, metrics and 272 technologies for the implementation of distributed detection of SLA 273 violations as a whole. 275 7.1. Device Based Self-Knowledge and Decisions 277 Each device has self-knowledge about the local SLA monitoring. This 278 could be in the form of historical measurement data and SLOs. 279 Besides that, the devices would have algorithms that could decide 280 which probes should be activated in a given time. The choice of 281 which algorithm is better for a specific situation would be also 282 autonomic. 284 7.2. Interaction with other devices 286 Network devices should share information about service level 287 measurement results. This information can speed up the detection of 288 SLA violations and increase the number of detected SLA violations. 289 In any case, it is necessary to assure that the results from remote 290 devices have local relevancy. The definition of network devices that 291 exchange measurement data, i.e., management peers, creates a new 292 topology. Different approaches could be used to define this topology 293 (e.g., correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer 294 selection, each device should use its known endpoints neighbors 295 (e.g., FIB and RIB tables) as the initial seed to get possible peers. 297 8. Comparison with current solutions 299 There is no standartized solution for distributed autonomic detection 300 of SLA violations. Current solutions are restricted to ad hoc 301 scripts running on a per node fashion to automate some 302 administrator's actions. There some proposals for passive probe 303 activation (e.g., DECON and CSAMP), but without the focus on 304 autonomic features. It is also mentioning a proposal from Barford et 305 al. to detect and localize links which cause anomalies along a 306 network path. 308 9. Related IETF Work 310 The following paragraphs discuss related IETF work and are provided 311 for reference. This section is not exhaustive, rather it provides an 312 overview of the various initiatives and how they relate to autonomic 313 distributed detection of SLA violations. 1. [LMAP]: The Large-Scale 314 Measurement of Broadband Performance Working Group aims at the 315 standards for performance management. Since their mechanisms also 316 consist in deploying measurement probes the autonomic solution could 317 be relevant for LMAP specially considering SLA violation screening. 318 Besides that, a solution to decrease the workload of human 319 administrators in service providers is probably highly desirable. 2. 320 [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process of 321 standardization of IP flows (i.e., netflows). IPFIX uses measurement 322 probes (i.e., metering exporters) to gather flow data. In this 323 context, the autonomic solution for the activation of active 324 measurement probes could be possibly extended to address also passive 325 measurement probes. Besides that, flow information could be used in 326 the decision making of probe activation. 3. [ALTO]: The Application 327 Layer Traffic Optimization Working Group aims to provide topological 328 information at a higher abstraction layer, which can be based upon 329 network policy, and with application-relevant service functions 330 located in it. Their work could be leveraged for the definition of 331 the topology regarding the network devices which exchange measurement 332 data. 334 10. Acknowledgements 336 We wish to acknowledge the helpful contributions, comments, and 337 suggestions that were received from Mohamed Boucadair, Bruno Klauser, 338 Eric Voit, and Hanlin Fang. 340 11. IANA Considerations 342 This memo includes no request to IANA. 344 12. Security Considerations 346 The bootstrapping of a new device follows the approach proposed on 347 anima wg [draft-anima-boot], thus in order to exchange data a device 348 should register first. This registration could be performed by a 349 "Registrar" device or a cloud service provided by the organization to 350 facilitate autonomic mechanisms. The new device sends its own 351 credentials to the Registrar, and after successful authentication, 352 receives domain information, to enable subsequent enrolment to the 353 domain. The Registrar sends all required information: a device name, 354 domain name, plus some parameters for the operation. Measurement 355 data should be exchanged signed and encripted among devices since 356 these data could carry sensible information about network 357 infrastructures. Some attacks should be considering when analyzing 358 the security of the autonomic solution. Denial of service (DoS) 359 attacks could be performed if the solution be tempered to active more 360 local probe than the available resources allow. Besides that, 361 results could be forged by a device (attacker) in order to this 362 device be considered peer of a specific device (target). This could 363 be done to gain information about a network. 365 13. References 367 13.1. Normative References 369 [draft-anima-boot] 370 Pritikin, M., Richardson, M., Behringer, M., and S. 371 Bjarnason, "draft-ietf-anima-bootstrapping-keyinfra", 372 draft-ietf-anima-bootstrapping-keyinfra-02 (work in 373 progress), March 2016. 375 [draft-ietf-ippm-6man-pdm-option] 376 Elkins, N., Hamilton, R., and M. Ackermann, "draft-ietf- 377 ippm-6man-pdm-option", draft-ietf-ippm-6man-pdm-option-01 378 (work in progress), October 2015. 380 [P2PBNM-Nobre-2012] 381 Nobre, J., Granville, L., Clemm, A., and A. Prieto, 382 "Decentralized Detection of SLA Violations Using P2P 383 Technology, 8th International Conference Network and 384 Service Management (CNSM)", 2012, 385 . 388 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 389 Zekauskas, "A One-way Active Measurement Protocol 390 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 391 . 393 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 394 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 395 RFC 5357, DOI 10.17487/RFC5357, October 2008, 396 . 398 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 399 S., and E. Yedavalli, "Cisco Service-Level Assurance 400 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 401 . 403 [RFC7297] Boucadair, M., Jacquenet, C., and N. Wang, "IP 404 Connectivity Provisioning Profile (CPP)", RFC 7297, 405 DOI 10.17487/RFC7297, July 2014, 406 . 408 13.2. Informative References 410 [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export 411 Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, 412 . 414 [RFC4148] Stephan, E., "IP Performance Metrics (IPPM) Metrics 415 Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August 416 2005, . 418 Authors' Addresses 420 Jeferson Campos Nobre 421 Federal University of Rio Grande do Sul 422 Porto Alegre 423 Brazil 425 Email: jcnobre@inf.ufrgs.br 426 Lisandro Zambenedetti Granvile 427 Federal University of Rio Grande do Sul 428 Porto Alegre 429 Brazil 431 Email: granville@inf.ufrgs.br 433 Alexander Clemm 434 Cisco Systems 435 San Jose 436 USA 438 Email: alex@cisco.com 440 Alberto Gonzalez Prieto 441 Cisco Systems 442 San Jose 443 USA 445 Email: albertgo@cisco.com