idnits 2.17.1 draft-irtf-nmrg-autonomic-sla-violation-detection-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 15, 2017) is 2624 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'LMAP' is mentioned on line 315, but not defined == Missing Reference: 'IPFIX' is mentioned on line 322, but not defined == Missing Reference: 'ALTO' is mentioned on line 328, but not defined -- Obsolete informational reference (is this intentional?): RFC 4148 (Obsoleted by RFC 6248) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Management Research Group J. Nobre 3 Internet-Draft University of Vale do Rio dos Sinos 4 Intended status: Informational L. Granville 5 Expires: August 19, 2017 Federal University of Rio Grande do Sul 6 A. Clemm 7 Huawei 8 A. Gonzalez Prieto 9 Cisco Systems 10 February 15, 2017 12 Autonomic Networking Use Case for Distributed Detection of SLA 13 Violations 14 draft-irtf-nmrg-autonomic-sla-violation-detection-06 16 Abstract 18 This document describes a use case for autonomic networking in 19 distributed detection of Service Level Agreement (SLA) violations. 20 It is one of a series of use cases intended to illustrate 21 requirements for autonomic networking. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on August 19, 2017. 40 Copyright Notice 42 Copyright (c) 2017 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 58 2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 4 59 3. Current Approaches . . . . . . . . . . . . . . . . . . . . . 4 60 4. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 61 5. Benefits of an Autonomic Solution . . . . . . . . . . . . . . 5 62 6. Intended User and Administrator Experience . . . . . . . . . 6 63 7. Analysis of Parameters and Information Involved . . . . . . . 6 64 7.1. Device Based Self-Knowledge and Decisions . . . . . . . . 6 65 7.2. Interaction with other devices . . . . . . . . . . . . . 7 66 8. Comparison with current solutions . . . . . . . . . . . . . . 7 67 9. Related IETF Work . . . . . . . . . . . . . . . . . . . . . . 7 68 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 69 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 70 12. Security Considerations . . . . . . . . . . . . . . . . . . . 8 71 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 13.1. Normative References . . . . . . . . . . . . . . . . . . 8 73 13.2. Informative References . . . . . . . . . . . . . . . . . 9 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 76 1. Introduction 78 The Internet has been growing dramatically in terms of size and 79 capacity, and accessibility in the last years. Communication 80 requirements of distributed services and applications running on top 81 of the Internet have become increasingly demanding. Some examples 82 are real-time interactive video or financial trading. Providing such 83 services involves stringent requirements in terms of acceptable 84 latency, loss, or jitter. Those requirements lead to the 85 articulation of Service Level Objectives (SLOs) which are to be met. 86 Those SLOs become part of Service Level Agreements (SLAs) that 87 articulate a contract between the provider and the consumer of a 88 service. To fulfill a service, it needs to be ensured that the SLOs 89 are met. Examples of service fulfillment clauses can be found on 90 [RFC7297]). Violations of SLOs can be associated with significant 91 financial loss, which can by divided in two types. First, there is 92 the loss incurred by the service users (e.g., the trader whose orders 93 are not executed in a timely manner) and the loss incurred by the 94 service provider in terms of penalties for not meeting the service 95 and loss of revenues due to reduced customer satisfaction. Thus, the 96 service level requirements of critical network services have become a 97 key concern for network administrators. To ensure that SLAs are not 98 being violated, service levels need to be constantly monitored at the 99 network infrastructure layer. To that end, network measurements must 100 take place. 102 Network measurement mechanisms are performed through either active or 103 passive measurement techniques. In passive measurements, production 104 traffic is observed. Network conditions are checked in a non 105 intrusive way because no monitoring traffic is created by the 106 measurement process itself. In the context of IP Flow Information 107 EXport (IPFIX) WG, several documents were produced to define passive 108 measurement mechanisms (e.g., flow records specification [RFC3954]). 109 Active measurement, on the other hand, is intrusive because it 110 injects synthetic traffic into the network to measure the network 111 performance. The IP Performance Metrics (IPPM) WG produced documents 112 that describe active measurement mechanisms, such as: One-Way Active 113 Measurement Protocol (OWAMP) [RFC4656], Two-Way Active Measurement 114 Protocol (TWAMP) [RFC5357], and Cisco Service Level Assurance 115 Protocol (SLA) [RFC6812]. Besides that, there are some mechanisms 116 that do not fit into either active or passive categories, such as 117 Performance and Diagnostic Metrics Destination Option (PDM) 118 techniques [draft-ietf-ippm-6man-pdm-option]. 120 Active measurement mechanisms offer a high level of control of what 121 and how to measure. It also does not require inspecting production 122 traffic. Because of this, it usually offers better accuracy and 123 privacy than passive measurement mechanisms. Traffic encryption and 124 regulations that limit the amount of payload inspection that can 125 occur are non-issues. Furthermore, active measurement mechanisms are 126 able to detect end-to-end network performance problems in a fine- 127 grained way (e.g., simulating the traffic that must be handled 128 considering specific Service Level Objectives - SLOs). As a result, 129 active measurements are often preferred over passive measurement for 130 SLA monitoring. Measurement probes must be hosted in network devices 131 and measurement sessions must be activated to compute the current 132 network metrics (e.g., considering those described in [RFC4148]). 133 This activation should be dynamic in order to follow changes in 134 network conditions, such as those related with routes being added or 135 new customer demands. 137 The activation of active measurement sessions (hosted in senders and 138 responders considering the architecture described by Cisco [RFC6812]) 139 is expensive in terms of the resource consumption, e.g., CPU cycle 140 and memory footprint, and monitoring functions compete for resources 141 with other functions, including routing and switching. Besides that, 142 the activated sessions also increase the network load because of the 143 injected traffic. The resources required and traffic generated by 144 the active measurement sessions are a function of the number of 145 measured network destinations, i.e., with more destinations the 146 larger will be the resources and the traffic needed to deploy the 147 sessions. Thus, to have a better monitoring coverage it is necessary 148 to deploy more sessions what consequently turns increases consumed 149 resources. Otherwise, enabling the observation of just a small 150 subset of all network flows can lead to an insufficient coverage. 151 Hence, the decision how to place measurement probes becomes an 152 important management activity, so that with a limited amount of 153 measurement overhead the maximum benefits in terms of service level 154 monitoring are obtained. 156 2. Definitions and Acronyms 158 Active Measurements: Techniques to measure service levels that 159 involves generating and observing synthetic test traffic 161 Passive Measurements: Techniques used to measure levels based on 162 observation of production traffic 164 SLA: Service Level Parameter 166 SLO: Service Level Objective 168 P2P: Peer-to-Peer 170 3. Current Approaches 172 The current best practice in feasible deployments of active 173 measurement solutions to distribute the available measurement 174 sessions along the network consists in relying entirely on the human 175 administrator expertise to infer which would be the best location to 176 activate such sessions. This is done through several steps. First, 177 it is necessary to collect traffic information in order to grasp the 178 traffic matrix. Then, the administrator uses this information to 179 infer which are the best destinations for measurement sessions. 180 After that, the administrator activates sessions on the chosen subset 181 of destinations considering the available resources. This practice, 182 however, does not scale well because it is still labor intensive and 183 error-prone for the administrator to compute which sessions should be 184 activated given the set of critical flows that needs to be measured. 185 Even worse, this practice completely fails in networks whose critical 186 flows are too short in time and dynamic in terms of traversing 187 network path, like in modern cloud environments. That is so because 188 fast reactions are necessary to reconfigure the sessions and 189 administrators are not just enough in computing and activating the 190 new set of required sessions every time the network traffic pattern 191 changes. Finally, the current active measurements practice usually 192 covers only a fraction of the network flows that should be observed, 193 which invariably leads to the damaging consequence of undetected SLA 194 violations. 196 4. Problem Statement 198 The problem to solve involves automating the placement of active 199 measurement probes in the most effective manner possible. 200 Specifically, assuming a bounded resource budget that is available 201 for measurements, the problem becomes how to place those measurement 202 probes such that the likelihood of detecting service level violations 203 is maximized, and subsequently performing the required 204 configurations. The method should be embeddable as management 205 software inside network devices that controls the deployment of 206 active measurement mechanisms. The method shall furthermore be 207 dynamic and be able to adapt to changing network conditions. 209 5. Benefits of an Autonomic Solution 211 The use case considered here is the distributed autonomic detection 212 of SLA violations. The use of Autonomic Networking (AN) properties 213 can help such detection through an efficient activation of 214 measurement sessions [P2PBNM-Nobre-2012]. The problem to be solved 215 by AN in the present use case is how to steer the process of 216 measurement session activation by a complete solution that sets all 217 necessary parameters for this activation to operate efficiently, 218 reliably and securely, with no required human intervention, while 219 allowing for their input. 221 We advocate for embedding Peer-to-Peer (P2P) technology in network 222 devices in order to improve the measurement session activation 223 decisions using autonomic control loops. The provisioning of the P2P 224 management overlay should be transparent for the network 225 administrator. It would be possible to control the measurement 226 session activation using local data and logic and to share 227 measurement results among different network devices. 229 An autonomic solution for the distributed detection of SLA violations 230 can provide several benefits. First, efficiency: this solution could 231 optimize the resource consumption and avoid resource starvation on 232 the network devices. In practice, the solution should maximize the 233 benefits of SLA monitoring (i.e., maximize the likelihood of SLA 234 violations being detected) by operating within a given resource 235 budget. This optimization comes from different sources: taking into 236 account past measurement results, taking into account other 237 observations (such as, observations of link utilizations and passive 238 measurements, where available) sharing of measurement results between 239 network devices, better efficiency in the probe activation decisions, 240 etc. Second, effectiveness: the number of detected SLA violations 241 could be increased. This increase is related with a better coverage 242 of the network. Third, the solution could decrease the time 243 necessary to detect SLA violations. Adaptivity features of an 244 autonomic loop could capture faster the network dynamics than an 245 human administrator. Finally, the solution could help to reduce the 246 workload of human administrator, or, at least, to avoid their need to 247 perform operational tasks. 249 6. Intended User and Administrator Experience 251 The autonomic solution should not require the human intervention in 252 the distributed detection of SLA violations. Besides that, it could 253 enable the control of SLA monitoring by less experienced human 254 administrators. However, some information may be provided from the 255 human administrator. For example, the human administrator may 256 provide the SLOs regarding the SLA being monitored. The 257 configuration and bootstrapping of network devices using the 258 autonomic solution should be minimal for the human administrator. 259 Probably it would be necessary just to inform the address of a device 260 which is already using the solution and the devices themselves could 261 exchange configuration data. 263 7. Analysis of Parameters and Information Involved 265 The active measurement model assumes that a typical infrastructure 266 will have multiple network segments and Autonomous Systems (ASs), and 267 a reasonably large number of several of routers and hosts. It also 268 considers that multiple SLOs can be in place in a given time. Since 269 interoperability in a heterogenous network is a goal, features found 270 on different active measurement mechanisms (e.g. OWAMP, TWAMP, and 271 IPSLA) and programability interfaces (e.g., Cisco's EEM and onePK) 272 could be used for the implementation. The autonomic solution should 273 include and/or reference specific algorithms, protocols, metrics and 274 technologies for the implementation of distributed detection of SLA 275 violations as a whole. 277 7.1. Device Based Self-Knowledge and Decisions 279 Each device has self-knowledge about the local SLA monitoring. This 280 could be in the form of historical measurement data and SLOs. 281 Besides that, the devices would have algorithms that could decide 282 which probes should be activated in a given time. The choice of 283 which algorithm is better for a specific situation would be also 284 autonomic. 286 7.2. Interaction with other devices 288 Network devices should share information about service level 289 measurement results. This information can speed up the detection of 290 SLA violations and increase the number of detected SLA violations. 291 In any case, it is necessary to assure that the results from remote 292 devices have local relevancy. The definition of network devices that 293 exchange measurement data, i.e., management peers, creates a new 294 topology. Different approaches could be used to define this topology 295 (e.g., correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer 296 selection, each device should use its known endpoints neighbors 297 (e.g., FIB and RIB tables) as the initial seed to get possible peers. 299 8. Comparison with current solutions 301 There is no standartized solution for distributed autonomic detection 302 of SLA violations. Current solutions are restricted to ad hoc 303 scripts running on a per node fashion to automate some 304 administrator's actions. There some proposals for passive probe 305 activation (e.g., DECON and CSAMP), but without the focus on 306 autonomic features. It is also mentioning a proposal from Barford et 307 al. to detect and localize links which cause anomalies along a 308 network path. 310 9. Related IETF Work 312 The following paragraphs discuss related IETF work and are provided 313 for reference. This section is not exhaustive, rather it provides an 314 overview of the various initiatives and how they relate to autonomic 315 distributed detection of SLA violations. 1. [LMAP]: The Large-Scale 316 Measurement of Broadband Performance Working Group aims at the 317 standards for performance management. Since their mechanisms also 318 consist in deploying measurement probes the autonomic solution could 319 be relevant for LMAP specially considering SLA violation screening. 320 Besides that, a solution to decrease the workload of human 321 administrators in service providers is probably highly desirable. 2. 322 [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process of 323 standardization of IP flows (i.e., netflows). IPFIX uses measurement 324 probes (i.e., metering exporters) to gather flow data. In this 325 context, the autonomic solution for the activation of active 326 measurement probes could be possibly extended to address also passive 327 measurement probes. Besides that, flow information could be used in 328 the decision making of probe activation. 3. [ALTO]: The Application 329 Layer Traffic Optimization Working Group aims to provide topological 330 information at a higher abstraction layer, which can be based upon 331 network policy, and with application-relevant service functions 332 located in it. Their work could be leveraged for the definition of 333 the topology regarding the network devices which exchange measurement 334 data. 336 10. Acknowledgements 338 We wish to acknowledge the helpful contributions, comments, and 339 suggestions that were received from Mohamed Boucadair, Bruno Klauser, 340 Eric Voit, and Hanlin Fang. 342 11. IANA Considerations 344 This memo includes no request to IANA. 346 12. Security Considerations 348 The bootstrapping of a new device follows the approach proposed on 349 anima wg [draft-anima-boot], thus in order to exchange data a device 350 should register first. This registration could be performed by a 351 "Registrar" device or a cloud service provided by the organization to 352 facilitate autonomic mechanisms. The new device sends its own 353 credentials to the Registrar, and after successful authentication, 354 receives domain information, to enable subsequent enrolment to the 355 domain. The Registrar sends all required information: a device name, 356 domain name, plus some parameters for the operation. Measurement 357 data should be exchanged signed and encripted among devices since 358 these data could carry sensible information about network 359 infrastructures. Some attacks should be considering when analyzing 360 the security of the autonomic solution. Denial of service (DoS) 361 attacks could be performed if the solution be tempered to active more 362 local probe than the available resources allow. Besides that, 363 results could be forged by a device (attacker) in order to this 364 device be considered peer of a specific device (target). This could 365 be done to gain information about a network. 367 13. References 369 13.1. Normative References 371 [draft-anima-boot] 372 Pritikin, M., Richardson, M., Behringer, M., and S. 373 Bjarnason, "draft-ietf-anima-bootstrapping-keyinfra", 374 draft-ietf-anima-bootstrapping-keyinfra-03 (work in 375 progress), June 2016. 377 [draft-ietf-ippm-6man-pdm-option] 378 Elkins, N., Hamilton, R., and M. Ackermann, "draft-ietf- 379 ippm-6man-pdm-option", draft-ietf-ippm-6man-pdm-option-06 380 (work in progress), September 2016. 382 [P2PBNM-Nobre-2012] 383 Nobre, J., Granville, L., Clemm, A., and A. Gonzalez 384 Prieto, "Decentralized Detection of SLA Violations Using 385 P2P Technology, 8th International Conference Network and 386 Service Management (CNSM)", 2012, 387 . 390 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 391 Zekauskas, "A One-way Active Measurement Protocol 392 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 393 . 395 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 396 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 397 RFC 5357, DOI 10.17487/RFC5357, October 2008, 398 . 400 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 401 S., and E. Yedavalli, "Cisco Service-Level Assurance 402 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 403 . 405 [RFC7297] Boucadair, M., Jacquenet, C., and N. Wang, "IP 406 Connectivity Provisioning Profile (CPP)", RFC 7297, 407 DOI 10.17487/RFC7297, July 2014, 408 . 410 13.2. Informative References 412 [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export 413 Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, 414 . 416 [RFC4148] Stephan, E., "IP Performance Metrics (IPPM) Metrics 417 Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August 418 2005, . 420 Authors' Addresses 422 Jeferson Campos Nobre 423 University of Vale do Rio dos Sinos 424 Porto Alegre 425 Brazil 427 Email: jcnobre@unisinos.br 428 Lisandro Zambenedetti Granvile 429 Federal University of Rio Grande do Sul 430 Porto Alegre 431 Brazil 433 Email: granville@inf.ufrgs.br 435 Alexander Clemm 436 Huawei 437 Santa Clara, California 438 USA 440 Email: ludwig@clemm.org 442 Alberto Gonzalez Prieto 443 Cisco Systems 444 San Jose 445 USA 447 Email: albertgo@cisco.com