idnits 2.17.1 draft-irtf-nmrg-autonomic-sla-violation-detection-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 7, 2015) is 3248 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'LMAP' is mentioned on line 279, but not defined == Missing Reference: 'IPFIX' is mentioned on line 286, but not defined == Missing Reference: 'ALTO' is mentioned on line 292, but not defined -- Obsolete informational reference (is this intentional?): RFC 4148 (Obsoleted by RFC 6248) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Management Research Group J. Nobre 3 Internet-Draft L. Granville 4 Intended status: Informational Federal University of Rio Grande do Sul 5 Expires: November 8, 2015 A. Clemm 6 A. Prieto 7 Cisco Systems 8 May 7, 2015 10 Autonomic Networking Use Case for Distributed Detection of SLA 11 Violations 12 draft-irtf-nmrg-autonomic-sla-violation-detection-02 14 Abstract 16 This document describes a use case for autonomic networking in 17 distributed detection of Service Level Agreement (SLA) violations. 18 It is one of a series of use cases intended to illustrate 19 requirements for autonomic networking. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on November 8, 2015. 38 Copyright Notice 40 Copyright (c) 2015 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Current Approaches . . . . . . . . . . . . . . . . . . . . . 3 57 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 58 4. Benefits of an Autonomic Solution . . . . . . . . . . . . . . 4 59 5. Intended User and Administrator Experience . . . . . . . . . 5 60 6. Analysis of Parameters and Information Involved . . . . . . . 5 61 6.1. Device Based Self-Knowledge and Decisions . . . . . . . . 6 62 6.2. Interaction with other devices . . . . . . . . . . . . . 6 63 7. Comparison with current solutions . . . . . . . . . . . . . . 6 64 8. Related IETF Work . . . . . . . . . . . . . . . . . . . . . . 6 65 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 66 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 67 11. Security Considerations . . . . . . . . . . . . . . . . . . . 7 68 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 69 12.1. Normative References . . . . . . . . . . . . . . . . . . 8 70 12.2. Informative References . . . . . . . . . . . . . . . . . 8 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 73 1. Introduction 75 The Internet has been growing dramatically in terms of size and 76 capacity, and accessibility in the last years. Besides that, the 77 communication requirements of distributed services and applications 78 running on top of the Internet have become increasingly strict. That 79 is the case due to the impact of disrespecting such requirements 80 (e.g., latency in trading can have a high cost). Thus, those 81 requirements are included in SLA specifications (examples of service 82 fulfillment clauses can be found on [RFC7297]). Violations on these 83 requirements usually present significant financial loss, which can by 84 divided in two types. First, there is the loss incurred by the 85 service users (e.g., the trader) and the loss incurred by the service 86 provider in terms of penalties for not meeting the service. Thus, 87 the service level requirements of critical network services have 88 become a key concern for network administrators. To ensure that SLAs 89 are not being violated, service levels need to be constantly 90 monitored at the network infrastructure layer. To that end, network 91 measurements must take place. 93 Network measurement mechanisms are performed through either active or 94 passive measurement techniques. In passive measurement, network 95 conditions are checked in a non intrusive way because no monitoring 96 traffic is created by the measurement process itself. In the context 97 of IP Flow Information EXport (IPFIX) WG, several documents were 98 produced to define passive measurement mechanisms (e.g., flow records 99 specification [RFC3954]). Active measurement, on the other hand, is 100 intrusive because it injects synthetic traffic into the network to 101 measure the network performance. The IP Performance Metrics (IPPM) 102 WG produced documents that describe active measurement mechanisms, 103 such as: One-Way Active Measurement Protocol (OWAMP) [RFC4656], Two- 104 Way Active Measurement Protocol (TWAMP) [RFC5357], and Cisco Service 105 Level Assurance Protocol (SLA) [RFC6812]. Besides that, there are 106 some mechanisms that do not fit into either active or passive 107 categories, such as Performance and Diagnostic Metrics Destination 108 Option (PDM) techniques [draft-elkins-ippm-pdm-option]. 110 Active measurement mechanisms usually offer better accuracy and 111 privacy than passive measurement mechanisms. Furthermore, active 112 measurement mechanisms are able to detect end-to-end network 113 performance problems in a fine-grained way (e.g., simulating the 114 traffic that must be handled considering specific Service Level 115 Objectives - SLOs). As a result, active is preferred over passive 116 measurement for SLA monitoring. Measurement probes must be hosted in 117 network devices and measurement sessions must be activated to compute 118 the current network metrics (e.g., considering those described in 119 [RFC4148]). This activation should be dynamic in order to follow 120 changes in network conditions, such as those related with routes 121 being added or new customer demands. 123 The activation of active measurement sessions (hosted in senders and 124 responders considering the architecture described by Cisco [RFC6812]) 125 is expensive in terms of the resource consumption, e.g., CPU cycle 126 and memory footprint, and monitoring functions compete for resources 127 with other functions, including routing and switching. Besides that, 128 the activated sessions also increase the network load because of the 129 injected traffic. The resources required and traffic generated by 130 the active measurement sessions are a function of the number of 131 measured network destinations, i.e., with more destinations the 132 larger will be the resources and the traffic needed to deploy the 133 sessions. Thus, to have a better monitoring coverage it is necessary 134 to deploy more sessions what consequently turns increases consumed 135 resources. Otherwise, enabling the observation of just a small 136 subset of all network flows can lead to an insufficient coverage. 138 2. Current Approaches 140 The current best practice in feasible deployments of active 141 measurement solutions to distribute the available measurement 142 sessions along the network consists in relying entirely on the human 143 administrator expertise to infer which would be the best location to 144 activate such sessions. This is done through several steps. First, 145 it is necessary to collect traffic information in order to grasp the 146 traffic matrix. Then, the administrator uses this information to 147 infer which are the best destinations for measurement sessions. 148 After that, the administrator activates sessions on the chosen subset 149 of destinations considering the available resources. This practice, 150 however, does not scale well because it is still labor intensive and 151 error-prone for the administrator to compute which sessions should be 152 activated given the set of critical flows that needs to be measured. 153 Even worse, this practice completely fails in networks whose critical 154 flows are too short in time and dynamic in terms of traversing 155 network path, like in modern cloud environments. That is so because 156 fast reactions are necessary to reconfigure the sessions and 157 administrators are not just enough in computing and activating the 158 new set of required sessions every time the network traffic pattern 159 changes. Finally, the current active measurements practice usually 160 covers only a fraction of the network flows that should be observed, 161 which invariably leads to the damaging consequence of undetected SLA 162 violations. 164 3. Problem Statement 166 Management software can be embedded inside network devices to control 167 the deployment of active measurement mechanisms. In fact, this is 168 done by some network equipment vendors, specially to avoid the 169 starvation of the network devices (e.g., due to configuration errors 170 and lack of experience from human administrators). However, the 171 current approach do not enhance the active measurement capabilities 172 in important terms, such as scalability and efficiency. For example, 173 the number of local available measurements (and, consequently, 174 detected SLA violations) is still bounded by the number of activated 175 sessions. Thus, if the number of SLA violation is greater than the 176 number of available sessions, only a fraction of the violations will 177 be observed. Also, devices cannot share resources and knowledge 178 about the networking infrastructures in order to take advantage of 179 remote management information (e.g., measurement results). 181 4. Benefits of an Autonomic Solution 183 The use case considered here is the distributed autonomic detection 184 of SLA violations. The use of Autonomic Networking (AN) properties 185 can help such detection through an efficient activation of 186 measurement sessions [P2PBNM-Nobre-2012]. The problem to be solved 187 by AN in the present use case is how to steer the process of 188 measurement session activation by a complete solution that sets all 189 necessary parameters for this activation to operate efficiently, 190 reliably and securely, with no required human intervention, while 191 allowing for their input. 193 We advocate for embedding Peer-to-Peer (P2P) technology in network 194 devices in order to improve the measurement session activation 195 decisions using autonomic loops. The provisioning of the P2P 196 management overlay should be transparent for the network 197 administrator. It would be possible to control the measurement 198 session activation using local data and logic and to share 199 measurement results among different network devices. 201 An autonomic solution for the distributed detection of SLA violations 202 can provide several benefits. First, efficiency: this solution could 203 optimize the resource consumption and avoid resource starvation on 204 the network devices. This optimization comes from different sources: 205 sharing of measurement results, better efficiency in the probe 206 activation decisions, etc. Second, effectiveness: the number of 207 detected SLA violations could be increased. This increase is related 208 with a better coverage of the network. Third, the solution could 209 decrease the time necessary to detect SLA violations. Adaptivity 210 features of an autonomic loop could capture faster the network 211 dynamics than an human administrator. Finally, the solution could 212 help to reduce the workload of human administrator, or, at least, to 213 avoid their need to perform operational tasks. 215 5. Intended User and Administrator Experience 217 The autonomic solution should not require the human intervention in 218 the distributed detection of SLA violations. Besides that, it could 219 enable the control of SLA monitoring by less experienced human 220 administrators. However, some information may be provided from the 221 human administrator. For example, the human administrator may 222 provide the SLOs regarding the SLA being monitored. The 223 configuration and bootstrapping of network devices using the 224 autonomic solution should be minimal for the human administrator. 225 Probably it would be necessary just to inform the address of a device 226 which is already using the solution and the devices themselves could 227 exchange configuration data. 229 6. Analysis of Parameters and Information Involved 230 The active measurement model assumes that a typical infrastructure 231 will have multiple network segments and Autonomous Systems (ASs), and 232 a reasonably large number of several of routers and hosts. It also 233 considers that multiple SLOs can be in place in a given time. Since 234 interoperability in a heterogenous network is a goal, features found 235 on different active measurement mechanisms (e.g. OWAMP, TWAMP, and 236 IPSLA) and programability interfaces (e.g., Cisco's EEM and onePK) 237 could be used for the implementation. The autonomic solution should 238 include and/or reference specific algorithms, protocols, metrics and 239 technologies for the implementation of distributed detection of SLA 240 violations as a whole. 242 6.1. Device Based Self-Knowledge and Decisions 244 Each device has self-knowledge about the local SLA monitoring. This 245 could be in the form of historical measurement data and SLOs. 246 Besides that, the devices would have algorithms that could decide 247 which probes should be activated in a given time. The choice of 248 which algorithm is better for a specific situation would be also 249 autonomic. 251 6.2. Interaction with other devices 253 Network devices should share information about service level 254 measurement results. This information can speed up the detection of 255 SLA violations and increase the number of detected SLA violations. 256 In any case, it is necessary to assure that the results from remote 257 devices have local relevancy. The definition of network devices that 258 exchange measurement data, i.e., management peers, creates a new 259 topology. Different approaches could be used to define this topology 260 (e.g., correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer 261 selection, each device should use its known endpoints neighbors 262 (e.g., FIB and RIB tables) as the initial seed to get possible peers. 264 7. Comparison with current solutions 266 There is no standartized solution for distributed autonomic detection 267 of SLA violations. Current solutions are restricted to ad hoc 268 scripts running on a per node fashion to automate some 269 administrator's actions. There some proposals for passive probe 270 activation (e.g., DECON and CSAMP), but without the focus on 271 autonomic features. It is also mentioning a proposal from Barford et 272 al. to detect and localize links which cause anomalies along a 273 network path. 275 8. Related IETF Work 276 The following paragraphs discuss related IETF work and are provided 277 for reference. This section is not exhaustive, rather it provides an 278 overview of the various initiatives and how they relate to autonomic 279 distributed detection of SLA violations. 1. [LMAP]: The Large-Scale 280 Measurement of Broadband Performance Working Group aims at the 281 standards for performance management. Since their mechanisms also 282 consist in deploying measurement probes the autonomic solution could 283 be relevant for LMAP specially considering SLA violation screening. 284 Besides that, a solution to decrease the workload of human 285 administrators in service providers is probably highly desirable. 2. 286 [IPFIX]: IP Flow Information EXport (IPFIX) aims at the process of 287 standardization of IP flows (i.e., netflows). IPFIX uses measurement 288 probes (i.e., metering exporters) to gather flow data. In this 289 context, the autonomic solution for the activation of active 290 measurement probes could be possibly extended to address also passive 291 measurement probes. Besides that, flow information could be used in 292 the decision making of probe activation. 3. [ALTO]: The Application 293 Layer Traffic Optimization Working Group aims to provide topological 294 information at a higher abstraction layer, which can be based upon 295 network policy, and with application-relevant service functions 296 located in it. Their work could be leveraged for the definition of 297 the topology regarding the network devices which exchange measurement 298 data. 300 9. Acknowledgements 302 We wish to acknowledge the helpful contributions, comments, and 303 suggestions that were received from Mohamed Boucadair, Bruno Klauser, 304 Eric Voit, and Hanlin Fang. 306 10. IANA Considerations 308 This memo includes no request to IANA. 310 11. Security Considerations 312 The bootstrapping of a new device follows the approach proposed on 313 anima wg [draft-anima-boot], thus in order to exchange data a device 314 should register first. This registration could be performed by a 315 "Registrar" device or a cloud service provided by the organization to 316 facilitate autonomic mechanisms. The new device sends its own 317 credentials to the Registrar, and after successful authentication, 318 receives domain information, to enable subsequent enrolment to the 319 domain. The Registrar sends all required information: a device name, 320 domain name, plus some parameters for the operation. Measurement 321 data should be exchanged signed and encripted among devices since 322 these data could carry sensible information about network 323 infrastructures. Some attacks should be considering when analyzing 324 the security of the autonomic solution. Denial of service (DoS) 325 attacks could be performed if the solution be tempered to active more 326 local probe than the available resources allow. Besides that, 327 results could be forged by a device (attacker) in order to this 328 device be considered peer of a specific device (target). This could 329 be done to gain information about a network. 331 12. References 333 12.1. Normative References 335 [P2PBNM-Nobre-2012] 336 Nobre, J., Granville, L., Clemm, A., and A. Prieto, 337 "Decentralized Detection of SLA Violations Using P2P 338 Technology, 8th International Conference Network and 339 Service Management (CNSM)", 2012, . 342 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 343 Zekauskas, "A One-way Active Measurement Protocol 344 (OWAMP)", RFC 4656, September 2006. 346 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 347 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 348 RFC 5357, October 2008. 350 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 351 S., and E. Yedavalli, "Cisco Service-Level Assurance 352 Protocol", RFC 6812, January 2013. 354 [RFC7297] Boucadair, M., Jacquenet, C., and N. Wang, "IP 355 Connectivity Provisioning Profile (CPP)", RFC 7297, July 356 2014. 358 [draft-anima-boot] 359 Pritikin, M., Behringer, M., and S. Bjarnason, "draft- 360 pritikin-anima-bootstrapping-keyinfra", draft-pritikin- 361 anima-bootstrapping-keyinfra-01 (work in progress), 362 February 2015. 364 [draft-elkins-ippm-pdm-option] 365 Elkins, N., Hamilton, R., and M. Ackermann, "draft-elkins- 366 ippm-pdm-option", draft-elkins-ippm-pdm-option-02 (work in 367 progress), September 2014. 369 12.2. Informative References 371 [RFC3954] Claise, B., "Cisco Systems NetFlow Services Export Version 372 9", RFC 3954, October 2004. 374 [RFC4148] Stephan, E., "IP Performance Metrics (IPPM) Metrics 375 Registry", BCP 108, RFC 4148, August 2005. 377 Authors' Addresses 379 Jeferson Campos Nobre 380 Federal University of Rio Grande do Sul 381 Porto Alegre 382 Brazil 384 Email: jcnobre@inf.ufrgs.br 386 Lisandro Zambenedetti Granvile 387 Federal University of Rio Grande do Sul 388 Porto Alegre 389 Brazil 391 Email: granville@inf.ufrgs.br 393 Alexander Clemm 394 Cisco Systems 395 San Jose 396 USA 398 Email: alex@cisco.com 400 Alberto Gonzalez Prieto 401 Cisco Systems 402 San Jose 403 USA 405 Email: albertgo@cisco.com