idnits 2.17.1 draft-unify-nfvrg-devops-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 18, 2016) is 2958 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-04) exists of draft-unify-nfvrg-challenges-03 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NFVRG C. Meirosu 2 Internet Draft Ericsson 3 Intended status: Informational A. Manzalini 4 Expires: September 2016 Telecom Italia 5 R. Steinert 6 SICS 7 G. Marchetto 8 Politecnico di Torino 9 I. Papafili 10 Hellenic Telecommunications Organization 11 K. Pentikousis 12 EICT 13 S. Wright 14 AT&T 16 March 20, 2016March 18, 2016 18 DevOps for Software-Defined Telecom Infrastructures 19 draft-unify-nfvrg-devops-04.txt 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/ietf/1id-abstracts.txt 39 The list of Internet-Draft Shadow Directories can be accessed at 40 http://www.ietf.org/shadow.html 42 This Internet-Draft will expire on September 20, 2016. 44 Copyright Notice 46 Copyright (c) 2016 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Abstract 61 Carrier-grade network management was optimized for environments built 62 with monolithic physical nodes and involves significant deployment, 63 integration and maintenance efforts from network service providers. 64 The introduction of virtualization technologies, from the physical 65 layer all the way up to the application layer, however, invalidates 66 several well-established assumptions in this domain. This draft opens 67 the discussion in NFVRG about challenges related to transforming the 68 telecom network infrastructure into an agile, model-driven production 69 environment for communication services. We take inspiration from data 70 center DevOps regarding how to simplify and automate management 71 processes for a telecom service provider software-defined 72 infrastructure (SDI). Among the identified challenges, we consider 73 scalability of observability processes and automated inference of 74 monitoring requirements from logical forwarding graphs, as well as 75 initial placement (and re-placement) of monitoring functionality 76 following changes in flow paths enforced by the controllers. In 77 another category of challenges, verifying correctness of behavior for 78 network functions where flow rules are no longer necessary and 79 sufficient for determining the forwarding state (for example, 80 stateful firewalls or load balancers) is very difficult with current 81 technology. Finally, we introduce challenges associated with 82 operationalizing DevOps principles at scale in software-defined 83 telecom networks in three areas related to key monitoring, 84 verification and troubleshooting processes. 86 Table of Contents 88 1. Introduction...................................................3 89 2. Software-Defined Telecom Infrastructure: Roles and DevOps 90 principles........................................................5 91 2.1. Service Developer Role....................................5 92 2.2. VNF Developer role........................................6 93 2.3. System Integrator role....................................6 94 2.4. Operator role.............................................6 95 2.5. Customer role.............................................6 96 2.6. DevOps Principles.........................................7 97 3. Continuous Integration.........................................8 98 4. Continuous Delivery............................................9 99 5. Consistency, Availability and Partitioning Challenges..........9 100 6. Stability Challenges..........................................10 101 7. Observability Challenges......................................12 102 8. Verification Challenges.......................................14 103 9. Troubleshooting Challenges....................................16 104 10. Programmable network management..............................17 105 11. DevOps Performance Metrics...................................18 106 12. Security Considerations......................................19 107 13. IANA Considerations..........................................19 108 14. References...................................................19 109 14.1. Informative References..................................19 110 15. Contributors.................................................22 111 16. Acknowledgments..............................................22 112 17. Authors' Addresses...........................................23 114 1. Introduction 116 Carrier-grade network management was developed as an incremental 117 solution once a particular network technology matured and came to be 118 deployed in parallel with legacy technologies. This approach requires 119 significant integration efforts when new network services are 120 launched. Both centralized and distributed algorithms have been 121 developed in order to solve very specific problems related to 122 configuration, performance and fault management. However, such 123 algorithms consider a network that is by and large functionally 124 static. Thus, management processes related to introducing new or 125 maintaining functionality are complex and costly due to significant 126 efforts required for verification and integration. 128 Network virtualization, by means of Software-Defined Networking (SDN) 129 and Network Function Virtualization (NFV), creates an environment 130 where network functions are no longer static or strictly embedded in 131 physical boxes deployed at fixed points. The virtualized network is 132 dynamic and open to fast-paced innovation enabling efficient network 133 management and reduction of operating cost for network operators. A 134 significant part of network capabilities are expected to become 135 available through interfaces that resemble the APIs widespread within 136 datacenters instead of the traditional telecom means of management 137 such as the Simple Network Management Protocol, Command Line 138 Interfaces or CORBA. Such an API-based approach, combined with the 139 programmability offered by SDN interfaces [RFC7426], open 140 opportunities for handling infrastructure, resources, and Virtual 141 Network Functions (VNFs) as code, employing techniques from software 142 engineering. 144 The efficiency and integration of existing management techniques in 145 virtualized and dynamic network environments are limited, however. 146 Monitoring tools, e.g. based on simple counters, physical network 147 taps and active probing, do not scale well and provide only a small 148 part of the observability features required in such a dynamic 149 environment. Although huge amounts of monitoring data can be 150 collected from the nodes, the typical granularity is rather coarse. 151 Debugging and troubleshooting techniques developed for software- 152 defined environments are a research topic that has gathered interest 153 in the research community in the last years. Still, it is yet to be 154 explored how to integrate them into an operational network management 155 system. Moreover, research tools developed in academia (such as 156 NetSight [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were 157 limited to solving very particular, well-defined problems, and 158 oftentimes are not built for automation and integration into carrier- 159 grade network operations workflows. 161 The topics at hand have already attracted several standardization 162 organizations to look into the issues arising in this new 163 environment. For example, IETF working groups have activities in the 164 area of OAM and Verification for Service Function Chaining 165 [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service 166 Function Chaining. At IRTF, [RFC7149] asks a set of relevant 167 questions regarding operations of SDNs. The ETSI NFV ISG defines the 168 MANO interfaces [NFVMANO], and TMForum investigates gaps between 169 these interfaces and existing specifications in [TR228]. The need for 170 programmatic APIs in the orchestration of compute, network and 171 storage resources is discussed in [I-D.unify-nfvrg-challenges]. 173 From a research perspective, problems related to operations of 174 software-defined networks are in part outlined in [SDNsurvey] and 175 research referring to both cloud and software-defined networks are 176 discussed in [D4.1]. 178 The purpose of this first version of this document is to act as a 179 discussion opener in NFVRG by describing a set of principles that are 180 relevant for applying DevOps ideas to managing software-defined 181 telecom network infrastructures. We identify a set of challenges 182 related to developing tools, interfaces and protocols that would 183 support these principles and how can we leverage standard APIs for 184 simplifying management tasks. 186 2. Software-Defined Telecom Infrastructure: Roles and DevOps principles 188 Agile methods used in many software focused companies are focused on 189 releasing small interactions of code to implement VNFs with high 190 velocity and high quality into a production environment. Similarly, 191 Service providers are interested to release incremental improvements 192 in the network services that they create from virtualized network 193 functions. The cycle time for devops as applied in many open source 194 projects is on the order of one quarter year or 13 weeks. 196 The code needs to undergo a significant amount of automated testing 197 and verification with pre-defined templates in a realistic setting. 198 From the point of view of infrastructure management, the verification 199 of the network configuration as result of network policy 200 decomposition and refinement, as well as the configuration of virtual 201 functions, is one of the most sensitive operations. When 202 troubleshooting the cause of unexpected behavior, fine-grained 203 visibility onto all resources supporting the virtual functions 204 (either compute, or network-related) is paramount to facilitating 205 fast resolution times. While compute resources are typically very 206 well covered by debugging and profiling toolsets based on many years 207 of advances in software engineering, programmable network resources 208 are a still a novelty and tools exploiting their potential are 209 scarce. 211 2.1. Service Developer Role 213 We identify two dimensions of the "developer" role in software- 214 defined infrastructure (SDI). One dimension relates to determining 215 which high-level functions should be part of a particular service, 216 deciding what logical interconnections are needed between these 217 blocks and defining a set of high-level constraints or goals related 218 to parameters that define, for instance, a Service Function Chain. 219 This could be determined by the product owner for a particular family 220 of services offered by a telecom provider. Or, it might be a key 221 account representative that adapts an existing service template to 222 the requirements of a particular customer by adding or removing a 223 small number of functional entities. We refer to this person as the 224 Service Developer and for simplicity (access control, training on 225 technical background, etc.) we consider the role to be internal to 226 the telecom provider. 228 2.2. VNF Developer role 230 Another dimension of the "developer" role is a person that writes the 231 software code for a new virtual network function (VNF). Depending on 232 the actual VNF being developed, this person might be internal or 233 external (e.g. a traditional equipment vendor) to the telecom 234 provider. We refer to them as VNF Developers. 236 2.3. System Integrator role 238 The System Integrator role is to some extent similar to the Service 239 Developer: people in this role need to identify the components of the 240 system to be delivered. However, for the Service Developer, the 241 service components are pre-integrated meaning that they have the 242 right interfaces to interact with each other. In contrast, the 243 Systems Integrator needs to develop the software that makes the 244 system components interact with each other. As such, the Systems 245 Integrator role combines aspects of the Developer roles and adds yet 246 another dimension to it. Compared to the other Developer roles, the 247 System Integrator might face additional challenges due to the fact 248 that they might not have access to the source code of some of the 249 components. This limits for example how fast they could address 250 issues with components to be integrated, as well as uneven workload 251 depending on the release granularity of the different components that 252 need to be integrated. 254 2.4. Operator role 256 The role of an Operator in SDI is to ensure that the deployment 257 processes were successful and a set of performance indicators 258 associated to a service are met while the service is supported on 259 virtual infrastructure within the domain of a telecom provider. 261 2.5. Customer role 263 A Customer contracts a telecom operator to provide one or more 264 services. In SDI, the Customer may communicate with the provider 265 through an online portal. Compared to the Service Developer, the 266 Customer is external to the operator and may define changes to their 267 own service instance only in accordance to policies defined by the 268 Service Developer. In addition to the usual per-service utilization 269 statistics, in SDI the portal may enable the customer to trigger 270 certain performance management or troubleshooting tools for the 271 service. This, for example, enables the Customer to determine whether 272 the root cause of certain error or degradation condition that they 273 observe is located in the telecom operator domain or not and may 274 facilitate the interaction with the customer support teams. 276 2.6. DevOps Principles 278 In line with the generic DevOps concept outlined in [DevOpsP], we 279 consider that these four principles as important for adapting DevOps 280 ideas to SDI: 282 * Deploy with repeatable, reliable processes: Service and VNF 283 Developers should be supported by automated build, orchestrate and 284 deploy processes that are identical in the development, test and 285 production environments. Such processes need to be made reliable and 286 trusted in the sense that they should reduce the chance of human 287 error and provide visibility at each stage of the process, as well as 288 have the possibility to enable manual interactions in certain key 289 stages. 291 * Develop and test against production-like systems: both Service 292 Developers and VNF Developers need to have the opportunity to verify 293 and debug their respective SDI code in systems that have 294 characteristics which are very close to the production environment 295 where the code is expected to be ultimately deployed. Customizations 296 of Service Function Chains or VNFs could thus be released frequently 297 to a production environment in compliance with policies set by the 298 Operators. Adequate isolation and protection of the services active 299 in the infrastructure from services being tested or debugged should 300 be provided by the production environment. 302 * Monitor and validate operational quality: Service Developers, VNF 303 Developers and Operators must be equipped with tools, automated as 304 much as possible, that enable to continuously monitor the operational 305 quality of the services deployed on SDI. Monitoring tools should be 306 complemented by tools that allow verifying and validating the 307 operational quality of the service in line with established 308 procedures which might be standardized (for example, Y.1564 Ethernet 309 Activation [Y1564]) or defined through best practices specific to a 310 particular telecom operator. 312 * Amplify development cycle feedback loops: An integral part of the 313 DevOps ethos is building a cross-cultural environment that bridges 314 the cultural gap between the desire for continuous change by the 315 Developers and the demand by the Operators for stability and 316 reliability of the infrastructure. Feedback from customers is 317 collected and transmitted throughout the organization. From a 318 technical perspective, such cultural aspects could be addressed 319 through common sets of tools and APIs that are aimed at providing a 320 shared vocabulary for both Developers and Operators, as well as 321 simplifying the reproduction of problematic situations in the 322 development, test and operations environments. 324 Network operators that would like to move to agile methods to deploy 325 and manage their networks and services face a different environment 326 compared to typical software companies where simplified trust 327 relationships between personnel are the norm. In software companies, 328 it is not uncommon that the same person may be rotating between 329 different roles. In contrast, in a telecom service provider, there 330 are strong organizational boundaries between suppliers (whether in 331 Developer roles for network functions, or in Operator roles for 332 outsourced services) and the carrier's own personnel that might also 333 take both Developer and Operator roles. How DevOps principles reflect 334 on these trust relationships and to what extent initiatives such as 335 co-creation could transform the environment to facilitate closer Dev 336 and Ops integration across business boundaries is an interesting area 337 for business studies, but we could not for now identify a specific 338 technological challenge. 340 3. Continuous Integration 342 Software integration is the process of bringing together the software 343 component subsystems into one software system, and ensuring that the 344 subsystems function together as a system. Software integration can 345 apply regardless of the size of the software components. The 346 objective of Continuous Integration is to prevent integration 347 problems close to the expected release of a software development 348 project into a production (operations) environment. Continuous 349 Integration is therefore closely coupled with the notion of DevOps as 350 a mechanism to ease the transition from development to operations. 352 Continuous integration may result in multiple builds per day. It is 353 also typically used in conjunction with test driven development 354 approaches that integrate unit testing into the build process. The 355 unit testing is typically automated through build servers. Such 356 servers may implement a variety of additional static and dynamic 357 tests as well as other quality control and documentation extraction 358 functions. The reduced cycle times of continuous enable improved 359 software quality by applying small efforts frequently. 361 Continuous Integration applies to developers of VNF as they integrate 362 the components that they need to deliver their VNF. The VNFs may 363 contain components developed by different teams within the VNF 364 Provider, or may integrate code developed externally - e.g. in 365 commercial code libraries or in open source communities. 367 Service providers also apply continuous integration in the 368 development of network services. Network services are comprised of 369 various aspects including VNFs and connectivity within and between 370 them as well as with various associated resource authorizations. The 371 components of the networks service are all dynamic, and largely 372 represented by software that must be integrated regularly to maintain 373 consistency. Some of the software components that Service Providers 374 may be sourced from VNF Providers or from open source communities. 375 Service Providers are increasingly motivated to engage with open 376 Source communities [OSandS]. Open source interfaces supported by open 377 source communities may be more useful than traditional paper 378 interface specifications. Even where Service Providers are deeply 379 engaged in the open source community (e.g. OPNFV) many service 380 providers may prefer to obtain the code through some software 381 provider as a business practice. Such software providers have the 382 same interests in software integration as other VNF providers. 384 4. Continuous Delivery 386 The practice of Continuous Delivery extends Continuous Integration by 387 ensuring that the software (either a VNF code or code for SDI) 388 checked in on the mainline is always in a user deployable state and 389 enables rapid deployment by those users. For critical systems such as 390 telecommunications networks, Continuous Delivery has the advantage of 391 including a manual trigger before the actual deployment in the live 392 system, compared to the Continuous Deployment methodology which is 393 also part of DevOps processes in software companies. 395 5. Consistency, Availability and Partitioning Challenges 397 The CAP theorem [CAP] states that any networked shared-data system 398 can have at most two of following three properties: 1) Consistency 399 (C) equivalent to having a single up-to-date copy of the data; 2) 400 high Availability (A) of that data (for updates); and 3) tolerance to 401 network Partitions (P). 403 Looking at a telecom SDI as a distributed computational system 404 (routing/forwarding packets can be seen as a computational problem), 405 just two of the three CAP properties will be possible at the same 406 time. The general idea is that 2 of the 3 have to be chosen. CP favor 407 consistency, AP favor availability, CA there are no partition. This 408 has profound implications for technologies that need to be developed 409 in line with the "deploy with repeatable, reliable processes" 410 principle for configuring SDI states. Latency or delay and 411 partitioning properties are closely related, and such relation 412 becomes more important in the case of telecom service providers where 413 Devs and Ops interact with widely distributed infrastructure. 415 Limitations of interactions between centralized management and 416 distributed control need to be carefully examined in such 417 environments. Traditionally connectivity was the main concern: C and 418 A was about delivering packets to destination. The features and 419 capabilities of SDN and NFV are changing the concerns: for example in 420 SDN, control plane Partitions no longer imply data plane Partitions, 421 so A does not imply C. In practice, CAP reflects the need for a 422 balance between local/distributed operations and remote/centralized 423 operations. 425 Furthermore to CAP aspects related to individual protocols, 426 interdependencies between CAP choices for both resources and VNFs 427 that are interconnected in a forwarding graph need to be considered. 428 This is particularly relevant for the "Monitor and Validate 429 Operational Quality" principle, as apart from transport protocols, 430 most OAM functionality is generally configured in processes that are 431 separated from the configuration of the monitored entities. Also, 432 partitioning in a monitoring plane implemented through VNFs executed 433 on compute resources does not necessarily mean that the dataplane of 434 the monitored VNF was partitioned as well. 436 6. Stability Challenges 438 The dimensions, dynamicity and heterogeneity of networks are growing 439 continuously. Monitoring and managing the network behavior in order 440 to meet technical and business objectives is becoming increasingly 441 complicated and challenging, especially when considering the need of 442 predicting and taming potential instabilities. 444 In general, instability in networks may have primary effects both 445 jeopardizing the performance and compromising an optimized use of 446 resources, even across multiple layers: in fact, instability of end- 447 to-end communication paths may depend both on the underlying 448 transport network, as well as the higher level components specific to 449 flow control and dynamic routing. For example, arguments for 450 introducing advanced flow admission control are essentially derived 451 from the observation that the network otherwise behaves in an 452 inefficient and potentially unstable manner. Even with resources over 453 provisioning, a network without an efficient flow admission control 454 has instability regions that can even lead to congestion collapse in 455 certain configurations. Another example is the instability which is 456 characteristic of any dynamically adaptive routing system. Routing 457 instability, which can be (informally) defined as the quick change of 458 network reachability and topology information, has a number of 459 possible origins, including problems with connections, router 460 failures, high levels of congestion, software configuration errors, 461 transient physical and data link problems, and software bugs. 463 As a matter of fact, the states monitored and used to implement the 464 different control and management functions in network nodes are 465 governed by several low-level configuration commands (today still 466 done mostly manually). Further, there are several dependencies among 467 these states and the logic updating the states (most of which are not 468 kept aligned automatically). Normally, high-level network goals (such 469 as the connectivity matrix, load-balancing, traffic engineering 470 goals, survivability requirements, etc) are translated into low-level 471 configuration commands (mostly manually) individually executed on the 472 network elements (e.g., forwarding table, packet filters, link- 473 scheduling weights, and queue-management parameters, as well as 474 tunnels and NAT mappings). Network instabilities due to configuration 475 errors can spread from node to node and propagate throughout the 476 network. 478 DevOps in the data center is a source of inspiration regarding how to 479 simplify and automate management processes for software-defined 480 infrastructure. Although the low-level configuration could be 481 automated by DevOps tools such as CFEngine [C2015], Puppet [P2015] 482 and Ansible [A2015], the high-level goal translation towards tool- 483 specific syntax is still a manual process. In addition, while 484 carrier-grade configuration tools using the NETCONF protocol support 485 complex atomic transaction management (which reduces the potential 486 for instability), Ansible requires third-party components to support 487 rollbacks and the Puppet transactions are not atomic. 489 As a specific example, automated configuration functions are expected 490 to take the form of a "control loop" that monitors (i.e., measures) 491 current states of the network, performs a computation, and then 492 reconfigures the network. These types of functions must work 493 correctly even in the presence of failures, variable delays in 494 communicating with a distributed set of devices, and frequent changes 495 in network conditions. Nevertheless cascading and nesting of 496 automated configuration processes can lead to the emergence of non- 497 linear network behaviors, and as such sudden instabilities (i.e. 498 identical local dynamic can give rise to widely different global 499 dynamics). 501 7. Observability Challenges 503 Monitoring algorithms need to operate in a scalable manner while 504 providing the specified level of observability in the network, either 505 for operation purposes (Ops part) or for debugging in a development 506 phase (Dev part). We consider the following challenges: 508 * Scalability - relates to the granularity of network observability, 509 computational efficiency, communication overhead, and strategic 510 placement of monitoring functions. 512 * Distributed operation and information exchange between monitoring 513 functions - monitoring functions supported by the nodes may perform 514 specific operations (such as aggregation or filtering) locally on the 515 collected data or within a defined data neighborhood and forward only 516 the result to a management system. Such operation may require 517 modifications of existing standards and development of protocols for 518 efficient information exchange and messaging between monitoring 519 functions. Different levels of granularity may need to be offered for 520 the data exchanged through the interfaces, depending on the Dev or 521 Ops role. Modern messaging systems, such as Apache Kafka [AK2015], 522 widely employed in datacenter environments, were optimized for 523 messages that are considerably larger than reading a single counter 524 value (typical SNMP GET call usage) - note the throughput vs record 525 size from [K2014]. It is also debatable to what extent properties 526 such as message persistence within the bus are needed in a carrier 527 environment, where MIBs practically offer already a certain level of 528 persistence of management data at the node level. Also, they require 529 the use of IP addressing which might not be needed when the monitored 530 data is consumed by a function within the same node. 532 * Common communication channel between monitoring functions and 533 higher layer entities (orchestration, control or management systems) 534 - a single communication channel for configuration and measurement 535 data of diverse monitoring functions running on heterogeneous hard- 536 and software environments. In telecommunication environments, 537 infrastructure assets span not only large geographical areas, but 538 also a wide range of technology domains, ranging from CPEs, access-, 539 aggregation-, and transport networks, to datacenters. This 540 heterogeneity of hard- and software platforms requires higher layer 541 entities to utilize various parallel communication channels for 542 either configuration or data retrieval of monitoring functions within 543 these technology domains. To address automation and advances in 544 monitoring programmability, software defined telecommunication 545 infrastructures would benefit from a single flexible communication 546 channel, thereby supporting the dynamicity of virtualized 547 environments. Such a channel should ideally support propagation of 548 configuration, signalling, and results from monitoring functions; 549 carrier-grade operations in terms of availability and multi-tenant 550 features; support highly distributed and hierarchical architectures, 551 keeping messages as local as possible; be lightweight, topology 552 independent, network address agnostic; support flexibility in terms 553 of transport mechanisms and programming language support. 554 Existing popular state-of-the-art message queuing systems such as 555 RabbitMQ [R2015] fulfill many of these requirements. However, they 556 utilize centralized brokers, posing a single point-of-failure and 557 scalability concerns within vastly distributed NFV environment. 558 Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015] 559 on the other hard lacks any advanced features for carrier-grade 560 operations, including high-availability, authentication, and tenant 561 isolation. 563 * Configurability and conditional observability - monitoring 564 functions that go beyond measuring simple metrics (such as delay, or 565 packet loss) require expressive monitoring annotation languages for 566 describing the functionality such that it can be programmed by a 567 controller. Monitoring algorithms implementing self-adaptive 568 monitoring behavior relative to local network situations may employ 569 such annotation languages to receive high-level objectives (KPIs 570 controlling tradeoffs between accuracy and measurement frequency, for 571 example) and conditions for varying the measurement intensity. Steps 572 in this direction were taken by the DevOps tools such as Splunk 573 [S2015], whose collecting agent has the ability to load particular 574 apps that in turn access specific counters or log files. However, 575 such apps are tool specific and may also require deploying additional 576 agents that are specific to the application, library or 577 infrastructure node being monitored. Choosing which objects to 578 monitor in such environment means deploying a tool-specific script 579 that configures the monitoring app. 581 * Automation - includes mapping of monitoring functionality from a 582 logical forwarding graph to virtual or physical instances executing 583 in the infrastructure, as well as placement and re-placement of 584 monitoring functionality for required observability coverage and 585 configuration consistency upon updates in a dynamic network 586 environment. Puppet [P2015] manifests or Ansible [A2015] playbooks 587 could be used for automating the deployment of monitoring agents, for 588 example those used by Splunk [S2015]. However, both manifests and 589 playbooks were designed to represent the desired system configuration 590 snapshot at a particular moment in time - they would now need to be 591 generated automatically by the orchestration tools instead of a 592 DevOps person. 594 * Actionable data 595 Data produced by observability tools could be utilized in a wide 596 category of processes, ranging from billing and dimensioning to real- 597 time troubleshooting and optimization. In order to allow for data- 598 driven automated decisions and actuations based on these decisions, 599 the data needs to be actionable. We define actionable data as being 600 representative for a particular context or situation and an adequate 601 input towards a decision. Ensuring actionable data is challenging in 602 a number of ways, including: defining adaptive correlation and 603 sampling windows, filtering and aggregation methods that are adapted 604 or coordinated with the actual consumer of the data, and developing 605 analytical and predictive methods that account for the uncertainty or 606 incompleteness of the data. 608 * Data Virtualization 610 Data is key in helping both Developers and Operators perform their 611 tasks. Traditional Network Management Systems were optimized for 612 using one database that contains the master copy of the operational 613 statistics and logs of network nodes. Ensuring access to this data 614 from across the organization is challenging because strict privacy 615 and business secrets need to be protected. In DevOps-driven 616 environments, data needs to be made available to Developers and their 617 test environments. Data virtualization collectively defines a set of 618 technologies that ensure that restricted copies of the partial data 619 needed for a particular task may be made available while enforcing 620 strict access control. Further than simple access control, data 621 virtualization needs to address scalability challenges involved in 622 copying large amounts of operational data as well as automatically 623 disposing of it when the task authorized for using it has finished. 625 8. Verification Challenges 627 Enabling ongoing verification of code is an important goal of 628 continuous integration as part of the data center DevOps concept. In 629 a telecom SDI, service definitions, decompositions and configurations 630 need to be expressed in machine-readable encodings. For example, 631 configuration parameters could be expressed in terms of YANG data 632 models. However, the infrastructure management layers (such as 633 Software-Defined Network Controllers and Orchestration functions) 634 might not always export such machine-readable descriptions of the 635 runtime configuration state. In this case, the management layer 636 itself could be expected to include a verification process that has 637 the same challenges as the stand-alone verification processes we 638 outline later in this section. In that sense, verification can be 639 considered as a set of features providing gatekeeper functions to 640 verify both the abstract service models and the proposed resource 641 configuration before or right after the actual instantiation on the 642 infrastructure layer takes place. 644 A verification process can involve different layers of the network 645 and service architecture. Starting from a high-level verification of 646 the customer input (for example, a Service Graph as defined in 647 [I-D.unify-nfvrg-challenges]), the verification process could go more 648 in depth to reflect on the Service Function Chain configuration. At 649 the lowest layer, the verification would handle the actual set of 650 forwarding rules and other configuration parameters associated to a 651 Service Function Chain instance. This enables the verification of 652 more quantitative properties (e.g. compliance with resource 653 availability), as well as a more detailed and precise verification of 654 the abovementioned topological ones. Existing SDN verification tools 655 could be deployed in this context, but the majority of them only 656 operate on flow space rules commonly expressed using OpenFlow syntax. 658 Moreover, such verification tools were designed for networks where 659 the flow rules are necessary and sufficient to determine the 660 forwarding state. This assumption is valid in networks composed only 661 by network functions that forward traffic by analyzing only the 662 packet headers (e.g. simple routers, stateless firewalls, etc.). 663 Unfortunately, most of the real networks contain active network 664 functions, represented by middle-boxes that dynamically change the 665 forwarding path of a flow according to function-local algorithms and 666 an internal state (that is based on the received packets), e.g. load 667 balancers, packet marking modules and intrusion detection systems. 668 The existing verification tools do not consider active network 669 functions because they do not account for the dynamic transformation 670 of an internal state into the verification process. 672 Defining a set of verification tools that can account for active 673 network functions is a significant challenge. In order to perform 674 verification based on formal properties of the system, the internal 675 states of an active (virtual or not) network function would need to 676 be represented. Although these states would increase the verification 677 process complexity (e.g., using simple model checking would not be 678 feasible due to state explosion), they help to better represent the 679 forwarding behavior in real networks. A way to address this challenge 680 is by attempting to summarize the internal state of an active network 681 function in a way that allows for the verification process to finish 682 within a reasonable time interval. 684 9. Troubleshooting Challenges 686 One of the problems brought up by the complexity introduced by NFV 687 and SDN is pinpointing the cause of a failure in an infrastructure 688 that is under continuous change. Developing an agile and low- 689 maintenance debugging mechanism for an architecture that is comprised 690 of multiple layers and discrete components is a particularly 691 challenging task to carry out. Verification, observability, and 692 probe-based tools are key to troubleshooting processes, regardless 693 whether they are followed by Dev or Ops personnel. 695 * Automated troubleshooting workflows 697 Failure is a frequently occurring event in network operation. 698 Therefore, it is crucial to monitor components of the system 699 periodically. Moreover, the troubleshooting system should search for 700 the cause automatically in the case of failure. If the system follows 701 a multi-layered architecture, monitoring and debugging actions should 702 be performed on components from the topmost layer to the bottom layer 703 in a chain. Likewise, the result of operations should be notified in 704 reverse order. In this regard, one should be able to define 705 monitoring and debugging actions through a common interface that 706 employs layer hopping logic. Besides, this interface should allow 707 fine-grained and automatic on-demand control for the integration of 708 other monitoring and verification mechanisms and tools. 710 * Troubleshooting with active measurement methods 712 Besides detecting network changes based on passively collected 713 information, active probes to quantify delay, network utilization and 714 loss rate are important to debug errors and to evaluate the 715 performance of network elements. While tools that are effective in 716 determining such conditions for particular technologies were 717 specified by IETF and other standardization organization, their use 718 requires a significant amount of manual labor in terms of both 719 configuration and interpretation of the results. 721 In contrast, methods that test and debug networks systematically 722 based on models generated from the router configuration, router 723 interface tables or forwarding tables, would significantly simplify 724 management. They could be made usable by Dev personnel that have 725 little expertise on diagnosing network defects. Such tools naturally 726 lend themselves to integration into complex troubleshooting workflows 727 that could be generated automatically based on the description of a 728 particular service chain. However, there are scalability challenges 729 associated with deploying such tools in a network. Some tools may 730 poll each networking device for the forwarding table information to 731 calculate the minimum number of test packets to be transmitted in the 732 network. Therefore, as the network size and the forwarding table size 733 increase, forwarding table updates for the tools may put a non- 734 negligible load in the network. 736 10. Programmable network management 738 The ability to automate a set of actions to be performed on the 739 infrastructure, be it virtual or physical, is key to productivity 740 increases following the application of DevOps principles. Previous 741 sections in this document touched on different dimensions of 742 programmability: 744 - Section 5 approached programmability in the context of developing 745 new capabilities for monitoring and for dynamically setting 746 configuration parameters of deployed monitoring functions 748 - Section 7 reflected on the need to determine the correctness of 749 actions that are to be inflicted on the infrastructure as result 750 of executing a set of high-level instructions 752 - Section 8 considered programmability in the perspective of an 753 interface to facilitate dynamic orchestration of troubleshooting 754 steps towards building workflows and for reducing the manual steps 755 required in troubleshooting processes 757 We expect that programmable network management - along the lines of 758 [RFC7426] - will draw more interest as we move forward. For example, 759 in [I-D.unify-nfvrg-challenges], the authors identify the need for 760 presenting programmable interfaces that accept instructions in a 761 standards-supported manner for the Two-way Active Measurement 762 Protocol (TWAMP)TWAMP protocol. More specifically, an excellent 763 example in this case is traffic measurements, which are extensively 764 used today to determine SLA adherence as well as debug and 765 troubleshoot pain points in service delivery. TWAMP is both widely 766 implemented by all established vendors and deployed by most global 767 operators. However, TWAMP management and control today relies solely 768 on diverse and proprietary tools provided by the respective vendors 769 of the equipment. For large, virtualized, and dynamically 770 instantiated infrastructures where network functions are placed 771 according to orchestration algorithms proprietary mechanisms for 772 managing TWAMP measurements have severe limitations. For example, 773 today's TWAMP implementations are managed by vendor-specific, 774 typically command-line interfaces (CLI), which can be scripted on a 775 platform-by-platform basis. As a result, although the control and 776 test measurement protocols are standardized, their respective 777 management is not. This hinders dramatically the possibility to 778 integrate such deployed functionality in the SP-DevOps concept. In 779 this particular case, recent efforts in the IPPM WG 780 [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data 781 model and effectively increase the programmability of TWAMP 782 deployments in the future. 784 Data center DevOps tools, such as those surveyed in [D4.1], developed 785 proprietary methods for describing and interacting through interfaces 786 with the managed infrastructure. Within certain communities, they 787 became de-facto standards in the same way particular CLIs became de- 788 facto standards for Internet professionals. Although open-source 789 components and a strong community involvement exists, the diversity 790 of the new languages and interfaces creates a burden for both vendors 791 in terms of choosing which ones to prioritize for support, and then 792 developing the functionality and operators that determine what fits 793 best for the requirements of their systems. 795 11. DevOps Performance Metrics 797 Defining a set of metrics that are used as performance indicators is 798 important for service providers to ensure the successful deployment 799 and operation of a service in the software-defined telecom 800 infrastructure. 802 We identify three types of considerations that are particularly 803 relevant for these metrics: 1) technical considerations directly 804 related to the service provided, 2) process-related considerations 805 regarding the deployment, maintenance and troubleshooting of the 806 service, i.e. concerning the operation of VNFs, and 3) cost-related 807 considerations associated to the benefits from using a Software- 808 Defined Telecom Infrastructure. 810 First, technical performance metrics shall be service-dependent/- 811 oriented and may address inter-alia service performance in terms of 812 delay, throughput, congestion, energy consumption, availability, etc. 813 Acceptable performance levels should be mapped to SLAs and the 814 requirements of the service users. Metrics in this category were 815 defined in IETF working groups and other standardization 816 organizations with responsibility over particular service or 817 infrastructure descriptions. 819 Second, process-related metrics shall serve a wider perspective in 820 the sense that they shall be applicable for multiple types of 821 services. For instance, process-related metrics may include: number 822 of probes for end-to-end QoS monitoring, number of on-site 823 interventions, number of unused alarms, number of configuration 824 mistakes, incident/trouble delay resolution, delay between service 825 order and deliver, or number of self-care operations. 827 Third, cost-related metrics shall be used to monitor and assess the 828 benefit of employing SDI compared to the usage of legacy hardware 829 infrastructure with respect to operational costs, e.g. possible man- 830 hours reductions, elimination of deployment and configuration 831 mistakes, etc. 833 Finally, identifying a number of highly relevant metrics for DevOps 834 and especially monitoring and measuring them is highly challenging 835 because of the amount and availability of data sources that could be 836 aggregated within one such metric, e.g. calculation of human 837 intervention, or secret aspects of costs. 839 12. Security Considerations 841 TBD 843 13. IANA Considerations 845 This memo includes no request to IANA. 847 14. References 849 14.1. Informative References 851 [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management 852 and Orchestration V0.6.1 (draft)", Jul. 2014 854 [I-D.aldrin-sfc-oam-framework] S. Aldrin, R. Pignataro, N. Akiya. 855 "Service Function Chaining Operations, Administration and 856 Maintenance Framework", draft-aldrin-sfc-oam-framework-02, 857 (work in progress), July 2015. 859 [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function 860 Chaining Verification", draft-lee-sfc-verification-00, 861 (work in progress), February 2014. 863 [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J. 864 Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined 865 Networking (SDN): Layers and Architecture Terminology", 866 RFC 7426, January 2015 868 [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking: 869 A Perspective from within a Service Provider Environment", 870 RFC 7149, March 2014. 872 [TR228] TMForum Gap Analysis Related to MANO Work. TR228, May 2014 874 [I-D.unify-nfvrg-challenges] R. Szabo et al. "Unifying Carrier and 875 Cloud Networks: Problem Statement and Challenges", draft- 876 unify-nfvrg-challenges-03 (work in progress), October 2016 878 [I-D.cmzrjp-ippm-twamp-yang] Civil, R., Morton, A., Zheng, L., 879 Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way 880 Active Measurement Protocol (TWAMP) Data Model", draft- 881 cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015. 883 [D4.1] W. John et al. D4.1 Initial requirements for the SP-DevOps 884 concept, universal node capabilities and proposed tools, 885 August 2014. 887 [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve 888 Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined 889 Networking: A Comprehensive Survey." To appear in 890 proceedings of the IEEE, 2015. 892 [DevOpsP] "DevOps, the IBM Approach" 2013. [Online]. 894 [Y1564] ITU-R Recommendation Y.1564: Ethernet service activation 895 test methodology, March 2011 897 [CAP] E. Brewer, "CAP twelve years later: How the "rules" have 898 changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012. 900 [H2014] N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N. 901 McKeown; "I Know What Your Packet Did Last Hop: Using 902 Packet Histories to Troubleshoot Networks", In Proceedings 903 of the 11th USENIX Symposium on Networked Systems Design 904 and Implementation (NSDI 14), pp.71-95 906 [W2011] A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann; 907 "OFRewind: Enabling Record and Replay Troubleshooting for 908 Networks". In Proceedings of the Usenix Anual Technical 909 Conference (Usenix ATC '11), pp 327-340 911 [S2010] E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration 912 analysis and verification of federated Openflow 913 infrastructures" In Proceedings of the 3rd ACM workshop on 914 Assurable and usable security configuration (SafeConfig 915 '10). Pp. 37-44 917 [OSandS] S. Wright, D. Druta, "Open Source and Standards: The Role 918 of Open Source in the Dialogue between Research and 919 Standardization" Globecom Workshops (GC Wkshps), 2014 , 920 pp.650,655, 8-12 Dec. 2014 922 [C2015] CFEngine. Online: http://cfengine.com/product/what-is- 923 cfengine/, retrieved Sep 23, 2015. 925 [P2015] Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet, 926 retrieved Sep 23, 2015. 928 [A2015] Ansible. Online: http://docs.ansible.com/ , retrieved Sep 929 23, 2015. 931 [AK2015] Apache Kafka. Online: 932 http://kafka.apache.org/documentation.html, retrieved Sep 933 23, 2015. 935 [S2015] Splunk. Online: http://www.splunk.com/en_us/products/splunk- 936 light.html , retrieved Sep 23, 2015. 938 [K2014] J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per 939 Second (On Three Cheap Machines). Online: 940 https://engineering.linkedin.com/kafka/benchmarking-apache- 941 kafka-2-million-writes-second-three-cheap-machines, 942 retrieved Sep 23, 2015. 944 [R2015] RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct 945 13, 2015 947 [Z2015] ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015 949 15. Contributors 951 W. John (Ericsson), J. Kim (Deutsche Telekom), S. Sharma (iMinds) 953 16. Acknowledgments 955 The research leading to these results has received funding from the 956 European Union Seventh Framework Programme FP7/2007-2013 under grant 957 agreement no. 619609 - the UNIFY project. The views expressed here 958 are those of the authors only. The European Commission is not liable 959 for any use that may be made of the information in this document. 961 We would like to thank in particular the UNIFY WP4 contributors, the 962 internal reviewers of the UNIFY WP4 deliverables and Russ White and 963 Ramki Krishnan for their suggestions. 965 This document was prepared using 2-Word-v2.0.template.dot. 967 17. Authors' Addresses 969 Catalin Meirosu 970 Ericsson Research 971 S-16480 Stockholm, Sweden 972 Email: catalin.meirosu@ericsson.com 974 Antonio Manzalini 975 Telecom Italia 976 Via Reiss Romoli, 274 977 10148 - Torino, Italy 978 Email: antonio.manzalini@telecomitalia.it 980 Juhoon Kim 981 Deutsche Telekom AG 982 Winterfeldtstr. 21 983 10781 Berlin, Germany 984 Email: J.Kim@telekom.de 986 Rebecca Steinert 987 SICS Swedish ICT AB 988 Box 1263, SE-16429 Kista, Sweden 989 Email: rebste@sics.se 991 Sachin Sharma 992 Ghent University-iMinds 993 Research group IBCN - Department of Information Technology 994 Zuiderpoort Office Park, Blok C0 995 Gaston Crommenlaan 8 bus 201 996 B-9050 Gent, Belgium 997 Email: sachin.sharma@intec.ugent.be 999 Guido Marchetto 1000 Politecnico di Torino 1001 Corso Duca degli Abruzzi 24 1002 10129 - Torino, Italy 1003 Email: guido.marchetto@polito.it 1005 Ioanna Papafili 1006 Hellenic Telecommunications Organization 1007 Measurements and Wireless Technologies Section 1008 Laboratories and New Technologies Division 1009 2, Spartis & Pelika str., Maroussi, 1010 GR-15122, Attica, Greece 1011 Buidling E, Office 102 1012 Email: iopapafi@oteresearch.gr 1014 Kostas Pentikousis 1015 EICT GmbH 1016 Torgauer Strasse 12-15 1017 Berlin 10829 1018 Germany 1019 Email: k.pentikousis@eict.de 1021 Steven Wright 1022 AT&T Services Inc. 1023 1057 Lenox Park Blvd NE, STE 4D28 1024 Atlanta, GA 30319 1025 USA 1026 Email: sw3588@att.com 1028 Wolfgang John 1029 Ericsson Research 1030 S-16480 Stockholm, Sweden 1031 Email: wolfgang.john@ericsson.com