idnits 2.17.1 draft-unify-nfvrg-devops-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == Mismatching filename: the document gives the document name as 'draft-unify-nfvrg-devops-05', but the file name used is 'draft-unify-nfvrg-devops-06' Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 8, 2016) is 2842 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-04) exists of draft-unify-nfvrg-challenges-03 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NFVRG C. Meirosu 2 Internet Draft Ericsson 3 Intended status: Informational A. Manzalini 4 Expires: January 2017 Telecom Italia 5 R. Steinert 6 SICS 7 G. Marchetto 8 Politecnico di Torino 9 K. Pentikousis 10 EICT 11 S. Wright 12 AT&T 13 P. Lynch 14 Ixia 15 W. John 16 Ericsson 18 July 8, 2016 20 DevOps for Software-Defined Telecom Infrastructures 21 draft-unify-nfvrg-devops-05.txt 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 44 This Internet-Draft will expire on January 8, 2016. 46 Copyright Notice 48 Copyright (c) 2016 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Abstract 63 Carrier-grade network management was optimized for environments built 64 with monolithic physical nodes and involves significant deployment, 65 integration and maintenance efforts from network service providers. 66 The introduction of virtualization technologies, from the physical 67 layer all the way up to the application layer, however, invalidates 68 several well-established assumptions in this domain. This draft opens 69 the discussion in NFVRG about challenges related to transforming the 70 telecom network infrastructure into an agile, model-driven 71 environment for communication services. We take inspiration from data 72 center DevOps on the simplification and automation of management 73 processes for a telecom service provider software-defined 74 infrastructure (SDI). A number of challenges associated with 75 operationalizing DevOps principles at scale in software-defined 76 telecom networks are identified in relation to three areas related to 77 key programmable management processes. 79 Table of Contents 81 1. Introduction...................................................3 82 2. Software-Defined Telecom Infrastructure: Roles and DevOps 83 principles........................................................5 84 2.1. Service Developer Role....................................6 85 2.2. VNF Developer role........................................6 86 2.3. System Integrator role....................................6 87 2.4. Operator role.............................................7 88 2.5. Customer role.............................................7 89 2.6. DevOps Principles.........................................7 90 3. Continuous Integration.........................................9 91 4. Continuous Delivery...........................................10 92 5. Consistency, Availability and Partitioning Challenges.........10 93 6. Stability and Real-Time Change Challenges.....................11 94 7. Observability Challenges......................................13 95 8. Verification Challenges.......................................15 96 9. Testing Challenges............................................17 97 10. Programmable management......................................18 98 11. Security Considerations......................................20 99 12. IANA Considerations..........................................20 100 13. References...................................................20 101 13.1. Informative References..................................20 102 14. Contributors to earlier versions.............................23 103 15. Acknowledgments..............................................23 104 16. Authors' Addresses...........................................24 106 1. Introduction 108 Carrier-grade network management was developed as an incremental 109 solution once a particular network technology matured and came to be 110 deployed in parallel with legacy technologies. This approach requires 111 significant integration efforts when new network services are 112 launched. Both centralized and distributed algorithms have been 113 developed in order to solve very specific problems related to 114 configuration, performance and fault management. However, such 115 algorithms consider a network that is by and large functionally 116 static. Thus, management processes related to introducing new or 117 maintaining functionality are complex and costly due to significant 118 efforts required for verification and integration. 120 Network virtualization, by means of Software-Defined Networking (SDN) 121 and Network Function Virtualization (NFV), creates an environment 122 where network functions are no longer static or strictly embedded in 123 physical boxes deployed at fixed points. The virtualized network is 124 dynamic and open to fast-paced innovation enabling efficient network 125 management and reduction of operating cost for network operators. A 126 significant part of network capabilities are expected to become 127 available through interfaces that resemble the APIs widespread within 128 datacenters instead of the traditional telecom means of management 129 such as the Simple Network Management Protocol, Command Line 130 Interfaces or CORBA. Such an API-based approach, combined with the 131 programmability offered by SDN interfaces [RFC7426], open 132 opportunities for handling infrastructure, resources, and Virtual 133 Network Functions (VNFs) as code, employing techniques from software 134 engineering. 136 The efficiency and integration of existing management techniques in 137 virtualized and dynamic network environments are limited, however. 138 Monitoring tools, e.g. based on simple counters, physical network 139 taps and active probing, do not scale well and provide only a small 140 part of the observability features required in such a dynamic 141 environment. Although huge amounts of monitoring data can be 142 collected from the nodes, the typical granularity is rather static 143 and coarse and management bandwidths may be limited. Debugging and 144 troubleshooting techniques developed for software-defined 145 environments are a research topic that has gathered interest in the 146 research community in the last years. Still, it is yet to be explored 147 how to integrate them into an operational network management system. 148 Moreover, research tools developed in academia (such as NetSight 149 [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were limited to 150 solving very particular, well-defined problems, and oftentimes are 151 not built for automation and integration into carrier-grade network 152 operations workflows. As the virtualized network functions, 153 infrastructure software and infrastructure hardware become more 154 dynamic [NFVSWA], the monitoring, management and testing approaches 155 also need to change. 157 The topics at hand have already attracted several standardization 158 organizations to look into the issues arising in this new 159 environment. For example, IETF working groups have activities in the 160 area of OAM and Verification for Service Function Chaining 161 [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service 162 Function Chaining. At IRTF, [RFC7149] asks a set of relevant 163 questions regarding operations of SDNs. The ETSI NFV ISG defines the 164 MANO interfaces [NFVMANO], and TMForum investigates gaps between 165 these interfaces and existing specifications in [TR228]. The need for 166 programmatic APIs in the orchestration of compute, network and 167 storage resources is discussed in [I-D.unify-nfvrg-challenges]. 169 From a research perspective, problems related to operations of 170 software-defined networks are in part outlined in [SDNsurvey] and 171 research referring to both cloud and software-defined networks are 172 discussed in [D4.1]. 174 The purpose of this first version of this document is to act as a 175 discussion opener in NFVRG by describing a set of principles that are 176 relevant for applying DevOps ideas to managing software-defined 177 telecom network infrastructures. We identify a set of challenges 178 related to developing tools, interfaces and protocols that would 179 support these principles and how can we leverage standard APIs for 180 simplifying management tasks. 182 2. Software-Defined Telecom Infrastructure: Roles and DevOps principles 184 There is no single list of core principles of DevOps, but it is 185 generally recognized as encompassing: 187 . Iterative development / Incremental feature content 189 . Continuous deployment 191 . Automated processes 193 . Holistic/Systemic views of development and deployment/ 194 operation. 196 With Deployment/ Operations becoming increasingly linked with 197 software development, and business needs driving more rapid 198 deployments, agile methodologies are assumed as a basis for DevOps. 199 Agile methods used in many software focused companies are focused on 200 releasing small interactions of code to implement VNFs with high 201 velocity and high quality into a production environment. Similarly, 202 Service providers are interested to release incremental improvements 203 in the network services that they create from virtualized network 204 functions. The cycle time for DevOps as applied in many open source 205 projects is on the order of one quarter year or 13 weeks. 207 The code needs to undergo a significant amount of automated testing 208 and verification with pre-defined templates in a realistic setting. 209 From the point of view of software defined telecom infrastructure 210 management, the of the network and service configuration is expected 211 to continuously evolve as result of network policy decomposition and 212 refinement, service evolution, the updates, failovers or re- 213 configuration of virtual functions, additions/upgrades of new 214 infrastructure resources (e.g. whiteboxes, fibers). When 215 troubleshooting the cause of unexpected behavior, fine-grained 216 visibility onto all resources supporting the virtual functions 217 (either compute, or network-related) is paramount to facilitating 218 fast resolution times. While compute resources are typically very 219 well covered by debugging and profiling toolsets based on many years 220 of advances in software engineering, programmable network resources 221 are a still a novelty and tools exploiting their potential are 222 scarce. 224 2.1. Service Developer Role 226 We identify two dimensions of the "developer" role in software- 227 defined infrastructure (SDI). The network service to be developed is 228 captured in a network service descriptor (e.g. [IFA014]). One 229 dimension relates to determining which high-level functions should be 230 part of a particular service, deciding what logical interconnections 231 are needed between these blocks and defining a set of high-level 232 constraints or goals related to parameters that define, for instance, 233 a Service Function Chain. This could be determined by the product 234 owner for a particular family of services offered by a telecom 235 provider. Or, it might be a key account representative that adapts an 236 existing service template to the requirements of a particular 237 customer by adding or removing a small number of functional entities. 238 We refer to this person as the Service Developer and for simplicity 239 (access control, training on technical background, etc.) we consider 240 the role to be internal to the telecom provider. 242 2.2. VNF Developer role 244 Another dimension of the "developer" role is a person that writes the 245 software code for a new virtual network function (VNF). The VNF then 246 needs to be delivered as a package (e.g.[IFA011]) that includes 247 various metadata for ingestion/integration into some service. Note 248 that a VNF may span multiple virtual machines to support design 249 objectives (e.g. for reliability or scalability). Depending on the 250 actual VNF being developed, this person might be internal or external 251 (e.g. a traditional equipment vendor) to the telecom provider. We 252 refer to them as VNF Developers. 254 2.3. System Integrator role 256 The System Integrator role is to some extent similar to the Service 257 Developer: people in this role need to identify the components of the 258 system to be delivered. However, for the Service Developer, the 259 service components are pre-integrated meaning that they have the 260 right interfaces to interact with each other. In contrast, the 261 Systems Integrator needs to develop the software that makes the 262 system components interact with each other. As such, the Systems 263 Integrator role combines aspects of the Developer roles and adds yet 264 another dimension to it. Compared to the other Developer roles, the 265 System Integrator might face additional challenges due to the fact 266 that they might not have access to the source code of some of the 267 components. This limits for example how fast they could address 268 issues with components to be integrated, as well as uneven workload 269 depending on the release granularity of the different components that 270 need to be integrated. Some system integration activities may take 271 place on an industry basis in collaborative communities (e.g. 272 OPNFV.org). 274 2.4. Network service Operator role 276 The role of a Network Service Operator is to ensure that the 277 deployment processes were successful and a set of performance 278 indicators associated to a particular network service are met. The 279 network service is supported on infrastructure specific set of 280 infrastructure resources that may be owned and operated by that 281 Network Service Operator, or provided under contract from some other 282 infrastructure service provider. . 284 2.5. Customer role 286 A Customer contracts a telecom operator to provide one or more 287 services. In SDI, the Customer may communicate with the provider in 288 real time through an online portal. From the customer perspective, 289 such portal interfaces become part of the service definition just 290 like the data transfer aspects of the service. Compared to the 291 Service Developer, the Customer is external to the operator and may 292 define changes to their own service instance only in accordance to 293 policies defined by the Service Developer. In addition to the usual 294 per-service utilization statistics, in SDI the portal may enable the 295 customer to trigger certain performance management or troubleshooting 296 tools for the service. This, for example, enables the Customer to 297 determine whether the root cause of certain error or degradation 298 condition that they observe is located in the telecom operator domain 299 or not and may facilitate the interaction with the customer support 300 teams. 302 2.6. DevOps Principles 304 In line with the generic DevOps concept outlined in [DevOpsP], we 305 consider that these four principles as important for adapting DevOps 306 ideas to SDI: 308 * Automated processes: Deploy with repeatable, reliable processes: 309 Service and VNF Developers should be supported by automated build, 310 orchestrate and deploy processes that are identical in the 311 development, test and production environments. Such processes need to 312 be made reliable and trusted in the sense that they should reduce the 313 chance of human error and provide visibility at each stage of the 314 process, as well as have the possibility to enable manual 315 interactions in certain key stages. 317 * Holistis/systemic view: Develop and test against production-like 318 systems: both Service Developers and VNF Developers need to have the 319 opportunity to verify and debug their respective SDI code in systems 320 that have characteristics which are very close to the production 321 environment where the code is expected to be ultimately deployed. 322 Customizations of Service Function Chains or VNFs could thus be 323 released frequently to a production environment in compliance with 324 policies set by the Operators. Adequate isolation and protection of 325 the services active in the infrastructure from services being tested 326 or debugged should be provided by the production environment. 328 * Continuous: Monitor and validate operational quality: Service 329 Developers, VNF Developers and Operators must be equipped with tools, 330 automated as much as possible, that enable to continuously monitor 331 the operational quality of the services deployed on SDI. Monitoring 332 tools should be complemented by tools that allow verifying and 333 validating the operational quality of the service in line with 334 established procedures which might be standardized (for example, 335 Y.1564 Ethernet Activation [Y1564]) or defined through best practices 336 specific to a particular telecom operator. 338 * Iterative/Incremental: Amplify development cycle feedback loops: An 339 integral part of the DevOps ethos is building a cross-cultural 340 environment that bridges the cultural gap between the desire for 341 continuous change by the Developers and the demand by the Operators 342 for stability and reliability of the infrastructure. Feedback from 343 customers is collected and transmitted throughout the organization. 344 From a technical perspective, such cultural aspects could be 345 addressed through common sets of tools and APIs that are aimed at 346 providing a shared vocabulary for both Developers and Operators, as 347 well as simplifying the reproduction of problematic situations in the 348 development, test and operations environments. 350 Network operators that would like to move to agile methods to deploy 351 and manage their networks and services face a different environment 352 compared to typical software companies where simplified trust 353 relationships between personnel are the norm. In software companies, 354 it is not uncommon that the same person may be rotating between 355 different roles. In contrast, in a telecom service provider, there 356 are strong organizational boundaries between suppliers (whether in 357 Developer roles for network functions, or in Operator roles for 358 outsourced services) and the carrier's own personnel that might also 359 take both Developer and Operator roles. Extending DevOps principles 360 across strong organizational boundaries e.g. through co-creation or 361 collaborative development in open source communities) may be a 362 commercial challenge rather than a technical issue. 364 3. Continuous Integration 366 Software integration is the process of bringing together the software 367 component subsystems into one software system, and ensuring that the 368 subsystems function together as a system. Software integration can 369 apply regardless of the size of the software components. The 370 objective of Continuous Integration is to prevent integration 371 problems close to the expected release of a software development 372 project into a production (operations) environment. Continuous 373 Integration is therefore closely coupled with the notion of DevOps as 374 a mechanism to ease the transition from development to operations. 376 Continuous integration may result in multiple builds per day. It is 377 also typically used in conjunction with test driven development 378 approaches that integrate unit testing into the build process. The 379 unit testing is typically automated through build servers. Such 380 servers may implement a variety of additional static and dynamic 381 tests as well as other quality control and documentation extraction 382 functions. The reduced cycle times of continuous enable improved 383 software quality by applying small efforts frequently. 385 Continuous Integration applies to developers of VNF as they integrate 386 the components that they need to deliver their VNF. The VNFs may 387 contain components developed by different teams within the VNF 388 Provider, or may integrate code developed externally - e.g. in 389 commercial code libraries or in open source communities. 391 Service developers also apply continuous integration in the 392 development of network services. Network services are comprised of 393 various aspects including VNFs and connectivity within and between 394 them as well as with various associated resource authorizations. The 395 components of the networks service are all dynamic, and largely 396 represented by software that must be integrated regularly to maintain 397 consistency. 399 Some of the software components that Service Developers integrate may 400 be sourced from VNF Providers or from open source communities. 401 Service Developers and Network Service Operators are increasingly 402 motivated to engage with open Source communities [OSandS]. Open 403 source interfaces supported by open source communities may be more 404 useful than traditional paper interface specifications. Even where 405 Service Providers are deeply engaged in the open source community 406 (e.g. OPNFV) many service providers may prefer to obtain the code 407 through some software provider as a business practice. Such software 408 providers have the same interests in software integration as other 409 VNF providers. An open source integration community (e.g. OPNFV) may 410 resolve common integration issues across the industry reducing the 411 need for integration issue resolution specific to particular 412 integrators. 414 4. Continuous Delivery 416 The practice of Continuous Delivery extends Continuous Integration by 417 ensuring that the software (either a VNF code or code for SDI) 418 checked in on the mainline is always in a user deployable state and 419 enables rapid deployment by those users. For critical systems such as 420 telecommunications networks, Continuous Delivery may require the 421 advantage of including a manual trigger before the actual deployment 422 in the live system, compared to the Continuous Deployment methodology 423 which is also part of DevOps processes in software companies. 425 Automated Continuous deployment systems in may exceed 10 updates per 426 day. Assuming an integration of 100 components, each with an average 427 time to upgrade of 180 days then deployments on the order of every 428 1.8 days might be expected. The telecom infrastructure is also very 429 distributed - consider the case of cloud RAN use cases where the 430 number of locations for deployment is of the order of the number of 431 cell tower locations (~10^4..10^6). Deployments may need to be 432 incremental across the infrastructure to reduce the risk of large- 433 scale failures. Conversely, there may need to be rapid rollbacks to 434 prior stable deployment configurations in the event of significant 435 failures. 437 5. Consistency, Availability and Partitioning Challenges 439 The CAP theorem [CAP] states that any networked shared-data system 440 can have at most two of following three properties: 1) Consistency 441 (C) equivalent to having a single up-to-date copy of the data; 2) 442 high Availability (A) of that data (for updates); and 3) tolerance to 443 network Partitions (P). 445 Looking at a telecom SDI as a distributed computational system 446 (routing/forwarding packets can be seen as a computational problem), 447 just two of the three CAP properties will be possible at the same 448 time. The general idea is that 2 of the 3 have to be chosen. CP favor 449 consistency, AP favor availability, CA there are no partition. This 450 has profound implications for technologies that need to be developed 451 in line with the "deploy with repeatable, reliable processes" 452 principle for configuring SDI states. Latency or delay and 453 partitioning properties are closely related, and such relation 454 becomes more important in the case of telecom service providers where 455 Devs and Ops interact with widely distributed infrastructure. 456 Limitations of interactions between centralized management and 457 distributed control need to be carefully examined in such 458 environments. Traditionally connectivity was the main concern: C and 459 A was about delivering packets to destination. The features and 460 capabilities of SDN and NFV are changing the concerns: for example in 461 SDN, control plane Partitions no longer imply data plane Partitions, 462 so A does not imply C. In practice, CAP reflects the need for a 463 balance between local/distributed operations and remote/centralized 464 operations. 466 Furthermore to CAP aspects related to individual protocols, 467 interdependencies between CAP choices for both resources and VNFs 468 that are interconnected in a forwarding graph need to be considered. 469 This is particularly relevant for the "Monitor and Validate 470 Operational Quality" principle, as apart from transport protocols, 471 most OAM functionality is generally configured in processes that are 472 separated from the configuration of the monitored entities. Also, 473 partitioning in a monitoring plane implemented through VNFs executed 474 on compute resources does not necessarily mean that the dataplane of 475 the monitored VNF was partitioned as well. 477 6. Stability and Real-Time Change Challenges 479 The dimensions, dynamicity and heterogeneity of networks are growing 480 continuously. Monitoring and managing the network behavior in order 481 to meet technical and business objectives is becoming increasingly 482 complicated and challenging, especially when considering the need of 483 predicting and taming potential instabilities. 485 In general, instability in networks may have primary effects both 486 jeopardizing the performance and compromising an optimized use of 487 resources, even across multiple layers: in fact, instability of end- 488 to-end communication paths may depend both on the underlying 489 transport network, as well as the higher level components specific to 490 flow control and dynamic routing. For example, arguments for 491 introducing advanced flow admission control are essentially derived 492 from the observation that the network otherwise behaves in an 493 inefficient and potentially unstable manner. Even with resources over 494 provisioning, a network without an efficient flow admission control 495 has instability regions that can even lead to congestion collapse in 496 certain configurations. Another example is the instability which is 497 characteristic of any dynamically adaptive routing system. Routing 498 instability, which can be (informally) defined as the quick change of 499 network reachability and topology information, has a number of 500 possible origins, including problems with connections, router 501 failures, high levels of congestion, software configuration errors, 502 transient physical and data link problems, and software bugs. 504 As a matter of fact, the states monitored and used to implement the 505 different control and management functions in network nodes are 506 governed by several low-level configuration commands. There are 507 several dependencies among these states and the logic updating the 508 states in real time (most of which are not synchronized 509 automatically). Normally, high-level network goals (such as the 510 connectivity matrix, load-balancing, traffic engineering goals, 511 survivability requirements, etc) are translated into low-level 512 configuration commands (mostly manually) individually executed on the 513 network elements (e.g., forwarding table, packet filters, link- 514 scheduling weights, and queue-management parameters, as well as 515 tunnels and NAT mappings). Network instabilities due to configuration 516 errors can spread from node to node and propagate throughout the 517 network. 519 DevOps in the data center is a source of inspiration regarding how to 520 simplify and automate management processes for software-defined 521 infrastructure. Although the low-level configuration could be 522 automated by DevOps tools such as CFEngine [C2015], Puppet [P2015] 523 and Ansible [A2015], the high-level goal translation towards tool- 524 specific syntax is still a manual process. In addition, while 525 carrier-grade configuration tools using the NETCONF protocol support 526 complex atomic transaction management (which reduces the potential 527 for instability), Ansible requires third-party components to support 528 rollbacks and the Puppet transactions are not atomic. 530 As a specific example, automated configuration functions are expected 531 to take the form of a "control loop" that monitors (i.e., measures) 532 current states of the network, performs a computation, and then 533 reconfigures the network. These types of functions must work 534 correctly even in the presence of failures, variable delays in 535 communicating with a distributed set of devices, and frequent changes 536 in network conditions. Nevertheless cascading and nesting of 537 automated configuration processes can lead to the emergence of non- 538 linear network behaviors, and as such sudden instabilities (i.e. 539 identical local dynamic can give rise to widely different global 540 dynamics). 542 7. Observability Challenges 544 Monitoring algorithms need to operate in a scalable manner while 545 providing the specified level of observability in the network, either 546 for operation purposes (Ops part) or for debugging in a development 547 phase (Dev part). We consider the following challenges: 549 * Scalability - relates to the granularity of network observability, 550 computational efficiency, communication overhead, and strategic 551 placement of monitoring functions. 553 * Distributed operation and information exchange between monitoring 554 functions - monitoring functions supported by the nodes may perform 555 specific operations (such as aggregation or filtering) locally on the 556 collected data or within a defined data neighborhood and forward only 557 the result to a management system. Such operation may require 558 modifications of existing standards and development of protocols for 559 efficient information exchange and messaging between monitoring 560 functions. Different levels of granularity may need to be offered for 561 the data exchanged through the interfaces, depending on the Dev or 562 Ops role. Modern messaging systems, such as Apache Kafka [AK2015], 563 widely employed in datacenter environments, were optimized for 564 messages that are considerably larger than reading a single counter 565 value (typical SNMP GET call usage) - note the throughput vs record 566 size from [K2014]. It is also debatable to what extent properties 567 such as message persistence within the bus are needed in a carrier 568 environment, where MIBs practically offer already a certain level of 569 persistence of management data at the node level. Also, they require 570 the use of IP addressing which might not be needed when the monitored 571 data is consumed by a function within the same node. 573 * Common communication channel between monitoring functions and 574 higher layer entities (orchestration, control or management systems) 575 - a single communication channel for configuration and measurement 576 data of diverse monitoring functions running on heterogeneous hard- 577 and software environments. In telecommunication environments, 578 infrastructure assets span not only large geographical areas, but 579 also a wide range of technology domains, ranging from CPEs, access-, 580 aggregation-, and transport networks, to datacenters. This 581 heterogeneity of hard- and software platforms requires higher layer 582 entities to utilize various parallel communication channels for 583 either configuration or data retrieval of monitoring functions within 584 these technology domains. To address automation and advances in 585 monitoring programmability, software defined telecommunication 586 infrastructures would benefit from a single flexible communication 587 channel, thereby supporting the dynamicity of virtualized 588 environments. Such a channel should ideally support propagation of 589 configuration, signalling, and results from monitoring functions; 590 carrier-grade operations in terms of availability and multi-tenant 591 features; support highly distributed and hierarchical architectures, 592 keeping messages as local as possible; be lightweight, topology 593 independent, network address agnostic; support flexibility in terms 594 of transport mechanisms and programming language support. 595 Existing popular state-of-the-art message queuing systems such as 596 RabbitMQ [R2015] fulfill many of these requirements. However, they 597 utilize centralized brokers, posing a single point-of-failure and 598 scalability concerns within vastly distributed NFV environment. 599 Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015] 600 on the other hard lacks any advanced features for carrier-grade 601 operations, including high-availability, authentication, and tenant 602 isolation. 604 * Configurability and conditional observability - monitoring 605 functions that go beyond measuring simple metrics (such as delay, or 606 packet loss) require expressive monitoring annotation languages for 607 describing the functionality such that it can be programmed by a 608 controller. Monitoring algorithms implementing self-adaptive 609 monitoring behavior relative to local network situations may employ 610 such annotation languages to receive high-level objectives (KPIs 611 controlling tradeoffs between accuracy and measurement frequency, for 612 example) and conditions for varying the measurement intensity. Steps 613 in this direction were taken by the DevOps tools such as Splunk 614 [S2015], whose collecting agent has the ability to load particular 615 apps that in turn access specific counters or log files. However, 616 such apps are tool specific and may also require deploying additional 617 agents that are specific to the application, library or 618 infrastructure node being monitored. Choosing which objects to 619 monitor in such environment means deploying a tool-specific script 620 that configures the monitoring app. 622 * Automation - includes mapping of monitoring functionality from a 623 logical forwarding graph to virtual or physical instances executing 624 in the infrastructure, as well as placement and re-placement of 625 monitoring functionality for required observability coverage and 626 configuration consistency upon updates in a dynamic network 627 environment. Puppet [P2015] manifests or Ansible [A2015] playbooks 628 could be used for automating the deployment of monitoring agents, for 629 example those used by Splunk [S2015]. However, both manifests and 630 playbooks were designed to represent the desired system configuration 631 snapshot at a particular moment in time - they would now need to be 632 generated automatically by the orchestration tools instead of a 633 DevOps person. 635 * Actionable data 636 Data produced by observability tools could be utilized in a wide 637 category of processes, ranging from billing and dimensioning to real- 638 time troubleshooting and optimization. In order to allow for data- 639 driven automated decisions and actuations based on these decisions, 640 the data needs to be actionable. We define actionable data as being 641 representative for a particular context or situation and an adequate 642 input towards a decision. Ensuring actionable data is challenging in 643 a number of ways, including: defining adaptive correlation and 644 sampling windows, filtering and aggregation methods that are adapted 645 or coordinated with the actual consumer of the data, and developing 646 analytical and predictive methods that account for the uncertainty or 647 incompleteness of the data. 649 * Data Virtualization 651 Data is key in helping both Developers and Operators perform their 652 tasks. Traditional Network Management Systems were optimized for 653 using one database that contains the master copy of the operational 654 statistics and logs of network nodes. Ensuring access to this data 655 from across the organization is challenging because strict privacy 656 and business secrets need to be protected. In DevOps-driven 657 environments, data needs to be made available to Developers and their 658 test environments. Data virtualization collectively defines a set of 659 technologies that ensure that restricted copies of the partial data 660 needed for a particular task may be made available while enforcing 661 strict access control. Further than simple access control, data 662 virtualization needs to address scalability challenges involved in 663 copying large amounts of operational data as well as automatically 664 disposing of it when the task authorized for using it has finished. 666 8. Verification Challenges 668 Enabling ongoing verification of code is an important goal of 669 continuous integration as part of the data center DevOps concept. In 670 a telecom SDI, service definitions, decompositions and configurations 671 need to be expressed in machine-readable encodings. For example, 672 configuration parameters could be expressed in terms of YANG data 673 models. However, the infrastructure management layers (such as 674 Software-Defined Network Controllers and Orchestration functions) 675 might not always export such machine-readable descriptions of the 676 runtime configuration state. In this case, the management layer 677 itself could be expected to include a verification process that has 678 the same challenges as the stand-alone verification processes we 679 outline later in this section. In that sense, verification can be 680 considered as a set of features providing gatekeeper functions to 681 verify both the abstract service models and the proposed resource 682 configuration before or right after the actual instantiation on the 683 infrastructure layer takes place. 685 A verification process can involve different layers of the network 686 and service architecture. Starting from a high-level verification of 687 the customer input (for example, a Service Graph as defined in 688 [I-D.unify-nfvrg-challenges]), the verification process could go more 689 in depth to reflect on the Service Function Chain configuration. At 690 the lowest layer, the verification would handle the actual set of 691 forwarding rules and other configuration parameters associated to a 692 Service Function Chain instance. This enables the verification of 693 more quantitative properties (e.g. compliance with resource 694 availability), as well as a more detailed and precise verification of 695 the abovementioned topological ones. Existing SDN verification tools 696 could be deployed in this context, but the majority of them only 697 operate on flow space rules commonly expressed using OpenFlow syntax. 699 Moreover, such verification tools were designed for networks where 700 the flow rules are necessary and sufficient to determine the 701 forwarding state. This assumption is valid in networks composed only 702 by network functions that forward traffic by analyzing only the 703 packet headers (e.g. simple routers, stateless firewalls, etc.). 704 Unfortunately, most of the real networks contain active network 705 functions, represented by middle-boxes that dynamically change the 706 forwarding path of a flow according to function-local algorithms and 707 an internal state (that is based on the received packets), e.g. load 708 balancers, packet marking modules and intrusion detection systems. 709 The existing verification tools do not consider active network 710 functions because they do not account for the dynamic transformation 711 of an internal state into the verification process. 713 Defining a set of verification tools that can account for active 714 network functions is a significant challenge. In order to perform 715 verification based on formal properties of the system, the internal 716 states of an active (virtual or not) network function would need to 717 be represented. Although these states would increase the verification 718 process complexity (e.g., using simple model checking would not be 719 feasible due to state explosion), they help to better represent the 720 forwarding behavior in real networks. A way to address this challenge 721 is by attempting to summarize the internal state of an active network 722 function in a way that allows for the verification process to finish 723 within a reasonable time interval. 725 9. Testing Challenges 727 Testing in an NFV environment does impact the methodology used. The 728 main challenge is the ability to isolate the Device Under Test (DUT). 729 When testing physical devices, which are dedicated to a specific 730 function, isolation of this function is relatively simple: isolate 731 the DUT by surrounding it with emulations from test devices. This 732 achieves isolation of the DUT, in a black box fashion, for any type 733 of testing. In an NFV environment, the DUT become a component of a 734 software infrastructure which can't be isolated. For example, testing 735 a VNF can't be achieved without the presence if the NFVI and MANO 736 components. In addition, the NFVI and MANO components can greatly 737 influence the behavior and the performance of the VNF under test. 739 With this in mind, in NFV, the isolation of the DUT becomes a new 740 concept: the VNF Under Test (VUT) becomes part of an environment that 741 consists of the rest of the necessary architecture components (the 742 test environment). In the previous example, the VNF becomes the VUT, 743 while the MANO and NFVI become the test environment. Then, isolation 744 of the VUT becomes a matter of configuration management, where the 745 configuration of the test environment is kept fixed for each test of 746 the VUT. So the MANO policies for instantiation, scaling, and 747 placement, as well as the NFVI parameters such as HW used, CPU 748 pinning, etc must remained fixed for each iterative test of the VNF. 749 Only by keeping the configurations constant can the VNF tests can be 750 compared to each other. If any test environment configurations are 751 changed between tests, the behavior of the VNF can be impacted, thus 752 negating any comparison of the results. 754 Of course, there are instances of testing where the inverse is 755 desired: the configuration of the test environment is changed between 756 each test, while the VNF configuration is kept constant. As an 757 example, this type of methodology would be used in order to discover 758 the optimum configuration of the NFVI for a particular VNF workload. 759 Another similar but daunting challenge is the introduction of co- 760 located tenants in the same environment as the VNF under test. The 761 workload on these "neighbors" can greatly influence the behavior and 762 performance of the VNF under test, but the test itself is invaluable 763 to understand the impact of such a configuration. 765 Another challenge is the usage of test devices (traffic generator, 766 emulator) that share the same infrastructure as the VNF under test. 767 This can create a situation as above, where the neighbor competes for 768 resources with the VUT itself, which can really negate test results. 769 If a test architecture such as this is necessary (testing east-west 770 traffic, for example), then care must be taken to configure the test 771 devices such as they are isolated from the SUT in terms of allowed 772 resources, and that they don't impact the SUT's ability to acquire 773 resources to operate in all conditions. 775 NFV offers new features that didn't exist as such previously, or 776 modifies existing mechanisms. Examples of new features are dynamic 777 scaling of VNFs and network services (NS), standardized acceleration 778 mechanisms and the presence of the virtualization layer, which 779 includes the vSwitch. An example mechanism which changes with NFV how 780 fault detection and fault recovery are handled. Fault recovery could 781 now be handled by MANO in such a way to invoke mechanisms such as 782 live migration or snapshots in order to recover the state of a VNF 783 and restore operation quickly. While the end results are expected to 784 be the same as before, since the mechanism is very different, 785 rigorous testing is highly recommended to validate those results. 787 Dynamic scaling of VNFs is a new concept in NFV. VNFs that require 788 more resources will have them dynamically allocated on demand, and 789 then subsequently released when not needed anymore. This is clearly a 790 benefit arising from SDI. For each type of VNF, specific metrics will 791 be used as input to conditions that will trigger a scaling operation, 792 orchestrated by MANO. Testing this mechanism requires a methodology 793 tailored to the specific operation of the VNF, in order to properly 794 reach the monitored metrics and exercise the conditions leading to a 795 scaling trigger. For example, a firewall VNF will be triggered for 796 scaling on very different metrics than a 3GPP MME. Both VNFs 797 accomplish different functions. Since there will normally be a 798 collection of metrics that are monitored in order to trigger a 799 scaling operation, the testing methodology must be constructed in 800 such a way as to address all combinations of those metrics. Metrics 801 for a particular VNF may include sessions, session 802 instantiations/second, throughput, etc. These metrics will be 803 observed in relation to the given resources for the VNF. 805 10. Programmable management 807 The ability to automate a set of actions to be performed on the 808 infrastructure, be it virtual or physical, is key to productivity 809 increases following the application of DevOps principles. Previous 810 sections in this document touched on different dimensions of 811 programmability: 813 - Section 5 approached programmability in the context of developing 814 new capabilities for monitoring and for dynamically setting 815 configuration parameters of deployed monitoring functions 817 - Section 7 reflected on the need to determine the correctness of 818 actions that are to be inflicted on the infrastructure as result 819 of executing a set of high-level instructions 821 - Section 8 considered programmability in the perspective of an 822 interface to facilitate dynamic orchestration of troubleshooting 823 steps towards building workflows and for reducing the manual steps 824 required in troubleshooting processes 826 We expect that programmable network management - along the lines of 827 [RFC7426] - will draw more interest as we move forward. For example, 828 in [I-D.unify-nfvrg-challenges], the authors identify the need for 829 presenting programmable interfaces that accept instructions in a 830 standards-supported manner for the Two-way Active Measurement 831 Protocol (TWAMP)TWAMP protocol. More specifically, an excellent 832 example in this case is traffic measurements, which are extensively 833 used today to determine SLA adherence as well as debug and 834 troubleshoot pain points in service delivery. TWAMP is both widely 835 implemented by all established vendors and deployed by most global 836 operators. However, TWAMP management and control today relies solely 837 on diverse and proprietary tools provided by the respective vendors 838 of the equipment. For large, virtualized, and dynamically 839 instantiated infrastructures where network functions are placed 840 according to orchestration algorithms proprietary mechanisms for 841 managing TWAMP measurements have severe limitations. For example, 842 today's TWAMP implementations are managed by vendor-specific, 843 typically command-line interfaces (CLI), which can be scripted on a 844 platform-by-platform basis. As a result, although the control and 845 test measurement protocols are standardized, their respective 846 management is not. This hinders dramatically the possibility to 847 integrate such deployed functionality in the SP-DevOps concept. In 848 this particular case, recent efforts in the IPPM WG 849 [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data 850 model and effectively increase the programmability of TWAMP 851 deployments in the future. 853 Data center DevOps tools, such as those surveyed in [D4.1], developed 854 proprietary methods for describing and interacting through interfaces 855 with the managed infrastructure. Within certain communities, they 856 became de-facto standards in the same way particular CLIs became de- 857 facto standards for Internet professionals. Although open-source 858 components and a strong community involvement exists, the diversity 859 of the new languages and interfaces creates a burden for both vendors 860 in terms of choosing which ones to prioritize for support, and then 861 developing the functionality and operators that determine what fits 862 best for the requirements of their systems. 864 11. Security Considerations 866 DevOps principles are typically practiced within the context of a 867 single organization ie a single trust domain. Extending DevOps 868 practices across strong organizational boundaries (e.g. between 869 commercial organizations) requires consideration of additional threat 870 models. Additional validation procedures may be required to ingest 871 and accept code changes arising from outside an organization. 873 12. IANA Considerations 875 This memo includes no request to IANA. 877 13. References 879 13.1. Informative References 881 [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management 882 and Orchestration V0.6.1 (draft)", Jul. 2014 884 [I-D.aldrin-sfc-oam-framework] S. Aldrin, R. Pignataro, N. Akiya. 885 "Service Function Chaining Operations, Administration and 886 Maintenance Framework", draft-aldrin-sfc-oam-framework-02, 887 (work in progress), July 2015. 889 [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function 890 Chaining Verification", draft-lee-sfc-verification-00, 891 (work in progress), February 2014. 893 [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J. 894 Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined 895 Networking (SDN): Layers and Architecture Terminology", 896 RFC 7426, January 2015 898 [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking: 899 A Perspective from within a Service Provider Environment", 900 RFC 7149, March 2014. 902 [TR228] TMForum Gap Analysis Related to MANO Work. TR228, May 2014 904 [I-D.unify-nfvrg-challenges] R. Szabo et al. "Unifying Carrier and 905 Cloud Networks: Problem Statement and Challenges", draft- 906 unify-nfvrg-challenges-03 (work in progress), October 2016 908 [I-D.cmzrjp-ippm-twamp-yang] Civil, R., Morton, A., Zheng, L., 909 Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way 910 Active Measurement Protocol (TWAMP) Data Model", draft- 911 cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015. 913 [D4.1] W. John et al. D4.1 Initial requirements for the SP-DevOps 914 concept, universal node capabilities and proposed tools, 915 August 2014. 917 [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve 918 Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined 919 Networking: A Comprehensive Survey." To appear in 920 proceedings of the IEEE, 2015. 922 [DevOpsP] "DevOps, the IBM Approach" 2013. [Online]. 924 [Y1564] ITU-R Recommendation Y.1564: Ethernet service activation 925 test methodology, March 2011 927 [CAP] E. Brewer, "CAP twelve years later: How the "rules" have 928 changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012. 930 [H2014] N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N. 931 McKeown; "I Know What Your Packet Did Last Hop: Using 932 Packet Histories to Troubleshoot Networks", In Proceedings 933 of the 11th USENIX Symposium on Networked Systems Design 934 and Implementation (NSDI 14), pp.71-95 936 [W2011] A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann; 937 "OFRewind: Enabling Record and Replay Troubleshooting for 938 Networks". In Proceedings of the Usenix Anual Technical 939 Conference (Usenix ATC '11), pp 327-340 941 [S2010] E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration 942 analysis and verification of federated Openflow 943 infrastructures" In Proceedings of the 3rd ACM workshop on 944 Assurable and usable security configuration (SafeConfig 945 '10). Pp. 37-44 947 [OSandS] S. Wright, D. Druta, "Open Source and Standards: The Role 948 of Open Source in the Dialogue between Research and 949 Standardization" Globecom Workshops (GC Wkshps), 2014 , 950 pp.650,655, 8-12 Dec. 2014 952 [C2015] CFEngine. Online: http://cfengine.com/product/what-is- 953 cfengine/, retrieved Sep 23, 2015. 955 [P2015] Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet, 956 retrieved Sep 23, 2015. 958 [A2015] Ansible. Online: http://docs.ansible.com/ , retrieved Sep 959 23, 2015. 961 [AK2015] Apache Kafka. Online: 962 http://kafka.apache.org/documentation.html, retrieved Sep 963 23, 2015. 965 [S2015] Splunk. Online: http://www.splunk.com/en_us/products/splunk- 966 light.html , retrieved Sep 23, 2015. 968 [K2014] J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per 969 Second (On Three Cheap Machines). Online: 970 https://engineering.linkedin.com/kafka/benchmarking-apache- 971 kafka-2-million-writes-second-three-cheap-machines, 972 retrieved Sep 23, 2015. 974 [R2015] RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct 975 13, 2015 977 [IFA014] ETSI, Network Functions Virtualisation (NFV); Management and 978 Orchestration Network Service Templates Specification , 979 DGS/NFV-IFA014, Work In Progress 981 [IFA011] ETSI, Network Functions Virtualisation (NFV); Management and 982 Orchestration; VNF Packaging Specification, DGS/NFV-IFA011, 983 Work in Progress 985 [NFVSWA] ETSI, Network functions Virtualisation; Virtual Network 986 Functions Architecture, GS NFV-SWA 001 v1.1.1 (2014) 988 [Z2015] ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015 990 14. Contributors to earlier versions 992 J. Kim (Deutsche Telekom), S. Sharma (iMinds), I. Papafili (OTE) 994 15. Acknowledgments 996 The research leading to these results has received funding from the 997 European Union Seventh Framework Programme FP7/2007-2013 under grant 998 agreement no. 619609 - the UNIFY project. The views expressed here 999 are those of the authors only. The European Commission is not liable 1000 for any use that may be made of the information in this document. 1002 We would like to thank in particular the UNIFY WP4 contributors, the 1003 internal reviewers of the UNIFY WP4 deliverables and Russ White and 1004 Ramki Krishnan for their suggestions. 1006 This document was prepared using 2-Word-v2.0.template.dot. 1008 16. Authors' Addresses 1010 Catalin Meirosu 1011 Ericsson Research 1012 S-16480 Stockholm, Sweden 1013 Email: catalin.meirosu@ericsson.com 1015 Antonio Manzalini 1016 Telecom Italia 1017 Via Reiss Romoli, 274 1018 10148 - Torino, Italy 1019 Email: antonio.manzalini@telecomitalia.it 1021 Rebecca Steinert 1022 SICS Swedish ICT AB 1023 Box 1263, SE-16429 Kista, Sweden 1024 Email: rebste@sics.se 1026 Guido Marchetto 1027 Politecnico di Torino 1028 Corso Duca degli Abruzzi 24 1029 10129 - Torino, Italy 1030 Email: guido.marchetto@polito.it 1032 Kostas Pentikousis 1033 Travelping GmbH 1034 Koernerstrasse 7-10 1035 Berlin 10785 1036 Germany 1037 Email: k.pentikousis@travelping.com 1039 Steven Wright 1040 AT&T Services Inc. 1041 1057 Lenox Park Blvd NE, STE 4D28 1042 Atlanta, GA 30319 1043 USA 1044 Email: sw3588@att.com 1046 Pierre Lynch 1047 Ixia 1048 800 Perimeter Park Drive, Suite A 1049 Morrisville, NC 27560 1050 USA 1051 Email: plynch@ixiacom.com 1053 Wolfgang John 1054 Ericsson Research 1055 S-16480 Stockholm, Sweden 1056 Email: wolfgang.john@ericsson.com