idnits 2.17.1 

draft-unify-nfvrg-devops-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == Mismatching filename: the document gives the document name as
     'draft-unify-nfvrg-devops-05', but the file name used is
     'draft-unify-nfvrg-devops-06'


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 8, 2016) is 2842 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-04) exists of
     draft-unify-nfvrg-challenges-03


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	NFVRG                                                         C. Meirosu
2	Internet Draft                                                  Ericsson
3	Intended status:  Informational                             A. Manzalini
4	Expires: January 2017                                     Telecom Italia
5	                                                             R. Steinert
6	                                                                    SICS
7	                                                            G. Marchetto
8	                                                   Politecnico di Torino
9	                                                          K. Pentikousis
10	                                                                    EICT
11	                                                               S. Wright
12	                                                                    AT&T
13	                                                                P. Lynch
14	                                                                    Ixia
15	                                                                 W. John
16	                                                                Ericsson

18	                                                           July 8, 2016

20	            DevOps for Software-Defined Telecom Infrastructures
21	                      draft-unify-nfvrg-devops-05.txt

23	Status of this Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF), its areas, and its working groups.  Note that
30	   other groups may also distribute working documents as Internet-
31	   Drafts.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   The list of current Internet-Drafts can be accessed at
39	   http://www.ietf.org/ietf/1id-abstracts.txt

41	   The list of Internet-Draft Shadow Directories can be accessed at
42	   http://www.ietf.org/shadow.html

44	   This Internet-Draft will expire on January 8, 2016.

46	Copyright Notice

48	   Copyright (c) 2016 IETF Trust and the persons identified as the
49	   document authors. All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document. Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document. Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Abstract

63	   Carrier-grade network management was optimized for environments built
64	   with monolithic physical nodes and involves significant deployment,
65	   integration and maintenance efforts from network service providers.
66	   The introduction of virtualization technologies, from the physical
67	   layer all the way up to the application layer, however, invalidates
68	   several well-established assumptions in this domain. This draft opens
69	   the discussion in NFVRG about challenges related to transforming the
70	   telecom network infrastructure into an agile, model-driven
71	   environment for communication services. We take inspiration from data
72	   center DevOps on the simplification and automation of management
73	   processes for a telecom service provider software-defined
74	   infrastructure (SDI). A number of challenges associated with
75	   operationalizing DevOps principles at scale in software-defined
76	   telecom networks are identified in relation to three areas related to
77	   key programmable management processes.

79	Table of Contents

81	   1. Introduction...................................................3
82	   2. Software-Defined Telecom Infrastructure: Roles and DevOps
83	   principles........................................................5
84	      2.1. Service Developer Role....................................6
85	      2.2. VNF Developer role........................................6
86	      2.3. System Integrator role....................................6
87	      2.4. Operator role.............................................7
88	      2.5. Customer role.............................................7
89	      2.6. DevOps Principles.........................................7
90	   3. Continuous Integration.........................................9
91	   4. Continuous Delivery...........................................10
92	   5. Consistency, Availability and Partitioning Challenges.........10
93	   6. Stability and Real-Time Change Challenges.....................11
94	   7. Observability Challenges......................................13
95	   8. Verification Challenges.......................................15
96	   9. Testing Challenges............................................17
97	   10. Programmable management......................................18
98	   11. Security Considerations......................................20
99	   12. IANA Considerations..........................................20
100	   13. References...................................................20
101	      13.1. Informative References..................................20
102	   14. Contributors to earlier versions.............................23
103	   15. Acknowledgments..............................................23
104	   16. Authors' Addresses...........................................24

106	1. Introduction

108	   Carrier-grade network management was developed as an incremental
109	   solution once a particular network technology matured and came to be
110	   deployed in parallel with legacy technologies. This approach requires
111	   significant integration efforts when new network services are
112	   launched. Both centralized and distributed algorithms have been
113	   developed in order to solve very specific problems related to
114	   configuration, performance and fault management. However, such
115	   algorithms consider a network that is by and large functionally
116	   static. Thus, management processes related to introducing new or
117	   maintaining functionality are complex and costly due to significant
118	   efforts required for verification and integration.

120	   Network virtualization, by means of Software-Defined Networking (SDN)
121	   and Network Function Virtualization (NFV), creates an environment
122	   where network functions are no longer static or strictly embedded in
123	   physical boxes deployed at fixed points. The virtualized network is
124	   dynamic and open to fast-paced innovation enabling efficient network
125	   management and reduction of operating cost for network operators. A
126	   significant part of network capabilities are expected to become
127	   available through interfaces that resemble the APIs widespread within
128	   datacenters instead of the traditional telecom means of management
129	   such as the Simple Network Management Protocol, Command Line
130	   Interfaces or CORBA. Such an API-based approach, combined with the
131	   programmability offered by SDN interfaces [RFC7426], open
132	   opportunities for handling infrastructure, resources, and Virtual
133	   Network Functions (VNFs) as code, employing techniques from software
134	   engineering.

136	   The efficiency and integration of existing management techniques in
137	   virtualized and dynamic network environments are limited, however.
138	   Monitoring tools, e.g. based on simple counters, physical network
139	   taps and active probing, do not scale well and provide only a small
140	   part of the observability features required in such a dynamic
141	   environment. Although huge amounts of monitoring data can be
142	   collected from the nodes, the typical granularity is rather static
143	   and coarse and management bandwidths may be limited. Debugging and
144	   troubleshooting techniques developed for software-defined
145	   environments are a research topic that has gathered interest in the
146	   research community in the last years. Still, it is yet to be explored
147	   how to integrate them into an operational network management system.
148	   Moreover, research tools developed in academia (such as NetSight
149	   [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were limited to
150	   solving very particular, well-defined problems, and oftentimes are
151	   not built for automation and integration into carrier-grade network
152	   operations workflows. As the virtualized network functions,
153	   infrastructure software and infrastructure hardware become more
154	   dynamic [NFVSWA], the monitoring, management and testing approaches
155	   also need to change.

157	   The topics at hand have already attracted several standardization
158	   organizations to look into the issues arising in this new
159	   environment. For example, IETF working groups have activities in the
160	   area of OAM and Verification for Service Function Chaining
161	   [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service
162	   Function Chaining. At IRTF, [RFC7149] asks a set of relevant
163	   questions regarding operations of SDNs. The ETSI NFV ISG defines the
164	   MANO interfaces [NFVMANO], and TMForum investigates gaps between
165	   these interfaces and existing specifications in [TR228]. The need for
166	   programmatic APIs in the orchestration of compute, network and
167	   storage resources is discussed in [I-D.unify-nfvrg-challenges].

169	   From a research perspective, problems related to operations of
170	   software-defined networks are in part outlined in [SDNsurvey] and
171	   research referring to both cloud and software-defined networks are
172	   discussed in [D4.1].

174	   The purpose of this first version of this document is to act as a
175	   discussion opener in NFVRG by describing a set of principles that are
176	   relevant for applying DevOps ideas to managing software-defined
177	   telecom network infrastructures. We identify a set of challenges
178	   related to developing tools, interfaces and protocols that would
179	   support these principles and how can we leverage standard APIs for
180	   simplifying management tasks.

182	2. Software-Defined Telecom Infrastructure: Roles and DevOps principles

184	   There is no single list of core principles of DevOps, but it is
185	   generally recognized as encompassing:

187	     .  Iterative development / Incremental feature content

189	     .  Continuous deployment

191	     .  Automated processes

193	     .  Holistic/Systemic views of development and deployment/
194	        operation.

196	   With Deployment/ Operations becoming increasingly linked with
197	   software development, and business needs driving more rapid
198	   deployments, agile methodologies are assumed as a basis for DevOps.
199	   Agile methods used in many software focused companies are focused on
200	   releasing small interactions of code to implement VNFs with high
201	   velocity and high quality into a production environment. Similarly,
202	   Service providers are interested to release incremental improvements
203	   in the network services that they create from virtualized network
204	   functions. The cycle time for DevOps as applied in many open source
205	   projects is on the order of one quarter year or 13 weeks.

207	   The code needs to undergo a significant amount of automated testing
208	   and verification with pre-defined templates in a realistic setting.
209	   From the point of view of software defined telecom infrastructure
210	   management, the of the network and service configuration is expected
211	   to continuously evolve as result of network policy decomposition and
212	   refinement, service evolution, the updates, failovers or re-
213	   configuration of virtual functions, additions/upgrades of new
214	   infrastructure resources (e.g. whiteboxes, fibers). When
215	   troubleshooting the cause of unexpected behavior, fine-grained
216	   visibility onto all resources supporting the virtual functions
217	   (either compute, or network-related) is paramount to facilitating
218	   fast resolution times. While compute resources are typically very
219	   well covered by debugging and profiling toolsets based on many years
220	   of advances in software engineering, programmable network resources
221	   are a still a novelty and tools exploiting their potential are
222	   scarce.

224	2.1. Service Developer Role

226	   We identify two dimensions of the "developer" role in software-
227	   defined infrastructure (SDI).  The network service to be developed is
228	   captured in a network service descriptor (e.g. [IFA014]). One
229	   dimension relates to determining which high-level functions should be
230	   part of a particular service, deciding what logical interconnections
231	   are needed between these blocks and defining a set of high-level
232	   constraints or goals related to parameters that define, for instance,
233	   a Service Function Chain. This could be determined by the product
234	   owner for a particular family of services offered by a telecom
235	   provider. Or, it might be a key account representative that adapts an
236	   existing service template to the requirements of a particular
237	   customer by adding or removing a small number of functional entities.
238	   We refer to this person as the Service Developer and for simplicity
239	   (access control, training on technical background, etc.) we consider
240	   the role to be internal to the telecom provider.

242	2.2. VNF Developer role

244	   Another dimension of the "developer" role is a person that writes the
245	   software code for a new virtual network function (VNF). The VNF then
246	   needs to be delivered as a package (e.g.[IFA011]) that includes
247	   various metadata for ingestion/integration into some service. Note
248	   that a VNF may span multiple virtual machines to support design
249	   objectives (e.g. for reliability or scalability). Depending on the
250	   actual VNF being developed, this person might be internal or external
251	   (e.g. a traditional equipment vendor) to the telecom provider. We
252	   refer to them as VNF Developers.

254	2.3. System Integrator role

256	   The System Integrator role is to some extent similar to the Service
257	   Developer: people in this role need to identify the components of the
258	   system to be delivered. However, for the Service Developer, the
259	   service components are pre-integrated meaning that they have the
260	   right interfaces to interact with each other. In contrast, the
261	   Systems Integrator needs to develop the software that makes the
262	   system components interact with each other. As such, the Systems
263	   Integrator role combines aspects of the Developer roles and adds yet
264	   another dimension to it. Compared to the other Developer roles, the
265	   System Integrator might face additional challenges due to the fact
266	   that they might not have access to the source code of some of the
267	   components. This limits for example how fast they could address
268	   issues with components to be integrated, as well as uneven workload
269	   depending on the release granularity of the different components that
270	   need to be integrated. Some system integration activities may take
271	   place on an industry basis in collaborative communities (e.g.
272	   OPNFV.org).

274	2.4. Network service Operator role

276	   The role of a Network Service Operator is to ensure that the
277	   deployment processes were successful and a set of performance
278	   indicators associated to a particular network service are met. The
279	   network service is supported on infrastructure specific set of
280	   infrastructure resources that may be owned and operated by that
281	   Network Service Operator, or provided under contract from some other
282	   infrastructure service provider. .

284	2.5. Customer role

286	   A Customer contracts a telecom operator to provide one or more
287	   services. In SDI, the Customer may communicate with the provider in
288	   real time through an online portal. From the customer perspective,
289	   such portal interfaces become part of the service definition just
290	   like the data transfer aspects of the service. Compared to the
291	   Service Developer, the Customer is external to the operator and may
292	   define changes to their own service instance only in accordance to
293	   policies defined by the Service Developer. In addition to the usual
294	   per-service utilization statistics, in SDI the portal may enable the
295	   customer to trigger certain performance management or troubleshooting
296	   tools for the service. This, for example, enables the Customer to
297	   determine whether the root cause of certain error or degradation
298	   condition that they observe is located in the telecom operator domain
299	   or not and may facilitate the interaction with the customer support
300	   teams.

302	2.6. DevOps Principles

304	   In line with the generic DevOps concept outlined in [DevOpsP], we
305	   consider that these four principles as important for adapting DevOps
306	   ideas to SDI:

308	   * Automated processes: Deploy with repeatable, reliable processes:
309	   Service and VNF Developers should be supported by automated build,
310	   orchestrate and deploy processes that are identical in the
311	   development, test and production environments. Such processes need to
312	   be made reliable and trusted in the sense that they should reduce the
313	   chance of human error and provide visibility at each stage of the
314	   process, as well as have the possibility to enable manual
315	   interactions in certain key stages.

317	   * Holistis/systemic view: Develop and test against production-like
318	   systems: both Service Developers and VNF Developers need to have the
319	   opportunity to verify and debug their respective SDI code in systems
320	   that have characteristics which are very close to the production
321	   environment where the code is expected to be ultimately deployed.
322	   Customizations of Service Function Chains or VNFs could thus be
323	   released frequently to a production environment in compliance with
324	   policies set by the Operators. Adequate isolation and protection of
325	   the services active in the infrastructure from services being tested
326	   or debugged should be provided by the production environment.

328	   * Continuous: Monitor and validate operational quality: Service
329	   Developers, VNF Developers and Operators must be equipped with tools,
330	   automated as much as possible, that enable to continuously monitor
331	   the operational quality of the services deployed on SDI. Monitoring
332	   tools should be complemented by tools that allow verifying and
333	   validating the operational quality of the service in line with
334	   established procedures which might be standardized (for example,
335	   Y.1564 Ethernet Activation [Y1564]) or defined through best practices
336	   specific to a particular telecom operator.

338	   * Iterative/Incremental: Amplify development cycle feedback loops: An
339	   integral part of the DevOps ethos is building a cross-cultural
340	   environment that bridges the cultural gap between the desire for
341	   continuous change by the Developers and the demand by the Operators
342	   for stability and reliability of the infrastructure. Feedback from
343	   customers is collected and transmitted throughout the organization.
344	   From a technical perspective, such cultural aspects could be
345	   addressed through common sets of tools and APIs that are aimed at
346	   providing a shared vocabulary for both Developers and Operators, as
347	   well as simplifying the reproduction of problematic situations in the
348	   development, test and operations environments.

350	   Network operators that would like to move to agile methods to deploy
351	   and manage their networks and services face a different environment
352	   compared to typical software companies where simplified trust
353	   relationships between personnel are the norm. In software companies,
354	   it is not uncommon that the same person may be rotating between
355	   different roles. In contrast, in a telecom service provider, there
356	   are strong organizational boundaries between suppliers (whether in
357	   Developer roles for network functions, or in Operator roles for
358	   outsourced services) and the carrier's own personnel that might also
359	   take both Developer and Operator roles. Extending DevOps principles
360	   across strong organizational boundaries e.g. through co-creation or
361	   collaborative development in open source communities) may be a
362	   commercial challenge rather than a technical issue.

364	3. Continuous Integration

366	   Software integration is the process of bringing together the software
367	   component subsystems into one software system, and ensuring that the
368	   subsystems function together as a system. Software integration can
369	   apply regardless of the size of the software components. The
370	   objective of Continuous Integration is to prevent integration
371	   problems close to the expected release of a software development
372	   project into a production (operations) environment. Continuous
373	   Integration is therefore closely coupled with the notion of DevOps as
374	   a mechanism to ease the transition from development to operations.

376	   Continuous integration may result in multiple builds per day. It is
377	   also typically used in conjunction with test driven development
378	   approaches that integrate unit testing into the build process. The
379	   unit testing is typically automated through build servers. Such
380	   servers may implement a variety of additional static and dynamic
381	   tests as well as other quality control and documentation extraction
382	   functions. The reduced cycle times of continuous enable improved
383	   software quality by applying small efforts frequently.

385	   Continuous Integration applies to developers of VNF as they integrate
386	   the components that they need to deliver their VNF. The VNFs may
387	   contain components developed by different teams within the VNF
388	   Provider, or may integrate code developed externally - e.g. in
389	   commercial code libraries or in open source communities.

391	   Service developers also apply continuous integration in the
392	   development of network services. Network services are comprised of
393	   various aspects including VNFs and connectivity within and between
394	   them as well as with various associated resource authorizations. The
395	   components of the networks service are all dynamic, and largely
396	   represented by software that must be integrated regularly to maintain
397	   consistency.

399	   Some of the software components that Service Developers integrate may
400	   be sourced from VNF Providers or from open source communities.
401	   Service Developers and Network Service Operators are increasingly
402	   motivated to engage with open Source communities [OSandS]. Open
403	   source interfaces supported by open source communities may be more
404	   useful than traditional paper interface specifications.  Even where
405	   Service Providers are deeply engaged in the open source community
406	   (e.g. OPNFV) many service providers may prefer to obtain the code
407	   through some software provider as a business practice. Such software
408	   providers have the same interests in software integration as other
409	   VNF providers. An open source integration community (e.g. OPNFV) may
410	   resolve common integration issues across the industry reducing the
411	   need for integration issue resolution specific to particular
412	   integrators.

414	4. Continuous Delivery

416	   The practice of Continuous Delivery extends Continuous Integration by
417	   ensuring that the software (either a VNF code or code for SDI)
418	   checked in on the mainline is always in a user deployable state and
419	   enables rapid deployment by those users. For critical systems such as
420	   telecommunications networks, Continuous Delivery may require the
421	   advantage of including a manual trigger before the actual deployment
422	   in the live system, compared to the Continuous Deployment methodology
423	   which is also part of DevOps processes in software companies.

425	   Automated Continuous deployment systems in may exceed 10 updates per
426	   day. Assuming an integration of 100 components, each with an average
427	   time to upgrade of 180 days then deployments on the order of every
428	   1.8 days might be expected. The telecom infrastructure is also very
429	   distributed - consider the case of cloud RAN use cases where the
430	   number of locations for deployment is of the order of the number of
431	   cell tower locations (~10^4..10^6). Deployments may need to be
432	   incremental across the infrastructure to reduce the risk of large-
433	   scale failures. Conversely, there may need to be rapid rollbacks to
434	   prior stable deployment configurations in the event of significant
435	   failures.

437	5. Consistency, Availability and Partitioning Challenges

439	   The CAP theorem [CAP] states that any networked shared-data system
440	   can have at most two of following three properties: 1) Consistency
441	   (C) equivalent to having a single up-to-date copy of the data; 2)
442	   high Availability (A) of that data (for updates); and 3) tolerance to
443	   network Partitions (P).

445	   Looking at a telecom SDI as a distributed computational system
446	   (routing/forwarding packets can be seen as a computational problem),
447	   just two of the three CAP properties will be possible at the same
448	   time. The general idea is that 2 of the 3 have to be chosen. CP favor
449	   consistency, AP favor availability, CA there are no partition. This
450	   has profound implications for technologies that need to be developed
451	   in line with the "deploy with repeatable, reliable processes"
452	   principle for configuring SDI states. Latency or delay and
453	   partitioning properties are closely related, and such relation
454	   becomes more important in the case of telecom service providers where
455	   Devs and Ops interact with widely distributed infrastructure.
456	   Limitations of interactions between centralized management and
457	   distributed control need to be carefully examined in such
458	   environments. Traditionally connectivity was the main concern: C and
459	   A was about delivering packets to destination. The features and
460	   capabilities of SDN and NFV are changing the concerns: for example in
461	   SDN, control plane Partitions no longer imply data plane Partitions,
462	   so A does not imply C. In practice, CAP reflects the need for a
463	   balance between local/distributed operations and remote/centralized
464	   operations.

466	   Furthermore to CAP aspects related to individual protocols,
467	   interdependencies between CAP choices for both resources and VNFs
468	   that are interconnected in a forwarding graph need to be considered.
469	   This is particularly relevant for the "Monitor and Validate
470	   Operational Quality" principle, as apart from transport protocols,
471	   most OAM functionality is generally configured in processes that are
472	   separated from the configuration of the monitored entities. Also,
473	   partitioning in a monitoring plane implemented through VNFs executed
474	   on compute resources does not necessarily mean that the dataplane of
475	   the monitored VNF was partitioned as well.

477	6. Stability and Real-Time Change Challenges

479	   The dimensions, dynamicity and heterogeneity of networks are growing
480	   continuously. Monitoring and managing the network behavior in order
481	   to meet technical and business objectives is becoming increasingly
482	   complicated and challenging, especially when considering the need of
483	   predicting and taming potential instabilities.

485	   In general, instability in networks may have primary effects both
486	   jeopardizing the performance and compromising an optimized use of
487	   resources, even across multiple layers: in fact, instability of end-
488	   to-end communication paths may depend both on the underlying
489	   transport network, as well as the higher level components specific to
490	   flow control and dynamic routing. For example, arguments for
491	   introducing advanced flow admission control are essentially derived
492	   from the observation that the network otherwise behaves in an
493	   inefficient and potentially unstable manner. Even with resources over
494	   provisioning, a network without an efficient flow admission control
495	   has instability regions that can even lead to congestion collapse in
496	   certain configurations. Another example is the instability which is
497	   characteristic of any dynamically adaptive routing system. Routing
498	   instability, which can be (informally) defined as the quick change of
499	   network reachability and topology information, has a number of
500	   possible origins, including problems with connections, router
501	   failures, high levels of congestion, software configuration errors,
502	   transient physical and data link problems, and software bugs.

504	   As a matter of fact, the states monitored and used to implement the
505	   different control and management functions in network nodes are
506	   governed by several low-level configuration commands. There are
507	   several dependencies among these states and the logic updating the
508	   states in real time (most of which are not synchronized
509	   automatically). Normally, high-level network goals (such as the
510	   connectivity matrix, load-balancing, traffic engineering goals,
511	   survivability requirements, etc) are translated into low-level
512	   configuration commands (mostly manually) individually executed on the
513	   network elements (e.g., forwarding table, packet filters, link-
514	   scheduling weights, and queue-management parameters, as well as
515	   tunnels and NAT mappings). Network instabilities due to configuration
516	   errors can spread from node to node and propagate throughout the
517	   network.

519	   DevOps in the data center is a source of inspiration regarding how to
520	   simplify and automate management processes for software-defined
521	   infrastructure. Although the low-level configuration could be
522	   automated by DevOps tools such as CFEngine [C2015], Puppet [P2015]
523	   and Ansible [A2015], the high-level goal translation towards tool-
524	   specific syntax is still a manual process. In addition, while
525	   carrier-grade configuration tools using the NETCONF protocol support
526	   complex atomic transaction management (which reduces the potential
527	   for instability), Ansible requires third-party components to support
528	   rollbacks and the Puppet transactions are not atomic.

530	   As a specific example, automated configuration functions are expected
531	   to take the form of a "control loop" that monitors (i.e., measures)
532	   current states of the network, performs a computation, and then
533	   reconfigures the network. These types of functions must work
534	   correctly even in the presence of failures, variable delays in
535	   communicating with a distributed set of devices, and frequent changes
536	   in network conditions. Nevertheless cascading and nesting of
537	   automated configuration processes can lead to the emergence of non-
538	   linear network behaviors, and as such sudden instabilities (i.e.
539	   identical local dynamic can give rise to widely different global
540	   dynamics).

542	7. Observability Challenges

544	   Monitoring algorithms need to operate in a scalable manner while
545	   providing the specified level of observability in the network, either
546	   for operation purposes (Ops part) or for debugging in a development
547	   phase (Dev part). We consider the following challenges:

549	   * Scalability - relates to the granularity of network observability,
550	   computational efficiency, communication overhead, and strategic
551	   placement of monitoring functions.

553	   * Distributed operation and information exchange between monitoring
554	   functions - monitoring functions supported by the nodes may perform
555	   specific operations (such as aggregation or filtering) locally on the
556	   collected data or within a defined data neighborhood and forward only
557	   the result to a management system. Such operation may require
558	   modifications of existing standards and development of protocols for
559	   efficient information exchange and messaging between monitoring
560	   functions. Different levels of granularity may need to be offered for
561	   the data exchanged through the interfaces, depending on the Dev or
562	   Ops role. Modern messaging systems, such as Apache Kafka [AK2015],
563	   widely employed in datacenter environments, were optimized for
564	   messages that are considerably larger than reading a single counter
565	   value (typical SNMP GET call usage) - note the throughput vs record
566	   size from [K2014]. It is also debatable to what extent properties
567	   such as message persistence within the bus are needed in a carrier
568	   environment, where MIBs practically offer already a certain level of
569	   persistence of management data at the node level. Also, they require
570	   the use of IP addressing which might not be needed when the monitored
571	   data is consumed by a function within the same node.

573	   * Common communication channel between monitoring functions and
574	   higher layer entities (orchestration, control or management systems)
575	   - a single communication channel for configuration and measurement
576	   data of diverse monitoring functions running on heterogeneous hard-
577	   and software environments. In telecommunication environments,
578	   infrastructure assets span not only large geographical areas, but
579	   also a wide range of technology domains, ranging from CPEs, access-,
580	   aggregation-, and transport networks, to datacenters. This
581	   heterogeneity of hard- and software platforms requires higher layer
582	   entities to utilize various parallel communication channels for
583	   either configuration or data retrieval of monitoring functions within
584	   these technology domains. To address automation and advances in
585	   monitoring programmability, software defined telecommunication
586	   infrastructures would benefit from a single flexible communication
587	   channel, thereby supporting the dynamicity of virtualized
588	   environments. Such a channel should ideally support propagation of
589	   configuration, signalling, and results from monitoring functions;
590	   carrier-grade operations in terms of availability and multi-tenant
591	   features; support highly distributed and hierarchical architectures,
592	   keeping messages as local as possible; be lightweight, topology
593	   independent, network address agnostic; support flexibility in terms
594	   of transport mechanisms and programming language support.
595	   Existing popular state-of-the-art message queuing systems such as
596	   RabbitMQ [R2015] fulfill many of these requirements. However, they
597	   utilize centralized brokers, posing a single point-of-failure and
598	   scalability concerns within vastly distributed NFV environment.
599	   Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015]
600	   on the other hard lacks any advanced features for carrier-grade
601	   operations, including high-availability, authentication, and tenant
602	   isolation.

604	   * Configurability and conditional observability - monitoring
605	   functions that go beyond measuring simple metrics (such as delay, or
606	   packet loss) require expressive monitoring annotation languages for
607	   describing the functionality such that it can be programmed by a
608	   controller. Monitoring algorithms implementing self-adaptive
609	   monitoring behavior relative to local network situations may employ
610	   such annotation languages to receive high-level objectives (KPIs
611	   controlling tradeoffs between accuracy and measurement frequency, for
612	   example) and conditions for varying the measurement intensity. Steps
613	   in this direction were taken by the DevOps tools such as Splunk
614	   [S2015], whose collecting agent has the ability to load particular
615	   apps that in turn access specific counters or log files. However,
616	   such apps are tool specific and may also require deploying additional
617	   agents that are specific to the application, library or
618	   infrastructure node being monitored. Choosing which objects to
619	   monitor in such environment means deploying a tool-specific script
620	   that configures the monitoring app.

622	   * Automation - includes mapping of monitoring functionality from a
623	   logical forwarding graph to virtual or physical instances executing
624	   in the infrastructure, as well as placement and re-placement of
625	   monitoring functionality for required observability coverage and
626	   configuration consistency upon updates in a dynamic network
627	   environment. Puppet [P2015] manifests or Ansible [A2015] playbooks
628	   could be used for automating the deployment of monitoring agents, for
629	   example those used by Splunk [S2015]. However, both manifests and
630	   playbooks were designed to represent the desired system configuration
631	   snapshot at a particular moment in time - they would now need to be
632	   generated automatically by the orchestration tools instead of a
633	   DevOps person.

635	   * Actionable data
636	   Data produced by observability tools could be utilized in a wide
637	   category of processes, ranging from billing and dimensioning to real-
638	   time troubleshooting and optimization. In order to allow for data-
639	   driven automated decisions and actuations based on these decisions,
640	   the data needs to be actionable. We define actionable data as being
641	   representative for a particular context or situation and an adequate
642	   input towards a decision. Ensuring actionable data is challenging in
643	   a number of ways, including: defining adaptive correlation and
644	   sampling windows, filtering and aggregation methods that are adapted
645	   or coordinated with the actual consumer of the data, and developing
646	   analytical and predictive methods that account for the uncertainty or
647	   incompleteness of the data.

649	   * Data Virtualization

651	   Data is key in helping both Developers and Operators perform their
652	   tasks. Traditional Network Management Systems were optimized for
653	   using one database that contains the master copy of the operational
654	   statistics and logs of network nodes. Ensuring access to this data
655	   from across the organization is challenging because strict privacy
656	   and business secrets need to be protected. In DevOps-driven
657	   environments, data needs to be made available to Developers and their
658	   test environments. Data virtualization collectively defines a set of
659	   technologies that ensure that restricted copies of the partial data
660	   needed for a particular task may be made available while enforcing
661	   strict access control. Further than simple access control, data
662	   virtualization needs to address scalability challenges involved in
663	   copying large amounts of operational data as well as automatically
664	   disposing of it when the task authorized for using it has finished.

666	8. Verification Challenges

668	   Enabling ongoing verification of code is an important goal of
669	   continuous integration as part of the data center DevOps concept. In
670	   a telecom SDI, service definitions, decompositions and configurations
671	   need to be expressed in machine-readable encodings. For example,
672	   configuration parameters could be expressed in terms of YANG data
673	   models. However, the infrastructure management layers (such as
674	   Software-Defined Network Controllers and Orchestration functions)
675	   might not always export such machine-readable descriptions of the
676	   runtime configuration state. In this case, the management layer
677	   itself could be expected to include a verification process that has
678	   the same challenges as the stand-alone verification processes we
679	   outline later in this section. In that sense, verification can be
680	   considered as a set of features providing gatekeeper functions to
681	   verify both the abstract service models and the proposed resource
682	   configuration before or right after the actual instantiation on the
683	   infrastructure layer takes place.

685	   A verification process can involve different layers of the network
686	   and service architecture. Starting from a high-level verification of
687	   the customer input (for example, a Service Graph as defined in
688	   [I-D.unify-nfvrg-challenges]), the verification process could go more
689	   in depth to reflect on the Service Function Chain configuration. At
690	   the lowest layer, the verification would handle the actual set of
691	   forwarding rules and other configuration parameters associated to a
692	   Service Function Chain instance. This enables the verification of
693	   more quantitative properties (e.g. compliance with resource
694	   availability), as well as a more detailed and precise verification of
695	   the abovementioned topological ones. Existing SDN verification tools
696	   could be deployed in this context, but the majority of them only
697	   operate on flow space rules commonly expressed using OpenFlow syntax.

699	   Moreover, such verification tools were designed for networks where
700	   the flow rules are necessary and sufficient to determine the
701	   forwarding state. This assumption is valid in networks composed only
702	   by network functions that forward traffic by analyzing only the
703	   packet headers (e.g. simple routers, stateless firewalls, etc.).
704	   Unfortunately, most of the real networks contain active network
705	   functions, represented by middle-boxes that dynamically change the
706	   forwarding path of a flow according to function-local algorithms and
707	   an internal state (that is based on the received packets), e.g. load
708	   balancers, packet marking modules and intrusion detection systems.
709	   The existing verification tools do not consider active network
710	   functions because they do not account for the dynamic transformation
711	   of an internal state into the verification process.

713	   Defining a set of verification tools that can account for active
714	   network functions is a significant challenge. In order to perform
715	   verification based on formal properties of the system, the internal
716	   states of an active (virtual or not) network function would need to
717	   be represented. Although these states would increase the verification
718	   process complexity (e.g., using simple model checking would not be
719	   feasible due to state explosion), they help to better represent the
720	   forwarding behavior in real networks. A way to address this challenge
721	   is by attempting to summarize the internal state of an active network
722	   function in a way that allows for the verification process to finish
723	   within a reasonable time interval.

725	9. Testing Challenges

727	   Testing in an NFV environment does impact the methodology used. The
728	   main challenge is the ability to isolate the Device Under Test (DUT).
729	   When testing physical devices, which are dedicated to a specific
730	   function, isolation of this function is relatively simple: isolate
731	   the DUT by surrounding it with emulations from test devices. This
732	   achieves isolation of the DUT, in a black box fashion, for any type
733	   of testing. In an NFV environment, the DUT become a component of a
734	   software infrastructure which can't be isolated. For example, testing
735	   a VNF can't be achieved without the presence if the NFVI and MANO
736	   components. In addition, the NFVI and MANO components can greatly
737	   influence the behavior and the performance of the VNF under test.

739	   With this in mind, in NFV, the isolation of the DUT becomes a new
740	   concept: the VNF Under Test (VUT) becomes part of an environment that
741	   consists of the rest of the necessary architecture components (the
742	   test environment). In the previous example, the VNF becomes the VUT,
743	   while the MANO and NFVI become the test environment. Then, isolation
744	   of the VUT becomes a matter of configuration management, where the
745	   configuration of the test environment is kept fixed for each test of
746	   the VUT. So the MANO policies for instantiation, scaling, and
747	   placement, as well as the NFVI parameters such as HW used, CPU
748	   pinning, etc must remained fixed for each iterative test of the VNF.
749	   Only by keeping the configurations constant can the VNF tests can be
750	   compared to each other. If any test environment configurations are
751	   changed between tests, the behavior of the VNF can be impacted, thus
752	   negating any comparison of the results.

754	   Of course, there are instances of testing where the inverse is
755	   desired: the configuration of the test environment is changed between
756	   each test, while the VNF configuration is kept constant. As an
757	   example, this type of methodology would be used in order to discover
758	   the optimum configuration of the NFVI for a particular VNF workload.
759	   Another similar but daunting challenge is the introduction of co-
760	   located tenants in the same environment as the VNF under test. The
761	   workload on these "neighbors" can greatly influence the behavior and
762	   performance of the VNF under test, but the test itself is invaluable
763	   to understand the impact of such a configuration.

765	   Another challenge is the usage of test devices (traffic generator,
766	   emulator) that share the same infrastructure as the VNF under test.
767	   This can create a situation as above, where the neighbor competes for
768	   resources with the VUT itself, which can really negate test results.
769	   If a test architecture such as this is necessary (testing east-west
770	   traffic, for example), then care must be taken to configure the test
771	   devices such as they are isolated from the SUT in terms of allowed
772	   resources, and that they don't impact the SUT's ability to acquire
773	   resources to operate in all conditions.

775	   NFV offers new features that didn't exist as such previously, or
776	   modifies existing mechanisms. Examples of new features are dynamic
777	   scaling of VNFs and network services (NS), standardized acceleration
778	   mechanisms and the presence of the virtualization layer, which
779	   includes the vSwitch. An example mechanism which changes with NFV how
780	   fault detection and fault recovery are handled. Fault recovery could
781	   now be handled by MANO in such a way to invoke mechanisms such as
782	   live migration or snapshots in order to recover the state of a VNF
783	   and restore operation quickly. While the end results are expected to
784	   be the same as before, since the mechanism is very different,
785	   rigorous testing is highly recommended to validate those results.

787	   Dynamic scaling of VNFs is a new concept in NFV. VNFs that require
788	   more resources will have them dynamically allocated on demand, and
789	   then subsequently released when not needed anymore. This is clearly a
790	   benefit arising from SDI. For each type of VNF, specific metrics will
791	   be used as input to conditions that will trigger a scaling operation,
792	   orchestrated by MANO. Testing this mechanism requires a methodology
793	   tailored to the specific operation of the VNF, in order to properly
794	   reach the monitored metrics and exercise the conditions leading to a
795	   scaling trigger. For example, a firewall VNF will be triggered for
796	   scaling on very different metrics than a 3GPP MME. Both VNFs
797	   accomplish different functions. Since there will normally be a
798	   collection of metrics that are monitored in order to trigger a
799	   scaling operation, the testing methodology must be constructed in
800	   such a way as to address all combinations of those metrics. Metrics
801	   for a particular VNF may include sessions, session
802	   instantiations/second, throughput, etc. These metrics will be
803	   observed in relation to the given resources for the VNF.

805	10. Programmable management

807	   The ability to automate a set of actions to be performed on the
808	   infrastructure, be it virtual or physical, is key to productivity
809	   increases following the application of DevOps principles. Previous
810	   sections in this document touched on different dimensions of
811	   programmability:

813	   -  Section 5 approached programmability in the context of developing
814	     new capabilities for monitoring and for dynamically setting
815	     configuration parameters of deployed monitoring functions

817	   -  Section 7 reflected on the need to determine the correctness of
818	     actions that are to be inflicted on the infrastructure as result
819	     of executing a set of high-level instructions

821	   -  Section 8 considered programmability in the perspective of an
822	     interface to facilitate dynamic orchestration of troubleshooting
823	     steps towards building workflows and for reducing the manual steps
824	     required in troubleshooting processes

826	   We expect that programmable network management - along the lines of
827	   [RFC7426] - will draw more interest as we move forward. For example,
828	   in [I-D.unify-nfvrg-challenges], the authors identify the need for
829	   presenting programmable interfaces that accept instructions in a
830	   standards-supported manner for the Two-way Active Measurement
831	   Protocol (TWAMP)TWAMP protocol. More specifically, an excellent
832	   example in this case is traffic measurements, which are extensively
833	   used today to determine SLA adherence as well as debug and
834	   troubleshoot pain points in service delivery. TWAMP is both widely
835	   implemented by all established vendors and deployed by most global
836	   operators. However, TWAMP management and control today relies solely
837	   on diverse and proprietary tools provided by the respective vendors
838	   of the equipment. For large, virtualized, and dynamically
839	   instantiated infrastructures where network functions are placed
840	   according to orchestration algorithms proprietary mechanisms for
841	   managing TWAMP measurements have severe limitations. For example,
842	   today's TWAMP implementations are managed by vendor-specific,
843	   typically command-line interfaces (CLI), which can be scripted on a
844	   platform-by-platform basis. As a result, although the control and
845	   test measurement protocols are standardized, their respective
846	   management is not. This hinders dramatically the possibility to
847	   integrate such deployed functionality in the SP-DevOps concept. In
848	   this particular case, recent efforts in the IPPM WG
849	   [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data
850	   model and effectively increase the programmability of TWAMP
851	   deployments in the future.

853	   Data center DevOps tools, such as those surveyed in [D4.1], developed
854	   proprietary methods for describing and interacting through interfaces
855	   with the managed infrastructure. Within certain communities, they
856	   became de-facto standards in the same way particular CLIs became de-
857	   facto standards for Internet professionals. Although open-source
858	   components and a strong community involvement exists, the diversity
859	   of the new languages and interfaces creates a burden for both vendors
860	   in terms of choosing which ones to prioritize for support, and then
861	   developing the functionality and operators that determine what fits
862	   best for the requirements of their systems.

864	11. Security Considerations

866	   DevOps principles are typically practiced within the context of a
867	   single organization ie a single trust domain. Extending DevOps
868	   practices across strong organizational boundaries (e.g. between
869	   commercial organizations) requires consideration of additional threat
870	   models. Additional validation procedures may be required to ingest
871	   and accept code changes arising from outside an organization.

873	12. IANA Considerations

875	   This memo includes no request to IANA.

877	13. References

879	13.1. Informative References

881	   [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management
882	             and Orchestration V0.6.1 (draft)", Jul. 2014

884	   [I-D.aldrin-sfc-oam-framework]   S. Aldrin, R. Pignataro, N. Akiya.
885	             "Service Function Chaining Operations, Administration and
886	             Maintenance Framework", draft-aldrin-sfc-oam-framework-02,
887	             (work in progress), July 2015.

889	   [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function
890	             Chaining Verification", draft-lee-sfc-verification-00,
891	             (work in progress), February 2014.

893	   [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J.
894	             Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined
895	             Networking (SDN):  Layers and Architecture Terminology",
896	             RFC 7426, January 2015

898	   [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking:
899	             A Perspective from within a Service Provider Environment",
900	             RFC 7149, March 2014.

902	   [TR228]   TMForum Gap Analysis Related to MANO Work. TR228, May 2014

904	   [I-D.unify-nfvrg-challenges]  R. Szabo et al. "Unifying Carrier and
905	             Cloud Networks: Problem Statement and Challenges", draft-
906	             unify-nfvrg-challenges-03 (work in progress), October 2016

908	   [I-D.cmzrjp-ippm-twamp-yang]  Civil, R., Morton, A., Zheng, L.,
909	             Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way
910	             Active Measurement Protocol (TWAMP) Data Model", draft-
911	             cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015.

913	   [D4.1]    W. John et al. D4.1 Initial requirements for the SP-DevOps
914	             concept, universal node capabilities and proposed tools,
915	             August 2014.

917	   [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve
918	             Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined
919	             Networking: A Comprehensive Survey." To appear in
920	             proceedings of the IEEE, 2015.

922	   [DevOpsP] "DevOps, the IBM Approach" 2013. [Online].

924	   [Y1564]   ITU-R Recommendation Y.1564: Ethernet service activation
925	             test methodology, March 2011

927	   [CAP]     E. Brewer, "CAP twelve years later: How the "rules" have
928	             changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012.

930	   [H2014]  N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N.
931	             McKeown; "I Know What Your Packet Did Last Hop: Using
932	             Packet Histories to Troubleshoot Networks", In Proceedings
933	             of the 11th USENIX Symposium on Networked Systems Design
934	             and Implementation (NSDI 14), pp.71-95

936	   [W2011]  A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann;
937	             "OFRewind: Enabling Record and Replay Troubleshooting for
938	             Networks". In Proceedings of the Usenix Anual Technical
939	             Conference (Usenix ATC '11), pp 327-340

941	   [S2010]  E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration
942	             analysis and verification of federated Openflow
943	             infrastructures" In Proceedings of the 3rd ACM workshop on
944	             Assurable and usable security configuration (SafeConfig
945	             '10). Pp. 37-44

947	   [OSandS]  S. Wright, D. Druta, "Open Source and Standards: The Role
948	             of Open Source in the Dialogue between Research and
949	             Standardization" Globecom Workshops (GC Wkshps), 2014 ,
950	             pp.650,655, 8-12 Dec. 2014

952	   [C2015]  CFEngine. Online: http://cfengine.com/product/what-is-
953	             cfengine/, retrieved Sep 23, 2015.

955	   [P2015]  Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet,
956	             retrieved Sep 23, 2015.

958	   [A2015]  Ansible. Online: http://docs.ansible.com/ , retrieved Sep
959	             23, 2015.

961	   [AK2015] Apache Kafka. Online:
962	             http://kafka.apache.org/documentation.html, retrieved Sep
963	             23, 2015.

965	   [S2015]  Splunk. Online: http://www.splunk.com/en_us/products/splunk-
966	             light.html , retrieved Sep 23, 2015.

968	   [K2014]  J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per
969	             Second (On Three Cheap Machines). Online:
970	             https://engineering.linkedin.com/kafka/benchmarking-apache-
971	             kafka-2-million-writes-second-three-cheap-machines,
972	             retrieved Sep 23, 2015.

974	   [R2015]  RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct
975	             13, 2015

977	   [IFA014] ETSI, Network Functions Virtualisation (NFV); Management and
978	             Orchestration Network Service Templates Specification ,
979	             DGS/NFV-IFA014, Work In Progress

981	   [IFA011] ETSI, Network Functions Virtualisation (NFV); Management and
982	             Orchestration; VNF Packaging Specification, DGS/NFV-IFA011,
983	             Work in Progress

985	   [NFVSWA] ETSI, Network functions Virtualisation; Virtual Network
986	             Functions Architecture,  GS NFV-SWA 001 v1.1.1 (2014)

988	   [Z2015]  ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015

990	14. Contributors to earlier versions

992	   J. Kim (Deutsche Telekom), S. Sharma (iMinds), I. Papafili (OTE)

994	15. Acknowledgments

996	   The research leading to these results has received funding from the
997	   European Union Seventh Framework Programme FP7/2007-2013 under grant
998	   agreement no. 619609 - the UNIFY project. The views expressed here
999	   are those of the authors only. The European Commission is not liable
1000	   for any use that may be made of the information in this document.

1002	   We would like to thank in particular the UNIFY WP4 contributors, the
1003	   internal reviewers of the UNIFY WP4 deliverables and Russ White and
1004	   Ramki Krishnan for their suggestions.

1006	   This document was prepared using 2-Word-v2.0.template.dot.

1008	16. Authors' Addresses

1010	   Catalin Meirosu
1011	   Ericsson Research
1012	   S-16480 Stockholm, Sweden
1013	   Email: catalin.meirosu@ericsson.com

1015	   Antonio Manzalini
1016	   Telecom Italia
1017	   Via Reiss Romoli, 274
1018	   10148 - Torino, Italy
1019	   Email: antonio.manzalini@telecomitalia.it

1021	   Rebecca Steinert
1022	   SICS Swedish ICT AB
1023	   Box 1263, SE-16429 Kista, Sweden
1024	   Email: rebste@sics.se

1026	   Guido Marchetto
1027	   Politecnico di Torino
1028	   Corso Duca degli Abruzzi 24
1029	   10129 - Torino, Italy
1030	   Email: guido.marchetto@polito.it

1032	   Kostas Pentikousis
1033	   Travelping GmbH
1034	   Koernerstrasse 7-10
1035	   Berlin 10785
1036	   Germany
1037	   Email: k.pentikousis@travelping.com

1039	   Steven Wright
1040	   AT&T Services Inc.
1041	   1057 Lenox Park Blvd NE, STE 4D28
1042	   Atlanta, GA 30319
1043	   USA
1044	   Email: sw3588@att.com

1046	   Pierre Lynch
1047	   Ixia
1048	   800 Perimeter Park Drive, Suite A
1049	   Morrisville, NC 27560
1050	   USA
1051	   Email: plynch@ixiacom.com

1053	   Wolfgang John
1054	   Ericsson Research
1055	   S-16480 Stockholm, Sweden
1056	   Email: wolfgang.john@ericsson.com