idnits 2.17.1 

draft-unify-nfvrg-devops-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 18, 2016) is 2958 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-04) exists of
     draft-unify-nfvrg-challenges-03


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	NFVRG                                                         C. Meirosu
2	Internet Draft                                                  Ericsson
3	Intended status:  Informational                             A. Manzalini
4	Expires: September 2016                                   Telecom Italia
5	                                                             R. Steinert
6	                                                                    SICS
7	                                                            G. Marchetto
8	                                                   Politecnico di Torino
9	                                                             I. Papafili
10	                                Hellenic Telecommunications Organization
11	                                                          K. Pentikousis
12	                                                                    EICT
13	                                                               S. Wright
14	                                                                    AT&T

16	                                           March 20, 2016March 18, 2016

18	            DevOps for Software-Defined Telecom Infrastructures
19	                      draft-unify-nfvrg-devops-04.txt

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF), its areas, and its working groups.  Note that
28	   other groups may also distribute working documents as Internet-
29	   Drafts.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   The list of current Internet-Drafts can be accessed at
37	   http://www.ietf.org/ietf/1id-abstracts.txt

39	   The list of Internet-Draft Shadow Directories can be accessed at
40	   http://www.ietf.org/shadow.html

42	   This Internet-Draft will expire on September 20, 2016.

44	Copyright Notice

46	   Copyright (c) 2016 IETF Trust and the persons identified as the
47	   document authors. All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document. Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document. Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Abstract

61	   Carrier-grade network management was optimized for environments built
62	   with monolithic physical nodes and involves significant deployment,
63	   integration and maintenance efforts from network service providers.
64	   The introduction of virtualization technologies, from the physical
65	   layer all the way up to the application layer, however, invalidates
66	   several well-established assumptions in this domain. This draft opens
67	   the discussion in NFVRG about challenges related to transforming the
68	   telecom network infrastructure into an agile, model-driven production
69	   environment for communication services. We take inspiration from data
70	   center DevOps regarding how to simplify and automate management
71	   processes for a telecom service provider software-defined
72	   infrastructure (SDI). Among the identified challenges, we consider
73	   scalability of observability processes and automated inference of
74	   monitoring requirements from logical forwarding graphs, as well as
75	   initial placement (and re-placement) of monitoring functionality
76	   following changes in flow paths enforced by the controllers. In
77	   another category of challenges, verifying correctness of behavior for
78	   network functions where flow rules are no longer necessary and
79	   sufficient for determining the forwarding state (for example,
80	   stateful firewalls or load balancers) is very difficult with current
81	   technology. Finally, we introduce challenges associated with
82	   operationalizing DevOps principles at scale in software-defined
83	   telecom networks in three areas related to key monitoring,
84	   verification and troubleshooting processes.

86	Table of Contents

88	   1. Introduction...................................................3
89	   2. Software-Defined Telecom Infrastructure: Roles and DevOps
90	   principles........................................................5
91	      2.1. Service Developer Role....................................5
92	      2.2. VNF Developer role........................................6
93	      2.3. System Integrator role....................................6
94	      2.4. Operator role.............................................6
95	      2.5. Customer role.............................................6
96	      2.6. DevOps Principles.........................................7
97	   3. Continuous Integration.........................................8
98	   4. Continuous Delivery............................................9
99	   5. Consistency, Availability and Partitioning Challenges..........9
100	   6. Stability Challenges..........................................10
101	   7. Observability Challenges......................................12
102	   8. Verification Challenges.......................................14
103	   9. Troubleshooting Challenges....................................16
104	   10. Programmable network management..............................17
105	   11. DevOps Performance Metrics...................................18
106	   12. Security Considerations......................................19
107	   13. IANA Considerations..........................................19
108	   14. References...................................................19
109	      14.1. Informative References..................................19
110	   15. Contributors.................................................22
111	   16. Acknowledgments..............................................22
112	   17. Authors' Addresses...........................................23

114	1. Introduction

116	   Carrier-grade network management was developed as an incremental
117	   solution once a particular network technology matured and came to be
118	   deployed in parallel with legacy technologies. This approach requires
119	   significant integration efforts when new network services are
120	   launched. Both centralized and distributed algorithms have been
121	   developed in order to solve very specific problems related to
122	   configuration, performance and fault management. However, such
123	   algorithms consider a network that is by and large functionally
124	   static. Thus, management processes related to introducing new or
125	   maintaining functionality are complex and costly due to significant
126	   efforts required for verification and integration.

128	   Network virtualization, by means of Software-Defined Networking (SDN)
129	   and Network Function Virtualization (NFV), creates an environment
130	   where network functions are no longer static or strictly embedded in
131	   physical boxes deployed at fixed points. The virtualized network is
132	   dynamic and open to fast-paced innovation enabling efficient network
133	   management and reduction of operating cost for network operators. A
134	   significant part of network capabilities are expected to become
135	   available through interfaces that resemble the APIs widespread within
136	   datacenters instead of the traditional telecom means of management
137	   such as the Simple Network Management Protocol, Command Line
138	   Interfaces or CORBA. Such an API-based approach, combined with the
139	   programmability offered by SDN interfaces [RFC7426], open
140	   opportunities for handling infrastructure, resources, and Virtual
141	   Network Functions (VNFs) as code, employing techniques from software
142	   engineering.

144	   The efficiency and integration of existing management techniques in
145	   virtualized and dynamic network environments are limited, however.
146	   Monitoring tools, e.g. based on simple counters, physical network
147	   taps and active probing, do not scale well and provide only a small
148	   part of the observability features required in such a dynamic
149	   environment. Although huge amounts of monitoring data can be
150	   collected from the nodes, the typical granularity is rather coarse.
151	   Debugging and troubleshooting techniques developed for software-
152	   defined environments are a research topic that has gathered interest
153	   in the research community in the last years. Still, it is yet to be
154	   explored how to integrate them into an operational network management
155	   system. Moreover, research tools developed in academia (such as
156	   NetSight [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were
157	   limited to solving very particular, well-defined problems, and
158	   oftentimes are not built for automation and integration into carrier-
159	   grade network operations workflows.

161	   The topics at hand have already attracted several standardization
162	   organizations to look into the issues arising in this new
163	   environment. For example, IETF working groups have activities in the
164	   area of OAM and Verification for Service Function Chaining
165	   [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service
166	   Function Chaining. At IRTF, [RFC7149] asks a set of relevant
167	   questions regarding operations of SDNs. The ETSI NFV ISG defines the
168	   MANO interfaces [NFVMANO], and TMForum investigates gaps between
169	   these interfaces and existing specifications in [TR228]. The need for
170	   programmatic APIs in the orchestration of compute, network and
171	   storage resources is discussed in [I-D.unify-nfvrg-challenges].

173	   From a research perspective, problems related to operations of
174	   software-defined networks are in part outlined in [SDNsurvey] and
175	   research referring to both cloud and software-defined networks are
176	   discussed in [D4.1].

178	   The purpose of this first version of this document is to act as a
179	   discussion opener in NFVRG by describing a set of principles that are
180	   relevant for applying DevOps ideas to managing software-defined
181	   telecom network infrastructures. We identify a set of challenges
182	   related to developing tools, interfaces and protocols that would
183	   support these principles and how can we leverage standard APIs for
184	   simplifying management tasks.

186	2. Software-Defined Telecom Infrastructure: Roles and DevOps principles

188	   Agile methods used in many software focused companies are focused on
189	   releasing small interactions of code to implement VNFs with high
190	   velocity and high quality into a production environment. Similarly,
191	   Service providers are interested to release incremental improvements
192	   in the network services that they create from virtualized network
193	   functions. The cycle time for devops as applied in many open source
194	   projects is on the order of one quarter year or 13 weeks.

196	   The code needs to undergo a significant amount of automated testing
197	   and verification with pre-defined templates in a realistic setting.
198	   From the point of view of infrastructure management, the verification
199	   of the network configuration as result of network policy
200	   decomposition and refinement, as well as the configuration of virtual
201	   functions, is one of the most sensitive operations. When
202	   troubleshooting the cause of unexpected behavior, fine-grained
203	   visibility onto all resources supporting the virtual functions
204	   (either compute, or network-related) is paramount to facilitating
205	   fast resolution times. While compute resources are typically very
206	   well covered by debugging and profiling toolsets based on many years
207	   of advances in software engineering, programmable network resources
208	   are a still a novelty and tools exploiting their potential are
209	   scarce.

211	2.1. Service Developer Role

213	   We identify two dimensions of the "developer" role in software-
214	   defined infrastructure (SDI). One dimension relates to determining
215	   which high-level functions should be part of a particular service,
216	   deciding what logical interconnections are needed between these
217	   blocks and defining a set of high-level constraints or goals related
218	   to parameters that define, for instance, a Service Function Chain.
219	   This could be determined by the product owner for a particular family
220	   of services offered by a telecom provider. Or, it might be a key
221	   account representative that adapts an existing service template to
222	   the requirements of a particular customer by adding or removing a
223	   small number of functional entities. We refer to this person as the
224	   Service Developer and for simplicity (access control, training on
225	   technical background, etc.) we consider the role to be internal to
226	   the telecom provider.

228	2.2. VNF Developer role

230	   Another dimension of the "developer" role is a person that writes the
231	   software code for a new virtual network function (VNF). Depending on
232	   the actual VNF being developed, this person might be internal or
233	   external (e.g. a traditional equipment vendor) to the telecom
234	   provider. We refer to them as VNF Developers.

236	2.3. System Integrator role

238	   The System Integrator role is to some extent similar to the Service
239	   Developer: people in this role need to identify the components of the
240	   system to be delivered. However, for the Service Developer, the
241	   service components are pre-integrated meaning that they have the
242	   right interfaces to interact with each other. In contrast, the
243	   Systems Integrator needs to develop the software that makes the
244	   system components interact with each other. As such, the Systems
245	   Integrator role combines aspects of the Developer roles and adds yet
246	   another dimension to it. Compared to the other Developer roles, the
247	   System Integrator might face additional challenges due to the fact
248	   that they might not have access to the source code of some of the
249	   components. This limits for example how fast they could address
250	   issues with components to be integrated, as well as uneven workload
251	   depending on the release granularity of the different components that
252	   need to be integrated.

254	2.4. Operator role

256	   The role of an Operator in SDI is to ensure that the deployment
257	   processes were successful and a set of performance indicators
258	   associated to a service are met while the service is supported on
259	   virtual infrastructure within the domain of a telecom provider.

261	2.5. Customer role

263	   A Customer contracts a telecom operator to provide one or more
264	   services. In SDI, the Customer may communicate with the provider
265	   through an online portal. Compared to the Service Developer, the
266	   Customer is external to the operator and may define changes to their
267	   own service instance only in accordance to policies defined by the
268	   Service Developer. In addition to the usual per-service utilization
269	   statistics, in SDI the portal may enable the customer to trigger
270	   certain performance management or troubleshooting tools for the
271	   service. This, for example, enables the Customer to determine whether
272	   the root cause of certain error or degradation condition that they
273	   observe is located in the telecom operator domain or not and may
274	   facilitate the interaction with the customer support teams.

276	2.6. DevOps Principles

278	   In line with the generic DevOps concept outlined in [DevOpsP], we
279	   consider that these four principles as important for adapting DevOps
280	   ideas to SDI:

282	   * Deploy with repeatable, reliable processes: Service and VNF
283	   Developers should be supported by automated build, orchestrate and
284	   deploy processes that are identical in the development, test and
285	   production environments. Such processes need to be made reliable and
286	   trusted in the sense that they should reduce the chance of human
287	   error and provide visibility at each stage of the process, as well as
288	   have the possibility to enable manual interactions in certain key
289	   stages.

291	   * Develop and test against production-like systems: both Service
292	   Developers and VNF Developers need to have the opportunity to verify
293	   and debug their respective SDI code in systems that have
294	   characteristics which are very close to the production environment
295	   where the code is expected to be ultimately deployed. Customizations
296	   of Service Function Chains or VNFs could thus be released frequently
297	   to a production environment in compliance with policies set by the
298	   Operators. Adequate isolation and protection of the services active
299	   in the infrastructure from services being tested or debugged should
300	   be provided by the production environment.

302	   * Monitor and validate operational quality: Service Developers, VNF
303	   Developers and Operators must be equipped with tools, automated as
304	   much as possible, that enable to continuously monitor the operational
305	   quality of the services deployed on SDI. Monitoring tools should be
306	   complemented by tools that allow verifying and validating the
307	   operational quality of the service in line with established
308	   procedures which might be standardized (for example, Y.1564 Ethernet
309	   Activation [Y1564]) or defined through best practices specific to a
310	   particular telecom operator.

312	   * Amplify development cycle feedback loops: An integral part of the
313	   DevOps ethos is building a cross-cultural environment that bridges
314	   the cultural gap between the desire for continuous change by the
315	   Developers and the demand by the Operators for stability and
316	   reliability of the infrastructure. Feedback from customers is
317	   collected and transmitted throughout the organization. From a
318	   technical perspective, such cultural aspects could be addressed
319	   through common sets of tools and APIs that are aimed at providing a
320	   shared vocabulary for both Developers and Operators, as well as
321	   simplifying the reproduction of problematic situations in the
322	   development, test and operations environments.

324	   Network operators that would like to move to agile methods to deploy
325	   and manage their networks and services face a different environment
326	   compared to typical software companies where simplified trust
327	   relationships between personnel are the norm. In software companies,
328	   it is not uncommon that the same person may be rotating between
329	   different roles. In contrast, in a telecom service provider, there
330	   are strong organizational boundaries between suppliers (whether in
331	   Developer roles for network functions, or in Operator roles for
332	   outsourced services) and the carrier's own personnel that might also
333	   take both Developer and Operator roles. How DevOps principles reflect
334	   on these trust relationships and to what extent initiatives such as
335	   co-creation could transform the environment to facilitate closer Dev
336	   and Ops integration across business boundaries is an interesting area
337	   for business studies, but we could not for now identify a specific
338	   technological challenge.

340	3. Continuous Integration

342	   Software integration is the process of bringing together the software
343	   component subsystems into one software system, and ensuring that the
344	   subsystems function together as a system. Software integration can
345	   apply regardless of the size of the software components. The
346	   objective of Continuous Integration is to prevent integration
347	   problems close to the expected release of a software development
348	   project into a production (operations) environment. Continuous
349	   Integration is therefore closely coupled with the notion of DevOps as
350	   a mechanism to ease the transition from development to operations.

352	   Continuous integration may result in multiple builds per day. It is
353	   also typically used in conjunction with test driven development
354	   approaches that integrate unit testing into the build process. The
355	   unit testing is typically automated through build servers. Such
356	   servers may implement a variety of additional static and dynamic
357	   tests as well as other quality control and documentation extraction
358	   functions. The reduced cycle times of continuous enable improved
359	   software quality by applying small efforts frequently.

361	   Continuous Integration applies to developers of VNF as they integrate
362	   the components that they need to deliver their VNF. The VNFs may
363	   contain components developed by different teams within the VNF
364	   Provider, or may integrate code developed externally - e.g. in
365	   commercial code libraries or in open source communities.

367	   Service providers also apply continuous integration in the
368	   development of network services. Network services are comprised of
369	   various aspects including VNFs and connectivity within and between
370	   them as well as with various associated resource authorizations. The
371	   components of the networks service are all dynamic, and largely
372	   represented by software that must be integrated regularly to maintain
373	   consistency.   Some of the software components that Service Providers
374	   may be sourced from VNF Providers or from open source communities.
375	   Service Providers are increasingly motivated to engage with open
376	   Source communities [OSandS]. Open source interfaces supported by open
377	   source communities may be more useful than traditional paper
378	   interface specifications.  Even where Service Providers are deeply
379	   engaged in the open source community (e.g. OPNFV) many service
380	   providers may prefer to obtain the code through some software
381	   provider as a business practice. Such software providers have the
382	   same interests in software integration as other VNF providers.

384	4. Continuous Delivery

386	   The practice of Continuous Delivery extends Continuous Integration by
387	   ensuring that the software (either a VNF code or code for SDI)
388	   checked in on the mainline is always in a user deployable state and
389	   enables rapid deployment by those users. For critical systems such as
390	   telecommunications networks, Continuous Delivery has the advantage of
391	   including a manual trigger before the actual deployment in the live
392	   system, compared to the Continuous Deployment methodology which is
393	   also part of DevOps processes in software companies.

395	5. Consistency, Availability and Partitioning Challenges

397	   The CAP theorem [CAP] states that any networked shared-data system
398	   can have at most two of following three properties: 1) Consistency
399	   (C) equivalent to having a single up-to-date copy of the data; 2)
400	   high Availability (A) of that data (for updates); and 3) tolerance to
401	   network Partitions (P).

403	   Looking at a telecom SDI as a distributed computational system
404	   (routing/forwarding packets can be seen as a computational problem),
405	   just two of the three CAP properties will be possible at the same
406	   time. The general idea is that 2 of the 3 have to be chosen. CP favor
407	   consistency, AP favor availability, CA there are no partition. This
408	   has profound implications for technologies that need to be developed
409	   in line with the "deploy with repeatable, reliable processes"
410	   principle for configuring SDI states. Latency or delay and
411	   partitioning properties are closely related, and such relation
412	   becomes more important in the case of telecom service providers where
413	   Devs and Ops interact with widely distributed infrastructure.

415	   Limitations of interactions between centralized management and
416	   distributed control need to be carefully examined in such
417	   environments. Traditionally connectivity was the main concern: C and
418	   A was about delivering packets to destination. The features and
419	   capabilities of SDN and NFV are changing the concerns: for example in
420	   SDN, control plane Partitions no longer imply data plane Partitions,
421	   so A does not imply C. In practice, CAP reflects the need for a
422	   balance between local/distributed operations and remote/centralized
423	   operations.

425	   Furthermore to CAP aspects related to individual protocols,
426	   interdependencies between CAP choices for both resources and VNFs
427	   that are interconnected in a forwarding graph need to be considered.
428	   This is particularly relevant for the "Monitor and Validate
429	   Operational Quality" principle, as apart from transport protocols,
430	   most OAM functionality is generally configured in processes that are
431	   separated from the configuration of the monitored entities. Also,
432	   partitioning in a monitoring plane implemented through VNFs executed
433	   on compute resources does not necessarily mean that the dataplane of
434	   the monitored VNF was partitioned as well.

436	6. Stability Challenges

438	   The dimensions, dynamicity and heterogeneity of networks are growing
439	   continuously. Monitoring and managing the network behavior in order
440	   to meet technical and business objectives is becoming increasingly
441	   complicated and challenging, especially when considering the need of
442	   predicting and taming potential instabilities.

444	   In general, instability in networks may have primary effects both
445	   jeopardizing the performance and compromising an optimized use of
446	   resources, even across multiple layers: in fact, instability of end-
447	   to-end communication paths may depend both on the underlying
448	   transport network, as well as the higher level components specific to
449	   flow control and dynamic routing. For example, arguments for
450	   introducing advanced flow admission control are essentially derived
451	   from the observation that the network otherwise behaves in an
452	   inefficient and potentially unstable manner. Even with resources over
453	   provisioning, a network without an efficient flow admission control
454	   has instability regions that can even lead to congestion collapse in
455	   certain configurations. Another example is the instability which is
456	   characteristic of any dynamically adaptive routing system. Routing
457	   instability, which can be (informally) defined as the quick change of
458	   network reachability and topology information, has a number of
459	   possible origins, including problems with connections, router
460	   failures, high levels of congestion, software configuration errors,
461	   transient physical and data link problems, and software bugs.

463	   As a matter of fact, the states monitored and used to implement the
464	   different control and management functions in network nodes are
465	   governed by several low-level configuration commands (today still
466	   done mostly manually). Further, there are several dependencies among
467	   these states and the logic updating the states (most of which are not
468	   kept aligned automatically). Normally, high-level network goals (such
469	   as the connectivity matrix, load-balancing, traffic engineering
470	   goals, survivability requirements, etc) are translated into low-level
471	   configuration commands (mostly manually) individually executed on the
472	   network elements (e.g., forwarding table, packet filters, link-
473	   scheduling weights, and queue-management parameters, as well as
474	   tunnels and NAT mappings). Network instabilities due to configuration
475	   errors can spread from node to node and propagate throughout the
476	   network.

478	   DevOps in the data center is a source of inspiration regarding how to
479	   simplify and automate management processes for software-defined
480	   infrastructure. Although the low-level configuration could be
481	   automated by DevOps tools such as CFEngine [C2015], Puppet [P2015]
482	   and Ansible [A2015], the high-level goal translation towards tool-
483	   specific syntax is still a manual process. In addition, while
484	   carrier-grade configuration tools using the NETCONF protocol support
485	   complex atomic transaction management (which reduces the potential
486	   for instability), Ansible requires third-party components to support
487	   rollbacks and the Puppet transactions are not atomic.

489	   As a specific example, automated configuration functions are expected
490	   to take the form of a "control loop" that monitors (i.e., measures)
491	   current states of the network, performs a computation, and then
492	   reconfigures the network. These types of functions must work
493	   correctly even in the presence of failures, variable delays in
494	   communicating with a distributed set of devices, and frequent changes
495	   in network conditions. Nevertheless cascading and nesting of
496	   automated configuration processes can lead to the emergence of non-
497	   linear network behaviors, and as such sudden instabilities (i.e.
498	   identical local dynamic can give rise to widely different global
499	   dynamics).

501	7. Observability Challenges

503	   Monitoring algorithms need to operate in a scalable manner while
504	   providing the specified level of observability in the network, either
505	   for operation purposes (Ops part) or for debugging in a development
506	   phase (Dev part). We consider the following challenges:

508	   * Scalability - relates to the granularity of network observability,
509	   computational efficiency, communication overhead, and strategic
510	   placement of monitoring functions.

512	   * Distributed operation and information exchange between monitoring
513	   functions - monitoring functions supported by the nodes may perform
514	   specific operations (such as aggregation or filtering) locally on the
515	   collected data or within a defined data neighborhood and forward only
516	   the result to a management system. Such operation may require
517	   modifications of existing standards and development of protocols for
518	   efficient information exchange and messaging between monitoring
519	   functions. Different levels of granularity may need to be offered for
520	   the data exchanged through the interfaces, depending on the Dev or
521	   Ops role. Modern messaging systems, such as Apache Kafka [AK2015],
522	   widely employed in datacenter environments, were optimized for
523	   messages that are considerably larger than reading a single counter
524	   value (typical SNMP GET call usage) - note the throughput vs record
525	   size from [K2014]. It is also debatable to what extent properties
526	   such as message persistence within the bus are needed in a carrier
527	   environment, where MIBs practically offer already a certain level of
528	   persistence of management data at the node level. Also, they require
529	   the use of IP addressing which might not be needed when the monitored
530	   data is consumed by a function within the same node.

532	   * Common communication channel between monitoring functions and
533	   higher layer entities (orchestration, control or management systems)
534	   - a single communication channel for configuration and measurement
535	   data of diverse monitoring functions running on heterogeneous hard-
536	   and software environments. In telecommunication environments,
537	   infrastructure assets span not only large geographical areas, but
538	   also a wide range of technology domains, ranging from CPEs, access-,
539	   aggregation-, and transport networks, to datacenters. This
540	   heterogeneity of hard- and software platforms requires higher layer
541	   entities to utilize various parallel communication channels for
542	   either configuration or data retrieval of monitoring functions within
543	   these technology domains. To address automation and advances in
544	   monitoring programmability, software defined telecommunication
545	   infrastructures would benefit from a single flexible communication
546	   channel, thereby supporting the dynamicity of virtualized
547	   environments. Such a channel should ideally support propagation of
548	   configuration, signalling, and results from monitoring functions;
549	   carrier-grade operations in terms of availability and multi-tenant
550	   features; support highly distributed and hierarchical architectures,
551	   keeping messages as local as possible; be lightweight, topology
552	   independent, network address agnostic; support flexibility in terms
553	   of transport mechanisms and programming language support.
554	   Existing popular state-of-the-art message queuing systems such as
555	   RabbitMQ [R2015] fulfill many of these requirements. However, they
556	   utilize centralized brokers, posing a single point-of-failure and
557	   scalability concerns within vastly distributed NFV environment.
558	   Furthermore, transport support is limited to TCP/IP. ZeroMQ [Z2015]
559	   on the other hard lacks any advanced features for carrier-grade
560	   operations, including high-availability, authentication, and tenant
561	   isolation.

563	   * Configurability and conditional observability - monitoring
564	   functions that go beyond measuring simple metrics (such as delay, or
565	   packet loss) require expressive monitoring annotation languages for
566	   describing the functionality such that it can be programmed by a
567	   controller. Monitoring algorithms implementing self-adaptive
568	   monitoring behavior relative to local network situations may employ
569	   such annotation languages to receive high-level objectives (KPIs
570	   controlling tradeoffs between accuracy and measurement frequency, for
571	   example) and conditions for varying the measurement intensity. Steps
572	   in this direction were taken by the DevOps tools such as Splunk
573	   [S2015], whose collecting agent has the ability to load particular
574	   apps that in turn access specific counters or log files. However,
575	   such apps are tool specific and may also require deploying additional
576	   agents that are specific to the application, library or
577	   infrastructure node being monitored. Choosing which objects to
578	   monitor in such environment means deploying a tool-specific script
579	   that configures the monitoring app.

581	   * Automation - includes mapping of monitoring functionality from a
582	   logical forwarding graph to virtual or physical instances executing
583	   in the infrastructure, as well as placement and re-placement of
584	   monitoring functionality for required observability coverage and
585	   configuration consistency upon updates in a dynamic network
586	   environment. Puppet [P2015] manifests or Ansible [A2015] playbooks
587	   could be used for automating the deployment of monitoring agents, for
588	   example those used by Splunk [S2015]. However, both manifests and
589	   playbooks were designed to represent the desired system configuration
590	   snapshot at a particular moment in time - they would now need to be
591	   generated automatically by the orchestration tools instead of a
592	   DevOps person.

594	   * Actionable data
595	   Data produced by observability tools could be utilized in a wide
596	   category of processes, ranging from billing and dimensioning to real-
597	   time troubleshooting and optimization. In order to allow for data-
598	   driven automated decisions and actuations based on these decisions,
599	   the data needs to be actionable. We define actionable data as being
600	   representative for a particular context or situation and an adequate
601	   input towards a decision. Ensuring actionable data is challenging in
602	   a number of ways, including: defining adaptive correlation and
603	   sampling windows, filtering and aggregation methods that are adapted
604	   or coordinated with the actual consumer of the data, and developing
605	   analytical and predictive methods that account for the uncertainty or
606	   incompleteness of the data.

608	   * Data Virtualization

610	   Data is key in helping both Developers and Operators perform their
611	   tasks. Traditional Network Management Systems were optimized for
612	   using one database that contains the master copy of the operational
613	   statistics and logs of network nodes. Ensuring access to this data
614	   from across the organization is challenging because strict privacy
615	   and business secrets need to be protected. In DevOps-driven
616	   environments, data needs to be made available to Developers and their
617	   test environments. Data virtualization collectively defines a set of
618	   technologies that ensure that restricted copies of the partial data
619	   needed for a particular task may be made available while enforcing
620	   strict access control. Further than simple access control, data
621	   virtualization needs to address scalability challenges involved in
622	   copying large amounts of operational data as well as automatically
623	   disposing of it when the task authorized for using it has finished.

625	8. Verification Challenges

627	   Enabling ongoing verification of code is an important goal of
628	   continuous integration as part of the data center DevOps concept. In
629	   a telecom SDI, service definitions, decompositions and configurations
630	   need to be expressed in machine-readable encodings. For example,
631	   configuration parameters could be expressed in terms of YANG data
632	   models. However, the infrastructure management layers (such as
633	   Software-Defined Network Controllers and Orchestration functions)
634	   might not always export such machine-readable descriptions of the
635	   runtime configuration state. In this case, the management layer
636	   itself could be expected to include a verification process that has
637	   the same challenges as the stand-alone verification processes we
638	   outline later in this section. In that sense, verification can be
639	   considered as a set of features providing gatekeeper functions to
640	   verify both the abstract service models and the proposed resource
641	   configuration before or right after the actual instantiation on the
642	   infrastructure layer takes place.

644	   A verification process can involve different layers of the network
645	   and service architecture. Starting from a high-level verification of
646	   the customer input (for example, a Service Graph as defined in
647	   [I-D.unify-nfvrg-challenges]), the verification process could go more
648	   in depth to reflect on the Service Function Chain configuration. At
649	   the lowest layer, the verification would handle the actual set of
650	   forwarding rules and other configuration parameters associated to a
651	   Service Function Chain instance. This enables the verification of
652	   more quantitative properties (e.g. compliance with resource
653	   availability), as well as a more detailed and precise verification of
654	   the abovementioned topological ones. Existing SDN verification tools
655	   could be deployed in this context, but the majority of them only
656	   operate on flow space rules commonly expressed using OpenFlow syntax.

658	   Moreover, such verification tools were designed for networks where
659	   the flow rules are necessary and sufficient to determine the
660	   forwarding state. This assumption is valid in networks composed only
661	   by network functions that forward traffic by analyzing only the
662	   packet headers (e.g. simple routers, stateless firewalls, etc.).
663	   Unfortunately, most of the real networks contain active network
664	   functions, represented by middle-boxes that dynamically change the
665	   forwarding path of a flow according to function-local algorithms and
666	   an internal state (that is based on the received packets), e.g. load
667	   balancers, packet marking modules and intrusion detection systems.
668	   The existing verification tools do not consider active network
669	   functions because they do not account for the dynamic transformation
670	   of an internal state into the verification process.

672	   Defining a set of verification tools that can account for active
673	   network functions is a significant challenge. In order to perform
674	   verification based on formal properties of the system, the internal
675	   states of an active (virtual or not) network function would need to
676	   be represented. Although these states would increase the verification
677	   process complexity (e.g., using simple model checking would not be
678	   feasible due to state explosion), they help to better represent the
679	   forwarding behavior in real networks. A way to address this challenge
680	   is by attempting to summarize the internal state of an active network
681	   function in a way that allows for the verification process to finish
682	   within a reasonable time interval.

684	9. Troubleshooting Challenges

686	   One of the problems brought up by the complexity introduced by NFV
687	   and SDN is pinpointing the cause of a failure in an infrastructure
688	   that is under continuous change. Developing an agile and low-
689	   maintenance debugging mechanism for an architecture that is comprised
690	   of multiple layers and discrete components is a particularly
691	   challenging task to carry out. Verification, observability, and
692	   probe-based tools are key to troubleshooting processes, regardless
693	   whether they are followed by Dev or Ops personnel.

695	   * Automated troubleshooting workflows

697	   Failure is a frequently occurring event in network operation.
698	   Therefore, it is crucial to monitor components of the system
699	   periodically. Moreover, the troubleshooting system should search for
700	   the cause automatically in the case of failure. If the system follows
701	   a multi-layered architecture, monitoring and debugging actions should
702	   be performed on components from the topmost layer to the bottom layer
703	   in a chain. Likewise, the result of operations should be notified in
704	   reverse order. In this regard, one should be able to define
705	   monitoring and debugging actions through a common interface that
706	   employs layer hopping logic. Besides, this interface should allow
707	   fine-grained and automatic on-demand control for the integration of
708	   other monitoring and verification mechanisms and tools.

710	   * Troubleshooting with active measurement methods

712	   Besides detecting network changes based on passively collected
713	   information, active probes to quantify delay, network utilization and
714	   loss rate are important to debug errors and to evaluate the
715	   performance of network elements. While tools that are effective in
716	   determining such conditions for particular technologies were
717	   specified by IETF and other standardization organization, their use
718	   requires a significant amount of manual labor in terms of both
719	   configuration and interpretation of the results.

721	   In contrast, methods that test and debug networks systematically
722	   based on models generated from the router configuration, router
723	   interface tables or forwarding tables, would significantly simplify
724	   management. They could be made usable by Dev personnel that have
725	   little expertise on diagnosing network defects. Such tools naturally
726	   lend themselves to integration into complex troubleshooting workflows
727	   that could be generated automatically based on the description of a
728	   particular service chain. However, there are scalability challenges
729	   associated with deploying such tools in a network. Some tools may
730	   poll each networking device for the forwarding table information to
731	   calculate the minimum number of test packets to be transmitted in the
732	   network. Therefore, as the network size and the forwarding table size
733	   increase, forwarding table updates for the tools may put a non-
734	   negligible load in the network.

736	10. Programmable network management

738	   The ability to automate a set of actions to be performed on the
739	   infrastructure, be it virtual or physical, is key to productivity
740	   increases following the application of DevOps principles. Previous
741	   sections in this document touched on different dimensions of
742	   programmability:

744	   -  Section 5 approached programmability in the context of developing
745	     new capabilities for monitoring and for dynamically setting
746	     configuration parameters of deployed monitoring functions

748	   -  Section 7 reflected on the need to determine the correctness of
749	     actions that are to be inflicted on the infrastructure as result
750	     of executing a set of high-level instructions

752	   -  Section 8 considered programmability in the perspective of an
753	     interface to facilitate dynamic orchestration of troubleshooting
754	     steps towards building workflows and for reducing the manual steps
755	     required in troubleshooting processes

757	   We expect that programmable network management - along the lines of
758	   [RFC7426] - will draw more interest as we move forward. For example,
759	   in [I-D.unify-nfvrg-challenges], the authors identify the need for
760	   presenting programmable interfaces that accept instructions in a
761	   standards-supported manner for the Two-way Active Measurement
762	   Protocol (TWAMP)TWAMP protocol. More specifically, an excellent
763	   example in this case is traffic measurements, which are extensively
764	   used today to determine SLA adherence as well as debug and
765	   troubleshoot pain points in service delivery. TWAMP is both widely
766	   implemented by all established vendors and deployed by most global
767	   operators. However, TWAMP management and control today relies solely
768	   on diverse and proprietary tools provided by the respective vendors
769	   of the equipment. For large, virtualized, and dynamically
770	   instantiated infrastructures where network functions are placed
771	   according to orchestration algorithms proprietary mechanisms for
772	   managing TWAMP measurements have severe limitations. For example,
773	   today's TWAMP implementations are managed by vendor-specific,
774	   typically command-line interfaces (CLI), which can be scripted on a
775	   platform-by-platform basis. As a result, although the control and
776	   test measurement protocols are standardized, their respective
777	   management is not. This hinders dramatically the possibility to
778	   integrate such deployed functionality in the SP-DevOps concept. In
779	   this particular case, recent efforts in the IPPM WG
780	   [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data
781	   model and effectively increase the programmability of TWAMP
782	   deployments in the future.

784	   Data center DevOps tools, such as those surveyed in [D4.1], developed
785	   proprietary methods for describing and interacting through interfaces
786	   with the managed infrastructure. Within certain communities, they
787	   became de-facto standards in the same way particular CLIs became de-
788	   facto standards for Internet professionals. Although open-source
789	   components and a strong community involvement exists, the diversity
790	   of the new languages and interfaces creates a burden for both vendors
791	   in terms of choosing which ones to prioritize for support, and then
792	   developing the functionality and operators that determine what fits
793	   best for the requirements of their systems.

795	11. DevOps Performance Metrics

797	   Defining a set of metrics that are used as performance indicators is
798	   important for service providers to ensure the successful deployment
799	   and operation of a service in the software-defined telecom
800	   infrastructure.

802	   We identify three types of considerations that are particularly
803	   relevant for these metrics: 1) technical considerations directly
804	   related to the service provided, 2) process-related considerations
805	   regarding the deployment, maintenance and troubleshooting of the
806	   service, i.e. concerning the operation of VNFs, and 3) cost-related
807	   considerations associated to the benefits from using a Software-
808	   Defined Telecom Infrastructure.

810	   First, technical performance metrics shall be service-dependent/-
811	   oriented and may address inter-alia service performance in terms of
812	   delay, throughput, congestion, energy consumption, availability, etc.
813	   Acceptable performance levels should be mapped to SLAs and the
814	   requirements of the service users. Metrics in this category were
815	   defined in IETF working groups and other standardization
816	   organizations with responsibility over particular service or
817	   infrastructure descriptions.

819	   Second, process-related metrics shall serve a wider perspective in
820	   the sense that they shall be applicable for multiple types of
821	   services. For instance, process-related metrics may include: number
822	   of probes for end-to-end QoS monitoring, number of on-site
823	   interventions, number of unused alarms, number of configuration
824	   mistakes, incident/trouble delay resolution, delay between service
825	   order and deliver, or number of self-care operations.

827	   Third, cost-related metrics shall be used to monitor and assess the
828	   benefit of employing SDI compared to the usage of legacy hardware
829	   infrastructure with respect to operational costs, e.g. possible man-
830	   hours reductions, elimination of deployment and configuration
831	   mistakes, etc.

833	   Finally, identifying a number of highly relevant metrics for DevOps
834	   and especially monitoring and measuring them is highly challenging
835	   because of the amount and availability of data sources that could be
836	   aggregated within one such metric, e.g. calculation of human
837	   intervention, or secret aspects of costs.

839	12. Security Considerations

841	   TBD

843	13. IANA Considerations

845	   This memo includes no request to IANA.

847	14. References

849	14.1. Informative References

851	   [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management
852	             and Orchestration V0.6.1 (draft)", Jul. 2014

854	   [I-D.aldrin-sfc-oam-framework]   S. Aldrin, R. Pignataro, N. Akiya.
855	             "Service Function Chaining Operations, Administration and
856	             Maintenance Framework", draft-aldrin-sfc-oam-framework-02,
857	             (work in progress), July 2015.

859	   [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function
860	             Chaining Verification", draft-lee-sfc-verification-00,
861	             (work in progress), February 2014.

863	   [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J.
864	             Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined
865	             Networking (SDN):  Layers and Architecture Terminology",
866	             RFC 7426, January 2015

868	   [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking:
869	             A Perspective from within a Service Provider Environment",
870	             RFC 7149, March 2014.

872	   [TR228]   TMForum Gap Analysis Related to MANO Work. TR228, May 2014

874	   [I-D.unify-nfvrg-challenges]  R. Szabo et al. "Unifying Carrier and
875	             Cloud Networks: Problem Statement and Challenges", draft-
876	             unify-nfvrg-challenges-03 (work in progress), October 2016

878	   [I-D.cmzrjp-ippm-twamp-yang]  Civil, R., Morton, A., Zheng, L.,
879	             Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way
880	             Active Measurement Protocol (TWAMP) Data Model", draft-
881	             cmzrjp-ippm-twamp-yang-02 (work in progress), October 2015.

883	   [D4.1]    W. John et al. D4.1 Initial requirements for the SP-DevOps
884	             concept, universal node capabilities and proposed tools,
885	             August 2014.

887	   [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve
888	             Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined
889	             Networking: A Comprehensive Survey." To appear in
890	             proceedings of the IEEE, 2015.

892	   [DevOpsP] "DevOps, the IBM Approach" 2013. [Online].

894	   [Y1564]   ITU-R Recommendation Y.1564: Ethernet service activation
895	             test methodology, March 2011

897	   [CAP]     E. Brewer, "CAP twelve years later: How the "rules" have
898	             changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012.

900	   [H2014]  N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N.
901	             McKeown; "I Know What Your Packet Did Last Hop: Using
902	             Packet Histories to Troubleshoot Networks", In Proceedings
903	             of the 11th USENIX Symposium on Networked Systems Design
904	             and Implementation (NSDI 14), pp.71-95

906	   [W2011]  A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann;
907	             "OFRewind: Enabling Record and Replay Troubleshooting for
908	             Networks". In Proceedings of the Usenix Anual Technical
909	             Conference (Usenix ATC '11), pp 327-340

911	   [S2010]  E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration
912	             analysis and verification of federated Openflow
913	             infrastructures" In Proceedings of the 3rd ACM workshop on
914	             Assurable and usable security configuration (SafeConfig
915	             '10). Pp. 37-44

917	   [OSandS]  S. Wright, D. Druta, "Open Source and Standards: The Role
918	             of Open Source in the Dialogue between Research and
919	             Standardization" Globecom Workshops (GC Wkshps), 2014 ,
920	             pp.650,655, 8-12 Dec. 2014

922	   [C2015]  CFEngine. Online: http://cfengine.com/product/what-is-
923	             cfengine/, retrieved Sep 23, 2015.

925	   [P2015]  Puppet. Online: http://puppetlabs.com/puppet/what-is-puppet,
926	             retrieved Sep 23, 2015.

928	   [A2015]  Ansible. Online: http://docs.ansible.com/ , retrieved Sep
929	             23, 2015.

931	   [AK2015] Apache Kafka. Online:
932	             http://kafka.apache.org/documentation.html, retrieved Sep
933	             23, 2015.

935	   [S2015]  Splunk. Online: http://www.splunk.com/en_us/products/splunk-
936	             light.html , retrieved Sep 23, 2015.

938	   [K2014]  J. Kreps. Benchmarking Apache Kafka: 2 Million Writes Per
939	             Second (On Three Cheap Machines). Online:
940	             https://engineering.linkedin.com/kafka/benchmarking-apache-
941	             kafka-2-million-writes-second-three-cheap-machines,
942	             retrieved Sep 23, 2015.

944	   [R2015]  RabbitMQ. Online: https://www.rabbitmq.com/ , retrieved Oct
945	             13, 2015

947	   [Z2015]  ZeroMQ. Online: http://zeromq.org/ , retrieved Oct 13, 2015

949	15. Contributors

951	   W. John (Ericsson), J. Kim (Deutsche Telekom), S. Sharma (iMinds)

953	16. Acknowledgments

955	   The research leading to these results has received funding from the
956	   European Union Seventh Framework Programme FP7/2007-2013 under grant
957	   agreement no. 619609 - the UNIFY project. The views expressed here
958	   are those of the authors only. The European Commission is not liable
959	   for any use that may be made of the information in this document.

961	   We would like to thank in particular the UNIFY WP4 contributors, the
962	   internal reviewers of the UNIFY WP4 deliverables and Russ White and
963	   Ramki Krishnan for their suggestions.

965	   This document was prepared using 2-Word-v2.0.template.dot.

967	17. Authors' Addresses

969	   Catalin Meirosu
970	   Ericsson Research
971	   S-16480 Stockholm, Sweden
972	   Email: catalin.meirosu@ericsson.com

974	   Antonio Manzalini
975	   Telecom Italia
976	   Via Reiss Romoli, 274
977	   10148 - Torino, Italy
978	   Email: antonio.manzalini@telecomitalia.it

980	   Juhoon Kim
981	   Deutsche Telekom AG
982	   Winterfeldtstr. 21
983	   10781 Berlin, Germany
984	   Email: J.Kim@telekom.de

986	   Rebecca Steinert
987	   SICS Swedish ICT AB
988	   Box 1263, SE-16429 Kista, Sweden
989	   Email: rebste@sics.se

991	   Sachin Sharma
992	   Ghent University-iMinds
993	   Research group IBCN - Department of Information Technology
994	   Zuiderpoort Office Park, Blok C0
995	   Gaston Crommenlaan 8 bus 201
996	   B-9050 Gent, Belgium
997	   Email: sachin.sharma@intec.ugent.be

999	   Guido Marchetto
1000	   Politecnico di Torino
1001	   Corso Duca degli Abruzzi 24
1002	   10129 - Torino, Italy
1003	   Email: guido.marchetto@polito.it

1005	   Ioanna Papafili
1006	   Hellenic Telecommunications Organization
1007	   Measurements and Wireless Technologies Section
1008	   Laboratories and New Technologies Division
1009	   2, Spartis & Pelika str., Maroussi,
1010	   GR-15122, Attica, Greece
1011	   Buidling E, Office 102
1012	   Email: iopapafi@oteresearch.gr

1014	   Kostas Pentikousis
1015	   EICT GmbH
1016	   Torgauer Strasse 12-15
1017	   Berlin  10829
1018	   Germany
1019	   Email: k.pentikousis@eict.de

1021	   Steven Wright
1022	   AT&T Services Inc.
1023	   1057 Lenox Park Blvd NE, STE 4D28
1024	   Atlanta, GA 30319
1025	   USA
1026	   Email: sw3588@att.com

1028	   Wolfgang John
1029	   Ericsson Research
1030	   S-16480 Stockholm, Sweden
1031	   Email: wolfgang.john@ericsson.com