idnits 2.17.1 

draft-unify-nfvrg-devops-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 6, 2015) is 3216 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-02) exists of
     draft-aldrin-sfc-oam-framework-01

  == Outdated reference: A later version (-04) exists of
     draft-unify-nfvrg-challenges-02

  == Outdated reference: A later version (-02) exists of
     draft-cmzrjp-ippm-twamp-yang-01


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	NFVRG                                                        C. Meirosu
2	Internet Draft                                                 Ericsson
3	Intended status:  Informational                            A. Manzalini
4	Expires: January 2016                                    Telecom Italia
5	                                                                 J. Kim
6	                                                       Deutsche Telekom
7	                                                            R. Steinert
8	                                                                   SICS
9	                                                              S. Sharma
10	                                                                 iMinds
11	                                                           G. Marchetto
12	                                                  Politecnico di Torino
13	                                                            I. Papafili
14	                                Hellenic Telecommunications Organization
15	                                                         K. Pentikousis
16	                                                                   EICT
17	                                                              S. Wright
18	                                                                   AT&T

20	                                                           July 6, 2015

22	            DevOps for Software-Defined Telecom Infrastructures
23	                      draft-unify-nfvrg-devops-02.txt

25	Abstract

27	   Carrier-grade network management was optimized for environments built
28	   with monolithic physical nodes and involves significant deployment,
29	   integration and maintenance efforts from network service providers.
30	   The introduction of virtualization technologies, from the physical
31	   layer all the way up to the application layer, however, invalidates
32	   several well-established assumptions in this domain. This draft opens
33	   the discussion in NFVRG about challenges related to transforming the
34	   telecom network infrastructure into an agile, model-driven production
35	   environment for communication services. We take inspiration from data
36	   center DevOps regarding how to simplify and automate management
37	   processes for a telecom service provider software-defined
38	   infrastructure (SDI). Finally, we introduce challenges associated
39	   with operationalizing DevOps principles at scale in software-defined
40	   telecom networks in three areas related to key monitoring,
41	   verification and troubleshooting processes.

43	Status of this Memo

45	   This Internet-Draft is submitted in full conformance with the
46	   provisions of BCP 78 and BCP 79.

48	   Internet-Drafts are working documents of the Internet Engineering
49	   Task Force (IETF), its areas, and its working groups.  Note that
50	   other groups may also distribute working documents as Internet-
51	   Drafts.

53	   Internet-Drafts are draft documents valid for a maximum of six months
54	   and may be updated, replaced, or obsoleted by other documents at any
55	   time.  It is inappropriate to use Internet-Drafts as reference
56	   material or to cite them other than as "work in progress."

58	   The list of current Internet-Drafts can be accessed at
59	   http://www.ietf.org/ietf/1id-abstracts.txt

61	   The list of Internet-Draft Shadow Directories can be accessed at
62	   http://www.ietf.org/shadow.html

64	   This Internet-Draft will expire on January 6, 2015.

66	Copyright Notice

68	   Copyright (c) 2015 IETF Trust and the persons identified as the
69	   document authors. All rights reserved.

71	   This document is subject to BCP 78 and the IETF Trust's Legal
72	   Provisions Relating to IETF Documents
73	   (http://trustee.ietf.org/license-info) in effect on the date of
74	   publication of this document. Please review these documents
75	   carefully, as they describe your rights and restrictions with respect
76	   to this document. Code Components extracted from this document must
77	   include Simplified BSD License text as described in Section 4.e of
78	   the Trust Legal Provisions and are provided without warranty as
79	   described in the Simplified BSD License.

81	Table of Contents

83	   1. Introduction...................................................3
84	   2. Software-Defined Telecom Infrastructure: Roles and DevOps
85	   principles........................................................5
86	      2.1. Service Developer Role....................................5
87	      2.2. VNF Developer role........................................5
88	      2.3. Operator role.............................................6
89	      2.4. DevOps Principles.........................................6
90	   3. Continuous Integration.........................................7
91	   4. Continuous Delivery............................................8
92	   5. Stability Challenges...........................................8
93	   6. Consistency, Availability and Partitioning Challenges.........10
94	   7. Observability Challenges......................................11
95	   8. Verification Challenges.......................................11
96	   9. Troubleshooting Challenges....................................13
97	   10. Programmable network management..............................14
98	   11. DevOps Performance Metrics...................................15
99	   12. Security Considerations......................................16
100	   13. IANA Considerations..........................................16
101	   14. Informative References.......................................16
102	   15. Acknowledgments..............................................18

104	1. Introduction

106	   Carrier-grade network management was developed as an incremental
107	   solution once a particular network technology matured and came to be
108	   deployed in parallel with legacy technologies. This approach requires
109	   significant integration efforts when new network services are
110	   launched. Both centralized and distributed algorithms have been
111	   developed in order to solve very specific problems related to
112	   configuration, performance and fault management. However, such
113	   algorithms consider a network that is by and large functionally
114	   static. Thus, management processes related to introducing new or
115	   maintaining functionality are complex and costly due to significant
116	   efforts required for verification and integration.

118	   Network virtualization, by means of Software-Defined Networking (SDN)
119	   and Network Function Virtualization (NFV), creates an environment
120	   where network functions are no longer static nor stricltly embedded
121	   in physical boxes deployed at fixed points. The virtualized network
122	   is dynamic and open to fast-paced innovation enabling efficient
123	   network management and reduction of operating cost for network
124	   operators. A significant part of network capabilities are expected to
125	   become available through interfaces that resemble the APIs widespread
126	   within datacenters instead of the traditional telecom means of
127	   management such as the Simple Network Management Protocol, Command
128	   Line Interfaces or CORBA. Such an API-based approach, combined with
129	   the programmability offered by SDN interfaces [RFC7426], open
130	   opportunities for handling infrastructure, resources, and Virtual
131	   Network Functions (VNFs) as code, employing techniques from software
132	   engineering.

134	   The efficiency and integration of existing management techniques in
135	   virtualized and dynamic network environments are limited, however.
136	   Monitoring tools, e.g. based on simple counters, physical network
137	   taps and active probing, do not scale well and provide only a small
138	   part of the observability features required in such a dynamic
139	   environment. Although huge amounts of monitoring data can be
140	   collected from the nodes, the typical granularity is rather coarse.
141	   Debugging and troubleshooting techniques developed for software-
142	   defined environments are a research topic that has gathered interest
143	   in the research community in the last years. Still, it is yet to be
144	   explored how to integrate them into an operational network management
145	   system. Moreover, research tools developed in academia (such as
146	   NetSight [H2014], OFRewind [W2011], FlowChecker [S2010], etc.) were
147	   limited to solving very particular, well-defined problems, and
148	   oftentimes are not built for automation and integration into carrier-
149	   grade network operations workflows.

151	   The topics at hand have already attracted several standardization
152	   organizations to look into the issues arising in this new
153	   environment. For example, IETF working groups have activities in the
154	   area of OAM  and Verification for Service Function Chaining
155	   [I-D.aldrin-sfc-oam-framework] [I-D.lee-sfc-verification] for Service
156	   Function Chaining. At IRTF, [RFC7149] asks a set of relevant
157	   questions regarding operations of SDNs. The ETSI NFV ISG defines the
158	   MANO interfaces [NFVMANO], and TMForum investigates gaps between
159	   these interfaces and existing specifications in [TR228]. The need for
160	   programmatic APIs in the orchestration of compute, network and
161	   storage resources is discussed in                              [I-
162	   D.unify-nfvrg-challenges].

164	   From a research perspective, problems related to operations of
165	   software-defined networks are in part outlined in [SDNsurvey] and
166	   research referring to both cloud and software-defined networks are
167	   discussed in [D4.1].

169	   The purpose of this first version of this document is to act as a
170	   discussion opener in NFVRG by describing a set of principles that are
171	   relevant for applying DevOps ideas to managing software-defined
172	   telecom network infrastructures. We identify a set of challenges
173	   related to developing tools, interfaces and protocols that would
174	   support these principles and how can we leverage standard APIs for
175	   simplifying management tasks.

177	2. Software-Defined Telecom Infrastructure: Roles and DevOps principles

179	   Agile methods used in many software focused companies are focused on
180	   releasing small interactions of code tom implement VNFs with high
181	   velocity and high quality into a production environment. Similarly
182	   Service providers are interested to release incremental improvements
183	   in the network services that they create from virtualized network
184	   functions. The cycle time for DevOps as applied in many open source
185	   projects is on the order of one quarter year or 13 weeks.

187	   The code needs to undergo a significant amount of automated testing
188	   and verification with pre-defined templates in a realistic setting.
189	   From the point of view of infrastructure management, the verification
190	   of the network configuration as result of network policy
191	   decomposition and refinement, as well as the configuration of virtual
192	   functions, is one of the most sensitive operations. When
193	   troubleshooting the cause of unexpected behavior, fine-grained
194	   visibility onto all resources supporting the virtual functions
195	   (either compute, or network-related) is paramount to facilitating
196	   fast resolution times. While compute resources are typically very
197	   well covered by debugging and profiling toolsets based on many years
198	   of advances in software engineering, programmable network resources
199	   are a still a novelty and tools exploiting their potential are
200	   scarce.

202	2.1. Service Developer Role

204	   We identify two dimensions of the "developer" role in software-
205	   defined infrastructure (SDI). One dimension relates to determining
206	   which high-level functions should be part of a particular service,
207	   deciding what logical interconnections are needed between these
208	   blocks and defining a set of high-level constraints or goals related
209	   to parameters that define, for instance, a Service Function Chain.
210	   This could be determined by the product owner for a particular family
211	   of services offered by a telecom provider. Or, it might be a key
212	   account representative that adapts an existing service template to
213	   the requirements of a particular customer by adding or removing a
214	   small number of functional entities. We refer to this person as the
215	   Service Developer and for simplicity (access control, training on
216	   technical background, etc.) we consider the role to be internal to
217	   the telecom provider.

219	2.2. VNF Developer role

221	   The other dimension of the "developer" role is a person that writes
222	   the software code for a new virtual network function (VNF). Depending
223	   on the actual VNF being developed, this person might be internal or
224	   external to the telecom provider. We refer to them as VNF Developers.

226	2.3. Operator role

228	   The role of an Operator in SDI is to ensure that the deployment
229	   processes were successful and a set of performance indicators
230	   associated to a service are met while the service is supported on
231	   virtual infrastructure within the domain of a telecom provider.

233	   System integration roles are important and we intend to approach them
234	   in a future reversion of this draft.

236	2.4. DevOps Principles

238	   In line with the generic DevOps concept outlined in [DevOpsP], we
239	   consider that these four principles as important for adapting DevOps
240	   ideas to SDI:

242	   * Deploy with repeatable, reliable processes: Service and VNF
243	   Developers should be supported by automated build, orchestrate and
244	   deploy processes that are identical in the development, test and
245	   production environments. Such processes need to be made reliable and
246	   trusted in the sense that they should reduce the chance of human
247	   error and provide visibility at each stage of the process, as well as
248	   have the possibility to enable manual interactions in certain key
249	   stages.

251	   * Develop and test against production-like systems: both Service
252	   Developers and VNF Developers need to have the opportunity to verify
253	   and debug their respective SDI code in systems that have
254	   characteristics which are very close to the production environment
255	   where the code is expected to be ultimately deployed. Customizations
256	   of Service Function Chains or VNFs could thus be released frequently
257	   to a production environment in compliance with policies set by the
258	   Operators. Adequate isolation and protection of the services active
259	   in the infrastructure from services being tested or debugged should
260	   be provided by the production environment.

262	   * Monitor and validate operational quality: Service Developers, VNF
263	   Developers and Operators must be equipped with tools, automated as
264	   much as possible, that enable to continuously monitor the operational
265	   quality of the services deployed on SDI. Monitoring tools should be
266	   complemented by tools that allow verifying and validating the
267	   operational quality of the service in line with established
268	   procedures which might be standardized (for example, Y.1564 Ethernet
269	   Activation [Y1564]) or defined through best practices specific to a
270	   particular telecom operator.

272	   * Amplify development cycle feedback loops: An integral part of the
273	   DevOps ethos is building a cross-cultural environment that bridges
274	   the cultural gap between the desire for continuous change by the
275	   Developers and the demand by the Operators for stability and
276	   reliability of the infrastructure. Feedback from customers is
277	   collected and transmitted throughout the organization. From a
278	   technical perspective, such cultural aspects could be addressed
279	   through common sets of tools and APIs that are aimed at providing a
280	   shared vocabulary for both Developers and Operators, as well as
281	   simplifying the reproduction of problematic situations in the
282	   development, test and operations environments.

284	   Network operators that would like to move to agile methods to deploy
285	   and manage their networks and services face a different environment
286	   compared to typical software companies where simplified trust
287	   relationships between personnel are the norm. In such companies, it
288	   is not uncommon that the same person may be rotating between
289	   different roles. In contrast, in a telecom service provider, there
290	   are strong organizational boundaries between suppliers (whether in
291	   Developer roles for network functions, or in Operator roles for
292	   outsourced services) and the carrier's own personnel that might also
293	   take both Developer and Operator roles. How DevOps principles reflect
294	   on these trust relationships and to what extent initiatives such as
295	   co-creation could transform the environment to facilitate closer Dev
296	   and Ops integration across business boundaries is an interesting area
297	   for business studies, but we could not for now identify a specific
298	   technological challenge.

300	3. Continuous Integration

302	   Software integration is the process of bringing together the software
303	   component subsystems into one software system, and ensuring that the
304	   subsystems function together as a system. Software integration can
305	   apply regardless of the size of the software components. The
306	   objective of Continuous Integration is to prevent integration
307	   problems close to the expected release of a software development
308	   project into a production (operations) environment. Continuous
309	   Integration is therefore closely coupled with the notion of DevOps as
310	   a mechanism to ease the transition from development to operations.

312	   Continuous integration may result in multiple builds per day. It is
313	   also typically used in conjunction with test driven development
314	   approaches that integrate unit testing into the build process. The
315	   unit testing is typically automated through build servers. Such
316	   servers may implement a variety of additional static and dynamic
317	   tests as well as other quality control and documentation extraction
318	   functions. The reduced cycle times of continuous enable improved
319	   software quality by applying small efforts frequently.

321	   Continuous Integration applies to developers of VNF as they integrate
322	   the components that they need to deliver their VNF. The VNFs may
323	   contain components developed by different teams within the VNF
324	   Provider, or may integrate code developed externally - e.g. in
325	   commercial code libraries or in open source communities.

327	   Service providers also apply continuous integration in the
328	   development of network services. Network services are comprised of
329	   various aspects including VNFs and connectivity within and between
330	   them as well as with various associated resource authorizations. The
331	   components of the networks service are all dynamic, and largely
332	   represented by software that must be integrated regularly to maintain
333	   consistency.   Some of the software components that Service Providers
334	   may be sourced from VNF Providers or from open source communities.
335	   Service Providers are increasingly motivated to engage with open
336	   Source communities [OSandS]. Open source interfaces supported by open
337	   source communities may be more useful than traditional paper
338	   interface specifications.  Even where Service Providers are deeply
339	   engaged in the open source community (e.g. OPNFV) many service
340	   providers may prefer to obtain the code through some software
341	   provider as a business practice. Such software providers have the
342	   same interests in software integration as other VNF providers.

344	4. Continuous Delivery

346	   The practice of Continuous Delivery extends Continuous Integration by
347	   ensuring that the software checked in on the mainline is always in a
348	   user deployable state and enables rapid deployment by those users.

350	5. Stability Challenges

352	   The dimensions, dynamicity and heterogeneity of networks are growing
353	   continuously. Monitoring and managing the network behavior in order
354	   to meet technical and business objectives is becoming increasingly
355	   complicated and challenging, especially when considering the need of
356	   predicting and taming potential instabilities.

358	   In general, instability in networks may have primary effects both
359	   jeopardizing the performance and compromising an optimized use of
360	   resources, even across multiple layers: in fact, instability of end-
361	   to-end communication paths may depend both on the underlying
362	   transport network, as well as the higher level components specific to
363	   flow control and dynamic routing. For example, arguments for
364	   introducing advanced flow admission control are essentially derived
365	   from the observation that the network otherwise behaves in an
366	   inefficient and potentially unstable manner. Even with resources over
367	   provisioning, a network without an efficient flow admission control
368	   has instability regions that can even lead to congestion collapse in
369	   certain configurations. Another example is the instability which is
370	   characteristic of any dynamically adaptive routing system. Routing
371	   instability, which can be (informally) defined as the quick change of
372	   network reachability and topology information, has a number of
373	   possible origins, including problems with connections, router
374	   failures, high levels of congestion, software configuration errors,
375	   transient physical and data link problems, and software bugs.

377	   As a matter of fact, the states monitored and used to implement the
378	   different control and management functions in network nodes are
379	   governed by several low-level configuration commands (today still
380	   done mostly manually). Further, there are several dependencies among
381	   these states and the logic updating the states (most of which are not
382	   kept aligned automatically). Normally, high-level network goals (such
383	   as the connectivity matrix, load-balancing, traffic engineering
384	   goals, survivability requirements, etc) are translated into low-level
385	   configuration commands (mostly manually) individually executed on the
386	   network elements (e.g., forwarding table, packet filters, link-
387	   scheduling weights, and queue-management parameters, as well as
388	   tunnels and NAT mappings). Network instabilities due to configuration
389	   errors can spread from node to node and propagate throughout the
390	   network.

392	   DevOps in the data center is a source of inspiration regarding how to
393	   simplify and automate management processes for software-defined
394	   infrastructure.

396	   As a specific example, automated configuration functions are expected
397	   to take the form of a "control loop" that monitors (i.e., measures)
398	   current states of the network, performs a computation, and then
399	   reconfigures the network. These types of functions must work
400	   correctly even in the presence of failures, variable delays in
401	   communicating with a distributed set of devices, and frequent changes
402	   in network conditions. Nevertheless cascading and nesting of
403	   automated configuration processes can lead to the emergence of non-
404	   linear network behaviors, and as such sudden instabilities (i.e.

406	   identical local dynamic can give rise to widely different global
407	   dynamics).

409	6. Consistency, Availability and Partitioning Challenges

411	   The CAP theorem [CAP] states that any networked shared-data system
412	   can have at most two of following three properties: 1) Consistency
413	   (C) equivalent to having a single up-to-date copy of the data; 2)
414	   high Availability (A) of that data (for updates); and 3) tolerance to
415	   network Partitions (P).

417	   Looking at a telecom SDI as a distributed computational system
418	   (routing/forwarding packets can be seen as a computational problem),
419	   just two of the three CAP properties will be possible at the same
420	   time. The general idea is that 2 of the 3 have to be chosen. CP favor
421	   consistency, AP favor availability, CA there are no partition. This
422	   has profound implications for technologies that need to be developed
423	   in line with the "deploy with repeatable, reliable processes"
424	   principle for configuring SDI states. Latency or delay and
425	   partitioning properties are closely related, and such relation
426	   becomes more important in the case of telecom service providers where
427	   Devs and Ops interact with widely distributed infrastructure.
428	   Limitations of interactions between centralized management and
429	   distributed control need to be carefully examined in such
430	   environments. Traditionally connectivity was the main concern: C and
431	   A was about delivering packets to destination. The features and
432	   capabilities of  SDN and NFV are changing the concerns: for example
433	   in SDN, control plane Partitions no longer imply data plane
434	   Partitions, so A does not imply C. In practice, CAP reflects the need
435	   for a balance between local/distributed operations and
436	   remote/centralized operations.

438	   Furthermore to CAP aspects related to individual protocols,
439	   interdependencies between CAP choices for both resources and VNFs
440	   that are interconnected in a forwarding graph need to be considered.
441	   This is particularly relevant for the  "Monitor and Validate
442	   Operational Quality" principle, as apart from transport protocols,
443	   most OAM functionality is generally configured in processes that are
444	   separated from the configuration of the monitored entities. Also,
445	   partitioning in a monitoring plane implemented through VNFs executed
446	   on compute resources does not necessarily mean that the dataplane of
447	   the monitored VNF was partitioned as well.

449	7. Observability Challenges

451	   Monitoring algorithms need to operate in a scalable manner while
452	   providing the specified level of observability in the network, either
453	   for operation purposes (Ops part) or for debugging in a development
454	   phase (Dev part). We consider the following challenges:

456	   * Scalability - relates to the granularity of network observability,
457	   computational efficiency, communication overhead, and strategic
458	   placement of monitoring functions.

460	   * Distributed operation and information exchange between monitoring
461	   functions - monitoring functions supported by the nodes may perform
462	   specific operations (such as aggregation or filtering) locally on the
463	   collected data or within a defined data neighborhood and forward only
464	   the result to a management system. Such operation may require
465	   modifications of existing standards and development of protocols for
466	   efficient information exchange and messaging between monitoring
467	   functions. Different levels of granularity may need to be offered for
468	   the data exchanged through the interfaces, depending on the Dev or
469	   Ops role.

471	   * Configurability and conditional observability - monitoring
472	   functions that go beyond measuring simple metrics (such as delay, or
473	   packet loss) require expressive monitoring annotation languages for
474	   describing the functionality such that it can be programmed by a
475	   controller. Monitoring algorithms implementing self-adaptive
476	   monitoring behavior relative to local network situations may employ
477	   such annotation languages to receive high-level objectives (KPIs
478	   controlling tradeoffs between accuracy and measurement frequency, for
479	   example) and conditions for varying the measurement intensity.

481	   * Automation - includes mapping of monitoring functionality from a
482	   logical forwarding graph to virtual or physical instances executing
483	   in the infrastructure, as well as placement and re-placement of
484	   monitoring functionality for required observability coverage and
485	   configuration consistency upon updates in a dynamic network
486	   environment.

488	8. Verification Challenges

490	   Enabling ongoing verification of code is an important goal of
491	   continuous integration as part of the data center DevOps concept. In
492	   a telecom SDI, service definitions, decompositions and configurations
493	   need to be expressed in machine-readable encodings. For example,
494	   configuration parameters could be expressed in terms of YANG data
495	   models. However, the infrastructure management layers (such as
496	   Software-Defined Network Controllers and Orchestration functions)
497	   might not always export such machine-readable descriptions of the
498	   runtime configuration state. In this case, the management layer
499	   itself could be expected to include a verification process that has
500	   the same challenges as the stand-alone verification processes we
501	   outline later in this section. In that sense, verification can be
502	   considered as a set of features providing gatekeeper functions to
503	   verify both the abstract service models and the proposed resource
504	   configuration before or right after the actual instantiation on the
505	   infrastructure layer takes place.

507	   A verification process can involve different layers of the network
508	   and service architecture. Starting from a high-level verification of
509	   the customer input (for example, a Service Graph as defined in [I-
510	   D.unify-nfvrg-challenges]), the verification process could go more in
511	   depth to reflect on the Service Function Chain configuration. At the
512	   lowest layer, the verification would handle the actual set of
513	   forwarding rules and other configuration parameters associated to a
514	   Service Function Chain instance. This enables the verification of
515	   more quantitative properties (e.g. compliance with resource
516	   availability), as well as a more detailed and precise verification of
517	   the abovementioned topological ones. Existing SDN verification tools
518	   could be deployed in this context, but the majority of them only
519	   operate on flow space rules commonly expressed using OpenFlow syntax.

521	   Moreover, such verification tools were designed for networks where
522	   the flow rules are necessary and sufficient to determine the
523	   forwarding state. This assumption is valid in networks composed only
524	   by network functions that forward traffic by analyzing only the
525	   packet headers (e.g. simple routers, stateless firewalls, etc.).
526	   Unfortunately, most of the real networks contain active network
527	   functions, represented by middle-boxes that dynamically change the
528	   forwarding path of a flow according to function-local algorithms and
529	   an internal state (that is based on the received packets), e.g. load
530	   balancers, packet marking modules and intrusion detection systems.
531	   The existing verification tools do not consider active network
532	   functions because they do not account for the dynamic transformation
533	   of an internal state into the verification process.

535	   Defining a set of verification tools that can account for active
536	   network functions is a significant challenge. In order to perform
537	   verification based on formal properties of the system, the internal
538	   states of an active (virtual or not) network function would need to
539	   be represented. Although these states would increase the verification
540	   process complexity (e.g., using simple model checking would not be
541	   feasible due to state explosion), they help to better represent the
542	   forwarding behavior in real networks. A way to address this challenge
543	   is by attempting to summarize the internal state of an active network
544	   function in a way that allows for the verification process to finish
545	   within a reasonable time interval.

547	9. Troubleshooting Challenges

549	   One of the problems brought up by the complexity introduced by NFV
550	   and SDN is pinpointing the cause of a failure in an infrastructure
551	   that is under continuous change. Developing an agile and low-
552	   maintenance debugging mechanism for an architecture that is comprised
553	   of multiple layers and discrete components is a particularly
554	   challenging task to carry out. Verification, observability, and
555	   probe-based tools are key to troubleshooting processes, regardless
556	   whether they are followed by Dev or Ops personnel.

558	   * Automated troubleshooting workflows

560	   Failure is a frequently occurring event in network operation.
561	   Therefore, it is crucial to monitor components of the system
562	   periodically. Moreover, the troubleshooting system should search for
563	   the cause automatically in the case of failure. If the system follows
564	   a multi-layered architecture, monitoring and debugging actions should
565	   be performed on components from the topmost layer to the bottom layer
566	   in a chain. Likewise, the result of operations should be notified in
567	   reverse order. In this regard, one should be able to define
568	   monitoring and debugging actions through a common interface that
569	   employs layer hopping logic. Besides, this interface should allow
570	   fine-grained and automatic on-demand control for the integration of
571	   other monitoring and verification mechanisms and tools.

573	   * Troubleshooting with active measurement methods

575	   Besides detecting network changes based on passively collected
576	   information, active probes to quantify delay, network utilization and
577	   loss rate are important to debug errors and to evaluate the
578	   performance of network elements. While tools that are effective in
579	   determining such conditions for particular technologies were
580	   specified by IETF and other standardization organization, their use
581	   requires a significant amount of manual labor in terms of both
582	   configuration and interpretation of the results; see also Section
583	   Error! Reference source not found.

585	   In contrast, methods that test and debug networks systematically
586	   based on models generated from the router configuration, router
587	   interface tables or forwarding tables, would significantly simplify
588	   management. They could be made usable by Dev personnel that have
589	   little expertise on diagnosing network defects. Such tools naturally
590	   lend themselves to integration into complex troubleshooting workflows
591	   that could be generated automatically based on the description of a
592	   particular service chain. However, there are scalability challenges
593	   associated with deploying such tools in a network. Some tools may
594	   poll each networking device for the forwarding table information to
595	   calculate the minimum number of test packets to be transmitted in the
596	   network. Therefore, as the network size and the forwarding table size
597	   increase, forwarding table updates for the tools may put a non-
598	   negligible load in the network.

600	10. Programmable network management

602	   The ability to automate a set of actions to be performed on the
603	   infrastructure, be it virtual or physical, is key to productivity
604	   increases following the application of DevOps principles. Previous
605	   sections in this document touched on different dimensions of
606	   programmability:

608	   -  Section 6 approached programmability in the context of developing
609	     new capabilities for monitoring and for dynamically setting
610	     configuration parameters of deployed monitoring functions

612	   -  Section 7 reflected on the need to determine the correctness of
613	     actions that are to be inflicted on the infrastructure as result
614	     of executing a set of high-level instructions

616	   -  Section 8 considered programmability in the perspective of an
617	     interface to facilitate dynamic orchestration of troubleshooting
618	     steps towards building workflows and for reducing the manual steps
619	     required in troubleshooting processes

621	   We expect that programmable network management - along the lines of
622	   [RFC7426] - will draw more interest as we move forward. For
623	   example,in [I-D.unify-nfvrg-challenges], the authors identify the
624	   need for presenting programmable interfaces that accept instructions
625	   in a standards-supported manner for the Two-way Active Measurement
626	   Protocol (TWAMP)TWAMP protocol. More specifically, an excellent
627	   example in this case is traffic measurements, which are extensively
628	   used today to determine SLA adherence as well as debug and
629	   troubleshoot pain points in service delivery. TWAMP is both widely
630	   implemented by all established vendors and deployed by most global
631	   operators. However, TWAMP management and control today relies solely
632	   on diverse and proprietary tools provided by the respective vendors
633	   of the equipment. For large, virtualized, and dynamically
634	   instantiated infrastructures where network functions are placed
635	   according to orchestration algorithms proprietary mechanisms for
636	   managing TWAMP measurements have severe limitations. For example,
637	   today's TWAMP implementations are managed by vendor-specific,
638	   typically command-line interfaces (CLI), which can be scripted on a
639	   platform-by-platform basis. As a result, although the control and
640	   test measurement protocols are standardized, their respective
641	   management is not. This hinders dramatically the possibility to
642	   integrate such deployed functionality in the SP-DevOps concept. In
643	   this particular case, recent efforts in the IPPM WG
644	   [I-D.cmzrjp-ippm-twamp-yang] aim to define a standard TWAMP data
645	   model and effectively increase the programmability of TWAMP
646	   deployments in the future.

648	   Data center DevOps tools, such as those surveyed in [D4.1], developed
649	   proprietary methods for describing and interacting through interfaces
650	   with the managed infrastructure. Within certain communities, they
651	   became de-facto standards in the same way particular CLIs became de-
652	   facto standards for Internet professionals. Although open-source
653	   components and a strong community involvement exists, the diversity
654	   of the new languages and interfaces creates a burden for both vendors
655	   in terms of choosing which ones to prioritize for support, and then
656	   developing the functionality and operators that determine what fits
657	   best for the requirements of their systems.

659	11. DevOps Performance Metrics

661	   Defining a set of metrics that are used as performance indicators is
662	   important for service providers to ensure the successful deployment
663	   and operation of a service in the software-defined telecom
664	   infrastructure.

666	   We identify three types of considerations that are particularly
667	   relevant for these metrics: 1) technical considerations directly
668	   related to the service provided, 2) process-related considerations
669	   regarding the deployment, maintenance and troubleshooting of the
670	   service, i.e. concerning the operation of VNFs, and 3) cost-related
671	   considerations associated to the benefits from using a Software-
672	   Defined Telecom Infrastructure.

674	   First, technical performance metrics shall be service-dependent/-
675	   oriented and may address inter-alia service performance in terms of
676	   delay, throughput, congestion, energy consumption, availability, etc.
677	   Acceptable performance levels should be mapped to SLAs and the
678	   requirements of the service users. Metrics in this category were
679	   defined in IETF working groups and other standardization
680	   organizations with responsibility over particular service or
681	   infrastructure descriptions.

683	   Second, process-related metrics shall serve a wider perspective in
684	   the sense that they shall be applicable for multiple types of
685	   services. For instance, process-related metrics may include: number
686	   of probes for end-to-end QoS monitoring, number of on-site
687	   interventions, number of unused alarms, number of configuration
688	   mistakes, incident/trouble delay resolution, delay between service
689	   order and deliver, or number of self-care operations.

691	   Third, cost-related metrics shall be used to monitor and assess the
692	   benefit of employing SDI compared to the usage of legacy hardware
693	   infrastructure with respect to operational costs, e.g. possible man-
694	   hours reductions, elimination of deployment and configuration
695	   mistakes, etc.

697	   Finally, identifying a number of highly relevant metrics for DevOps
698	   and especially monitoring and measuring them is highly challenging
699	   because of the amount and availability of data sources that could be
700	   aggregated within one such metric, e.g. calculation of human
701	   intervention, or secret aspects of costs.

703	12. Security Considerations

705	   TBD

707	13. IANA Considerations

709	   This memo includes no request to IANA.

711	14. Informative References

713	    [NFVMANO] ETSI, "Network Function Virtualization (NFV) Management
714	             and Orchestration V0.6.1 (draft)", Jul. 2014

716	   [I-D.aldrin-sfc-oam-framework]   S. Aldrin, R. Pignataro, N. Akiya.
717	             "Service Function Chaining Operations, Administration and
718	             Maintenance Framework", draft-aldrin-sfc-oam-framework-01,
719	             (work in progress), July 2014.

721	   [I-D.lee-sfc-verification] S. Lee and M. Shin. "Service Function
722	             Chaining Verification", draft-lee-sfc-verification-00,
723	             (work in progress), February 2014.

725	   [RFC7426] E. Haleplidis (Ed.), K. Pentikousis (Ed.), S. Denazis, J.
726	             Hadi Salim, D. Meyer, and O. Koufopavlou, "Software Defined
727	             Networking (SDN):  Layers and Architecture Terminology",
728	             RFC 7426, January 2015

730	   [RFC7149] M. Boucadair and C Jaquenet. "Software-Defined Networking:
731	             A Perspective from within a Service Provider Environment",
732	             RFC 7149, March 2014.

734	   [TR228]   TMForum Gap Analysis Related to MANO Work. TR228, May 2014

736	   [I-D.unify-nfvrg-challenges]  R. Szabo et al. "Unifying Carrier and
737	             Cloud Networks: Problem Statement and Challenges", draft-
738	             unify-nfvrg-challenges-02 (work in progress), July 2015

740	   [I-D.cmzrjp-ippm-twamp-yang]  Civil, R., Morton, A., Zheng, L.,
741	             Rahman, R., Jethanandani, M., and K. Pentikousis, "Two-Way
742	             Active Measurement Protocol (TWAMP) Data Model", draft-
743	             cmzrjp-ippm-twamp-yang-01 (work in progress), July 2015.

745	   [D4.1]    W. John et al. D4.1 Initial requirements for the SP-DevOps
746	             concept, universal node capabilities and proposed tools,
747	             August 2014.

749	   [SDNsurvey] D. Kreutz, F. M. V. Ramos, P. Verissimo, C. Esteve
750	             Rothenberg, S. Azodolmolky, S. Uhlig. "Software-Defined
751	             Networking: A Comprehensive Survey." To appear in
752	             proceedings of the IEEE, 2015.

754	   [DevOpsP] "DevOps, the IBM Approach" 2013. [Online].

756	   [Y1564]   ITU-R Recommendation Y.1564: Ethernet service activation
757	             test methodology, March 2011

759	   [CAP]     E. Brewer, "CAP twelve years later: How the "rules" have
760	             changed", IEEE Computer, vol.45, no.2, pp.23,29, Feb. 2012.

762	   [H2014]  N. Handigol, B. Heller, V. Jeyakumar, D. Mazieres, N.
763	             McKeown; "I Know What Your Packet Did Last Hop: Using
764	             Packet Histories to Troubleshoot Networks", In Proceedings
765	             of the 11th USENIX Symposium on Networked Systems Design
766	             and Implementation (NSDI 14), pp.71-95

768	   [W2011]  A. Wundsam, D. Levin, S. Seetharaman, A. Feldmann;
769	             "OFRewind: Enabling Record and Replay Troubleshooting for
770	             Networks". In Proceedings of the Usenix Anual Technical
771	             Conference (Usenix ATC '11), pp 327-340

773	   [S2010]  E. Al-Shaer and S. Al-Haj. "FlowChecker: configuration
774	             analysis and verification of federated Openflow
775	             infrastructures" In Proceedings of the 3rd ACM workshop on
776	             Assurable and usable security configuration (SafeConfig
777	             '10). Pp. 37-44

779	   [OSandS]  S. Wright, D. Druta, "Open Source and Standards: The Role
780	             of Open Source in the Dialogue between Research and
781	             Standardization" Globecom Workshops (GC Wkshps), 2014 ,
782	             pp.650,655, 8-12 Dec. 2014

784	15. Acknowledgments

786	   The research leading to these results has received funding from the
787	   European Union Seventh Framework Programme FP7/2007-2013 under grant
788	   agreement no. 619609 - the UNIFY project. The views expressed here
789	   are those of the authors only. The European Commission is not liable
790	   for any use that may be made of the information in this document.

792	   We would like to thank in particular the UNIFY WP4 contributors, the
793	   internal reviewers of the UNIFY WP4 deliverables, and Wolfgang John
794	   from Ericsson for the useful discussions and insightful comments.

796	   This document was prepared using 2-Word-v2.0.template.dot.

798	Authors' Addresses

800	   Catalin Meirosu
801	   Ericsson Research
802	   S-16480 Stockholm, Sweden
803	   Email: catalin.meirosu@ericsson.com

805	   Antonio Manzalini
806	   Telecom Italia
807	   Via Reiss Romoli, 274
808	   10148 - Torino, Italy
809	   Email: antonio.manzalini@telecomitalia.it

811	   Juhoon Kim
812	   Deutsche Telekom AG
813	   Winterfeldtstr. 21
814	   10781 Berlin, Germany
815	   Email: J.Kim@telekom.de

817	   Rebecca Steinert
818	   SICS Swedish ICT AB
819	   Box 1263, SE-16429 Kista, Sweden
820	   Email: rebste@sics.se

822	   Sachin Sharma
823	   Ghent University-iMinds
824	   Research group IBCN - Department of Information Technology
825	   Zuiderpoort Office Park, Blok C0
826	   Gaston Crommenlaan 8 bus 201
827	   B-9050 Gent, Belgium
828	   Email: sachin.sharma@intec.ugent.be

830	   Guido Marchetto
831	   Politecnico di Torino
832	   Corso Duca degli Abruzzi 24
833	   10129 - Torino, Italy
834	   Email: guido.marchetto@polito.it

836	   Ioanna Papafili
837	   Hellenic Telecommunications Organization
838	   Measurements and Wireless Technologies Section
839	   Laboratories and New Technologies Division
840	   2, Spartis & Pelika str., Maroussi,
841	   GR-15122, Attica, Greece
842	   Buidling E, Office 102
843	   Email: iopapafi@oteresearch.gr

845	   Kostas Pentikousis
846	   EICT GmbH
847	   Torgauer Strasse 12-15
848	   Berlin  10829
849	   Germany
850	   Email: k.pentikousis@eict.de

852	   Steven Wright
853	   AT&T Services Inc.
854	   1057 Lenox Park Blvd NE, STE 4D28
855	   Atlanta, GA 30319
856	   USA
857	   Email: sw3588@att.com