idnits 2.17.1 

draft-mlk-nfvrg-nfv-reliability-using-cots-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 884 has weird spacing: '...it | ti   on |...'

  == Line 887 has weird spacing: '...   | il   ity ...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (October 1, 2015) is 3130 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'I-D.irtf-nfvrg-nfv-policy-arch' is defined on line
     968, but no explicit reference was found in the text

  == Outdated reference: A later version (-04) exists of
     draft-irtf-nfvrg-nfv-policy-arch-01


     Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NFV RG                                                             L. Mo
3	Internet-Draft                                        B. Khasnabish, Ed.
4	Intended status: Informational                             ZTE (TX) Inc.
5	Expires: April 3, 2016                                   October 1, 2015

7	                  NFV Reliability using COTS Hardware
8	             draft-mlk-nfvrg-nfv-reliability-using-cots-00

10	Abstract

12	   This draft discusses the results of a recent study on the feasibility
13	   of using Commercial Off-The-Shelf (COTS) hardware for virtualized
14	   network functions in telecom equipment.  In particular, it explores
15	   the conditions under which the COTS hardware can be used in the NFV
16	   (Network Function Virtualization) environment.  The concept of silent
17	   error probability is introduced in order to take software error or
18	   undetectable hardware failures into account.  The silent error
19	   probability is included in both the theoretical work and the
20	   simulation work.  It is difficult to theoretically analyze the impact
21	   of site maintenance and site failure events.  Therefore, simulation
22	   is used for evaluating the impact of these site management related
23	   events which constitute the undesirable feature of using COTS
24	   hardware in telecom environment.

26	Status of this Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at http://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on April 3, 2016.

43	Copyright Notice

45	   Copyright (c) 2015 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents
50	   (http://trustee.ietf.org/license-info) in effect on the date of
51	   publication of this document.  Please review these documents
52	   carefully, as they describe your rights and restrictions with respect
53	   to this document.  Code Components extracted from this document must
54	   include Simplified BSD License text as described in Section 4.e of
55	   the Trust Legal Provisions and are provided without warranty as
56	   described in the Simplified BSD License.

58	Table of Contents

60	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
61	   2.  Conventions used in this document  . . . . . . . . . . . . . .  4
62	     2.1.  Abbreviations  . . . . . . . . . . . . . . . . . . . . . .  4
63	   3.  Network Reliability  . . . . . . . . . . . . . . . . . . . . .  5
64	   4.  Network Part of the Availability . . . . . . . . . . . . . . .  7
65	   5.  Theoretical Analysis of Server Part of System Availability . .  9
66	   6.  Simulation Study of Server Part of Availability  . . . . . . . 12
67	     6.1.  Methodology  . . . . . . . . . . . . . . . . . . . . . . . 13
68	     6.2.  Validation of the Simulator  . . . . . . . . . . . . . . . 16
69	     6.3.  Simulation Results . . . . . . . . . . . . . . . . . . . . 17
70	     6.4.  Multiple Servers Sharing the Load  . . . . . . . . . . . . 19
71	   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 20
72	   8.  Security considerations  . . . . . . . . . . . . . . . . . . . 21
73	   9.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 21
74	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
75	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
76	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 22
77	     11.2. Informative References . . . . . . . . . . . . . . . . . . 22
78	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22

80	1.  Introduction

82	   Using COTS hardware for network functions (e.g.  IMS, EPC) have drawn
83	   considerable attention in the recent years.  Some operators do have
84	   legitimate concern regarding the reliability of using the COTS
85	   hardware, with reduced MTBF (mean time between failures) and many
86	   undesirable attributes of COTS hardware unfamiliar in the traditional
87	   telecom industry.

89	   In the previous reliability studies (e.g.  GR-77 [1]), the emphasis
90	   were place on hardware failures only.  In this work, besides hardware
91	   failures, which characterized by the MTBF (mean time between
92	   failures) and MTTR (mean time to repair), the silent error is also
93	   introduced to take account the software error and hardware failure
94	   which is undetectable by the management system.

96	   The silent error affecting the system availability in different ways,
97	   depending on the particular scenarios.

99	   In a typical system, a server performing certain network functions
100	   will have another dedicated server as backup.  This is normal master-
101	   slave or 1+1 redundancy configuration of the telecom equipment.

103	   The server performing the network function is called the "master
104	   server" and the dedicated backup is called the "slave server."  In
105	   order to differentiate the 1+1 redundancy scheme and 1:1 redundancy
106	   scheme, the slave server is deemed "dedicated" for 1+1 case.  In 1:1
107	   redundancy, both servers will perform network functions while
108	   protecting each other at the same time.

110	   In any protection scheme, on assuming single fault for clarity of
111	   discussion, the system availability will not be impacted if the slave
112	   part experience silent error and such silent error eventually
113	   becoming observable in behavior.  In this case, another slave will be
114	   identified and the master server will continue to serve the network
115	   function.  Before the slave server becoming fully functional, the
116	   system will operate at reduced error correction capabilities.

118	   On the other hand, if the master server experience the silent error,
119	   the data transmitted to the slave server could be corrupted.  In this
120	   case, the system availability will be impacted when such error
121	   becoming observable.  On detection of such error, both the master
122	   server and the slave server need time to recover.  The time for such
123	   recovery is fixed in the NFV environment, which is deemed to be a NFV
124	   MTTR time.  During this time interval, the network function is not
125	   available and will be considered to be the downtime in the
126	   availability calculations.

128	   Comparing the MTBF of COTS hardware and the typical telecom grade
129	   hardware, the COTS may have less MTBF due to its relaxed design
130	   criteria.

132	   Comparing the MTTR of COTS hardware and the typical telecom grade
133	   hardware, the COTS time to repair is not a random variable and
134	   actually is fixed.  Hence the COTS MTTR is the time required to bring
135	   up a server and ready to serve.  In the traditional telecom hardware,
136	   the time to repair is a random variable and MTTR is the mean of this
137	   random variable.  Because manual intervention is normally required in
138	   the telecom environment, the NFV COTS MTTR is normally assumed to be
139	   less than the traditional telecom equipment MTTR.

141	   The most obvious difference between those two hardware types (COTS
142	   hardware and the telecom grade hardware) is related to its
143	   maintenance procedure and practice.  While telecom equipment takes
144	   pains to minimize the impact of maintenance on system availability,
145	   the COTS hardware normally is maintained in a cowboy fashion (e.g.
146	   reset first and ask questions later).

148	   In this study, a closed solution is available if the site and
149	   maintenance related issues are absent for one or two dedicated backup
150	   COTS servers in the NFV environment.  In order to evaluate the site
151	   and maintenance related issues, a simulator is constructed to study
152	   the system availability with one or two dedicated backup servers.

154	   It is shown that, with COTS hardware and all its undesirable
155	   features, it is still possible to satisfy the telecom requirements
156	   under reasonable conditions.

158	2.  Conventions used in this document

160	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
161	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
162	   document are to be interpreted as described in RFC-2119 [RFC2119].

164	   In this document, these words will appear with that interpretation
165	   only when in ALL CAPS.  Lower case uses of these words are not to be
166	   interpreted as carrying RFC-2119 significance.

168	2.1.  Abbreviations

170	   o  A-N: Network Availability

172	   o  A-S: Server Availability
173	   o  A-Sys: System Availability

175	   o  COTS: Commercial Off-The-Shelf

177	   o  DC: Data Center

179	   o  MTBF: Mean Time Between Failures

181	   o  MTTF: Mean Time To Failure

183	   o  MTTR: Mean Time To Repair

185	   o  NFV: Network Function Virtualization

187	   o  PGUP: Protection Group Up Time

189	   o  PSTN: Public Switched Telephone Network

191	   o  SDN: Software-Defined Network/Networking

193	   o  TET: Total Elapsed Time

195	   o  VM: Virtual Machine

197	   o  WDT: Weighted Down Time

199	3.  Network Reliability

201	   In the NFV environment, the reliability analysis can be divided into
202	   two distinct parts: the server part and the network part, where the
203	   network part is to connect all the servers with vSwitch and the
204	   server part is to provide the actual network functions.  This can be
205	   illustrated by using a diagram as shown in Figure-1.

207	           +--------------------+
208	           | Availability: A-S  |               Availability: A-N
209	           |                    |
210	           |                    |
211	           |                    |               +---------------+
212	           |   (VM)             |               |               |
213	           |   COTS.............................|  vSwitch 1    |
214	           |  Server...............       ......|               |
215	           |    |               |  \     /      |(X) (X) .. (X) |
216	           |    |               |   \   /       +---------------+
217	           |    |               |    \ /
218	           |    |               |     \         +---------------+
219	           |    |               |    / \        |  vSwitch 2    |
220	           |   (VM)             |   /   \       |               |
221	           |   COTS................/     \......|(X) (X) .. (X) |
222	           |  Server............................|               |
223	           |                    |               |               |
224	           +--------------------+               +---------------+

226	       Figure 1: System Availability - Network Part and Server Part

228	   If the overall system availability is denoted by the symbol (A-Sys),
229	   the overall system availability will be the product of server part of
230	   the system availability (A-S) and the network part of the system
231	   availability (A-N).

233	   EQ(1) ... ... ...  A-Sys = [A-S x A-N]

235	   Given the fact that both A-S and A-N are "less than" 1 (one), we have
236	   A-Sys "less than" A-S and A-Sys "less than" A-N.  In another words,
237	   if FIVE 9s are required for system availability, both the server part
238	   and the network part of the availability need to be better than FIVE
239	   9s so their products can be more than FIVE 9s.

241	   To improve the network part of the network availability, as
242	   illustrated in Figure 1, the normal 1+1 protection scheme is
243	   utilized.  It shall be noted that it is possible for the vSwitch to
244	   cover long distance transmission network to connect multiple data
245	   centers.

247	   The mechanisms in the server part for improving availability is not
248	   specified.  In this study, it is assumed that one active server will
249	   be supported by one or two backup servers.  Normally, if the active
250	   server is faulty, one of the backup server(s) will take over the
251	   responsibility and hence there will be no loss of availability on the
252	   server part.

254	   There is a significant difference between the NFV environment and the
255	   dedicated traditional telecom equipment related to the time to
256	   recover from the server fault.  In the traditional telecom equipment
257	   case, a manual change of some equipment (e.g. a faulty board) is
258	   normally required and hence the time for restoration after
259	   experiencing fault, normally denoted as MTTR (Mean Time to Repair) is
260	   long.

262	   In the NFV environment, the time for restoration is the time required
263	   to boot another virtual machine (VM) with the needed software and re-
264	   synchronization of data.  Hence the MTTR in the NFV environment can
265	   be considered to be shorter than the traditional telecom equipment.
266	   More importantly, the MTTR in the NFV environment can be considered
267	   to be a fixed constant.

269	   It is also understood that multiple servers will be active to share
270	   the load.  Contradictory to common sense believe, this arrangement
271	   will not increase nor decrease the overall network availability if
272	   those active servers are supported by one or two backup servers.
273	   This fact will be elaborated in the later section from both
274	   theoretical point of view and simulations.

276	4.  Network Part of the Availability

278	   The traditional analysis can be applied to the network part of the
279	   availability.  In fact, the network part of the availability is
280	   impacted by the availability of the switch which is part of the
281	   vSwitch and the maximum hops in the vSwitch.  The vSwitch is to
282	   connect the VMs in the NFV environment.

284	   If A-n is the denote the availability of the network element, for a
285	   vSwitch with maximum of h hops, the availability of the vSwtich would
286	   be "(A-n)^h."  Hence, considering the 1+1 configuration of the
287	   vSwtich, the A-N can be expressed by

289	   EQ(2) ... ... ...  A-N = [1 - (1 - (A-n)^h)^2 ]

291	   The network availability, as a function of number of hops (h) and the
292	   per network element availability (A-n), can be illustrated by using
293	   teh diagrm as shown in Figure-2.

295	   While this 3-D illustration shows the general trend in network
296	   availability, the following data table is able to give more details
297	   regarding the network availability with different hop counts and
298	   different network element availability, as shown in Table-1.

300	   Table-1: Network Part of System Availability with Various Network
301	   Elements Availability and Hop Counts

303	   +-------------+---------+----------+----------+----------+----------+
304	   | Network     |    10   |    16    |    22    |    26    |    30    |
305	   | Element     |         |          |          |          |          |
306	   | Availabilit |         |          |          |          |          |
307	   | y/ Hop      |         |          |          |          |          |
308	   +-------------+---------+----------+----------+----------+----------+
309	   | 0.99        | 0.99086 | 0.977935 |  0.96065 |  0.94712 | 0.932244 |
310	   | 0.999       | 0.99990 | 0.999748 | 0.999526 | 0.999341 | 0.999126 |
311	   | 0.9999      | 0.99999 | 0.999997 | 0.999995 | 0.999993 | 0.999991 |
312	   | 0.99999     |    1    |     1    |     1    |     1    |     1    |
313	   +-------------+---------+----------+----------+----------+----------+

315	               +------------------------------------------+
316	              /                                         / |
317	             / Five 9s                                 /  |
318	            /                                         /   |
319	           /                             ...   ...  -/    |
320	          /  ...   ...   ...    .....               /     |
321	         +-----------------------------------------+      +..0.99999
322	         | .  ....                                 |     /
323	         |        ....  ....  ... . . . .  .       |    /0.9999
324	         |                                     .   |   / Net Element
325	         |                                        .|  /  Availability
326	         |                                         | / 0.999
327	         |                                         |/
328	         +-----------------------------------------+..0.99
329	         2       8         12        18            24
330	            ..............Hop Count..........>

332	   Figure 2: Network Part of the System Availability with Different Hop
333	             Counts and Different Network Element Availability

335	   In order to achieve FIVE 9s reliability normally demanded by the
336	   telecommunication operators, the network element reliability needs to
337	   be at least FOUR 9s if hop counts is more than 10.

339	   In fact, in order to achieve FIVE 9s while per network element
340	   availability is only THREE 9s, the hop count needs to be less than
341	   two, which is deemed non-practical.

343	5.  Theoretical Analysis of Server Part of System Availability

345	   In GR-77 [1], extensive analysis has been performed for systems under
346	   various conditions.  In the NFV environment, if the server's
347	   availability is denoted as the symbol (Ax), the server part of the
348	   system availability (As), with a 1+1 master and slave configuration,
349	   can be given by [1] in Part D, Chapter 6.

351	   EQ(3) ... ... ...  As = [1-((1-Ax)^2)]

353	   In a more practical environment, there will be silent errors (errors
354	   can not be detected by the system under consideration).  The silent
355	   error probability will be expressed as the symbol (Pse)

357	   We need to further assume that the silent error only affects the
358	   master of the system because it is the one which has the ability to
359	   corrupt the data.  This assumption can be further articulated, in
360	   practical engineering terms, is that "when there is error detected
361	   and there is no obvious cause of the error, the master of the
362	   "master-salve" configuration will assume the master is correct while
363	   the slave will go through a MTTR time to recover."

365	   The state transition can be illustrated as in the following diagram:

367	   Figure 3: State Transition for System with only one Backup, ...
368	   (Note: a dot-and-dash version of the diagram is being developed)....

370	   With this state transition diagram outlined in Figure 1, the system
371	   availability in a 1+1 master-slave configuration can be expressed as
372	   follows.

374	   EQ(4a)... ... ... ...  As = [1-((1-Ax)^2 + PseAx(1-Ax))

376	   EQ(4b)... ... ... ...  As = (2-Pse)Ax - (1-Pse)((Ax)^2))

378	   The following diagram (Figure 4) illustrates the server part of the
379	   availability with different per server availability and different
380	   silent error probability.

382	   Figure 4: Server Part of the System Availability with Various Server
383	   Availability and Silent Error Probability ...  (Note: a dot-and-dash
384	   version of the diagram is being developed)....

386	   While the graphics illustrate the trends, the following data table
387	   will give precise information on a single backup (1+1) configuration.

389	   Table-2: Server Part of availability for different silent error
390	   probability and different server availability for single backup
391	   configuration

393	         +-------+---------+-----------+-------------+----------+
394	         | SEPSA | 0.99000 |  0.99900  |   0.99990   |  0.99999 |
395	         +-------+---------+-----------+-------------+----------+
396	         |  0.0  |  0.9999 |  0.999999 |  0.99999999 |    1.0   |
397	         |  0.1  | 0.99891 | 0.9998991 | 0.999989991 |  0.99999 |
398	         |  0.2  | 0.99792 | 0.9997992 | 0.999979992 | 0.999998 |
399	         |  0.3  | 0.99693 | 0.9996993 | 0.999969993 | 0.999997 |
400	         |  0.4  | 0.99594 | 0.9995994 | 0.999959994 | 0.999996 |
401	         |  0.5  | 0.99495 | 0.9994995 | 0.999949995 | 0.999995 |
402	         |  0.6  | 0.99396 | 0.9993996 | 0.999939996 | 0.999994 |
403	         |  0.7  | 0.99297 | 0.9992997 | 0.999929997 | 0.999993 |
404	         |  0.8  | 0.99198 | 0.9991998 | 0.999919998 | 0.999992 |
405	         |  0.9  | 0.99099 | 0.9990999 | 0.999909999 | 0.999991 |
406	         |   1   |   0.99  |   0.999   |    0.9999   |  0.99999 |
407	         +-------+---------+-----------+-------------+----------+

409	   The green shaded area in the above table outline the area which five
410	   9 availability is possible.  As evidenced in the above table, the
411	   server part of the network availability deteriorates rapidly with
412	   silent error probability.  It is possible to achieve five 9s of
413	   availability with only server availability of only three 9s, it does
414	   demand five 9s server availability when the silent error probability
415	   is only 10%.

417	   While the 1+1 configuration illustrated above seems reasonable for
418	   server part of the system availability (As), there may be cases
419	   demanding more than 1+1 configuration for reliability.  For systems
420	   with two backups, the availability, without consideration of the
421	   silent error, can be expressed as [1] (Part D, chapter 6)

423	   EQ(5) ... ... ...  As = [1-((1-Ax)^3)]

425	   With the introduction of the silent error probability, the error
426	   transition can be expressed in the following diagram:

428	   Figure 5: Error State Transition for System with only two Backups ...
429	   (Note: a dot-and-dash version of the diagram is being developed)....

431	   With the introduction of the silent error and observing the error
432	   transition above, assuming the silent error event and the server
433	   fault event are independent (e.g., A software error as cause of the
434	   silent error and the server fault event as a hardware failure), the
435	   server part of the availability for dual backup case can be given by

437	   EQ(6a)... ...  As = 1-((1-Ax)^3 + Pse(1 - Ax)((Ax)^2 + 2Ax(1-Ax)))
438	   EQ(6b)... ...  As = (3-2Pse)Ax - 3(1-Pse)(Ax)^2 + (1-Pse)(Ax)^3

440	   It should be noted that, when "Pse = 1" for both EQ (4) and EQ (6),
441	   the server part of the system availability (As) and the server
442	   availability (Ax) are the same.  This relationship shall be expected
443	   since, if the mater always experiences the silent error, the backups
444	   are useless and will be corrupted all the time.

446	   The system availability, with dual backup, can be illustrated as
447	   follows for different server availability and different silent error
448	   including software malfunctions.

450	   Figure 6: Server Part of the System Availability with Various Silent
451	   Error Probability and Server Availability for a dual Backup System
452	   ...  (Note: a dot-and-dash version of the diagram is being
453	   developed)....

455	   As with previous case, the diagram will only illustrate the trend
456	   while the following table will provide precise data for the system
457	   availability under different silent error probability and server
458	   availabilities for dual backup case

460	   Table-3: System Availability with different silent error probability
461	   and server availability (SEPSA) for dual backup configuration

463	         +-------+-----------+-------------+---------+----------+
464	         | SEPSA |  0.99000  |   0.99900   | 0.99990 |  0.99999 |
465	         +-------+-----------+-------------+---------+----------+
466	         |  0.0  |  0.999999 | 0.999999999 |   1.0   |    1.0   |
467	         |  0.1  | 0.9989991 | 0.999899999 | 0.99999 | 0.999999 |
468	         |  0.2  | 0.9979992 | 0.999799999 | 0.99998 | 0.999998 |
469	         |  0.3  | 0.9969993 | 0.999699999 | 0.99997 | 0.999997 |
470	         |  0.4  | 0.9959994 | 0.999599999 | 0.99996 | 0.999996 |
471	         |  0.5  | 0.9949995 |    0.9995   |  .99995 | 0.999995 |
472	         |  0.6  | 0.9939996 |    0.9994   | 0.99994 | 0.999994 |
473	         |  0.7  | 0.9929997 |    0.9993   | 0.99993 | 0.999993 |
474	         |  0.8  | 0.9919998 |    .9992    |  .99992 | 0.999992 |
475	         |  0.9  | 0.9909999 |    0.9991   | 0.99991 | 0.999991 |
476	         |  1.0  |    0.99   |    0.999    |  0.9999 | 0.999990 |
477	         +-------+-----------+-------------+---------+----------+

479	   As shown in Table-2 , the green shaded area in Table-3 represents the
480	   five 9s capabilities.  Comparing those two tables, the dual backups
481	   are of marginal advantage over the single backup except for the case
482	   there is no silent error.  In this case, with only two 9s server
483	   availability, the five 9s server part of system availability can be
484	   achieved.

486	   From the data above, we can conclude that the silent error,
487	   introduced by software error or hardware error not detectable by
488	   software, plays an important role in the server part of the system
489	   availability and hence the final system availability.  In fact, it
490	   will be the dominant elements if Pse is more than 10% when the
491	   difference between single backup and dual backup are not significant.

493	   Some operators are of the opinion that there need to be a new
494	   approach to the availability requirements.  COTS hardware are assumed
495	   to have less availability than the traditional telecom hardware.
496	   But, in the NFV environment, since each server (or VM) in the NFV
497	   environment will only affect a small number of users, the
498	   requirements for traditional five 9s could be relaxed while keeping
499	   the same user experience in downtime.  In another words, the weighted
500	   downtime, in proportion of the number of users, may be reduced in the
501	   NFV environment due to each server affecting only a small number of
502	   users for a given server reliability.

504	   Unfortunately, from the theoretical point of view, this is not true.
505	   It is possible that, each server downtime will only affect a small
506	   number of users.  But the multiple active servers will experience
507	   more server fault opportunities (this is similar to the famous
508	   reliability argument for the twin engine Boeing 777).  As long as the
509	   protection scheme, or more importantly, the number of backup(s) are
510	   the same, the eventual system availability will be the same,
511	   regardless what portion of users each server is to serve.

513	6.  Simulation Study of Server Part of Availability

515	   In the above theoretical analysis of server part of availability, the
516	   following factors are not considered: (A) Site Maintenance (e.g.
517	   software upgrade, patch, etc. affecting the whole site, (B) Site
518	   Failure (earth quake, etc.)

520	   While traditional telecom grade equipment putting a lot of emphasis
521	   and engineering complexity to ensure smooth migration, smooth
522	   software upgrade, and smooth patching procedures, the COTS hardware
523	   and its related software are notorious in lacking such capabilities.
524	   This is the primary reason for operators to be hesitate on utilizing
525	   COTS hardware, even though COTS hardware in the NFV environment does
526	   having the improved MTTR as compared to the traditional telecom
527	   hardware.

529	   While it is relative easy to obtain a closed for of system
530	   availability for the ideal case without site related issues, it is
531	   extremely difficult to obtain an analytical solution when site issues
532	   are involved.  In this case, we resort to numerical simulation under
533	   reasonable assumptions [2, 3, 4].

535	6.1.  Methodology

537	   In this section, the various assumptions and the outline of the
538	   simulation mechanisms will be discussed.

540	   A discrete event simulator is constructed to obtain the availability
541	   for the server part.  In the simulator, an active server (master
542	   server which processing the network traffic) will be supported by 1
543	   (single backup) or 2 (dual backup) servers in another site(s).

545	   For the failure probability of the server, it is common to assume the
546	   bathtub probability distribution (WeiBull distribution).  In
547	   practice, we need to enforce that the NFV management will provide
548	   servers which is on the flat part of the bathtub distribution.  In
549	   this case, the familiar exponential distribution can be utilized.

551	   In the discrete event simulator, each server will be scheduled to
552	   work for a certain duration of time.  This duration will be a random
553	   variable with exponential distribution which is common to measure the
554	   server behavior during its useful life cycle, with mean given by the
555	   MTBF of the server.

557	   In fact, the flat part of the bathtub distribution can related to the
558	   normal server MTBF (mean time between failures) with the failure
559	   density function expressed as F(x)=[(1/MTBF) times (e-to-the-power(-
560	   x/MTBF))].

562	   After the working duration, the server will be down for a fixed time
563	   duration which represents the time duration to start another virtual
564	   machine to replace the one in trouble.  This part is actually
565	   different from the traditional telecom grade equipment.  Here, the
566	   assumption is that there will be another server available to replace
567	   the one went down.  Hence, regardless the nature of the fault, the
568	   down time for a server fault will be fixed which represent the time
569	   needed to have another server ready to take over the task.

571	   The following diagram shows this arrangement for a system with only
572	   one backup.  It shall be noted that, while the server up time
573	   duration is variable, the server down time will be fixed.

575	   Figure 7: The life of the Servers ...  (Note: a dot-and-dash version
576	   of the diagram is being developed)....

578	   The servers will be hosted in "sites" which is considered to be data
579	   centers.  In this simulation, during initial setup, the servers
580	   supporting each other for reliability purposes will be hosted in
581	   different sites.  This is to minimize the impact of the site failure
582	   and site maintenance.

584	   In order to model the system behavior with one or two backups, the
585	   concept of protection group is introduced.

587	   A protection group will consists of a "master" server with one or two
588	   "slave" server(s) in other site(s).  There may be multiple protection
589	   groups inside the network with each protection group serving a
590	   fraction of the users.

592	   A protection group will be considered to be "down" if every server in
593	   this group is dead.  During the time the protection group is "down",
594	   the network service will be affected and the network is considered to
595	   be "down" for the group of users this protection group is responsible
596	   for.

598	   The uptime and downtime of the protection group will be recorded in
599	   the discrete event simulator.  The server part of the availability is
600	   given by (where the total elapsed time is the total simulation time
601	   in the discrete event simulator)

603	   EQ(7) ... ...  Availability(server part)= [(PGUP)/(TET)], whereas

605	   o  PGUP is Protection Group Up Time

607	   o  TET is the Total Elapsed Time

609	   The concept of protection group, site, and server can be illustrated
610	   as follows (Figure 8) for a system with two backups.  It shall be
611	   noted that the protection group is an abstract concept and the
612	   portion of the network function is not available if and only if the
613	   all the servers in the protection group is not functioning.

615	   Figure 8: Servers, Sites, and Protection Group ...  (Note: a dot-and-
616	   dash version of the diagram is being developed)....

618	   Even though the simulator will allow each site to have a number of
619	   servers, which is configurable, there is little use for this
620	   arrangement.  The system availability will not change regardless how
621	   many servers per site is used to support the system, as long as there
622	   is no change in the number of servers in the protection group.  The
623	   increase of number of servers per site is essentially increase the
624	   number of protection groups.  For a long time duration, each
625	   protection group will experience the similar downtime for the same up
626	   time (or will have the same availability).

628	   As in the theoretical analysis, the silent error, due to software or
629	   subtle hardware failure, will only affect the active (or master)
630	   server.  When the master server failed with silent error, both the
631	   master and "slave" servers will go through a MTTR time to recover
632	   (e.g. time to incarnate two VMs simultaneously).  In this case, this
633	   part of the system (or this protection group) is considered to be
634	   under fault.

636	   In the reliability study, the focus is the number of the backups for
637	   each protection group where 1+1 configuration is a typical
638	   configuration for one backup mechanism.  For load sharing arrangement
639	   such as 1:1, it can be viewed as two protection groups.

641	   In general, the load sharing scheme will have less availability
642	   because, in 1:1 case, any server fault will result in two faults in
643	   different protection groups.  This can be extended to 1:2 case where
644	   three protection groups are involved, and any server fault will
645	   introduce three faults in different protection groups.  In this
646	   study, the load sharing mechanisms will not be elaborated further.

648	   The site will also go though its maintenance work.  The traditional
649	   telecom grade equipment and the COTS hardware mainly defers on this
650	   part.  In Telecom grade equipment, minimum impact on system
651	   performance or system availability is to be maintained during the
652	   maintenance window.  But, for COTS hardware, the maintenance work may
653	   be more frequently and more destructive.

655	   In order to simulate the maintenance aspect of COTS hardware, the
656	   simulator will put the site "under maintenance" at random time.  The
657	   interval for the site to be working is also assumed to be
658	   exponentially distributed random variable, with mean to be
659	   configurable in the simulator.  The duration of the maintenance is
660	   also a uniform distributed random variable with a configured mean,
661	   minimum, and maximum.

663	   In order to put a site "under maintenance", there shall be no-fault
664	   inside the network.  All the servers on the site to be "under
665	   maintenance" will be moved to other site.  Hence, no traffic will be
666	   impacted during the process of putting the site under maintenance.
667	   Of course, the ability against site failure when some site is under
668	   maintenance will be reduced.

670	   When a site is back from maintenance, it will attempt to claim all
671	   its server responsibilities transferred due to site maintenance.

673	   o  For each protection group, if every server is working, the
674	      protection group will re-arrange the protect relationship so each
675	      site will only have one server in the protection group.  The new
676	      server on the site back from maintenance will need a MTTR time to
677	      be ready for backup.  In this case, no loss of service in the
678	      system.

680	   o  For each protection group, if there are at least one working and
681	      at least on in fault condition, one working server will be added
682	      to the protection group.  The new server on the site back from
683	      maintenance will need a MTTR time to be ready for backup.  In this
684	      case, no loss of service in the system.

686	   o  For each protection group, if there is no servers working, the
687	      protection group will gain a working server from the site back
688	      from the maintenance.  The new server on the site back from
689	      maintenance will need a MTTR time to be ready for service.  In
690	      this case, the system will provide service after the new server is
691	      ready.

693	   A site can also under fault (e.g. loss of power, operating under
694	   reduced capability due to thermal issues, and earth quake).  The
695	   simulator can also simulate the effect of such events, with site up
696	   duration as an exponentially distributed random variable with mean to
697	   be configured.  The site failure duration is expressed as a uniform
698	   distributed random variable with configurable mean, minimum, and
699	   maximum.

701	6.2.  Validation of the Simulator

703	   In order to verify the correctness of the simulator (e.g. the random
704	   number generator, the whole program structure, etc.), the simulation
705	   is performed with various server availability and various silent
706	   error probability.

708	   For single backup case, the error between the theoretical data and
709	   simulation data for system availability on the server part can be
710	   illustrated by the following diagram (Figure 9).

712	   Figure 9: Verification of Simulator for Single Backup Case ...
713	   (Note: a dot-and-dash version of the diagram is being developed)....

715	   As we can see, the magnitude of the errors are within 10-to-the-
716	   power-(-5) which is very small, considering the nominal value of
717	   system availability for the server part is close to 1.0.  For the
718	   dual backup case, the error between the simulated and theoretical
719	   system availability for different silent error probability and server
720	   availability can be illustrated as follows (Figure 10).

722	   Figure 10: Verification of Simulator for Dual Backup Case ...  (Note:
723	   a dot-and-dash version of the diagram is being developed)....

725	   This is also similar to that of the single backup case where the
726	   error are within the range.  Those error information gives us the
727	   needed confidence on the simulation result for complicated case where
728	   analytical solutions are evasive.

730	6.3.  Simulation Results

732	   The effect of the MTTR in the NFV environment is studied first.  In
733	   this study, the effect of the MTTR and the silent error probability
734	   can be shown below:

736	   Figure 11: Availability with Various Silent for different MTTRs...
737	   (Note: a dot-and-dash version of the diagram is being developed)....

739	   In the diagram (Figure 11), R6 represents MTTR of 6 minutes while R60
740	   represents MTTR of 60 minutes.  The x-axis is the silent error
741	   probability.  As shown, the effect of the MTTR (time to recover from
742	   a fault or time to have VM rebirth) will affect the slope of the
743	   system availability, which decline with the increase of silent error
744	   probability.  In the above example, the server MTBF is assumed to be
745	   10000 hours which represents the server availability of 0.9994 for R6
746	   case and 0.994 for the R60 case.

748	   The two curves starting approximate 1.0 are the system availability
749	   with dual backups while the other two are for system availability
750	   with single backup.  It should be noted that, for the dual backup
751	   case, there is little difference in availability for different MTTR
752	   when there is no silent error.  Intuitively, this is expected due to
753	   the added number of backup servers.

755	   In this simulation, both site failure (with mean time between
756	   failures of 20000 hours) and site maintenance (with mean time between
757	   site maintenance of 1000) are considered.  The mean time for site
758	   failure duration is assumed to be 12 hours (uniform distributed
759	   between 4 hours and 24 hours) and the mean time for site maintenance
760	   is 24 hours (uniform distributed between 4 hours and 48 hours).

762	   The next step is to evaluate the impact of the site issues (site
763	   failure, maintenance).  For a very bad site outlined above, which has
764	   the mean time between site failures to be 2 times of the server MTBF
765	   and the mean time between site maintenance events is assumed to be
766	   0.1 times of the server MTBF.  The availability on the server part
767	   can be illustrated with different silent error probability and server
768	   availability for the single backup configuration.

770	   Figure 12: Availability for the Server Part in Single Backup
771	   Configuration...  (Note: a dot-and-dash version of the diagram is
772	   being developed)....

774	   As the data will illustrate that, in order to achieve high
775	   availability, the server availability needs to be very high.  In
776	   fact, the server availability needs to be in the range of FIVE 9s in
777	   order to achieve the system availability of FIVE 9s under various
778	   site related issues.  The dual backup systems for exactly the same
779	   configuration, the result will be better and can be illustrated as
780	   follows:

782	   Figure 13: Availability for the Server Part in Single Backup
783	   Configuration...  (Note: a dot-and-dash version of the diagram is
784	   being developed)....

786	   With server availability of FOUR 9s, and with low silent error
787	   probabilities, the server part of the availability can achieve FIVE
788	   9s.  For a site with less issues, such as the one with mean time
789	   between failures is 100 times of the server MTBF and site maintenance
790	   is 0.1 times of the server MTBF.  The mean time for site failure
791	   duration is also assumed to be 12 hours (uniform distributed between
792	   4 hours and 24 hours) and the mean time for site maintenance is 24
793	   hours (uniform distributed between 4 hours and 48 hours).  The result
794	   for the single backup system can be shown as follows:

796	   Figure 14: Server Part of Availability for a Good Site on Single
797	   Backup...  (Note: a dot-and-dash version of the diagram is being
798	   developed)....

800	   The following data table (Table-4) will give precise information
801	   regarding this simulation results.

803	   Table-4: Details Regarding Availability on Server Part for Single
804	   Backup on a Good Site

806	   +-------------------+----------+----------+------------+------------+
807	   |       Silent      | 0.990099 | 0.999001 | 0.99990001 |   0.99999  |
808	   |    Error/Server   |          |          |            |            |
809	   |    Availability   |          |          |            |            |
810	   +-------------------+----------+----------+------------+------------+
811	   |        0.0        | 0.998971 | 0.999959 |  0.9999992 |     1.0    |
812	   |        0.1        | 0.997918 | 0.999857 | 0.99998959 | 0.99999901 |
813	   |        0.2        | 0.996908 | 0.999771 | 0.99997957 | 0.99999804 |
814	   |        0.3        | 0.995999 | 0.999674 | 0.99996935 | 0.99999695 |
815	   +-------------------+----------+----------+------------+------------+

817	   As evidenced in the table above, the server part of the system
818	   availability will be impacted by the silent error and a single
819	   redundant hardware will only provide marginal improvement when the
820	   silent error probability is small.

822	   Figure 15: Server Part of Availability for a Good Site on Dual
823	   Backup...  (Note: a dot-and-dash version of the diagram is being
824	   developed)....

826	   The diagram above give a general trend in system availability and the
827	   follow data table will precise the data.

829	   Table 5: Details Regarding Availability on Server Part for Dual
830	   Backup on a Good Site

832	   +---------------+------------+------------+------------+------------+
833	   |     Silent    | 0.99009901 |  0.999001  | 0.99990001 |   0.99999  |
834	   |  Error/Server |            |            |            |            |
835	   |  Availability |            |            |            |            |
836	   +---------------+------------+------------+------------+------------+
837	   |      0.0      |  0.9999939 | 0.99999998 |     1.0    |     1.0    |
838	   |      0.2      |  0.9981346 | 0.99980209 | 0.99998048 | 0.99999792 |
839	   |      0.4      | 0.99615083 | 0.99960136 | 0.99996002 | 0.99999594 |
840	   |      0.5      | 0.99522474 |  0.9995184 | 0.99995225 | 0.99999503 |
841	   +---------------+------------+------------+------------+------------+

843	   From the tables for single and dual backup, we can see that dual
844	   backup only provides marginal benefit in the face of site issues.
845	   Given the fact that site issues are inevitable in practice, a
846	   geographically distributed single backup system is recommended for
847	   simplicity.

849	6.4.  Multiple Servers Sharing the Load

851	   In this section, we outline the simulation results for cases when
852	   there are multiple servers to take care of the active work load.  In
853	   this case, the impact of a protection group failure will affecting
854	   smaller number of users.

856	   In the simulation, each site will have N servers to serve the work.
857	   A weighted uptime and weighted down time was introduced.  The system
858	   availability is the weighted uptime divided by the total of weighted
859	   uptime and weighted downtime.

861	   EQ(8)... ...  Weighted-Availability[Server-Part]=[(TET - WDT)/TET],
862	   whereas

864	   o  TET is the Total Elapsed Time

866	   o  WDT is the Weighted Down Time

868	   If any protection group (i) is down, the WDT will be updated as
869	   follows:

871	   EQ(9)... ...  WDT = WDT + [Protection Group (i) Down Time]/N

873	   For a system with three protection groups (i.e. the servers sharing
874	   the workload), the availability of each protection group, as well as
875	   the weighted availability, are obtained as follows (Table-6):

877	   Table-6: Availability of Protection Groups and the Weighted
878	   Availability (Dual Backup)

880	   +--------------+---------+---------+----------+----------+----------+
881	   |  Availabilit | Availab | Availab | Availabi | Measured | Protecti |
882	   | y   /Silent  | ility o | ility o | l ity of | Weighted | o n Grou |
883	   |      Error   | fProtec | fProtec |  Protect | Availabi | p Averag |
884	   |   Probabilit | ti   on | ti   on | io n Gro | l   ity  | e    -   |
885	   | y            |   Group |   Group | up    3  |          |  Weighte |
886	   |              |  1      |  2      |          |          | dAvailab |
887	   |              |         |         |          |          | il   ity |
888	   +--------------+---------+---------+----------+----------+----------+
889	   |      0.0     |   1.0   |   1.0   |    1.0   |    1.0   |    0.0   |
890	   |      0.2     | 0.99999 | 0.99999 | 0.999997 | 0.999998 | 6.66668E |
891	   |              | 8  015  | 8  005  | 9   85   | 0   01   | -   11   |
892	   |      0.4     | 0.99999 | 0.99999 | 0.999995 | 0.999996 | -3.33333 |
893	   |              | 6  027  | 6  018  | 9   88   | 0   11   | E   -11  |
894	   +--------------+---------+---------+----------+----------+----------+

896	   In this case, there is little difference between the different
897	   protection groups.  The weighted availability is actually the average
898	   of availability of all the protection groups.This also illustrate the
899	   fact that, regardless how many servers to share to active load, the
900	   system availability will be the same as long as (A) The number of
901	   backups are the same, and (B) Each server availability are the same

903	7.  Conclusions

905	   The system availability can be divided into two parts; the
906	   availability from the network and the availability from the server.
907	   The final system availability is the product of those two parts.

909	   The system availability from the network is determined by the maximum
910	   number of hops and individual network element availability, with the
911	   fault tolerant setup is assumed to be 1+1.  The system availability
912	   from the server is mainly determined by the following parameters.

914	   o  Availability of each individual server

916	   o  Silent error probability
917	   o  Site related issues (maintenance, fault)

919	   o  Protection Scheme (one or two dedicated backups)

921	   The introduction of silent error is to take account of software error
922	   and errors undetectable by hardware, the system availability on the
923	   server part will be dominated by such silent error if the silent
924	   error probability is more than 10%.  This is shown in both
925	   theoretical work and simulations.

927	   It shall be interesting to note that the dual backup scheme provides
928	   marginal benefits and the added complexity may not warrant such
929	   practice in the real network.

931	   It is possible for COTS hardware to provide as high availability as
932	   the traditional telecom hardware if the server itself is of
933	   reasonable high-availability.  The undesirable attributes of COTS
934	   hardware have been modelled into the site related issues, such as
935	   site maintenance and site failure which is not applicable for
936	   traditional telecom hardware.  Hence, in calculating the server
937	   availability, the site related issues are to be excluded.

939	   It is critical for the virtualization infrastructure management to
940	   provide as much hardware failure information as possible to improve
941	   the availability of the application.  As seen in both theoretical
942	   work and simulation, the silent error probability becomes a dominant
943	   factor in the final availability.  The silent error probability can
944	   be reduced if the virtualization infrastructure management is capable
945	   of fault isolation.

947	8.  Security considerations

949	   To be determined.

951	9.  IANA considerations

953	   This Internet Draft includes no request to IANA.

955	10.  Acknowledgements

957	   Authors would like to thank the NFV RG chairs (Diego and Ramki) for
958	   encouraging discussions and guidance.

960	11.  References
961	11.1.  Normative References

963	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
964	              Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/
965	              RFC2119, March 1997,
966	              <http://www.rfc-editor.org/info/rfc2119>.

968	   [I-D.irtf-nfvrg-nfv-policy-arch]
969	              Figueira, N., Krishnan, R., Lopez, D., Wright, S., and D.
970	              Krishnaswamy, "Policy Architecture and Framework for NFV
971	              Infrastructures", draft-irtf-nfvrg-nfv-policy-arch-01
972	              (work in progress), August 2015.

974	   [1]        GR-77, "Applied R&M Manual for Defense Systems", 2012.

976	11.2.  Informative References

978	   [2]        Papoulis, A., "Probability, Random Variables, and
979	              Stochastic Processes", 2002.

981	   [3]        Bremaud, P., "An Introduction to Probabilistic Modeling",
982	              1994.

984	   [4]        Press, et al, W., "Numerical Recipes in C/C++", 2007.

986	Authors' Addresses

988	   Li Mo
989	   ZTE (TX) Inc.
990	   2425, N. central expressway
991	   Richardson, TX  75080
992	   USA

994	   Phone: +1-972-454-9661
995	   Email: li.mo@ztetx.com

997	   Bhumip Khasnabish (editor)
998	   ZTE (TX) Inc.
999	   55 Madison Avenue, Suite 160
1000	   Morristown, New Jersey  07960
1001	   USA

1003	   Phone: +001-781-752-8003
1004	   Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com
1005	   URI:   http://tinyurl.com/bhumip/