NFV RG                                                             L. Mo
Internet-Draft                                        B. Khasnabish, Ed.
Intended status: Informational                             ZTE (TX) Inc.
Expires: April 3, 2016                                   October 1, 2015


                  NFV Reliability using COTS Hardware
             draft-mlk-nfvrg-nfv-reliability-using-cots-00

Abstract

   This draft discusses the results of a recent study on the feasibility
   of using Commercial Off-The-Shelf (COTS) hardware for virtualized
   network functions in telecom equipment.  In particular, it explores
   the conditions under which the COTS hardware can be used in the NFV
   (Network Function Virtualization) environment.  The concept of silent
   error probability is introduced in order to take software error or
   undetectable hardware failures into account.  The silent error
   probability is included in both the theoretical work and the
   simulation work.  It is difficult to theoretically analyze the impact
   of site maintenance and site failure events.  Therefore, simulation
   is used for evaluating the impact of these site management related
   events which constitute the undesirable feature of using COTS
   hardware in telecom environment.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 3, 2016.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal


Mo & Khasnabish           Expires April 3, 2016                 [Page 1]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Conventions used in this document  . . . . . . . . . . . . . .  4
     2.1.  Abbreviations  . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Network Reliability  . . . . . . . . . . . . . . . . . . . . .  5
   4.  Network Part of the Availability . . . . . . . . . . . . . . .  7
   5.  Theoretical Analysis of Server Part of System Availability . .  9
   6.  Simulation Study of Server Part of Availability  . . . . . . . 12
     6.1.  Methodology  . . . . . . . . . . . . . . . . . . . . . . . 13
     6.2.  Validation of the Simulator  . . . . . . . . . . . . . . . 16
     6.3.  Simulation Results . . . . . . . . . . . . . . . . . . . . 17
     6.4.  Multiple Servers Sharing the Load  . . . . . . . . . . . . 19
   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 20
   8.  Security considerations  . . . . . . . . . . . . . . . . . . . 21
   9.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 21
   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
     11.1. Normative References . . . . . . . . . . . . . . . . . . . 22
     11.2. Informative References . . . . . . . . . . . . . . . . . . 22
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22


Mo & Khasnabish           Expires April 3, 2016                 [Page 2]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


1.  Introduction

   Using COTS hardware for network functions (e.g.  IMS, EPC) have drawn
   considerable attention in the recent years.  Some operators do have
   legitimate concern regarding the reliability of using the COTS
   hardware, with reduced MTBF (mean time between failures) and many
   undesirable attributes of COTS hardware unfamiliar in the traditional
   telecom industry.

   In the previous reliability studies (e.g.  GR-77 [1]), the emphasis
   were place on hardware failures only.  In this work, besides hardware
   failures, which characterized by the MTBF (mean time between
   failures) and MTTR (mean time to repair), the silent error is also
   introduced to take account the software error and hardware failure
   which is undetectable by the management system.

   The silent error affecting the system availability in different ways,
   depending on the particular scenarios.

   In a typical system, a server performing certain network functions
   will have another dedicated server as backup.  This is normal master-
   slave or 1+1 redundancy configuration of the telecom equipment.

   The server performing the network function is called the "master
   server" and the dedicated backup is called the "slave server."  In
   order to differentiate the 1+1 redundancy scheme and 1:1 redundancy
   scheme, the slave server is deemed "dedicated" for 1+1 case.  In 1:1
   redundancy, both servers will perform network functions while
   protecting each other at the same time.

   In any protection scheme, on assuming single fault for clarity of
   discussion, the system availability will not be impacted if the slave
   part experience silent error and such silent error eventually
   becoming observable in behavior.  In this case, another slave will be
   identified and the master server will continue to serve the network
   function.  Before the slave server becoming fully functional, the
   system will operate at reduced error correction capabilities.

   On the other hand, if the master server experience the silent error,
   the data transmitted to the slave server could be corrupted.  In this
   case, the system availability will be impacted when such error
   becoming observable.  On detection of such error, both the master
   server and the slave server need time to recover.  The time for such
   recovery is fixed in the NFV environment, which is deemed to be a NFV
   MTTR time.  During this time interval, the network function is not
   available and will be considered to be the downtime in the
   availability calculations.


Mo & Khasnabish           Expires April 3, 2016                 [Page 3]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   Comparing the MTBF of COTS hardware and the typical telecom grade
   hardware, the COTS may have less MTBF due to its relaxed design
   criteria.

   Comparing the MTTR of COTS hardware and the typical telecom grade
   hardware, the COTS time to repair is not a random variable and
   actually is fixed.  Hence the COTS MTTR is the time required to bring
   up a server and ready to serve.  In the traditional telecom hardware,
   the time to repair is a random variable and MTTR is the mean of this
   random variable.  Because manual intervention is normally required in
   the telecom environment, the NFV COTS MTTR is normally assumed to be
   less than the traditional telecom equipment MTTR.

   The most obvious difference between those two hardware types (COTS
   hardware and the telecom grade hardware) is related to its
   maintenance procedure and practice.  While telecom equipment takes
   pains to minimize the impact of maintenance on system availability,
   the COTS hardware normally is maintained in a cowboy fashion (e.g.
   reset first and ask questions later).

   In this study, a closed solution is available if the site and
   maintenance related issues are absent for one or two dedicated backup
   COTS servers in the NFV environment.  In order to evaluate the site
   and maintenance related issues, a simulator is constructed to study
   the system availability with one or two dedicated backup servers.

   It is shown that, with COTS hardware and all its undesirable
   features, it is still possible to satisfy the telecom requirements
   under reasonable conditions.


2.  Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [RFC2119].

   In this document, these words will appear with that interpretation
   only when in ALL CAPS.  Lower case uses of these words are not to be
   interpreted as carrying RFC-2119 significance.

2.1.  Abbreviations

   o  A-N: Network Availability

   o  A-S: Server Availability


Mo & Khasnabish           Expires April 3, 2016                 [Page 4]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   o  A-Sys: System Availability

   o  COTS: Commercial Off-The-Shelf

   o  DC: Data Center

   o  MTBF: Mean Time Between Failures

   o  MTTF: Mean Time To Failure

   o  MTTR: Mean Time To Repair

   o  NFV: Network Function Virtualization

   o  PGUP: Protection Group Up Time

   o  PSTN: Public Switched Telephone Network

   o  SDN: Software-Defined Network/Networking

   o  TET: Total Elapsed Time

   o  VM: Virtual Machine

   o  WDT: Weighted Down Time


3.  Network Reliability

   In the NFV environment, the reliability analysis can be divided into
   two distinct parts: the server part and the network part, where the
   network part is to connect all the servers with vSwitch and the
   server part is to provide the actual network functions.  This can be
   illustrated by using a diagram as shown in Figure-1.


Mo & Khasnabish           Expires April 3, 2016                 [Page 5]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


           +--------------------+
           | Availability: A-S  |               Availability: A-N
           |                    |
           |                    |
           |                    |               +---------------+
           |   (VM)             |               |               |
           |   COTS.............................|  vSwitch 1    |
           |  Server...............       ......|               |
           |    |               |  \     /      |(X) (X) .. (X) |
           |    |               |   \   /       +---------------+
           |    |               |    \ /
           |    |               |     \         +---------------+
           |    |               |    / \        |  vSwitch 2    |
           |   (VM)             |   /   \       |               |
           |   COTS................/     \......|(X) (X) .. (X) |
           |  Server............................|               |
           |                    |               |               |
           +--------------------+               +---------------+


       Figure 1: System Availability - Network Part and Server Part

   If the overall system availability is denoted by the symbol (A-Sys),
   the overall system availability will be the product of server part of
   the system availability (A-S) and the network part of the system
   availability (A-N).

   EQ(1) ... ... ...  A-Sys = [A-S x A-N]

   Given the fact that both A-S and A-N are "less than" 1 (one), we have
   A-Sys "less than" A-S and A-Sys "less than" A-N.  In another words,
   if FIVE 9s are required for system availability, both the server part
   and the network part of the availability need to be better than FIVE
   9s so their products can be more than FIVE 9s.

   To improve the network part of the network availability, as
   illustrated in Figure 1, the normal 1+1 protection scheme is
   utilized.  It shall be noted that it is possible for the vSwitch to
   cover long distance transmission network to connect multiple data
   centers.

   The mechanisms in the server part for improving availability is not
   specified.  In this study, it is assumed that one active server will
   be supported by one or two backup servers.  Normally, if the active
   server is faulty, one of the backup server(s) will take over the
   responsibility and hence there will be no loss of availability on the
   server part.


Mo & Khasnabish           Expires April 3, 2016                 [Page 6]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   There is a significant difference between the NFV environment and the
   dedicated traditional telecom equipment related to the time to
   recover from the server fault.  In the traditional telecom equipment
   case, a manual change of some equipment (e.g. a faulty board) is
   normally required and hence the time for restoration after
   experiencing fault, normally denoted as MTTR (Mean Time to Repair) is
   long.

   In the NFV environment, the time for restoration is the time required
   to boot another virtual machine (VM) with the needed software and re-
   synchronization of data.  Hence the MTTR in the NFV environment can
   be considered to be shorter than the traditional telecom equipment.
   More importantly, the MTTR in the NFV environment can be considered
   to be a fixed constant.

   It is also understood that multiple servers will be active to share
   the load.  Contradictory to common sense believe, this arrangement
   will not increase nor decrease the overall network availability if
   those active servers are supported by one or two backup servers.
   This fact will be elaborated in the later section from both
   theoretical point of view and simulations.


4.  Network Part of the Availability

   The traditional analysis can be applied to the network part of the
   availability.  In fact, the network part of the availability is
   impacted by the availability of the switch which is part of the
   vSwitch and the maximum hops in the vSwitch.  The vSwitch is to
   connect the VMs in the NFV environment.

   If A-n is the denote the availability of the network element, for a
   vSwitch with maximum of h hops, the availability of the vSwtich would
   be "(A-n)^h."  Hence, considering the 1+1 configuration of the
   vSwtich, the A-N can be expressed by

   EQ(2) ... ... ...  A-N = [1 - (1 - (A-n)^h)^2 ]

   The network availability, as a function of number of hops (h) and the
   per network element availability (A-n), can be illustrated by using
   teh diagrm as shown in Figure-2.

   While this 3-D illustration shows the general trend in network
   availability, the following data table is able to give more details
   regarding the network availability with different hop counts and
   different network element availability, as shown in Table-1.

   Table-1: Network Part of System Availability with Various Network


Mo & Khasnabish           Expires April 3, 2016                 [Page 7]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   Elements Availability and Hop Counts

   +-------------+---------+----------+----------+----------+----------+
   | Network     |    10   |    16    |    22    |    26    |    30    |
   | Element     |         |          |          |          |          |
   | Availabilit |         |          |          |          |          |
   | y/ Hop      |         |          |          |          |          |
   +-------------+---------+----------+----------+----------+----------+
   | 0.99        | 0.99086 | 0.977935 |  0.96065 |  0.94712 | 0.932244 |
   | 0.999       | 0.99990 | 0.999748 | 0.999526 | 0.999341 | 0.999126 |
   | 0.9999      | 0.99999 | 0.999997 | 0.999995 | 0.999993 | 0.999991 |
   | 0.99999     |    1    |     1    |     1    |     1    |     1    |
   +-------------+---------+----------+----------+----------+----------+


               +------------------------------------------+
              /                                         / |
             / Five 9s                                 /  |
            /                                         /   |
           /                             ...   ...  -/    |
          /  ...   ...   ...    .....               /     |
         +-----------------------------------------+      +..0.99999
         | .  ....                                 |     /
         |        ....  ....  ... . . . .  .       |    /0.9999
         |                                     .   |   / Net Element
         |                                        .|  /  Availability
         |                                         | / 0.999
         |                                         |/
         +-----------------------------------------+..0.99
         2       8         12        18            24
            ..............Hop Count..........>


   Figure 2: Network Part of the System Availability with Different Hop
             Counts and Different Network Element Availability

   In order to achieve FIVE 9s reliability normally demanded by the
   telecommunication operators, the network element reliability needs to
   be at least FOUR 9s if hop counts is more than 10.

   In fact, in order to achieve FIVE 9s while per network element
   availability is only THREE 9s, the hop count needs to be less than
   two, which is deemed non-practical.


Mo & Khasnabish           Expires April 3, 2016                 [Page 8]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


5.  Theoretical Analysis of Server Part of System Availability

   In GR-77 [1], extensive analysis has been performed for systems under
   various conditions.  In the NFV environment, if the server's
   availability is denoted as the symbol (Ax), the server part of the
   system availability (As), with a 1+1 master and slave configuration,
   can be given by [1] in Part D, Chapter 6.

   EQ(3) ... ... ...  As = [1-((1-Ax)^2)]

   In a more practical environment, there will be silent errors (errors
   can not be detected by the system under consideration).  The silent
   error probability will be expressed as the symbol (Pse)

   We need to further assume that the silent error only affects the
   master of the system because it is the one which has the ability to
   corrupt the data.  This assumption can be further articulated, in
   practical engineering terms, is that "when there is error detected
   and there is no obvious cause of the error, the master of the
   "master-salve" configuration will assume the master is correct while
   the slave will go through a MTTR time to recover."

   The state transition can be illustrated as in the following diagram:

   Figure 3: State Transition for System with only one Backup, ...
   (Note: a dot-and-dash version of the diagram is being developed)....

   With this state transition diagram outlined in Figure 1, the system
   availability in a 1+1 master-slave configuration can be expressed as
   follows.

   EQ(4a)... ... ... ...  As = [1-((1-Ax)^2 + PseAx(1-Ax))

   EQ(4b)... ... ... ...  As = (2-Pse)Ax - (1-Pse)((Ax)^2))

   The following diagram (Figure 4) illustrates the server part of the
   availability with different per server availability and different
   silent error probability.

   Figure 4: Server Part of the System Availability with Various Server
   Availability and Silent Error Probability ...  (Note: a dot-and-dash
   version of the diagram is being developed)....

   While the graphics illustrate the trends, the following data table
   will give precise information on a single backup (1+1) configuration.

   Table-2: Server Part of availability for different silent error
   probability and different server availability for single backup


Mo & Khasnabish           Expires April 3, 2016                 [Page 9]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   configuration

         +-------+---------+-----------+-------------+----------+
         | SEPSA | 0.99000 |  0.99900  |   0.99990   |  0.99999 |
         +-------+---------+-----------+-------------+----------+
         |  0.0  |  0.9999 |  0.999999 |  0.99999999 |    1.0   |
         |  0.1  | 0.99891 | 0.9998991 | 0.999989991 |  0.99999 |
         |  0.2  | 0.99792 | 0.9997992 | 0.999979992 | 0.999998 |
         |  0.3  | 0.99693 | 0.9996993 | 0.999969993 | 0.999997 |
         |  0.4  | 0.99594 | 0.9995994 | 0.999959994 | 0.999996 |
         |  0.5  | 0.99495 | 0.9994995 | 0.999949995 | 0.999995 |
         |  0.6  | 0.99396 | 0.9993996 | 0.999939996 | 0.999994 |
         |  0.7  | 0.99297 | 0.9992997 | 0.999929997 | 0.999993 |
         |  0.8  | 0.99198 | 0.9991998 | 0.999919998 | 0.999992 |
         |  0.9  | 0.99099 | 0.9990999 | 0.999909999 | 0.999991 |
         |   1   |   0.99  |   0.999   |    0.9999   |  0.99999 |
         +-------+---------+-----------+-------------+----------+

   The green shaded area in the above table outline the area which five
   9 availability is possible.  As evidenced in the above table, the
   server part of the network availability deteriorates rapidly with
   silent error probability.  It is possible to achieve five 9s of
   availability with only server availability of only three 9s, it does
   demand five 9s server availability when the silent error probability
   is only 10%.

   While the 1+1 configuration illustrated above seems reasonable for
   server part of the system availability (As), there may be cases
   demanding more than 1+1 configuration for reliability.  For systems
   with two backups, the availability, without consideration of the
   silent error, can be expressed as [1] (Part D, chapter 6)

   EQ(5) ... ... ...  As = [1-((1-Ax)^3)]

   With the introduction of the silent error probability, the error
   transition can be expressed in the following diagram:

   Figure 5: Error State Transition for System with only two Backups ...
   (Note: a dot-and-dash version of the diagram is being developed)....

   With the introduction of the silent error and observing the error
   transition above, assuming the silent error event and the server
   fault event are independent (e.g., A software error as cause of the
   silent error and the server fault event as a hardware failure), the
   server part of the availability for dual backup case can be given by

   EQ(6a)... ...  As = 1-((1-Ax)^3 + Pse(1 - Ax)((Ax)^2 + 2Ax(1-Ax)))


Mo & Khasnabish           Expires April 3, 2016                [Page 10]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   EQ(6b)... ...  As = (3-2Pse)Ax - 3(1-Pse)(Ax)^2 + (1-Pse)(Ax)^3

   It should be noted that, when "Pse = 1" for both EQ (4) and EQ (6),
   the server part of the system availability (As) and the server
   availability (Ax) are the same.  This relationship shall be expected
   since, if the mater always experiences the silent error, the backups
   are useless and will be corrupted all the time.

   The system availability, with dual backup, can be illustrated as
   follows for different server availability and different silent error
   including software malfunctions.

   Figure 6: Server Part of the System Availability with Various Silent
   Error Probability and Server Availability for a dual Backup System
   ...  (Note: a dot-and-dash version of the diagram is being
   developed)....

   As with previous case, the diagram will only illustrate the trend
   while the following table will provide precise data for the system
   availability under different silent error probability and server
   availabilities for dual backup case

   Table-3: System Availability with different silent error probability
   and server availability (SEPSA) for dual backup configuration

         +-------+-----------+-------------+---------+----------+
         | SEPSA |  0.99000  |   0.99900   | 0.99990 |  0.99999 |
         +-------+-----------+-------------+---------+----------+
         |  0.0  |  0.999999 | 0.999999999 |   1.0   |    1.0   |
         |  0.1  | 0.9989991 | 0.999899999 | 0.99999 | 0.999999 |
         |  0.2  | 0.9979992 | 0.999799999 | 0.99998 | 0.999998 |
         |  0.3  | 0.9969993 | 0.999699999 | 0.99997 | 0.999997 |
         |  0.4  | 0.9959994 | 0.999599999 | 0.99996 | 0.999996 |
         |  0.5  | 0.9949995 |    0.9995   |  .99995 | 0.999995 |
         |  0.6  | 0.9939996 |    0.9994   | 0.99994 | 0.999994 |
         |  0.7  | 0.9929997 |    0.9993   | 0.99993 | 0.999993 |
         |  0.8  | 0.9919998 |    .9992    |  .99992 | 0.999992 |
         |  0.9  | 0.9909999 |    0.9991   | 0.99991 | 0.999991 |
         |  1.0  |    0.99   |    0.999    |  0.9999 | 0.999990 |
         +-------+-----------+-------------+---------+----------+

   As shown in Table-2 , the green shaded area in Table-3 represents the
   five 9s capabilities.  Comparing those two tables, the dual backups
   are of marginal advantage over the single backup except for the case
   there is no silent error.  In this case, with only two 9s server
   availability, the five 9s server part of system availability can be
   achieved.


Mo & Khasnabish           Expires April 3, 2016                [Page 11]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   From the data above, we can conclude that the silent error,
   introduced by software error or hardware error not detectable by
   software, plays an important role in the server part of the system
   availability and hence the final system availability.  In fact, it
   will be the dominant elements if Pse is more than 10% when the
   difference between single backup and dual backup are not significant.

   Some operators are of the opinion that there need to be a new
   approach to the availability requirements.  COTS hardware are assumed
   to have less availability than the traditional telecom hardware.
   But, in the NFV environment, since each server (or VM) in the NFV
   environment will only affect a small number of users, the
   requirements for traditional five 9s could be relaxed while keeping
   the same user experience in downtime.  In another words, the weighted
   downtime, in proportion of the number of users, may be reduced in the
   NFV environment due to each server affecting only a small number of
   users for a given server reliability.

   Unfortunately, from the theoretical point of view, this is not true.
   It is possible that, each server downtime will only affect a small
   number of users.  But the multiple active servers will experience
   more server fault opportunities (this is similar to the famous
   reliability argument for the twin engine Boeing 777).  As long as the
   protection scheme, or more importantly, the number of backup(s) are
   the same, the eventual system availability will be the same,
   regardless what portion of users each server is to serve.


6.  Simulation Study of Server Part of Availability

   In the above theoretical analysis of server part of availability, the
   following factors are not considered: (A) Site Maintenance (e.g.
   software upgrade, patch, etc. affecting the whole site, (B) Site
   Failure (earth quake, etc.)

   While traditional telecom grade equipment putting a lot of emphasis
   and engineering complexity to ensure smooth migration, smooth
   software upgrade, and smooth patching procedures, the COTS hardware
   and its related software are notorious in lacking such capabilities.
   This is the primary reason for operators to be hesitate on utilizing
   COTS hardware, even though COTS hardware in the NFV environment does
   having the improved MTTR as compared to the traditional telecom
   hardware.

   While it is relative easy to obtain a closed for of system
   availability for the ideal case without site related issues, it is
   extremely difficult to obtain an analytical solution when site issues
   are involved.  In this case, we resort to numerical simulation under


Mo & Khasnabish           Expires April 3, 2016                [Page 12]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   reasonable assumptions [2, 3, 4].

6.1.  Methodology

   In this section, the various assumptions and the outline of the
   simulation mechanisms will be discussed.

   A discrete event simulator is constructed to obtain the availability
   for the server part.  In the simulator, an active server (master
   server which processing the network traffic) will be supported by 1
   (single backup) or 2 (dual backup) servers in another site(s).

   For the failure probability of the server, it is common to assume the
   bathtub probability distribution (WeiBull distribution).  In
   practice, we need to enforce that the NFV management will provide
   servers which is on the flat part of the bathtub distribution.  In
   this case, the familiar exponential distribution can be utilized.

   In the discrete event simulator, each server will be scheduled to
   work for a certain duration of time.  This duration will be a random
   variable with exponential distribution which is common to measure the
   server behavior during its useful life cycle, with mean given by the
   MTBF of the server.

   In fact, the flat part of the bathtub distribution can related to the
   normal server MTBF (mean time between failures) with the failure
   density function expressed as F(x)=[(1/MTBF) times (e-to-the-power(-
   x/MTBF))].

   After the working duration, the server will be down for a fixed time
   duration which represents the time duration to start another virtual
   machine to replace the one in trouble.  This part is actually
   different from the traditional telecom grade equipment.  Here, the
   assumption is that there will be another server available to replace
   the one went down.  Hence, regardless the nature of the fault, the
   down time for a server fault will be fixed which represent the time
   needed to have another server ready to take over the task.

   The following diagram shows this arrangement for a system with only
   one backup.  It shall be noted that, while the server up time
   duration is variable, the server down time will be fixed.

   Figure 7: The life of the Servers ...  (Note: a dot-and-dash version
   of the diagram is being developed)....

   The servers will be hosted in "sites" which is considered to be data
   centers.  In this simulation, during initial setup, the servers
   supporting each other for reliability purposes will be hosted in


Mo & Khasnabish           Expires April 3, 2016                [Page 13]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   different sites.  This is to minimize the impact of the site failure
   and site maintenance.

   In order to model the system behavior with one or two backups, the
   concept of protection group is introduced.

   A protection group will consists of a "master" server with one or two
   "slave" server(s) in other site(s).  There may be multiple protection
   groups inside the network with each protection group serving a
   fraction of the users.

   A protection group will be considered to be "down" if every server in
   this group is dead.  During the time the protection group is "down",
   the network service will be affected and the network is considered to
   be "down" for the group of users this protection group is responsible
   for.

   The uptime and downtime of the protection group will be recorded in
   the discrete event simulator.  The server part of the availability is
   given by (where the total elapsed time is the total simulation time
   in the discrete event simulator)

   EQ(7) ... ...  Availability(server part)= [(PGUP)/(TET)], whereas

   o  PGUP is Protection Group Up Time

   o  TET is the Total Elapsed Time

   The concept of protection group, site, and server can be illustrated
   as follows (Figure 8) for a system with two backups.  It shall be
   noted that the protection group is an abstract concept and the
   portion of the network function is not available if and only if the
   all the servers in the protection group is not functioning.

   Figure 8: Servers, Sites, and Protection Group ...  (Note: a dot-and-
   dash version of the diagram is being developed)....

   Even though the simulator will allow each site to have a number of
   servers, which is configurable, there is little use for this
   arrangement.  The system availability will not change regardless how
   many servers per site is used to support the system, as long as there
   is no change in the number of servers in the protection group.  The
   increase of number of servers per site is essentially increase the
   number of protection groups.  For a long time duration, each
   protection group will experience the similar downtime for the same up
   time (or will have the same availability).

   As in the theoretical analysis, the silent error, due to software or


Mo & Khasnabish           Expires April 3, 2016                [Page 14]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   subtle hardware failure, will only affect the active (or master)
   server.  When the master server failed with silent error, both the
   master and "slave" servers will go through a MTTR time to recover
   (e.g. time to incarnate two VMs simultaneously).  In this case, this
   part of the system (or this protection group) is considered to be
   under fault.

   In the reliability study, the focus is the number of the backups for
   each protection group where 1+1 configuration is a typical
   configuration for one backup mechanism.  For load sharing arrangement
   such as 1:1, it can be viewed as two protection groups.

   In general, the load sharing scheme will have less availability
   because, in 1:1 case, any server fault will result in two faults in
   different protection groups.  This can be extended to 1:2 case where
   three protection groups are involved, and any server fault will
   introduce three faults in different protection groups.  In this
   study, the load sharing mechanisms will not be elaborated further.

   The site will also go though its maintenance work.  The traditional
   telecom grade equipment and the COTS hardware mainly defers on this
   part.  In Telecom grade equipment, minimum impact on system
   performance or system availability is to be maintained during the
   maintenance window.  But, for COTS hardware, the maintenance work may
   be more frequently and more destructive.

   In order to simulate the maintenance aspect of COTS hardware, the
   simulator will put the site "under maintenance" at random time.  The
   interval for the site to be working is also assumed to be
   exponentially distributed random variable, with mean to be
   configurable in the simulator.  The duration of the maintenance is
   also a uniform distributed random variable with a configured mean,
   minimum, and maximum.

   In order to put a site "under maintenance", there shall be no-fault
   inside the network.  All the servers on the site to be "under
   maintenance" will be moved to other site.  Hence, no traffic will be
   impacted during the process of putting the site under maintenance.
   Of course, the ability against site failure when some site is under
   maintenance will be reduced.

   When a site is back from maintenance, it will attempt to claim all
   its server responsibilities transferred due to site maintenance.

   o  For each protection group, if every server is working, the
      protection group will re-arrange the protect relationship so each
      site will only have one server in the protection group.  The new
      server on the site back from maintenance will need a MTTR time to


Mo & Khasnabish           Expires April 3, 2016                [Page 15]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


      be ready for backup.  In this case, no loss of service in the
      system.

   o  For each protection group, if there are at least one working and
      at least on in fault condition, one working server will be added
      to the protection group.  The new server on the site back from
      maintenance will need a MTTR time to be ready for backup.  In this
      case, no loss of service in the system.

   o  For each protection group, if there is no servers working, the
      protection group will gain a working server from the site back
      from the maintenance.  The new server on the site back from
      maintenance will need a MTTR time to be ready for service.  In
      this case, the system will provide service after the new server is
      ready.

   A site can also under fault (e.g. loss of power, operating under
   reduced capability due to thermal issues, and earth quake).  The
   simulator can also simulate the effect of such events, with site up
   duration as an exponentially distributed random variable with mean to
   be configured.  The site failure duration is expressed as a uniform
   distributed random variable with configurable mean, minimum, and
   maximum.

6.2.  Validation of the Simulator

   In order to verify the correctness of the simulator (e.g. the random
   number generator, the whole program structure, etc.), the simulation
   is performed with various server availability and various silent
   error probability.

   For single backup case, the error between the theoretical data and
   simulation data for system availability on the server part can be
   illustrated by the following diagram (Figure 9).

   Figure 9: Verification of Simulator for Single Backup Case ...
   (Note: a dot-and-dash version of the diagram is being developed)....

   As we can see, the magnitude of the errors are within 10-to-the-
   power-(-5) which is very small, considering the nominal value of
   system availability for the server part is close to 1.0.  For the
   dual backup case, the error between the simulated and theoretical
   system availability for different silent error probability and server
   availability can be illustrated as follows (Figure 10).

   Figure 10: Verification of Simulator for Dual Backup Case ...  (Note:
   a dot-and-dash version of the diagram is being developed)....


Mo & Khasnabish           Expires April 3, 2016                [Page 16]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   This is also similar to that of the single backup case where the
   error are within the range.  Those error information gives us the
   needed confidence on the simulation result for complicated case where
   analytical solutions are evasive.

6.3.  Simulation Results

   The effect of the MTTR in the NFV environment is studied first.  In
   this study, the effect of the MTTR and the silent error probability
   can be shown below:

   Figure 11: Availability with Various Silent for different MTTRs...
   (Note: a dot-and-dash version of the diagram is being developed)....

   In the diagram (Figure 11), R6 represents MTTR of 6 minutes while R60
   represents MTTR of 60 minutes.  The x-axis is the silent error
   probability.  As shown, the effect of the MTTR (time to recover from
   a fault or time to have VM rebirth) will affect the slope of the
   system availability, which decline with the increase of silent error
   probability.  In the above example, the server MTBF is assumed to be
   10000 hours which represents the server availability of 0.9994 for R6
   case and 0.994 for the R60 case.

   The two curves starting approximate 1.0 are the system availability
   with dual backups while the other two are for system availability
   with single backup.  It should be noted that, for the dual backup
   case, there is little difference in availability for different MTTR
   when there is no silent error.  Intuitively, this is expected due to
   the added number of backup servers.

   In this simulation, both site failure (with mean time between
   failures of 20000 hours) and site maintenance (with mean time between
   site maintenance of 1000) are considered.  The mean time for site
   failure duration is assumed to be 12 hours (uniform distributed
   between 4 hours and 24 hours) and the mean time for site maintenance
   is 24 hours (uniform distributed between 4 hours and 48 hours).

   The next step is to evaluate the impact of the site issues (site
   failure, maintenance).  For a very bad site outlined above, which has
   the mean time between site failures to be 2 times of the server MTBF
   and the mean time between site maintenance events is assumed to be
   0.1 times of the server MTBF.  The availability on the server part
   can be illustrated with different silent error probability and server
   availability for the single backup configuration.

   Figure 12: Availability for the Server Part in Single Backup
   Configuration...  (Note: a dot-and-dash version of the diagram is
   being developed)....


Mo & Khasnabish           Expires April 3, 2016                [Page 17]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   As the data will illustrate that, in order to achieve high
   availability, the server availability needs to be very high.  In
   fact, the server availability needs to be in the range of FIVE 9s in
   order to achieve the system availability of FIVE 9s under various
   site related issues.  The dual backup systems for exactly the same
   configuration, the result will be better and can be illustrated as
   follows:

   Figure 13: Availability for the Server Part in Single Backup
   Configuration...  (Note: a dot-and-dash version of the diagram is
   being developed)....

   With server availability of FOUR 9s, and with low silent error
   probabilities, the server part of the availability can achieve FIVE
   9s.  For a site with less issues, such as the one with mean time
   between failures is 100 times of the server MTBF and site maintenance
   is 0.1 times of the server MTBF.  The mean time for site failure
   duration is also assumed to be 12 hours (uniform distributed between
   4 hours and 24 hours) and the mean time for site maintenance is 24
   hours (uniform distributed between 4 hours and 48 hours).  The result
   for the single backup system can be shown as follows:

   Figure 14: Server Part of Availability for a Good Site on Single
   Backup...  (Note: a dot-and-dash version of the diagram is being
   developed)....

   The following data table (Table-4) will give precise information
   regarding this simulation results.

   Table-4: Details Regarding Availability on Server Part for Single
   Backup on a Good Site

   +-------------------+----------+----------+------------+------------+
   |       Silent      | 0.990099 | 0.999001 | 0.99990001 |   0.99999  |
   |    Error/Server   |          |          |            |            |
   |    Availability   |          |          |            |            |
   +-------------------+----------+----------+------------+------------+
   |        0.0        | 0.998971 | 0.999959 |  0.9999992 |     1.0    |
   |        0.1        | 0.997918 | 0.999857 | 0.99998959 | 0.99999901 |
   |        0.2        | 0.996908 | 0.999771 | 0.99997957 | 0.99999804 |
   |        0.3        | 0.995999 | 0.999674 | 0.99996935 | 0.99999695 |
   +-------------------+----------+----------+------------+------------+

   As evidenced in the table above, the server part of the system
   availability will be impacted by the silent error and a single
   redundant hardware will only provide marginal improvement when the
   silent error probability is small.


Mo & Khasnabish           Expires April 3, 2016                [Page 18]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   Figure 15: Server Part of Availability for a Good Site on Dual
   Backup...  (Note: a dot-and-dash version of the diagram is being
   developed)....

   The diagram above give a general trend in system availability and the
   follow data table will precise the data.

   Table 5: Details Regarding Availability on Server Part for Dual
   Backup on a Good Site

   +---------------+------------+------------+------------+------------+
   |     Silent    | 0.99009901 |  0.999001  | 0.99990001 |   0.99999  |
   |  Error/Server |            |            |            |            |
   |  Availability |            |            |            |            |
   +---------------+------------+------------+------------+------------+
   |      0.0      |  0.9999939 | 0.99999998 |     1.0    |     1.0    |
   |      0.2      |  0.9981346 | 0.99980209 | 0.99998048 | 0.99999792 |
   |      0.4      | 0.99615083 | 0.99960136 | 0.99996002 | 0.99999594 |
   |      0.5      | 0.99522474 |  0.9995184 | 0.99995225 | 0.99999503 |
   +---------------+------------+------------+------------+------------+

   From the tables for single and dual backup, we can see that dual
   backup only provides marginal benefit in the face of site issues.
   Given the fact that site issues are inevitable in practice, a
   geographically distributed single backup system is recommended for
   simplicity.

6.4.  Multiple Servers Sharing the Load

   In this section, we outline the simulation results for cases when
   there are multiple servers to take care of the active work load.  In
   this case, the impact of a protection group failure will affecting
   smaller number of users.

   In the simulation, each site will have N servers to serve the work.
   A weighted uptime and weighted down time was introduced.  The system
   availability is the weighted uptime divided by the total of weighted
   uptime and weighted downtime.

   EQ(8)... ...  Weighted-Availability[Server-Part]=[(TET - WDT)/TET],
   whereas

   o  TET is the Total Elapsed Time

   o  WDT is the Weighted Down Time

   If any protection group (i) is down, the WDT will be updated as
   follows:


Mo & Khasnabish           Expires April 3, 2016                [Page 19]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   EQ(9)... ...  WDT = WDT + [Protection Group (i) Down Time]/N

   For a system with three protection groups (i.e. the servers sharing
   the workload), the availability of each protection group, as well as
   the weighted availability, are obtained as follows (Table-6):

   Table-6: Availability of Protection Groups and the Weighted
   Availability (Dual Backup)

   +--------------+---------+---------+----------+----------+----------+
   |  Availabilit | Availab | Availab | Availabi | Measured | Protecti |
   | y   /Silent  | ility o | ility o | l ity of | Weighted | o n Grou |
   |      Error   | fProtec | fProtec |  Protect | Availabi | p Averag |
   |   Probabilit | ti   on | ti   on | io n Gro | l   ity  | e    -   |
   | y            |   Group |   Group | up    3  |          |  Weighte |
   |              |  1      |  2      |          |          | dAvailab |
   |              |         |         |          |          | il   ity |
   +--------------+---------+---------+----------+----------+----------+
   |      0.0     |   1.0   |   1.0   |    1.0   |    1.0   |    0.0   |
   |      0.2     | 0.99999 | 0.99999 | 0.999997 | 0.999998 | 6.66668E |
   |              | 8  015  | 8  005  | 9   85   | 0   01   | -   11   |
   |      0.4     | 0.99999 | 0.99999 | 0.999995 | 0.999996 | -3.33333 |
   |              | 6  027  | 6  018  | 9   88   | 0   11   | E   -11  |
   +--------------+---------+---------+----------+----------+----------+

   In this case, there is little difference between the different
   protection groups.  The weighted availability is actually the average
   of availability of all the protection groups.This also illustrate the
   fact that, regardless how many servers to share to active load, the
   system availability will be the same as long as (A) The number of
   backups are the same, and (B) Each server availability are the same


7.  Conclusions

   The system availability can be divided into two parts; the
   availability from the network and the availability from the server.
   The final system availability is the product of those two parts.

   The system availability from the network is determined by the maximum
   number of hops and individual network element availability, with the
   fault tolerant setup is assumed to be 1+1.  The system availability
   from the server is mainly determined by the following parameters.

   o  Availability of each individual server

   o  Silent error probability


Mo & Khasnabish           Expires April 3, 2016                [Page 20]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


   o  Site related issues (maintenance, fault)

   o  Protection Scheme (one or two dedicated backups)

   The introduction of silent error is to take account of software error
   and errors undetectable by hardware, the system availability on the
   server part will be dominated by such silent error if the silent
   error probability is more than 10%.  This is shown in both
   theoretical work and simulations.

   It shall be interesting to note that the dual backup scheme provides
   marginal benefits and the added complexity may not warrant such
   practice in the real network.

   It is possible for COTS hardware to provide as high availability as
   the traditional telecom hardware if the server itself is of
   reasonable high-availability.  The undesirable attributes of COTS
   hardware have been modelled into the site related issues, such as
   site maintenance and site failure which is not applicable for
   traditional telecom hardware.  Hence, in calculating the server
   availability, the site related issues are to be excluded.

   It is critical for the virtualization infrastructure management to
   provide as much hardware failure information as possible to improve
   the availability of the application.  As seen in both theoretical
   work and simulation, the silent error probability becomes a dominant
   factor in the final availability.  The silent error probability can
   be reduced if the virtualization infrastructure management is capable
   of fault isolation.


8.  Security considerations

   To be determined.


9.  IANA considerations

   This Internet Draft includes no request to IANA.


10.  Acknowledgements

   Authors would like to thank the NFV RG chairs (Diego and Ramki) for
   encouraging discussions and guidance.


11.  References


Mo & Khasnabish           Expires April 3, 2016                [Page 21]

Internet-Draft     NFV Reliability using COTS Hardware      October 2015


11.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/
              RFC2119, March 1997,
              <http://www.rfc-editor.org/info/rfc2119>.

   [I-D.irtf-nfvrg-nfv-policy-arch]
              Figueira, N., Krishnan, R., Lopez, D., Wright, S., and D.
              Krishnaswamy, "Policy Architecture and Framework for NFV
              Infrastructures", draft-irtf-nfvrg-nfv-policy-arch-01
              (work in progress), August 2015.

   [1]        GR-77, "Applied R&M Manual for Defense Systems", 2012.

11.2.  Informative References

   [2]        Papoulis, A., "Probability, Random Variables, and
              Stochastic Processes", 2002.

   [3]        Bremaud, P., "An Introduction to Probabilistic Modeling",
              1994.

   [4]        Press, et al, W., "Numerical Recipes in C/C++", 2007.


Authors' Addresses

   Li Mo
   ZTE (TX) Inc.
   2425, N. central expressway
   Richardson, TX  75080
   USA

   Phone: +1-972-454-9661
   Email: li.mo@ztetx.com


   Bhumip Khasnabish (editor)
   ZTE (TX) Inc.
   55 Madison Avenue, Suite 160
   Morristown, New Jersey  07960
   USA

   Phone: +001-781-752-8003
   Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com
   URI:   http://tinyurl.com/bhumip/


Mo & Khasnabish           Expires April 3, 2016                [Page 22]