NFV RG L. Mo Internet-Draft B. Khasnabish, Ed. Intended status: Informational ZTE (TX) Inc. Expires: April 3, 2016 October 1, 2015 NFV Reliability using COTS Hardware draft-mlk-nfvrg-nfv-reliability-using-cots-00 Abstract This draft discusses the results of a recent study on the feasibility of using Commercial Off-The-Shelf (COTS) hardware for virtualized network functions in telecom equipment. In particular, it explores the conditions under which the COTS hardware can be used in the NFV (Network Function Virtualization) environment. The concept of silent error probability is introduced in order to take software error or undetectable hardware failures into account. The silent error probability is included in both the theoretical work and the simulation work. It is difficult to theoretically analyze the impact of site maintenance and site failure events. Therefore, simulation is used for evaluating the impact of these site management related events which constitute the undesirable feature of using COTS hardware in telecom environment. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 3, 2016. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Mo & Khasnabish Expires April 3, 2016 [Page 1] Internet-Draft NFV Reliability using COTS Hardware October 2015 Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions used in this document . . . . . . . . . . . . . . 4 2.1. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 4 3. Network Reliability . . . . . . . . . . . . . . . . . . . . . 5 4. Network Part of the Availability . . . . . . . . . . . . . . . 7 5. Theoretical Analysis of Server Part of System Availability . . 9 6. Simulation Study of Server Part of Availability . . . . . . . 12 6.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 13 6.2. Validation of the Simulator . . . . . . . . . . . . . . . 16 6.3. Simulation Results . . . . . . . . . . . . . . . . . . . . 17 6.4. Multiple Servers Sharing the Load . . . . . . . . . . . . 19 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 20 8. Security considerations . . . . . . . . . . . . . . . . . . . 21 9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 21 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 11.1. Normative References . . . . . . . . . . . . . . . . . . . 22 11.2. Informative References . . . . . . . . . . . . . . . . . . 22 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 Mo & Khasnabish Expires April 3, 2016 [Page 2] Internet-Draft NFV Reliability using COTS Hardware October 2015 1. Introduction Using COTS hardware for network functions (e.g. IMS, EPC) have drawn considerable attention in the recent years. Some operators do have legitimate concern regarding the reliability of using the COTS hardware, with reduced MTBF (mean time between failures) and many undesirable attributes of COTS hardware unfamiliar in the traditional telecom industry. In the previous reliability studies (e.g. GR-77 [1]), the emphasis were place on hardware failures only. In this work, besides hardware failures, which characterized by the MTBF (mean time between failures) and MTTR (mean time to repair), the silent error is also introduced to take account the software error and hardware failure which is undetectable by the management system. The silent error affecting the system availability in different ways, depending on the particular scenarios. In a typical system, a server performing certain network functions will have another dedicated server as backup. This is normal master- slave or 1+1 redundancy configuration of the telecom equipment. The server performing the network function is called the "master server" and the dedicated backup is called the "slave server." In order to differentiate the 1+1 redundancy scheme and 1:1 redundancy scheme, the slave server is deemed "dedicated" for 1+1 case. In 1:1 redundancy, both servers will perform network functions while protecting each other at the same time. In any protection scheme, on assuming single fault for clarity of discussion, the system availability will not be impacted if the slave part experience silent error and such silent error eventually becoming observable in behavior. In this case, another slave will be identified and the master server will continue to serve the network function. Before the slave server becoming fully functional, the system will operate at reduced error correction capabilities. On the other hand, if the master server experience the silent error, the data transmitted to the slave server could be corrupted. In this case, the system availability will be impacted when such error becoming observable. On detection of such error, both the master server and the slave server need time to recover. The time for such recovery is fixed in the NFV environment, which is deemed to be a NFV MTTR time. During this time interval, the network function is not available and will be considered to be the downtime in the availability calculations. Mo & Khasnabish Expires April 3, 2016 [Page 3] Internet-Draft NFV Reliability using COTS Hardware October 2015 Comparing the MTBF of COTS hardware and the typical telecom grade hardware, the COTS may have less MTBF due to its relaxed design criteria. Comparing the MTTR of COTS hardware and the typical telecom grade hardware, the COTS time to repair is not a random variable and actually is fixed. Hence the COTS MTTR is the time required to bring up a server and ready to serve. In the traditional telecom hardware, the time to repair is a random variable and MTTR is the mean of this random variable. Because manual intervention is normally required in the telecom environment, the NFV COTS MTTR is normally assumed to be less than the traditional telecom equipment MTTR. The most obvious difference between those two hardware types (COTS hardware and the telecom grade hardware) is related to its maintenance procedure and practice. While telecom equipment takes pains to minimize the impact of maintenance on system availability, the COTS hardware normally is maintained in a cowboy fashion (e.g. reset first and ask questions later). In this study, a closed solution is available if the site and maintenance related issues are absent for one or two dedicated backup COTS servers in the NFV environment. In order to evaluate the site and maintenance related issues, a simulator is constructed to study the system availability with one or two dedicated backup servers. It is shown that, with COTS hardware and all its undesirable features, it is still possible to satisfy the telecom requirements under reasonable conditions. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying RFC-2119 significance. 2.1. Abbreviations o A-N: Network Availability o A-S: Server Availability Mo & Khasnabish Expires April 3, 2016 [Page 4] Internet-Draft NFV Reliability using COTS Hardware October 2015 o A-Sys: System Availability o COTS: Commercial Off-The-Shelf o DC: Data Center o MTBF: Mean Time Between Failures o MTTF: Mean Time To Failure o MTTR: Mean Time To Repair o NFV: Network Function Virtualization o PGUP: Protection Group Up Time o PSTN: Public Switched Telephone Network o SDN: Software-Defined Network/Networking o TET: Total Elapsed Time o VM: Virtual Machine o WDT: Weighted Down Time 3. Network Reliability In the NFV environment, the reliability analysis can be divided into two distinct parts: the server part and the network part, where the network part is to connect all the servers with vSwitch and the server part is to provide the actual network functions. This can be illustrated by using a diagram as shown in Figure-1. Mo & Khasnabish Expires April 3, 2016 [Page 5] Internet-Draft NFV Reliability using COTS Hardware October 2015 +--------------------+ | Availability: A-S | Availability: A-N | | | | | | +---------------+ | (VM) | | | | COTS.............................| vSwitch 1 | | Server............... ......| | | | | \ / |(X) (X) .. (X) | | | | \ / +---------------+ | | | \ / | | | \ +---------------+ | | | / \ | vSwitch 2 | | (VM) | / \ | | | COTS................/ \......|(X) (X) .. (X) | | Server............................| | | | | | +--------------------+ +---------------+ Figure 1: System Availability - Network Part and Server Part If the overall system availability is denoted by the symbol (A-Sys), the overall system availability will be the product of server part of the system availability (A-S) and the network part of the system availability (A-N). EQ(1) ... ... ... A-Sys = [A-S x A-N] Given the fact that both A-S and A-N are "less than" 1 (one), we have A-Sys "less than" A-S and A-Sys "less than" A-N. In another words, if FIVE 9s are required for system availability, both the server part and the network part of the availability need to be better than FIVE 9s so their products can be more than FIVE 9s. To improve the network part of the network availability, as illustrated in Figure 1, the normal 1+1 protection scheme is utilized. It shall be noted that it is possible for the vSwitch to cover long distance transmission network to connect multiple data centers. The mechanisms in the server part for improving availability is not specified. In this study, it is assumed that one active server will be supported by one or two backup servers. Normally, if the active server is faulty, one of the backup server(s) will take over the responsibility and hence there will be no loss of availability on the server part. Mo & Khasnabish Expires April 3, 2016 [Page 6] Internet-Draft NFV Reliability using COTS Hardware October 2015 There is a significant difference between the NFV environment and the dedicated traditional telecom equipment related to the time to recover from the server fault. In the traditional telecom equipment case, a manual change of some equipment (e.g. a faulty board) is normally required and hence the time for restoration after experiencing fault, normally denoted as MTTR (Mean Time to Repair) is long. In the NFV environment, the time for restoration is the time required to boot another virtual machine (VM) with the needed software and re- synchronization of data. Hence the MTTR in the NFV environment can be considered to be shorter than the traditional telecom equipment. More importantly, the MTTR in the NFV environment can be considered to be a fixed constant. It is also understood that multiple servers will be active to share the load. Contradictory to common sense believe, this arrangement will not increase nor decrease the overall network availability if those active servers are supported by one or two backup servers. This fact will be elaborated in the later section from both theoretical point of view and simulations. 4. Network Part of the Availability The traditional analysis can be applied to the network part of the availability. In fact, the network part of the availability is impacted by the availability of the switch which is part of the vSwitch and the maximum hops in the vSwitch. The vSwitch is to connect the VMs in the NFV environment. If A-n is the denote the availability of the network element, for a vSwitch with maximum of h hops, the availability of the vSwtich would be "(A-n)^h." Hence, considering the 1+1 configuration of the vSwtich, the A-N can be expressed by EQ(2) ... ... ... A-N = [1 - (1 - (A-n)^h)^2 ] The network availability, as a function of number of hops (h) and the per network element availability (A-n), can be illustrated by using teh diagrm as shown in Figure-2. While this 3-D illustration shows the general trend in network availability, the following data table is able to give more details regarding the network availability with different hop counts and different network element availability, as shown in Table-1. Table-1: Network Part of System Availability with Various Network Mo & Khasnabish Expires April 3, 2016 [Page 7] Internet-Draft NFV Reliability using COTS Hardware October 2015 Elements Availability and Hop Counts +-------------+---------+----------+----------+----------+----------+ | Network | 10 | 16 | 22 | 26 | 30 | | Element | | | | | | | Availabilit | | | | | | | y/ Hop | | | | | | +-------------+---------+----------+----------+----------+----------+ | 0.99 | 0.99086 | 0.977935 | 0.96065 | 0.94712 | 0.932244 | | 0.999 | 0.99990 | 0.999748 | 0.999526 | 0.999341 | 0.999126 | | 0.9999 | 0.99999 | 0.999997 | 0.999995 | 0.999993 | 0.999991 | | 0.99999 | 1 | 1 | 1 | 1 | 1 | +-------------+---------+----------+----------+----------+----------+ +------------------------------------------+ / / | / Five 9s / | / / | / ... ... -/ | / ... ... ... ..... / | +-----------------------------------------+ +..0.99999 | . .... | / | .... .... ... . . . . . | /0.9999 | . | / Net Element | .| / Availability | | / 0.999 | |/ +-----------------------------------------+..0.99 2 8 12 18 24 ..............Hop Count..........> Figure 2: Network Part of the System Availability with Different Hop Counts and Different Network Element Availability In order to achieve FIVE 9s reliability normally demanded by the telecommunication operators, the network element reliability needs to be at least FOUR 9s if hop counts is more than 10. In fact, in order to achieve FIVE 9s while per network element availability is only THREE 9s, the hop count needs to be less than two, which is deemed non-practical. Mo & Khasnabish Expires April 3, 2016 [Page 8] Internet-Draft NFV Reliability using COTS Hardware October 2015 5. Theoretical Analysis of Server Part of System Availability In GR-77 [1], extensive analysis has been performed for systems under various conditions. In the NFV environment, if the server's availability is denoted as the symbol (Ax), the server part of the system availability (As), with a 1+1 master and slave configuration, can be given by [1] in Part D, Chapter 6. EQ(3) ... ... ... As = [1-((1-Ax)^2)] In a more practical environment, there will be silent errors (errors can not be detected by the system under consideration). The silent error probability will be expressed as the symbol (Pse) We need to further assume that the silent error only affects the master of the system because it is the one which has the ability to corrupt the data. This assumption can be further articulated, in practical engineering terms, is that "when there is error detected and there is no obvious cause of the error, the master of the "master-salve" configuration will assume the master is correct while the slave will go through a MTTR time to recover." The state transition can be illustrated as in the following diagram: Figure 3: State Transition for System with only one Backup, ... (Note: a dot-and-dash version of the diagram is being developed).... With this state transition diagram outlined in Figure 1, the system availability in a 1+1 master-slave configuration can be expressed as follows. EQ(4a)... ... ... ... As = [1-((1-Ax)^2 + PseAx(1-Ax)) EQ(4b)... ... ... ... As = (2-Pse)Ax - (1-Pse)((Ax)^2)) The following diagram (Figure 4) illustrates the server part of the availability with different per server availability and different silent error probability. Figure 4: Server Part of the System Availability with Various Server Availability and Silent Error Probability ... (Note: a dot-and-dash version of the diagram is being developed).... While the graphics illustrate the trends, the following data table will give precise information on a single backup (1+1) configuration. Table-2: Server Part of availability for different silent error probability and different server availability for single backup Mo & Khasnabish Expires April 3, 2016 [Page 9] Internet-Draft NFV Reliability using COTS Hardware October 2015 configuration +-------+---------+-----------+-------------+----------+ | SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 | +-------+---------+-----------+-------------+----------+ | 0.0 | 0.9999 | 0.999999 | 0.99999999 | 1.0 | | 0.1 | 0.99891 | 0.9998991 | 0.999989991 | 0.99999 | | 0.2 | 0.99792 | 0.9997992 | 0.999979992 | 0.999998 | | 0.3 | 0.99693 | 0.9996993 | 0.999969993 | 0.999997 | | 0.4 | 0.99594 | 0.9995994 | 0.999959994 | 0.999996 | | 0.5 | 0.99495 | 0.9994995 | 0.999949995 | 0.999995 | | 0.6 | 0.99396 | 0.9993996 | 0.999939996 | 0.999994 | | 0.7 | 0.99297 | 0.9992997 | 0.999929997 | 0.999993 | | 0.8 | 0.99198 | 0.9991998 | 0.999919998 | 0.999992 | | 0.9 | 0.99099 | 0.9990999 | 0.999909999 | 0.999991 | | 1 | 0.99 | 0.999 | 0.9999 | 0.99999 | +-------+---------+-----------+-------------+----------+ The green shaded area in the above table outline the area which five 9 availability is possible. As evidenced in the above table, the server part of the network availability deteriorates rapidly with silent error probability. It is possible to achieve five 9s of availability with only server availability of only three 9s, it does demand five 9s server availability when the silent error probability is only 10%. While the 1+1 configuration illustrated above seems reasonable for server part of the system availability (As), there may be cases demanding more than 1+1 configuration for reliability. For systems with two backups, the availability, without consideration of the silent error, can be expressed as [1] (Part D, chapter 6) EQ(5) ... ... ... As = [1-((1-Ax)^3)] With the introduction of the silent error probability, the error transition can be expressed in the following diagram: Figure 5: Error State Transition for System with only two Backups ... (Note: a dot-and-dash version of the diagram is being developed).... With the introduction of the silent error and observing the error transition above, assuming the silent error event and the server fault event are independent (e.g., A software error as cause of the silent error and the server fault event as a hardware failure), the server part of the availability for dual backup case can be given by EQ(6a)... ... As = 1-((1-Ax)^3 + Pse(1 - Ax)((Ax)^2 + 2Ax(1-Ax))) Mo & Khasnabish Expires April 3, 2016 [Page 10] Internet-Draft NFV Reliability using COTS Hardware October 2015 EQ(6b)... ... As = (3-2Pse)Ax - 3(1-Pse)(Ax)^2 + (1-Pse)(Ax)^3 It should be noted that, when "Pse = 1" for both EQ (4) and EQ (6), the server part of the system availability (As) and the server availability (Ax) are the same. This relationship shall be expected since, if the mater always experiences the silent error, the backups are useless and will be corrupted all the time. The system availability, with dual backup, can be illustrated as follows for different server availability and different silent error including software malfunctions. Figure 6: Server Part of the System Availability with Various Silent Error Probability and Server Availability for a dual Backup System ... (Note: a dot-and-dash version of the diagram is being developed).... As with previous case, the diagram will only illustrate the trend while the following table will provide precise data for the system availability under different silent error probability and server availabilities for dual backup case Table-3: System Availability with different silent error probability and server availability (SEPSA) for dual backup configuration +-------+-----------+-------------+---------+----------+ | SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 | +-------+-----------+-------------+---------+----------+ | 0.0 | 0.999999 | 0.999999999 | 1.0 | 1.0 | | 0.1 | 0.9989991 | 0.999899999 | 0.99999 | 0.999999 | | 0.2 | 0.9979992 | 0.999799999 | 0.99998 | 0.999998 | | 0.3 | 0.9969993 | 0.999699999 | 0.99997 | 0.999997 | | 0.4 | 0.9959994 | 0.999599999 | 0.99996 | 0.999996 | | 0.5 | 0.9949995 | 0.9995 | .99995 | 0.999995 | | 0.6 | 0.9939996 | 0.9994 | 0.99994 | 0.999994 | | 0.7 | 0.9929997 | 0.9993 | 0.99993 | 0.999993 | | 0.8 | 0.9919998 | .9992 | .99992 | 0.999992 | | 0.9 | 0.9909999 | 0.9991 | 0.99991 | 0.999991 | | 1.0 | 0.99 | 0.999 | 0.9999 | 0.999990 | +-------+-----------+-------------+---------+----------+ As shown in Table-2 , the green shaded area in Table-3 represents the five 9s capabilities. Comparing those two tables, the dual backups are of marginal advantage over the single backup except for the case there is no silent error. In this case, with only two 9s server availability, the five 9s server part of system availability can be achieved. Mo & Khasnabish Expires April 3, 2016 [Page 11] Internet-Draft NFV Reliability using COTS Hardware October 2015 From the data above, we can conclude that the silent error, introduced by software error or hardware error not detectable by software, plays an important role in the server part of the system availability and hence the final system availability. In fact, it will be the dominant elements if Pse is more than 10% when the difference between single backup and dual backup are not significant. Some operators are of the opinion that there need to be a new approach to the availability requirements. COTS hardware are assumed to have less availability than the traditional telecom hardware. But, in the NFV environment, since each server (or VM) in the NFV environment will only affect a small number of users, the requirements for traditional five 9s could be relaxed while keeping the same user experience in downtime. In another words, the weighted downtime, in proportion of the number of users, may be reduced in the NFV environment due to each server affecting only a small number of users for a given server reliability. Unfortunately, from the theoretical point of view, this is not true. It is possible that, each server downtime will only affect a small number of users. But the multiple active servers will experience more server fault opportunities (this is similar to the famous reliability argument for the twin engine Boeing 777). As long as the protection scheme, or more importantly, the number of backup(s) are the same, the eventual system availability will be the same, regardless what portion of users each server is to serve. 6. Simulation Study of Server Part of Availability In the above theoretical analysis of server part of availability, the following factors are not considered: (A) Site Maintenance (e.g. software upgrade, patch, etc. affecting the whole site, (B) Site Failure (earth quake, etc.) While traditional telecom grade equipment putting a lot of emphasis and engineering complexity to ensure smooth migration, smooth software upgrade, and smooth patching procedures, the COTS hardware and its related software are notorious in lacking such capabilities. This is the primary reason for operators to be hesitate on utilizing COTS hardware, even though COTS hardware in the NFV environment does having the improved MTTR as compared to the traditional telecom hardware. While it is relative easy to obtain a closed for of system availability for the ideal case without site related issues, it is extremely difficult to obtain an analytical solution when site issues are involved. In this case, we resort to numerical simulation under Mo & Khasnabish Expires April 3, 2016 [Page 12] Internet-Draft NFV Reliability using COTS Hardware October 2015 reasonable assumptions [2, 3, 4]. 6.1. Methodology In this section, the various assumptions and the outline of the simulation mechanisms will be discussed. A discrete event simulator is constructed to obtain the availability for the server part. In the simulator, an active server (master server which processing the network traffic) will be supported by 1 (single backup) or 2 (dual backup) servers in another site(s). For the failure probability of the server, it is common to assume the bathtub probability distribution (WeiBull distribution). In practice, we need to enforce that the NFV management will provide servers which is on the flat part of the bathtub distribution. In this case, the familiar exponential distribution can be utilized. In the discrete event simulator, each server will be scheduled to work for a certain duration of time. This duration will be a random variable with exponential distribution which is common to measure the server behavior during its useful life cycle, with mean given by the MTBF of the server. In fact, the flat part of the bathtub distribution can related to the normal server MTBF (mean time between failures) with the failure density function expressed as F(x)=[(1/MTBF) times (e-to-the-power(- x/MTBF))]. After the working duration, the server will be down for a fixed time duration which represents the time duration to start another virtual machine to replace the one in trouble. This part is actually different from the traditional telecom grade equipment. Here, the assumption is that there will be another server available to replace the one went down. Hence, regardless the nature of the fault, the down time for a server fault will be fixed which represent the time needed to have another server ready to take over the task. The following diagram shows this arrangement for a system with only one backup. It shall be noted that, while the server up time duration is variable, the server down time will be fixed. Figure 7: The life of the Servers ... (Note: a dot-and-dash version of the diagram is being developed).... The servers will be hosted in "sites" which is considered to be data centers. In this simulation, during initial setup, the servers supporting each other for reliability purposes will be hosted in Mo & Khasnabish Expires April 3, 2016 [Page 13] Internet-Draft NFV Reliability using COTS Hardware October 2015 different sites. This is to minimize the impact of the site failure and site maintenance. In order to model the system behavior with one or two backups, the concept of protection group is introduced. A protection group will consists of a "master" server with one or two "slave" server(s) in other site(s). There may be multiple protection groups inside the network with each protection group serving a fraction of the users. A protection group will be considered to be "down" if every server in this group is dead. During the time the protection group is "down", the network service will be affected and the network is considered to be "down" for the group of users this protection group is responsible for. The uptime and downtime of the protection group will be recorded in the discrete event simulator. The server part of the availability is given by (where the total elapsed time is the total simulation time in the discrete event simulator) EQ(7) ... ... Availability(server part)= [(PGUP)/(TET)], whereas o PGUP is Protection Group Up Time o TET is the Total Elapsed Time The concept of protection group, site, and server can be illustrated as follows (Figure 8) for a system with two backups. It shall be noted that the protection group is an abstract concept and the portion of the network function is not available if and only if the all the servers in the protection group is not functioning. Figure 8: Servers, Sites, and Protection Group ... (Note: a dot-and- dash version of the diagram is being developed).... Even though the simulator will allow each site to have a number of servers, which is configurable, there is little use for this arrangement. The system availability will not change regardless how many servers per site is used to support the system, as long as there is no change in the number of servers in the protection group. The increase of number of servers per site is essentially increase the number of protection groups. For a long time duration, each protection group will experience the similar downtime for the same up time (or will have the same availability). As in the theoretical analysis, the silent error, due to software or Mo & Khasnabish Expires April 3, 2016 [Page 14] Internet-Draft NFV Reliability using COTS Hardware October 2015 subtle hardware failure, will only affect the active (or master) server. When the master server failed with silent error, both the master and "slave" servers will go through a MTTR time to recover (e.g. time to incarnate two VMs simultaneously). In this case, this part of the system (or this protection group) is considered to be under fault. In the reliability study, the focus is the number of the backups for each protection group where 1+1 configuration is a typical configuration for one backup mechanism. For load sharing arrangement such as 1:1, it can be viewed as two protection groups. In general, the load sharing scheme will have less availability because, in 1:1 case, any server fault will result in two faults in different protection groups. This can be extended to 1:2 case where three protection groups are involved, and any server fault will introduce three faults in different protection groups. In this study, the load sharing mechanisms will not be elaborated further. The site will also go though its maintenance work. The traditional telecom grade equipment and the COTS hardware mainly defers on this part. In Telecom grade equipment, minimum impact on system performance or system availability is to be maintained during the maintenance window. But, for COTS hardware, the maintenance work may be more frequently and more destructive. In order to simulate the maintenance aspect of COTS hardware, the simulator will put the site "under maintenance" at random time. The interval for the site to be working is also assumed to be exponentially distributed random variable, with mean to be configurable in the simulator. The duration of the maintenance is also a uniform distributed random variable with a configured mean, minimum, and maximum. In order to put a site "under maintenance", there shall be no-fault inside the network. All the servers on the site to be "under maintenance" will be moved to other site. Hence, no traffic will be impacted during the process of putting the site under maintenance. Of course, the ability against site failure when some site is under maintenance will be reduced. When a site is back from maintenance, it will attempt to claim all its server responsibilities transferred due to site maintenance. o For each protection group, if every server is working, the protection group will re-arrange the protect relationship so each site will only have one server in the protection group. The new server on the site back from maintenance will need a MTTR time to Mo & Khasnabish Expires April 3, 2016 [Page 15] Internet-Draft NFV Reliability using COTS Hardware October 2015 be ready for backup. In this case, no loss of service in the system. o For each protection group, if there are at least one working and at least on in fault condition, one working server will be added to the protection group. The new server on the site back from maintenance will need a MTTR time to be ready for backup. In this case, no loss of service in the system. o For each protection group, if there is no servers working, the protection group will gain a working server from the site back from the maintenance. The new server on the site back from maintenance will need a MTTR time to be ready for service. In this case, the system will provide service after the new server is ready. A site can also under fault (e.g. loss of power, operating under reduced capability due to thermal issues, and earth quake). The simulator can also simulate the effect of such events, with site up duration as an exponentially distributed random variable with mean to be configured. The site failure duration is expressed as a uniform distributed random variable with configurable mean, minimum, and maximum. 6.2. Validation of the Simulator In order to verify the correctness of the simulator (e.g. the random number generator, the whole program structure, etc.), the simulation is performed with various server availability and various silent error probability. For single backup case, the error between the theoretical data and simulation data for system availability on the server part can be illustrated by the following diagram (Figure 9). Figure 9: Verification of Simulator for Single Backup Case ... (Note: a dot-and-dash version of the diagram is being developed).... As we can see, the magnitude of the errors are within 10-to-the- power-(-5) which is very small, considering the nominal value of system availability for the server part is close to 1.0. For the dual backup case, the error between the simulated and theoretical system availability for different silent error probability and server availability can be illustrated as follows (Figure 10). Figure 10: Verification of Simulator for Dual Backup Case ... (Note: a dot-and-dash version of the diagram is being developed).... Mo & Khasnabish Expires April 3, 2016 [Page 16] Internet-Draft NFV Reliability using COTS Hardware October 2015 This is also similar to that of the single backup case where the error are within the range. Those error information gives us the needed confidence on the simulation result for complicated case where analytical solutions are evasive. 6.3. Simulation Results The effect of the MTTR in the NFV environment is studied first. In this study, the effect of the MTTR and the silent error probability can be shown below: Figure 11: Availability with Various Silent for different MTTRs... (Note: a dot-and-dash version of the diagram is being developed).... In the diagram (Figure 11), R6 represents MTTR of 6 minutes while R60 represents MTTR of 60 minutes. The x-axis is the silent error probability. As shown, the effect of the MTTR (time to recover from a fault or time to have VM rebirth) will affect the slope of the system availability, which decline with the increase of silent error probability. In the above example, the server MTBF is assumed to be 10000 hours which represents the server availability of 0.9994 for R6 case and 0.994 for the R60 case. The two curves starting approximate 1.0 are the system availability with dual backups while the other two are for system availability with single backup. It should be noted that, for the dual backup case, there is little difference in availability for different MTTR when there is no silent error. Intuitively, this is expected due to the added number of backup servers. In this simulation, both site failure (with mean time between failures of 20000 hours) and site maintenance (with mean time between site maintenance of 1000) are considered. The mean time for site failure duration is assumed to be 12 hours (uniform distributed between 4 hours and 24 hours) and the mean time for site maintenance is 24 hours (uniform distributed between 4 hours and 48 hours). The next step is to evaluate the impact of the site issues (site failure, maintenance). For a very bad site outlined above, which has the mean time between site failures to be 2 times of the server MTBF and the mean time between site maintenance events is assumed to be 0.1 times of the server MTBF. The availability on the server part can be illustrated with different silent error probability and server availability for the single backup configuration. Figure 12: Availability for the Server Part in Single Backup Configuration... (Note: a dot-and-dash version of the diagram is being developed).... Mo & Khasnabish Expires April 3, 2016 [Page 17] Internet-Draft NFV Reliability using COTS Hardware October 2015 As the data will illustrate that, in order to achieve high availability, the server availability needs to be very high. In fact, the server availability needs to be in the range of FIVE 9s in order to achieve the system availability of FIVE 9s under various site related issues. The dual backup systems for exactly the same configuration, the result will be better and can be illustrated as follows: Figure 13: Availability for the Server Part in Single Backup Configuration... (Note: a dot-and-dash version of the diagram is being developed).... With server availability of FOUR 9s, and with low silent error probabilities, the server part of the availability can achieve FIVE 9s. For a site with less issues, such as the one with mean time between failures is 100 times of the server MTBF and site maintenance is 0.1 times of the server MTBF. The mean time for site failure duration is also assumed to be 12 hours (uniform distributed between 4 hours and 24 hours) and the mean time for site maintenance is 24 hours (uniform distributed between 4 hours and 48 hours). The result for the single backup system can be shown as follows: Figure 14: Server Part of Availability for a Good Site on Single Backup... (Note: a dot-and-dash version of the diagram is being developed).... The following data table (Table-4) will give precise information regarding this simulation results. Table-4: Details Regarding Availability on Server Part for Single Backup on a Good Site +-------------------+----------+----------+------------+------------+ | Silent | 0.990099 | 0.999001 | 0.99990001 | 0.99999 | | Error/Server | | | | | | Availability | | | | | +-------------------+----------+----------+------------+------------+ | 0.0 | 0.998971 | 0.999959 | 0.9999992 | 1.0 | | 0.1 | 0.997918 | 0.999857 | 0.99998959 | 0.99999901 | | 0.2 | 0.996908 | 0.999771 | 0.99997957 | 0.99999804 | | 0.3 | 0.995999 | 0.999674 | 0.99996935 | 0.99999695 | +-------------------+----------+----------+------------+------------+ As evidenced in the table above, the server part of the system availability will be impacted by the silent error and a single redundant hardware will only provide marginal improvement when the silent error probability is small. Mo & Khasnabish Expires April 3, 2016 [Page 18] Internet-Draft NFV Reliability using COTS Hardware October 2015 Figure 15: Server Part of Availability for a Good Site on Dual Backup... (Note: a dot-and-dash version of the diagram is being developed).... The diagram above give a general trend in system availability and the follow data table will precise the data. Table 5: Details Regarding Availability on Server Part for Dual Backup on a Good Site +---------------+------------+------------+------------+------------+ | Silent | 0.99009901 | 0.999001 | 0.99990001 | 0.99999 | | Error/Server | | | | | | Availability | | | | | +---------------+------------+------------+------------+------------+ | 0.0 | 0.9999939 | 0.99999998 | 1.0 | 1.0 | | 0.2 | 0.9981346 | 0.99980209 | 0.99998048 | 0.99999792 | | 0.4 | 0.99615083 | 0.99960136 | 0.99996002 | 0.99999594 | | 0.5 | 0.99522474 | 0.9995184 | 0.99995225 | 0.99999503 | +---------------+------------+------------+------------+------------+ From the tables for single and dual backup, we can see that dual backup only provides marginal benefit in the face of site issues. Given the fact that site issues are inevitable in practice, a geographically distributed single backup system is recommended for simplicity. 6.4. Multiple Servers Sharing the Load In this section, we outline the simulation results for cases when there are multiple servers to take care of the active work load. In this case, the impact of a protection group failure will affecting smaller number of users. In the simulation, each site will have N servers to serve the work. A weighted uptime and weighted down time was introduced. The system availability is the weighted uptime divided by the total of weighted uptime and weighted downtime. EQ(8)... ... Weighted-Availability[Server-Part]=[(TET - WDT)/TET], whereas o TET is the Total Elapsed Time o WDT is the Weighted Down Time If any protection group (i) is down, the WDT will be updated as follows: Mo & Khasnabish Expires April 3, 2016 [Page 19] Internet-Draft NFV Reliability using COTS Hardware October 2015 EQ(9)... ... WDT = WDT + [Protection Group (i) Down Time]/N For a system with three protection groups (i.e. the servers sharing the workload), the availability of each protection group, as well as the weighted availability, are obtained as follows (Table-6): Table-6: Availability of Protection Groups and the Weighted Availability (Dual Backup) +--------------+---------+---------+----------+----------+----------+ | Availabilit | Availab | Availab | Availabi | Measured | Protecti | | y /Silent | ility o | ility o | l ity of | Weighted | o n Grou | | Error | fProtec | fProtec | Protect | Availabi | p Averag | | Probabilit | ti on | ti on | io n Gro | l ity | e - | | y | Group | Group | up 3 | | Weighte | | | 1 | 2 | | | dAvailab | | | | | | | il ity | +--------------+---------+---------+----------+----------+----------+ | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | | 0.2 | 0.99999 | 0.99999 | 0.999997 | 0.999998 | 6.66668E | | | 8 015 | 8 005 | 9 85 | 0 01 | - 11 | | 0.4 | 0.99999 | 0.99999 | 0.999995 | 0.999996 | -3.33333 | | | 6 027 | 6 018 | 9 88 | 0 11 | E -11 | +--------------+---------+---------+----------+----------+----------+ In this case, there is little difference between the different protection groups. The weighted availability is actually the average of availability of all the protection groups.This also illustrate the fact that, regardless how many servers to share to active load, the system availability will be the same as long as (A) The number of backups are the same, and (B) Each server availability are the same 7. Conclusions The system availability can be divided into two parts; the availability from the network and the availability from the server. The final system availability is the product of those two parts. The system availability from the network is determined by the maximum number of hops and individual network element availability, with the fault tolerant setup is assumed to be 1+1. The system availability from the server is mainly determined by the following parameters. o Availability of each individual server o Silent error probability Mo & Khasnabish Expires April 3, 2016 [Page 20] Internet-Draft NFV Reliability using COTS Hardware October 2015 o Site related issues (maintenance, fault) o Protection Scheme (one or two dedicated backups) The introduction of silent error is to take account of software error and errors undetectable by hardware, the system availability on the server part will be dominated by such silent error if the silent error probability is more than 10%. This is shown in both theoretical work and simulations. It shall be interesting to note that the dual backup scheme provides marginal benefits and the added complexity may not warrant such practice in the real network. It is possible for COTS hardware to provide as high availability as the traditional telecom hardware if the server itself is of reasonable high-availability. The undesirable attributes of COTS hardware have been modelled into the site related issues, such as site maintenance and site failure which is not applicable for traditional telecom hardware. Hence, in calculating the server availability, the site related issues are to be excluded. It is critical for the virtualization infrastructure management to provide as much hardware failure information as possible to improve the availability of the application. As seen in both theoretical work and simulation, the silent error probability becomes a dominant factor in the final availability. The silent error probability can be reduced if the virtualization infrastructure management is capable of fault isolation. 8. Security considerations To be determined. 9. IANA considerations This Internet Draft includes no request to IANA. 10. Acknowledgements Authors would like to thank the NFV RG chairs (Diego and Ramki) for encouraging discussions and guidance. 11. References Mo & Khasnabish Expires April 3, 2016 [Page 21] Internet-Draft NFV Reliability using COTS Hardware October 2015 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ RFC2119, March 1997, . [I-D.irtf-nfvrg-nfv-policy-arch] Figueira, N., Krishnan, R., Lopez, D., Wright, S., and D. Krishnaswamy, "Policy Architecture and Framework for NFV Infrastructures", draft-irtf-nfvrg-nfv-policy-arch-01 (work in progress), August 2015. [1] GR-77, "Applied R&M Manual for Defense Systems", 2012. 11.2. Informative References [2] Papoulis, A., "Probability, Random Variables, and Stochastic Processes", 2002. [3] Bremaud, P., "An Introduction to Probabilistic Modeling", 1994. [4] Press, et al, W., "Numerical Recipes in C/C++", 2007. Authors' Addresses Li Mo ZTE (TX) Inc. 2425, N. central expressway Richardson, TX 75080 USA Phone: +1-972-454-9661 Email: li.mo@ztetx.com Bhumip Khasnabish (editor) ZTE (TX) Inc. 55 Madison Avenue, Suite 160 Morristown, New Jersey 07960 USA Phone: +001-781-752-8003 Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com URI: http://tinyurl.com/bhumip/ Mo & Khasnabish Expires April 3, 2016 [Page 22]