idnits 2.17.1 draft-mlk-nfvrg-nfv-reliability-using-cots-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 884 has weird spacing: '...it | ti on |...' == Line 887 has weird spacing: '... | il ity ...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 1, 2015) is 3130 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'I-D.irtf-nfvrg-nfv-policy-arch' is defined on line 968, but no explicit reference was found in the text == Outdated reference: A later version (-04) exists of draft-irtf-nfvrg-nfv-policy-arch-01 Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFV RG L. Mo 3 Internet-Draft B. Khasnabish, Ed. 4 Intended status: Informational ZTE (TX) Inc. 5 Expires: April 3, 2016 October 1, 2015 7 NFV Reliability using COTS Hardware 8 draft-mlk-nfvrg-nfv-reliability-using-cots-00 10 Abstract 12 This draft discusses the results of a recent study on the feasibility 13 of using Commercial Off-The-Shelf (COTS) hardware for virtualized 14 network functions in telecom equipment. In particular, it explores 15 the conditions under which the COTS hardware can be used in the NFV 16 (Network Function Virtualization) environment. The concept of silent 17 error probability is introduced in order to take software error or 18 undetectable hardware failures into account. The silent error 19 probability is included in both the theoretical work and the 20 simulation work. It is difficult to theoretically analyze the impact 21 of site maintenance and site failure events. Therefore, simulation 22 is used for evaluating the impact of these site management related 23 events which constitute the undesirable feature of using COTS 24 hardware in telecom environment. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on April 3, 2016. 43 Copyright Notice 45 Copyright (c) 2015 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2. Conventions used in this document . . . . . . . . . . . . . . 4 62 2.1. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 4 63 3. Network Reliability . . . . . . . . . . . . . . . . . . . . . 5 64 4. Network Part of the Availability . . . . . . . . . . . . . . . 7 65 5. Theoretical Analysis of Server Part of System Availability . . 9 66 6. Simulation Study of Server Part of Availability . . . . . . . 12 67 6.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 13 68 6.2. Validation of the Simulator . . . . . . . . . . . . . . . 16 69 6.3. Simulation Results . . . . . . . . . . . . . . . . . . . . 17 70 6.4. Multiple Servers Sharing the Load . . . . . . . . . . . . 19 71 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 20 72 8. Security considerations . . . . . . . . . . . . . . . . . . . 21 73 9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 21 74 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 75 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 76 11.1. Normative References . . . . . . . . . . . . . . . . . . . 22 77 11.2. Informative References . . . . . . . . . . . . . . . . . . 22 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 80 1. Introduction 82 Using COTS hardware for network functions (e.g. IMS, EPC) have drawn 83 considerable attention in the recent years. Some operators do have 84 legitimate concern regarding the reliability of using the COTS 85 hardware, with reduced MTBF (mean time between failures) and many 86 undesirable attributes of COTS hardware unfamiliar in the traditional 87 telecom industry. 89 In the previous reliability studies (e.g. GR-77 [1]), the emphasis 90 were place on hardware failures only. In this work, besides hardware 91 failures, which characterized by the MTBF (mean time between 92 failures) and MTTR (mean time to repair), the silent error is also 93 introduced to take account the software error and hardware failure 94 which is undetectable by the management system. 96 The silent error affecting the system availability in different ways, 97 depending on the particular scenarios. 99 In a typical system, a server performing certain network functions 100 will have another dedicated server as backup. This is normal master- 101 slave or 1+1 redundancy configuration of the telecom equipment. 103 The server performing the network function is called the "master 104 server" and the dedicated backup is called the "slave server." In 105 order to differentiate the 1+1 redundancy scheme and 1:1 redundancy 106 scheme, the slave server is deemed "dedicated" for 1+1 case. In 1:1 107 redundancy, both servers will perform network functions while 108 protecting each other at the same time. 110 In any protection scheme, on assuming single fault for clarity of 111 discussion, the system availability will not be impacted if the slave 112 part experience silent error and such silent error eventually 113 becoming observable in behavior. In this case, another slave will be 114 identified and the master server will continue to serve the network 115 function. Before the slave server becoming fully functional, the 116 system will operate at reduced error correction capabilities. 118 On the other hand, if the master server experience the silent error, 119 the data transmitted to the slave server could be corrupted. In this 120 case, the system availability will be impacted when such error 121 becoming observable. On detection of such error, both the master 122 server and the slave server need time to recover. The time for such 123 recovery is fixed in the NFV environment, which is deemed to be a NFV 124 MTTR time. During this time interval, the network function is not 125 available and will be considered to be the downtime in the 126 availability calculations. 128 Comparing the MTBF of COTS hardware and the typical telecom grade 129 hardware, the COTS may have less MTBF due to its relaxed design 130 criteria. 132 Comparing the MTTR of COTS hardware and the typical telecom grade 133 hardware, the COTS time to repair is not a random variable and 134 actually is fixed. Hence the COTS MTTR is the time required to bring 135 up a server and ready to serve. In the traditional telecom hardware, 136 the time to repair is a random variable and MTTR is the mean of this 137 random variable. Because manual intervention is normally required in 138 the telecom environment, the NFV COTS MTTR is normally assumed to be 139 less than the traditional telecom equipment MTTR. 141 The most obvious difference between those two hardware types (COTS 142 hardware and the telecom grade hardware) is related to its 143 maintenance procedure and practice. While telecom equipment takes 144 pains to minimize the impact of maintenance on system availability, 145 the COTS hardware normally is maintained in a cowboy fashion (e.g. 146 reset first and ask questions later). 148 In this study, a closed solution is available if the site and 149 maintenance related issues are absent for one or two dedicated backup 150 COTS servers in the NFV environment. In order to evaluate the site 151 and maintenance related issues, a simulator is constructed to study 152 the system availability with one or two dedicated backup servers. 154 It is shown that, with COTS hardware and all its undesirable 155 features, it is still possible to satisfy the telecom requirements 156 under reasonable conditions. 158 2. Conventions used in this document 160 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 161 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 162 document are to be interpreted as described in RFC-2119 [RFC2119]. 164 In this document, these words will appear with that interpretation 165 only when in ALL CAPS. Lower case uses of these words are not to be 166 interpreted as carrying RFC-2119 significance. 168 2.1. Abbreviations 170 o A-N: Network Availability 172 o A-S: Server Availability 173 o A-Sys: System Availability 175 o COTS: Commercial Off-The-Shelf 177 o DC: Data Center 179 o MTBF: Mean Time Between Failures 181 o MTTF: Mean Time To Failure 183 o MTTR: Mean Time To Repair 185 o NFV: Network Function Virtualization 187 o PGUP: Protection Group Up Time 189 o PSTN: Public Switched Telephone Network 191 o SDN: Software-Defined Network/Networking 193 o TET: Total Elapsed Time 195 o VM: Virtual Machine 197 o WDT: Weighted Down Time 199 3. Network Reliability 201 In the NFV environment, the reliability analysis can be divided into 202 two distinct parts: the server part and the network part, where the 203 network part is to connect all the servers with vSwitch and the 204 server part is to provide the actual network functions. This can be 205 illustrated by using a diagram as shown in Figure-1. 207 +--------------------+ 208 | Availability: A-S | Availability: A-N 209 | | 210 | | 211 | | +---------------+ 212 | (VM) | | | 213 | COTS.............................| vSwitch 1 | 214 | Server............... ......| | 215 | | | \ / |(X) (X) .. (X) | 216 | | | \ / +---------------+ 217 | | | \ / 218 | | | \ +---------------+ 219 | | | / \ | vSwitch 2 | 220 | (VM) | / \ | | 221 | COTS................/ \......|(X) (X) .. (X) | 222 | Server............................| | 223 | | | | 224 +--------------------+ +---------------+ 226 Figure 1: System Availability - Network Part and Server Part 228 If the overall system availability is denoted by the symbol (A-Sys), 229 the overall system availability will be the product of server part of 230 the system availability (A-S) and the network part of the system 231 availability (A-N). 233 EQ(1) ... ... ... A-Sys = [A-S x A-N] 235 Given the fact that both A-S and A-N are "less than" 1 (one), we have 236 A-Sys "less than" A-S and A-Sys "less than" A-N. In another words, 237 if FIVE 9s are required for system availability, both the server part 238 and the network part of the availability need to be better than FIVE 239 9s so their products can be more than FIVE 9s. 241 To improve the network part of the network availability, as 242 illustrated in Figure 1, the normal 1+1 protection scheme is 243 utilized. It shall be noted that it is possible for the vSwitch to 244 cover long distance transmission network to connect multiple data 245 centers. 247 The mechanisms in the server part for improving availability is not 248 specified. In this study, it is assumed that one active server will 249 be supported by one or two backup servers. Normally, if the active 250 server is faulty, one of the backup server(s) will take over the 251 responsibility and hence there will be no loss of availability on the 252 server part. 254 There is a significant difference between the NFV environment and the 255 dedicated traditional telecom equipment related to the time to 256 recover from the server fault. In the traditional telecom equipment 257 case, a manual change of some equipment (e.g. a faulty board) is 258 normally required and hence the time for restoration after 259 experiencing fault, normally denoted as MTTR (Mean Time to Repair) is 260 long. 262 In the NFV environment, the time for restoration is the time required 263 to boot another virtual machine (VM) with the needed software and re- 264 synchronization of data. Hence the MTTR in the NFV environment can 265 be considered to be shorter than the traditional telecom equipment. 266 More importantly, the MTTR in the NFV environment can be considered 267 to be a fixed constant. 269 It is also understood that multiple servers will be active to share 270 the load. Contradictory to common sense believe, this arrangement 271 will not increase nor decrease the overall network availability if 272 those active servers are supported by one or two backup servers. 273 This fact will be elaborated in the later section from both 274 theoretical point of view and simulations. 276 4. Network Part of the Availability 278 The traditional analysis can be applied to the network part of the 279 availability. In fact, the network part of the availability is 280 impacted by the availability of the switch which is part of the 281 vSwitch and the maximum hops in the vSwitch. The vSwitch is to 282 connect the VMs in the NFV environment. 284 If A-n is the denote the availability of the network element, for a 285 vSwitch with maximum of h hops, the availability of the vSwtich would 286 be "(A-n)^h." Hence, considering the 1+1 configuration of the 287 vSwtich, the A-N can be expressed by 289 EQ(2) ... ... ... A-N = [1 - (1 - (A-n)^h)^2 ] 291 The network availability, as a function of number of hops (h) and the 292 per network element availability (A-n), can be illustrated by using 293 teh diagrm as shown in Figure-2. 295 While this 3-D illustration shows the general trend in network 296 availability, the following data table is able to give more details 297 regarding the network availability with different hop counts and 298 different network element availability, as shown in Table-1. 300 Table-1: Network Part of System Availability with Various Network 301 Elements Availability and Hop Counts 303 +-------------+---------+----------+----------+----------+----------+ 304 | Network | 10 | 16 | 22 | 26 | 30 | 305 | Element | | | | | | 306 | Availabilit | | | | | | 307 | y/ Hop | | | | | | 308 +-------------+---------+----------+----------+----------+----------+ 309 | 0.99 | 0.99086 | 0.977935 | 0.96065 | 0.94712 | 0.932244 | 310 | 0.999 | 0.99990 | 0.999748 | 0.999526 | 0.999341 | 0.999126 | 311 | 0.9999 | 0.99999 | 0.999997 | 0.999995 | 0.999993 | 0.999991 | 312 | 0.99999 | 1 | 1 | 1 | 1 | 1 | 313 +-------------+---------+----------+----------+----------+----------+ 315 +------------------------------------------+ 316 / / | 317 / Five 9s / | 318 / / | 319 / ... ... -/ | 320 / ... ... ... ..... / | 321 +-----------------------------------------+ +..0.99999 322 | . .... | / 323 | .... .... ... . . . . . | /0.9999 324 | . | / Net Element 325 | .| / Availability 326 | | / 0.999 327 | |/ 328 +-----------------------------------------+..0.99 329 2 8 12 18 24 330 ..............Hop Count..........> 332 Figure 2: Network Part of the System Availability with Different Hop 333 Counts and Different Network Element Availability 335 In order to achieve FIVE 9s reliability normally demanded by the 336 telecommunication operators, the network element reliability needs to 337 be at least FOUR 9s if hop counts is more than 10. 339 In fact, in order to achieve FIVE 9s while per network element 340 availability is only THREE 9s, the hop count needs to be less than 341 two, which is deemed non-practical. 343 5. Theoretical Analysis of Server Part of System Availability 345 In GR-77 [1], extensive analysis has been performed for systems under 346 various conditions. In the NFV environment, if the server's 347 availability is denoted as the symbol (Ax), the server part of the 348 system availability (As), with a 1+1 master and slave configuration, 349 can be given by [1] in Part D, Chapter 6. 351 EQ(3) ... ... ... As = [1-((1-Ax)^2)] 353 In a more practical environment, there will be silent errors (errors 354 can not be detected by the system under consideration). The silent 355 error probability will be expressed as the symbol (Pse) 357 We need to further assume that the silent error only affects the 358 master of the system because it is the one which has the ability to 359 corrupt the data. This assumption can be further articulated, in 360 practical engineering terms, is that "when there is error detected 361 and there is no obvious cause of the error, the master of the 362 "master-salve" configuration will assume the master is correct while 363 the slave will go through a MTTR time to recover." 365 The state transition can be illustrated as in the following diagram: 367 Figure 3: State Transition for System with only one Backup, ... 368 (Note: a dot-and-dash version of the diagram is being developed).... 370 With this state transition diagram outlined in Figure 1, the system 371 availability in a 1+1 master-slave configuration can be expressed as 372 follows. 374 EQ(4a)... ... ... ... As = [1-((1-Ax)^2 + PseAx(1-Ax)) 376 EQ(4b)... ... ... ... As = (2-Pse)Ax - (1-Pse)((Ax)^2)) 378 The following diagram (Figure 4) illustrates the server part of the 379 availability with different per server availability and different 380 silent error probability. 382 Figure 4: Server Part of the System Availability with Various Server 383 Availability and Silent Error Probability ... (Note: a dot-and-dash 384 version of the diagram is being developed).... 386 While the graphics illustrate the trends, the following data table 387 will give precise information on a single backup (1+1) configuration. 389 Table-2: Server Part of availability for different silent error 390 probability and different server availability for single backup 391 configuration 393 +-------+---------+-----------+-------------+----------+ 394 | SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 | 395 +-------+---------+-----------+-------------+----------+ 396 | 0.0 | 0.9999 | 0.999999 | 0.99999999 | 1.0 | 397 | 0.1 | 0.99891 | 0.9998991 | 0.999989991 | 0.99999 | 398 | 0.2 | 0.99792 | 0.9997992 | 0.999979992 | 0.999998 | 399 | 0.3 | 0.99693 | 0.9996993 | 0.999969993 | 0.999997 | 400 | 0.4 | 0.99594 | 0.9995994 | 0.999959994 | 0.999996 | 401 | 0.5 | 0.99495 | 0.9994995 | 0.999949995 | 0.999995 | 402 | 0.6 | 0.99396 | 0.9993996 | 0.999939996 | 0.999994 | 403 | 0.7 | 0.99297 | 0.9992997 | 0.999929997 | 0.999993 | 404 | 0.8 | 0.99198 | 0.9991998 | 0.999919998 | 0.999992 | 405 | 0.9 | 0.99099 | 0.9990999 | 0.999909999 | 0.999991 | 406 | 1 | 0.99 | 0.999 | 0.9999 | 0.99999 | 407 +-------+---------+-----------+-------------+----------+ 409 The green shaded area in the above table outline the area which five 410 9 availability is possible. As evidenced in the above table, the 411 server part of the network availability deteriorates rapidly with 412 silent error probability. It is possible to achieve five 9s of 413 availability with only server availability of only three 9s, it does 414 demand five 9s server availability when the silent error probability 415 is only 10%. 417 While the 1+1 configuration illustrated above seems reasonable for 418 server part of the system availability (As), there may be cases 419 demanding more than 1+1 configuration for reliability. For systems 420 with two backups, the availability, without consideration of the 421 silent error, can be expressed as [1] (Part D, chapter 6) 423 EQ(5) ... ... ... As = [1-((1-Ax)^3)] 425 With the introduction of the silent error probability, the error 426 transition can be expressed in the following diagram: 428 Figure 5: Error State Transition for System with only two Backups ... 429 (Note: a dot-and-dash version of the diagram is being developed).... 431 With the introduction of the silent error and observing the error 432 transition above, assuming the silent error event and the server 433 fault event are independent (e.g., A software error as cause of the 434 silent error and the server fault event as a hardware failure), the 435 server part of the availability for dual backup case can be given by 437 EQ(6a)... ... As = 1-((1-Ax)^3 + Pse(1 - Ax)((Ax)^2 + 2Ax(1-Ax))) 438 EQ(6b)... ... As = (3-2Pse)Ax - 3(1-Pse)(Ax)^2 + (1-Pse)(Ax)^3 440 It should be noted that, when "Pse = 1" for both EQ (4) and EQ (6), 441 the server part of the system availability (As) and the server 442 availability (Ax) are the same. This relationship shall be expected 443 since, if the mater always experiences the silent error, the backups 444 are useless and will be corrupted all the time. 446 The system availability, with dual backup, can be illustrated as 447 follows for different server availability and different silent error 448 including software malfunctions. 450 Figure 6: Server Part of the System Availability with Various Silent 451 Error Probability and Server Availability for a dual Backup System 452 ... (Note: a dot-and-dash version of the diagram is being 453 developed).... 455 As with previous case, the diagram will only illustrate the trend 456 while the following table will provide precise data for the system 457 availability under different silent error probability and server 458 availabilities for dual backup case 460 Table-3: System Availability with different silent error probability 461 and server availability (SEPSA) for dual backup configuration 463 +-------+-----------+-------------+---------+----------+ 464 | SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 | 465 +-------+-----------+-------------+---------+----------+ 466 | 0.0 | 0.999999 | 0.999999999 | 1.0 | 1.0 | 467 | 0.1 | 0.9989991 | 0.999899999 | 0.99999 | 0.999999 | 468 | 0.2 | 0.9979992 | 0.999799999 | 0.99998 | 0.999998 | 469 | 0.3 | 0.9969993 | 0.999699999 | 0.99997 | 0.999997 | 470 | 0.4 | 0.9959994 | 0.999599999 | 0.99996 | 0.999996 | 471 | 0.5 | 0.9949995 | 0.9995 | .99995 | 0.999995 | 472 | 0.6 | 0.9939996 | 0.9994 | 0.99994 | 0.999994 | 473 | 0.7 | 0.9929997 | 0.9993 | 0.99993 | 0.999993 | 474 | 0.8 | 0.9919998 | .9992 | .99992 | 0.999992 | 475 | 0.9 | 0.9909999 | 0.9991 | 0.99991 | 0.999991 | 476 | 1.0 | 0.99 | 0.999 | 0.9999 | 0.999990 | 477 +-------+-----------+-------------+---------+----------+ 479 As shown in Table-2 , the green shaded area in Table-3 represents the 480 five 9s capabilities. Comparing those two tables, the dual backups 481 are of marginal advantage over the single backup except for the case 482 there is no silent error. In this case, with only two 9s server 483 availability, the five 9s server part of system availability can be 484 achieved. 486 From the data above, we can conclude that the silent error, 487 introduced by software error or hardware error not detectable by 488 software, plays an important role in the server part of the system 489 availability and hence the final system availability. In fact, it 490 will be the dominant elements if Pse is more than 10% when the 491 difference between single backup and dual backup are not significant. 493 Some operators are of the opinion that there need to be a new 494 approach to the availability requirements. COTS hardware are assumed 495 to have less availability than the traditional telecom hardware. 496 But, in the NFV environment, since each server (or VM) in the NFV 497 environment will only affect a small number of users, the 498 requirements for traditional five 9s could be relaxed while keeping 499 the same user experience in downtime. In another words, the weighted 500 downtime, in proportion of the number of users, may be reduced in the 501 NFV environment due to each server affecting only a small number of 502 users for a given server reliability. 504 Unfortunately, from the theoretical point of view, this is not true. 505 It is possible that, each server downtime will only affect a small 506 number of users. But the multiple active servers will experience 507 more server fault opportunities (this is similar to the famous 508 reliability argument for the twin engine Boeing 777). As long as the 509 protection scheme, or more importantly, the number of backup(s) are 510 the same, the eventual system availability will be the same, 511 regardless what portion of users each server is to serve. 513 6. Simulation Study of Server Part of Availability 515 In the above theoretical analysis of server part of availability, the 516 following factors are not considered: (A) Site Maintenance (e.g. 517 software upgrade, patch, etc. affecting the whole site, (B) Site 518 Failure (earth quake, etc.) 520 While traditional telecom grade equipment putting a lot of emphasis 521 and engineering complexity to ensure smooth migration, smooth 522 software upgrade, and smooth patching procedures, the COTS hardware 523 and its related software are notorious in lacking such capabilities. 524 This is the primary reason for operators to be hesitate on utilizing 525 COTS hardware, even though COTS hardware in the NFV environment does 526 having the improved MTTR as compared to the traditional telecom 527 hardware. 529 While it is relative easy to obtain a closed for of system 530 availability for the ideal case without site related issues, it is 531 extremely difficult to obtain an analytical solution when site issues 532 are involved. In this case, we resort to numerical simulation under 533 reasonable assumptions [2, 3, 4]. 535 6.1. Methodology 537 In this section, the various assumptions and the outline of the 538 simulation mechanisms will be discussed. 540 A discrete event simulator is constructed to obtain the availability 541 for the server part. In the simulator, an active server (master 542 server which processing the network traffic) will be supported by 1 543 (single backup) or 2 (dual backup) servers in another site(s). 545 For the failure probability of the server, it is common to assume the 546 bathtub probability distribution (WeiBull distribution). In 547 practice, we need to enforce that the NFV management will provide 548 servers which is on the flat part of the bathtub distribution. In 549 this case, the familiar exponential distribution can be utilized. 551 In the discrete event simulator, each server will be scheduled to 552 work for a certain duration of time. This duration will be a random 553 variable with exponential distribution which is common to measure the 554 server behavior during its useful life cycle, with mean given by the 555 MTBF of the server. 557 In fact, the flat part of the bathtub distribution can related to the 558 normal server MTBF (mean time between failures) with the failure 559 density function expressed as F(x)=[(1/MTBF) times (e-to-the-power(- 560 x/MTBF))]. 562 After the working duration, the server will be down for a fixed time 563 duration which represents the time duration to start another virtual 564 machine to replace the one in trouble. This part is actually 565 different from the traditional telecom grade equipment. Here, the 566 assumption is that there will be another server available to replace 567 the one went down. Hence, regardless the nature of the fault, the 568 down time for a server fault will be fixed which represent the time 569 needed to have another server ready to take over the task. 571 The following diagram shows this arrangement for a system with only 572 one backup. It shall be noted that, while the server up time 573 duration is variable, the server down time will be fixed. 575 Figure 7: The life of the Servers ... (Note: a dot-and-dash version 576 of the diagram is being developed).... 578 The servers will be hosted in "sites" which is considered to be data 579 centers. In this simulation, during initial setup, the servers 580 supporting each other for reliability purposes will be hosted in 581 different sites. This is to minimize the impact of the site failure 582 and site maintenance. 584 In order to model the system behavior with one or two backups, the 585 concept of protection group is introduced. 587 A protection group will consists of a "master" server with one or two 588 "slave" server(s) in other site(s). There may be multiple protection 589 groups inside the network with each protection group serving a 590 fraction of the users. 592 A protection group will be considered to be "down" if every server in 593 this group is dead. During the time the protection group is "down", 594 the network service will be affected and the network is considered to 595 be "down" for the group of users this protection group is responsible 596 for. 598 The uptime and downtime of the protection group will be recorded in 599 the discrete event simulator. The server part of the availability is 600 given by (where the total elapsed time is the total simulation time 601 in the discrete event simulator) 603 EQ(7) ... ... Availability(server part)= [(PGUP)/(TET)], whereas 605 o PGUP is Protection Group Up Time 607 o TET is the Total Elapsed Time 609 The concept of protection group, site, and server can be illustrated 610 as follows (Figure 8) for a system with two backups. It shall be 611 noted that the protection group is an abstract concept and the 612 portion of the network function is not available if and only if the 613 all the servers in the protection group is not functioning. 615 Figure 8: Servers, Sites, and Protection Group ... (Note: a dot-and- 616 dash version of the diagram is being developed).... 618 Even though the simulator will allow each site to have a number of 619 servers, which is configurable, there is little use for this 620 arrangement. The system availability will not change regardless how 621 many servers per site is used to support the system, as long as there 622 is no change in the number of servers in the protection group. The 623 increase of number of servers per site is essentially increase the 624 number of protection groups. For a long time duration, each 625 protection group will experience the similar downtime for the same up 626 time (or will have the same availability). 628 As in the theoretical analysis, the silent error, due to software or 629 subtle hardware failure, will only affect the active (or master) 630 server. When the master server failed with silent error, both the 631 master and "slave" servers will go through a MTTR time to recover 632 (e.g. time to incarnate two VMs simultaneously). In this case, this 633 part of the system (or this protection group) is considered to be 634 under fault. 636 In the reliability study, the focus is the number of the backups for 637 each protection group where 1+1 configuration is a typical 638 configuration for one backup mechanism. For load sharing arrangement 639 such as 1:1, it can be viewed as two protection groups. 641 In general, the load sharing scheme will have less availability 642 because, in 1:1 case, any server fault will result in two faults in 643 different protection groups. This can be extended to 1:2 case where 644 three protection groups are involved, and any server fault will 645 introduce three faults in different protection groups. In this 646 study, the load sharing mechanisms will not be elaborated further. 648 The site will also go though its maintenance work. The traditional 649 telecom grade equipment and the COTS hardware mainly defers on this 650 part. In Telecom grade equipment, minimum impact on system 651 performance or system availability is to be maintained during the 652 maintenance window. But, for COTS hardware, the maintenance work may 653 be more frequently and more destructive. 655 In order to simulate the maintenance aspect of COTS hardware, the 656 simulator will put the site "under maintenance" at random time. The 657 interval for the site to be working is also assumed to be 658 exponentially distributed random variable, with mean to be 659 configurable in the simulator. The duration of the maintenance is 660 also a uniform distributed random variable with a configured mean, 661 minimum, and maximum. 663 In order to put a site "under maintenance", there shall be no-fault 664 inside the network. All the servers on the site to be "under 665 maintenance" will be moved to other site. Hence, no traffic will be 666 impacted during the process of putting the site under maintenance. 667 Of course, the ability against site failure when some site is under 668 maintenance will be reduced. 670 When a site is back from maintenance, it will attempt to claim all 671 its server responsibilities transferred due to site maintenance. 673 o For each protection group, if every server is working, the 674 protection group will re-arrange the protect relationship so each 675 site will only have one server in the protection group. The new 676 server on the site back from maintenance will need a MTTR time to 677 be ready for backup. In this case, no loss of service in the 678 system. 680 o For each protection group, if there are at least one working and 681 at least on in fault condition, one working server will be added 682 to the protection group. The new server on the site back from 683 maintenance will need a MTTR time to be ready for backup. In this 684 case, no loss of service in the system. 686 o For each protection group, if there is no servers working, the 687 protection group will gain a working server from the site back 688 from the maintenance. The new server on the site back from 689 maintenance will need a MTTR time to be ready for service. In 690 this case, the system will provide service after the new server is 691 ready. 693 A site can also under fault (e.g. loss of power, operating under 694 reduced capability due to thermal issues, and earth quake). The 695 simulator can also simulate the effect of such events, with site up 696 duration as an exponentially distributed random variable with mean to 697 be configured. The site failure duration is expressed as a uniform 698 distributed random variable with configurable mean, minimum, and 699 maximum. 701 6.2. Validation of the Simulator 703 In order to verify the correctness of the simulator (e.g. the random 704 number generator, the whole program structure, etc.), the simulation 705 is performed with various server availability and various silent 706 error probability. 708 For single backup case, the error between the theoretical data and 709 simulation data for system availability on the server part can be 710 illustrated by the following diagram (Figure 9). 712 Figure 9: Verification of Simulator for Single Backup Case ... 713 (Note: a dot-and-dash version of the diagram is being developed).... 715 As we can see, the magnitude of the errors are within 10-to-the- 716 power-(-5) which is very small, considering the nominal value of 717 system availability for the server part is close to 1.0. For the 718 dual backup case, the error between the simulated and theoretical 719 system availability for different silent error probability and server 720 availability can be illustrated as follows (Figure 10). 722 Figure 10: Verification of Simulator for Dual Backup Case ... (Note: 723 a dot-and-dash version of the diagram is being developed).... 725 This is also similar to that of the single backup case where the 726 error are within the range. Those error information gives us the 727 needed confidence on the simulation result for complicated case where 728 analytical solutions are evasive. 730 6.3. Simulation Results 732 The effect of the MTTR in the NFV environment is studied first. In 733 this study, the effect of the MTTR and the silent error probability 734 can be shown below: 736 Figure 11: Availability with Various Silent for different MTTRs... 737 (Note: a dot-and-dash version of the diagram is being developed).... 739 In the diagram (Figure 11), R6 represents MTTR of 6 minutes while R60 740 represents MTTR of 60 minutes. The x-axis is the silent error 741 probability. As shown, the effect of the MTTR (time to recover from 742 a fault or time to have VM rebirth) will affect the slope of the 743 system availability, which decline with the increase of silent error 744 probability. In the above example, the server MTBF is assumed to be 745 10000 hours which represents the server availability of 0.9994 for R6 746 case and 0.994 for the R60 case. 748 The two curves starting approximate 1.0 are the system availability 749 with dual backups while the other two are for system availability 750 with single backup. It should be noted that, for the dual backup 751 case, there is little difference in availability for different MTTR 752 when there is no silent error. Intuitively, this is expected due to 753 the added number of backup servers. 755 In this simulation, both site failure (with mean time between 756 failures of 20000 hours) and site maintenance (with mean time between 757 site maintenance of 1000) are considered. The mean time for site 758 failure duration is assumed to be 12 hours (uniform distributed 759 between 4 hours and 24 hours) and the mean time for site maintenance 760 is 24 hours (uniform distributed between 4 hours and 48 hours). 762 The next step is to evaluate the impact of the site issues (site 763 failure, maintenance). For a very bad site outlined above, which has 764 the mean time between site failures to be 2 times of the server MTBF 765 and the mean time between site maintenance events is assumed to be 766 0.1 times of the server MTBF. The availability on the server part 767 can be illustrated with different silent error probability and server 768 availability for the single backup configuration. 770 Figure 12: Availability for the Server Part in Single Backup 771 Configuration... (Note: a dot-and-dash version of the diagram is 772 being developed).... 774 As the data will illustrate that, in order to achieve high 775 availability, the server availability needs to be very high. In 776 fact, the server availability needs to be in the range of FIVE 9s in 777 order to achieve the system availability of FIVE 9s under various 778 site related issues. The dual backup systems for exactly the same 779 configuration, the result will be better and can be illustrated as 780 follows: 782 Figure 13: Availability for the Server Part in Single Backup 783 Configuration... (Note: a dot-and-dash version of the diagram is 784 being developed).... 786 With server availability of FOUR 9s, and with low silent error 787 probabilities, the server part of the availability can achieve FIVE 788 9s. For a site with less issues, such as the one with mean time 789 between failures is 100 times of the server MTBF and site maintenance 790 is 0.1 times of the server MTBF. The mean time for site failure 791 duration is also assumed to be 12 hours (uniform distributed between 792 4 hours and 24 hours) and the mean time for site maintenance is 24 793 hours (uniform distributed between 4 hours and 48 hours). The result 794 for the single backup system can be shown as follows: 796 Figure 14: Server Part of Availability for a Good Site on Single 797 Backup... (Note: a dot-and-dash version of the diagram is being 798 developed).... 800 The following data table (Table-4) will give precise information 801 regarding this simulation results. 803 Table-4: Details Regarding Availability on Server Part for Single 804 Backup on a Good Site 806 +-------------------+----------+----------+------------+------------+ 807 | Silent | 0.990099 | 0.999001 | 0.99990001 | 0.99999 | 808 | Error/Server | | | | | 809 | Availability | | | | | 810 +-------------------+----------+----------+------------+------------+ 811 | 0.0 | 0.998971 | 0.999959 | 0.9999992 | 1.0 | 812 | 0.1 | 0.997918 | 0.999857 | 0.99998959 | 0.99999901 | 813 | 0.2 | 0.996908 | 0.999771 | 0.99997957 | 0.99999804 | 814 | 0.3 | 0.995999 | 0.999674 | 0.99996935 | 0.99999695 | 815 +-------------------+----------+----------+------------+------------+ 817 As evidenced in the table above, the server part of the system 818 availability will be impacted by the silent error and a single 819 redundant hardware will only provide marginal improvement when the 820 silent error probability is small. 822 Figure 15: Server Part of Availability for a Good Site on Dual 823 Backup... (Note: a dot-and-dash version of the diagram is being 824 developed).... 826 The diagram above give a general trend in system availability and the 827 follow data table will precise the data. 829 Table 5: Details Regarding Availability on Server Part for Dual 830 Backup on a Good Site 832 +---------------+------------+------------+------------+------------+ 833 | Silent | 0.99009901 | 0.999001 | 0.99990001 | 0.99999 | 834 | Error/Server | | | | | 835 | Availability | | | | | 836 +---------------+------------+------------+------------+------------+ 837 | 0.0 | 0.9999939 | 0.99999998 | 1.0 | 1.0 | 838 | 0.2 | 0.9981346 | 0.99980209 | 0.99998048 | 0.99999792 | 839 | 0.4 | 0.99615083 | 0.99960136 | 0.99996002 | 0.99999594 | 840 | 0.5 | 0.99522474 | 0.9995184 | 0.99995225 | 0.99999503 | 841 +---------------+------------+------------+------------+------------+ 843 From the tables for single and dual backup, we can see that dual 844 backup only provides marginal benefit in the face of site issues. 845 Given the fact that site issues are inevitable in practice, a 846 geographically distributed single backup system is recommended for 847 simplicity. 849 6.4. Multiple Servers Sharing the Load 851 In this section, we outline the simulation results for cases when 852 there are multiple servers to take care of the active work load. In 853 this case, the impact of a protection group failure will affecting 854 smaller number of users. 856 In the simulation, each site will have N servers to serve the work. 857 A weighted uptime and weighted down time was introduced. The system 858 availability is the weighted uptime divided by the total of weighted 859 uptime and weighted downtime. 861 EQ(8)... ... Weighted-Availability[Server-Part]=[(TET - WDT)/TET], 862 whereas 864 o TET is the Total Elapsed Time 866 o WDT is the Weighted Down Time 868 If any protection group (i) is down, the WDT will be updated as 869 follows: 871 EQ(9)... ... WDT = WDT + [Protection Group (i) Down Time]/N 873 For a system with three protection groups (i.e. the servers sharing 874 the workload), the availability of each protection group, as well as 875 the weighted availability, are obtained as follows (Table-6): 877 Table-6: Availability of Protection Groups and the Weighted 878 Availability (Dual Backup) 880 +--------------+---------+---------+----------+----------+----------+ 881 | Availabilit | Availab | Availab | Availabi | Measured | Protecti | 882 | y /Silent | ility o | ility o | l ity of | Weighted | o n Grou | 883 | Error | fProtec | fProtec | Protect | Availabi | p Averag | 884 | Probabilit | ti on | ti on | io n Gro | l ity | e - | 885 | y | Group | Group | up 3 | | Weighte | 886 | | 1 | 2 | | | dAvailab | 887 | | | | | | il ity | 888 +--------------+---------+---------+----------+----------+----------+ 889 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 890 | 0.2 | 0.99999 | 0.99999 | 0.999997 | 0.999998 | 6.66668E | 891 | | 8 015 | 8 005 | 9 85 | 0 01 | - 11 | 892 | 0.4 | 0.99999 | 0.99999 | 0.999995 | 0.999996 | -3.33333 | 893 | | 6 027 | 6 018 | 9 88 | 0 11 | E -11 | 894 +--------------+---------+---------+----------+----------+----------+ 896 In this case, there is little difference between the different 897 protection groups. The weighted availability is actually the average 898 of availability of all the protection groups.This also illustrate the 899 fact that, regardless how many servers to share to active load, the 900 system availability will be the same as long as (A) The number of 901 backups are the same, and (B) Each server availability are the same 903 7. Conclusions 905 The system availability can be divided into two parts; the 906 availability from the network and the availability from the server. 907 The final system availability is the product of those two parts. 909 The system availability from the network is determined by the maximum 910 number of hops and individual network element availability, with the 911 fault tolerant setup is assumed to be 1+1. The system availability 912 from the server is mainly determined by the following parameters. 914 o Availability of each individual server 916 o Silent error probability 917 o Site related issues (maintenance, fault) 919 o Protection Scheme (one or two dedicated backups) 921 The introduction of silent error is to take account of software error 922 and errors undetectable by hardware, the system availability on the 923 server part will be dominated by such silent error if the silent 924 error probability is more than 10%. This is shown in both 925 theoretical work and simulations. 927 It shall be interesting to note that the dual backup scheme provides 928 marginal benefits and the added complexity may not warrant such 929 practice in the real network. 931 It is possible for COTS hardware to provide as high availability as 932 the traditional telecom hardware if the server itself is of 933 reasonable high-availability. The undesirable attributes of COTS 934 hardware have been modelled into the site related issues, such as 935 site maintenance and site failure which is not applicable for 936 traditional telecom hardware. Hence, in calculating the server 937 availability, the site related issues are to be excluded. 939 It is critical for the virtualization infrastructure management to 940 provide as much hardware failure information as possible to improve 941 the availability of the application. As seen in both theoretical 942 work and simulation, the silent error probability becomes a dominant 943 factor in the final availability. The silent error probability can 944 be reduced if the virtualization infrastructure management is capable 945 of fault isolation. 947 8. Security considerations 949 To be determined. 951 9. IANA considerations 953 This Internet Draft includes no request to IANA. 955 10. Acknowledgements 957 Authors would like to thank the NFV RG chairs (Diego and Ramki) for 958 encouraging discussions and guidance. 960 11. References 961 11.1. Normative References 963 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 964 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 965 RFC2119, March 1997, 966 . 968 [I-D.irtf-nfvrg-nfv-policy-arch] 969 Figueira, N., Krishnan, R., Lopez, D., Wright, S., and D. 970 Krishnaswamy, "Policy Architecture and Framework for NFV 971 Infrastructures", draft-irtf-nfvrg-nfv-policy-arch-01 972 (work in progress), August 2015. 974 [1] GR-77, "Applied R&M Manual for Defense Systems", 2012. 976 11.2. Informative References 978 [2] Papoulis, A., "Probability, Random Variables, and 979 Stochastic Processes", 2002. 981 [3] Bremaud, P., "An Introduction to Probabilistic Modeling", 982 1994. 984 [4] Press, et al, W., "Numerical Recipes in C/C++", 2007. 986 Authors' Addresses 988 Li Mo 989 ZTE (TX) Inc. 990 2425, N. central expressway 991 Richardson, TX 75080 992 USA 994 Phone: +1-972-454-9661 995 Email: li.mo@ztetx.com 997 Bhumip Khasnabish (editor) 998 ZTE (TX) Inc. 999 55 Madison Avenue, Suite 160 1000 Morristown, New Jersey 07960 1001 USA 1003 Phone: +001-781-752-8003 1004 Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com 1005 URI: http://tinyurl.com/bhumip/