Internet Draft Srinivas_Pitta Expires: October 2003 Wipro Technologies April 2003 Redundant Fault Tolerant Configurations draft-pitta-redundant-fault-tol-conf-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http:// www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on October 20, 2003. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract The ability of the system to deliver its normal service in the presence of any unexpected errors and able to recover from the errors and restore to normal operation is called a fault tolerant system. Such systems whose behavior is predictable in nearly every possible situation are often in redundant fault tolerant configurations. This document defines the various redundant fault tolerant configurations that make these systems dependable systems or high availability systems. Srinivas_Pitta Expires October 2003 [Page 1] Internet-Draft Redundant Fault Tolerant Configurations April 2003 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 2 3. Configurations . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Hot Standby Redundant Fault Tolerant Configuration . . . . 4 3.2 Warm Standby Redundant Fault Tolerant Configuration . . . . 6 3.3 Cold Standby Redundant Fault Tolerant Configuration . . . . 7 Security Considerations . . . . . . . . . . . . . . . . . . . . 9 Normative References . . . . . . . . . . . . . . . . . . . . . 9 Author's Addresses . . . . . . . . . . . . . . . . . . . . . . 9 Intellectual Property and Copyright Statements . . . . . . . 10 Srinivas_Pitta Expires October 2003 [Page 2] Internet-Draft Redundant Fault Tolerant Configurations April 2003 1. Introduction This document gives the definitions of Hot, Warm and Cold standby redundant fault-tolerant configurations. It will not assume in any way as to how the functionality is achieved. It can be implemented either in hardware, software or a combination of both. The fault-tolerant configurations mentioned in this document does not restrict the user in having more that one standby sets. 1.1 Definitions Component - A component can be a set of hardware or software entities or a combination of both. up-time - A pre-determined time as to how long a system will be up and running and will be able to provide complete functionality to the user. dependable system - A system that will be able to provide full services without any downtime for a given period of time (up-time) Set - A set is a collection of hardware and software components. 2. Conventions The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when they appear in this document, are to be interpreted as described in [2]. Srinivas_Pitta Expires October 2003 [Page 3] Internet-Draft Redundant Fault Tolerant Configurations April 2003 3. Configurations Due to our inability to produce error-free hardware or software components, it is not possible to assure the up-time of even a dependable system. One of the most important factor in building a dependable mission critical system is fault tolerance. The primary purpose of the fault-tolerant system is to ensure the system is operational for a given period (up-time) of time for which it is designed. Any failures that occur during this time should not prevent the system to render its services and should be able to provide the complete functionality for which it is designed. For some applications safety is more important than reliability, and fault tolerance techniques will aid those applications in preventing catastrophes. Applications running over a fault-tolerant system are often called as High-Availability applications. The term redundancy can be defined as the use of more resources beyond the minimum need to deliver a specific functionality or service to the user. Fault tolerance is achieved through the use of redundancy in the hardware and software. This involves the use of duplicate set of hardware and software in addition to the primary or the principal set. The primary set actively process the inputs entering the system until a failure occurs, after which the duplicate or secondary set (standby) takes over the system and starts to provide the functionality. Redundant fault-tolerant systems come in three different configurations: 1) Hot Standby 2) Warm Standby 3) Cold Standby Under normal functionality the active set (one or more processors and software components) controls the system. The active set processes all the traffic (includes provisioning, data and control) and provides complete functionality to the user. If the active set cannot deliver the functionality because it has broken badly due to a failure then the standby takes over the system and starts to provide the functionality. The difference between these systems is how the input enters and where it is processed. It also depends on where the information required to make the system a fault-tolerant system, is processed and maintained. Srinivas_Pitta Expires October 2003 [Page 4] Internet-Draft Redundant Fault Tolerant Configurations April 2003 3.1 Hot Standby Redundant Fault Tolerant Configuration In Hot standby systems, the active and standby sets receives all the inputs entering the system. Both the sets process the input in exactly the same way. A selector receives the output from both the sets. The selector suppress the output from the standby and only the output from the active is considered. When the active set encounters a failure where it can no longer be able to deliver its intended functionality, the standby takes over the system and continue to provide the functionality. The decision to switch to the standby when a fault occurs can be made by the standby or another processor actively monitoring both the active and standby set. In some cases if the active set determines that some of its components are in bad shape and it will no longer be able to continue further, the active set can give up the control of the system to the standby set. In this case all the hardware and software states at both the active and standby sets are exactly similar as both receive all the input traffic. At any given time if the active fails the system will be able to continue to provide the intended functionality with out any disruption in the system. In this arrangement it is sufficient that the selector function need to switch to the standby output with the dead line set by the sonet standard GR-253-CORE. If a hot insertion needs to be supported for some or all of the components in a particular set, then the implementation should take care of restoring the re-inserted component to the same state as the active one. The re-inserted component can be declared redundant only after exactly replicating the hardware and software states of the current active component. In this configuration as both the active and standby sets receives exactly the similar stimuli or inputs, the hardware and software states on both sides will be exactly similar. It is possible that both the sets will not be in the similar state in the following cases. a) If the processing by the system is dependant on the surrounding environment. For example, even though the twp sets receives the same inputs, if they take into account the environment variables like, temperature or pressure etc., the out from the two systems will be different and they may not be in the same state as expected. b) If the two sets have different versions of the hardware and software and the system cannot guarantee the same output for a given input. Srinivas_Pitta Expires October 2003 [Page 5] Internet-Draft Redundant Fault Tolerant Configurations April 2003 The rationale for the use of multiple versions is the expectation that components built differently (i.e., different designers, different algorithms, different design tools, etc) should fail differently. Therefore, if one version fails on a particular input, at least one of the alternate versions should be able to provide an appropriate output. But it the output from the two different sets with different versions cannot guarantee the same output then the states of the two systems may not be as expected. | \|/ Inputs V ------------- | Duplication | | Function | -------------- | \|/ V ---------------- | | \|/ \|/ V V ------------- ------------- | Active | | Standby | | Set | | Set | ------------- ------------- | | \|/ \|/ V V -------------------- | \|/ V ------------- | Selection | | Function | ------------- | \|/ Outputs V Fig 1: Hot Standby Redundant Fault-tolerant configuration Srinivas_Pitta Expires October 2003 [Page 6] Internet-Draft Redundant Fault Tolerant Configurations April 2003 3.2 Warm Standby Redundant Fault Tolerant Configuration In Warm standby systems, only the active set receives all the inputs entering the system. The active set process the all inputs actively entering into the system. Once the inputs are processed the active set updates the internal state changes to the standby set. This process of active set updating the internal state changes to the standby set is called state updates or check pointing. All the hardware and software states (data base) on both the active and standby sets will be similar due to internal the state updates from the active to standby set. Since their states are almost similar at any given time, the standby will be ready to take over the system at any time. If at any given time the active fails or if the active set is no longer be able to provide the intended functionality to the user, the standby takes over the system and continues to provide the functionality. The decision to switch to the standby when a fault occurs can be made by the standby or another processor actively monitoring both the active and standby set. In some cases if the active set determines that some of its components are in bad shape and will no longer be able to continue further, the active set can give up the control of the system to the standby set. In this arrangement it is sufficient that the standby set should be able to take over the system with the dead line set by the sonet standard GR-253-CORE. If a hot insertion needs to be supported for some or all of the components in a particular set, then the implementation should take care of replicating the hardware and software states to the re-inserted (standby) components in the system. This process of replicating the standby component states from active is called cloning. The re-inserted component should be declared redundant only after cloning the hardware and software states of the current active component. In this configuration, instead of cloning the actual internal states to the standby from the active, the actual input (stimuli) can be cloned to the standby so that the standby will be exactly in the same configuration. It is possible that both the sets will not be in the similar state in the following cases. Srinivas_Pitta Expires October 2003 [Page 7] Internet-Draft Redundant Fault Tolerant Configurations April 2003 a) If the processing by the system is dependant on the surrounding environment. For example, even though the two sets receives the same inputs, if they take into account the environment variables like, temperature or pressure etc., the output from the two systems will be different and they may not be in the same state as expected. b) If the two sets have different versions of the hardware and software and the system cannot guarantee the same output for a given input. The rationale for the use of multiple versions is the expectation that components built differently (i.e., different designers, different algorithms, different design tools, etc) should fail differently. Therefore, if one version fails on a particular input, at least one of the alternate versions should be able to provide an appropriate output. But it the output from the two different sets with different versions cannot guarantee the same output then the states of the two systems may not be as expected. ------------- State ------------- Inputs | Active | Updates | Standby | -------->| Set |--------------> | Set | ------------- ------------- | | \|/ Outputs V Fig 2: Warm Standby Redundant Fault-tolerant configuration 3.3 Cold Standby Redundant Fault Tolerant Configuration In Cold standby systems, only the active set receives all the inputs entering the system. The active set processes all the inputs actively entering into the system. Once the inputs are processed the active set updates (writes) the internal state changes to a reliable storage media. The standby set may or may not be powered up. Standby will not be able to take over the system until the standby is populated with all the internal state changes from the storage media. Srinivas_Pitta Expires October 2003 [Page 8] Internet-Draft Redundant Fault Tolerant Configurations April 2003 If at any given time the active fails or if the active set is no longer be able to provide the intended functionality to the user, the standby may be brought into the operational state. The standby will be populated with all the state updates from the storage media, and all the inputs will be directed to the standby set. Hence forth standby starts to provide the functionality to the user. The decision to switch to the standby when a fault occurs on the active set can be made by the standby if the standby is powered up or another processor actively monitoring the active set if the standby is not powered up. In some cases if the active set determines that some of its components are in bad shape and will no longer be able to continue further, the active set can give up the control of the system to the standby set. In any case in this configuration it may not be possible by the standby set to take over the system with the dead line set by the sonet standard GR-253-CORE. Hot insertion will not be of great concern here as there will not be much interaction between the active set and the standby set. ---------- State / --------- \ State ----------- Inputs | Active | Updates | Reliable | Updates | Standby | ------>| Set |---------->| Storage |--------> | Set | --------- \ --------- / ------------ | | \|/ Outputs V Fig 3a: Cold Standby Redundant Fault-tolerant configuration (Powered Up Scenario) | | | | \|/ Inputs \|/ Inputs V V ------------- ------------- | Active | | Standby | | Set | | Set | ------------- ------------- | | | | \|/ Outputs \|/ Outputs V V Fig 3b: Cold Standby Redundant Fault-tolerant configuration (Not Powered Up Scenario) Srinivas_Pitta Expires October 2003 [Page 9] Internet-Draft Redundant Fault Tolerant Configurations April 2003 Security Considerations None Normative References [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [3] GR-253-CORE, "Synchronous Optical Network (SONET) Transport Systems: Common Generic Criteria" September 2000. Author's Addresses Srinivas Pitta Wipro Technologies 1300 Crittenden Lane, 2nd Floor, Mountain View CA - 94043, USA Phone: EMail: Srinivas.Pitta@Wipro.com Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. Srinivas_Pitta Expires October 2003 [Page 10] Internet-Draft Redundant Fault Tolerant Configurations April 2003 The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Srinivas_Pitta Expires October 2003 [Page 11]