Network Working Group Ralph Droms INTERNET DRAFT Kim Kinnear Mark Stapp Cisco Systems Bernie Volz IPWorks Steve Gonczi Network Engines Greg Rabil Mike Dooley Arun Kapur Lucent Technologies July 2000 Expires January 2001 DHCP Failover Protocol Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. Droms, et. al. Expires January 2001 [Page 1] Internet Draft DHCP Failover Protocol July 2000 Abstract DHCP [RFC 2131] allows for multiple servers to be operating on a single network. Some sites are interested in running multiple servers in such a way so as to provide redundancy in case of server failure. In order for this to work reliably, the cooperating primary and secondary servers must maintain a consistent database of the lease information. This implies that servers will need to coordinate any and all lease activity so that this information is synchronized in case of failover. This document defines a protocol to provide such synchronization between two servers. One server is designated the "primary" server, the other is the "secondary" server. This document also describes a way to integrate the failover protocol with the DHCP load balancing approach. This document is a substantial reorganization as well as a technical and editorial revision of draft-ietf-dhc-failover-05.txt. Table of Contents 1. Introduction................................................. 4 2. Terminology.................................................. 5 2.1. Requirements terminology................................... 5 2.2. DHCP and failover terminology.............................. 5 3. Background and External Requirements......................... 9 3.1. Key aspects of the DHCP protocol........................... 9 3.2. BOOTP relay agent implementation........................... 11 3.3. What does it mean if a server can't communicate with its partner? 12 3.4. Challenging scenarios for a Failover protocol.............. 13 3.5. Using TCP to detect partner server failure................. 14 4. Design Goals................................................. 15 4.1. Design goals for this protocol............................. 15 4.2. Limitations of this protocol............................... 17 5. Protocol Overview............................................ 17 5.1. Messages and States........................................ 17 5.2. Fundamental guarantees..................................... 20 5.3. Load balancing............................................. 26 5.4. IP address allocations between servers..................... 27 5.5. Operating in NORMAL state.................................. 29 5.6. Operating in COMMUNICATIONS-INTERRUPTED state.............. 29 5.7. Operating in PARTNER-DOWN state............................ 30 5.8. Operating in RECOVER state................................. 30 5.9. Operating in STARTUP state................................. 30 Droms, et. al. Expires January 2001 [Page 2] Internet Draft DHCP Failover Protocol July 2000 5.10. Time synchronization between servers...................... 30 5.11. IP address binding-status................................. 31 5.12. DNS dynamic update considerations......................... 35 5.13. Reservations and failover................................. 39 5.14. Dynamic BOOTP and failover................................ 41 5.15. Guidelines for selecting MCLT............................. 41 5.16. What is sent in response to an UPDREQ or UPDREQALL message? 42 6. Common Message Format........................................ 43 6.1. Message header format...................................... 43 6.2. Common option format....................................... 46 6.3. Batching multiple binding update transactions in one BNDUPD mes- 47 7. Protocol Messages............................................ 49 7.1. BNDUPD message [3]......................................... 49 7.2. BNDACK message [4]......................................... 60 7.3. UPDREQ message [9]......................................... 63 7.4. UPDREQALL message [7]...................................... 64 7.5. UPDDONE message [8]........................................ 65 7.6. POOLREQ message [1]........................................ 65 7.7. POOLRESP message [2]....................................... 66 7.8. CONNECT message [5]........................................ 67 7.9. CONNECTACK message [6]..................................... 71 7.10. STATE message [10]........................................ 75 7.11. CONTACT message [11]...................................... 76 7.12. DISCONNECT message [12]................................... 76 8. Connection Management........................................ 77 8.1. Connection granularity..................................... 78 8.2. Creating the TCP connection................................ 78 8.3. Using the TCP connection for determining communications status 80 8.4. Using the TCP connection for binding data.................. 82 8.5. Using the TCP connection for control messages.............. 82 8.6. Losing the TCP connection.................................. 82 9. Failover Endpoint States..................................... 83 9.1. Server Initialization...................................... 83 9.2. Server State Transitions................................... 83 9.3. STARTUP state.............................................. 86 9.4. PARTNER-DOWN state......................................... 88 9.5. RECOVER state.............................................. 90 9.6. NORMAL state............................................... 93 9.7. COMMUNICATIONS-INTERRUPTED State........................... 95 9.8. POTENTIAL-CONFLICT state................................... 99 9.9. RESOLUTION-INTERRUPTED state............................... 100 9.10. CONFLICT-DONE state....................................... 101 9.12. PAUSED state.............................................. 102 9.13. SHUTDOWN state............................................ 103 10. Safe Period................................................. 104 11. Security.................................................... 105 11.1. Simple shared secret...................................... 106 Droms, et. al. Expires January 2001 [Page 3] Internet Draft DHCP Failover Protocol July 2000 11.2. TLS....................................................... 107 12. Failover Options............................................ 107 12.1. addresses-transferred..................................... 108 12.2. assigned-IP-address....................................... 108 12.3. binding-status............................................ 108 12.4. client-identifier......................................... 109 12.5. client-hardware-address................................... 109 12.6. client-last-transaction-time.............................. 109 12.7. client-reply-options...................................... 110 12.8. client-request-options.................................... 110 12.9. DDNS...................................................... 111 12.10. delayed-service-parameter................................ 112 12.11. hash-bucket-assignment................................... 112 12.12. IP-flags................................................. 113 12.13. lease-expiration-time.................................... 114 12.14. max-unacked-bndupd....................................... 114 12.15. MCLT..................................................... 114 12.16. message.................................................. 115 12.17. message-digest........................................... 115 12.18. potential-expiration-time................................ 115 12.19. receive-timer............................................ 116 12.20. protocol-version......................................... 116 12.21. reject-reason............................................ 117 12.22. sending-server-IP-address................................ 118 12.23. server-flags............................................. 118 12.24. server-state............................................. 119 12.25. start-time-of-state...................................... 119 12.26. TLS-reply................................................ 120 12.27. TLS-request.............................................. 120 12.28. vendor-class-identifier.................................. 120 12.29. vendor-specific-options.................................. 121 13. IANA Considerations......................................... 121 14. Acknowledgments............................................. 121 15. References.................................................. 123 16. Author's information........................................ 124 17. Full Copyright Statement.................................... 125 1. Introduction DHCP [RFC 2131] allows for multiple servers to be operating on a sin- gle network. Some sites are interested in running multiple servers in such a way so as to provide redundancy in case of server failure since the DHCP subsystem is in many cases a critical part of the net- work infrastructure. This document defines a protocol to provide synchronization between two servers in order that each can take over for the other should Droms, et. al. Expires January 2001 [Page 4] Internet Draft DHCP Failover Protocol July 2000 either one fail or become unreachable. One server is designated the "primary" server, the other is the "secondary" server, and most DHCP client requests are sent to each server (see Section 3.1.1 for details). In order to provide a high availability DHCP service, these cooperating primary and secondary servers must maintain a consistent database of lease information. This implies that servers will need to coordinate all lease activity so that this information is syn- chronized in case failover is required. The protocol messages and processing techniques required to maintain a consistent database are specified in the protocol described here. The failover protocol also contains a way to integrate the DHCP load- balancing algorithm described in [LOADB] with the failover protocol. 2. Terminology This section discusses both the generic requirements terminology com- mon to many IETF protocol specifications as well as specialized DHCP and failover protocol specific terminology. 2.1. Requirements terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119]. 2.2. DHCP and failover terminology This document uses the following terms: o "available IP address" An IP address is "available" if it may be allocated by a specific DHCP server. An IP address is considered (for the purposes of this document) to be available to a single server for allocation unless otherwise noted. An IP address available for allocation on a primary server has state FREE, and an IP address available for allocation on a secondary server has state BACKUP. o "binding" A binding is a collection of configuration parameters, includ- ing at least an IP address, associated with or "bound to" a Droms, et. al. Expires January 2001 [Page 5] Internet Draft DHCP Failover Protocol July 2000 DHCP client. Bindings are managed by DHCP servers. o "binding database" The collection of bindings managed by a primary and secondary. o "binding update transaction" A binding update transaction refers to the set of information (contained in options) necessary to perform a binding update for a single IP address. It will be comprised of the assigned-IP-address option, the binding-status option, along with other options as appropriate. o "binding-status" The binding-status is the status of an IP address with respect to its association with a client. There are specific binding- status values defined for use by the failover protocol, e.g., ACTIVE, FREE, RELEASED, ABANDONED, etc. These are designed to map more or less directly onto the binding-status values used internally in most DHCP server implementations. The term binding-status refers to the concept also sometimes known as "lease state" or "IP address state", but in this document the term "state" is reserved for the failover state of a failover endpoint, and binding-status is always used to refer to the state associated with an IP address or lease. o "DHCP client" or "client" A DHCP client is an Internet host using DHCP to obtain confi- guration parameters such as a network address. The term "client" used within this document always means a DHCP client, and never one of the two failover servers. o "DHCP server" or "server" A DHCP server is an Internet host that returns configuration parameters to DHCP clients. o "DDNS" An abbreviation for "Dynamic DNS", which refers to the capabil- ity to update a DNS server's name (actually resource record) database using an on-the-wire protocol defined in [RFC 2136]. o "DNS" Droms, et. al. Expires January 2001 [Page 6] Internet Draft DHCP Failover Protocol July 2000 An abbreviation for "Domain Name System", a scheme where a cen- tral name repository is used to map names to IP addresses and IP addresses to names. o "failover endpoint" The failover protocol allows for there to be a unique failover endpoint per partner per role (where role is primary or secon- dary). This failover endpoint can take actions and hold unique states. There are thus a maximum of two failover endpoints per server per partner (one for each partner as a primary and one for that same partner as a secondary.) o "FQDN" An FQDN is a "fully qualified domain name". A fully qualified domain name generally is a host name with at least one zone name, for example "www.dhcp.org" is a fully qualified domain name. o "lazy update" Lazy update refers to the requirement placed on a server imple- menting a failover protocol to update its failover partner when- ever the binding database changes. A failover protocol which didn't support lazy update would require the failover partner update to be complete before a DHCP server could respond to a DHCP client request with a DHCPACK. A failover protocol which does support lazy update places no such restriction on the update of the failover partner server, and so a server can allo- cate an IP address or extend a lease on an IP address and then update its failover partner as time permits. A failover proto- col which supports lazy update not only removes the requirement to update the failover partner prior to responding to a DHCP client with a DHCPACK, but also allows gathering up batches of updates from one failover server to its partner. o "MCLT" The MCLT refers to maximum client lead time. This time is con- figured on the primary server and transmitted from the primary to the secondary server in the CONNECT message. It is the max- imum amount of time that one server can extend a lease for a client's binding beyond the time known by the partner server. See section 5.2.1 for details. o "partner" Droms, et. al. Expires January 2001 [Page 7] Internet Draft DHCP Failover Protocol July 2000 A "partner", for the purposes of this document, refers to a failover server, typically the other failover server. In many (if not most) cases, the failover protocol is symmetric with respect to the primary or secondary nature of the servers, and so it is often appropriate to discuss "updating the partner server", since it could be a primary server updating a secondary server or a secondary server updating a primary server. o "Primary server" or "Primary" A DHCP server configured to provide primary service to a set of DHCP clients for a particular set of subnet address pools. o "RR" "RR" is an abbreviation for "resource record". All records in the DNS are resource records. The resource records of most relevance to this document are the "A" resource record, which maps a DNS name to a particular IP address, the "PTR" resource record, which allows a "reverse map", from the IP address back to a DNS name, and the "KEY" resource record, which is used in ways defined in [DDNS] to tag a DNS name with the identity of the DHCP client with which it is associated. o "Secondary server" or "Secondary" A DHCP server configured to act as backup to a primary server for a particular set of subnet address pools. o "stable storage" Every DHCP server is assumed to have some form of what is called "stable storage". Stable storage is used to hold information concerning IP address bindings (among other things) so that this information is not lost in the event of a server failure which requires restart of the server. o "state" In this document, the term "state" refers exclusively to the state of a failover endpoint, for example: NORMAL, COMMUNICATIONS-INTERRUPTED, PARTNER-DOWN. It is not used to refer to any attributes of an IP address or a binding of an IP address. See "binding-status". o "subnet address pool" A subnet address pool is the set of IP addresses which is Droms, et. al. Expires January 2001 [Page 8] Internet Draft DHCP Failover Protocol July 2000 associated with a particular network number and subnet mask. In the simple case, there is a single network number and subnet mask and a set of IP addresses. In the more complex case (some- times called "secondary subnets", sometimes "superscopes"), several (apparently unrelated) network number and subnet mask combinations with their associated IP addresses may all be con- figured together into one subnet address pool. 3. Background and External Requirements This section highlights key aspects of the DHCP protocol on which the failover protocol depends. It also discusses the requirements that the failover protocol places on other aspects of the network infras- tructure, and some general issues surrounding server failure detec- tion. Some failure scenarios that provide particular challenges to a failover protocol are discussed. Finally, the challenges inherent in using a TCP connection as a means to detect failure of a partner server are elaborated. 3.1. Key aspects of the DHCP protocol The failover protocol is designed to augment the DHCP protocol as described in RFC 2131 [RFC 2131]. There are several key aspects of the DHCP protocol which are required by the failover protocol in order to successfully meet its design goals. 3.1.1. Broadcast behavior There are two aspects of the broadcast behavior of the DHCP protocol which are key to making the failover protocol operate successfully. The first is simply that the DHCP protocol requires a DHCP client to broadcast all DHCPDISCOVER and DHCPREQUEST/INIT-REBOOT messages. Because of this requirement, a DHCP client who was communicating with one server will automatically be able to communicate with another server if one is available. The second aspect of broadcast behavior is similar to the first, but involves the distinction between a DHCPREQUEST/RENEW and DHCPREQUEST/REBINDING. A DHCPREQUEST/RENEW is the message that a DHCP client uses to extend its lease. It is unicast to the DHCP server from which it acquired the lease. However, the DHCP protocol (in a farsighted move), was explicitly designed so that in the event that a DHCP client cannot contact the server from which it received a lease on an IP address using a DHCPREQUEST/RENEW, the client is required to broadcast its renewal using a DHCPREQUEST/REBINDING to any available DHCP server. Since all DHCP clients were required to implement this algorithm, the failover protocol can have a different Droms, et. al. Expires January 2001 [Page 9] Internet Draft DHCP Failover Protocol July 2000 server from the one that initially granted a lease be the server to renew a lease. Thus, one server can take over for another with no interruption in the service as experienced by the DHCP client or its associated applications software. 3.1.2. Client responsibility In the DHCP protocol the DHCP clients are entrusted with a consider- able responsibility. In particular, after they are granted a lease on an IP address, they are enjoined to only use that IP address while their lease is valid. Every DHCP client is expected to stop using an IP address if the expiration time on the lease has passed and if it cannot get an extension on the lease for that IP address from some DHCP server. Thus, the correct behavior of every DHCP client in this regard is required to ensure the integrity of the DHCP service. On the other hand, incorrect behavior by a client in this area will tend to adversely affect at most one other DHCP client. Furthermore, any DHCP client which sends in a DHCPREQUEST/RENEW or DHCPREQUEST/REBINDING to a DHCP server (either unicast for a RENEW or broadcast for a REBINDING) MUST still have time to run on the lease for that IP address. The DHCP server sends the DHCPACK back unicast to the IP address from which the RENEW or REBINDING originated. Given the existing responsibility placed on the client to only use an IP address when the lease is valid, and to only send in a RENEW or REBINDING if the lease is valid, the failover protocol relies on DHCP clients to perform responsibly and will, in the absence of conflict- ing information, believe a DHCP client that is attempting to RENEW or REBIND a lease on an IP address is the legitimate owner of that IP address. If clients do not follow these rules, it is possible for an address to be in use by more than one client. For a single server, this hap- pens because the server has leased the expired address to another client and the original client is also attempting to use the address. The server would NAK the renewal request. This is made slightly worse in the failover protocol if the two servers are unable to communicate with each other and one server leases an available address to a new client while the other server receives a renewal from a different client. In this case, both servers lease the same address to dif- ferent clients for the MCLT time. One troublesome issue is that of the DHCP client responsibility when sending in DHCPREQUEST/INIT-REBOOT requests. While the original DHCP RFC was written to require a DHCP client to have time left to run on the lease for an IP address if the client is sending an INIT-REBOOT request, it was sufficiently unclear that some client vendors didn't Droms, et. al. Expires January 2001 [Page 10] Internet Draft DHCP Failover Protocol July 2000 realize this until recently. Since the INIT-REBOOT request was sent with the IP address in the dhcp-requested-address option and not in the ciaddr (for perfectly good reasons), the similarity to the RENEW and REBINDING case was lost on many people. At present, the failover protocol does not assume that a client send- ing in an INIT-REBOOT request necessarily has a valid lease on the IP address appearing in the dhcp-requested-address option in the INIT- REBOOT request. The implications of this are as follows: Assume that there is a DHCP client that gets a lease from one server while that server is unable to communicate with its failover partner. Then, assume that after that client reboots it is able only to communicate with the other failover server. If the failover servers have not been able to com- municate with each other during this process, then the DHCP client will get a new IP address instead of being able to continue to use its existing IP address. This will affect no applications on the DHCP client, since it is rebooting. However, it will use up an additional IP address in this marginal case. 3.1.3. Stable storage update before DHCPACK The DHCP protocol allocates resources, and in order to operate correctly it requires that a DHCP server update some form of stable storage prior to sending a DHCPACK to a DHCP client in order to grant that client a lease on an IP address. One of the goals of the failover protocol is that it not add signifi- cant additional time to this already time consuming requirement to update stable storage prior to a DHCPACK. In particular, adding a requirement to communicate with another server prior to sending a DHCPACK would greatly simplify the failover protocol, but it would unacceptably limit the potential scalability of any DHCP server which employed the failover protocol. 3.2. BOOTP relay agent implementation Many DHCP clients are not resident on the same network segment as a DHCP server. In order to support this form of network architecture, most contemporary routers implement something known as a BOOTP Relay Agent. This capability inside of a router listens for all broadcasts at the DHCP port, port 67, and will relay any broadcasts that it receives on to a DHCP server. The IP address of the DHCP server must have been previously configured into the router. As part of the relay process, the relay agent will place the address of the inter- face on which it received the broadcast into the giaddr field of the DHCP packet. Droms, et. al. Expires January 2001 [Page 11] Internet Draft DHCP Failover Protocol July 2000 Since the failover protocol requires two DHCP servers to receive any broadcast DHCP messages, in order to work with DHCP clients which are not local to the DHCP server, the BOOTP relay agent on the router closest to the DHCP client must be configured to point at more than one DHCP server. Most BOOTP relay agent implementations allow this duplication of packets. If this is not possible, an administrator might be able to configure the relay agent with a subnet broadcast address, but in this case the primary and secondary DHCP servers in a failover pair must both reside on the same subnet. 3.3. What does it mean if a server can't communicate with its partner? In any protocol designed to allow one server to take over some responsibilities from a partner server in the event of "failure" of that partner server, there is an inherent difficulty in determining when that partner server has failed. In fact, it is fundamentally impossible for one server to distinguish a network communications failure from the outright failure of the server to which it is trying to communicate. In the case where each server is handing out resources (in this case IP addresses) to a client community, mistaking an inability to communicate with a partner server for failure of that partner server could easily cause both servers to be handing out the same IP addresses to different clients. One way that this is sometimes handled is for there to be more than two servers. In the case of an odd number of servers, the servers that can still communicate with a majority of other servers will con- sider themselves operational, and any server which can't communicate to a majority of other servers must immediately cease operations. While this technique works in some domains, having the only server to which a DHCP client can communicate voluntarily shut itself down seems like something worth avoiding. The failover protocol will operate correctly while both servers are unable to communicate, whether they are both running or not. At some point there may be resource contention, and if one of the servers is actually down, then the operator can inform the operational server and the operational server will be able to use all of the failed server's resources. The protocol also allows detection of an orderly shutdown of a Droms, et. al. Expires January 2001 [Page 12] Internet Draft DHCP Failover Protocol July 2000 participating server. 3.4. Challenging scenarios for a Failover protocol There exist two failure scenarios which provide particular challenges to the correctness guarantees of a failover protocol. 3.4.1. Primary Server crash before "lazy" update: In the case where the primary server sends a DHCPACK to a client for a newly allocated IP address and then crashes prior to sending the corresponding update to the secondary server, the secondary server will have no record of the IP address allocation. When the secondary server takes over, it may well try to allocate that IP address to a different client. In the case where the first client to receive the IP address is not on the net at the time (yet while there was still time to run on its lease), an ICMP echo (i.e., ping) will not prevent the secondary server from allocating that IP address to a different client. The failover protocol deals with this situation by having the primary and secondary servers allocate addresses for new clients from dis- joint address pools. See section 5.5 for details. A more likely (in that DHCPRENEWs are presumably more common than DHCPDISCOVERs) and more subtle version of this problem is where the primary server crashes after extending a client's lease time, and before updating the secondary with a new time using a lazy update. After the secondary takes over, if the client is not connected to the network the secondary will believe the client's lease has expired when, in fact, it has not. In this case as well, the IP address might be reallocated to a different client while the first client is still using it. This scenario is handled by the failover protocol through control of the lease time and the use of the maximum client lead time (MCLT). See section 5.2.1 for details. 3.4.2. Network partition where DHCP servers can't communicate but each can talk to clients: Several conditions are required for this situation to occur. First, due to a network failure, the primary and secondary servers cannot communicate. As well, some of the DHCP clients must be able to com- municate with the primary server, and some of the clients must now only be able to communicate with the secondary server. When this condition occurs, both primary and secondary servers could attempt to allocate IP addresses for new clients from the same pool of available Droms, et. al. Expires January 2001 [Page 13] Internet Draft DHCP Failover Protocol July 2000 addresses. At some point, then, two clients will end up being allo- cated the same IP address. This will cause problems when the network failure that created this situation is corrected. The failover protocol deals with this situation by having the primary and secondary servers allocate addresses for new clients from dis- joint address pools. See section 5.5 for details. 3.5. Using TCP to detect partner server failure There are several characteristics of TCP that are important to the functioning of the failover protocol, which uses one TCP connection for both bulk data transfer as well as to assess communications integrity with the other server. Reliable and ordered message delivery are chief among these important characteristics. It would be nice to use the capabilities built in to TCP to allow it to determine if communications integrity exists to the failover partner but this strategy contains some problems which require analysis. There exist three fundamental cases for an open TCP con- nection that must be examined. 1. When no data is being sent then no messages are traveling across the TCP connection. 2. When data is queued to be sent, and the receiver has not blocked the sending of additional data, then messages are flowing across the TCP connection containing the applications data. 3. When data is queued to be sent, and the receiver has blocked the transmission of additional data, then persist messages are flowing from the receiver to the sender to ensure that the sender doesn't miss the receiver opening the window for further transmissions. The first case can be turned into the second case by sending application-level keep-alive messages periodically when there is no other data queued to be sent. Note TCP keep-alive messages might be used as well, but they present additional problems. Thus, we can ensure that the TCP connection has messages flowing periodically across the connection fairly easily. The question remains as to what TCP will do if the other end of the connection fails to respond (either because of network partition or because the receiving server crashes). TCP will attempt to retransmit a message with an exponential backoff, and will eventually timeout that retransmission. However, the length of that timeout cannot, in Droms, et. al. Expires January 2001 [Page 14] Internet Draft DHCP Failover Protocol July 2000 general, be set on a per-connection basis, and is frequently as long as nine minutes, though in some cases it may be as short as two minutes. On some systems it can be set system-wide, while on other systems it cannot be changed at all. A value for this timeout that would be appropriate for the failover protocol, say less than 1 minute, could have unpleasant side-effects on other applications running on the same server, assuming that it could be changed at all on the host operating system. Nine minutes is a long time for the DHCP service to be unavailable to any new clients that were being served by the server which has crashed, when there is another server running that could respond to them as soon as it determines that its partner is not operational. The conclusion drawn from this analysis is that TCP provides very useful support for the failover protocol in the areas of reliable and ordered message delivery, but cannot by itself be relied upon to detect partner server failure in a fashion acceptable to the needs of the failover protocol. Additional failover protocol capabilities have been created to support timely detection of partner server failure. See section 8.3 for details on this mechanism. 4. Design Goals This section lists the design goals and the limitations of the fail- over protocol. 4.1. Design goals for this protocol The following is a list of goals that are met by this protocol. They are listed in priority order. 1. Implementations of this protocol must work with existing DHCP client implementations based on the DHCP protocol [1]. 2. Implementations of the protocol must work with existing BOOTP relay agent implementations. 3. The protocol must provide failover redundancy between servers that are not located on the same subnet. 4. Provide for continued service to DHCP clients through an automated mechanism in the event of failure of the primary server. 5. Avoid binding an IP address to a client while that binding is currently valid for another client. In other words, do not Droms, et. al. Expires January 2001 [Page 15] Internet Draft DHCP Failover Protocol July 2000 allocate the same IP address to two clients. 6. Minimize any need for manual administrative intervention. 7. Introduce no additional delays in server response time as a result of the network communications required to implement the failover protocol, i.e., don't require communications with the partner between the receipt of a DHCPREQUEST and the corresponding DHCPACK. 8. Share IP address ranges between primary and secondary servers; i.e., impose no requirement that the pool of available addresses be manually or permanently divided between servers. 9. Continue to meet the goals and objectives of this protocol in the event of server failure or network partition. 10. Provide graceful reintegration of full protocol service after server failure or network partition. 11. Allow for one computer to act as a secondary server for multi- ple primary servers. The protocol must allow failover primary and secondary configuration choices to be made at a granular- ity smaller than "all of the subnets served by a single server", though individual implementations may not choose to allow such flexibility. 12. Ensure that an existing client can keep its existing IP address binding if it can communicate with either the primary or secondary DHCP server implementing this protocol - not just whichever server that originally offered it the binding. 13. Ensure that a new client can get an IP address from some server. Ensure that in the face of partition, where servers continue to run but cannot communicate with each other, the above goals and requirements may be met. In addition, when the partition condition is removed, allow graceful automatic re-integration without requiring human intervention. 14. If either primary or secondary server loses all of the infor- mation that it has stored in stable storage, ensure that it be able to refresh its stable storage from the other server. 15. Support load balancing between the primary and secondary servers, and allow configuration of the percentage of the client population served by each with a moderately fine granu- larity. Droms, et. al. Expires January 2001 [Page 16] Internet Draft DHCP Failover Protocol July 2000 4.2. Limitations of this protocol The following are explicit limitations of this protocol. 1. This protocol provides only one level of redundancy through a single secondary server for each primary server. 2. A subset of the address pool is reserved for secondary server use. In order to handle the failure case where both servers are able to communicate with DHCP clients, but unable to com- municate with each other, a subset of the IP address pool must be set aside as a private address pool for the secondary server. The secondary can use these to service newly arrived DHCP clients during such a period. The required size of this private pool is based only on the arrival rate of new DHCP clients and the length of expected downtime, and is not influ- enced in any way by the total number of DHCP clients supported by the server pair. The failover protocol can be used in a mode where both the primary and secondary servers can share the load between them when both are operating. In this load balancing mode, the addresses allocated by the primary server to the secondary server are not unused, but are used instead to service the portion of the client base to which the secondary server is required to respond. See section 5.3 for more information on load balancing. 3. The primary and secondary servers do not respond to client requests at all while recovering from a failure that could have resulted in duplicate IP assignments. (When synchroniz- ing in POTENTIAL-CONFLICT state). 5. Protocol Overview This section will discuss the failover protocol at a relatively high level of detail. In the event that a description in this section conflicts (or appears to conflict due to the overview nature of this section) with information in later sections of this draft, the infor- mation in the later sections should be considered authoritative. 5.1. Messages and States This protocol is centered around the message exchange used by one server to update the other server of binding database changes result- ing from DHCP client activity: Droms, et. al. Expires January 2001 [Page 17] Internet Draft DHCP Failover Protocol July 2000 o Communication of binding database changes The binding update (BNDUPD) message is used to send the binding database changes to the partner server, and the partner server responds with a binding acknowledgement (BNDACK) message when it has successfully committed those changes to its own stable storage. All of the other messages involve ancillary issues: o Management of available IP addresses The pool request (POOLREQ) is used by the secondary server to request an allocation of IP addresses from the primary server. The pool response (POOLRESP) is used by the primary server to inform the secondary server how many IP addresses were allocated to the secondary server as the result of the pool request. o Synchronization of the binding databases between the servers after they've been out of communications The update request (UPDREQ) message is used by one server to request that its partner send it all binding database informa- tion that it has not already seen. The update request all (UPDREQALL) message is used by one server to request that all binding database information be sent in order to recover from a total loss of its binding database by the requesting server. The update done (UPDDONE) message is used by the responding server to indicate that all requested updates have been sent the responding server and acked by the requesting server. o Connection establishment The connect (CONNECT) message is used by the primary server to establish a high level connection with the other server, and to transmit several important configuration data items between the servers. The connect acknowledgement message (CONNECTACK) is used by the secondary server to respond to a CONNECT message from the primary server. The disconnect (DISCONNECT) message is used by either server when closing a connection. o Server synchronization The state change (STATE) message is used by either server to inform the other server of a change of failover state. o Connection integrity management Droms, et. al. Expires January 2001 [Page 18] Internet Draft DHCP Failover Protocol July 2000 The contact (CONTACT) message is used by either server to ensure that the other server continues to see the connection as opera- tional. It MUST be transmitted periodically over every esta- blished connection if other message traffic is not flowing, and it MAY be sent at any time. 5.1.1. Failover endpoints The proper operation of the failover protocol requires more than the transmission of messages between one server and the other. Each end- point might seem to be a single DHCP server, but in fact there are many situations where additional flexibility in configuration is use- ful. For instance, there might be several servers which are each primary for a distinct set of address pools, and one server which is secon- dary for all of those address pools. The situation with the pri- maries is straightforward, but the secondary will need to maintain a separate failover state, partner state, and communications up/down status for each of the separate primary servers for which it is act- ing as a secondary. The failover protocol calls for there to be a unique failover end- point per partner per role (where role is primary or secondary). This failover endpoint can take actions and hold unique states. There are thus a maximum of two failover endpoints per partner (one for the partner as a primary and one for that same partner as a secondary.) Thus, in the case where there are two primary servers A and B each backed up by a single common secondary server C, there is one fail- over endpoint on each of A and B, and two different failover end- points on C. The two different failover endpoints on C each have unique states and independent TCP connections. This document frequently describes the behavior of the protocol in terms of primary and secondary servers, not primary and secondary failover endpoints. However, it is important to remember that every 'server' described in this document is in reality a failover endpoint that resides in a particular process, and that many failover end- points may reside in the same process. It is not the case that there is a unique failover endpoint for each subnet address pool that participates in a failover relationship. On one server, there is one failover endpoint per partner per role, regardless of how many subnet address pools are managed by that com- bination of partner and role. Conversely, on a particular server, any given subnet address pool will be associated with exactly one Droms, et. al. Expires January 2001 [Page 19] Internet Draft DHCP Failover Protocol July 2000 failover endpoint. When a connection is received from the partner, the unique failover endpoint to which the message is directed is determined solely by the IP address of the partner and the port to which the connection is directed by the partner. See section 8.2. 5.2. Fundamental guarantees There a several fundamental restrictions this protocol places on what one server can do in the absence of knowledge of the other server. Operating within these restrictions allows certain guarantees to be made to the partner server, and these are key to the correct opera- tion of the protocol. 5.2.1. Control of lease time The key problem with lazy update is that when a server fails after updating a client with a particular lease time and before updating its partner, the partner will believe that a lease has expired even though the client still retains a valid lease on that IP address. In order to handle this problem, a period of time known as the "Max- imum Client Lead Time" (MCLT) is defined and must be known to both the primary and secondary servers. Proper use of this time interval places an upper bound on the difference allowed between the lease time provided to a DHCP client by a server and the lease time known by that server's partner. However, the MCLT is typically much less than the lease time that a server has been configured to offer a client, and so some strategy must exist to allow a server to offer the configured lease time to a client. During a lazy update the updating server typically updates its partner with a potential expiration time which is longer than the lease time previously given to the client and which is longer than the lease time that the server has been configured to give a client. This allows that server to give a longer lease time to the client the next time the client renews its lease, since the time that it will give to the client will not exceed the MCLT beyond the potential expiration time acknowledged by its partner. The PARTNER-DOWN state exists so that a server can be sure that its partner is, indeed, down. Correct operation while in that state requires (generally) that the server wait the MCLT after anything that happened prior to its transition into PARTNER-DOWN state (or, more accurately, when the other server went down if that is known). Thus, the server MUST wait the MCLT after the partner server went down before allocating any of the partner's addresses which were available for allocation. In the event the partner was not in Droms, et. al. Expires January 2001 [Page 20] Internet Draft DHCP Failover Protocol July 2000 communication prior to going down, it might have allocated one or more of its FREE addresses to a DHCP client and been unable to inform the server entering PARTNER-DOWN prior to going down itself. By waiting the MCLT after the time the partner went down, the server in PARTNER-DOWN state ensures that any clients which have a lease on one of the partner's FREE addresses will either time out or contact the server in PARTNER-DOWN by the time that period ends. In addition, once a server has transitioned to PARTNER-DOWN state, it MUST NOT reallocate an IP address from one client to another client until an additional MCLT interval after the lease by the original client expires. (Actually, until the maximum client lead time after what it believes to be the lease expiration time of the client.) Some optimizations exist for this restriction, in that it only applies to leases that were issued BEFORE entering PARTNER-DOWN. Once a server has entered PARTNER-DOWN and it leases out an address, it need not wait this time as long as it has never communicated with the partner since the lease was given out. The fundamental relationship on which much of the correctness of this protocol depends is that the lease expiration time known to a DHCP client MUST NOT be more than the maximum client lead time greater than the potential expiration time known to a server's partner. The remainder of this section makes the above fundamental relation- ship more explicit. This protocol requires a DHCP server to deal with several different lease intervals and places specific restrictions on their relation- ships. The purpose of these restrictions is to allow the other server in the pair to be able to make certain assumptions in the absence of an ability to communicate between servers. The different lease times are: o desired lease interval The desired lease interval is the lease interval that a DHCP server would like to give to a DHCP client in the absence of any restrictions imposed by the Failover protocol. Its determina- tion is outside of the scope of this protocol. Typically this is the result of external configuration of a DHCP server. o actual lease interval The actual lease internal is the lease interval that a DHCP server gives out to a DHCP client in the dhcp-lease-time option Droms, et. al. Expires January 2001 [Page 21] Internet Draft DHCP Failover Protocol July 2000 of a DHCPACK packet. It may be shorter than the desired client lease interval (as explained below). o potential lease interval The potential lease interval is the lease expiration interval the local server tells to its partner in the potential- expiration-time option of a BNDUPD message. o acknowledged potential lease interval The acknowledged potential lease interval is the potential lease interval the partner server has most recently acknowledged in the potential-expiration-time option of a BNDACK message. The key restriction (and guarantee) that any server makes with respect to lease intervals is that the actual client lease interval never exceeds the acknowledged potential lease interval (if any) by more than a fixed amount. This fixed amount is called the "Maximum Client Lead Time" (MCLT). The MCLT MAY be configurable on the primary server, but for correct server operation it MUST be the same and known to both the primary and secondary servers. The secondary server determines the MCLT from the MCLT option sent from the primary server to the secondary server in the CONNECT message. A server MUST record in its stable storage both the actual lease interval and the most recently acknowledged potential lease interval for each IP address binding. It is assumed that the desired client lease interval can be determined through techniques outside of the scope of this protocol. See section 7.1.5 for more details concern- ing the times that the server MUST record in its stable storage and the way that they interact with the lease time that may be offered to a DHCP client. Again, the fundamental relationship among these times which MUST be maintained is: actual lease interval < ( acknowledged potential lease interval + MCLT ) Figure 5.2.1-1 illustrates an initial lease to a client using the rules discussed in the example which follows it. Note that this is only one example -- as long as the fundamental relationship is preserved, the actual times used could be quite different. Droms, et. al. Expires January 2001 [Page 22] Internet Draft DHCP Failover Protocol July 2000 DHCP Primary Secondary time Client Server Server | (time in intervals) | (absolute time) | | | | | >-DHCPDISCOVER-> | | | <---DHCPOFFER-< | | | | | | >-DHCPREQUEST-> | | | (selecting) | | | | | t | <--------DHCPACK-< | | | lease-time=MCLT | | | | >-BNDUPD--> | | | lease-expiration=t+MCLT | | potential-expiration=t+(MCLT/2)+X | | | | | <-BNDACK-< | | | potential-expiration=t+(MCLT/2)+X ... ... ... | | | t+MCLT/2 | >-DHCPREQUEST-> | | | (renew) | | | | | t1 | <--------DHCPACK-< | | | lease-time=X | | | | >-BNDUPD--> | | | lease-expiration=t1+X | | potential-expiration=t1+(X/2)+X | | | | | <-BNDACK-< | | | potential-expiration=t1+(X/2)+X ... ... ... Figure 5.2.1-1: Lazy Update Message Traffic X = Desired Lease Interval Assumes renewal interval = lease interval / 2 DISCUSSION: This protocol mandates only that the above fundamental relation- ship concerning lease intervals is preserved. In the interests of clarity, however, let's examine a specific example. The MCLT in this case is 1 hour. The desired lease interval is 3 days, and its renewal time is half the lease Droms, et. al. Expires January 2001 [Page 23] Internet Draft DHCP Failover Protocol July 2000 interval. The rules for this example are: o What to tell the client: Take the remainder of the acknowledged potential lease interval. If this is a new lease, then this value will be zero. If this remainder plus the MCLT is greater than the desired lease inter- val, give the client the desired lease interval else give the client the remainder plus the MCLT. o What to tell the failover partner server: Take the renewal interval (typically half of the actual client lease interval), add to it the desired lease interval, and add it to the current time to yield the value that goes into the potential-expiration-time option. Also tell the failover partner the actual lease interval by adding it to the current time to yield the value that goes into the lease-expiration option. In operation this might work as follows: When a server makes an offer for a new lease on an IP address to a DHCP client, it determines the desired lease interval (in this case, 3 days). It then examines the acknowledged potential lease interval (which in this case is zero) and determines the remainder of the time left to run, which is also zero. To this it adds the MCLT. Since the actual lease interval cannot be allowed to exceed the remainder of the current acknowledged potential lease interval plus the MCLT, the offer made to the client is for the remainder of the current acknowledged potential lease interval (i.e., zero) plus the MCLT. Thus, the actual lease interval is 1 hour. Once the server has performed the BNDACK to the DHCP client, it will update the secondary server with the lease information. How- ever, the desired potential lease interval will be composed of the one half of the current actual lease interval added to the desired lease interval. Thus, the secondary server is updated with a BNDUPD with a lease interval of 3 days + 1/2 hour specified in the potential-expiration-time option. When the primary server receives an ACK to its update of the secondary server's (partner's) potential lease interval, it records that as the acknowledged potential lease interval. A server MUST NOT send a BNDACK in response to a BNDUPD message Droms, et. al. Expires January 2001 [Page 24] Internet Draft DHCP Failover Protocol July 2000 until it is sure that the information in the BNDUPD message resides in its stable storage. Thus, the primary server in this case can be sure that the secondary server has recorded the poten- tial lease interval in its stable storage when the primary server receives a BNDACK message from the secondary server. When the DHCP client attempts to renew at T1 (approximately one half an hour from the start of the lease), the primary server again determines the desired lease interval, which is still 3 days. It then compares this with the remaining acknowledged potential lease interval (3 days + 1/2 hour) and adjusts for the time passed since the secondary was last updated (1/2 hour). Thus the time remaining of the acknowledged potential lease interval is 3 days. Adding the MCLT to this yields 3 days plus 1 hour, which is more than the desired lease interval of 3 days. So the client is renewed for the desired lease interval -- 3 days. When the primary DHCP server updates the secondary DHCP server after the DHCP client's renewal ACK is complete, it will calculate the desired potential lease interval as the T1 fraction of the actual client lease interval (1/2 of 3 days this time = 1.5 days). To this it will add the desired client lease interval of 3 days, yielding a total desired partner server lease interval of 4.5 days. In this way, the primary attempts to have the secondary always "lead" the client in its understanding of the client's lease interval so as to be able to always offer the client the desired client lease interval. Once the initial actual client lease interval of the MCLT is past, the protocol operates effectively like the DHCP protocol does today in its behavior concerning lease intervals. However, the guarantee that the actual client lease interval will never exceed the remaining acknowledged partner server lease interval by more than the MCLT allows full recovery from a variety of failures. 5.2.2. Controlled re-allocation of IP addresses When in PARTNER-DOWN state there is a waiting period after which an IP address can be re-allocated to another client. For leases which are available when the server enters PARTNER-DOWN state, the period is the MCLT from entry into PARTNER-DOWN state. For IP addresses which are not available when the server enters PARTNER-DOWN state, the period is the MCLT after the lease becomes available. See sec- tion 9.4.2 for more details. In any other state, a server cannot reallocate an address from one client to another without first notifying its partner (through a BNDUPD message) and receiving acknowledgement (through a BNDACK Droms, et. al. Expires January 2001 [Page 25] Internet Draft DHCP Failover Protocol July 2000 message) that its partner is aware that that first client is not using the address. This could be modeled in the following way. Though this specific implementation is in no way required, it may serve to better illus- trate the concept. An "available" IP address on a server may be allocated to any client. An IP address which was leased to a client and which expired or was released by that client would take on a new state, EXPIRED or RELEASED respectively. The partner server would then be notified that this IP address was EXPIRED or RELEASED through a BNDUPD. When the sending server received the BNDACK for that IP address showing it was FREE, it would move the IP address from EXPIRED or RELEASED to FREE, and it would be available for allocation by the primary server to any clients. A server MAY reallocate an IP address in the EXPIRED or RELEASED state to the same client with no restrictions provided it has not sent a BNDUPD message to its partner. This situation would exist if the lease expired or was released after the transition into PARTNER- DOWN state, for instance. 5.3. Load balancing In order to implement load balancing between a primary and secondary server pair, each server must respond to DHCPDISCOVER requests from some clients and not from other clients. In order to do this suc- cessfully, each server must be able to determine immediately upon receipt of a DHCP client request whether it is to service this request or to ignore it in order to allow the other server to service the request. In addition, it should be possible to configure the percentage of clients which will be serviced by either the primary or secondary server. This configuration should be more or less continuous, from all clients serviced by the primary through an even split with half serviced by each, to all clients serviced by the secondary. The technique chosen to support these goals is described in [LOADB]. A bitmap-style Hash Bucket Assignment (as described in [LOADB]) is used to determine which DHCP clients can be processed. There are two potential HBA's in a failover server -- a server HBA and a failover HBA. The way that a server acquires a server HBA is outside of the scope of the failover protocol, but both servers in a failover pair MUST have the same server HBA. The failover HBA (which specifies the Droms, et. al. Expires January 2001 [Page 26] Internet Draft DHCP Failover Protocol July 2000 clients that the secondary is supposed to process) is sent by the primary server to the secondary server whenever a connection is esta- blished, using the hash-bucket-assignment option defined in section 12.11. When using the server HBA (if any) and the failover HBA (if any), to decide whether to process a DHCP request, the server HBA always applies in every failover state, and the failover HBA (which MUST be a subset of the server HBA) is used by the secondary server to decide which packets to process when in NORMAL state. 5.4. IP address allocations between servers The failover protocol allows a DHCP server which implements it to operate correctly in spite of the uncertainty over whether its partner has failed or whether the communications link to its partner has failed. This is made possible in part by the existence of separate address pools on each server for allocation to newly arrived DHCP clients. Thus, each server has its own pool of available IP addresses. Note that an IP address is not "owned" by a particular server throughout its entire lifetime. Only an IP address which is available is "owned" by a particular server -- once it has been leased to a DHCP client, it is not owned by either failover partner. When it finally becomes available again, it will be owned initially by the primary server, and it may or may not be allocated to the secondary server by the primary server. So, the flow of IP address ownership is as follows: initially an IP address is owned by the primary server. It may be allocated to the secondary server if it is available, and then it is owned by the secondary server. Either server can allocate available IP addresses which they own to DHCP clients, in which case they cease to own them. When the DHCP client releases the address or the lease on it expires, it will again become available and will be owned by the primary. An IP address will not become owned by the server which allocated it initially when it is released or the lease expires because, in gen- eral, that server will have had to replenish its pool of available addresses well in advance of any likely lease expirations. Thus, having a particular IP address cycle back to the secondary might well put the secondary more out of balance with respect to the primary as it is to enhance the balance of available addresses between them. These address pools are used when in COMMUNICATIONS-INTERRUPTED state and while waiting for the MCLT expiration in PARTNER-DOWN state. In addition, when using load balancing, these pools are used when in Droms, et. al. Expires January 2001 [Page 27] Internet Draft DHCP Failover Protocol July 2000 NORMAL state as well. These allocation and maintenance of these address pools is an area of some sensitivity, since the goal is to maintain a more or less con- stant ratio of available addresses between the two servers. The initial allocation when the servers first integrate is triggered by the POOLREQ message from the secondary to the primary. This is followed by the POOLRESP message where the primary tells the secon- dary how many IP addresses it allocated to the secondary. Then, the primary sends the allocated IP addresses to the secondary. The POOLREQ/POOLRESP message is a trigger to the primary to perform a scan of its database and to ensure that the secondary has enough IP addresses (based on some configured ratio). The actual IP addresses are sent to the secondary using the BNDUPD message with a state of BACKUP, which indicates the IP address is now available for allocation by the secondary. The POOLREQ/POOLRESP message exchange initiated by the secondary is valid at any time, and the primary server SHOULD, whenever it receives the POOLREQ message, scan its database of address pools and determine if the secondary needs more IP addresses from any of the IP address pools. However, in order to support a reasonably dynamic balance of the IP addresses between the failover partners, the primary server needs to do additional work to ensure that the secondary server has as many IP addresses as it needs (but that it doesn't have *more* than it needs either). The primary server SHOULD examine the balance of available addresses between the primary and secondary for a particular address pool when- ever the number of available addresses for either the primary or secondary changes. The primary server SHOULD adjust the available address balance as required to ensure the configured address balance, excepting that the primary server SHOULD employ some threshold mechanism to such a balance adjustment in order to minimize the over- head of maintaining this balance. An example of a threshold approach is: do not attempt to re-balance the available pools on the primary and secondary until the out of balance value exceeds a configured value. The primary server can, at any time, send an available IP address to the secondary using a BNDUPD with the state BACKUP. The primary server can attempt to take an available IP address away from the secondary by sending a BNDUPD with the state FREE. If the secondary Droms, et. al. Expires January 2001 [Page 28] Internet Draft DHCP Failover Protocol July 2000 accepts the BNDUPD, then it is now available to the PRIMARY and not available to the secondary. Of course, the secondary MUST reject that BNDUPD if it has already used that IP address for a DHCP client. Whenever the primary server examines the possible available IP addresses which it could send to the secondary server, the primary server SHOULD take into account whether load balancing is in use, and if it is the primary server SHOULD attempt to send to the secondary any IP addresses whose most recent client would be processed by the secondary under the current load balancing regime in use. Likewise, when removing available IP addresses from the secondary server when load balancing is in use, the primary server SHOULD first remove those IP addresses whose most recent client would be processed by the primary server under the current load balancing regime in use. 5.5. Operating in NORMAL state When in NORMAL state, each server services DHCPDISCOVER's and all other DHCP requests other than DHCPREQUEST/RENEWAL or DHCPREQUEST/REBINDING from the client set defined by the load balanc- ing algorithm [LOADB]. Each server services DHCPREQUEST/RENEWAL or DHCPDISCOVER/REBINDING requests from any client. In general, whenever the binding database is changed in stable storage (other than a change resulting from receiving a BNDUPD from the failover partner), then a BNDUPD message is sent with the con- tents of that change to the partner server. The partner server then writes the information about that binding in its bindings database in stable storage and replies with a BNDACK message. The binding database in a DHCP server would normally be changed as a result of DHCP protocol activity with a DHCP client (e.g., granting a lease to a DHCP client through the familiar DISCOVER/OFFER/REQUEST/ACK cycle or extending a lease due to a renewal from a DHCP client) or possibly (on some servers) because a lease has expired or undergone another state change that must be recorded in the DHCP binding database. These are the state changes that would be communicated to the partner server using a BNDUPD mes- sage. Of course, receipt of a BNDUPD message itself will normally cause an update of the binding database for all of the IP addresses contained in the BNDUPD, and a binding database change such as this MUST NOT trigger a corresponding BNDUPD message to the partner. 5.6. Operating in COMMUNICATIONS-INTERRUPTED state When operating in COMMUNICATIONS-INTERRUPTED state, each server is operating independently, but does not assume that its partner is not operating. The partner server might be operating and simply unable Droms, et. al. Expires January 2001 [Page 29] Internet Draft DHCP Failover Protocol July 2000 to communicate with this server, or might not be operating. Each server responds to the full range of DHCP client messages that it receives, but in such a way that graceful reintegration is always possible when its partner comes back into contact with it. 5.7. Operating in PARTNER-DOWN state When operating in PARTNER-DOWN state, a server assumes that its partner is not currently operating, but does make allowances for the possibility that that server was operating in the past, though possi- bly out of communications with this server. It responds to all DHCP client requests in PARTNER-DOWN state. 5.8. Operating in RECOVER state A server operating in RECOVER state assumes that it is reintegrating with a server that has been operating in PARTNER-DOWN state, and that it needs to update its bindings database before it services DHCP client requests. A server may also operate in RECOVER state in order to fully recover its bindings database from its partner server. 5.9. Operating in STARTUP state A server operating in STARTUP state assumes that failover is opera- tional, and it spends a short time whenever it comes up attempting to contact the partner. During this time (generally a few seconds), the server is unresponsive to DHCP client requests. This period exists in order to give a server a chance to determine that its partner has changed state since it was last in communications, and to react to that changed state (if any) prior to responding to DHCP client requests. The period of time a server remains in STARTUP state SHOULD be long enough to ensure that it will connect to the other server if that server is available for connections. 5.10. Time synchronization between servers The failover protocol is designed to operate between two servers which have time values which differ by an arbitrarily large amount. A particular implementation MAY choose to only support servers whose time values differ by an arbitrarily small amount. In any event, whether large or only small differences in time values are supported, every message that is received MUST be tagged with a Droms, et. al. Expires January 2001 [Page 30] Internet Draft DHCP Failover Protocol July 2000 time value as soon as possible after receipt. This time value is used along with the time value that is sent in every message between the failover partners to develop a delta time between the servers. This delta time is used during the connection process to establish a baseline delta time between the servers, and upon receipt of each message, the delta time for that message is used to refine the delta time for the server pair. While the algorithm for this refinement of delta time is not speci- fied as part of this protocol, a server SHOULD allow the delta time value for a pair of failover servers to be periodically updated to account for time drift. In addition, the delta time value between servers SHOULD be smoothed in some fashion, so that transient network delays will not cause it to vary wildly. A server SHOULD recognize a drastic change in the delta time value as an event to be signaled to a network administrator, as well as reset- ting the time delta between the failover partners. The specific definitions of a minor or drastic change in delta time as well as the algorithm used to smooth minor changes into the run- ning delta time are implementation issues and are not further addresses in this document. 5.11. IP address binding-status In most DHCP servers an IP address can take on several different binding-status values, sometimes also called states. While no two DHCP servers probably have exactly the same possible binding-status values, the DHCP RFC enforces some commonality among the general semantics of the binding-status values used by various DHCP server implementations. In order to transmit binding database updates between one server and another using the failover protocol, some common denominator binding-status values must be defined. It is not expected that these binding-status-values correspond with any actual implementation of the DHCP protocol in a DHCP server, but rather that the binding- status values defined in this document should be a common denominator of those in use by many DHCP server implementations. It is a goal of this protocol that any DHCP server can map the various IP address binding-status values that it uses internally into these failover IP address binding-status values on transmission of binding database updates to its partner, and likewise that it can map any failover IP address binding-status values it received in a binding update into its internal IP address binding-status values. The IP address binding-status values defined for the failover Droms, et. al. Expires January 2001 [Page 31] Internet Draft DHCP Failover Protocol July 2000 protocol are listed below. Unless otherwise noted below, there MAY be client information associated with each of these binding-status values. o o ACTIVE -- Lease is assigned to a client. Client identification MUST appear. o EXPIRED -- indicates that a client's binding on an IP address has expired. When the partner server ACK's the BNDUPD of an EXPIRED IP address, the server sets its internal state to FREE. It is then available for allocation to any client of the primary server. It may be allocated to the same client on the server where the lease expired if a BNDUPD containing the EXPIRED state has not yet been sent to the partner (e.g., in the event that the servers are not in communication). Client identification SHOULD appear. o RELEASED -- indicates that a DHCP client sent in a DHCPRELEASE message. When the partner server ACK's the BNDUPD of an RELEASED IP address, the server sets its internal state to FREE, and it is available for allocation by the primary server to any DHCP client. It may be allocated to the same client if a BNDUPD has not yet been sent to the partner. Client identification SHOULD appear. o FREE -- is used when a DHCP server needs to communicate that an IP address is unused by any DHCP client, but it was not just released, expired, or reset by a network administrator. When the partner server ACK's the BNDUPD of a FREE IP address, the server sets its internal state such that it is available for allocation by the primary DHCP server to any DHCP client. (Note that in PARTNER-DOWN state, after waiting the MCLT, the IP address MAY be allocated to a DHCP client by the secondary server.) Note that when an IP address that was allocated by the secondary reverts to the FREE state, it must (like any other IP address) be assigned to the secondary through the POOLREQ/BNDUPD process before the secondary can reallocate it. Client identification MAY appear. o ABANDONED -- indicates that an IP address is considered unusable by the DHCP subsystem. An IP address for which a valid PING response was received SHOULD be set to ABANDONED. An IP address for which a DHCPDECLINE was received should be set to ABANDONED. Droms, et. al. Expires January 2001 [Page 32] Internet Draft DHCP Failover Protocol July 2000 Client identification MUST NOT appear. o RESET -- indicates that this IP address was made available by operator command. This is a distinct state so that the reason that the IP address became FREE can be determined. Client iden- tification MAY appear. o BACKUP -- indicates that this IP address can be allocated by the secondary server to a DHCP client at any time. When the MCLT has passed after its time of entry into PARTNER-DOWN state, the IP address may be allocated by the primary to any DHCP client. Client identification MAY appear. These binding-status values are communicated from one failover partner to another using the binding-status option, see section 12.3 for details of this option. Unless otherwise noted above there MAY be client information associated with each of these binding-status values. An IP address will move between these binding-status values using the following state transition diagram: Droms, et. al. Expires January 2001 [Page 33] Internet Draft DHCP Failover Protocol July 2000 DHCP client DECLINE or server detected problem from any state +----------+ V +---------+ External >---->| RESET | | |ABANDONED| command | | +-->| | +----------+ +---------+ | Comm w/Parter(1) V +---------+ Comm(1) +----------+ Comm(1) +---------+ | EXPIRED |--------->| FREE |<----------| RELEASED| | | w/Parter | | w/Partner | | +---------+ +----------+ +---------+ ^ ^ | | +-----------+ ^ | | | | | | | Exp. grace IP | IP addr alloc. IP addr | | period ends address to sec.(2) reserved | | | leasedy V | | | | by | +----------+ | | | | primary | BACKUP |<---+ | | wait for | | | | | grace period | +----------+ | | | | | | | | | IP addr leased by | | Expired grace | secondary | | period exists V V | | | +----------+ | | | Lease on | ACTIVE | DHCPRELEASE | +-----+-IP addr---| |------------------+ expires +----------+ Figure 5.10-1: Transitions between binding-status values. (1) This transition MAY also occur if the server is in PARTNER-DOWN state and the MCLT has passed since the entry in the RELEASED, EXPIRED, or RESET states. (2) This transition MAY occur if the server is the secondary and the MCLT has passed since its entry into PARTNER-DOWN state. Again, note that a DHCP server implementing the failover protocol does not have to implement either this state machine or use these Droms, et. al. Expires January 2001 [Page 34] Internet Draft DHCP Failover Protocol July 2000 particular binding-status values in its normal operation of allocat- ing IP addresses to DHCP clients. It only needs to map its internal binding-status-values onto these "standard" binding-status values, and map these "standard" binding-status values back into its internal binding-status values. For example, a server which implements a grace period for a IP address binding SHOULD simply wait to update its partner server until the grace period on that binding has run out. The process of setting an IP address to FREE deserves some detailed discussion. When an IP address is moved to the EXPIRED,RELEASED, or RESET binding-status on a server, it will send a BNDUPD with the binding-status of EXPIRED, RELEASED, or RESET to its partner. If its partner agrees that is acceptable (see sections 7.1.2 and 7.1.3 con- cerning why a server might not accept a BNDUPD) it will return a BNDACK with no reject-reason, signifying that it accepted the update. As part of the BNDUPD processing, the server returning the BNDACK will set the binding-status of the IP address to FREE, and upon receipt of the BNDACK the server which sent the BNDUPD will set the binding-status of the IP address to FREE. Thus, the EXPIRED, RELEASED, or RESET binding-status is something of a transitory state. This process is encoded in the transition diagram above by "Comm w/Partner". 5.12. DNS dynamic update considerations DHCP servers (and clients) can use DNS Dynamic Updates as described in [RFC 2136] to maintain DNS name-mappings as they maintain DHCP leases. Many different administrative models for DHCP-DNS integra- tion are possible. Descriptions of several of these models, and guidelines that DHCP servers and clients should follow in carrying them out, are laid out in [DDNS]. The nature of the DHCP failover protocol introduces some issues concerning dynamic DNS updates that are not part of non-failover DHCP environments. This section describes these issues, and defines the information which failover partners should exchange and the protocol which they should follow in order to ensure consistent behavior. The presence of this section should not be interpreted as requiring that implementations of the DHCP failover protocol must also support DDNS updates. The purpose of this discussion is to clarify the areas where the DHCP failover and DHCP-DDNS protocols intersect for the benefit of implementations which support both protocols, not to introduce a new requirement into the DHCP failover protocol. Thus, a DHCP server which implements the failover protocol MAY also support dynamic DNS updates, but if it does support dynamic DNS updates it SHOULD utilize the techniques described here in order to correctly distribute them between the failover partners. Droms, et. al. Expires January 2001 [Page 35] Internet Draft DHCP Failover Protocol July 2000 From the standpoint of the failover protocol, there is no reason why a server which is utilizing the DDNS protocol to update a DNS server should not be a partner with a server which is not utilizing the DDNS protocol to update a DNS server. However, a server which is not able to support DDNS or is not configured to support DDNS SHOULD output a warning message when it receives BNDUPD messages which indicate that its failover partner is configured to support the DDNS protocol to update a DNS server. An implementation MAY consider this an error and refuse to operate, or it MAY choose to operate anyway, having warned the user of the problem in some way. 5.12.1. Relationship between failover and dynamic DNS update The failover protocol describes the conditions under which each fail- over server may renew a lease to its current DHCP client, and describes the conditions under which it may grant a lease to a new DHCP client. An analogous set of conditions determines when a fail- over server should initiate a DDNS update, and when it should attempt to remove records from the DNS. The failover protocol's conditions are based on the desired external behavior: avoiding duplicate address assignments; allowing clients to continue using leases which they obtained from one failover partner even if they can only commun- icate with the other partner; allowing the backup DHCP server to grant new leases even if it is unable to communicate with the primary server. The desired external DDNS behavior for DHCP failover servers is: 1. Allow timely DDNS updates from the server which grants a client a lease. Recognize that there is often a DDNS update lifecycle which parallels the DHCP lease lifecycle. This is likely to include the addition of records when the lease is granted, and the removal of DNS records when the lease is sub- sequently made available for allocation to a different client. 2. Communicate enough information between the two failover servers to allow one to complete the DDNS update 'lifecycle' even if the other server originally granted the lease. 3. Avoid redundant or overlapping DDNS updates, where both fail- over servers are attempting to perform DDNS updates for the same lease-client binding. Avoid situations where one partner is attempting to add RRs related to a lease binding while the other partner is attempting to remove RRs related to the same lease binding. 5.12.2. Use of the DDNS option In order for either server to be able to complete a DDNS update, or Droms, et. al. Expires January 2001 [Page 36] Internet Draft DHCP Failover Protocol July 2000 to remove DNS records which were added by its partner, both servers need to know the FQDN associated with the lease-client binding. The FQDN associated with the client's A RR and PTR RR SHOULD be communi- cated from the server which adds records into the DNS to its partner. The initiating server SHOULD use the DDNS option in the BNDUPD mes- sages to inform the partner server of the status of any DDNS updates associated with a lease binding. Failover servers MAY choose not to include the DDNS option in BNDUPD messages if there has been no change in the status of any DDNS update related to the lease binding. The partner server receiving BNDUPD messages containing the DDNS option SHOULD compare the status flags and the FQDN contained in the option data with the current DDNS information it has associated with the lease binding, and update its notion of the DDNS status accord- ingly. The initiating server MAY send a BNDUPD to its partner before the DDNS update has been successfully completed. If it does so, it SHOULD leave the 'C' bit in the Flags field clear, to indicate to the partner that the DDNS update may not be complete. When the DDNS update has been successfully acknowledged by the DNS server, the ini- tiating DHCP server SHOULD include the DDNS option in its next BNDUPD message about the binding, so that the partner server will be able to record the final status of the DDNS update. The initiating server SHOULD set the 'C' bit in the DDNS option if the DDNS update was suc- cessfully accepted by the DNS server. Some implementations will choose to send a BNDUPD without waiting for the DDNS update to complete, and then will send a second BNDUPD once the DDNS update is complete. Other implementations will delay sending the partner a BNDUPD until the DDNS update has been acknowledged by the DNS server, or until some time-limit has elapsed, in order to avoid sending a second BNDUPD. The Domain Name field in the DDNS option contains the FQDN that will be associated with the A RR (if the server is performing an A RR update for the client) and the PTR RR. This FQDN may be composed in any of several ways, depending on server configuration and the infor- mation provided by the client in its DHCP messages. The client may supply a hostname which it would like the server to use in forming the FQDN, or it may supply the entire FQDN. The server may be config- ured to attempt to use the information the client supplies, it may be configured with an FQDN to use for the client, or it may be config- ured to synthesize an FQDN. The responsive server SHOULD include the FQDN that it will be using in DDNS updates it initiates when it sends the DDNS option. Since the responsive server may not have completed the DDNS update at the time it sends the first BNDUPD about the lease binding, there may Droms, et. al. Expires January 2001 [Page 37] Internet Draft DHCP Failover Protocol July 2000 be cases where the FQDN in later BNDUPD messages does not match the FQDN included in earlier messages. For example, the responsive server may be configured to handle situations where two or more DHCP client FQDNs are identical by modifying the most-specific label in the FQDNs of some of the clients in an attempt to generate unique FQDNs for them (a process sometimes called "disambiguation"). Alter- natively, at sites which use some or all of the information which clients supply to form the FQDN, it's possible that a client's confi- guration may be changed so that it begins to supply new data. The responsive server may react by removing the DNS records which it ori- ginally added for the client, and replacing them with records that refer to the client's new FQDN. In such cases, the responsive server SHOULD include the actual FQDN that was used in subsequent DDNS options. The responsive server SHOULD include relevant client-option data in the client-request-options option in its BNDUPD messages. This information may be necessary in order to allow the non- responsive partner to detect client configuration changes that change the hostname or FQDN data which the client includes in its DHCP requests. 5.12.3. Adding RRs to the DNS A failover server which is going to perform DDNS updates SHOULD ini- tiate the DDNS update when it grants a new lease to a client. The non-responsive partner SHOULD NOT initiate a DDNS update when it receives the BNDUPD after the lease has been granted. The failover protocol ensures that only one of the partners will grant a lease to any individual client, so it follows that this requirement will prevent both partners from initiating updates simultaneously. The server initiating the update SHOULD follow the protocol in [DDNS]. The server may be configured to perform an A RR update on behalf of its clients, or not. Ordinarily, a failover server will not initiate DDNS updates when it renews leases. In two cases, however, a failover server MAY initiate a DDNS update when it renews a lease to its existing client: 1. When the lease was granted before the server was configured to perform DDNS updates, the server MAY be configured to perform updates when it next renews existing leases. Since both servers are responsive to renewals in NORMAL state, it is not enough to simply require the non-responsive server to avoid a DNS update in this case. The server which would be responsive to a DHCPDISCOVER from this client (even though the current request is a DHCPREQUEST/RENEW) is the server which should initiate the DDNS update. 2. If a server is in PARTNER-DOWN state, it can conclude that its partner is no longer attempting to perform an update for the Droms, et. al. Expires January 2001 [Page 38] Internet Draft DHCP Failover Protocol July 2000 existing client. If the remaining server has not recorded that an update for the binding has been successfully completed, the server MAY initiate a DDNS update. It MAY initiate this update immediately upon entry to PARTNER-DOWN state, it may perform this in the background, or it MAY initiate this update upon next hearing from the DHCP client. 5.12.4. Deleting RRs from the DNS The failover server which makes an IP address FREE SHOULD initiate any DDNS deletes, if it has recorded that DNS records were added on behalf of the client. A server not in PARTNER-DOWN state "makes an IP address FREE" when it initiates a BNDUPD with a binding-status of FREE, EXPIRED, or RELEASED. Its partner confirms this status by acking that BNDUPD, and upon receipt of the ACK the server has "made the IP address FREE". Conversely, a server in PARTNER-DOWN state "makes an IP address FREE" when it sets the binding-status to FREE, since in PARTNER-DOWN state not communications is required with the partner. It is at this point that it should initiate the DDNS operations to delete RRs from the DDNS. Its partner SHOULD NOT initiate DDNS deletes for DNS records related to the lease binding as part of send- ing the BNDACK message. The partner MAY have issued BNDUPD messages with a binding-status of FREE, EXPIRED, or RELEASED previously, but the other server will have NAKed these BNDUPD messages. The failover protocol ensures that only one of the two partner servers will be able to make a lease FREE. The server making the lease FREE may be doing so while it is in NORMAL communication with its partner, or it may be in PARTNER-DOWN state. If a server is in PARTNER-DOWN state, it may be performing DDNS deletes for RRs which its partner added originally. This allows a single remaining partner server to assume responsibility for all of the DDNS activity which the two servers were undertaking. Another implication of this approach is that no DDNS RR deletes will be performed while either server is in COMMUNICATIONS-INTERRUPTED state, since no IP addresses are moved into the FREE state during that period. 5.13. Reservations and failover Some DHCP servers support a capability to offer specific pre- configured IP addresses to DHCP clients. These are real DHCP clients, they do the entire DHCP protocol, but these servers always offer the client a specific pre-configured IP address -- and they Droms, et. al. Expires January 2001 [Page 39] Internet Draft DHCP Failover Protocol July 2000 offer that IP address to no other clients. Such a capability has several names, but it is sometimes called a "reservation", in that the IP address is reserved for a particular DHCP client. In a situation where there are two DHCP servers serving the same sub- net without using failover, the two DHCP server's need to have dis- joint IP address pools, but identical reservations for the DHCP clients. In a failover context, both servers need to be configured with the proper reservations in an identical manner, but if we stop there problems can occur around the edge conditions where reservations are made for an IP address that has already been leased to a different client. Different servers handle this conflict in different ways, but the goal of the failover protocol is to allow correct operation with any server's approach to the normal processing of the DHCP pro- tocol. The general solution with regards to reservations is as follows. Whenever a reserved IP address becomes FREE (i.e., when first config- ured or whenever a client frees it or it expires or is reset), the primary server MUST show that IP address as FREE (and thus available for its own allocation) and it MUST send it to the secondary server with the R bit set in the IP-flags option and the binding-status BACKUP. Note that this implies that a reserved IP address goes through the normal state changes from FREE to ACTIVE (and possibly back to FREE). The failover protcol supports this approach to reservations, i.e., where the IP address undergoes the normal state changes of any IP address, but it can only be offered to the client for which it is reserved. Other approaches to the support of reservations exist in some DHCP server implementations (e.g., where the IP address is apparently leased to a particular client forever, without any expira- tion). The goal is for the failover protocol to support any of the usual approaches to reservations, both those that allow an IP address to go through different states when reserved, and those that don't. From the above, it follows that a reservation soley on the secondary will not necessarily allow the secondary to offer that address to client to whom it is reserved. The reservation must also appear on the primary as well for the secondary to be able to offer the IP address to the client to which is is reserved. When the reservation on an IP address is cancelled, if the IP address is currently FREE and the server is the primary, or BACKUP and the server is the secondary, the server MUST send a BNDUPD to the other server with the binding-status FREE. Droms, et. al. Expires January 2001 [Page 40] Internet Draft DHCP Failover Protocol July 2000 5.14. Dynamic BOOTP and failover Some DHCP servers support a capability to offer IP addresses to BOOTP clients without having a particular address previously allocated for those clients. This capability is often called something like "dynamic BOOTP". It is discussed briefly in RFC 1534 [RFC 1534]. This capability has a negative interaction with the fundamental ele- ments of the failover protocol, in that an address handed out to a BOOTP device has no term (or effectively no term, in that usually they are considered leases for "forever"). There is no opportunity to hand out a lease which is only the MCLT long when first hearing from a BOOTP device, because they may only interact once with the DHCP server and they have no notion of a lease expiration time. Thus the entire concept of the MCLT and waiting the MCLT after entering PARTNER-DOWN state is defeated when dealing with BOOTP devices. With some restrictions, however, dynamic BOOTP devices can be sup- ported in a server on a subnet where failover is supported. The only restriction (and it is not small) is that on any portion of the sub- net (in any address pool) where dynamic BOOTP devices can be allo- cated IP addresses, a DHCP server MUST NOT ever use any of the IP addresses which were previously available for allocation by its fail- over partner. Thus, the addresses allocated by the primary to the secondary for allocation that might have been allocated to BOOTP dev- ices MUST NOT ever be used by the primary server even if it is in PARTNER-DOWN state and has waited the MCLT after entering that state. Conversely, addresses available for allocation by the primary MUST NOT be used by the secondary even it is in PARTNER-DOWN state. The reason for this is because one of those IP address could have been allocated by the secondary server to a BOOTP device, and the primary server would have no way of ever knowing that happened. Whenever a server sends BNDUPD message to its partner, if the client of associated with the IP address is a BOOTP client, then the server MUST set the B bit in the IP-flags option. 5.15. Guidelines for selecting MCLT There is no one correct value for the MCLT. There is an explicit tradeoff between various factors in selecting an MCLT value. 5.15.1. Short MCLT A short MCLT value will mean that after entering PARTNER-DOWN state, a server will only have to wait a short time before it can start allocating its partner's IP addresses to DHCP clients. Furthermore, it will only have to wait a short time after the expiration of a Droms, et. al. Expires January 2001 [Page 41] Internet Draft DHCP Failover Protocol July 2000 lease on an IP address before it can reallocate that IP address to another DHCP client. However the downside of a short MCLT value is that the initial lease interval that will be offered to every new DHCP client will be short, which will cause increased traffic as those clients will need to send in their first renew in a half of a short MCLT time. In addition, the lease extensions that a server in COMMUNICATIONS-INTERRUPTED state can give will be only the MCLT after the server has been in COMMUNICATIONS-INTERRUPTED for around the desired client lease period. If a server stays in COMMUNICATIONS-INTERRUPTED for that long, then the leases it hands out will be short and that will increase the load on that server, possibly causing difficulty. 5.15.2. Long MCLT A long MCLT value will mean that the initial lease period will be longer and the time that a server in COMMUNICATIONS-INTERRUPTED state will be able to extend leases (after it has been in COMMUNICATIONS- INTERRUPTED state for around the desired client lease period) will be longer. However, a server entering PARTNER-DOWN state will have to wait the longer MCLT before being able to allocate its partner's IP addresses to new DHCP clients. This may mean that additional IP addresses are required in order to cover this time period. Further, the server in PARTNER-DOWN will have to wait the longer MCLT from every lease expiration before it can reallocate an IP address to a different DHCP client. 5.16. What is sent in response to an UPDREQ or UPDREQALL message? In section 7.3, the UPDREQ message is defined, and it says that the receiving server sends to the requesting server "all of the binding database information that it has not already seen". In section 7.4.2, the UPDREQALL message is defined, and it says that the receiv- ing server sends to the requesting server "all binding database information". Both of these statements need further elaboration. First, for the UPDREQ message, the information to be sent in BNDUPD messages concerns "all of the binding database information it has not already seen". Since every BNDUPD is acked by the receving server, the sending server need only keep track of which IP addresses have binding database changes not yet seen by the partner, and when they are finally acked by the partner it can record that. Thus, at any time, it knows which IP addresses have unacked binding database Droms, et. al. Expires January 2001 [Page 42] Internet Draft DHCP Failover Protocol July 2000 information. This is less simple when, across reconfigurations of the servers, an IP address can change the failover partner to which it is associated. In that case, it is important to reset the indica- tion that the partner has seen this binding information. Second, in the event that a failover server's binding database infor- mation is restored from a backup, it will be partially out of date. In this case, its partner's indication of which binding database information the restored server has seen will be also be out of date. The solution to this problem is for a server which is connecting with its partner to check the partner's last communicated time, and if it is very much ahead of its own last communicated time, to to into recover state and transmit an UPDREQALL to allow it to refresh its state. See section 9.3.2, step 5. If the partner's last communi- cated time is very much behind its own record of when it last commun- icated with the partner, then it SHOULD invalidate its information on which binding database information the partner server knows, so that it will send all of its relevant binding database information to the partner. Third, in the event that a server receives a UPDREQALL message, what constitutes "all binding database information"? At first glance this would seem to be information on every configured IP address in the server. While this would be technically correct, it may impose a serious and unacceptable performance penalty on servers which have millions of configured IP addresses. What can be done to lessen the data that must be sent for an UPDREQALL? When sending "all binding database information", if the sending server sends only information concerning IP addresses which have been at some time associated with clients, it will send enough information to satisfy the needs of the failover protocol. It need not send information on any IP addresses that have never been used, since presumably they will be initialized as available to the primary server (i.e. FREE) on any server employing failover. 6. Common Message Format This section discusses the common message format that all failover messages have in common, including the message header format as well as the common option format. See section 12 for the the definitions of the specific options used in the failover protocol. 6.1. Message header format The options contained in the payload data section of the failover message all use a two byte option number and two byte length format. Droms, et. al. Expires January 2001 [Page 43] Internet Draft DHCP Failover Protocol July 2000 All failover protocol messages are sent over the TCP connection between failover endpoints and encoded using a message format specific to the failover protocol. There exists a common message format for all failover messages, which utilizes the options in a way similar to the DHCP protocol. For each message type, some options are required and some are optional. In addition, when a message is received any options that are not under- stood by the receiving server MUST be ignored. All of the fields in the fixed portion of the message MUST be filled with correct data in every message sent. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | message length (2) | msg type (1) |payload off (1)| +---------------+---------------+---------------+---------------+ | time (4) | +---------------------------------------------------------------+ | xid (4) | +---------------------------------------------------------------+ | 0 or more additional header bytes (variable) | +---------------------------------------------------------------+ | payload data (variable) | | | | formatted as DHCP-style options | | using a two byte option code and two byte length | | See section 6.2 for details. | +---------------------------------------------------------------+ message length - 2 bytes, network byte order This is the length of the message. It includes the two byte message length itself. The maximum length is 2048 bytes. The minimum length is 12. Droms, et. al. Expires January 2001 [Page 44] Internet Draft DHCP Failover Protocol July 2000 msg type - 1 byte The message type field is used to distinguish between messages. The following message types are defined: Value Message Type ----- ------------ 0 reserved not used 1 POOLREQ request allocation of addresses 2 POOLRESP respond with allocation count 3 BNDUPD update partner with binding info 4 BNDACK acknowledge receipt of binding update 5 CONNECT establish connection with the secondary 6 CONNECTACK respond to attempt to establish connection with partner 7 UPDREQALL request full transfer of binding info 8 UPDDONE ack send and ack of req'd binding info 9 UPDREQ req transfer of un-acked binding info 10 STATE inform partner of current state or state change 11 CONTACT probe communications integrity with partner 12 DISCONNECT close a connection New message types should be defined in one of two ranges, 0-127 or 129-255. The range of 0-127 is used for messages that MUST be sup- ported by every server, and if a server receives a message in the range of 0-127 that it doesn't understand, it MUST close the TCP con- nection. The range of 128-255 is used for messages which MAY be sup- ported but are not required, and if a server receives a message in this range that it does not understand it SHOULD ignore the message. payload offset - 1 byte The byte offset of the Payload Data, from the beginning of the failover message header. The value for the current protocol version (version 1) is 8. time - 4 bytes, network byte order The absolute time in GMT when the message was transmitted, represented as seconds elapsed since Jan 1, 1970 (i.e., similar to the ANSI C time_t time value representation). While the ANSI C time_t value is signed, the value used in this specification is unsigned. A server SHOULD set this time as close to the actual transmission of the message as possible. Droms, et. al. Expires January 2001 [Page 45] Internet Draft DHCP Failover Protocol July 2000 xid - 4 bytes, network byte order This is the transaction id of the failover message. The sender of a failover protocol message is responsible for setting this number, and the receiver of the message copies the number over into any response message, treating it as opaque data. The sender MUST ensure that every message sent from a particular failover endpoint over the associated TCP connection has a unique transaction id. For failover messages that have no corresponding response message, the XID value is meaningless, but MUST be supplied. The XID value is used solely by the receiver of a response message to determine the corresponding request message. Requests messages where the XID is used in the corresponding response messages are: POOLREQ, BNDUPD, CONNECT, UPDREQALL, and UPDREQ. The corresponding response messages are POOLRESP, BNDACK, CONNECTACK, UPDDONE, and UPDDONE, respectively. As requests/responses don't survive connection reestablishment, XIDs only need to be unique during a specific connection. payload data - variable length The options are placed after the header, after skipping payload offset bytes from beginning of the message. The payload data options are not preceded by a "cookie" value. The payload data is formatted as DHCP style options using two byte option codes and two byte option lengths. The option codes are in a namespace which is unique to the failover protocol. The maximum length of the payload data in octets is 2048 less the size of the header, i.e., the maximum message length is 2048 octets. 6.2. Common option format The options contained in the payload data section of the failover message all use a two byte option number and two byte length format. The option numbers are drawn from an option number space unique to the failover protocol. All of the message types share a common option number space and common options definitions, though not all options are required or meaningful for every message. In contrast to the options which appear in DHCP client and server messages, the options in failover message are ordered. That is, for Droms, et. al. Expires January 2001 [Page 46] Internet Draft DHCP Failover Protocol July 2000 some messages the order in which the options appear in the payload data area is significant. The messages for which option ordering is significant explicitly describe the ordering requirements. If no ordering requirements are mentioned, then the order is not signifi- cant for that message. For all options which refer to time, they all use an absolute time in GMT. Time synchronization has already been achieved between the source and the target server using the CONNECT message and is updated and refined using the time in every packet. The time value is an unsigned 32 bit integer in network byte order giving the number of seconds since 00:00 UTC, 1st January 1970. This can be converted to an NTP timestamp by adding decimal 2208988800. This time format will not wrap until the year 2106. Until sometime in 2038, it is equal to the ANSI C time_t value (which is a signed 32 bit value and will overflow into a negative number in 2038). Options should appear once only in each message (except for BNDUPD and BNDACK messages where bulking is used, see section 6.3 for details.) An option that appears twice is not concatenated, but treated as an error. Specific option values are described in section 12. See section 13 for how to define additional options. 6.3. Batching multiple binding update transactions in one BNDUPD mes- sage Implementations of this protocol MAY send multiple binding update transactions in one BNDUPD message, where a binding update transac- tion is defined as the set of options which are associated with the update of a single IP address. All implementations of this protocol MUST be prepared to receive BNDUPD messages which contain multiple binding update transactions and respond correctly to them, including replying with a BNDACK message which contains status for the multiple binding update transactions contained in the BNDUPD message. In the discussion of sending and receiving BNDUPD messages in section 7.1 and BNDACK messages in section 7.2, each BNDUPD message and BNDACK message is assumed to contain a single binding update transac- tion in order to reduce the complexity of the discussions in section 7. Multiple binding update transactions MAY be batched together in one BNDUPD protocol message with the data sets for the individual tran- sactions delimited by the assigned-IP-address option, which MUST Droms, et. al. Expires January 2001 [Page 47] Internet Draft DHCP Failover Protocol July 2000 appear first in the option set for each transaction. Ordering of options between the assigned-IP-address options is not significant. This is illustrated in the following schematic representation: Non-IP Address/Non-client specific options first assigned-IP-address option for the first IP address Options pertaining to first address, including at least the binding-status option and others as required. assigned-IP-address option for the second IP address Options pertaining to second address, including at least the binding-status option and others as required. ... Trailing options (message digest). There MUST be a one-to-one correspondence between BNDUPD and BNDACK messages, and every BNDACK message MUST contain status for all of the binding update transactions in the corresponding BNDUPD message. The BNDACK message corresponding to a BNDUPD message MUST contain assigned-IP-address options for all of the binding update transac- tions in the BNDUPD message. Thus, every BNDACK message contains exactly the same assigned-IP-address options as does its correspond- ing BNDUPD message. The order of the assigned-IP-address options MAY, however, be different. Here is a schematic representation of a BNDACK: Non-IP Address/Non-client specific options first assigned-IP-address option for the first IP address If rejected, reject-reason option and message option. assigned-IP-address option for the second IP address If rejected, reject-reason option and message option. ... Trailing options (message digest). In case the server chooses to reject some or all of the IP address binding information in a BNDUPD message in a BNDACK reply, the BNDACK message MUST contain a reject-reason option following every assigned-IP-address option in order to indicate that the binding update transaction for that IP address was not accepted and why. As with a BNDACK message containing a single binding update transaction, an assigned-IP-address option without any associated reject-reason option indicates a successful binding update transaction. Droms, et. al. Expires January 2001 [Page 48] Internet Draft DHCP Failover Protocol July 2000 7. Protocol Messages This section contains the detailed definition of the protocol mes- sages, including the information to include when sending the message, as well as the actions to take upon receiving the message. The mes- sage type for each message appears as [n] in the heading for the mes- sage (see section 6.1). 7.1. BNDUPD message [3] The binding update (BNDUPD) message is used to send the binding data- base changes (known as binding update transactions) to the partner server, and the partner server responds with a binding acknowledge- ment (BNDACK) message when it has successfully committed those changes to its own stable storage. The rest of the failover protocol exists to determine whether the partner server is able to communicate or not, and to enable the partners to exchange BNDUPD/BNDACK messages in order to keep their binding databases in stable storage synchronized. The rest of this section is written as though every BNDUPD message contains only a single binding update transaction in order to reduce the complexity of the discussion. See section 6.3 for information on how to create and process BNDUPD and BNDACK messages which contain multiple binding update transactions. Note that while a server MAY generate BNDUPD messages with multiple binding update transactions, every server MUST be able to process a BNDUPD message which contains multiple binding update transactions and generate the corresponding BNDACK messages with status for multiple binding update transactions. The following table summarizes the various options for the BNDUPD message. Droms, et. al. Expires January 2001 [Page 49] Internet Draft DHCP Failover Protocol July 2000 binding-status BACKUP RESET ABANDONED Option ACTIVE EXPIRED RELEASED FREE ------ ------ ------- -------- ---- assigned-IP-address (3) MUST MUST MUST MUST IP-flags MUST(4) MUST(4) MUST(4) MUST(4) binding-status MUST MUST MUST MUST client-identifier MAY MAY MAY MAY(2) client-hardware-address MUST MUST MUST MAY(2) lease-expiration-time MUST MUST NOT MUST NOT MUST NOT potential-expiration-time MUST MUST NOT MUST NOT MUST NOT start-time-of-state SHOULD SHOULD SHOULD SHOULD client-last-trans.-time MUST SHOULD MUST MAY DDNS(1) SHOULD SHOULD SHOULD SHOULD client-request-options SHOULD SHOULD NOT SHOULD SHOULD NOT client-reply-options SHOULD SHOULD NOT SHOULD NOT SHOULD NOT (1) MUST if server is performing dynamic DNS for this IP address, else MUST NOT. (2) MUST NOT if binding-status is ABANDONED. (3) assigned-IP-address MUST be the first option for an IP address (4) IP-flags option MUST appear if any flags are non-zero, else it MAY appear. Table 7.1-1: Options used in a BNDUPD message 7.1.1. Sending the BNDUPD message A BNDUPD message SHOULD be generated whenever any binding changes. A change might be in the binding-status, the lease-expiration-time, or even just the last-transaction-time. In general, any time a DHCP server writes its stable storage, a BNDUPD message SHOULD be gen- erated. This will often be the result of the processing of a DHCP client request, but it might also be the result of a successful dynamic DNS update operation. Stable storage updates due to BNDUPD or BNDACK messages SHOULD NOT result in additional BNDUPD messages. BNDUPD (and BNDACK) messages refer to the binding-status of the IP address, and this protocol defines a series of binding-statuses, dis- cussed in more detail below. Some servers may not support all of these binding-statuses, and so in those cases they will not be sent. Upon receipt of a BNDUPD message which contains an unsupported binding-status, a reasonable interpretation should be made (see sec- tion 5.10). Droms, et. al. Expires January 2001 [Page 50] Internet Draft