Network Working Group INTERNET-DRAFT A. Lior Category: Informational Bridgewater Systems draft-lior-radius-reliable-accounting-00.txt Expires: December 23rd 2003 Remote Authentication Dial-In User Service (RADIUS) Reliable Transport Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract Remote Authentication Dial-In User Service (RADIUS) Request For Comments (RFCs) do not address RADIUS reliability with respect to transport of RADIUS messages. This Informational Internet Draft describes procedures for Retransmission, Failover and Failback. Lior, et al. [Page 1] RADIUS Reliable Transport February 2003 Table of Contents 1. Introduction...................................................3 1.1 Reliable Transport of Authentication and Authorization messages.......................................................3 1.2 Reliable Transport of Accounting messages..................3 1.3 Terminology................................................4 1.4 Requirements language......................................4 2. RADIUS Transport Today.........................................4 2.1 RADIUS Transport of Authentication and Authorization.......4 2.2 RADIUS Transport of Accounting.............................5 2.3 RADIUS Transportation of Dynamic Authorization messages....6 3. Model..........................................................7 4. General Requirements...........................................9 5. General Algorithm..............................................9 5.1 Retransmit Algorithm......................................10 5.2 Offline Algorithm.........................................10 5.3 Online Algorithm..........................................12 6. Special Consideration.........................................13 6.1 Consideration for Accounting Messages.....................13 6.2 Consideration for Dynamic Authorization Messages..........13 7. Security Considerations.......................................14 8. Normative References..........................................14 9. Informative References........................................15 10. Acknowledgments..............................................15 11. Author's Addresses...........................................15 12. Intellectual Property Statement..............................15 13. Full Copyright Statement.....................................16 14. Expiration Date..............................................16 Lior, et al. Informational [Page 2] RADIUS Reliable Transport February 2003 1. Introduction In the context of this document, transport reliability includes the ability to detect failures in communication between two RADIUS entities (client and server), retransmission of messages, failover procedures, and failback procedures. Transport reliability is not part of the RADIUS specification and has been left up to implementers resulting in an inconsistent approaches that make it difficult to engineer reliable RADIUS based deployments. This document recommends an approach to provide a reliable transport for RADIUS messages. There have been other discussions covering AAA Reliable Transport [AAATransport], and implementation of these, for example Diameter [Diameter]. However, these discussions covered AAA protocols that use connection oriented protocols (TCP and SCTP) where as RADIUS uses a connectionless based protocol (UDP) as the transport mechanism. Where applicable, this document adopts some of the principles covered by these other sources. 1.1 Reliable Transport of Authentication and Authorization messages. TODO: Motivation for reliable transport during Authentication and Authorization is needed. 1.2 Reliable Transport of Accounting messages Usage based billing, requires accuracy in billing presentment. Customers should be presented with a bill that is consistent with their usage and contains no errors. When errors in accounting occur, operators often err on the side of the customer. This results in loss of revenue for the operator. Worse, if a customer gets an inconsistent bill, or an inaccurate bill, they may call customer support. Support calls are expensive, and customer dissatisfaction must be avoided as well. The RADIUS protocol does not have a reliable mechanism for delivering of accounting messages. Historically RADIUS has been used to service dialup subscribers that are generally billed in very coarse grain fashion that range from monthly or yearly contracts to Lior, et al. Informational [Page 3] RADIUS Reliable Transport February 2003 block of time contracts measured in hours. Precision from the accounting records was not required. For example, if the subscriber is using a hourly plan and an accounting records for a particular session are lost, then the operator may lose a few hours of revenue billed at pennies per hour. This coarse granularity meant that accounting records did not have to be reliable. Systems such as Voice over IP (VoIP), WiFi LANs, where usage based billing is used, loss of accounting records could reflect a significant loss of revenue. For example, assuming RADIUS Accounting Interim generated every 60-600 seconds (60 is the minimum, 600 is the recommended minium) and if we lose the RADIUS Accounting Stop record; this loss would represent a loss of 60 to 600 seconds of revenue. This draft describes a best practice for increasing the reliability of RADIUS Accounting messages. 1.3 Terminology 1.4 Requirements language In this document, several words are used to signify the requirements of the specification. These words are often capitalized. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. RADIUS Transport Today 2.1 RADIUS Transport of Authentication and Authorization Lior, et al. Informational [Page 4] RADIUS Reliable Transport February 2003 RADIUS Authentication and Authorization procedures are described in [RFC2865] and [RFC2869]. In the process of authenticating a user, the RADIUS Client (e.g. NAS) will send one or more Access-Request messages to Proxy server, that may be forwarding the requests to a Remote server. Under normal conditions, each Access-Request should result in one of the following responses: Access-Accept,Access- Reject or Access-Challenge. Under certain conditions, such as network errors, the RADIUS Client may not get a response back. If a response is not received after some unspecified time, the NAS or Proxy RADIUS, will retry and eventually failover to another RADIUS server. As time progresses, servers that failed need to be retried. How long to wait, how many times to retry, how to fail over, how and when to failback, is not covered by the RADIUS specification. 2.2 RADIUS Transport of Accounting RADIUS accounting messages are described in [RFC2866] and [RFC2869]. A RADIUS client sends Accounting-Request (start) messages at the start of a session-segments, Accounting-Request (stop) messages at the end of session-segments and optionally, Accounting-Request (Interim) messages periodically during the session at a rate controlled by Accounting Interim Interval. The RADIUS client receives an Accounting-Response message once a RADIUS Accounting Server has received the Accounting packet and took responsibility for it. The Accounting messages may traverse through zero or more proxy RADIUS Accounting Server before reaching their destination. As described in [RFC2866], the proxy RADIUS Accounting Servers may pass the Accounting messages immediately to the next RADIUS Accounting Server in the proxy chain or it may store the accounting messages and send them at a later time. If the proxy RADIUS Accounting Server stores the accounting message it responds back to the client with an Accounting Response message. The RADIUS specification only provides for an Accounting-Response to acknowledge the successful reception of Accounting Packets. There isnÆt an Accounting NAK message. A receiver of an accounting message will silently discard a bad message. The sender of the Lior, et al. Informational [Page 5] RADIUS Reliable Transport February 2003 message may not know why an acknowledgement was not received. Is it because the access request message was lost, was the response lost, or was there an error. The sender has no option but to retry. From RFC2866, "It is recommended that the client continue attempting to send the Accounting-Request packet until it receives an acknowledgement, using some form of backoff. If no response is returned within a length of time, the request is re-sent a number of times. The client can also forward requests to an alternate server or servers in the event that the primary server is down or unreachable. An alternate server can be used either after a number of tries to the primary server fail, or in a round-robin fashion. Retry and fallback algorithms are the topic of current research and are not specified in detail in this document." Furthermore failure issues are made more complex due to the presence of proxy servers. The failures can occur at each proxy. The specification is not clear about how failures should be handled at the proxy. Should they silently discard and let the originator try? Or should they retry? How long do we wait. The specification is also not clear on the issue of whether or not we treat the three types of accounting messages equally when failures are detected. In this Internet draft we recommend strategies for dealing with the above shortcomings. We believe that application of these recommendations will go along way to make RADIUS accounting reliable enough to be used in application that demand stringent accounting, such as usage based billing. 2.3 RADIUS Transportation of Dynamic Authorization messages Dynamic authorization messages described in [CHIBA] include Disconnect-Messages and Change-of-Authorization messages. These messages are sent by RADIUS server to the NAS directly or via intermediaries. A sender of a Disconnect Message or a Change of Authorization message expects a NAK or ACK response to his message. If the sender does not receive a NAK or ACK it should retry sending the message. The retransmission and failover procedures are not specified. Lior, et al. Informational [Page 6] RADIUS Reliable Transport February 2003 In case of CHIBA, when the sender receives a NAK message, the sender should examine the Error-Code and based on the value it may choose to retransmit the message. See further details below. 3. Model In this section we present a general set of recommendations to address the above issues. The general model used for these discussions is represented in figure 1. In figure 1, a RADIUS client can be a NAS, or another RADIUS server (an Intermediary). The RADIUS server can be an intermediary or the end RADIUS server. As well, in consideration for [CHIBA], where messages flow is reversed, that is, from the end RADIUS server to the NAS, the RADIUS Client can be the end RADIUS server and the RADIUS server can represent either an intermediary or the NAS. In most robust deployments, as is assumed here, a RADIUS Client has two or more RADIUS Server that it can use to send a message destined to a particular location. We call a collection of zero or more RADIUS Servers that proxy to a given location a Proxy Group. A RADIUS Client can have more then one Proxy Group. Specifically, a RADIUS Client knows which Proxy Group to route a message to (the routing decision can be based on the type of messages (e.g. Access Request, Accounting) and/or attributes contained in the message (e.g the NAI, a calling number). The client also keeps state about each of the RADIUS Servers in the Proxy Group. It knows for example, which RADIUS Servers are available. It may use only one of the available RADIUS Servers all the time, or all of available RADIUS Servers in the Proxy Group(round robin fashion). In figure 1, the Proxy Group A is used to send messages based on NAI = x, the RADIUS client is using only one RADIUS server. Proxy Group B is used to route messages based on NAI = y, the RADIUS client is using both RADIUS servers in a round robin fashion. A RADIUS Server can exist in more then one Proxy Group. The RADIUS Client keeps a separate state for that RADIUS Server in the group. Therefore, from the clients point of view, a RADIUS Server that appears in two Proxy Groups (services two realms) will appear as two distinct RADIUS Servers. Unless otherwise specified, the term Lior, et al. Informational [Page 7] RADIUS Reliable Transport February 2003 RADIUS Server refers to the logical RADIUS Server in a particular Proxy Group. Proxy Group A +------------+ | | | +------+ | | | | | NAI x | |RADIUS| | +------------------------|->|Server| | | | | | | | | +------+ | | | | | | +------+ | | | | | | | | |RADIUS| | | | |Server| | | | | | | +--------+ | +------+ | | | | | | RADIUS | +------------+ | | Proxy Group B | Client | +------------+ | | | | +--------+ | +------+ | | | | | | | | |RADIUS| | | +-------|->|Server| | | | | | | | | | | +------+ | +----------------+ | | NAI y | | +------+ | | | | | | | | |RADIUS| | +-------|->|Server| | | | | | | +------+ | | | +------------+ Figure 1: Basic Architecture. Lior, et al. Informational [Page 8] RADIUS Reliable Transport February 2003 Note that when discussing these failover, we can choose to perform the procedures at the origination of the messages only and not at the intermediaries; or we can perform the algorithms at the originating server and at the intermediaries. To minimize traffic in the network, and to minimize time delays, it is highly desirable that we detect failure conditions and act on them at the intermediaries. 4. General Requirements Given the above model the following are the general capabilities that form a reliable transport. - A RADIUS Client MUST be able to determine when a RADIUS Server is not available. A RADIUS Server is not available either because its not reachable due to a network failure, the machine is not working, or the application is not responding. The RADIUS Client will put a RADIUS Server that is not available in the offline state. A RADIUS Server in an offline state will not be used to send messages. - RADIUS Server that has been previously declared as offline SHOULD automatically be reinstated into the online state as soon as possible. This is particularly important when load- balancing is used. The algorithm used to bring a RADIUS server online should be conservative. The cost of bringing a RADIUS server online falsely is added traffic and added delays. - A RADIUS Client that does not receive and acknowledgement will attempt to retransmit. In order to keep the network traffic down, and to reduce the number of duplicate requests, and to give a potentially overload RADIUS Server a chance to clear its queues, the retransmission algorithm should be conservative. 5. General Algorithm When a RADIUS Client sends a message it expects a response. If that response is not received in a given amount of time, the RADIUS client will retry to transmit the message to the same RADIUS Server. The algorithm for retransmission is described below. If the retransmission algorithm fails, the RADIUS Server will attempt to send the message to another RADIUS Server in the Proxy Group. Note, the RADIUS Server, is not necessarily brought offline. The Offline Lior, et al. Informational [Page 9] RADIUS Reliable Transport February 2003 Algorithm determines when a RADIUS Server is brought offline. When the failure rate reaches a certain threshold, for a number of time periods, the Offline Algorithm places the RADIUS Server in the Proxy Group into the offline state for a period of time. The Online Algorithm is used to automatically bring RADIUS Server that are in an Offline state back to the Online state. Normally an offline RADIUS Server is brought out of the Offline state when the Offline-Period time expires. However, under certain conditions an offline RADIUS Server will be brought out of that state earlier. These algorithms are explained in detail in the following sections. 5.1 Retransmit Algorithm This algorithm describes how a message gets retransmitted if a response is not received in a specified amount of time, T-retry. 1) Set T-retry to minimum value. 2) Send a message and start timer. 3) If a response is not received and T-retry is reached, double T-retry up to a maximum value, and resend the message. 4) Repeat steps 2 and 3 N times. Where N is configurable. T-retry is associated with each message, not the RADIUS Server. Once a message has been retried N times we fail the message to the next available RADIUS Server in the Proxy Group. Note, we do not place the RADIUS Server in the offline state. The RADIUS Server is placed in an Offline state by using the Offline Algorithm described below. 5.2 Offline Algorithm A RADIUS Client places a RADIUS Server in an Offline state when the RADIUS Client perceives that the RADIUS Server is not responsive. There are many reasons why a RADIUS Client may not receive a response: - The network dropped the packet; - The server has silently discarded the packet due to errors; - The server is busy. Lior, et al. Informational [Page 10] RADIUS Reliable Transport February 2003 - The server is dead. Note that in proxy situation the failure may have occurred anywhere in the proxy chain. As well, the RADIUS Client may timeout while a RADIUS Server down the proxy chain is performing a retry algorithm. Therefore, using responses to determine whether the immediate RADIUS Server is operational is difficult. In the cases where the RADIUS Server exists in more then one Proxy- Group (it is servicing multiple realms), it may be possible to determine whether that RADIUS Server is dead. However, generally, the only way to determine whether the immediate server is alive is to send it out of band message. This approach is outside the scope of the RADIUS protocol and will not be considered here. See [AAATransport] The process for determining non-responsive must also be very carefully considered. A single failure of the retransmit algorithm is not sufficient. A better approach is to use a number of such failures determine whether or not a RADIUS Server should be placed in the Offline state. This will allow us to handle the case where messages maybe silently discarded, or lost due to other reasons such as maybe the case for UDP packets. The Offline algorithm is used to determine when a server in the Proxy Group is placed in an Offline state. The algorithm takes into account consecutive failures caused when the RADIUS server has completely failed; and also intermittent failures that may occur when a server is overloaded. The algorithm uses a number of buckets. Each bucket represents a uniform period of time (for example one minute of time). Each bucket consists of two counters: number-of-requests, which count the total number of requests sent during the time period of the bucket; and number-of-failures which counts the total number of failure (timeouts) experienced during the time period of the bucket. These counter are used to determine whether there were significant failures during the bucket period. As messages are sent the number-of-requests is incremented. If the message fails (times-out), then we increment the number-of-failures. The algorithm requires three threshold parameters: Lior, et al. Informational [Page 11] RADIUS Reliable Transport February 2003 a) minimum-request-threshold. The minimum-request-threshold is used to make sure that we have sufficient number of messages in the buffer to make a sound decision; b) failure-rate-threshold determines at what error-rate do we declare that the bucket has failed; c) N which represents the number of consecutive buckets that need to fail before we put the RADIUS server in the offline state. d) Offline-Period which is the length of time that the RADIUS Server should be kept in the offline state. A RADIUS server is placed in offline state under the following condition: 1) For a given bucket, providing the number of requests processed is greater then the minimum-request-threshold and during the bucket period there were 100% failures; or 2) If N consecutive buckets experienced significant intermittent failures. Note: a bucket that has not contain sufficient number of request is simply skipped or ignored. It does not break the continuity of the sample buckets. We say that a bucket has experienced significant intermittent failures if the number of requests processed is greater than the minimum-request-threshold and the error rate exceeds the failure- rate-threshold. That is: number-of-failures/number-of-requests > failure-rate-threshold Once a RADIUS server has been placed in the Offline state it will remain in that state for the amount of time specified by the Offline-Period parameter. 5.3 Online Algorithm The Online Algorithm is the procedure used to bring a RADIUS Server in the Offline state back online. Lior, et al. Informational [Page 12] RADIUS Reliable Transport February 2003 Normally, a RADIUS Server will be placed to in an Offline state for a period of time known as the Offline-Period. However, there could be situations where the number of available RADIUS servers in a Proxy Group is too low. If all the RADIUS Servers in a Proxy Group are offline, we would have a service outage for the realm that that Proxy Group is servicing. Therefore we have to make sure that we never run out of RADIUS Servers. Furthermore, if we are using load balancing in a Proxy Group, it may be highly desirable to try to maintain a certain number of RADIUS Servers even if we have to bring some of them out of the Offline state earlier. The Online Algorithm brings the RADIUS Servers out of the Offline state when the Offline-Period has expired. As well, if the number of available RADIUS Servers in a Proxy Group falls below a threshold, the Online Algorithm will bring the RADIUS Server(s) that are closest to approaching their Offline Period. Alternatively, the Online Algorithm may also take into account the RADIUS Server's state in other Proxy Groups. 6. Special Consideration The following section describes special considerations for the different types of messages. 6.1 Consideration for Accounting Messages Accounting messages are critical but with respect to retransmission, we recommend that the retransmission algorithm should not be applied to Accounting-Request (Interim) messages. Note failures should be reported as Accounting-Accept messages are not received for the Accounting-Request (Interim) messages. 6.2 Consideration for Dynamic Authorization Messages When the sender of a Disconnect message or a Change-of-Authorization message receives a NAK it should examine the Error-codes to determine whether it should retransmit or not. Lior, et al. Informational [Page 13] RADIUS Reliable Transport February 2003 If the Error-code is set to "Request Not Routable"(502) the sender should retry to send the message to another RADIUS Server in the Proxy Group. If the Error-code is set to "Other Proxy Processing Error"(505) the sender should treat this as a non-response, wait as it normally would, and retry to transmit the message. If the RADIUS Server is sending the Disconnect Message or Change-of- Authorization message directly to the NAS where the session resides then failing over does not make any sense. 7. Security Considerations This document enhances existing RADIUS specification by recommending strategies for failure detection, retransmission procedures, failover procedures, and failback procedures. The document does not modify the base protocols and therefore the security considerations are the same as those discussed in the appropriate documents. However, in this section addresses the security concerns that are introduced by the procedures that are discussed in this document. -Effects due to Denial of Service attacks 8. Normative References [RFC2026] [RFC2119] [RFC2865] Rigney, C., Rubens, A., Simpson, W. and S. Willens, ôRemote Authentication Dial In User Server (RADIUS)ö, RFC 2865, June 2000. [RFC2866] Rigney, C., ôRADIUS Accounting ö, RFC 2866, June 2000. [RFC2869] Rigney, C., Willats, W., Calhoun, P., ôRADIUS Extensionsö, RFC 2869, June 2000. [RFC2868] [CHIBA] Chiba, M., Dommety, G., Eklund, M., Mitton, D., Aboba, B., " Dynamic Authorization Extensions to Lior, et al. Informational [Page 14] RADIUS Reliable Transport February 2003 Remote Authentication Dial In User Service (RADIUS)", draft-chiba-radius-dynamic-authorization-20.txt, Internet draft (work in progress), 15 May, 2003. [RFC2988] Paxson, V., Allman, M., "Computing TCP's Retransmission Timer", RFC 2988, November 2000. 9. Informative References [RFC3127] Mitton, D.," Authentication, Authorization, and Accounting: Protocol Evaluation", RFC 3127, June 2001 [AAATransport] Aboba, B. and J. Wood, "Authentication, Authorization and Accounting Transport Profile", draft-ietf-aaa-transport-12.txt, Internet draft (work in progress), January 2003. 10. Acknowledgments Funding for the RFC Editor function is currently provided by the Internet Society. The author would like to thank the following people: Yong Li and Helena Mancini from Bridgewater Systems. 11. Author's Addresses Avi Lior Bridgewater Systems 303 Terry Fox Drive Suite 100 Ottawa Ontario Canada avi@bridgewatersystems.com 12. Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it Lior, et al. Informational [Page 15] RADIUS Reliable Transport February 2003 has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. 13. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 14. Expiration Date This memo is filed as , and expires December 23, 2003. Lior, et al. Informational [Page 16]