| < draft-ietf-dime-overload-reqs-06.txt | draft-ietf-dime-overload-reqs-07.txt > | |||
|---|---|---|---|---|
| Network Working Group E. McMurry | Network Working Group E. McMurry | |||
| Internet-Draft B. Campbell | Internet-Draft B. Campbell | |||
| Intended status: Standards Track Tekelec | Intended status: Informational Tekelec | |||
| Expires: October 19, 2013 April 17, 2013 | Expires: December 8, 2013 June 6, 2013 | |||
| Diameter Overload Control Requirements | Diameter Overload Control Requirements | |||
| draft-ietf-dime-overload-reqs-06 | draft-ietf-dime-overload-reqs-07 | |||
| Abstract | Abstract | |||
| When a Diameter server or agent becomes overloaded, it needs to be | When a Diameter server or agent becomes overloaded, it needs to be | |||
| able to gracefully reduce its load, typically by informing clients to | able to gracefully reduce its load, typically by informing clients to | |||
| reduce sending traffic for some period of time. Otherwise, it must | reduce sending traffic for some period of time. Otherwise, it must | |||
| continue to expend resources parsing and responding to Diameter | continue to expend resources parsing and responding to Diameter | |||
| messages, possibly resulting in congestion collapse. The existing | messages, possibly resulting in congestion collapse. The existing | |||
| Diameter mechanisms, listed in Section 3 are not sufficient for this | Diameter mechanisms, listed in Section 3 are not sufficient for this | |||
| purpose. This document describes the limitations of the existing | purpose. This document describes the limitations of the existing | |||
| skipping to change at page 1, line 38 ¶ | skipping to change at page 1, line 38 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on October 19, 2013. | This Internet-Draft will expire on December 8, 2013. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2013 IETF Trust and the persons identified as the | Copyright (c) 2013 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| skipping to change at page 2, line 29 ¶ | skipping to change at page 2, line 29 ¶ | |||
| 2.2. Agent Scenarios . . . . . . . . . . . . . . . . . . . . . 9 | 2.2. Agent Scenarios . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 2.3. Interconnect Scenario . . . . . . . . . . . . . . . . . . 12 | 2.3. Interconnect Scenario . . . . . . . . . . . . . . . . . . 12 | |||
| 3. Existing Mechanisms . . . . . . . . . . . . . . . . . . . . . 13 | 3. Existing Mechanisms . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 4. Issues with the Current Mechanisms . . . . . . . . . . . . . . 14 | 4. Issues with the Current Mechanisms . . . . . . . . . . . . . . 14 | |||
| 4.1. Problems with Implicit Mechanism . . . . . . . . . . . . . 15 | 4.1. Problems with Implicit Mechanism . . . . . . . . . . . . . 15 | |||
| 4.2. Problems with Explicit Mechanisms . . . . . . . . . . . . 15 | 4.2. Problems with Explicit Mechanisms . . . . . . . . . . . . 15 | |||
| 5. Diameter Overload Case Studies . . . . . . . . . . . . . . . . 16 | 5. Diameter Overload Case Studies . . . . . . . . . . . . . . . . 16 | |||
| 5.1. Overload in Mobile Data Networks . . . . . . . . . . . . . 16 | 5.1. Overload in Mobile Data Networks . . . . . . . . . . . . . 16 | |||
| 5.2. 3GPP Study on Core Network Overload . . . . . . . . . . . 17 | 5.2. 3GPP Study on Core Network Overload . . . . . . . . . . . 17 | |||
| 6. Extensibility and Application Independence . . . . . . . . . . 18 | 6. Extensibility and Application Independence . . . . . . . . . . 18 | |||
| 7. Solution Requirements . . . . . . . . . . . . . . . . . . . . 19 | 7. Solution Requirements . . . . . . . . . . . . . . . . . . . . 18 | |||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 9. Security Considerations . . . . . . . . . . . . . . . . . . . 23 | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 23 | |||
| 9.1. Access Control . . . . . . . . . . . . . . . . . . . . . . 24 | 9.1. Access Control . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 9.2. Denial-of-Service Attacks . . . . . . . . . . . . . . . . 24 | 9.2. Denial-of-Service Attacks . . . . . . . . . . . . . . . . 24 | |||
| 9.3. Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 24 | 9.3. Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 9.4. Man-in-the-Middle Attacks . . . . . . . . . . . . . . . . 25 | 9.4. Man-in-the-Middle Attacks . . . . . . . . . . . . . . . . 25 | |||
| 9.5. Compromised Hosts . . . . . . . . . . . . . . . . . . . . 25 | 9.5. Compromised Hosts . . . . . . . . . . . . . . . . . . . . 25 | |||
| 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 10.1. Normative References . . . . . . . . . . . . . . . . . . . 25 | 10.1. Normative References . . . . . . . . . . . . . . . . . . . 25 | |||
| 10.2. Informative References . . . . . . . . . . . . . . . . . . 25 | 10.2. Informative References . . . . . . . . . . . . . . . . . . 26 | |||
| Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 26 | Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 26 | |||
| Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 26 | Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 27 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 1. Introduction | 1. Introduction | |||
| When a Diameter [RFC6733] server or agent becomes overloaded, it | When a Diameter [RFC6733] server or agent becomes overloaded, it | |||
| needs to be able to gracefully reduce its load, typically by | needs to be able to gracefully reduce its load, typically by | |||
| informing clients to reduce sending traffic for some period of time. | informing clients to reduce sending traffic for some period of time. | |||
| Otherwise, it must continue to expend resources parsing and | Otherwise, it must continue to expend resources parsing and | |||
| responding to Diameter messages, possibly resulting in congestion | responding to Diameter messages, possibly resulting in congestion | |||
| collapse. The existing mechanisms provided by Diameter are not | collapse. The existing mechanisms provided by Diameter are not | |||
| skipping to change at page 16, line 15 ¶ | skipping to change at page 16, line 15 ¶ | |||
| client should wait before retrying the overloaded destination. If an | client should wait before retrying the overloaded destination. If an | |||
| agent or server supports multiple realms and/or applications, | agent or server supports multiple realms and/or applications, | |||
| DIAMETER_TOO_BUSY offers no way to indicate that it is overloaded for | DIAMETER_TOO_BUSY offers no way to indicate that it is overloaded for | |||
| one application but not another. A DIAMETER_TOO_BUSY error can only | one application but not another. A DIAMETER_TOO_BUSY error can only | |||
| indicate overload at a "whole server" scope. | indicate overload at a "whole server" scope. | |||
| Agent processing of a DIAMETER_TOO_BUSY response is also problematic | Agent processing of a DIAMETER_TOO_BUSY response is also problematic | |||
| as described in the base specification. DIAMETER_TOO_BUSY is defined | as described in the base specification. DIAMETER_TOO_BUSY is defined | |||
| as a protocol error. If an agent receives a protocol error, it may | as a protocol error. If an agent receives a protocol error, it may | |||
| either handle it locally or it may forward the response back towards | either handle it locally or it may forward the response back towards | |||
| the downstream peer. (The Diameter specification is inconsistent | the downstream peer. If a downstream peer receives the | |||
| about whether a protocol error MAY or SHOULD be handled by an agent, | ||||
| rather than forwarded downstream.) If a downstream peer receives the | ||||
| DIAMETER_TOO_BUSY response, it may stop sending all requests to the | DIAMETER_TOO_BUSY response, it may stop sending all requests to the | |||
| agent for some period of time, even though the agent may still be | agent for some period of time, even though the agent may still be | |||
| able to deliver requests to other upstream peers. | able to deliver requests to other upstream peers. | |||
| DIAMETER_UNABLE_TO_DELIVER, or using DPR with cause code BUSY also | DIAMETER_UNABLE_TO_DELIVER, or using DPR with cause code BUSY also | |||
| have no mechanisms for specifying the scope or cause of the failure, | have no mechanisms for specifying the scope or cause of the failure, | |||
| or the durational validity. | or the durational validity. | |||
| The issues with error responses in [RFC6733] extend beyond the | The issues with error responses in [RFC6733] extend beyond the | |||
| particular issues for overload control and have been addressed in an | particular issues for overload control and have been addressed in an | |||
| ad hoc fashion by various implementations. Addressing these in a | ad hoc fashion by various implementations. Addressing these in a | |||
| standard way would be a useful exercise, but it us beyond the scope | standard way would be a useful exercise, but it us beyond the scope | |||
| of this document. | of this document. | |||
| 5. Diameter Overload Case Studies | 5. Diameter Overload Case Studies | |||
| 5.1. Overload in Mobile Data Networks | 5.1. Overload in Mobile Data Networks | |||
| As the number of Third Generation (3G) and Long Term Evolution (LTE) | As the number of Third Generation (3G) and Long Term Evolution (LTE) | |||
| enabled smartphone devices continue to expand in mobility networks, | enabled smartphone devices continue to expand in mobile networks, | |||
| there have been situations where high signaling traffic load led to | there have been situations where high signaling traffic load led to | |||
| overload events at the Diameter-based Home Location Registries (HLR) | overload events at the Diameter-based Home Location Registries (HLR) | |||
| and/or Home Subscriber Servers (HSS) [TR23.843]. The root causes of | and/or Home Subscriber Servers (HSS) [TR23.843]. The root causes of | |||
| the HLR congestion events were manifold but included hardware failure | the HLR congestion events were manifold but included hardware failure | |||
| and procedural errors. The result was high signaling traffic load on | and procedural errors. The result was high signaling traffic load on | |||
| the HLR and HSS. | the HLR and HSS. | |||
| The 3GPP architecture [TS23.002] makes extensive use of Diameter. It | The 3GPP architecture [TS23.002] makes extensive use of Diameter. It | |||
| is used for mobility management [TS29.272] (and others), (IP | is used for mobility management [TS29.272] (and others), (IP | |||
| Multimedia Subsystem) IMS [TS29.228] (and others), policy and | Multimedia Subsystem) IMS [TS29.228] (and others), policy and | |||
| skipping to change at page 17, line 25 ¶ | skipping to change at page 17, line 23 ¶ | |||
| Smartphones, an increasingly large percentage of mobile devices, | Smartphones, an increasingly large percentage of mobile devices, | |||
| contribute much more heavily, relative to non-smartphones, to the | contribute much more heavily, relative to non-smartphones, to the | |||
| continuation of a registration surge due to their very aggressive | continuation of a registration surge due to their very aggressive | |||
| registration algorithms. Smartphone behavior contributes to network | registration algorithms. Smartphone behavior contributes to network | |||
| loading and can contribute to overload conditions. The aggressive | loading and can contribute to overload conditions. The aggressive | |||
| smartphone logic is designed to: | smartphone logic is designed to: | |||
| a. always have voice and data registration, and | a. always have voice and data registration, and | |||
| b. constantly try to be on 3G or LTE data (and thus on 3G voice or | b. constantly try to be on 3G or LTE data (and thus on 3G voice or | |||
| VoLTE) for their added benefits. | VoLTE [IR.92]) for their added benefits. | |||
| Non-smartphones typically have logic to wait for a time period after | Non-smartphones typically have logic to wait for a time period after | |||
| registering successfully on voice and data. | registering successfully on voice and data. | |||
| The smartphone aggressive registration is problematic in two ways: | The smartphone aggressive registration is problematic in two ways: | |||
| o first by generating excessive signaling load towards the HLR that | o first by generating excessive signaling load towards the HSS that | |||
| is ten times that from a non-smartphone, | is ten times that from a non-smartphone, | |||
| o and second by causing continual registration attempts when a | o and second by causing continual registration attempts when a | |||
| network failure affects registrations through the 3G data network. | network failure affects registrations through the 3G data network. | |||
| 5.2. 3GPP Study on Core Network Overload | 5.2. 3GPP Study on Core Network Overload | |||
| A study in 3GPP SA2 on core network overload has produced the | A study in 3GPP SA2 on core network overload has produced the | |||
| technical report [TR23.843]. This enumerates several causes of | technical report [TR23.843]. This enumerates several causes of | |||
| overload in mobile core networks including portions that are signaled | overload in mobile core networks including portions that are signaled | |||
| skipping to change at page 21, line 15 ¶ | skipping to change at page 21, line 15 ¶ | |||
| REQ 17: In a mixed environment with nodes that support the overload | REQ 17: In a mixed environment with nodes that support the overload | |||
| control mechanism and that do not, the mechanism MUST result | control mechanism and that do not, the mechanism MUST result | |||
| in at least as much useful throughput as would have resulted | in at least as much useful throughput as would have resulted | |||
| if the mechanism were not present. It SHOULD result in less | if the mechanism were not present. It SHOULD result in less | |||
| severe congestion in this environment. | severe congestion in this environment. | |||
| REQ 18: In a mixed environment of nodes that support the overload | REQ 18: In a mixed environment of nodes that support the overload | |||
| control mechanism and that do not, the mechanism MUST NOT | control mechanism and that do not, the mechanism MUST NOT | |||
| preclude elements that support overload control from | preclude elements that support overload control from | |||
| treating elements that do not support overload control in a | treating elements that do not support overload control in a | |||
| equitable fashion relative to those that do. users and | equitable fashion relative to those that do. Users and | |||
| operators of nodes that do not support the mechanism MUST | operators of nodes that do not support the mechanism MUST | |||
| NOT unfairly benefit from the mechanism. The mechanism | NOT unfairly benefit from the mechanism. The mechanism | |||
| specification SHOULD provide guidance to implementors for | specification SHOULD provide guidance to implementors for | |||
| dealing with elements not supporting overload control. | dealing with elements not supporting overload control. | |||
| REQ 19: It MUST be possible to use the mechanism between nodes in | REQ 19: It MUST be possible to use the mechanism between nodes in | |||
| different realms and in different administrative domains. | different realms and in different administrative domains. | |||
| REQ 20: Any explicit overload indication MUST be clearly | REQ 20: Any explicit overload indication MUST be clearly | |||
| distinguishable from other errors reported via Diameter. | distinguishable from other errors reported via Diameter. | |||
| REQ 21: In cases where a network node fails, is so overloaded that | REQ 21: In cases where a network node fails, is so overloaded that | |||
| it cannot process messages, or cannot communicate due to a | it cannot process messages, or cannot communicate due to a | |||
| network failure, it may not be able to provide explicit | network failure, it may not be able to provide explicit | |||
| indications of the nature of the failure or its levels of | indications of the nature of the failure or its levels of | |||
| congestion. The mechanism MUST result in at least as much | congestion. The mechanism MUST result in at least as much | |||
| useful throughput as would have resulted if the overload | useful throughput as would have resulted if the overload | |||
| control mechanism was not in place. | control mechanism was not in place. | |||
| REQ 22: The mechanism MUST provide a way for an node to throttle the | REQ 22: The mechanism MUST provide a way for a node to throttle the | |||
| amount of traffic it receives from an peer node. This | amount of traffic it receives from a peer node. This | |||
| throttling SHOULD be graded so that it can be applied | throttling SHOULD be graded so that it can be applied | |||
| gradually as offered load increases. Overload is not a | gradually as offered load increases. Overload is not a | |||
| binary state; there may be degrees of overload. | binary state; there may be degrees of overload. | |||
| REQ 23: The mechanism MUST provide sufficient information to enable | REQ 23: The mechanism MUST provide sufficient information to enable | |||
| a load balancing node to divert messages that are rejected | a load balancing node to divert messages that are rejected | |||
| or otherwise throttled by an overloaded upstream node to | or otherwise throttled by an overloaded upstream node to | |||
| other upstream nodes that are the most likely to have | other upstream nodes that are the most likely to have | |||
| sufficient capacity to process them. | sufficient capacity to process them. | |||
| skipping to change at page 26, line 18 ¶ | skipping to change at page 26, line 23 ¶ | |||
| [TR23.843] | [TR23.843] | |||
| 3GPP, "Study on Core Network Overload Solutions", | 3GPP, "Study on Core Network Overload Solutions", | |||
| TR 23.843 0.6.0, October 2012. | TR 23.843 0.6.0, October 2012. | |||
| [IR.34] GSMA, "Inter-Service Provider IP Backbone Guidelines", | [IR.34] GSMA, "Inter-Service Provider IP Backbone Guidelines", | |||
| IR 34 7.0, January 2012. | IR 34 7.0, January 2012. | |||
| [IR.88] GSMA, "LTE Roaming Guidelines", IR 88 7.0, January 2012. | [IR.88] GSMA, "LTE Roaming Guidelines", IR 88 7.0, January 2012. | |||
| [IR.92] GSMA, "IMS Profile for Voice and SMS", IR 92 7.0, | ||||
| March 2013. | ||||
| [TS23.002] | [TS23.002] | |||
| 3GPP, "Network Architecture", TS 23.002 12.0.0, | 3GPP, "Network Architecture", TS 23.002 12.0.0, | |||
| September 2012. | September 2012. | |||
| [TS29.272] | [TS29.272] | |||
| 3GPP, "Evolved Packet System (EPS); Mobility Management | 3GPP, "Evolved Packet System (EPS); Mobility Management | |||
| Entity (MME) and Serving GPRS Support Node (SGSN) related | Entity (MME) and Serving GPRS Support Node (SGSN) related | |||
| interfaces based on Diameter protocol", TS 29.272 11.4.0, | interfaces based on Diameter protocol", TS 29.272 11.4.0, | |||
| September 2012. | September 2012. | |||
| End of changes. 13 change blocks. | ||||
| 16 lines changed or deleted | 17 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||