idnits 2.17.1 draft-mcmurry-dime-overload-reqs-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 17, 2012) is 4355 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-34) exists of draft-ietf-dime-rfc3588bis-33 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. M. McMurry 3 Internet-Draft B. C. Campbell 4 Intended status: Standards Track Tekelec 5 Expires: November 18, 2012 May 17, 2012 7 Diameter Overload Control Requirements 8 draft-mcmurry-dime-overload-reqs-00 10 Abstract 12 When a Diameter server or agent becomes overloaded, it needs to be 13 able to gracefully reduce its load, typically by informing clients to 14 reduce sending traffic for some period of time. Otherwise, it must 15 continue to expend resources parsing and responding to Diameter 16 messages, possibly resulting in congestion collapse. The existing 17 mechanisms provided by Diameter are not sufficient for this purpose. 18 This document describes the limitations of the existing mechanisms, 19 and provides requirements for new overload management mechanisms. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on November 18, 2012. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 1.1. Causes of Overload . . . . . . . . . . . . . . . . . . . . 3 57 1.2. Effects of Overload . . . . . . . . . . . . . . . . . . . 4 58 1.3. Documentation Conventions . . . . . . . . . . . . . . . . 5 59 2. Overload Scenarios . . . . . . . . . . . . . . . . . . . . . . 5 60 2.1. Peer to Peer Scenarios . . . . . . . . . . . . . . . . . . 6 61 2.2. Agent Scenarios . . . . . . . . . . . . . . . . . . . . . 8 62 3. Existing Mechanisms . . . . . . . . . . . . . . . . . . . . . 11 63 4. Issues with the Current Mechanisms . . . . . . . . . . . . . . 12 64 4.1. Problems with Implicit Mechanism . . . . . . . . . . . . . 12 65 4.2. Problems with Explicit Mechanisms . . . . . . . . . . . . 12 66 5. 3GPP Study on Core Network Overload . . . . . . . . . . . . . 13 67 6. Solution Requirements . . . . . . . . . . . . . . . . . . . . 14 68 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 69 8. Security Considerations . . . . . . . . . . . . . . . . . . . 19 70 8.1. Access Control . . . . . . . . . . . . . . . . . . . . . . 19 71 8.2. Denial-of-Service Attacks . . . . . . . . . . . . . . . . 20 72 8.3. Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 20 73 8.4. Man-in-the-Middle Attacks . . . . . . . . . . . . . . . . 20 74 8.5. Compromised Hosts . . . . . . . . . . . . . . . . . . . . 21 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21 76 9.1. Normative References . . . . . . . . . . . . . . . . . . . 21 77 9.2. Informative References . . . . . . . . . . . . . . . . . . 21 78 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 21 79 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 22 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 82 1. Introduction 84 When a Diameter [I-D.ietf-dime-rfc3588bis] server or agent becomes 85 overloaded, it needs to be able to gracefully reduce its load, 86 typically by informing clients to reduce sending traffic for some 87 period of time. Otherwise, it must continue to expend resources 88 parsing and responding to Diameter messages, possibly resulting in 89 congestion collapse. The existing mechanisms provided by Diameter 90 are not sufficient for this purpose. This document describes the 91 limitations of the existing mechanisms, and provides requirements for 92 new overload management mechanisms. 94 This document draws on [RFC5390] and the work done on SIP overload 95 control as well as on overload practices in SS7 networks and studies 96 done by 3GPP. 98 Diameter is not typically an end-user protocol; rather it is 99 generally used as one component in support of some end-user activity. 100 For example, a WiFi access point might use Diameter to authenticate 101 and authorize user access via 802.11. Overload in the Diameter 102 network will likely spill over into the end-user application network. 103 The impact of Diameter overload on the client application (a client 104 application may use the Diameter protocol and other protocols to do 105 its job) is beyond the scope of this document. 107 This document presents non-normative descriptions of causes of 108 overload along with related scenarios and studies. Finally, it 109 offers a set of normative requirements for an improved overload 110 indication mechanism. 112 1.1. Causes of Overload 114 Overload occurs when an element, such as a Diameter server or agent, 115 has insufficient resources to successfully process all of the traffic 116 it is receiving. Resources include all of the capabilities of the 117 element used to process a request, including CPU processing, memory, 118 I/O, and disk resources. It can also include external resources such 119 as a database or DNS server, in which case the CPU, processing, 120 memory, I/O, and disk resources of those servers are effectively part 121 of the logical element processing the request. 123 Overload can occur for many reasons, including: 125 Inadequate capacity: When designing Diameter networks, it can be 126 very difficult to predict all scenarios that may cause elevated 127 traffic. It may also be more costly to implement support for some 128 scenarios than a network operator may deem worthwhile. This 129 results in the likelihood that a Diameter network will not have 130 adequate capacity to handle all situations. 132 Dependency failures: A Diameter element can become overloaded 133 because a resource on which it is dependent has failed or become 134 overloaded, greatly reducing the logical capacity of the element. 135 In these cases, even minimal traffic might cause the server to go 136 into overload. Examples of such dependency overloads include DNS 137 servers, databases, disks, and network interfaces. 139 Component failures: A Diameter element can become overloaded when it 140 is a member of a cluster of servers that each share the load of 141 traffic, and one or more of the other members in the cluster fail. 142 In this case, the remaining elements take over the work of the 143 failed elements. Normally, capacity planning takes such failures 144 into account, and servers are typically run with enough spare 145 capacity to handle failure of another element. However, unusual 146 failure conditions can cause many elements to fail at once. This 147 is often the case with software failures, where a bad packet or 148 bad database entry hits the same bug in a set of elements in a 149 cluster. 151 Network Initiated Traffic Flood: Issues with the radio access 152 network in a mobile network such as radio overlays with frequent 153 handovers, and operational changes are examples of network events 154 that can precipitate a flood of signaling traffic on a Diameter 155 network, such as an avalanche restart. Failure of a Diameter 156 proxy may also result in a large amount of signaling as 157 connections and sessions are reestablished. 159 Subscriber Initiated Traffic Flood: Large gatherings of subscribers 160 or events that result in many subscribers interacting with the 161 network in close time proximity can result in signaling traffic 162 floods on Diameter networks. For example, the finale of a large 163 fireworks show could be immediately followed by many subscribers 164 posting messages, pictures, and videos concentrated on one portion 165 of a network. 167 DoS attacks: An attacker, wishing to disrupt service in the network, 168 can cause a large amount of traffic to be launched at a target 169 server. This can be done from a central source of traffic or 170 through a distributed DoS attack. In all cases, the volume of 171 traffic well exceeds the capacity of the server, sending the 172 system into overload. 174 1.2. Effects of Overload 176 Modern Diameter networks may operate at very large transaction 177 volumes. If a Diameter node becomes overloaded, or even worse, fails 178 completely, a large number of messages may be lost very quickly. 179 Even with redundant servers, many messages can be lost in the time it 180 takes for failover to complete. While a Diameter client or agent 181 should be able to retry such requests, an overloaded peer may cause a 182 sudden large increase in the number of transaction transactions 183 needing to be retried, rapidly filling local queues or otherwise 184 contributing to local overload. Therefore Diameter devices need to 185 be able to shed load before critical failures can occur. 187 Diameter depends heavily on The "Authentication, Authorization, 188 and Accounting (AAA) Transport Profile" [RFC3539], which states 189 assumptions about the scale of AAA services which may be incorrect 190 for current uses of Diameter. In particular, the document 191 suggests that AAA services will typically be low volume and that 192 traffic will typically be application-driven. Section 2.1 of that 193 document uses an example of a 48 port NAS. However, Diameter is 194 commonly used in large-scale mobile data environments, where a 195 typical client could be a packet gateway that serves millions of 196 users, and generates Diameter messages at network-driven rates. 198 1.3. Documentation Conventions 200 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 201 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 202 document are to be interpreted as described in [RFC2119]. 204 The terms "client", "server", "agent", "node", "peer", "upstream", 205 and "downstream" are used as defined in [I-D.ietf-dime-rfc3588bis]. 207 2. Overload Scenarios 209 Several Diameter deployment scenarios exist that may impact overload 210 management. The following scenarios help motivate the requirements 211 for an overload management mechanism. 213 These scenarios are by no means exhaustive, and are in general 214 simplified for the sake of clarity. In particular, the authors 215 assume for the sake of clarity that the client sends Diameter 216 requests to the server, and the server sends responses to client, 217 even though Diameter supports bidirectional applications. Each 218 direction in such an application can be modeled separately. 220 In a large scale deployment, many of the nodes represented in these 221 scenarios would be deployed as clusters of servers. The authors 222 assume that such a cluster is responsible for managing its own 223 internal load balancing and overload management so that it appears as 224 a single Diameter node. That is, other Diameter nodes can treat it 225 as single, monolithic node for the purposes of overload management. 227 These scenarios do not illustrate the client application. As 228 mentioned in Section 1, Diameter is not typically an end-user 229 protocol; rather it is generally used in support of some other client 230 application. These scenarios do not consider the impact of Diameter 231 overload on the client application. 233 2.1. Peer to Peer Scenarios 235 This section describes Diameter peer-to-peer scenarios. That is, 236 scenarios where a Diameter client talks directly with a Diameter 237 server, without the use of a Diameter agent. 239 Figure 1 illustrates the simplest possible Diameter relationship. 240 The client and server share a one-to-one peer-to-peer relationship. 241 If the server becomes overloaded, either because the client exceeds 242 the server's capacity, or because the server's capacity is reduced 243 due to some resource dependency, the client needs to reduce the 244 amount of Diameter traffic it sends to the server. Since the client 245 cannot forward requests to another server, it must either queue 246 requests until the server recovers, or itself become overloaded in 247 the context of the client application and other protocols it may also 248 use. 250 +------------------+ 251 | | 252 | | 253 | Server | 254 | | 255 +--------+---------+ 256 | 257 | 258 +--------+---------+ 259 | | 260 | | 261 | Client | 262 | | 263 +------------------+ 265 Figure 1: Basic Peer to Peer Scenario 267 Figure 2 shows a similar scenario, except in this case the client has 268 multiple servers that can handle work for a specific realm and 269 application. If server 1 becomes overloaded, the client can forward 270 traffic to server 2. Assuming server 2 has sufficient reserve 271 capacity to handle the forwarded traffic, the client should be able 272 to continue serving client application protocol users. If server 1 273 is approaching overload, but can still handle some number of new 274 request, it needs to be able to instruct the client to forward a 275 subset of its traffic to server 2. 277 +------------------+ +------------------+ 278 | | | | 279 | | | | 280 | Server 1 | | Server 2 | 281 | | | | 282 +--------+-`.------+ +------.'+---------+ 283 `. .' 284 `. .' 285 `. .' 286 `. .' 287 +-------`.'--------+ 288 | | 289 | | 290 | Client | 291 | | 292 +------------------+ 294 Figure 2: Multiple Server Peer to Peer Scenario 296 Figure 3 illustrates a peer-to-peer scenario with multiple Diameter 297 realm and application combinations. In this example, server 2 can 298 handle work for both applications. Each application might have 299 different resource dependencies. For example, a server might need to 300 access one database for application A, and another for application B. 301 This creates a possibility that Server 2 could become overloaded for 302 application A but not for application B, in which case the client 303 would need to divert some part of its application A requests to 304 server 1, but should not divert any application B requests. This 305 requires server 2 to be able to distinguish between applications when 306 it indicates an overload condition to the client. 308 On the other hand, it's possible that the servers host many 309 applications. If server 2 becomes overloaded for all applications, 310 it would be undesirable for it to have to notify the client 311 separately for each application. Therefore it also needs a way to 312 indicate that it is overloaded for all possible applications. 314 +----------------------------------------------+ 315 | Application A +------------------------+----------------------+ 316 |+------------------+ | +------------------+ | +------------------+| 317 || | | | | | | || 318 || | | | | | | || 319 || Server 1 | | | Server 2 | | | Server 3 || 320 || | | | | | | || 321 |+--------+---------+ | +--------+---------+ | +-+----------------+| 322 | | | | | | | 323 +---------+-----------+-----------+------------+ | | 324 | | | | | 325 | | | | Application B | 326 | +-----------+-----------------+-----------------+ 327 ``-.._ | | 328 `-..__ | _.-'' 329 `--._ | _.-'' 330 ``-.__ | _.-'' 331 +------`-.-''------+ 332 | | 333 | | 334 | Client | 335 | | 336 +------------------+ 338 Figure 3: Multiple Application Peer to Peer Scenario 340 2.2. Agent Scenarios 342 This section describes scenarios that include a Diameter agent, 343 either in the form of a Diameter relay or Diameter proxy. These 344 scenarios do not consider Diameter redirect agents, since they are 345 more readily modeled as end-servers. 347 Figure 4 illustrates a simple Diameter agent scenario with a single 348 client, agent, and server. In this case, overload can occur at the 349 server, at the agent, or both. But in most cases, client behavior is 350 the same whether overload occurs at the server or at the agent. From 351 the client's perspective, server overload and agent overload is the 352 same thing. 354 +------------------+ 355 | | 356 | | 357 | Server | 358 | | 359 +--------+---------+ 360 | 361 | 362 +--------+---------+ 363 | | 364 | | 365 | Agent | 366 | | 367 +--------+---------+ 368 | 369 | 370 +--------+---------+ 371 | | 372 | | 373 | Client | 374 | | 375 +------------------+ 377 Figure 4: Basic Agent Scenario 379 Figure 5 shows an agent scenario with multiple servers. If server 1 380 becomes overloaded, but server 2 has sufficient reserve capacity, the 381 agent may be able to transparently divert some or all Diameter 382 requests originally bound for server 1 to server 2. 384 In most cases, the client does not have detailed knowledge of the 385 Diameter topology upstream of the agent. If the agent uses dynamic 386 discovery to find eligible servers, the set of eligible servers may 387 not be enumerable from the perspective of the client. Therefore, in 388 most cases the agent needs to deal with any upstream overload issues 389 in a way that is transparent to the client. If one server notifies 390 the agent that it has become overloaded, the notification should not 391 be passed back to the client in a way where the client could 392 mistakenly perceive the agent itself as being overloaded. If the set 393 of all possible destinations upstream of the agent no longer has 394 sufficient capacity for incoming load, the agent itself becomes 395 effectively overloaded. 397 On the other hand, there are cases where the client needs to be able 398 to select a particular server from behind an agent. For example, if 399 a Diameter request is part of a multiple-round-trip authentication, 400 or is otherwise part of a Diameter "session", it may have a 401 DestinationHost AVP that requires the request to be served by server 402 1. Therefore the agent may need to inform a client that a particular 403 upstream server is overloaded or otherwise unavailable. 405 +------------------+ +------------------+ 406 | | | | 407 | | | | 408 | Server 1 | | Server 2 | 409 | | | | 410 +--------+-`.------+ +------.'+---------+ 411 `. .' 412 `. .' 413 `. .' 414 `. .' 415 +-------`.'--------+ 416 | | 417 | | 418 | Agent | 419 | | 420 +--------+---------+ 421 | 422 | 423 | 424 +--------+---------+ 425 | | 426 | | 427 | Client | 428 | | 429 +------------------+ 431 Figure 5: Multiple Server Agent Scenario 433 Figure 6 shows a scenario where an agent routes requests to a set of 434 servers for more than one Diameter realm and application. In this 435 scenario, if server 1 becomes overloaded or unavailable, the agent 436 may effectively operate at reduced capacity for application A, but at 437 full capacity for application B. Therefore, the agent needs to be 438 able to report that it is overloaded for one application, but not for 439 another. 441 +----------------------------------------------+ 442 | Application A +------------------------+----------------------+ 443 |+------------------+ | +------------------+ | +------------------+| 444 || | | | | | | || 445 || | | | | | | || 446 || Server 1 | | | Server 2 | | | Server 3 || 447 || | | | | | | || 448 |+---------+--------+ | +--------+---------+ | +--+---------------+| 449 | | | | | | | 450 +----------+----------+-----------+------------+ | | 451 | | | | | 452 | | | | Application B | 453 | +-----------+------------------+----------------+ 454 | | | 455 ``--.__ | _. 456 ``-.__ | __.--'' 457 `--.._ | _..--' 458 +-----``-+.-''-----+ 459 | | 460 | | 461 | Agent | 462 | | 463 +--------+---------+ 464 | 465 | 466 +--------+---------+ 467 | | 468 | | 469 | Client | 470 | | 471 +------------------+ 473 Figure 6: Multiple Application Agent Scenario 475 3. Existing Mechanisms 477 Diameter requires the use of a congestion-managed transport layer, 478 currently TCP or SCTP, to mitigate network congestion. But even with 479 a congestion-managed transport, a Diameter node can become overloaded 480 at the protocol layer due to the causes described in Section 1.1. 482 Diameter offers both implicit and explicit mechanisms for a Diameter 483 node to learn that a peer is overloaded or unreachable. The implicit 484 mechanism is simply the lack of responses to requests. If a client 485 fails to receive a response in a certain time period, it assumes the 486 upstream peer is unavailable, or overloaded to the point of effective 487 unavailability. The watchdog mechanism [RFC3539] ensures that a 488 certain rate of transaction responses occur even when there is 489 otherwise little or no other Diameter traffic. 491 The explicit mechanism involves specific protocol error responses, 492 where an agent or server can tell a downstream peer that it is either 493 too busy to handle a request (DIAMETER_TOO_BUSY) or unable to route a 494 request to an upstream destination (DIAMETER_UNABLE_TO_DELIVER), 495 perhaps because that destination itself is overloaded to the point of 496 unavailability. 498 Once a Diameter node learns that an upstream peer has become 499 overloaded via one of these mechanisms, it can then attempt to take 500 action to reduce the load. This usually means forwarding traffic to 501 an alternate destination, if available. If no alternate destination 502 is available, the node must either reduce the number of messages it 503 originates (in the case of a client) or inform the client to reduce 504 traffic (in the case of an agent.) 506 4. Issues with the Current Mechanisms 508 The currently available Diameter mechanisms for indicating an 509 overload condition are not adequate to avoid congestion collapse. In 510 particular, they do not allow a Diameter agent or server to shed load 511 as it approaches overload. At best, a node can only indicate that 512 needs to entirely stop receiving requests, i.e. that it has 513 effectively failed. Diameter offers no mechanism to allow a node to 514 indicate different overload states for different categories of 515 messages, for example, if it is overloaded for one Diameter 516 application but not another. 518 4.1. Problems with Implicit Mechanism 520 The implicit mechanism doesn't allow an agent or server to inform the 521 client of a problem until it is effectively too late to do anything 522 about it. The client does not know to take action until the upstream 523 node has effectively failed. A Diameter node has no opportunity to 524 shed load early to avoid collapse in the first place. 526 Additionally, the implicit mechanism cannot distinguish between 527 overload of a Diameter node and network congestion. Diameter treats 528 the failure to receive an answer as a transport failure. 530 4.2. Problems with Explicit Mechanisms 532 The Diameter specification is ambiguous on how a client should handle 533 receipt of a DIAMETER_TOO_BUSY response. The base specification 534 [I-D.ietf-dime-rfc3588bis] indicates that the sending client should 535 attempt to send the request to a different peer. It makes no 536 suggestion that a the receipt of a DIAMETER_TOO_BUSY response should 537 affect future Diameter messages in any way. 539 The Authentication, Authorization, and Accounting (AAA) Transport 540 Profile [RFC3539] recommends that a AAA node that receives a "Busy" 541 response failover all remaining requests to a different agent or 542 server. But while the Diameter base specification explicitly depends 543 on RFC3539 to define transport behavior, it does not refer to RFC3539 544 in the description of behavior on receipt of DIAMETER_TOO_BUSY. 545 There's a strong likelihood that at least some implementations will 546 continue to send Diameter requests to an upstream peer even after 547 receiving a DIAMETER_TOO_BUSY error. 549 BCP 41 [RFC2914] describes, among other things, how end-to-end 550 application behavior can help avoid congestion collapse. In 551 particular, an application should avoid sending messages that will 552 never be delivered or processed. The DIAMETER_TOO_BUSY behavior as 553 described in the Diameter base specification fails at this, since if 554 an upstream node becomes overloaded, a client attempts each request, 555 and does not discover the need to failover the request until the 556 initial attempt fails. 558 The situation is improved if implementations follow the [RFC3539] 559 recommendation and keep state about upstream peer overload. But even 560 then, the Diameter specification offers no guidance on how long a 561 client should wait before retrying the overloaded destination. If an 562 agent or server supports multiple realms and/or applications, 563 DIAMETER_TOO_BUSY only offers no way to indicate that it is 564 overloaded for one application but not another. A DIAMETER_TOO_BUSY 565 error can only indicate overload at a "whole server" scope. 567 Agent processing of a DIAMETER_TOO_BUSY response is also problematic 568 as described in the base specification. DIAMETER_TOO_BUSY is defined 569 as a protocol error. If an agent receives a protocol error, it may 570 either handle it locally or it may forward the response back towards 571 the downstream peer. (The Diameter specification is inconsistent 572 about whether a protocol error MAY or SHOULD be handled by an agent, 573 rather than forwarded downstream.) If a downstream peer receives the 574 DIAMETER_TOO_BUSY response, it may stop sending all requests to the 575 agent for some period of time, even though the agent may still be 576 able to deliver requests to other upstream peers. 578 5. 3GPP Study on Core Network Overload 580 A study in 3GPP SA2 on core network overload has produced the 581 technical report [TR23.843]. This enumerates several causes of 582 overload in mobile core networks including portions that are signaled 583 using Diameter. 585 It is common for mobile networks to employ more than one radio 586 technology and to do so in an overlay fashion with multiple 587 technologies present in the same location (such as GSM or CDMA along 588 with LTE). This presents opportunities for traffic storms when 589 issues occur on one overlay and not another as all devices that had 590 been on the overlay with issues switch. This causes a large amount 591 of Diameter traffic as locations and policies are updated. 593 Another scenario called out by this study is a flood of registration 594 and mobility management events caused by some element in the core 595 network failing. This flood of traffic from end elements falls under 596 the network initiated traffic flood category. There is likely to 597 also be traffic resulting directly from the component failure in this 598 case. 600 Subscriber initiated traffic floods are also indicated in this study 601 as an overload mechanism where a large number of mobile devices 602 attempting to access services at the same time, such as in response 603 to an entertainment event or a catastrophic event. 605 While this study is concerned with the broader effects of these 606 scenarios on wireless networks and their elements, they have 607 implications specifically for Diameter signaling. One of the goals 608 of this document is to provide guidance for a core mechanism that can 609 be used to mitigate the scenarios called out by this study. 611 6. Solution Requirements 613 This section proposes requirements for an improved mechanism to 614 control Diameter overload, with the goals of improving the issues 615 described in Section 4 and supporting the scenarios described in 616 Section 2 618 REQ 1: The overload mechanism MUST provide a communication method 619 for Diameter nodes to exchange overload information. 621 REQ 2: The overload mechanism MUST be useable with any existing or 622 future Diameter application. It MUST NOT require 623 specification changes for existing Diameter applications. 624 This may be achieved using a mechanism in the Diameter base 625 protocol that all applications could make use of. 627 REQ 3: The overload mechanism MUST limit the impact of overload on 628 the overall useful throughput of a Diameter server, even 629 when the incoming load on the network is far in excess of 630 its capacity. The overall useful throughput under load is 631 the ultimate measure of the value of an overload control 632 mechanism. 634 REQ 4: Diameter allows requests to be sent from either side of a 635 connection and either side of a connection may have need to 636 provide its overload status. The mechanism MUST allow each 637 side of a connection to independently inform the other of 638 its overload status. 640 REQ 5: Diameter allows elements to determine their peers via 641 dynamic discovery or manual configuration. The mechanism 642 MUST work consistently without regard to how peers are 643 determined. 645 REQ 6: The mechanism designers SHOULD seek to minimize the amount 646 of new configuration required in order to work. For 647 example, it is better to allow peers to advertise or 648 negotiate support for the mechanism, rather than to require 649 this knowledge to be configured at each node. 651 REQ 7: The overload mechanism MUST ensure that the system remains 652 stable. When the offered load drops from above the overall 653 capacity of the network to below the overall capacity, the 654 throughput MUST stabilize and become equal to the offered 655 load. 657 REQ 8: The mechanism MUST allow nodes to shed load without 658 introducing oscillations. Note that this requirement 659 implies a need for supporting nodes to be able to 660 distinguish current overload information from stale 661 information, and to make decisions using the most currently 662 available information. 664 REQ 9: The mechanism MUST function across fully loaded as well as 665 quiescent transport connections. This is partially derived 666 from the requirements for stability and hysteresis control 667 above. 669 REQ 10: Consumers of overload state indications MUST be able to 670 determine when the overload condition improves or ends. 672 REQ 11: The overload mechanism MUST be scalable. That is, it MUST 673 be able to operate in different sized networks. 675 REQ 12: When a single network element fails, goes into overload, or 676 suffers from reduced processing capacity, the mechanism MUST 677 make it possible to limit the impact of this on other 678 elements in the network. This helps to prevent a small- 679 scale failure from becoming a widespread outage. 681 REQ 13: The mechanism MUST NOT introduce substantial additional work 682 for node in an overloaded state. For example, a requirement 683 for an overloaded node to send overload information every 684 time it received a new request would introduce substantial 685 work. Existing messaging is likely to have the 686 characteristic of increasing as an overload condition 687 approaches, allowing for the possibility of increased 688 feedback for information piggybacked on it. 690 REQ 14: Some scenarios that result in overload involve a rapid 691 increase of traffic with little time between normal levels 692 and overload inducing levels. The mechanism SHOULD provide 693 for increased feedback when traffic levels increase. The 694 mechanism MUST NOT do this in such a way that it increases 695 the number of messages while at high loads. 697 REQ 15: The mechanism MUST NOT interfere with the congestion control 698 mechanisms of underlying transport protocols. 700 REQ 16: The mechanism MUST operate without malfunction in an 701 environment with a mix of elements that do, and elements 702 that do not, support the mechanism. 704 REQ 17: In a mixed environment with elements that support the 705 overload control mechanism and that do not, the mechanism 706 MUST NOT result in less useful throughput than would have 707 resulted if it were not present. It SHOULD result in less 708 severe congestion in this environment. 710 REQ 18: In a mixed environment of elements that support the overload 711 control mechanism and that do not, users and operators of 712 elements that do not support the mechanism MUST NOT benefit 713 from the mechanism more than users and operators of elements 714 that support the mechanism. 716 REQ 19: It MUST be possible to use the mechanism between nodes in 717 different realms and in different administrative domains. 719 REQ 20: Any explicit overload indication MUST distinguish between 720 actual overload, as opposed to other, non-overload related 721 failures. 723 REQ 21: In cases where a network element fails, is so overloaded 724 that it cannot process messages, or cannot communicate due 725 to a network failure, it may not be able to provide explicit 726 indications of the nature of the failure or its levels of 727 congestion. The mechanism MUST properly function in these 728 cases. 730 REQ 22: The mechanism MUST provide a way for an element to throttle 731 the amount of traffic it receives from an peer element. 732 This throttling SHOULD be graded so that it can be applied 733 gradually as offered load increases. Overload is not a 734 binary state; there may be degrees of overload. 736 REQ 23: The mechanism MUST enable a supporting node to minimize the 737 chance that retries due to an overloaded or failed element 738 result in additional traffic to other overloaded elements, 739 or cause additional elements to become overloaded. 740 Moreover, the mechanism SHOULD provide unambiguous 741 directions to clients on when they should retry a request 742 and when they should not considering the various causes of 743 overload such as avalanche restart. 745 REQ 24: The mechanism MUST provide sufficient information to enable 746 a load balancing node to divert messages that are rejected 747 or otherwise throttled by an overloaded upstream element to 748 other upstream elements that are the most likely to have 749 sufficient capacity to process them. 751 REQ 25: The mechanism MUST provide a mechanism for indicating load 752 levels even when not in an overloaded condition, to assist 753 elements making decisions to prevent overload conditions 754 from occurring. 756 REQ 26: The specification for the overload mechanism SHOULD offer 757 guidance on which message types might be desirable to 758 process over others during times of overload, based on 759 Diameter-specific considerations. For example, it may be 760 more beneficial to process messages for existing sessions 761 ahead of new sessions. 763 REQ 27: The mechanism MUST NOT prevent a node from prioritizing 764 requests based on any local policy, so that certain requests 765 are given preferential treatment, given additional 766 retransmission, or processed ahead of others. 768 REQ 28: The overload mechanism MUST NOT provide new vulnerabilities 769 to malicious attack, or increase the severity of any 770 existing vulnerabilities. This includes vulnerabilities to 771 DoS and DDoS attacks as well as replay and man-in-the middle 772 attacks. 774 REQ 29: The mechanism MUST provide a means to match an overload 775 indication with the node that originated it. In particular, 776 the mechanism MUST allow a node to distinguish between 777 overload at a next-hop peer from overload at a node upstream 778 of the peer. For example, in Figure 5, the client must not 779 mistake overload at server 1 for overload at the agent, 780 whether or not the agent supports the mechanism.( see REQ 781 4). 783 REQ 30: The mechanism MUST NOT depend on being deployed in 784 environments where all Diameter nodes are completely 785 trusted. It SHOULD operate as effectively as possible in 786 environments where other elements are malicious; this 787 includes preventing malicious elements from obtaining more 788 than a fair share of service. Note that this does not imply 789 any responsibility on the mechanism to detect, or take 790 countermeasures against, malicious elements. 792 REQ 31: It MUST be possible for a supporting node to make 793 authorization decisions about what information will be sent 794 to peer elements based on the identity of those elements. 795 This allows a domain administrator who considers the load of 796 their elements to be sensitive information to restrict 797 access to that information. Of course, in such cases, there 798 is no expectation that the overload mechanism itself will 799 help prevent overload from that peer element. 801 REQ 32: The mechanism MUST NOT interfere with any Diameter compliant 802 method that a node may use to protect itself from overload 803 from non-supporting nodes, or from denial of service 804 attacks. 806 REQ 33: There are multiple situations where a Diameter node may be 807 overloaded for some purposes but not others. For example, 808 this can happen to an agent or server that supports multiple 809 applications, or when a server depends on multiple external 810 resources, some of which may become overloaded while others 811 are fully available. The mechanism MUST allow Diameter 812 nodes to indicate overload with sufficient granularity to 813 allow clients to take action based on the overloaded 814 resources without forcing available capacity to go unused. 815 The mechanism MUST support specification of overload 816 information with granularities of at least "Diameter node", 817 "realm", "Diameter application", and "Diameter session", and 818 SHOULD allow extensibility for others to be added in the 819 future. 821 REQ 34: The mechanism MUST provide a method for extending the 822 information communicated and the algorithms used for 823 overload control. 825 7. IANA Considerations 827 This document makes no requests of IANA. 829 8. Security Considerations 831 A Diameter overload control mechanism is primarily concerned with the 832 load and overload related behavior of elements in a Diameter network, 833 and the information used to affect that behavior. Load and overload 834 information is shared between elements and directly affects the 835 behavior and thus is potentially vulnerable to a number of methods of 836 attack. 838 Load and overload information may also be sensitive from both 839 business and network protection viewpoints. Operators of Diameter 840 equipment want to control visibility to load and overload information 841 to keep it from being used for competitive intelligence or for 842 targeting attacks. It is also important that the Diameter overload 843 control mechanism not introduce any way in which any other 844 information carried by Diameter is sent inappropriately. 846 This document includes requirements intended to mitigate the effects 847 of attacks and to protect the information used by the mechanism. 849 8.1. Access Control 851 To control the visibility of load and overload information, sending 852 should be subject to some form of authentication and authorization of 853 the receiver. It is also important to the receivers that they are 854 confident the load and overload information they receive is from a 855 legitimate source. Note that this implies a certain amount of 856 configurability on the elements supporting the Diameter overload 857 control mechanism. 859 8.2. Denial-of-Service Attacks 861 An overload control mechanism provides a very attractive target for 862 denial-of-service attacks. A small number of messages may affect a 863 large service disruption by falsely reporting overload conditions. 864 Alternately, attacking servers nearing, or in, overload may also be 865 facilitated by disrupting their overload indications, potentially 866 preventing them from mitigating their overload condition. 868 A design goal for the Diameter overload control mechanism is to 869 minimize or eliminate the possibility of using the mechanism for this 870 type of attack. 872 As the intent of some denial-of-service attacks is to induce overload 873 conditions, an effective overload control mechanism should help to 874 mitigate the effects of an such an attack. 876 8.3. Replay Attacks 878 An attacker that has managed to obtain some messages from the 879 overload control mechanism may attempt to affect the behavior of 880 elements supporting the mechanism by sending those messages at 881 potentially inopportune times. In addition to time shifting, replay 882 attacks may send messages to other nodes as well (target shifting). 884 A design goal for the Diameter overload control mechanism is to 885 minimize or eliminate the possibility of causing disruption by using 886 a replay attack on the Diameter overload control mechanism. 888 8.4. Man-in-the-Middle Attacks 890 By inserting themselves in between two elements supporting the 891 Diameter overload control mechanism, an attacker may potentially both 892 access and alter the information sent between those elements. This 893 can be used for information gathering for business intelligence and 894 attack targeting, as well as direct attacks. 896 A design goal for the Diameter overload control mechanism is to 897 minimize or eliminate the possibility of causing disruption man-in- 898 the-middle attacks on the Diameter overload control mechanism. A 899 transport using TLS and/or IPSEC may be desirable for this. 901 8.5. Compromised Hosts 903 A compromised host that supports the Diameter overload control 904 mechanism could be used for information gathering as well as for 905 sending malicious information to any Diameter element that would 906 normally accept information from it. While is is beyond the scope of 907 the Diameter overload control mechanism to mitigate any operational 908 interruption to the compromised host, a reasonable design goal is to 909 minimize the impact that a compromised host can have on other 910 elements through the use of the Diameter overload control mechanism. 911 Of course, a compromised host could be used to cause damage in a 912 number of other ways. This is out of scope for a Diameter overload 913 control mechanism. 915 9. References 917 9.1. Normative References 919 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 920 Requirement Levels", BCP 14, RFC 2119, March 1997. 922 [I-D.ietf-dime-rfc3588bis] 923 Fajardo, V., Arkko, J., Loughney, J., and G. Zorn, 924 "Diameter Base Protocol", draft-ietf-dime-rfc3588bis-33 925 (work in progress), May 2012. 927 [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, 928 RFC 2914, September 2000. 930 [RFC3539] Aboba, B. and J. Wood, "Authentication, Authorization and 931 Accounting (AAA) Transport Profile", RFC 3539, June 2003. 933 9.2. Informative References 935 [RFC5390] Rosenberg, J., "Requirements for Management of Overload in 936 the Session Initiation Protocol", RFC 5390, December 2008. 938 [TR23.843] 939 3GPP, "Study on Core Network Overload Solutions", 940 TR 23.843 0.4.0, April 2011. 942 Appendix A. Contributors 944 Significant contributions to this document were made by Adam Roach 945 and Eric Noel. 947 Appendix B. Acknowledgements 949 Review of, and contributions to, this specification by Martin Dolly, 950 Carolyn Johnson, Jianrong Wang, Imtiaz Shaikh, and Robert Sparks were 951 most appreciated. We would like to thank them for their time and 952 expertise. 954 Authors' Addresses 956 Eric McMurry 957 Tekelec 958 17210 Campbell Rd. 959 Suite 250 960 Dallas, TX 75252 961 US 963 Email: emcmurry@estacado.net 965 Ben Campbell 966 Tekelec 967 17210 Campbell Rd. 968 Suite 250 969 Dallas, TX 75252 970 US 972 Email: ben@nostrum.com