idnits 2.17.1 

draft-mcmurry-dime-overload-reqs-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (May 17, 2012) is 4355 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-34) exists of
     draft-ietf-dime-rfc3588bis-33


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                      E. M. McMurry
3	Internet-Draft                                            B. C. Campbell
4	Intended status: Standards Track                                 Tekelec
5	Expires: November 18, 2012                                  May 17, 2012

7	                 Diameter Overload Control Requirements
8	                  draft-mcmurry-dime-overload-reqs-00

10	Abstract

12	   When a Diameter server or agent becomes overloaded, it needs to be
13	   able to gracefully reduce its load, typically by informing clients to
14	   reduce sending traffic for some period of time.  Otherwise, it must
15	   continue to expend resources parsing and responding to Diameter
16	   messages, possibly resulting in congestion collapse.  The existing
17	   mechanisms provided by Diameter are not sufficient for this purpose.
18	   This document describes the limitations of the existing mechanisms,
19	   and provides requirements for new overload management mechanisms.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on November 18, 2012.

38	Copyright Notice

40	   Copyright (c) 2012 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
56	     1.1.  Causes of Overload . . . . . . . . . . . . . . . . . . . .  3
57	     1.2.  Effects of Overload  . . . . . . . . . . . . . . . . . . .  4
58	     1.3.  Documentation Conventions  . . . . . . . . . . . . . . . .  5
59	   2.  Overload Scenarios . . . . . . . . . . . . . . . . . . . . . .  5
60	     2.1.  Peer to Peer Scenarios . . . . . . . . . . . . . . . . . .  6
61	     2.2.  Agent Scenarios  . . . . . . . . . . . . . . . . . . . . .  8
62	   3.  Existing Mechanisms  . . . . . . . . . . . . . . . . . . . . . 11
63	   4.  Issues with the Current Mechanisms . . . . . . . . . . . . . . 12
64	     4.1.  Problems with Implicit Mechanism . . . . . . . . . . . . . 12
65	     4.2.  Problems with Explicit Mechanisms  . . . . . . . . . . . . 12
66	   5.  3GPP Study on Core Network Overload  . . . . . . . . . . . . . 13
67	   6.  Solution Requirements  . . . . . . . . . . . . . . . . . . . . 14
68	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
69	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 19
70	     8.1.  Access Control . . . . . . . . . . . . . . . . . . . . . . 19
71	     8.2.  Denial-of-Service Attacks  . . . . . . . . . . . . . . . . 20
72	     8.3.  Replay Attacks . . . . . . . . . . . . . . . . . . . . . . 20
73	     8.4.  Man-in-the-Middle Attacks  . . . . . . . . . . . . . . . . 20
74	     8.5.  Compromised Hosts  . . . . . . . . . . . . . . . . . . . . 21
75	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 21
76	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 21
77	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 21
78	   Appendix A.  Contributors  . . . . . . . . . . . . . . . . . . . . 21
79	   Appendix B.  Acknowledgements  . . . . . . . . . . . . . . . . . . 22
80	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22

82	1.  Introduction

84	   When a Diameter [I-D.ietf-dime-rfc3588bis] server or agent becomes
85	   overloaded, it needs to be able to gracefully reduce its load,
86	   typically by informing clients to reduce sending traffic for some
87	   period of time.  Otherwise, it must continue to expend resources
88	   parsing and responding to Diameter messages, possibly resulting in
89	   congestion collapse.  The existing mechanisms provided by Diameter
90	   are not sufficient for this purpose.  This document describes the
91	   limitations of the existing mechanisms, and provides requirements for
92	   new overload management mechanisms.

94	   This document draws on [RFC5390] and the work done on SIP overload
95	   control as well as on overload practices in SS7 networks and studies
96	   done by 3GPP.

98	   Diameter is not typically an end-user protocol; rather it is
99	   generally used as one component in support of some end-user activity.
100	   For example, a WiFi access point might use Diameter to authenticate
101	   and authorize user access via 802.11.  Overload in the Diameter
102	   network will likely spill over into the end-user application network.
103	   The impact of Diameter overload on the client application (a client
104	   application may use the Diameter protocol and other protocols to do
105	   its job) is beyond the scope of this document.

107	   This document presents non-normative descriptions of causes of
108	   overload along with related scenarios and studies.  Finally, it
109	   offers a set of normative requirements for an improved overload
110	   indication mechanism.

112	1.1.  Causes of Overload

114	   Overload occurs when an element, such as a Diameter server or agent,
115	   has insufficient resources to successfully process all of the traffic
116	   it is receiving.  Resources include all of the capabilities of the
117	   element used to process a request, including CPU processing, memory,
118	   I/O, and disk resources.  It can also include external resources such
119	   as a database or DNS server, in which case the CPU, processing,
120	   memory, I/O, and disk resources of those servers are effectively part
121	   of the logical element processing the request.

123	   Overload can occur for many reasons, including:

125	   Inadequate capacity:  When designing Diameter networks, it can be
126	      very difficult to predict all scenarios that may cause elevated
127	      traffic.  It may also be more costly to implement support for some
128	      scenarios than a network operator may deem worthwhile.  This
129	      results in the likelihood that a Diameter network will not have
130	      adequate capacity to handle all situations.

132	   Dependency failures:  A Diameter element can become overloaded
133	      because a resource on which it is dependent has failed or become
134	      overloaded, greatly reducing the logical capacity of the element.
135	      In these cases, even minimal traffic might cause the server to go
136	      into overload.  Examples of such dependency overloads include DNS
137	      servers, databases, disks, and network interfaces.

139	   Component failures:  A Diameter element can become overloaded when it
140	      is a member of a cluster of servers that each share the load of
141	      traffic, and one or more of the other members in the cluster fail.
142	      In this case, the remaining elements take over the work of the
143	      failed elements.  Normally, capacity planning takes such failures
144	      into account, and servers are typically run with enough spare
145	      capacity to handle failure of another element.  However, unusual
146	      failure conditions can cause many elements to fail at once.  This
147	      is often the case with software failures, where a bad packet or
148	      bad database entry hits the same bug in a set of elements in a
149	      cluster.

151	   Network Initiated Traffic Flood:  Issues with the radio access
152	      network in a mobile network such as radio overlays with frequent
153	      handovers, and operational changes are examples of network events
154	      that can precipitate a flood of signaling traffic on a Diameter
155	      network, such as an avalanche restart.  Failure of a Diameter
156	      proxy may also result in a large amount of signaling as
157	      connections and sessions are reestablished.

159	   Subscriber Initiated Traffic Flood:  Large gatherings of subscribers
160	      or events that result in many subscribers interacting with the
161	      network in close time proximity can result in signaling traffic
162	      floods on Diameter networks.  For example, the finale of a large
163	      fireworks show could be immediately followed by many subscribers
164	      posting messages, pictures, and videos concentrated on one portion
165	      of a network.

167	   DoS attacks:  An attacker, wishing to disrupt service in the network,
168	      can cause a large amount of traffic to be launched at a target
169	      server.  This can be done from a central source of traffic or
170	      through a distributed DoS attack.  In all cases, the volume of
171	      traffic well exceeds the capacity of the server, sending the
172	      system into overload.

174	1.2.  Effects of Overload

176	   Modern Diameter networks may operate at very large transaction
177	   volumes.  If a Diameter node becomes overloaded, or even worse, fails
178	   completely, a large number of messages may be lost very quickly.
179	   Even with redundant servers, many messages can be lost in the time it
180	   takes for failover to complete.  While a Diameter client or agent
181	   should be able to retry such requests, an overloaded peer may cause a
182	   sudden large increase in the number of transaction transactions
183	   needing to be retried, rapidly filling local queues or otherwise
184	   contributing to local overload.  Therefore Diameter devices need to
185	   be able to shed load before critical failures can occur.

187	      Diameter depends heavily on The "Authentication, Authorization,
188	      and Accounting (AAA) Transport Profile" [RFC3539], which states
189	      assumptions about the scale of AAA services which may be incorrect
190	      for current uses of Diameter.  In particular, the document
191	      suggests that AAA services will typically be low volume and that
192	      traffic will typically be application-driven.  Section 2.1 of that
193	      document uses an example of a 48 port NAS.  However, Diameter is
194	      commonly used in large-scale mobile data environments, where a
195	      typical client could be a packet gateway that serves millions of
196	      users, and generates Diameter messages at network-driven rates.

198	1.3.  Documentation Conventions

200	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
201	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
202	   document are to be interpreted as described in [RFC2119].

204	   The terms "client", "server", "agent", "node", "peer", "upstream",
205	   and "downstream" are used as defined in [I-D.ietf-dime-rfc3588bis].

207	2.  Overload Scenarios

209	   Several Diameter deployment scenarios exist that may impact overload
210	   management.  The following scenarios help motivate the requirements
211	   for an overload management mechanism.

213	   These scenarios are by no means exhaustive, and are in general
214	   simplified for the sake of clarity.  In particular, the authors
215	   assume for the sake of clarity that the client sends Diameter
216	   requests to the server, and the server sends responses to client,
217	   even though Diameter supports bidirectional applications.  Each
218	   direction in such an application can be modeled separately.

220	   In a large scale deployment, many of the nodes represented in these
221	   scenarios would be deployed as clusters of servers.  The authors
222	   assume that such a cluster is responsible for managing its own
223	   internal load balancing and overload management so that it appears as
224	   a single Diameter node.  That is, other Diameter nodes can treat it
225	   as single, monolithic node for the purposes of overload management.

227	   These scenarios do not illustrate the client application.  As
228	   mentioned in Section 1, Diameter is not typically an end-user
229	   protocol; rather it is generally used in support of some other client
230	   application.  These scenarios do not consider the impact of Diameter
231	   overload on the client application.

233	2.1.  Peer to Peer Scenarios

235	   This section describes Diameter peer-to-peer scenarios.  That is,
236	   scenarios where a Diameter client talks directly with a Diameter
237	   server, without the use of a Diameter agent.

239	   Figure 1 illustrates the simplest possible Diameter relationship.
240	   The client and server share a one-to-one peer-to-peer relationship.
241	   If the server becomes overloaded, either because the client exceeds
242	   the server's capacity, or because the server's capacity is reduced
243	   due to some resource dependency, the client needs to reduce the
244	   amount of Diameter traffic it sends to the server.  Since the client
245	   cannot forward requests to another server, it must either queue
246	   requests until the server recovers, or itself become overloaded in
247	   the context of the client application and other protocols it may also
248	   use.

250	                         +------------------+
251	                         |                  |
252	                         |                  |
253	                         |     Server       |
254	                         |                  |
255	                         +--------+---------+
256	                                  |
257	                                  |
258	                         +--------+---------+
259	                         |                  |
260	                         |                  |
261	                         |     Client       |
262	                         |                  |
263	                         +------------------+

265	                   Figure 1: Basic Peer to Peer Scenario

267	   Figure 2 shows a similar scenario, except in this case the client has
268	   multiple servers that can handle work for a specific realm and
269	   application.  If server 1 becomes overloaded, the client can forward
270	   traffic to server 2.  Assuming server 2 has sufficient reserve
271	   capacity to handle the forwarded traffic, the client should be able
272	   to continue serving client application protocol users.  If server 1
273	   is approaching overload, but can still handle some number of new
274	   request, it needs to be able to instruct the client to forward a
275	   subset of its traffic to server 2.

277	           +------------------+     +------------------+
278	           |                  |     |                  |
279	           |                  |     |                  |
280	           |     Server 1     |     |     Server 2     |
281	           |                  |     |                  |
282	           +--------+-`.------+     +------.'+---------+
283	                        `.               .'
284	                          `.           .'
285	                            `.       .'
286	                              `.   .'
287	                        +-------`.'--------+
288	                        |                  |
289	                        |                  |
290	                        |     Client       |
291	                        |                  |
292	                        +------------------+

294	              Figure 2: Multiple Server Peer to Peer Scenario

296	   Figure 3 illustrates a peer-to-peer scenario with multiple Diameter
297	   realm and application combinations.  In this example, server 2 can
298	   handle work for both applications.  Each application might have
299	   different resource dependencies.  For example, a server might need to
300	   access one database for application A, and another for application B.
301	   This creates a possibility that Server 2 could become overloaded for
302	   application A but not for application B, in which case the client
303	   would need to divert some part of its application A requests to
304	   server 1, but should not divert any application B requests.  This
305	   requires server 2 to be able to distinguish between applications when
306	   it indicates an overload condition to the client.

308	   On the other hand, it's possible that the servers host many
309	   applications.  If server 2 becomes overloaded for all applications,
310	   it would be undesirable for it to have to notify the client
311	   separately for each application.  Therefore it also needs a way to
312	   indicate that it is overloaded for all possible applications.

314	 +----------------------------------------------+
315	 | Application A       +------------------------+----------------------+
316	 |+------------------+ |  +------------------+  |  +------------------+|
317	 ||                  | |  |                  |  |  |                  ||
318	 ||                  | |  |                  |  |  |                  ||
319	 ||     Server 1     | |  |     Server 2     |  |  |     Server 3     ||
320	 ||                  | |  |                  |  |  |                  ||
321	 |+--------+---------+ |  +--------+---------+  |  +-+----------------+|
322	 |         |           |           |            |    |                 |
323	 +---------+-----------+-----------+------------+    |                 |
324	          |           |           |                 |                 |
325	          |           |           |                 |  Application B  |
326	          |           +-----------+-----------------+-----------------+
327	          ``-.._                  |                 |
328	                `-..__            |             _.-''
329	                     `--._        |        _.-''
330	                          ``-.__  |   _.-''
331	                         +------`-.-''------+
332	                         |                  |
333	                         |                  |
334	                         |     Client       |
335	                         |                  |
336	                         +------------------+

338	           Figure 3: Multiple Application Peer to Peer Scenario

340	2.2.  Agent Scenarios

342	   This section describes scenarios that include a Diameter agent,
343	   either in the form of a Diameter relay or Diameter proxy.  These
344	   scenarios do not consider Diameter redirect agents, since they are
345	   more readily modeled as end-servers.

347	   Figure 4 illustrates a simple Diameter agent scenario with a single
348	   client, agent, and server.  In this case, overload can occur at the
349	   server, at the agent, or both.  But in most cases, client behavior is
350	   the same whether overload occurs at the server or at the agent.  From
351	   the client's perspective, server overload and agent overload is the
352	   same thing.

354	                       +------------------+
355	                       |                  |
356	                       |                  |
357	                       |     Server       |
358	                       |                  |
359	                       +--------+---------+
360	                                |
361	                                |
362	                       +--------+---------+
363	                       |                  |
364	                       |                  |
365	                       |      Agent       |
366	                       |                  |
367	                       +--------+---------+
368	                                |
369	                                |
370	                       +--------+---------+
371	                       |                  |
372	                       |                  |
373	                       |     Client       |
374	                       |                  |
375	                       +------------------+

377	                      Figure 4: Basic Agent Scenario

379	   Figure 5 shows an agent scenario with multiple servers.  If server 1
380	   becomes overloaded, but server 2 has sufficient reserve capacity, the
381	   agent may be able to transparently divert some or all Diameter
382	   requests originally bound for server 1 to server 2.

384	   In most cases, the client does not have detailed knowledge of the
385	   Diameter topology upstream of the agent.  If the agent uses dynamic
386	   discovery to find eligible servers, the set of eligible servers may
387	   not be enumerable from the perspective of the client.  Therefore, in
388	   most cases the agent needs to deal with any upstream overload issues
389	   in a way that is transparent to the client.  If one server notifies
390	   the agent that it has become overloaded, the notification should not
391	   be passed back to the client in a way where the client could
392	   mistakenly perceive the agent itself as being overloaded.  If the set
393	   of all possible destinations upstream of the agent no longer has
394	   sufficient capacity for incoming load, the agent itself becomes
395	   effectively overloaded.

397	   On the other hand, there are cases where the client needs to be able
398	   to select a particular server from behind an agent.  For example, if
399	   a Diameter request is part of a multiple-round-trip authentication,
400	   or is otherwise part of a Diameter "session", it may have a
401	   DestinationHost AVP that requires the request to be served by server
402	   1.  Therefore the agent may need to inform a client that a particular
403	   upstream server is overloaded or otherwise unavailable.

405	           +------------------+     +------------------+
406	           |                  |     |                  |
407	           |                  |     |                  |
408	           |     Server 1     |     |     Server 2     |
409	           |                  |     |                  |
410	           +--------+-`.------+     +------.'+---------+
411	                        `.               .'
412	                         `.           .'
413	                            `.       .'
414	                              `.   .'
415	                        +-------`.'--------+
416	                        |                  |
417	                        |                  |
418	                        |     Agent        |
419	                        |                  |
420	                        +--------+---------+
421	                                 |
422	                                 |
423	                                 |
424	                        +--------+---------+
425	                        |                  |
426	                        |                  |
427	                        |     Client       |
428	                        |                  |
429	                        +------------------+

431	                 Figure 5: Multiple Server Agent Scenario

433	   Figure 6 shows a scenario where an agent routes requests to a set of
434	   servers for more than one Diameter realm and application.  In this
435	   scenario, if server 1 becomes overloaded or unavailable, the agent
436	   may effectively operate at reduced capacity for application A, but at
437	   full capacity for application B. Therefore, the agent needs to be
438	   able to report that it is overloaded for one application, but not for
439	   another.

441	 +----------------------------------------------+
442	 | Application A       +------------------------+----------------------+
443	 |+------------------+ |  +------------------+  |  +------------------+|
444	 ||                  | |  |                  |  |  |                  ||
445	 ||                  | |  |                  |  |  |                  ||
446	 ||     Server 1     | |  |     Server 2     |  |  |     Server 3     ||
447	 ||                  | |  |                  |  |  |                  ||
448	 |+---------+--------+ |  +--------+---------+  |  +--+---------------+|
449	 |          |          |           |            |     |                |
450	 +----------+----------+-----------+------------+     |                |
451	            |          |           |                  |                |
452	            |          |           |                  | Application B  |
453	            |          +-----------+------------------+----------------+
454	            |                      |                  |
455	             ``--.__               |                 _.
456	                    ``-.__         |          __.--''
457	                          `--.._   |    _..--'
458	                          +-----``-+.-''-----+
459	                          |                  |
460	                          |                  |
461	                          |     Agent        |
462	                          |                  |
463	                          +--------+---------+
464	                                   |
465	                                   |
466	                          +--------+---------+
467	                          |                  |
468	                          |                  |
469	                          |     Client       |
470	                          |                  |
471	                          +------------------+

473	               Figure 6: Multiple Application Agent Scenario

475	3.  Existing Mechanisms

477	   Diameter requires the use of a congestion-managed transport layer,
478	   currently TCP or SCTP, to mitigate network congestion.  But even with
479	   a congestion-managed transport, a Diameter node can become overloaded
480	   at the protocol layer due to the causes described in Section 1.1.

482	   Diameter offers both implicit and explicit mechanisms for a Diameter
483	   node to learn that a peer is overloaded or unreachable.  The implicit
484	   mechanism is simply the lack of responses to requests.  If a client
485	   fails to receive a response in a certain time period, it assumes the
486	   upstream peer is unavailable, or overloaded to the point of effective
487	   unavailability.  The watchdog mechanism [RFC3539] ensures that a
488	   certain rate of transaction responses occur even when there is
489	   otherwise little or no other Diameter traffic.

491	   The explicit mechanism involves specific protocol error responses,
492	   where an agent or server can tell a downstream peer that it is either
493	   too busy to handle a request (DIAMETER_TOO_BUSY) or unable to route a
494	   request to an upstream destination (DIAMETER_UNABLE_TO_DELIVER),
495	   perhaps because that destination itself is overloaded to the point of
496	   unavailability.

498	   Once a Diameter node learns that an upstream peer has become
499	   overloaded via one of these mechanisms, it can then attempt to take
500	   action to reduce the load.  This usually means forwarding traffic to
501	   an alternate destination, if available.  If no alternate destination
502	   is available, the node must either reduce the number of messages it
503	   originates (in the case of a client) or inform the client to reduce
504	   traffic (in the case of an agent.)

506	4.  Issues with the Current Mechanisms

508	   The currently available Diameter mechanisms for indicating an
509	   overload condition are not adequate to avoid congestion collapse.  In
510	   particular, they do not allow a Diameter agent or server to shed load
511	   as it approaches overload.  At best, a node can only indicate that
512	   needs to entirely stop receiving requests, i.e. that it has
513	   effectively failed.  Diameter offers no mechanism to allow a node to
514	   indicate different overload states for different categories of
515	   messages, for example, if it is overloaded for one Diameter
516	   application but not another.

518	4.1.  Problems with Implicit Mechanism

520	   The implicit mechanism doesn't allow an agent or server to inform the
521	   client of a problem until it is effectively too late to do anything
522	   about it.  The client does not know to take action until the upstream
523	   node has effectively failed.  A Diameter node has no opportunity to
524	   shed load early to avoid collapse in the first place.

526	   Additionally, the implicit mechanism cannot distinguish between
527	   overload of a Diameter node and network congestion.  Diameter treats
528	   the failure to receive an answer as a transport failure.

530	4.2.  Problems with Explicit Mechanisms

532	   The Diameter specification is ambiguous on how a client should handle
533	   receipt of a DIAMETER_TOO_BUSY response.  The base specification
534	   [I-D.ietf-dime-rfc3588bis] indicates that the sending client should
535	   attempt to send the request to a different peer.  It makes no
536	   suggestion that a the receipt of a DIAMETER_TOO_BUSY response should
537	   affect future Diameter messages in any way.

539	   The Authentication, Authorization, and Accounting (AAA) Transport
540	   Profile [RFC3539] recommends that a AAA node that receives a "Busy"
541	   response failover all remaining requests to a different agent or
542	   server.  But while the Diameter base specification explicitly depends
543	   on RFC3539 to define transport behavior, it does not refer to RFC3539
544	   in the description of behavior on receipt of DIAMETER_TOO_BUSY.
545	   There's a strong likelihood that at least some implementations will
546	   continue to send Diameter requests to an upstream peer even after
547	   receiving a DIAMETER_TOO_BUSY error.

549	   BCP 41 [RFC2914] describes, among other things, how end-to-end
550	   application behavior can help avoid congestion collapse.  In
551	   particular, an application should avoid sending messages that will
552	   never be delivered or processed.  The DIAMETER_TOO_BUSY behavior as
553	   described in the Diameter base specification fails at this, since if
554	   an upstream node becomes overloaded, a client attempts each request,
555	   and does not discover the need to failover the request until the
556	   initial attempt fails.

558	   The situation is improved if implementations follow the [RFC3539]
559	   recommendation and keep state about upstream peer overload.  But even
560	   then, the Diameter specification offers no guidance on how long a
561	   client should wait before retrying the overloaded destination.  If an
562	   agent or server supports multiple realms and/or applications,
563	   DIAMETER_TOO_BUSY only offers no way to indicate that it is
564	   overloaded for one application but not another.  A DIAMETER_TOO_BUSY
565	   error can only indicate overload at a "whole server" scope.

567	   Agent processing of a DIAMETER_TOO_BUSY response is also problematic
568	   as described in the base specification.  DIAMETER_TOO_BUSY is defined
569	   as a protocol error.  If an agent receives a protocol error, it may
570	   either handle it locally or it may forward the response back towards
571	   the downstream peer.  (The Diameter specification is inconsistent
572	   about whether a protocol error MAY or SHOULD be handled by an agent,
573	   rather than forwarded downstream.)  If a downstream peer receives the
574	   DIAMETER_TOO_BUSY response, it may stop sending all requests to the
575	   agent for some period of time, even though the agent may still be
576	   able to deliver requests to other upstream peers.

578	5.  3GPP Study on Core Network Overload

580	   A study in 3GPP SA2 on core network overload has produced the
581	   technical report [TR23.843].  This enumerates several causes of
582	   overload in mobile core networks including portions that are signaled
583	   using Diameter.

585	   It is common for mobile networks to employ more than one radio
586	   technology and to do so in an overlay fashion with multiple
587	   technologies present in the same location (such as GSM or CDMA along
588	   with LTE).  This presents opportunities for traffic storms when
589	   issues occur on one overlay and not another as all devices that had
590	   been on the overlay with issues switch.  This causes a large amount
591	   of Diameter traffic as locations and policies are updated.

593	   Another scenario called out by this study is a flood of registration
594	   and mobility management events caused by some element in the core
595	   network failing.  This flood of traffic from end elements falls under
596	   the network initiated traffic flood category.  There is likely to
597	   also be traffic resulting directly from the component failure in this
598	   case.

600	   Subscriber initiated traffic floods are also indicated in this study
601	   as an overload mechanism where a large number of mobile devices
602	   attempting to access services at the same time, such as in response
603	   to an entertainment event or a catastrophic event.

605	   While this study is concerned with the broader effects of these
606	   scenarios on wireless networks and their elements, they have
607	   implications specifically for Diameter signaling.  One of the goals
608	   of this document is to provide guidance for a core mechanism that can
609	   be used to mitigate the scenarios called out by this study.

611	6.  Solution Requirements

613	   This section proposes requirements for an improved mechanism to
614	   control Diameter overload, with the goals of improving the issues
615	   described in Section 4 and supporting the scenarios described in
616	   Section 2

618	   REQ 1:   The overload mechanism MUST provide a communication method
619	            for Diameter nodes to exchange overload information.

621	   REQ 2:   The overload mechanism MUST be useable with any existing or
622	            future Diameter application.  It MUST NOT require
623	            specification changes for existing Diameter applications.
624	            This may be achieved using a mechanism in the Diameter base
625	            protocol that all applications could make use of.

627	   REQ 3:   The overload mechanism MUST limit the impact of overload on
628	            the overall useful throughput of a Diameter server, even
629	            when the incoming load on the network is far in excess of
630	            its capacity.  The overall useful throughput under load is
631	            the ultimate measure of the value of an overload control
632	            mechanism.

634	   REQ 4:   Diameter allows requests to be sent from either side of a
635	            connection and either side of a connection may have need to
636	            provide its overload status.  The mechanism MUST allow each
637	            side of a connection to independently inform the other of
638	            its overload status.

640	   REQ 5:   Diameter allows elements to determine their peers via
641	            dynamic discovery or manual configuration.  The mechanism
642	            MUST work consistently without regard to how peers are
643	            determined.

645	   REQ 6:   The mechanism designers SHOULD seek to minimize the amount
646	            of new configuration required in order to work.  For
647	            example, it is better to allow peers to advertise or
648	            negotiate support for the mechanism, rather than to require
649	            this knowledge to be configured at each node.

651	   REQ 7:   The overload mechanism MUST ensure that the system remains
652	            stable.  When the offered load drops from above the overall
653	            capacity of the network to below the overall capacity, the
654	            throughput MUST stabilize and become equal to the offered
655	            load.

657	   REQ 8:   The mechanism MUST allow nodes to shed load without
658	            introducing oscillations.  Note that this requirement
659	            implies a need for supporting nodes to be able to
660	            distinguish current overload information from stale
661	            information, and to make decisions using the most currently
662	            available information.

664	   REQ 9:   The mechanism MUST function across fully loaded as well as
665	            quiescent transport connections.  This is partially derived
666	            from the requirements for stability and hysteresis control
667	            above.

669	   REQ 10:  Consumers of overload state indications MUST be able to
670	            determine when the overload condition improves or ends.

672	   REQ 11:  The overload mechanism MUST be scalable.  That is, it MUST
673	            be able to operate in different sized networks.

675	   REQ 12:  When a single network element fails, goes into overload, or
676	            suffers from reduced processing capacity, the mechanism MUST
677	            make it possible to limit the impact of this on other
678	            elements in the network.  This helps to prevent a small-
679	            scale failure from becoming a widespread outage.

681	   REQ 13:  The mechanism MUST NOT introduce substantial additional work
682	            for node in an overloaded state.  For example, a requirement
683	            for an overloaded node to send overload information every
684	            time it received a new request would introduce substantial
685	            work.  Existing messaging is likely to have the
686	            characteristic of increasing as an overload condition
687	            approaches, allowing for the possibility of increased
688	            feedback for information piggybacked on it.

690	   REQ 14:  Some scenarios that result in overload involve a rapid
691	            increase of traffic with little time between normal levels
692	            and overload inducing levels.  The mechanism SHOULD provide
693	            for increased feedback when traffic levels increase.  The
694	            mechanism MUST NOT do this in such a way that it increases
695	            the number of messages while at high loads.

697	   REQ 15:  The mechanism MUST NOT interfere with the congestion control
698	            mechanisms of underlying transport protocols.

700	   REQ 16:  The mechanism MUST operate without malfunction in an
701	            environment with a mix of elements that do, and elements
702	            that do not, support the mechanism.

704	   REQ 17:  In a mixed environment with elements that support the
705	            overload control mechanism and that do not, the mechanism
706	            MUST NOT result in less useful throughput than would have
707	            resulted if it were not present.  It SHOULD result in less
708	            severe congestion in this environment.

710	   REQ 18:  In a mixed environment of elements that support the overload
711	            control mechanism and that do not, users and operators of
712	            elements that do not support the mechanism MUST NOT benefit
713	            from the mechanism more than users and operators of elements
714	            that support the mechanism.

716	   REQ 19:  It MUST be possible to use the mechanism between nodes in
717	            different realms and in different administrative domains.

719	   REQ 20:  Any explicit overload indication MUST distinguish between
720	            actual overload, as opposed to other, non-overload related
721	            failures.

723	   REQ 21:  In cases where a network element fails, is so overloaded
724	            that it cannot process messages, or cannot communicate due
725	            to a network failure, it may not be able to provide explicit
726	            indications of the nature of the failure or its levels of
727	            congestion.  The mechanism MUST properly function in these
728	            cases.

730	   REQ 22:  The mechanism MUST provide a way for an element to throttle
731	            the amount of traffic it receives from an peer element.
732	            This throttling SHOULD be graded so that it can be applied
733	            gradually as offered load increases.  Overload is not a
734	            binary state; there may be degrees of overload.

736	   REQ 23:  The mechanism MUST enable a supporting node to minimize the
737	            chance that retries due to an overloaded or failed element
738	            result in additional traffic to other overloaded elements,
739	            or cause additional elements to become overloaded.
740	            Moreover, the mechanism SHOULD provide unambiguous
741	            directions to clients on when they should retry a request
742	            and when they should not considering the various causes of
743	            overload such as avalanche restart.

745	   REQ 24:  The mechanism MUST provide sufficient information to enable
746	            a load balancing node to divert messages that are rejected
747	            or otherwise throttled by an overloaded upstream element to
748	            other upstream elements that are the most likely to have
749	            sufficient capacity to process them.

751	   REQ 25:  The mechanism MUST provide a mechanism for indicating load
752	            levels even when not in an overloaded condition, to assist
753	            elements making decisions to prevent overload conditions
754	            from occurring.

756	   REQ 26:  The specification for the overload mechanism SHOULD offer
757	            guidance on which message types might be desirable to
758	            process over others during times of overload, based on
759	            Diameter-specific considerations.  For example, it may be
760	            more beneficial to process messages for existing sessions
761	            ahead of new sessions.

763	   REQ 27:  The mechanism MUST NOT prevent a node from prioritizing
764	            requests based on any local policy, so that certain requests
765	            are given preferential treatment, given additional
766	            retransmission, or processed ahead of others.

768	   REQ 28:  The overload mechanism MUST NOT provide new vulnerabilities
769	            to malicious attack, or increase the severity of any
770	            existing vulnerabilities.  This includes vulnerabilities to
771	            DoS and DDoS attacks as well as replay and man-in-the middle
772	            attacks.

774	   REQ 29:  The mechanism MUST provide a means to match an overload
775	            indication with the node that originated it.  In particular,
776	            the mechanism MUST allow a node to distinguish between
777	            overload at a next-hop peer from overload at a node upstream
778	            of the peer.  For example, in Figure 5, the client must not
779	            mistake overload at server 1 for overload at the agent,
780	            whether or not the agent supports the mechanism.( see REQ
781	            4).

783	   REQ 30:  The mechanism MUST NOT depend on being deployed in
784	            environments where all Diameter nodes are completely
785	            trusted.  It SHOULD operate as effectively as possible in
786	            environments where other elements are malicious; this
787	            includes preventing malicious elements from obtaining more
788	            than a fair share of service.  Note that this does not imply
789	            any responsibility on the mechanism to detect, or take
790	            countermeasures against, malicious elements.

792	   REQ 31:  It MUST be possible for a supporting node to make
793	            authorization decisions about what information will be sent
794	            to peer elements based on the identity of those elements.
795	            This allows a domain administrator who considers the load of
796	            their elements to be sensitive information to restrict
797	            access to that information.  Of course, in such cases, there
798	            is no expectation that the overload mechanism itself will
799	            help prevent overload from that peer element.

801	   REQ 32:  The mechanism MUST NOT interfere with any Diameter compliant
802	            method that a node may use to protect itself from overload
803	            from non-supporting nodes, or from denial of service
804	            attacks.

806	   REQ 33:  There are multiple situations where a Diameter node may be
807	            overloaded for some purposes but not others.  For example,
808	            this can happen to an agent or server that supports multiple
809	            applications, or when a server depends on multiple external
810	            resources, some of which may become overloaded while others
811	            are fully available.  The mechanism MUST allow Diameter
812	            nodes to indicate overload with sufficient granularity to
813	            allow clients to take action based on the overloaded
814	            resources without forcing available capacity to go unused.
815	            The mechanism MUST support specification of overload
816	            information with granularities of at least "Diameter node",
817	            "realm", "Diameter application", and "Diameter session", and
818	            SHOULD allow extensibility for others to be added in the
819	            future.

821	   REQ 34:  The mechanism MUST provide a method for extending the
822	            information communicated and the algorithms used for
823	            overload control.

825	7.  IANA Considerations

827	   This document makes no requests of IANA.

829	8.  Security Considerations

831	   A Diameter overload control mechanism is primarily concerned with the
832	   load and overload related behavior of elements in a Diameter network,
833	   and the information used to affect that behavior.  Load and overload
834	   information is shared between elements and directly affects the
835	   behavior and thus is potentially vulnerable to a number of methods of
836	   attack.

838	   Load and overload information may also be sensitive from both
839	   business and network protection viewpoints.  Operators of Diameter
840	   equipment want to control visibility to load and overload information
841	   to keep it from being used for competitive intelligence or for
842	   targeting attacks.  It is also important that the Diameter overload
843	   control mechanism not introduce any way in which any other
844	   information carried by Diameter is sent inappropriately.

846	   This document includes requirements intended to mitigate the effects
847	   of attacks and to protect the information used by the mechanism.

849	8.1.  Access Control

851	   To control the visibility of load and overload information, sending
852	   should be subject to some form of authentication and authorization of
853	   the receiver.  It is also important to the receivers that they are
854	   confident the load and overload information they receive is from a
855	   legitimate source.  Note that this implies a certain amount of
856	   configurability on the elements supporting the Diameter overload
857	   control mechanism.

859	8.2.  Denial-of-Service Attacks

861	   An overload control mechanism provides a very attractive target for
862	   denial-of-service attacks.  A small number of messages may affect a
863	   large service disruption by falsely reporting overload conditions.
864	   Alternately, attacking servers nearing, or in, overload may also be
865	   facilitated by disrupting their overload indications, potentially
866	   preventing them from mitigating their overload condition.

868	   A design goal for the Diameter overload control mechanism is to
869	   minimize or eliminate the possibility of using the mechanism for this
870	   type of attack.

872	   As the intent of some denial-of-service attacks is to induce overload
873	   conditions, an effective overload control mechanism should help to
874	   mitigate the effects of an such an attack.

876	8.3.  Replay Attacks

878	   An attacker that has managed to obtain some messages from the
879	   overload control mechanism may attempt to affect the behavior of
880	   elements supporting the mechanism by sending those messages at
881	   potentially inopportune times.  In addition to time shifting, replay
882	   attacks may send messages to other nodes as well (target shifting).

884	   A design goal for the Diameter overload control mechanism is to
885	   minimize or eliminate the possibility of causing disruption by using
886	   a replay attack on the Diameter overload control mechanism.

888	8.4.  Man-in-the-Middle Attacks

890	   By inserting themselves in between two elements supporting the
891	   Diameter overload control mechanism, an attacker may potentially both
892	   access and alter the information sent between those elements.  This
893	   can be used for information gathering for business intelligence and
894	   attack targeting, as well as direct attacks.

896	   A design goal for the Diameter overload control mechanism is to
897	   minimize or eliminate the possibility of causing disruption man-in-
898	   the-middle attacks on the Diameter overload control mechanism.  A
899	   transport using TLS and/or IPSEC may be desirable for this.

901	8.5.  Compromised Hosts

903	   A compromised host that supports the Diameter overload control
904	   mechanism could be used for information gathering as well as for
905	   sending malicious information to any Diameter element that would
906	   normally accept information from it.  While is is beyond the scope of
907	   the Diameter overload control mechanism to mitigate any operational
908	   interruption to the compromised host, a reasonable design goal is to
909	   minimize the impact that a compromised host can have on other
910	   elements through the use of the Diameter overload control mechanism.
911	   Of course, a compromised host could be used to cause damage in a
912	   number of other ways.  This is out of scope for a Diameter overload
913	   control mechanism.

915	9.  References

917	9.1.  Normative References

919	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
920	              Requirement Levels", BCP 14, RFC 2119, March 1997.

922	   [I-D.ietf-dime-rfc3588bis]
923	              Fajardo, V., Arkko, J., Loughney, J., and G. Zorn,
924	              "Diameter Base Protocol", draft-ietf-dime-rfc3588bis-33
925	              (work in progress), May 2012.

927	   [RFC2914]  Floyd, S., "Congestion Control Principles", BCP 41,
928	              RFC 2914, September 2000.

930	   [RFC3539]  Aboba, B. and J. Wood, "Authentication, Authorization and
931	              Accounting (AAA) Transport Profile", RFC 3539, June 2003.

933	9.2.  Informative References

935	   [RFC5390]  Rosenberg, J., "Requirements for Management of Overload in
936	              the Session Initiation Protocol", RFC 5390, December 2008.

938	   [TR23.843]
939	              3GPP, "Study on Core Network Overload Solutions",
940	              TR 23.843 0.4.0, April 2011.

942	Appendix A.  Contributors

944	   Significant contributions to this document were made by Adam Roach
945	   and Eric Noel.

947	Appendix B.  Acknowledgements

949	   Review of, and contributions to, this specification by Martin Dolly,
950	   Carolyn Johnson, Jianrong Wang, Imtiaz Shaikh, and Robert Sparks were
951	   most appreciated.  We would like to thank them for their time and
952	   expertise.

954	Authors' Addresses

956	   Eric McMurry
957	   Tekelec
958	   17210 Campbell Rd.
959	   Suite 250
960	   Dallas, TX  75252
961	   US

963	   Email: emcmurry@estacado.net

965	   Ben Campbell
966	   Tekelec
967	   17210 Campbell Rd.
968	   Suite 250
969	   Dallas, TX  75252
970	   US

972	   Email: ben@nostrum.com