idnits 2.17.1 draft-ietf-ospf-omp-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 6 instances of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 1556 has weird spacing: '...-egress loa...' == Line 1564 has weird spacing: '... link load...' == Line 1674 has weird spacing: '...-egress loa...' == Line 1687 has weird spacing: '... link load...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 24, 1999) is 9185 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'NextHopStruct' on line 1168

  == Unused Reference: '6' is defined on line 787, but no explicit reference
     was found in the text

  ** Obsolete normative reference: RFC 2370 (ref. '1') (Obsoleted by RFC 5250)

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  ** Obsolete normative reference: RFC 2001 (ref. '6') (Obsoleted by RFC 2581)


     Summary: 9 errors (**), 0 flaws (~~), 6 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                        Curtis Villamizar
3	INTERNET-DRAFT                                                     UUNET
4	draft-ietf-ospf-omp-02                                February 24, 1999

6	                    OSPF Optimized Multipath (OSPF-OMP)

8	Status of this Memo

10	  This document is an Internet-Draft and is in full conformance with all
11	  provisions of Section 10 of RFC2026.

13	  Internet-Drafts are working documents of the Internet Engineering
14	  Task Force (IETF), its areas, and its working groups.  Note that other
15	  groups may also distribute working documents as Internet-Drafts.

17	  Internet-Drafts are draft documents valid for a maximum of six months
18	  and may be updated, replaced, or obsoleted by other documents at any
19	  time.  It is inappropriate to use Internet- Drafts as reference mate-
20	  rial or to cite them other than as ``work in progress.''

22	  The list of current Internet-Drafts can be accessed at
23	  http://www.ietf.org/ietf/1id-abstracts.txt

25	  The list of Internet-Draft Shadow Directories can be accessed at
26	  http://www.ietf.org/shadow.html.

28	  Copyright (C) The Internet Society (February 24, 1999).  All Rights
29	  Reserved.

31	Abstract

33	  OSPF may form multiple equal cost paths between points.  This is true
34	  of any link state protocol.  In the absense of any explicit support
35	  to take advantage of this, a path may be chosen arbitrarily.  Tech-
36	  niques have been utilized to divide traffic somewhat evenly among the
37	  available paths.  These techniques have been referred to as Equal Cost
38	  Multipath (ECMP). An unequal division of traffic among the available
39	  paths is generally preferable.  Routers generally have no knowledge
40	  of traffic loading on distant links and therefore have no basis to
41	  optimize the allocation of traffic.

43	  Optimized Mulitpath is a compatible extension to OSPF, utilizing the
44	  Opaque LSA to distribute loading information, proposing a means to
45	  adjust forwarding, and providing an algorithm to make the adjustments
46	  gradually enough to insure stability yet provide reasonably fast ad-
47	  justment when needed.

49	1  Overview

51	  Networks running OSPF are often heavily loaded.  Topologies often
52	  evolve to include multiple paths.  Multiple paths may be initially de-
53	  signed to provide redundancy but also result from incremental addition
54	  of circuits to accomodate traffic growth.  The redundant paths provide
55	  a potential to distribute traffic loading and reduce congestion.  Op-
56	  timized Mulitpath (OMP) provides a means for OSPF to make better use
57	  of this potential to distribute loading.

59	1.1  Past Attempts

61	  Early attempts to provide load sensitive routing involved changing
62	  link costs according to loading.  These attempts were doomed to fail-
63	  ure because the adjustment increment was grossly course and oscilla-
64	  tion was inevitable [2].  This early experience is largely responsible
65	  for the common belief that any form of load sensitive routing will
66	  fail due to severe oscillations resulting from instability.

68	  Attempts to use a metric composed of weighted components of delay,
69	  traffic, and fixed costs have also been met with very limited success.
70	  The problem again is the granularity of adjustment.  As the composi-
71	  tion of weighted components switches favored paths large amounts of
72	  traffic are suddenly moved, making the technique prone to oscillations
73	  [3].  The oscillation is damped to some extent by providing a range
74	  of composite metric differences in which composite metrics are con-
75	  sidered equal and equal cost multipath techniques are used.  Even then
76	  the technique still suffers oscillations due to the course adjustments
77	  made at equal/unequal metric boundaries.

79	1.2  Equal Cost Multipath

81	  A widely utilized technique to improve loading is known as Equal Cost
82	  Multipath (ECMP). ECMP is specified in [5].  In ECMP no attempt to
83	  make dynamic adjustments to OSPF costs based on loading and therefore
84	  ECMP is completely stable.  If the topology is such that equal cost
85	  paths exist, then an attempt is made to divide traffic equally among
86	  the paths.  At least three methods of splitting traffic have been
87	  used.

89	 1.  Per packet round robin forwarding.

91	 2.  Dividing destination prefixes among available next hops in the for-
92	     warding entries.

94	 3.  Dividing traffic according to a hash function applied to the source
95	     and desination pair.

97	  The ``per packet round robin forwarding'' technique is only applicable
98	  if the delays on the paths are almost equal.  The delay difference
99	  must be small relative to packet serialization time.  Delay differ-
100	  ences greater than three times the packet serialization time can cause
101	  terrible TCP performance.  For example, packet 2, 4, and 6 may arrive
102	  before packet 1, triggering TCP fast retransmit.  The result will be
103	  limiting TCP to a very small window and very poor performance over
104	  long delay paths.

106	  The delay differences must be quite small.  A 532 byte packet is seri-
107	  alized onto a DS1 link in under 2.8 msec.  At DS3 speed, serialization
108	  is accomplished in under 100 usec.  At OC12 it is under 7 usec.  For
109	  this reason ``per packet round robin forwarding'' is not applicable to
110	  a high speed WAN.

112	  Dividing destination prefixes among available next hops provides a
113	  very course and unpredictable load split.  Very short prefixes are
114	  problematic.  In reaching an end node, the majority of traffic is of-
115	  ten destined to a single prefix.  This technique is applicable to a
116	  high speed WAN but with the drawbacks just mentioned better techniques
117	  are needed.

119	  The ``source/destination hash'' based technique was used as far back
120	  as the T1-NSFNET in the IBM RT-PC based routers.  A hash function,
121	  such as CRC-16, is applied over the source address and destination ad-
122	  dress.  The hash space is then split evenly among the available paths
123	  by either setting threshholds or performing a modulo operation.  Traf-
124	  fic between any given source and destination remain on the same path.
125	  Because the technique is based on host addresses, and uses both the
126	  source and destination address, it does not suffer the course gran-
127	  ularity problem of the prefix based technique, even when forwarding
128	  to a single prefix.  Source/destination hash is the best technique
129	  available for a high speed WAN.

131	  The forwarding decision for the ``source/destination hash'' based
132	  technique is quite simple.  When a packet arrives, look up the for-
133	  warding entry in the radix tree.  The next hop entry can be an array
134	  index into a set of structures, each containing one or more actual
135	  next hops.  If more than one next hop is present, compute a CRC16
136	  value based on the source and destination addresses.  The CRC16 can
137	  be implemented in hardware and computed in parallel to the radix tree
138	  lookup in high speed implementations, and discarded if not needed.

140	                         .----.
141	                        /      \
142	                        |  N2  |
143	                        \      /
144	                         `----'
145	                       //      \\
146	                      //        \\
147	                     //          \\
148	                .----.            .----.
149	               /      \          /      \
150	               |  N1  | ======== |  N3  |
151	               \      /          \      /
152	                `----'            `----'

154	               Figure 1:  A very simple application of ECMP

156	  Each next hop entry in the structure must contain a boundary value
157	  and the next hop itself.  An integer ``less than'' comparison is made
158	  against the boundary value determining whether to use this next hop
159	  or move to the next a comparison.  In hardware the full set of compar-
160	  isons can be made simultaneously for up to some number of next hops or
161	  a binary search can be performed.  This yields the next hop to use.

163	1.3  Optimized Multipath differs from ECMP

165	  For ECMP, the boundary values are set by first dividing one more than
166	  the maximum value that the hash computation can return (65536 for
167	  CRC16) by the number of available next hops and then setting the Nth
168	  boundary to N times that number (with the Nth value fixed at one more
169	  than the maximum value regardless of underflow caused by trucating
170	  during division, 65536 for CRC16).

172	  An equal load split is not always optimal.  Consider the example in
173	  Figure 1 with the offered traffic in Table 1.  If all of the link
174	  costs are set equally, then the link N1---N3 is significantly over-
175	  loaded (135.75%) while the path N1---N2---N3 is lightly loaded (45.25%
176	  and 22.62%).  If the cost on the N1---N3 link is equal to the cost
177	  of the N1---N2---N3 path, then N1 will try to split the load destined
178	  toward N3 across the two paths.

180	                                      No     ECMP      OMP
181	        Nodes       Node Names       Split  Traffic  Traffic

183	        n3-n1    Node 3 -> Node 1      60      30       40
184	        n1-n3    Node 1 -> Node 3      60      30       40
185	        n3-n2    Node 3 -> Node 2      20      50       40
186	        n2-n3    Node 2 -> Node 3      20      50       40
187	        n2-n1    Node 2 -> Node 1      10      40       30
188	        n1-n2    Node 1 -> Node 2      10      40       30

190	           Table 1:  Traffic loading for the example in Figure 1

192	  Given the offered traffic in Table 1 the loading on N1---N3 is reduced
193	  to 67.87% but the link loading on the path N2---N3 becomes 113.12%.
194	  Ideally node N1 should put 1/3 of the traffic toward N3 on the path
195	  N1---N2---N3 and 2/3 on the path N1---N3.  To know to do this N1 must
196	  know the loading on N2--N3.

198	  This is where Optimized Multipath (OMP) provides additional benefit
199	  over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of the
200	  traffic toward N3 on the path N1---N2---N3 (described later in Sec-
201	  tion 2), the way the distribution of traffic is accomplished from a
202	  forwarding standpoint is to move the boundary in the forwarding struc-
203	  ture from the default value of 1/2 of 65536 to about 1/3 of 65536.  If
204	  there are a very large set of source and destination host addresses
205	  pairs, then the traffic will be split among the 65536 possible hash
206	  values.  This provides the means for a very fine granularity of ad-
207	  justment.

209	  Having explained how a fine granularity of forwarding adjustment can
210	  be accomplished, what remains is to define how nodes in a large topol-
211	  ogy can know what the loading levels are elsewhere in the topology and
212	  defining an algorithm which can allow autonomous unsyncronized deci-
213	  sions on the parts of many routers in a topology to quickly converge
214	  on a near optimal loading without the risk of oscillation.  This is
215	  covered in the following sections.

217	2  Flooding Loading Information

219	  Loading information is flooded within an OSPF area using Opaque
220	  LSAs [1].  Area local scope (link-state type 10) link state at-
221	  tributes are flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD
222	  or LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is
223	  used to flood link loading information within an area.  The
224	  type LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading informa-
225	  tion for use with inter-area routes.  Loading information obtained
226	  from an exterior routing protocol may also be considered if avail-
227	  able.  The means of passing loading information in an exterior routing
228	  protocol is beyond the scope of this document.

230	2.1  Link Loading Information

232	  Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD
233	  Opaque LSA. The format of this LSA is described in Appendix A.

235	  The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA
236	  contains the following.

238	 1.  a measure of link loading in each direction as a fraction of link
239	     capacity,

241	 2.  a measure of packets dropped due to queue overflow in each direc-
242	     tion (if known) expressed as a fraction,
243	 3.  the link capacity in kilobits per second (or unity if less than
244	     1000 bytes per second).

246	  Generally the number of ouput packets dropped will be known.  In de-
247	  signs where drops occur on the input, the rate of input queue drops
248	  should be recorded.  These measures of loading and drop are computed
249	  using the interface counters generally maintained for SNMP purposes,
250	  plus a running count of output queue drops if available.  The coun-
251	  ters are sampled every 15 seconds but generally flooded at longer time
252	  intervals.

254	  The previous value of each of the counters is substracted from the
255	  current value.  The counters required are 1) bytes out, 2) bytes in,
256	  3) packets out, 4) packets in, 5) output queue drops, and 6) input
257	  queue drops.  These counters should already exist to satisfy SNMP
258	  requirements.

260	  A value of instantaneous load in each direction is based on byte count
261	  and link capacity.  An instantaneous output queue drop rate is based
262	  on queue drops and packet count.  Some of the values are filtered as
263	  described in Appendix B.1.

265	  The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same
266	  Opaque ID was sent is recorded and the values sent are recorded.  For
267	  the purpose of determining when to reflood, an equivalent loading fig-
268	  ure is used.  The computation of equivalent loading is described in
269	  Section 2.3.

271	  The higher of the current equivalent loading computation and
272	  the previous is used when determining whether to send the
273	  type LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque
274	  LSA is flooded according to elapsed time since last flooded, the cur-
275	  rent equivalent load, and the difference between the current equiva-
276	  lent load and the previously flooded equivalent load.  The reflooding
277	  decision is described in detail in Appendix B.1.

279	  The point of this graduated reflooding schedule is to reduce the
280	  amount of flooding that is occurring unless links are in trouble or
281	  undergoing a significant traffic shift.  Change may occur in a qui-
282	  escent network due to failure external to the network that causes
283	  traffic to take alternate paths.  In this case, the more frequent
284	  flooding will trigger a faster convergence.  Traffic shift may also
285	  occur due to shedding of loading by the OMP algorithm itself as the
286	  algorithm converges in response to an external change.

288	2.2  Path Loading Information

290	  Path loading information regarding an adjacent area is flooded by an
291	  Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA.
292	  The format of this LSA is described in Appendix A.

294	  The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA
295	  contains the following.

297	 1.  the highest loading in the direction toward the destination as a
298	     fraction of link capacity,

300	 2.  a measure of total packet drop due to queue overflow in the direc-
301	     tion toward the destination expressed as a fraction,
302	 3.  the smallest link capacity on the path to the destination.

304	  These values are taken from the link on the path from the ABR to the
305	  destination of the summary LSA. The link with the highest loading may
306	  not be the link with the lowest capacity.  The queue drop value is one
307	  minus the product of fraction of packets that are not dropped at each
308	  measurement point on the path (input and output in the direction of
309	  the path).  The following computation is used.

311	        path-loss = 1 - product(1 - link-loss)

313	  The path loading and path loss rate are filtered according to the
314	  algorithm defined in Appendix B.1.  Rather than polling a set of coun-
315	  ters the current value of the path loading and path loss rate is used.
316	  An equivalent load is calculated for each path to a summary LSA des-
317	  tination as described in Section 2.3.  A type LSA_OMP_PATH_LOAD Opaque
318	  LSA is flooded according to the same rate schedule as described in the
319	  prior section and Appendix B.1.

321	  An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA
322	  into any given area.  See Appendix C.

324	2.3  Computing equivalent loading

326	  The equivalent load is the actual fractional loading multiplied by a
327	  factor that provides an estimate based on loss of the extent to which
328	  TCP is expected to slow down to avoid congestion.  This estimate is
329	  based on the link bandwidth and loss rate, knowledge of TCP dynamics,
330	  and some assumption about the characteristics of the TCP flows being
331	  passed through the link.  Some of the assumptions must be configured.

333	  If loss is low or zero, the equivalent load will be equal to the ac-
334	  tual fractional loading (link utilization expressed as a number be-
335	  tween 0 and 1).  If loss is high and loading is at or near 100%, then
336	  the equivalent load calculation provides a means of deciding which
337	  links are more heavily overloaded.  The equivalent load figure is
338	  not intended to be an accurate pridiction of offerred load, simply a
339	  metric for use in deciding which link to offload.

341	  Mathis et al provide the following estimate of loss given TCP window
342	  size and round trip time [4].

344	        p < (MSS / (BW * RTT))**2

346	  The basis for the estimate is that TCP slows down roughly in propor-
347	  tion to the inverse of the square root of loss.  There is no way to
348	  know how fast TCP would be going if no loss were present if there are
349	  other bottlenecks.  A somewhat arbitrary assumption is made that TCP
350	  would go no faster than if loss were at 0.5%.  If loss is greater than
351	  0.5% then TCP performance would be reduced.  The equivalent loading is
352	  estimated using the following computation.

354	        equiv-load = load * K * sqrt(loss)

356	  The inverse of the square root of 0.1% is 10 so 10 may be used for the
357	  value of ``K''.

359	  The conversion of loss to estimated loading is not at all accurate.
360	  The non-linearity does affect the time to converge though convergence
361	  still occurs as long as loss is positively correlated to loading.
362	  This is discussed further in Appendix E.1.

364	3  Next hop structures

366	  A ``next hop structure'' contains a set of complete paths to a des-
367	  tination, some of which may share the same immediate next hop.  The
368	  name is not meant to imply a single next hop.  A given route can ref-
369	  erence only one next hop structure, which can contain multiple paths
370	  and multiple next hops.  Entries for paths that use the same next hop
371	  are combined before moving information to the forwarding table.  A
372	  next hop structure contains the information nessecary to balance load
373	  across a set of next hops.

375	  For intra-area routes, a separate next hop structure must exist for
376	  each destination router or network.  For inter-area routes (summary
377	  routes), at most one next hop structure is needed for each combination
378	  of ABRs which announce summary routes that are considered equidis-
379	  tant.  Optimizing inter-area and external routing is discussed in
380	  Section 3.2.

382	  The set of intra-area next hop structures is initialized after the
383	  OSPF SPF calculation is completed.  An additional set of next hops is
384	  then added by relaxing the best path criteria.

386	  The use of the next hop structure and its contents is described in
387	  Section 4.1.

389	3.1  Relaxing the Best Path Criteria

391	  The exercise of setting link costs to produce the most beneficial set
392	  of equal costs paths is tedious and very difficult for large topolo-
393	  gies.  OSPF as defined in RFC--2328 requires that only the best path
394	  be considered.  For the purpose of Optimized Multipath, this crite-
395	  ria can be relaxed to allow a greater number of multipaths but not to
396	  the point of creating routing loops.  Any next hop which is closer in
397	  terms of costs than the current hop and does not cross a virtual link
398	  can be considered a viable next hop for multipath routing.  If next
399	  hops were used where the cost at the next hop is equal or greater,
400	  routing loops would form.

402	  In considering the paths beyond the next hop path, only the best paths
403	  should be considered.  There is no way to determine if subsequent
404	  routers have relaxed the best path criteria.  In addition, there is
405	  no need to consider the additional paths if the best path criteria
406	  is relaxed downstream.  If best path criteria is relaxed downstream,
407	  the best paths must be part of the downstream next hop structure.  If
408	  there are additional paths the the downstream is able to use to fur-
409	  ther distribute the load, the entire set of paths will still converge
410	  toward optimal loading.

412	  The best path criteria is relaxed only for intra-area routes.  The
413	  best path criteria can also be relaxed when considering the cost to
414	  reach ABRs or ASBRs.  The best path criteria should not be relaxed
415	  when considering the total cost to reach a summary route or external
416	  route.

418	3.2  Offloading Congestion Outside the OSPF Area

420	  For inter-area routes or external routes, a separate next hop struc-
421	  ture must exist for each such route if it is desireable to reduce
422	  loading outside of the area and the loading within the area is suffi-
423	  ciently low to safely allow this.

425	  The existing procedures regarding selection of inter-area and exter-
426	  nal routes outlined in [5] still apply.  For inter-area routes the
427	  intra-area cost and cost of the summary route are summed.  For exter-
428	  nal routes the intra-area cost is summed with a type 1 external cost
429	  and considered before a type 2 external cost.  The best path criteria
430	  is not relaxed when applied to the sum of intra-area cost and summary
431	  route cost or intra-area cost and type 1 external cost.

433	  In order for an ABR or ASBR to be considered as a viable exit point to
434	  the area for a given destination, it must be advertising an applicable
435	  summary route or external route.  The best summary route or external
436	  route must still be choosen.  If a single ABR or ASBR advertises the
437	  best route, multiple paths to that ABR or ASBR may be used, but traf-
438	  fic cannot be sent toward an ABR or ASBR advertising a higher cost
439	  summary route or external route.  If two or more ABR or ASBR advertise
440	  a route at the same cost, then traffic load can be split among these
441	  ABR or ASBR.

443	  For intra-area routes if a type LSA_OMP_PATH_LOAD Opaque LSA exists
444	  for the summary LSA and more than one ABR is advertising an equally
445	  preferred summary route and the equivalent load for the summary LSA
446	  is greater than 90% and the equivalent load within the area is suffi-
447	  ciently smaller than the inter-area loading, then a next hop structure
448	  can be created specifically to allow offloading of the intra-area
449	  route.  For external routes, if an equivalent loading exists, and more
450	  than one ASBR is advertising an equally preferred external route, and
451	  the equivalent load is greater than 95% and the equivalent load within
452	  the area is sufficiently smaller than the external route loading, then
453	  a separate structure is used.

455	  Hysterysis must be used in the algorithm for determining if an equiv-
456	  alent load on a summary LSA or external route is considered suffi-
457	  ciently larger than the intra-area equivalent load or if an external
458	  route loading is considered sufficiently larger than the inter-area
459	  equivalent load.  For for the purpose of describing this algorithm one
460	  equivalent load is referred to as the more external, and the other as
461	  the more internal equivalent load.

463	  If the more external equivalent load exceeds the more internal equiv-
464	  alent load by 15% and the more internal equivalent load is under 85%,
465	  then a separate next hop structure is created.  If the more external
466	  equivalent load falls below 20% of the more internal equivalent load
467	  or the more internal equivalent load exceeds 98%, then an existing
468	  separate next hop structure is marked for removal and combined with
469	  the more internal next hop structure (see Section 3.3).  The more ex-
470	  ternal equivalent load should not fall significantly below the more
471	  internal unless either the traffic toward the more external destina-
472	  tion increases or the loading on the more internal increases, since
473	  the more internal equivalent load will become the critical segment on
474	  the separate next hop structure if the load is sufficiently shifted
475	  but is unlikely to overshoot by 20%.  These threshholds should be
476	  configurable at least per type of routes (inter-AS or external).

478	  The degree to which Summary LSA loading and external route loading
479	  will be considered is limited.  This serves two purposes.  First, it
480	  prevents compensating for external congestion to the point of loading
481	  the internal network beyond a fixed threshhold.  Second, it prevents
482	  triggering the removal of the next hop structure, which if allowed to
483	  occur could trigger a hysteresis loop.  This mechanisms are described
484	  in Section 3.4, and Appendix C.4.

486	3.3  Creating and destroying next hop structures

488	  As described in Section 3.2 separate next hop structure is needed
489	  if the loading indicated by the type LSA_OMP_PATH_LOAD Opaque LSA or
490	  exterior routing protocol is sufficiently high to require separate
491	  balancing for traffic to the summary-LSA or exterior route and the
492	  intra-AS loading is sufficiently low.

494	  When a separate next hop structure is created, the same available
495	  paths appear in the structure, leading to the same set of ABR or ASBR.
496	  The balance on these available paths should be copied from the exist-
497	  ing more internal next hop structure.  By initializing the new next
498	  hop structure this way, a sudden change in loading is avoided if a
499	  great deal of traffic is destined toward the summary route or external
500	  route.

502	  When a separate next hop structure can be destroyed, the traffic
503	  should be transitioned gradually.  The next hop structure must be
504	  marked for deletion.  The traffic share in this separate next hop
505	  structure should be gradually changed so that it exactly matches the
506	  traffic share in the more internal next hop structure.  The gradual
507	  change should follow the adjustment rate schedule described in Sec-
508	  tion 4.1 where the move increment is increased gradually as moves
509	  continue in the same direction.  The only difference is that there is
510	  no need to overshoot when adjusting to match the more internal next
511	  hop structure parameters.  Once the separate next hop structure marked
512	  for deletion matches the more internal next hop structure, the summary
513	  route or external route can be changed to point to the more internal
514	  next hop structure and the deletion can be made.

516	3.4  Critcally loaded segment

518	  For every set of paths, one link or part of the path is identified as
519	  the ``critcally loaded'' segment.  This is the part of the path with
520	  the highest equivalent load as defined in Section 2.3.  For an inter-
521	  area route with a separate next hop structure, the critically loaded
522	  segment may be the critcally loaded segment for the intra-area set of
523	  paths, or it may be the summary LSA if the equivalent load on the sum-
524	  mary LSA is greater.  For an external route with a separate next hop
525	  structure, the critcally loaded segment may be the critcally loaded
526	  segment for the internal route or it may be the external route if
527	  the equivalent load on the external route is greater.  In considering
528	  loading reported for summary LSA or external routes, the loading may
529	  be clamped to some configured ceiling (see Appendix C.4).  If intra-
530	  area loading exceeds this ceiling, the summary LSA loads or external
531	  routes loads are in effect ignored.

533	  Each next hop structure has exactly one ``critcally loaded'' segment.
534	  There may be more than one path in the next hop structure sharing
535	  this critcally loaded segment.  A particular Opaque LSA may be the
536	  critcally loaded segment for no next hop structures if it is lightly
537	  loaded.  Another Opaque LSA may be the critcally loaded segment for
538	  many next hop structures if it is heavily loaded.

540	3.5  Optimizing Partial Paths

542	  Under some circumstances multiple paths will exist to a destination
543	  where all of the available paths share one or more links.  In some
544	  cases overall system convergence time can be substantially improved by
545	  optimizing a partial path when the most heavily loaded link is shared
546	  by all available paths to a destination.

548	  Computations are actually reduced when partial paths are considered.
549	  The next hop structures kept within the routing process must contain
550	  the full paths used to reach a destination (this is already a require-
551	  ment).  After an SPF calculation has changed the next hop structure
552	  and before attempting any optimization the set of paths are examined
553	  looking for intr-area links which are common to all paths.  If any
554	  such links are found, only intra-area links closer than any of these
555	  links can be considered as candidates for the ``critcally loaded''
556	  segment (Section 3.4).  If there is only one immediate hop, no attempt
557	  is made to load balance.

559	  The change in load adjustment parameters should be applied to the
560	  data structures for the full paths even though only a subset of the
561	  links are eligible to be considered as the critcally loaded segment.
562	  For the purpose of building type LSA_OMP_PATH_LOAD Opaque LSA loading
563	  along the entire path must be considered including links shared by all
564	  available paths.

566	4  Adjusting Equal Cost Path Loadings

568	  Next hop structures are described in Section 3.  A next hop structure
569	  contains a set of complete paths to a destination.

571	  Adjustments are made to a next hop structure to reflect differences in
572	  loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque
573	  LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.4 describes the
574	  selection of a ``critically loaded segment'' which is used to de-
575	  termine when to make adjustments and the size of the adjustments.
576	  Section 3.5 describes conditions under which some links are excluded
577	  from considerations as the ``critically loaded segment''.

579	  An adjustment to loading of a given set of equal cost paths is made
580	  when one of two conditions are true.  Either the ``critically loaded
581	  segment'' has been reflooded, or a criteria is met involving 1) the
582	  difference between the equivalent load of the``critically loaded seg-
583	  ment'' and the lightest loaded path, 2) the equivalent load of the
584	  ``critically loaded segment'', 3) the type of destination, intr-area,
585	  inter-area, or external, and 4) the amount of time since the last
586	  load adjustment.  The details of this conditional are described in
587	  Appendix B.

589	  The reflooding algorithm is designed to be slightly less aggressive
590	  than the adjustment algorithm.  This reduces the need to continuously
591	  flood small changes except in conditions of overload or substantial
592	  change in loading.  Some overshoot may occur due to adjustments made
593	  in the absence of accurate knowledge of loading.

595	4.1  Load Adjustment Rate

597	  In order to assure stability the rate of adjustment must be suffi-
598	  ciently limited.  An adaptive adjustment rate method is used.

600	  A ``critcally loaded'' segment for a next hop structure is determined
601	  as described in Section 3.4.  When the type LSA_OMP_LINK_LOAD Opaque
602	  LSA or type LSA_OMP_PATH_LOAD Opaque LSA for this segment is updated
603	  or the criteria in Appendix B is met, load is shed from all paths in
604	  the next hop structure that include that segment toward all paths in
605	  the next hop structure that do not include that segment.  A separate
606	  set of variables controlling rate of adjustment is kept for each path
607	  receiving load.

609	  The number of paths usually exceeds the number of next hops.  The dis-
610	  tinction between paths which share a next hop is important if one of
611	  the paths sharing a next hop goes down (see Section 4.2).  This dis-
612	  tinction is only needed in making the computations.  When moving the
613	  next hop structure into the data structures used for forwarding, paths
614	  which share a common next hop may be combined.

616	  The following variables are kept for each path in a next hop struc-
617	  ture.

619	 1.  The current ``traffic share'' (an integer, the range is 0 to 65355
620	     for a CRC16 hash),
621	 2.  The current ``move increment'' used when moving traffic toward this
622	     path (an integer, the range is 0 to 65355 for a CRC16 hash),

624	 3.  The number of moves in the same direction, referred to as the
625	     ``move count''.

627	  If there is no prior history for a path, then the move increment is
628	  initialized to a constant, typically about 1% (about 650 for CRC16).
629	  The number of moves in the same direction is initialized to 0.  No
630	  loading adjustment is made on the first iteration.

632	  If the critcally loaded segment has changed, all paths now containing
633	  the critically loaded segment are first examined.  The lowest move
634	  increment of any one of these paths is noted.

636	  The move increment is adjusted for each path before any traffic is
637	  moved.  One of the following actions is taken for each path.

639	 1.  If the path contains the critcally loaded segment its move incre-
640	     ment is left unchanged.

642	 2.  If the path does not contain the critcally loaded segment but the
643	     critically loaded segment has changed and the path contains the
644	     prior critcally loaded segment, then first the move increment is
645	     replaced with the lowest move increment from any of the paths con-
646	     taining the critically loaded segment unless the move increment is
647	     already lower.  Then in either case the move increment is cut in
648	     half.

650	 3.  If the path does not contain the critcally loaded segment and ei-
651	     ther the critically loaded segment has not changed, or the path
652	     does not contain the prior critcally loaded segment, then the move
653	     increment is increased.

655	  The amount increase in the move increment is described in Ap-
656	  pendix B.4.  The increase is designed to minimize the possibility
657	  of dramatic overshoot due to to great an increase in adjustment rate.

659	  The move increment is never less than a configured minimum.  The in-
660	  crease in move increment is never less than one but generally is con-
661	  strained to a higher number by virtue of being calculated based on the
662	  prior move increment.  The configured minimum for the move increment
663	  is typically 0.1% (65 for CRC16).  The move increment is never allowed
664	  to exceed the size of the hash space divided by the number of equal
665	  cost paths in the next hop structure.

667	  The dramatic decrease in move increment when move direction is re-
668	  versed and the slow increase in move increment when it remains in the
669	  same direction keeps the algorithm stable.  The exponential nature of
670	  the increase allows the algorithm to track externally caused changes
671	  in traffic loading.

673	  The traffic share allocated to a path not containing the critcally
674	  loaded segment is incremented by the move amount for that path and the
675	  traffic share allocated to the path or paths containing the the crit-
676	  cally loaded segment are reduced by this amount divided by the number
677	  of paths containing the critcally loaded segment.  This adjustment is
678	  described in pseudocode in Appendix B.4.

680	  This adjustment process is repeated for each path in a next hop struc-
681	  ture.  The new hash space boundaries are then moved to the forwarding
682	  engine.

684	4.2  Dealing with Link Adjacency Changes

686	  Link failures do occur for various reasons.  OSPF routing will con-
687	  verge to a new set of paths.  Whatever load balance had previously
688	  existed will be upset and the load balancing will have to converge to
689	  a new load balanced state.  Previous load balancing parameter should
690	  remain intact to the extent possible after the SPF calculation has
691	  completed.  Adjustments for new or deleted paths in the SPF result are
692	  described here.  These adjustments must be made after the best path
693	  criteria is relaxed as described in Section 3.1.

695	4.2.1  Impact of Link Adjacency Changes

697	  Links which are intermitent may be the most harmful.  The OSPF
698	  ``Hello'' protocol is inadequate for handling intermitent links.  When
699	  such a link is up it may draw traffic during periods of high loss,
700	  even brief periods of complete loss.

702	  The inadequacies of the OSPF ``Hello'' protocol is well known and many
703	  implementations provide lower level protocol state information to OSPF
704	  to indicate a link in the ``down'' state.  For example, indications
705	  may include carrier loss, excessive framing errors, unavailable sec-
706	  onds, or loss indications from PPP LQM.

708	  Even where the use of a link is avoided by providing indication of
709	  lower level link availability, intermitent links are still problem-
710	  atic.  During a brief period immediately after a link state attribute
711	  is initially flooded OSPF state can be inconsistent among routers
712	  within the OSPF area.  This inconsistency can cause intermittent rout-
713	  ing loops and have a severe short term impact on link loading.  An
714	  oscillating link can cause high levels of loss and is generally better
715	  off held in the neighbor adjacency ``down'' state.  The algorithm de-
716	  scribed in the [7] can be used when advertising OSPF type 1 or type 2
717	  LSA (router and network LSAs).

719	  Regardless as to whether router and network LSAs are damped, neigh-
720	  bor adjacency state changes will occur and router and network LSAs
721	  will have to be handled.  The LSA may indicate an up transition or
722	  a down transition.  In either an up or down transition, when the SPF
723	  algorithm is applied, existing paths to specific destinations may no
724	  longer be usable and new paths may become usable.  In the case of an
725	  up transition, some paths may no longer be usable because their cost
726	  is no longer among those tied for the best.  In the case of down tran-
727	  sitions, new paths may become usable because they are now the best
728	  path still available.

730	4.2.2  Handling the Loss of Paths

732	  When a path becomes unusable, paths which previously had the same
733	  cost may remain.  This can only occur on an LSA down transition.
734	  A new next hop entry should be created in which the proportion of
735	  source/destination hash space allocated to the now infeasible path
736	  is distributed to the remaining paths proportionally to their prior
737	  allocation.  Very high loading percentages should result, trigger-
738	  ing an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until
739	  convergence is approached.

741	4.2.3  Handling the Addition of Paths

743	  When a new path becomes usable it may be tied for best with paths car-
744	  rying existing traffic.  This can only occur on an LSA up transition.
745	  A new next hop entry should be created in which the loading on the
746	  new path is zero.  If such a path were to oscillate, little or no load
747	  would be affected.  If the path remains usable, the shift of load to
748	  this path will accellerate until a balance is reached.

750	  If a completely new set of best paths becomes available, the load
751	  should be split across the available paths.  The split used in sim-
752	  ulations was a share on a given link proportional to 10% of link
753	  capacity plus the remaining link bandwidth as determined by prior
754	  LSA_OMP_LINK_LOAD Opaque LSA values.  The contribution of link capacity
755	  in the weighting should be configurable.  See Appendix C.5.

757	Acknowledgements

759	  Numerous individual have provided valuable comments regarding this
760	  work.  Dave Ward made a very substantial contribution by pointing out
761	  that the best path criteria could be relaxed.  Geoffrey Cristallo pro-
762	  vided comments on the handling of inter-area and external routes with
763	  worked examples which resulted in corrections and clarifications to
764	  this document.  John Scudder, Tony Li, and Daniel Awduche have also
765	  provided particularly valuable review and comments.

767	References

769	  [1]  R. Coltun. The ospf opaque lsa option. Technical Report RFC 2370,
770	       Internet Engineering Task Force, 1998.      ftp://ftp.isi.edu/in-
771	       notes/rfc2370.txt.
772	  [2]  Atul Khanna and John Zinky.      The revised ARPAnet routing met-
773	       ric.     In SIGCOMM Symposium on Communications Architectures and
774	       Protocols, pages 45--56, Austin, Texas, September 1989. ACM.

776	  [3]  Steven H. Low and P. Varaiya.      Dynamic behavior of a class of
777	       adaptive routing protocols (IGRP).  In Proceedings of the Confer-
778	       ence on Computer Communications (IEEE Infocom), pages 610--616,
779	       March/April 1993.

781	  [4]  M. Mathis, J. Semke, J. Mahdavi, and T. Ott.  The macroscopic be-
782	       havior of the TCP congestion avoidance algorithm.    ACM Computer
783	       Communication Review, 27(3), July 1997.

785	  [5]  J. Moy. Ospf version 2. Technical Report RFC 2328, Internet Engi-
786	       neering Task Force, 1998. ftp://ftp.isi.edu/in-notes/rfc2328.txt.
787	  [6]  W. Stevens.   Tcp slow start, congestion avoidance, fast retrans-
788	       mit, and fast recovery algorithms.     Technical Report RFC 2001,
789	       Internet Engineering Task Force, 1997.      ftp://ftp.isi.edu/in-
790	       notes/rfc2001.txt.

792	  [7]  C. Villamizar, R. Chandra, and R. Govindan.  Bgp route flap damp-
793	       ing.  Technical Report RFC 2439, Internet Engineering Task Force,
794	       1998. ftp://ftp.isi.edu/in-notes/rfc2439.txt.

796	Security Considerations

798	  The use of the Opaque LSAs described in this document do no impact
799	  the security of OSPF deployments.  In deployments which use a strong
800	  OSPF authentication method, and require signatures on LSA from the
801	  originating router, no leveraging of a partial compromise beyond a
802	  localized disruption of service is possible.  In deployments which
803	  use a strong OSPF authentication method, but do not require signatures
804	  on LSA from the originating router, compromise of a single router can
805	  be leveraged to cause significant disruption of service with or with-
806	  out the use of these Opaque LSA, but disruption of service cannot be
807	  achieved without such a compromise.  In deployments which use a weak
808	  OSPF authentication method, significant disruption of service can be
809	  caused through forged protocol interactions with or without the use of
810	  these Opaque LSA.

812	Author's Addresses

814	  Curtis Villamizar
815	  UUNET Network Architecture Group
816	  

818	Full Copyright Statement

820	  Copyright (C) The Internet Society (February 24, 1999).  All Rights
821	  Reserved.

823	  This document and translations of it may be copied and furnished to
824	  others, and derivative works that comment on or otherwise explain it
825	  or assist in its implmentation may be prepared, copied, published and
826	  distributed, in whole or in part, without restriction of any kind,
827	  provided that the above copyright notice and this paragraph are in-
828	  cluded on all such copies and derivative works.  However, this doc-
829	  ument itself may not be modified in any way, such as by removing the
830	  copyright notice or references to the Internet Society or other In-
831	  ternet organizations, except as needed for the purpose of developing
832	  Internet standards in which case the procedures for copyrights defined
833	  in the Internet Standards process must be followed, or as required to
834	  translate it into languages other than English.

836	  The limited permissions granted above are perpetual and will not be
837	  revoked by the Internet Society or its successors or assigns.

839	  This document and the information contained herein is provided on an
840	  ``AS IS'' basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEER-
841	  ING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUD-
842	  ING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
843	  HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MER-
844	  CHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

846	A  Data Formats

848	  +----------------+----------------+----------------+----------------+
849	  |   Link State Advertisement Age  |   Options      |   LSA Type     |
850	  +----------------+----------------+----------------+----------------+
851	  |   Opaque Type  |   Opaque ID                                      |
852	  +----------------+----------------+----------------+----------------+
853	  |   Advertising Router                                              |
854	  +----------------+----------------+----------------+----------------+
855	  |   Link State Advertisement Sequence Number                        |
856	  +----------------+----------------+----------------+----------------+
857	  |   LSA Checksum                  |   LSA length                    |
858	  +----------------+----------------+----------------+----------------+
859	  |   Version      | Reference Type | Packing Method | BW Scale       |
860	  +----------------+----------------+----------------+----------------+
861	  |   Reference to a Type 1-5 LSA (32 or 64 bits, see below)          |
862	  +----------------+----------------+----------------+----------------+
863	  |   Packed Loading Information (variable length, see below)         |
864	  +----------------+----------------+----------------+----------------+

866	  The ``LSA Age'', ``Options'', and ``LSA Type'' are part of the Link
867	  State Advertisement Format described in Appendix A of RFC--2328.  The
868	  LSA Type is 10, signifying an Opaque LSA with Area local scope, as de-
869	  fined in RFC--2370.  RFC--2370 splits the Link State ID field into two
870	  part, Opaque Type, and Opaque ID. The Opaque Type contains either the
871	  value LSA_OMP_LINK_LOAD or LSA_OMP_PATH_LOAD as described in Section 2.
872	  The ``Advertising Router'', ``Link State Advertisement Sequence Num-
873	  ber'', ``LSA Checksum'', and ``LSA length'' are part of the Link State
874	  Advertisement Format described in Appendix A of RFC--2328.  The re-
875	  mainder of the packet is specific to Opaque Type LSA_OMP_LINK_LOAD or
876	  LSA_OMP_PATH_LOAD LSAs.

878	  Opaque Type  The Opaque Type will contain the value LSA_OMP_LINK_LOAD
879	     or LSA_OMP_PATH_LOAD as described in Section 2.  Numeric values will
880	     be requested from IANA.

882	  Opaque ID  The Opaque ID will contain an integer which will be unique
883	     per router and interface, virtual interface, or MAC address for
884	     which loading is reported.  These numbers are only of significance
885	     to the advertising router except as a means of identification of
886	     subsequent LSAs.
887	     For a LSA_OMP_LINK_LOAD Opaque LSA, the ``Opaque ID'' must contain
888	     a 24 bit integer that is unique to the link or virtual link.  The
889	     method of assignment of these 24 bit integers is a local matter.  A
890	     router must be capable of uniquely identify an interface using a 24
891	     bit number.
892	     For a LSA_OMP_PATH_LOAD Opaque LSA, the ``Opaque ID'' must contain
893	     a 24 bit integer that is unique to a summary LSA or AS-external LSA
894	     advertised by the same router.  The method of assignment of these
895	     24 bit integers is a local matter.  A router must be capable of
896	     uniquely identify a summary LSA or AS-external LSA using a 24 bit
897	     number.

899	  Version  The version number is 1.

901	  Reference Type  The Reference Type indicates the type of LSA that is
902	     being referenced in the ``Reference to a Type 1-5 LSA'' field.
903	  Packing Method  The Packing Method is an integer that indicates how
904	     the ``Packed Loading Information'' field is formated.

906	  BW Scale  Bandwidth numbers must be scale by shifting the 32 bit in-
907	     teger left by this amount.  If this value is non-zero, 64 bit inte-
908	     gers or larger should be used to represent the bandwidth.

910	  Reference to a Type 1-5 LSA  This field contains the 32 bit ``Link
911	     State ID'' field of a Type 1-5 LSA. Type 1-5 indicate:

913	   1.  Router-LSAs

915	   2.  Network-LSAs
916	   3.  Summary-LSAs (IP network)
917	   4.  Summary-LSAs (ASBR)

919	   5.  AS-external-LSAs
920	     Loading information for a Type 1 LSA, a Router-LSA, is sent as a
921	     LSA_OMP_LINK_LOAD Opaque LSA. For a Type 1 LSA the ``Link State ID''
922	     field is followed by a 32 bit ``Link ID''. This identifies a single
923	     link.  There are four types of Links.

925	   1.  Point-to-point connection to another router
926	   2.  Connection to a transit network
927	   3.  Connection to a stub network

929	   4.  Virtual link

931	     Normally loading information is provided for a Link Type 1.  Load-
932	     ing information may also be provided for a Link Type 2 or 3.  Load-
933	     ing information cannot be provided to a Link Type 4.
934	     Loading information is not provided for Type 2 LSAs, Network-LSAs.

936	     Loading information for Type 3 and Type 4 LSAs, Summary-LSAs for
937	     IP networks in another area and Summary-LSAs for ASBRs, is sent as
938	     a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2.  For
939	     a Type 3 and Type 4 LSA there is no information in the Reference
940	     following the ``Link State ID''.
941	     Loading information for Type 5 LSAs, AS-external-LSAs, is sent as
942	     a LSA_OMP_PATH_LOAD Opaque LSA as described in Section 2.2.  For a
943	     Type 5 LSA there is no information in the Reference following the
944	     ``Link State ID''.
945	  Packed Loading Information  The format of the Packed Loading Informa-
946	     tion depends on the value of the Packing Method field.  Currently
947	     only the value 1 is defined.

949	  The following format is used when the Packing Method field contains 1.
950	  The LSA must be ignored if values other than 1 are found in Packing
951	  Method.

953	  +----------------+----------------+----------------+----------------+
954	  |  In Scaled Link Capacity in kilobits per second                   |
955	  +----------------+----------------+----------------+----------------+
956	  |  In Link Loading Fraction       | In Link Drop Fraction (packets) |
957	  +----------------+----------------+----------------+----------------+
958	  |  Out Scaled Link Capacity in kilobits per second                  |
959	  +----------------+----------------+----------------+----------------+
960	  |  Out Link Loading Fraction      | Out Link Drop Fraction (packet) |
961	  +----------------+----------------+----------------+----------------+

963	  The Scaled Link Capacity is an unsigned integer in kilobits per sec-
964	  ond.  If this would be larger than a 32 bit integer, the value should
965	  be shifted to the right until the top bit is in the 32 bit number MSB
966	  and the number of bit shifts recorded in the BW Scale field

968	  The Link Loading Fraction is a 16 bit unsigned integer from 0 to
969	  0xffff representing capacity in bytes being used relative to capac-
970	  ity in bytes per the measurement interval.  The hex number 0x10000
971	  would represent unity, but if full loading has been acheived or due
972	  to counting or truncation error, greater than full loading, substitute
973	  0xffff.  The Link Drop Fraction is a 16 bit unsigned integer from 0 to
974	  0xffff representing number of packets dropped relative to the number
975	  of packets received.  This value can be derived from the change in two
976	  MIB-2 counters (ifOutDiscard<<16)/ifInPacket.  The hex number 0x10000
977	  would represent unity (all of the packets being dropped) so 0xffff
978	  must be substituted.

980	B  Concise Statement of the Algorithms

982	  An OSPF router may play one of two roles, or both.  The two functions
983	  are flooding loading information and load balancing.  An interior
984	  routers and edge routers will flood loading information.  A router
985	  may choose not to flood information if it is beleived that there is no
986	  way that the interface could become congested or if it has no way to
987	  measure the load, as is the case in a shared broadcast interface.  An
988	  ingress or interior router will process loading information and if it
989	  has equal cost paths will balance load across those paths.

991	  The description of algorithms is broken down into the following sub-
992	  sections.

994	  Flooding Loading Information  Appendix B.1
995	  Building Next Hop Structures  Appendix B.2

997	  Processing Loading Information  Appendix B.3

999	  Adjusting Loading  Appendix B.4

1001	  The algorithms are defined in the following section in pseudocode.

1003	B.1  Flooding Loading Information

1005	  It is assumed that counters are large enough to avoid multiple over-
1006	  flow (ie:  64 bit counters are used for high speed interfaces) and
1007	  that counter overflow is easily detected is compensated for in counter
1008	  deltas.  It is assumed that ifInDiscard and ifOutDiscard accurately
1009	  counts all queueing drops.

1011	  The following counters are sampled at 15 second intervals:
1012	  ifInOctets, ifOutOctets, ifInPacket, ifOutPacket, ifInDiscard and
1013	  ifOutDiscard.  The value if ifInSpeed and ifOutSpeed is assumed to
1014	  be constant.  Some state must be stored.  The previously used value
1015	  of each raw counter is needed to compute deltas.  State variables
1016	  InFilteredUtil, OutFilteredUtil, InLoss, OutLoss, InEquivLoad and Out-
1017	  EquivLoad must be saved.  The last time a reflooding occurred must
1018	  also be stored.

1020	  The input and output utilizations are expressed as fractions using
1021	  ifInOctets, ifOutOctets, ifInSpeed, and ifOutSpeed.  Call the raw 15
1022	  second fractional utilizations InRawUtil and OutRawUtil.  Compute the
1023	  following filtered values for both In and Out, making sure to save the
1024	  previous values.

1026	      PrevFilteredUtil = FilteredUtil;
1027	      if (RawUtil < FilteredUtil) {
1028	          FilteredUtil -= (FilteredUtil >> 3);
1029	          FilteredUtil += (RawUtil >> 3);
1030	      } else if (RawUtil > FilteredUtil) {
1031	          FilteredUtil -= (FilteredUtil >> 1);
1032	          FilteredUtil += (RawUtil >> 1);
1033	      }

1035	  InLoss and OutLoss is computed using the ratio of the deltas of Dis-
1036	  card to Packet SNMP counters.  Next compute an ``equivalent loading''
1037	  for the purpose of determining whether to reflood.

1039	      PrevEquivLoad = EquivLoad;
1040	      if (Loss < 0.005) {
1041	          EquivLoad = FilteredUtil;
1042	      } else {
1043	          if (Loss <= 0.09) {
1044	              LossComp = 10 * sqrt(Loss);
1045	          } else {
1046	              LossComp = 3;
1047	          }
1048	          EquivLoad = FilteredUtil * LossComp;
1049	      }

1051	  A square root is somewhat time consuming to compute, so a table lookup
1052	  can be done to avoid this computation.  Increments of 0.1% loss would
1053	  yield an 90 entry table.  A 128-512 entry table would be adequate.
1054	  The table can be sized so a shift and mask can be used to index it.
1055	  The computation could then be done with a table lookup, a shift, and
1056	  an integer multiply.  At most this needs to be done for links with
1057	  nonzero loss once every 15 seconds.

1059	  Then decide whether to flood based on the greater of the relative
1060	  change in InEquivLoad or OutEquivLoad and on the time elapsed since
1061	  the last reflooding (taking care not to divide by zero).

1063	      Diff = max(abs(InEquivLoad - InPrevEquivLoad)
1064	                     / InPrevEquivLoad,
1065	                 abs(OutEquivLoad - OutPrevEquivLoad)
1066	                     / OutPrevEquivLoad);
1067	      Load = max(InEquivLoad, OutEquivLoad)
1068	      Elapsed = Now - LastReflood;
1069	      if (((Load > 1.00) && (Diff > 0.05) && (Elapsed >= 30))
1070	          || ((Load > 1.00) && (Diff > 0.02) && (Elapsed >= 60))
1071	          || ((Load > 1.00) && (Diff > 0.01) && (Elapsed >= 90))
1072	          || ((Load > 1.00) && (Elapsed >= 180))
1073	          || ((Load > 0.90) && (Diff > 0.05) && (Elapsed >= 60))
1074	          || ((Load > 0.90) && (Diff > 0.02) && (Elapsed >= 240))
1075	          || ((Load > 0.90) && (Diff > 0.01) && (Elapsed >= 480))
1076	          || ((Load > 0.90) && (Elapsed >= 600))
1077	          || ((Load > 0.70) && (Diff > 0.10) && (Elapsed >= 60))
1078	          || ((Load > 0.70) && (Diff > 0.05) && (Elapsed >= 120))
1079	          || ((Load > 0.70) && (Diff > 0.02) && (Elapsed >= 480))
1080	          || ((Load > 0.70) && (Elapsed >= 900))
1081	          || ((Load > 0.50) && (Diff > 0.10) && (Elapsed >= 60))
1082	          || ((Load > 0.50) && (Diff > 0.05) && (Elapsed >= 300))
1083	          || ((Load > 0.25) && (Diff > 0.25) && (Elapsed >= 120))
1084	          || ((Load > 0.25) && (Elapsed >= 1200))
1085	          ) {
1086	              /* we passed one of the tests so flood it */
1087	              LastReflood = Now;
1088	              ...

1090	  If the decision is made to reflood an LSA according to the test above,
1091	  then input and output FilteredUtil and Loss must be packed into an LSA
1092	  and flooded.

1094	  The 15second timer interval should be jittered by a random number in
1095	  the range of plus or minus 5 seconds (obviously taking the actual time
1096	  interval into account in computing rates).

1098	B.2  Building Next Hop Structures

1100	  The OSPF SPF calculation is done as per RFC--2328.  Minor differ-
1101	  ences in the outcome result from relaxing the best path criteria as
1102	  described in Section 3.1.  Modification to the SPF algorithm is de-
1103	  scribed in Appendix D.  The arrival of loading information does not
1104	  require new SPF calculations since neither reacheability or costs are
1105	  changed.

1107	  The SPF calculation yields the shortest paths from the given node
1108	  to all other routers and networks in the OSPF area.  In some cases
1109	  multipaths will already exist.  For all destinations, every feasible
1110	  hop is examined, and paths through next hops that simply provide an
1111	  improvement are added, as described in Section 3.1.

1113	  After completing the SPF calculation and relaxing the best path cri-
1114	  teria, intra-area multipath sets are recorded as next hop structures.
1115	  If a previous SPF had been in use, loadings are transfered to the new
1116	  set of data structures and links are added or removed as needed as
1117	  described in Section 4.2.

1119	  After recording the intra-area next hop structures, the existing set
1120	  of inter-area next hop structures and the set of external route next
1121	  hop structures is examined.  Paths are added or removed from next
1122	  hop structures as needed, as described in Section 3, Section 3.3, and
1123	  Section 4.2.

1125	  Inter-area and external routes map onto the intra-area routing.  These
1126	  therefore share the same set of paths and the same next hop structure
1127	  as the intra-area route to the nearest ABR or ASBR. Next hop struc-
1128	  tures may be needed to reach any one in a set of ABRs or ASBRs if 1)
1129	  there are multiple ABRs equally distant to a summary route or 2) mul-
1130	  tiple ASBRs equally distant advertising an external route at the same
1131	  cost, 3) relaxing the best path criteria for an intra-area destination
1132	  results in going to a next-hop which does not share the same closest
1133	  ABR or ASBR.

1135	  Next hop structures may also be needed to offload paths in adjacent
1136	  areas or external paths.  The following conditional is used to deter-
1137	  mine whether a next hop structure should be added for a SummaryLSA.

1139	      if (IntraAreaLoad < 85%
1140	          && SummaryLSALoad > 90%
1141	          && SummaryLSALoad - IntraAreaLoad > 15%) {
1142	              /* add a next hop structure */
1143	              ...

1145	  The conditional for an external route is the same, except the intra-
1146	  area load would be a more internal load, intra-area, or Summary LSA,
1147	  and the 90% threshhold is increased to 95%.

1149	  The following conditional is used to determine is an existing sepa-
1150	  rate next hop structure for a Summary LSA or external route should be
1151	  deleted.

1153	      if (MoreInternalLoad > 98%
1154	  || MoreInternalLoad - MoreExternalLoad > 20%) {
1155	              /* delete a next hop structure */
1156	              ...

1158	B.3  Processing Loading Information

1160	  Adjustments to loading may be triggerred by one of two events.  When
1161	  a new loading LSA is received, if the LSA corresponds to the most
1162	  heavily loaded link for a next hop structure, then the next hop struc-
1163	  ture should be readjusted immediately.  The last time each next hop
1164	  structure has been readjusted must be maintained and periodically
1165	  readjusted.  Timer events are handled as follows.

1167	      foreach NextHopStruct ( AllNextHopStructs ) {
1168	          Elapsed = Now - LastReadjust[NextHopStruct];
1169	          MinLoaded = MinEquivLoad(NextHopStruct);
1170	          MaxLoaded = MaxEquivLoad(NextHopStruct);
1171	          AbsDiff = MaxLoaded - MinLoaded;
1172	          if (((Elapsed >= 60)
1173	               && (AbsDiff > 0.045) && (MaxLoaded > 0.95))
1174	              || ((Elapsed >= 90)
1175	                  && (AbsDiff > 0.03) && (MaxLoaded > 0.95))
1176	              || ((Elapsed >= 120)
1177	                  && (AbsDiff > 0.01) && (MaxLoaded > 0.97))
1178	              || ((Elapsed >= 240)
1179	                  && (AbsDiff > 0.005) && (MaxLoaded > 0.98))
1180	              || ((Elapsed >= 90)
1181	                  && (AbsDiff > 0.05) && (MaxLoaded > 0.90))
1182	              || ((Elapsed >= 120)
1183	                  && (AbsDiff > 0.03) && (MaxLoaded > 0.90))
1184	              || ((Elapsed >= 180)
1185	                  && (AbsDiff > 0.01) && (MaxLoaded > 0.90))
1186	              || (Elapsed >= 300)) {
1187	                     /* we need to readjust this multipath set */
1188	                     ...

1190	  This loop and conditional results in a subset of the available next
1191	  hop structures being adjusted based on the timer.  The same effect may
1192	  be accomplished by determining when a next hop structure will need to
1193	  be adjusted if no further flooding changes arrive and queueing next
1194	  hop structures on lists according to how long they will remain idle.

1196	B.4  Adjusting Loading

1198	  A next hop structure will need to be adjusted when one of the two cri-
1199	  teria in the prior section is met.  The adjustment procedure itslef
1200	  relies upon the following stored state.

1202	      NextHopStruct {
1203	          LastReadjust;
1204	          PrevCriticalSegment;
1205	          TotalPaths;
1206	          SetofPaths (
1207	              Path;
1208	              bit HasCriticalSegment,
1209	                  HasPrevCriticalSeg;
1210	              TrafficShare;
1211	              MoveIncrement;
1212	              MoveCount;
1213	          );
1214	      };

1216	  Before the path move increments are adjusted, a preliminary pass is
1217	  made over the next hop structure.  This pass notes which paths con-
1218	  tain the prior critical segment, notes which paths contain the current
1219	  critical segment and counts the number of paths containing the current
1220	  critical segment.

1222	      NumberWithCritical = 0;
1223	      MinRateWithCritical = 65536;
1224	      foreach Path ( SetofPaths ) {
1225	          SetOrClear HasCriticalSegment;
1226	          SetOrClear HasPrevCriticalSeg;
1227	          if (HasCriticalSegment) {
1228	              ++NumberWithCritical;
1229	              if (MoveIncrement < MinRateWithCritical)
1230	                  MinRateWithCritical = MoveIncrement;
1231	          }
1232	      }

1234	  Next the move increments for each path is adjusted.

1236	      foreach Path ( SetofPaths ) {
1237	          if (HasCriticalSegment)
1238	              continue;
1239	          if (!HasPrevCriticalSeg) {
1240	              ++MoveCount;
1241	              if (MoveCount > 4) {
1242	                  Increase = MoveIncrement
1243	                      / (2 * (1 + NumberWithCritical));
1244	              } else {
1245	                  Increase = MoveIncrement
1246	                      / (4 * (1 + NumberWithCritical));
1247	              }
1248	              if (Increase < 1)
1249	                  Increase = 1;
1250	              MoveIncrement += Increase;
1251	          } else {
1252	              if (MoveIncrement > MinRateWithCritical)
1253	                  MoveIncrement = MinRateWithCritical;
1254	              MoveIncrement /= 2;
1255	              MoveCount = 0;
1256	          }
1257	          if (MoveIncrement < MinMoveIncrement)
1258	              MoveIncrement = MinMoveIncrement;
1259	          if (MoveIncrement > 65535)
1260	              MoveIncrement = 65535;
1261	      }

1263	  Then traffic share is adjusted.

1265	      foreach Path1 ( SetofPaths ) {
1266	          if (!Path1.HasCriticalSegment)
1267	              continue;
1268	          foreach Path2 ( SetofPaths ) {
1269	              if (Path2.HasCriticalSegment)
1270	                  continue;
1271	              Move = Path2.MoveIncrement / NumberWithCritical;
1272	              if (Move < 1)
1273	                  Move = 1;
1274	              if (Move > (65536 - Path2.TrafficShare)) {
1275	                  Move = 65536 - Path2.TrafficShare;
1276	                  Path2.MoveIncrement = Move;
1277	              }
1278	              if (Move > Path1.TrafficShare)
1279	                  Move = Path1.TrafficShare;
1280	              Path2.TrafficShare += Move;
1281	              Path1.TrafficShare -= Move;
1282	          }
1283	      }

1285	  The traffic shares for paths sharing a common next hop are then summed
1286	  and the appropriate information is transferred to the forwarding data
1287	  structures.

1289	C  Configuration Options

1291	  Many of the capabilities described here must be configurable.  The
1292	  ability to enable and disable capability subsets is needed.  Many
1293	  parameters used by the algorithm should also be configurable.

1295	C.1  Capability Subsets

1297	  There should at least be the ability to enabled and disabled the fol-
1298	  lowing.

1300	      default description of capability
1301	        ON   Flooding any loading information
1302	        ON   Flooding loading information for specific links
1303	        -    Relaxing best path criteria
1304	        -    Adjusting traffic shares (default to even split)
1305	        OFF  Flooding loading information for Summary LSA
1306	        OFF  Flooding loading information for specific Summary LSA
1307	        OFF  Flooding loading information for external routes
1308	        OFF  Flooding loading information for specific external routes
1309	        OFF  Considering loading information for Summary LSA
1310	        OFF  Considering loading information for specific Summary LSA
1311	        OFF  Considering loading information for external routes
1312	        OFF  Considering loading information for specific exter-
1313	  nal routes

1315	  Flooding and considering Summary LSA and external route loading in-
1316	  formation should be disabled by default.  Flooding loading information
1317	  should be enabled by default.  Relaxing best path criteria and adjust-
1318	  ing traffic shares could be enabled or disabled by default, depending
1319	  on vendor preference.

1321	C.2  Parameters for Equivalent Load Calculation

1323	  The following values affect the computation of equivalent load.

1325	      default        description of parameter
1326	        10      The value of K in ``equiv-load = load * K * sqrt(loss)''
1327	        0.5%    The minimum loss rate at which to compensate for loss
1328	        9%      The maximum loss rate above which compensation is fixed

1330	C.3  Parameters for Relaxing the Best Path Criteria

1332	  The following parameter affects the number of next hops and paths
1333	  added as a result of relaxing the best path criteria.  For example,
1334	  increasing the mtric difference to 2 would require the next hop to be
1335	  a metric of ``2'' closer than the current distance to the destination,
1336	  and reduce the number of paths added.

1338	      default        description of parameter
1339	        1       The metric difference required to relaxing best path

1341	C.4  Parameters for Loading Outside of the OSPF Area

1343	  The following parameters affect the creation of separate next hop
1344	  structures to compensate for loading on Summary LSA and external
1345	  routes when the those loadings are high and intra-AS loadings are
1346	  substantially lower.

1348	      default        description of parameter
1349	        15%     The loading difference to consider a more external load
1350	                over a more internal load
1351	        85%     The maximum internal loading where a more external load
1352	                will become eligible for consideration
1353	        90%     The minimum loading in which a Summary LSA will be
1354	                considered over a an intra-area loading
1355	        95%     The minimum loading in which an external route will be
1356	                considered over a an intra-area loading

1358	        20%     The load difference at which an external load will be
1359	                removed from consideration due to being well under the
1360	                internal load.
1361	        94%     The maximum value used for in place of loading for a
1362	                Summary LSA when performing traffic share adjustment.
1363	        98%     The internal loading where a Summary LSA will be
1364	                removed from consideration over the internal load
1365	        90%     The maximum value used for in place of loading for a
1366	                external route when performing traffic share adjustment.
1367	        98%     The internal loading where a external route will be
1368	                removed from consideration over the internal load

1370	  Limiting the compensation that will be made to accommodate external
1371	  loading is consistent with the reason for considering external routes.
1372	  Rarely does a business go out of its way to improve the performance of
1373	  their competitor's product, a network service or otherwise.  Avoiding
1374	  congestion in a peer's network is doing a favor for one's own cus-
1375	  tomers by not sending their traffic into known areas of congestion
1376	  but only if it does not significantly impact congestion in one's own
1377	  network.

1379	  Limiting the compensation for Summary LSA loading and external route
1380	  loading avoids triggering the hysteresis criteria where a separate
1381	  next hop structure is deleted if an internal loading exceeds a fixed
1382	  threshhold.  In effect the loading on the Summary LSA loading and ex-
1383	  ternal route loading is ignored if internal loadings exceed a given
1384	  threshhold, since the Summary LSA loading or external route loading
1385	  will no longer be considered as the critical segment.  If internal
1386	  loading reaches a point where even with load balancing internal paths
1387	  exceed the higher threshhold, the next hop structure will be removed.

1389	C.5  Parameters for Initial Loading of Paths

1391	  When determining the initial loading on a new set of paths, where the
1392	  destination was previously unreachable, or none of the previous paths
1393	  appear in the new next hop structure, the following weighting is used.

1395	      weight = (link-capacity * 0.10)
1396	             + (link-capacity * (1 - utilization))

1398	  The contribution of link capacity in the weighting should be config-
1399	  urable.

1401	      default        description of parameter
1402	        10%     The fraction of total link capacity to consider in
1403	                addition to the reported unused link capacity.

1405	C.6  Parameters associated with Flooding and Traffic Share

1407	  Parameters associated with flooding rate, move increment and traffic
1408	  share adjustment should not be configurable by the user or should be
1409	  well hidden so as only to be accessible to developers.  Adjustment of
1410	  these parameters can slow convergence or increase overshoot.  If the
1411	  parameters are sufficiently altered instability could result.  While
1412	  it is likely that some improvements could result from further tuning,
1413	  experimentation on live networks is not encouraged.

1415	D  Modified SPF Calculation

1417	  The most common implementation of the SPF calculation is Dikstra's
1418	  algorithm.  Most implementations do not yield the full path as a
1419	  consequence of the SPF calculation.  Retaining the full path as the
1420	  algorithm procedes is a relatively minor modification.

1422	  If the best path criteria is relaxed, the information obtained from
1423	  a single Dikstra calculation is insufficient.  Dikstra's algorithm
1424	  provides a very efficient single-source shortest path calculation.
1425	  For the relaxed best path criteria, the cost to any destination from
1426	  any immediately adjacent node is needed in addition to the set of best
1427	  paths and costs from the current node.

1429	  It is beleived to be more efficient to compute an SPF using Dikstra's
1430	  algorithm from the standpoint of each adjacent node in addition to an
1431	  SPF from the current node than it is to use an algorithm to compute
1432	  the costs from any node to any other node.  The former runs in order
1433	  N squared while algorithms to accomplish the latter runs in order N
1434	  cubed, where N is the number of nodes in the graph.  The amount of
1435	  computation would be expected to be about equal in the case where all
1436	  nodes are immediately adjacent to the current node.

1438	  There is likely to be more efficient methods of computing the costs
1439	  from a subset of nodes to all destinations than either using multiple
1440	  Dikstra calculations or computing the costs from all nodes to all oth-
1441	  ers and only making use of a subset of the results.  This efficiency
1442	  consideration is left as an exercise for the implementor.

1444	E  Algorithm Performance

1446	  A number of simulations have been performed to test the OSPF-OMP algo-
1447	  rithms.  In these simulations the algorithm has been demonstrated to
1448	  be very stable.  This work has not been formally published yet but is
1449	  currently available at http://engr.ans.net/ospf-omp.

1451	  The simulations done to date have only modeled behavior of load bal-
1452	  ancing intra-area routes.  This would be applicable to networks in
1453	  which external routing was mapped onto IGP routing with a single OSPF
1454	  area.  Passing loading information between areas, allowing loading in
1455	  one area affect an adjacent area, has not been simulated.  Similarly
1456	  passing loading information with external routes and affecting loading
1457	  in a peer AS has not been simulated.

1459	E.1  Conversion from Loss to Equivalent Load

1461	  The current adjustment for loss in the equivalent load is based on
1462	  the premise that traffic is dominated by TCP and that TCP flows suffi-
1463	  ciently unconstrained by other bottlenecks and of sufficient duration
1464	  exist to keep the aggrgate traffic flow elastic.  This is considered
1465	  a very safe assumption for Internet traffic.  Enterprise networks
1466	  may have a higher contribution of incompressible traffic (traffic not
1467	  conforming to a form of congestion avoidance such as TCP congestion
1468	  avoidance described in RFC--2001).

1470	  The assumed relationship between packet loss and link utilization is
1471	  based on the work of Mathis et al [4].  The constants in this rela-
1472	  tionship cannot be determined as they depend on delay bandwith product
1473	  of TCP flows, the number and duration of TCP flows, and whether TCP
1474	  flows are severely constrained elsewhere.

1476	  The objective is to estimate the offered load, which cannot be mea-
1477	  sured directly when links are at full utilization, using the link
1478	  utilization and loss.  The load adjustment algorithm should remain
1479	  stable as long as the first derivative of the estimator over offered
1480	  load remains positive.  If the first derivative is negative within
1481	  some region, then oscillation will occur within that range of opera-
1482	  tion.  The greatest risk of this occurring is in routers where active
1483	  queue management is not used (ie:  where drop-tail queues are used)
1484	  and in particular where buffering is too small.  In such cases, as
1485	  offered load increases just beyond full utilization, loss increases
1486	  somewhat, but utilization can drop substantially (typically to about
1487	  90%) as offered load increases.  In this region, as the offered load
1488	  increases, the estimator of offered load may decrease, causing the
1489	  link to appear less loaded than another.  The rather aggresive com-
1490	  pensation for loss is intended to insure that this effect either does
1491	  not occur, or occurs only within a very narrow range of offerred load
1492	  at just over full utiization.  If the derivative is negative within a
1493	  narrow range, oscillations can occur only within that range, and the
1494	  oscillations are well bounded.

1496	E.2  Performance as traffic loads change

1498	  This work has considerable advantages over other approaches, particu-
1499	  larly traffic engineering approaches that involve adjustment of vir-
1500	  tual circuit layouts based on historic traffic figures.  The advantage
1501	  is the ability to adjust loading gradually as actual offerred traffic
1502	  deviates for expected.  The advantage is even greater when compared to
1503	  dynamic virtual circuit layouts, using PNNI for example, since these
1504	  algorithms have proven to often converge on very suboptimal layouts.

1506	  Simulations demonstrating behavior in these particular cases can be
1507	  found at http://engr.ans.net/ospf-omp/ramp.html.

1509	E.3  Convergence after a major perturbation

1511	  Simulations have been performed in which link failures are examined.
1512	  Without relaxing the best path criteria, OSPF OMP is quite dependant
1513	  of the set of link metrics to create a set of equal cost paths that
1514	  will allow the most heavily loaded portions of the topology to be
1515	  offloaded.  When links fail, the set of metrics often are far from
1516	  ideal for the remaining topology.  Relaxing the best path criteria
1517	  significantly improves the response to link failure.

1519	  Simulations are available at http://engr.ans.net/ospf-omp/fail.html.

1521	F  Examples

1523	  Figure 2 provides a simple topology.  For the purpose of illustrating
1524	  how OMP works, only the traffic flow from left to right between a few
1525	  pair of dominant traffic contributors is considered.

1527	  The traffic mapped onto the topology in Figure 2 is dominated by the
1528	  ingress nodes A and F and the egress nodes E and G. The capacity of
1529	  the links are 1 excpt link E-G which has a capacity of 2.  The load
1530	  contributed by the ingress-egress pairs A-E, F-G, and F-E are 0.5.
1531	  The node pair A-G contributes a load of 1.  Link costs are all 2,
1532	  except F-D which is 3, and F-G which is 6.

1534	  If ECMP were used, all the traffic from F to E would take the path
1535	  F-D-E. All the traffic from F to G would take the link F-G. All the
1536	          cost=2        cost=2        cost=2        cost=2
1537	      A ----------> B ----------> C ----------> E ----------> G
1538	      |  load=1.5   |  load=0.5      load=0.5   ^  load<25%   ^
1539	      |             |                           |             |
1540	      |             |   cost=2        cost=2    |             |
1541	      |             \-----------> D ------------/             |
1542	      |                load=0.5   ^  load=1.0                 |
1543	      |                           |                           |
1544	      |                    cost=3 | load=0.5                  |
1545	      |   cost=2                  |                 cost=6    |
1546	      \-------------------------> F --------------------------/
1547	         load=0.5                                  load=1.0

1549	                        Figure 2:  A Simple Example

1551	  traffic from A to E would take the hop from A to B and then at B it
1552	  would be split evenly between the paths B-C-E and B-D-E. Half of the
1553	  traffic from A to G would take the hop A-B and half would take the hop
1554	  A-G.

1556	    ingress-egress    load and path
1557	      F-E  0.5         0.50   F-D-E
1558	      F-G  0.5         0.50   F-G
1559	      A-E  0.5         0.25   A-B-C-E
1560	                       0.25   A-B-D-E
1561	      A-G  1.0         0.33   A-B-C-E-G
1562	                       0.33   A-B-D-E-G
1563	                       0.33   A-F-G
1564	    link   loading  status
1565	     A-B    1.16     overloaded
1566	     A-F    0.33     underutilized
1567	     B-C    0.58     underutilized
1568	     B-D    0.58     underutilized
1569	     C-E    0.58     underutilized
1570	     D-E    1.08     overloaded
1571	     E-G    0.66     underutilized
1572	     F-D    0.50     underutilized
1573	     F-G    0.83     near capacity

1575	  Above is the initial loading for OMP which differs slightly from ECMP.
1576	  In ECMP half the traffic from A to G would take the A-F, where OMP
1577	  starts out with one third of the A-G traffic on link A-F.

1579	  Note that using OMP the path F-D-E-G with cost 7 is considered close
1580	  enough to equal to the path F-G with cost 6.  This is because the
1581	  next hop D is closer to G with a cost of 4 than F is with a cost of
1582	  6.  Initially node F would not move load over because link D-E at a
1583	  loading of 1.08 is in worse shape than node F-G at a loading of 0.83.

1585	  For illustrative purposes three opportunities for moving load are
1586	  considered separately.  These are shown in Figure 3, Figure 4, and
1587	  Figure 5.

1589	          cost=2        cost=2        cost=2        cost=2
1590	      A ----------> B ----------> C ----------> E ----------> G
1591	      |  load=1.25  |  load=.75      load=.75   ^  load<25%   ^
1592	      |             |                           |             |
1593	      |             |   cost=2        cost=2    |             |
1594	      |             \-----------> D ------------/             |
1595	      |                load=.25   ^  load=.75                 |
1596	      |                           |                           |
1597	      |                    cost=3 | load=0.5                  |
1598	      |   cost=2                  |                 cost=6    |
1599	      \-------------------------> F --------------------------/
1600	         load=0.75                                 load=1.25

1602	             Figure 3:  First Opportunity for Load Adjustment

1604	  Node B has a clear opportunity to reduce the load on the link D-E by
1605	  moving traffic from the next hop D to the next hop C. If this opti-
1606	  mization were to be applied alone, utilizations on the links B-C,
1607	  C-E, and D-E would approach 0.75 and utilization on the link B-D would
1608	  approach 0.25.  This is illustrated in Figure 3.

1610	  Node A can reduce the loss on link A-B by putting more load on link
1611	  A-F. This will initially have the effect of lowering A-B to 1.0 and
1612	  raising F-G to 1.0.  The links can only pass 100would just reduce loss
1613	  on link A-B at the expense of increasing loss on link F-G. The load on
1614	  link A-F would increase to 0.5.

1616	  After node B had moved enough traffic away from link D-E so that its
1617	  loading fell below the 1.0 loading of link F-G, node F would begin
1618	  to move traffic destined to G away from link F-G. Load would be added
1619	  to link D-E but node B would continue to move load away from D-E.
1620	  Utilizations of B-C, C-E, and D-E would rise.  Utilizations of F-D
1621	  would also approach 0.5 and utilization on F-G would fall.  This is
1622	  illustrated in Figure 4.

1624	  Node A would be faced with an overloaded link A-B and a better path to
1625	  G of A-F-G, with the worst loading being at link F-G, initially only
1626	  slightly over capacity.  Both links A-B and F-G would be reporting
1627	  100% utilization but link A-B would be expected to report higher loss.
1628	  In addition, as the other optimizations proceed, the utilization of
1629	  link F-G would approach 100%.

1631	          cost=2        cost=2        cost=2        cost=2
1632	      A ----------> B ----------> C ----------> E ----------> G
1633	      |  load=1.25  |  load=.92      load=.92   ^  load<25%   ^
1634	      |             |                           |             |
1635	      |             |   cost=2        cost=2    |             |
1636	      |             \-----------> D ------------/             |
1637	      |                load=.08   ^  load=.92                 |
1638	      |                           |                           |
1639	      |                    cost=3 | load=0.92                 |
1640	      |   cost=2                  |                 cost=6    |
1641	      \-------------------------> F --------------------------/
1642	         load=0.75                                 load=0.92

1644	             Figure 4:  Second Opportunity for Load Adjustment

1646	          cost=2        cost=2        cost=2        cost=2
1647	      A ----------> B ----------> C ----------> E ----------> G
1648	      |  load=1.0   |  load=1.0      load=1.0   ^  load=25%   ^
1649	      |             |                           |             |
1650	      |             |   cost=2        cost=2    |             |
1651	      |             \-----------> D ------------/             |
1652	      |                load=0     ^  load=1.0                 |
1653	      |                           |                           |
1654	      |                    cost=3 | load=1.0                  |
1655	      |   cost=2                  |                 cost=6    |
1656	      \-------------------------> F --------------------------/
1657	         load=1.0                                 load=1.0

1659	            Figure 5:  Another Opportunity for Load Adjustment

1661	  Node A would move traffic from the next hop of B to the next hop of
1662	  F. Node F will continue to move load from F-G to F-D-E. Node B will
1663	  continue to move load from B-D-E to B-C-E. The utilizations of links
1664	  A-B, B-C, C-E, D-E, F-D, and F-G will approach 0.83.  Utilization of
1665	  link A-F will approach 0.63.  Utilization of link B-D will approach
1666	  zero.  This is illustrated in Figure 5.

1668	  The following table provides the approximate state of traffic loading
1669	  acheived in a brief simulation.  Of 6 links that could be balanced at
1670	  about 0.83, 3 converged to about 0.85, and three to about 0.82.  Note
1671	  that the path F-G-E was used to get from F to E in addition to the
1672	  lower cost F-D-E.

1674	    ingress-egress    load and path
1675	      F-E  0.5         0.25   F-D-E
1676	                       0.25   F-G-E
1677	      F-G  0.5         0.25   F-G
1678	                       0.25   F-D-E-G
1679	      A-E  0.5         0.25   A-B-C-E
1680	                       0.00   A-B-D-E
1681	       0.12   A-F-D-E
1682	       0.12   A-F-G-E
1683	      A-G  1.0         0.60   A-B-C-E-G
1684	                       0.00   A-B-D-E-G
1685	                       0.20   A-F-G
1686	                       0.20   A-F-D-E-G
1687	    link   loading  status
1688	     A-B    0.85
1689	     A-F    0.65
1690	     B-C    0.85
1691	     B-D    0.00
1692	     C-E    0.85
1693	     D-E    0.82
1694	     E-G    0.80
1695	     F-D    0.82
1696	     F-G    0.82

1698	  In this example, multiple links are balanced at 82% to 85% utiliza-
1699	  tion.  Without using OMP it is difficult (it might be impossible using
1700	  only ECMP) to avoid applying an offerred load that exceeds link ca-
1701	  pacity on parts of the topology.  This example is intended to provide
1702	  a more advanced tutorial than the trivial three node example in Fig-
1703	  ure 1.

1705	  This example is among the simulations at http://engr.ans.net/ospf-
1706	  omp/tutorial.  More complex topologies and traffic patterns have been
1707	  simulated and are available at the same URL.