idnits 2.17.1 draft-ietf-ospf-omp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 12 instances of too long lines in the document, the longest one being 5 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 13, 1998) is 9540 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '3' is defined on line 736, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-ietf-ospf-opaque-04 -- Possible downref: Non-RFC (?) normative reference: ref. '2' ** Obsolete normative reference: RFC 2178 (ref. '3') (Obsoleted by RFC 2328) == Outdated reference: A later version (-03) exists of draft-ietf-idr-route-damp-02 Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Curtis Villamizar 3 INTERNET-DRAFT ANS 4 draft-ietf-ospf-omp-00 March 13, 1998 6 OSPF Optimized Multipath (OSPF-OMP) 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working 11 documents of the Internet Engineering Task Force (IETF), its areas, 12 and its working groups. Note that other groups may also distribute 13 working documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six months 16 and may be updated, replaced, or obsoleted by other documents at any 17 time. It is inappropriate to use Internet- Drafts as reference 18 material or to cite them other than as ``work in progress.'' 20 To view the entire list of current Internet-Drafts, please check the 21 ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow 22 Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), 23 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 24 ftp.isi.edu (US West Coast). 26 Abstract 28 OSPF may form multiple equal cost paths between points. This is true 29 of any link state protocol. In the absense of any explicit support to 30 take advantage of this, a path may be chosen arbitrarily. Techniques 31 have been utilized to divide traffic somewhat evenly among the 32 available paths. These techniques have been referred to as Equal Cost 33 Multipath (ECMP). An unequal division of traffic among the available 34 paths is generally preferable. Routers generally have no knowledge of 35 traffic loading on distant links and therefore have no basis to 36 optimize the allocation of traffic. 38 Optimized Mulitpath is a compatible extension to OSPF, utilizing the 39 Opaque LSA to distribute loading information, proposing a means to 40 adjust forwarding, and providing an algorithm to make the adjustments 41 gradually enough to insure stability yet provide reasonably fast 42 adjustment when needed. 44 1 Overview 46 Networks running OSPF are often heavily loaded. Topologies often 47 evolve to include multiple paths. Multiple paths may be initially 48 designed to provide redundancy but also result from incremental 49 addition of circuits to accomodate traffic growth. The redundant 50 paths provide a potential to distribute traffic loading and reduce 51 congestion. Optimized Mulitpath (OMP) provides a means for OSPF to 52 make better use of this potential to distribute loading. 54 Early attempts to provide load sensitive routing involved changing 55 link costs according to loading. These attempts were doomed to 56 failure because the adjustment increment was grossly course and 57 oscillation was inevitable [?]. 59 A widely utilized technique to improve loading is known as Equal Cost 60 Multipath (ECMP). If the topology is such that equal cost paths exist, 61 then an attempt is made to divide traffic equally among the paths. 62 Three methods of splitting traffic have been used. 64 1. Per packet round robin forwarding. 66 2. Dividing destination prefixes among available next hops in the 67 forwarding entries. 68 3. Dividing traffic according to a hash function applied to the source 69 and desination pair. 71 The ``per packet round robin forwarding'' technique is only applicable 72 if the delays on the paths are almost equal. The delay difference 73 must be small relative to packet serialization time. Delay 74 differences greater than three times the packet serialization time can 75 cause terrible TCP performance. for example, packet 2, 4, and 6 may 76 arrive before packet 1, triggering TCP fast retransmit. The result 77 will be limiting TCP to a very small window and very poor performance 78 over long delay paths. 80 The delay differences must be quite small. A 532 byte packet is 81 serialized onto a DS1 link in under 2.8 msec. At DS3 speed, 82 serialization is accomplished in under 100 usec. At OC12 it is under 83 7 usec. For this reason ``per packet round robin forwarding'' is not 84 applicable to a high speed WAN. 86 Dividing destination prefixes among available next hops provides a 87 very course and unpredictable load split. Long prefixes are 88 problematic. In reaching an end node, the majority of traffic is 89 often destined to a single prefix. This technique is applicable to a 90 high speed WAN but with the drawbacks just mentioned better techniques 91 are needed. 93 The ``source/destination hash'' based technique was used as far back 94 as the T3-NSFNET in the NSS routers. A hash function, such as CRC-16, 95 is applied over the source address and destination address. The hash 96 space is then split evenly among the available paths by either setting 97 threshholds or performing a modulo operation. Traffic between any 98 given source and destination remain on the same path. Because the 99 technique is based on host addresses, and uses both the source and 100 destination address, it does not suffer the course granularity problem 101 of the prefix based technique, even when forwarding to a single 102 prefix. Source/destination hash is the best technique available for a 103 high speed WAN. 105 The forwarding decision for the ``source/destination hash'' based 106 technique is quite simple. When a packet arrives, look up the for- 107 warding entry in the radix tree. The next hop entry can be an array 108 index into a set of structures, each containing one or more actual 109 next hops. If more than one next hop is present, compute a CRC16 110 value based on the source and destination addresses. The CRC16 can be 111 implemented in hardware and computed in parallel to the radix tree 112 lookup in high speed implementations, and discarded if not needed. 113 Each next hop entry in the structure must contain a boundary value and 114 the next hop itself. An integer ``less than'' comparison is made 115 against the boundary value determining whether to use this next hop or move 116 to the next a comparison. In hardware the full set of comparisons can be 117 made simultaneously for up to some number of next hops. This yields the 118 next hop to use. 120 For ECMP, the boundary values are set by first dividing one more than 121 the maximum value that the hash computation can return (65536 for 122 CRC16) by the number of available next hops and then setting the Nth 123 boundary to N times that number (with the Nth value fixed at one more 124 than the maximum value regardless of underflow caused by trucating 125 during division, 65536 for CRC16). 127 An equal load split is not always optimal. Consider the example in 128 Figure 1 with the offered traffic in Table 1. If all of the link 129 costs are set equally, then the link N1---N3 is significantly 130 overloaded (135.75%) while the path N1---N2---N3 is lightly loaded 131 (45.25% and 22.62%). If the cost on the N1---N3 link is equal to the 132 cost of the N1---N2---N3 path, then N1 will try to split the load 133 destined toward N3 across the two paths. 135 Given the offered traffic in Table 1 the loading on N1---N3 is reduced 136 to 67.87% but the link loading on the path N2---N3 becomes 113.12%. 137 Ideally node N1 should put 1/3 of the traffic toward N3 on the path 138 N1---N2---N3 and 2/3 on the path N1---N3. To know to do this N1 must 139 know the loading on N2--N3. 141 .----. 142 / \ 143 | N2 | 144 \ / 145 `----' 146 // \\ 147 // \\ 148 // \\ 149 .----. .----. 150 / \ / \ 151 | N1 | ======== | N3 | 152 \ / \ / 153 `----' `----' 155 Figure 1: A very simple application of ECMP 157 Nodes Traffic Node Names 159 n3-n1 60.000 Node 3 -> Node 1 160 n1-n3 60.000 Node 1 -> Node 3 161 n3-n2 20.000 Node 3 -> Node 2 162 n2-n3 20.000 Node 2 -> Node 3 163 n2-n1 10.000 Node 2 -> Node 1 164 n1-n2 10.000 Node 1 -> Node 2 166 Table 1: Traffic loading for the example in Figure 1 168 This is where Optimized Multipath (OMP) provides additional benefit 169 over ECMP. Ignoring for the moment how node N1 knows to put 1/3 of the 170 traffic toward N3 on the path N1---N2---N3, the way this is 171 accomplished from a forwarding standpoint is to move the boundary in 172 the forwarding structure from the default value of 1/2 of 65536 to 173 about 1/3 of 65536. If there are a very large set of source and 174 destination host addresses pairs, then the traffic will be split among 175 the 65536 possible hash values. This provides the means for a very 176 fine granularity of adjustment. 178 Having explained how a fine granularity of forwarding adjustment can 179 be accomplished, what remains is to define how nodes in a large topol- 180 ogy can know what the loading levels are elsewhere in the topology and 181 defining an algorithm which can allow autonomous unsyncronized 182 decisions on the parts of many routers in a topology to quickly 183 converge on a near optimal loading without the risk of oscillation. 185 2 Flooding Loading Information 187 Loading information is flooded within an OSPF area using Opaque LSAs 188 [1]. Area local scope (link-state type 10) link state attributes are 189 flooded containing an ``Opaque Type'' of LSA_OMP_LINK_LOAD or 190 LSA_OMP_PATH_LOAD. The type LSA_OMP_LINK_LOAD Opaque LSA is used to 191 flood link loading information within an area. The type 192 LSA_OMP_PATH_LOAD Opaque LSA is used to flood loading information for 193 use with inter-area routes. Loading information obtained from an 194 exterior routing protocol may also be considered if available. The 195 means of passing loading information in an exterior routing protocol 196 is beyond the scope of this document. 198 2.1 Link Loading Information 200 Within an area link loading is flooded using the type LSA_OMP_LINK_LOAD 201 Opaque LSA. The ``Opaque ID'' must contain a 24 bit integer that is 202 unique to the advertising router and link or virtual link. The method 203 of assignment of these 24 bit integers is a local matter. A router 204 must be capable of being configured to put a fixed value in N of the 205 bits and use the remainin 24-N bits to uniquely identify an interface. 207 The ``Opaque Information'' in the type LSA_OMP_LINK_LOAD Opaque LSA 208 contains the following. 210 1. a measure of link loading in each direction as a fraction of link 211 capacity, 213 2. a measure of packets dropped due to queue overflow in each 214 direction (if known) expressed as a fraction, 216 3. the link capacity in killobits per second (or unity if less than 217 1000 bits per second). 219 Generally the number of ouput packets dropped will be known. In 220 designs where drops occur on the input (generally a bad design 221 practice), the rate of input queue drops should be recorded. These 222 measures of loading and drop are computed using the interface counters 223 generally maintained for SNMP purposes, plus a running count of output 224 queue drops if available. The counters are sampled every 225 OMP_SAMPLE_INTERVAL seconds. 227 The previous value of each of the counters is substracted from the 228 current value. The counters to be sampled are the following. 230 1. bytes out, 232 2. bytes in, 234 3. packets out, 235 4. packets in, 237 5. output queue drops, 239 6. input queue drops. 241 A value of instantaneous load in each direction is based on byte count 242 and link capacity. An instantaneous output queue drop rate is based 243 on queue drops and packet count. Each of these is combined with a 244 running filtered value according to the following method. The running 245 total is shifted down by OMP_SHIFT_BITS bits and subtracted from the 246 running total. The instantaneous value is shifted down by 247 OMP_SHIFT_BITS bits and added to the running total. 249 The last time that a type LSA_OMP_LINK_LOAD Opaque LSA with the same 250 Opaque ID was sent is recorded and the values sent are recorded. For 251 the purpose of determining when to reflood, an equivalent loading 252 figure is used. The computation of equivalent loading is described in 253 Section 2.3. 255 The higher of the current equivalent loading computation and the 256 previous is used when determining whether to send the type 257 LSA_OMP_LINK_LOAD Opaque LSA. The type LSA_OMP_LINK_LOAD Opaque LSA is 258 sent if any of the following is true. 260 1. The equivalent load is over 100% and the change in equivalent load 261 since last resent is over 5% and 30 seconds has elapsed since last 262 sent. 264 2. The equivalent load is over 100% and the change in equivalent load 265 since last resent is over 2% and 90 seconds has elapsed since last 266 sent. 268 3. The equivalent load is over 100% and 3 minutes has elapsed since 269 last sent. 270 4. The equivalent load is over 90% and and the change in equivalent 271 load since last resent is over 5% and 1 minute has elapsed since 272 last sent. 274 5. The equivalent load is over 90% and the change in equivalent load 275 since last resent is over 2% and 4 minutes has elapsed since last 276 sent. 278 6. The equivalent load is over 90% and 10 minutes has elapsed since 279 last sent. 280 7. The equivalent load is over 70% and the change in equivalent load 281 since last resent is over 10% and 1 minute has elapsed since last 282 sent. 284 8. The equivalent load is over 70% and the change in equivalent load 285 since last resent is over 5% and 2 minutes has elapsed since last 286 sent. 288 9. The equivalent load is over 70% and the change in equivalent load 289 since last resent is over 2% and 8 minutes has elapsed since last 290 sent. 291 10. The equivalent load is over 70% and 15 minutes has elapsed since 292 last sent. 294 11. The equivalent load is over 50% and the change in equivalent load 295 since last resent is over 10% and 1 minute has elapsed since last 296 sent. 298 12. The equivalent load is over 50% and the change in equivalent load 299 since last resent is over 5% and 5 minutes has elapsed since last 300 sent. 301 13. The equivalent load is over 25% and the change in equivalent load 302 since last resent is over 25% and 2 minutes has elapsed since last 303 sent. 305 14. The equivalent load is over 25% and 20 minutes has elapsed since 306 last sent. 308 15. The type LSA_OMP_LINK_LOAD Opaque LSA has never been sent. 310 The point of this graduated scale is to reduce the amount of flooding 311 that is occurring unless links are in trouble or undergoing a 312 significant traffic shift. Change may occur in a quiescent network 313 due to failure external to the network that causes traffic to take 314 alternate paths. In this case, the more frequent flooding will 315 trigger a faster convergence. Traffic shift may also occur due to 316 shedding of loading by the OMP algortihm itself as the algorithm 317 converges in response to an external change. 319 2.2 Path Loading Information 321 Path loading information regarding an adjacent area is flooded by an 322 Area Border Router (ABR) using the type LSA_OMP_PATH_LOAD Opaque LSA. 323 The ``Opaque ID'' must contain a 24 bit integer that is unique to the 324 router and an advertised summary LSA. The method of assignment of 325 these 24 bit integers is a local matter. A router must be capable of 326 being configured to put a fixed value in N of the bits and use the 327 remainin 24-N bits to uniquely identify the summary LSA. 329 The ``Opaque Information'' in the type LSA_OMP_PATH_LOAD Opaque LSA 330 contains the following. 332 1. the highest loading in the direction toward the destination as a 333 fraction of link capacity, 335 2. a measure of total packet drop due to queue overflow in the 336 direction toward the destination expressed as a fraction, 337 3. the smallest link capacity on the path to the destination. 339 These values are taken from the link on the path from the ABR to the 340 destination of the summary LSA. The link with the highest loading may 341 not be the link with the lowest capacity. The queue drop value is one 342 minus the product of fraction of packets that are not dropped at each 343 measurement point on the path (input and output in the direction of 344 the path). The following computation is used. 346 path-loss = 1 - product((1 - link-loss-in) * (1 - 347 link-loss-out)) 349 The path loading and path loss rate are filtered according to the same 350 algorithm defined in the prior section. Rather than polling a set of 351 counters the current value of the path loading and path loss rate is 352 used. An equivalent load is calculated for each path to a summary LSA 353 destination as described in Section 2.3. A type LSA_OMP_PATH_LOAD 354 Opaque LSA is flooded according to the same rate schedule as described 355 in the prior section. 357 An ABR may be configured to not send type LSA_OMP_PATH_LOAD Opaque LSA 358 into any given area. 360 2.3 Computing equivalent loading 362 The equivalent load is the actual percent loading multiplied by a 363 factor that provides an estimate of the extent to which TCP would be 364 slowing down to avoid congestion. This estimate is based on the link 365 bandwidth and loss rate, knowledge of TCP dynamics, and some 366 assumption about the characteristics of the TCP flows being passed 367 through the link. Some of the assumptions must be configured. 369 If loss is low, the equivalent load will be equal to the actual 370 percent loading. If loss is high and loading is at or near 100%, then 371 the equivalent load calculation provides a means of deciding which 372 links are more heavily overloaded. The equivalent load figure is not 373 intended to be an accurate pridiction of offerred load, simply a 374 metric for use in deciding which link to offload. 376 Mathis and Mahdavi provide the following estimate of loss given TCP 377 window size and round trip time [2]. 379 p < (MSS / (BW * RTT))**2 381 The basis for the estimate is that TCP slows down roughly in 382 proportion to the inverse of the square root of loss. There is no way 383 to know how fast TCP would be going if no loss were present if there 384 are other bottlenecks. A somewhat arbitrary assumption is made that 385 TCP would go no faster than if loss were at 0.5%. If loss is greater 386 than 0.5% then TCP performance would be reduced. The equivalent 387 loading is estimated using the following computation. 389 equiv-load = load * K * sqrt(1 - ((1 - loss-in) * (1 - 390 loss-out))) 392 The inverse of the square root of 0.1% is 10 so 10 may be used for the 393 value of ``K''. A square root is somewhat time consuming to compute, 394 so a table lookup can be done to avoid this computation. Increments 395 of 0.5% would yield a 200 entry table. The computation could then be 396 done with a table lookup, a shift, and an integer multiply. At most 397 this needs to be done on links with loss once every 398 OMP_SAMPLE_INTERVAL seconds. 400 The conversion of loss to estimated loading is not at all accurate. 401 The non-linearity does affect the time to converge though convergence 402 still occurs as long as loss is positively correlated to loading. 403 This is discussed further in Section D.1. 405 3 Adjusting Equal Cost Path Loadings 407 Adjustments are made to a next hop structure to reflect differences in 408 loading on the paths as reported by the type LSA_OMP_LINK_LOAD Opaque 409 LSA and type LSA_OMP_PATH_LOAD Opaque LSA. Section 3.2 describes the 410 selection of a ``critically loaded segment'' which is used to 411 determine when to make adjustments and the size of the adjustments. 412 An adjustment to loading of a given set of equal cost paths is made 413 when one of the following conditions are true. 415 1. The LSA for the ``critically loaded segment'' has been reflooded. 416 2. The difference between the equivalent load of the ``critically 417 loaded segment'' and the lightest loaded path is greater than 5% 418 and the equivalent load of the ``critically loaded segment'' is 419 greater than 100% and 90 seconds has elapsed since the last 420 adjustment. 422 3. The difference between the equivalent load of the ``critically 423 loaded segment'' and the lightest loaded path is greater than 3% 424 and the equivalent load of the ``critically loaded segment'' is 425 greater than 100% and 9 180 seconds has elapsed since the last 426 adjustment. 428 4. The difference between the equivalent load of the ``critically 429 loaded segment'' and the lightest loaded path is greater than 5% 430 and the equivalent load of the ``critically loaded segment'' is 431 greater than 90% and 120 seconds has elapsed since the last 432 adjustment. 433 5. The difference between the equivalent load of the ``critically 434 loaded segment'' and the lightest loaded path is greater than 3% 435 and the equivalent load of the ``critically loaded segment'' is 436 greater than 90% and 240 seconds has elapsed since the last 437 adjustment. 439 6. 300 seconds has elapsed since the last adjustment. 441 The reflooding algorithm is designed to be slightly less aggressive 442 than the adjustment algorithm. This reduces the need to continuously 443 flood small changes except in conditions of overload or substantial 444 change in loading. Some overshoot may occur due to adjustments made 445 in the absence of accurate knowledge of loading. 447 3.1 Next hop structures 449 For intra-AS routes, a separate next hop structure must exist for each 450 destination router or network. For inter-AS routes, a separate struc- 451 ture must exist for each intra-AS route if a type LSA_OMP_PATH_LOAD 452 Opaque LSA exists for the summary LSA and more than one ABR is 453 advertising the summary route and the equivalent load for the summary 454 LSA is greater than 50% and the equivalent load is not sufficiently 455 smaller than the intra-AS loading. If a separate structure is not 456 used for the intra-AS route, then the next hop structure associated 457 with the reaching the ABR or set of ABRs is used. For external 458 routes, if an equivalent loading exists, and more than one ASBR is ad- 459 vertising the route, and the equivalent load is greater than 50% and the 460 equivalent load is not sufficiently smaller than the internal route load- 461 ing associated with the external next hop, then a separate structure is 462 used. If a separate structure is not used for an external route, then the 463 next hop structure associated with the reaching external next hop is used. 465 Hysterysis must be used in the algorithm for determining if an 466 equivalent load on a summary LSA or external route is considered 467 sufficiently smaller than the intra-AS equivalent load or if an 468 external route is considered sufficiently smaller than the inter-AS 469 equivalent load. For for the purpose of describing this algorithm one 470 euivalent load is referred to as the more external, and the other as 471 the more internal equivalent load. 473 If the more external equivalent load exceeds the more internal equiva- 474 lent load by 5% and the more internal equivalent load is under 90%, 475 then a separate next hop structure is created. If the more external 476 equivalent load falls below 20% of the more internal equivalent load 477 or the more internal equivalent load exceeds 95%, then an existing 478 separate next hop structure is marked for removal and combined with 479 the more internal next hop structure (see Section 3.4). The more 480 external equivalent load should not fall significantly below the more 481 internal unless either the traffic toward the more external destina- 482 tion increases or the loading on the more internal increases, since 483 the more internal equivalent load will become the critical segment on the 484 separate next hop structure if the load is sufficiently shifted but is un- 485 likely to overshoot by 20%. These threshholds should be configurable at 486 least per type of routes (inter-AS or external). 488 3.2 Critcally loaded segment 490 For every set of intra-AS paths, one link or part of the path is 491 identified as the ``critcally loaded'' segment. This is the part of 492 the path with the highest equivalent load as defined in Section 2.3. 493 For an inter-AS route with a separate next hop structure, the 494 critcally loaded segment is the critcally loaded segment for the 495 intra-AS set of paths, or the summary LSA if the equivalent load on 496 the summary LSA is greater. For an external route with a separate 497 next hop structure, the critcally loaded segment is the critcally 498 loaded segment for the internal route or the external route if the 499 equivalent load on the external route is greater. 501 Each next hop structure has exactly one ``critcally loaded'' segment. 502 An Opaque LSA may be the critcally loaded segment for no next hop 503 structures if it is lightly loaded. An Opaque LSA may be the 504 critcally loaded segment for many next hop structures if it is heavily 505 loaded. 507 3.3 Load Adjustment Rate 509 In order to assure stability the rate of adjustment must be 510 sufficiently limited. An adaptive adjustment rate method is used. 512 When the SPF is recalculated paths may disappear and new paths may 513 appear. If a new equal cost path is added to a set of existing set of 514 paths or a single path, the new path would have previously not carried 515 any traffic of the traffic. The next hop structure is initialized or 516 modified if there had previously been equal cost paths such that the 517 new path is unused. The procedures for handling links coming up or 518 going down is covered in Section 4. 520 A ``critcally loaded'' segment for a next hop structure is determined 521 as described in Section 3.2. When the type LSA_OMP_LINK_LOAD Opaque 522 LSA or type LSA_OMP_PATH_LOAD Opaque LSA for this segment is updated, 523 load is shed toward all equal cost paths that do not involve that 524 segment. A separate set of variables controlling rate of adjustment 525 is kept for each alternate path. 527 The number of paths may exceed the number of next hops. The 528 distinction between paths which share a next hop is important if one 529 of the paths sharing a next hop goes down (see Section 4). This 530 distinction is only needed in making the computations. When moving 531 the next hop structure into the data structures used for forwarding, 532 paths which share a common next hop may be combined. The following 533 variables are kept for each path in a next hop structure. 535 1. The current ``traffic share'' (an integer, the range is 0 to 65355 536 for a CRC16 hash), 538 2. The current ``move increment'' (an integer, the range is 0 to 65355 539 for a CRC16 hash), 541 3. The number of moves in the same direction, referred to as the 542 ``move count''. 544 If there is no prior history for a path, then the move increment is 545 initialized to about 1% or 650. The number of moves in the same 546 direction is initialized to 3. No loading adjustment is made on the 547 first iteration. 549 If critcally loaded segment has not changed, or if the current path 550 did not contain the previous critcally loaded segment, then the 551 adjustment is continuing in the same direction. If the critcally 552 loaded segment has just changed and the path being shed load toward 553 contains the prior critcally loaded segment, then the adjustment 554 direction has reversed for this path. 556 If the adjustment direction has reversed, the number of moves in the 557 same direction is set to zero and the move increment is reduced by 558 half. This move increment is then used. 560 If the adjustment is continuing in the same direction, the number of 561 moves in the same direction is considered before increasing the move 562 increment. This value is here referred to as the move count. The 563 move increment is updated according to the following conditions. 565 1. If the move count is greater than 4, the move increment is 566 increased by its current value divided by the number of equal cost 567 paths in the next hop structure. 568 2. If the move count is less than or equal to 4, the move increment is 569 increased by 1/2 its current value divided by the number of equal 570 cost paths in the next hop structure. 572 The move increment is never less than one and the increase in move 573 increment is never less than one. The move increment is never allowed 574 to exceed the size of the hash space divided by the number of equal 575 cost paths in the next hop structure. 577 The dramatic decrease in move increment when move direction is 578 reversed and the slow increase in move increment when it remains in 579 the same direction keeps the algorithm stable. The exponential nature 580 of the increase allows the algorithm to track externally caused 581 changes in traffic loading. 583 The amount of hash space allocated to a path is incremented by the 584 move amount and the amount of hash space allocated to the critcally 585 loaded path or paths are decremented by this amount. 587 This process is repeated for each alternate path. The new hash space 588 boundaries are then moved to the forwarding engine. 590 3.4 Creating and destroying next hop structures 592 Section 3.1 describes the conditions under which a next hop structure 593 would be needed for an inter-AS route or AS external route. Briefly, 594 a separate next hop structure is needed if the loading indicated by 595 the type LSA_OMP_PATH_LOAD Opaque LSA or exterior routing protocol is 596 sufficiently high to require separate balancing for traffic to the 597 summary-LSA or exterior route and the intra-AS loading is sufficiently 598 low. 600 When a separate next hop structure is created, the same available 601 paths appear in the structure, leading to the same set of ABR or ASBR. 602 The balance on these available paths should be copied from the 603 existing more internal next hop structure. By initializing the new 604 next hop structure this way, a sudden change in loading is avoided if 605 the summary route or external route sinks a great deal of traffic. 607 When a separate next hop structure can be destroyed, the traffic 608 should be transitioned gradually. The next hop structure must be 609 marked for deletion. The traffic share in this separate next hop 610 structure should be gradually changed so that it exactly matches the 611 traffic share in the more internal next hop structure. The gradual 612 change should follow the adjustment rate schedule described in 613 Section 3.3 where the move increment is increased gradually as moves 614 continue in the same direction. Once the separate next hop structure 615 marked for deletion matches the more internal next hop structure, the 616 summary route or external route can be changed to point to the more 617 internal next hop structure and the deletion can be made. 619 4 Dealing with Link Failures 621 Link failures do occur for various reasons. OSPF routing will 622 converge to a new set of paths. Whatever load balance had previously 623 existed will be upset and the load balancing will have to converge to 624 a new load balanced state. 626 Links which are intermitent may be the most harmful. The OSPF 627 ``Hello'' protocol is inadequate for handling intermitent links. When 628 such a link is up it may draw traffic during periods of high loss, 629 even brief periods of complete loss. 631 The inadequacies of the OSPF ``Hello'' protocol is well known and many 632 implementations provide lower level protocol state information to OSPF 633 to indicate a link in the ``down'' state. For example, indications 634 may include carrier loss, excessive framing errors, unavailable 635 seconds, or loss indications from PPP LQM. 637 Even where the use of a link is avoided by providing indication of 638 lower level link availability, intermitent links are still 639 problematic. During a brief period immediately after a link state 640 attribute is initially flooded OSPF state can be inconsistent among 641 routers within the AS. This inconsistency can cause intermittent 642 routing loops and have a severe short term impact on link loading. An 643 oscillating link can cause high levels of loss and is generally better 644 off held in the neighbor adjacency ``down'' state. The algorithm 645 described in the [4] can be used when advertising OSPF type 1 or type 646 2 LSA (router and network LSAs). 648 Regardless as to whether router and network LSAs are damped, neighbor 649 adjacency state changes will occur and router and network LSAs will 650 have to be handled. The LSA may indicate an up transition or a down 651 transition. In either an up or down transition, when the SPF 652 algorithm is applied, existing paths from the router doing the SPF to 653 specific destinations may no longer be usable and new paths may become 654 usable. In the case of an up transition, some paths may no longer be 655 usable because their cost is no longer among those tied for the best. 656 In the case of down transitions, new paths may become usable because 657 they are now the best path still available. 659 When a path becomes unusable, paths which previously had the same cost 660 may remain. This can only occur on an LSA down transition. A new 661 next hop entry should be created in which the proportion of 662 source/destination hash space allocated to the now infeasible path is 663 distributed to the remaining paths proportionally to their prior 664 allocation. Very high loading percentages should result, triggering 665 an increase in LSA_OMP_LINK_LOAD Opaque LSA flooding rate until 666 convergence is approached. 668 When a new path becomes usable it may be tied for best with paths car- 669 rying existing traffic. This can only occur on an LSA up transition. 670 A new next hop entry should be created in which the loading on the new 671 path is zero. If such a path were to oscillate, little or no load 672 would be affected. If the path remains usable, the shift of load to 673 this path will accellerate until a balance is reached. 675 If a completely new set of best paths becomes available, the load 676 should be split across the available paths, proportional to 10% of 677 link capacity plus the remaining link bandwidth as determined by prior 678 LSA_OMP_LINK_LOAD Opaque LSA values. 680 A Data Formats 682 @@ This is obviously a very important piece and is missing. 684 B Concise Statement of the Algorithms 686 @@ MIB counters -> pseudo code -- pull code from the simulation and 687 past it here. 689 C Changes to Existing OSPF Implementations 691 @@ BSD and gated would make a nice reference implementation 693 C.1 Forwarding 695 @@ ip_output.c or equiv 697 C.2 Routing process interface to forwarding 699 @@ route socket or equiv 701 C.3 The routing process 703 @@ gated or equiv 705 D Algorithm Performance 707 @@ see http://engr.ans.net/ospf-omp 709 D.1 Conversion from Loss to Equivalent Load 711 @@ black art - consult local TCP guru - very few exist 713 D.2 Performance when tracking traffic load change 715 @@ see http://engr.ans.net/ospf-omp/ramp.html 717 D.3 Performance when traffic loading is constant 719 @@ it converges and oscillates a tiny bit - about 0.5% in simulations. 721 D.4 Convergence after a major perturbation 723 @@ see simulations at http://engr.ans.net/ospf-omp - we're going kill 724 a link and watch the thing converge at some later date. 726 References 728 [1] Rob Coltun. The ospf opaque lsa option. Internet Draft (Work in 729 Progress) draft-ietf-ospf-opaque-04, Internet Engineering Task 730 Force, 2 1998. ftp://ds.internic.net/internet-drafts/draft-ietf- 731 ospf-opaque-04.txt. 733 [2] M. Mathis, J. Semke, J. Mahdavi, and T. Ott. The macroscopic 734 behavior of the TCP congestion avoidance algorithm. ACM Computer 735 Communication Review, 27(3), July 1997. 736 [3] J. Moy. Ospf version 2. Technical Report RFC 2178, Internet Engi- 737 neering Task Force, 1997. ftp://ds.internic.net/rfc/rfc2178.txt. 739 [4] C. Villamizar, R. Govindan, and R. Chandra. Bgp route flap damp- 740 ing. Internet Draft (Work in Progress) draft-ietf-idr-route-damp- 741 02, Internet Engineering Task Force, 2 742 1998. ftp://ds.internic.net/internet-drafts/draft-ietf-idr-route- 743 damp-02.txt. 745 Security Considerations 747 @@ You can't be sure if loading the *.doc version of this draft into 748 your word processor may infect you with a dangerous virus so a *.doc 749 version has not been made available. [Something vaguely serious to go 750 in here at a later date, subject to IESG whim de jour.] @@ 752 Author's Addresses 754 Curtis Villamizar 755 ANS Communications 756