idnits 2.17.1 draft-ietf-grow-bgp-wedgies-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 456. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 433. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 440. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 446. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 255 has weird spacing: '...--+peer peer...' == Line 258 has weird spacing: '...rovider provi...' == Line 261 has weird spacing: '...ustomer custo...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 10, 2005) is 6887 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 1771 (Obsoleted by RFC 4271) Summary: 4 errors (**), 0 flaws (~~), 5 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 GROW T. Griffin 3 Internet-Draft University of Cambridge 4 Expires: December 12, 2005 G. Huston 5 APNIC 6 June 10, 2005 8 BGP Wedgies 9 draft-ietf-grow-bgp-wedgies-03.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on December 12, 2005. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 It has commonly been assumed that the Border Gateway Protocol (BGP) 43 is a tool for distributing reachability information in a manner that 44 creates forwarding paths in a deterministic manner. In this memo we 45 will describe a class of BGP configurations for which there is more 46 than one potential outcome, and where forwarding states other than 47 the intended state are equally stable, and that the stable state 48 where BGP converges may be selected by BGP in a non-deterministic 49 manner. These stable, but unintended, BGP states are termed here 50 "BGP Wedgies". 52 1. Introduction 54 It has commonly been assumed that the Border Gateway Protocol (BGP) 55 [RFC1771] is a tool for distributing reachability information in a 56 manner that creates forwarding paths in a deterministic manner. This 57 is a 'problem statement' memo that describes a class of BGP 58 configurations for which there is more than one stable forwarding 59 state. In this class of configurations forwarding states other than 60 the intended state are equally stable, and the stable state where BGP 61 converges may be selected by BGP in a non-deterministic manner. 63 These stable, but unintended, BGP states are termed here "BGP 64 Wedgies". 66 2. Describing BGP Routing Policy 68 BGP routing policies generally reflect each network administrator's 69 objective to optimize their position with respect to their network's 70 cost, performance and reliability. 72 With respect to cost optimization, the local network's default 73 routing policy often reflects a local preference to prefer routes 74 learned from a customer to routes learned from some form of peering 75 exchange. In the same vein the local network is often configured to 76 prefer routes learned from a peer or a customer over those learned 77 from a directly connected upstream transit provider. These 78 preferences may be expressed via a local preference configuration 79 setting, where the local preference overrides the AS path length 80 metric of the base BGP operation. 82 In terms of engineering reliability in the inter-domain routing 83 environment it is commonly the case that a service provider may enter 84 into arrangements with two or more upstream transit providers, 85 passing routes to all upstream providers, and receiving traffic from 86 all sources. If the path to one upstream fails the traffic will 87 switch to other links. Once the path is recovered, the traffic 88 should switch back. 90 In such situations of multiple upstream providers it is also 91 commonplace to place a relative preference on the providers, so that 92 one connection is regarded as a preferred, or "primary" connection, 93 and other connections are regarded as less preferred, or "backup" 94 connections. The intent is typically that the backup connections 95 will be used for traffic only for the duration of a failure in the 96 primary connection. 98 It is possible to express this primary / backup policy using local AS 99 path prepending, where the AS path is artificially lengthened towards 100 the backup providers, using additional instances of the local AS. 101 This is not a deterministic selection algorithm, as the selected 102 primary provider may in turn be using AS path prepending to its 103 backup upstream provider, and in certain cases the path through the 104 backup provider may still be selected as the shortest AS path length. 106 An alternative approach to routing policy specification uses BGP 107 communities [RFC1997]. In this case the provider publishes a set of 108 community values that allows the client to select the provider's 109 local preference setting. The client can use a community to mark a 110 route as "backup only" towards the backup provider, and "primary 111 preferred' to the primary provider, assuming both providers support 112 community values with such semantics. In this case the local 113 preference overrides the AS path length metric, so that if the route 114 is marked "backup only", the route will be selected only when there 115 is no other source of the route. 117 3. BGP Wedgies 119 The richness of local policy expression through the use of 120 communities, when coupled with the behavior of a distance vector 121 protocol like BGP leads to the observation that certain 122 configurations have more than one "solution", or more than one stable 123 BGP state. An example of such a situation is indicated in Figure 1. 125 +----+peer peer+----+ 126 |AS 3|------------------------|AS 4| 127 +----+ +----+ 128 |provider provider| 129 | | 130 | | 131 |customer | 132 +----+ | 133 |AS 2| | 134 +----+ | 135 |provider | 136 | | 137 | | 138 |customer customer| 139 +---------------+ +----------+ 140 backup service| |primary service 141 +----+ 142 |AS 1| 143 +----+ 145 Figure 1 147 In this case AS1 has marked its advertisement of prefixes to AS2 as 148 "backup only", and its advertisement of prefixes to AS4 as "primary". 149 AS4 will advertise AS1's prefixes to AS3. AS3 will hear AS4's 150 advertisement across the peering link, and select AS1's prefixes with 151 the path "AS4, AS1". AS3 will advertise these prefixes to AS2. AS2 152 will hear two paths to AS1's prefixes, the first is via the direct 153 connection to AS1, and the second is via the path "AS3, AS4, AS1". 154 AS2 will prefer the longer path, as the directly connected routes are 155 marked "backup only", and AS2's local preference decision will prefer 156 the AS3 advertisement over the AS1 advertisement. 158 This is the intended outcome of AS1's policy settings, where in the 159 'normal' state no traffic passes from AS2 to AS1 across the backup 160 link, and AS2 reaches AS1 via a path that transits AS3 and AS4, using 161 the primary link to AS1. 163 This intended outcome is achieved as long as AS1 announces its routes 164 on the primary path to AS4 before announcing its backup routes to 165 AS2. 167 If the AS1 - AS4 path is broken, causing aBGP sesssion failure 168 between AS1 and AS4, then AS4 will withdraw its advertisement of 169 AS1's routes to AS3, who, in turn, will send a withdrawal to AS2. 170 AS2, will then select the backup path to AS1. AS2 will advertise 171 this path to AS3, and AS3 will advertise this path to AS4. Again, 172 this is part of the intended operation of the primary / backup policy 173 setting, and all traffic to AS1 will use the backup path. 175 When connectivity between AS4 and AS1 is restored the BGP state will 176 not revert to the original state. AS4 will learn the primary path to 177 AS1, and readvertise this to AS3 using the path "AS4, AS1". AS3, 178 using a default preference of preferring customer-advertised routes 179 over peer routes will continue to prefer the "AS2, AS1" path. AS3 180 will not pass any updates to AS2. After the restoration of the AS4 181 to AS1 circuit the traffic from AS3 to AS1 and from AS2 to AS1 will 182 be presented to AS1 via the backup path, even through the primary 183 path via AS4 is back in service. 185 The intended forwarding state can only be restored by AS1 186 deliberately bringing down its eBGP session with AS2, even though it 187 is carrying traffic. This will cause the BGP state to revert to the 188 intended configuration. 190 It is often the case that an AS will attempt to balance incoming 191 traffic across multiple providers, again using the primary / backup 192 mechanism. For some prefixes one link is configured as the primary 193 link, and the others as the backup link, while for other prefixes 194 another link is selected as the primary link. An example is shown in 195 Figure 2. 197 +----+peer peer+----+ 198 |AS 3|--------------------------|AS 4| 199 +----+ +----+ 200 |provider provider| 201 | | 202 | customer| 203 |customer | 204 +----+ +----+ 205 |AS 2| |AS 5| 206 +----+ +----+ 207 |provider provider| 208 | | 209 | | 210 |customer customer| 211 +-----------------+ +----------+ 212 | | 213 backup (192.0.2.0/25) | |primary service (192.0.2.0/25) 214 primary (192.0.2.128/25)| |backup service (192.0.2.128/25) 215 +----+ 216 |AS 1| 217 +----+ 219 Figure 2 221 The intended configuration has all incoming traffic for addresses in 222 the range 192.0.2.0/25 via the link from AS5, and all incoming 223 traffic for addresses in the range 192.0.2.128/25 from AS2. 225 In this case if the link between AS3 and AS4 is reset, AS3 will learn 226 both routes from AS2, and AS4 will learn both routes from AS5. As 227 these customer routes are preferred over peer routes, when the link 228 between AS3 and AS4 is restored, neither AS3 nor AS4 will alter their 229 routing behavior with respect to AS1's routes. This situation is now 230 wedged, in that there is no eBGP peering that can be reset that will 231 flip BGP back to the intended state. This is an instance of a BGP 232 Wedgie. 234 The restoration path here is that AS1 has to withdraw the backup 235 advertisements on both paths and operate for an interval without 236 backup, and then readvertise the backup prefix advertisements. The 237 length of the interval cannot be readily determined in advance, as it 238 has to be sufficiently long so as to allow AS2 and AS5 to learn of an 239 alternate path to AS1. At this stage the backup routes can be 240 readvertised. 242 4. Multi-Party BGP Wedgies 244 This situation can be more complex when three or more parties provide 245 upstream transit services to an AS. An example is indicated in 246 Figure 3. 248 +----+ peer peer +----+ 249 |AS 3|------------------------|AS 4| 250 +----+ +----+ 251 ||provider provider| 252 |+----------------+ | 253 | | | 254 |customer |customer | 255 +----+peer peer+----+ | 256 |AS 2|-----------|AS 5| | 257 +----+ +----+ | 258 |provider provider| | 259 | | | 260 | | | 261 |customer customer| customer| 262 +---------------+ |+---------+ 263 backup service| ||primary service 264 +----+ 265 |AS 1| 266 +----+ 268 Figure 3 270 In this example the intended state is that AS2 and AS5 are both 271 backup providers to AS1, and AS4 is the primary provider. When the 272 link between AS1 and AS4 breaks and is subsequently restored, AS3 273 will continue to direct traffic to AS1 via AS2 or AS5. In this case 274 a single reset of the link between AS2 and AS1 will not restore the 275 original intended BGP state, as the BGP-selected best route to AS1 276 will switch to AS5, and AS2 and AS3 will learn a path to AS1 via AS5. 278 What AS1 is observing is incoming traffic on the backup link from 279 AS2. Resetting this connection will not restore traffic back to the 280 primary path, but instead will switch incoming traffic over to AS5. 281 The action required to correct the situation is to simultaneously 282 reset both the link to AS2, and also the link to AS5. This is not 283 necessarily an intuitively obvious solution, as at any point on time 284 only one of these links will be carrying backup traffic, yet both BGP 285 sessions need to be brought down at the same time in order to 286 commence restoration of the intended primary and backup state. 288 5. BGP and Determinism 290 BGP does not behave deterministically in all cases, and, as a 291 consequence, there is intended and unintended non-determinism in BGP. 292 For example, the default final tie break in some implementations of 293 BGP is to prefer the longest-lived route. To achieve determinism in 294 this last step it would be necessary to use a comparison operator 295 that has a predictable outcome, such as a comparison of router 296 identifiers. This class of non-deterministic behavior is termed here 297 "intended" non-determinism, in that the policy interactions are, to 298 some extent, predictable by network administrators. 300 BGP is also able to generate outcomes that can be described as 301 "unintended non-determinism" that can result from unexpected policy 302 interactions. These outcomes do not represent misconfiguration in 303 the standard sense, since all policies may look completely rational 304 locally, but their interaction across multiple routing entities can 305 cause unintended outcomes, and BGP may reach a state that includes 306 such unintended outcomes in a non-deterministic manner. 308 Unintended non-determinism in BGP would not be as critical an issue 309 if all stable routings were guaranteed to be consistent with the 310 policy writer's intent. However, this is not always the case. The 311 above examples indicate that the operation of BGP allows multiple 312 stable states to exist from a single configuration state, where some 313 of these states are not consistent with the policy writer's intent. 314 These particular examples can be described as a form of "route 315 pinning", where the route is pinned to a non-preferred path. 317 The challenge for the network administrator is to ensure that an 318 intended state is maintained. Under certain circumstances this can 319 only be achieved by deliberate service disruption, involving the 320 withdrawal of routes being used to forward traffic, and re- 321 advertising routes in a certain sequence in order to induce an 322 intended BGP state. However, the knowledge that is required by any 323 single network operator administrator in order to understand the 324 reason why BGP has stabilized to an unintended state requires BGP 325 policy configuration knowledge of remote networks. In effect there 326 is insufficient local information for any single network 327 administrator to correctly identify the root cause of the unintended 328 BGP state, nor is there sufficient information to allow any single 329 network administrator to undertake a sequence of steps to rectify the 330 situation back to the intended routing state. 332 It is reasonable to anticipate that as the density of interconnection 333 increases, and also that the capability for policy-based preference 334 setting of learned and re-advertised routes will become more 335 expressive. It is therefore reasonable to anticipate that the 336 incidence of unintended BGP states will increase, and the ability to 337 understand the necessary sequence of route withdrawals and re- 338 advertisements will become more challenging to determine in advance. 340 Whether this could lead to BGP routing system reaching a point where 341 each network consistently cannot direct traffic in a deterministic 342 manner is at this stage a matter of speculation. BGP Wedgies are an 343 illustration that a sufficiently complex interconnection topology, 344 coupled with a sufficiently expressive set of policy constructs, can 345 lead to a number of stable BGP states, rather than a single intended 346 state. As the topology complexity increases it is not possible to 347 deterministically predict which state the BGP routing system may 348 converge to. Paradoxically, the demands of inter-domain traffic 349 engineering appear to require both greater levels of expressive 350 capability in policy-based routing directives, operating across 351 denser interconnectivity topologies in a deterministic manner. This 352 may not be a sustainable outcome in BGP-based routing systems. 354 6. Security Considerations 356 BGP is a relaying protocol, where route information is received, 357 processed and forwarded. BGP contains no specific mechanisms to 358 prevent the unauthorized modification of the information by a 359 forwarding agent, allowing routing information to be modified, 360 deleted or false information to be inserted without the knowledge of 361 the originator of the routing information or any of the recipients. 363 The memo proposes no modifications to the BGP protocol, nor does it 364 propose any changes to the manner of deployment of BGP, and therefore 365 introduces no new factors in terms of the security and integrity of 366 inter-domain routing. 368 The memo illustrates that in attempting to create policy-based 369 outcomes relating to path selection for incoming traffic it is 370 possible to generate BGP configurations where there are multiple 371 stable outcomes, rather than a single outcome. Furthermore, of these 372 instances of multiple outcomes, there are cases where the BGP 373 selection of a particular outcome is not a deterministic selection. 375 This class of behaviour may be exploitable by a hostile third party. 376 A common theme of BGP Wedgies is that starting from an intended or 377 desired forwarding state, the loss and subsequent restoration of an 378 eBGP peering connection can flip the network's forwarding 379 configuration into an unintended and potentially undesired state. 380 Significant administrative effort, based on BGP state and 381 configuration knowledge that may not be locally available, may be 382 required to shift the BGP forwarding configuration back to the 383 intended or desired forwardinging state. If a hostile third party 384 can deliberately cause the BGP session to reset, thereby producing 385 the initial conditions that lead to an unintended forwarding state, 386 the network impacts of the resulting unintended or undesired 387 forwarding state may be long-lived, far outliving the temporary 388 interruption of connectivity that triggered the condition. If these 389 impacts, including potential issues of increased cost, reduction of 390 available bandwidth, increases in overall latency or degradation of 391 service reliability, are significant, then disrupting a BGP session 392 could represent an attractive attack vector to a hostile party. 394 7. IANA Considerations 396 [Note to RFC Editor: Please remove this section prior to publication] 398 This document has no associated IANA actions or considerations. 400 8. References 402 8.1 Normative References 404 [RFC1771] Rekhter, Y. and T. Li, "A Border Gateway Protocol 4 405 (BGP-4)", RFC 1771, March 1995. 407 8.2 Informative References 409 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 410 Communities Attribute", RFC 1997, August 1996. 412 Authors' Addresses 414 Tim Griffin 415 University of Cambridge 417 Email: Timothy.Griffin@cl.cam.ac.uk 419 Geoff Huston 420 Asia Pacific Network Information Centre 422 Email: gih@apnic.net 424 Intellectual Property Statement 426 The IETF takes no position regarding the validity or scope of any 427 Intellectual Property Rights or other rights that might be claimed to 428 pertain to the implementation or use of the technology described in 429 this document or the extent to which any license under such rights 430 might or might not be available; nor does it represent that it has 431 made any independent effort to identify any such rights. Information 432 on the procedures with respect to rights in RFC documents can be 433 found in BCP 78 and BCP 79. 435 Copies of IPR disclosures made to the IETF Secretariat and any 436 assurances of licenses to be made available, or the result of an 437 attempt made to obtain a general license or permission for the use of 438 such proprietary rights by implementers or users of this 439 specification can be obtained from the IETF on-line IPR repository at 440 http://www.ietf.org/ipr. 442 The IETF invites any interested party to bring to its attention any 443 copyrights, patents or patent applications, or other proprietary 444 rights that may cover technology that may be required to implement 445 this standard. Please address the information to the IETF at 446 ietf-ipr@ietf.org. 448 Disclaimer of Validity 450 This document and the information contained herein are provided on an 451 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 452 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 453 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 454 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 455 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 456 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 458 Copyright Statement 460 Copyright (C) The Internet Society (2005). This document is subject 461 to the rights, licenses and restrictions contained in BCP 78, and 462 except as set forth therein, the authors retain all their rights. 464 Acknowledgment 466 Funding for the RFC Editor function is currently provided by the 467 Internet Society.