idnits 2.17.1 draft-lapukhov-bgp-sdn-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 13 instances of too long lines in the document, the longest one being 11 characters in excess of 72. ** The abstract seems to contain references ([RFC6241], [I-D.lapukhov-bgp-routing-large-dc], [RFC1997]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 02, 2013) is 3890 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'Topology 0' is mentioned on line 629, but not defined == Missing Reference: 'Topology 1' is mentioned on line 548, but not defined == Missing Reference: 'Topology 2' is mentioned on line 643, but not defined == Unused Reference: 'RFC4271' is defined on line 1152, but no explicit reference was found in the text == Unused Reference: 'RFC4786' is defined on line 1173, but no explicit reference was found in the text == Unused Reference: 'JAKMA2008' is defined on line 1218, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-lapukhov-bgp-routing-large-dc-06 == Outdated reference: A later version (-17) exists of draft-ietf-grow-bmp-07 == Outdated reference: A later version (-15) exists of draft-ietf-idr-add-paths-08 == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 == Outdated reference: A later version (-05) exists of draft-raszuk-wide-bgp-communities-03 == Outdated reference: A later version (-05) exists of draft-uttaro-idr-bgp-persistence-02 Summary: 3 errors (**), 0 flaws (~~), 13 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Lapukhov 3 Internet-Draft E. Nkposong 4 Intended status: Informational Microsoft Corporation 5 Expires: March 06, 2014 September 02, 2013 7 Centralized Routing Control in BGP Networks Using Link-State Abstraction 8 draft-lapukhov-bgp-sdn-00 10 Abstract 12 Some operators deploy networks consisting of multiple BGP Autonomous- 13 Systems (ASNs) under the same administrative control. There are also 14 implementations which use only one routing protocol, namely BGP, as 15 in [I-D.lapukhov-bgp-routing-large-dc], for example. In such 16 designs, inter-AS traffic engineering is commonly implemented using 17 BGP policies, by configuring multiple routers at the ASN boundaries. 18 This distributed policy model is difficult to manage and scale due to 19 its dependency on complex routing policies and the need to develop 20 and maintain a model for per-prefix path preference signaling. One 21 example of such models could be standard BGP community-based (see 22 [RFC1997]) signaling, which requires careful documentation and 23 consistent configuration. Furthermore, automating such policy 24 configuration changes for the purpose of centralized management 25 requires additional efforts and is dependent on a particular vendor's 26 configuration management (CLI extensions, NetConf [RFC6241] etc). 28 This document proposes a method for inter-AS traffic engineering for 29 use with the kind of deployment scenarios outlined above. No 30 protocol changes or additional features are required to implement 31 this method. The key to the proposed methodology is a new software 32 entity called "BGP Controller" - a special purpose application that 33 peers with all eBGP speakers in the managed network. This controller 34 constructs live state of the underlying BGP ASN graph and presents 35 multi-topology view of this graph via a simple API to third-party 36 applications interested in performing network traffic engineering. 37 An example application could be an operational tool used to drain 38 traffic from network devices. In response to changes in the logical 39 network topology proposed by these applications, the controller 40 computes new routing tables, and pushes them down to the network 41 devices via the established BGP sessions. 43 Status of This Memo 45 This Internet-Draft is submitted in full conformance with the 46 provisions of BCP 78 and BCP 79. 48 Internet-Drafts are working documents of the Internet Engineering 49 Task Force (IETF). Note that other groups may also distribute 50 working documents as Internet-Drafts. The list of current Internet- 51 Drafts is at http://datatracker.ietf.org/drafts/current/. 53 Internet-Drafts are draft documents valid for a maximum of six months 54 and may be updated, replaced, or obsoleted by other documents at any 55 time. It is inappropriate to use Internet-Drafts as reference 56 material or to cite them other than as "work in progress." 58 This Internet-Draft will expire on March 06, 2014. 60 Copyright Notice 62 Copyright (c) 2013 IETF Trust and the persons identified as the 63 document authors. All rights reserved. 65 This document is subject to BCP 78 and the IETF Trust's Legal 66 Provisions Relating to IETF Documents 67 (http://trustee.ietf.org/license-info) in effect on the date of 68 publication of this document. Please review these documents 69 carefully, as they describe your rights and restrictions with respect 70 to this document. Code Components extracted from this document must 71 include Simplified BSD License text as described in Section 4.e of 72 the Trust Legal Provisions and are provided without warranty as 73 described in the Simplified BSD License. 75 Table of Contents 77 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 78 2. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4 79 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 80 2.2. Architectural Assumptions . . . . . . . . . . . . . . . . 5 81 2.3. BGP Controller . . . . . . . . . . . . . . . . . . . . . 8 82 3. Link-State Abstraction and Multiple Topologies . . . . . . . 9 83 3.1. Link-State Discovery Process . . . . . . . . . . . . . . 9 84 3.2. The Default Topology . . . . . . . . . . . . . . . . . . 10 85 3.3. Alternate Topologies . . . . . . . . . . . . . . . . . . 11 86 3.4. Overloading a Vertex . . . . . . . . . . . . . . . . . . 13 87 4. Implementation Details . . . . . . . . . . . . . . . . . . . 15 88 4.1. Programming Next-Hops . . . . . . . . . . . . . . . . . . 15 89 4.2. Equal-Cost Multipath Routing . . . . . . . . . . . . . . 15 90 4.3. Prefix Discovery Process . . . . . . . . . . . . . . . . 16 91 4.4. Sequenced Device Programming . . . . . . . . . . . . . . 16 92 4.5. Mapping Prefixes to Topologies . . . . . . . . . . . . . 17 93 4.6. Autonomous Systems with iBGP Peering Mesh . . . . . . . . 17 94 4.7. Minimizing Controller-Injected State . . . . . . . . . . 18 95 5. Handling Failure Scenarios . . . . . . . . . . . . . . . . . 18 96 5.1. Underlying Network Failures . . . . . . . . . . . . . . . 18 97 5.2. BGP Controller failures . . . . . . . . . . . . . . . . . 19 98 5.3. Multiple BGP Controllers . . . . . . . . . . . . . . . . 20 99 5.4. Network Partitioning . . . . . . . . . . . . . . . . . . 21 100 6. Controller API . . . . . . . . . . . . . . . . . . . . . . . 21 101 6.1. Pathnames and document names . . . . . . . . . . . . . . 22 102 6.2. Encoding of the documents and objects . . . . . . . . . . 22 103 6.3. Creating & Deleting State . . . . . . . . . . . . . . . . 22 104 6.4. Reading State . . . . . . . . . . . . . . . . . . . . . . 23 105 6.5. Writing State . . . . . . . . . . . . . . . . . . . . . . 23 106 6.6. Typical API Call Sequence . . . . . . . . . . . . . . . . 23 107 6.7. Limitations . . . . . . . . . . . . . . . . . . . . . . . 24 108 7. Security Considerations . . . . . . . . . . . . . . . . . . . 24 109 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 24 110 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 24 111 9.1. Normative References . . . . . . . . . . . . . . . . . . 24 112 9.2. Informative References . . . . . . . . . . . . . . . . . 25 113 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 26 115 1. Introduction 117 BGP was intentionally designed as a path-vector protocol, since 118 efficiently distributing link-state information for Internet-sized 119 graph is virtually impossible. However, some network deployments 120 leverage multiple BGP ASN to separate IGP domains, or simply use BGP 121 as the only routing protocol. See, for example 122 [I-D.lapukhov-bgp-routing-large-dc] which proposes using a BGP AS 123 either per network device or "horizontal" device group, within a 124 data-center. In such cases, the number of BGP ASNs is very small 125 when compared to the Internet - on the order of few thousands in the 126 largest case. 128 Under these assumptions, it becomes possible to build and maintain a 129 link-state graph of the complete inter-AS topology and compute 130 network paths based on this link-state information. In accomplishing 131 this, it is desirable to avoid adding any protocol extensions so that 132 current implementations can leverage the proposed method, such as 133 those described, for example in [RWHITE2005]. Instead, this document 134 proposes the use of a centralized agent (referred to as "BGP 135 Controller" or simply "the controller") that peers with all eBGP 136 speakers in the underlying network. The BGP Controller is 137 responsible for constructing an up-to-date link-state view of the BGP 138 inter-AS graph and pushing down routing information (prefixes and 139 their associated next-hops) to the network devices via BGP updates. 140 The new routing information reflects the results of link-state path 141 computations performed by the controller. Such routing information 142 push is possible because BGP supports the next-hop attribute that 143 could be recursively resolved via either IGP or BGP. Notice that 144 while the controller pushes routing information to the device, the 145 underlying BGP processes also compute the best-paths for the same 146 prefixes using the path-vector logic in the regular way. However, 147 the BGP Controller could override this information by manipulating 148 BGP attributes of injected routes, such as LOCAL_PREF to make its own 149 advertisements more preferred. 151 Third party applications can influence routing computations by 152 creating logical alternations of the network link-state graph, e.g. 153 changing the cost of the links from the BGP Controllers point of 154 view. This document will refer to those constructs as "alternate 155 topologies" (or simply "topologies" for short), while the original, 156 unaltered, link-state graph will referred to as the "default 157 topology". The controller would use these alternate topologies to 158 make routing decisions different from those that BGP would have made 159 based on available information. It is possible to create multiple 160 alternate topologies and associate different prefixes with every 161 topology, with the restriction that each prefix maps to one and only 162 one topology. Once this mapping is defined, the BGP Controller would 163 perform autonomously, detecting network faults and reacting by re- 164 computing routing information as needed based on the effect that the 165 failure has across all instantiated topologies. 167 In many aspects, the proposed method was inspired by and is similar 168 to the "Routing Control Platform" [RCP], but differs in the fact that 169 link-state discovery is done using BGP mechanics only, and overall 170 BGP is the only protocol used to build the system. 172 2. Overview 174 2.1. Use Cases 176 Primary intended use case of the BGP Controller is inter-AS traffic 177 engineering. This includes, but is not limited, to the following: 179 o Link/device overloading for the purpose of drying out traffic from 180 a device. A link, or group of links, connecting one ASN to 181 another could be declared as having "infinite" cost from the 182 controller's viewpoint, causing the latter to re-compute paths and 183 instruct the network devices to bypass those links. Notice that 184 this does not include "internal" overload (inside an ASN), that 185 may need to be done using IGP techniques. 187 o Traffic load-sharing among multiple links, e.g. links connecting 188 two different ASN's. Multiple alternate topologies could be 189 created where the same link is given different costs in each 190 topology. These topologies will then have subsets of prefixes 191 mapped to them, thus engineering different inter-AS paths for 192 these prefixes. Notice that for accurate load-sharing, knowledge 193 of the traffic matrix may be required, but this requirement 194 equally applies to any traffic engineering solutions. The load- 195 sharing could be also accomplished using weighted Equal-Cost 196 Multipath (ECMP), accounting for link capacities as "weights" to 197 distribute different proportions of egress traffic to the peering 198 points. See [KVALBEIN2007] for more information on the multi- 199 topology techniques in general and [I-D.ietf-idr-link-bandwidth] 200 for information on weighted ECMP signaling in BGP. 202 The main benefit of the proposed approach is centralized control of 203 the above functions. There is no need to configure policies on 204 multiple devices, as all routing changes could be done using the 205 uniform light-weight API to the controller. This ensures ease of 206 automation and consistent changes. Furthermore, such a centralized 207 model should be deployed to augment the classical distributed routing 208 policy configuration. The advantage is that centralized control 209 could be disabled at any time, falling the network back to the 210 "traditional" BGP decision model, thus allowing for a safe state to 211 roll-back to. Next, knowing the link-state of the network may allow 212 avoiding the BGP path-hunting problem, and improve global BGP 213 convergence timing in a large group of heavily meshed ASNs. 214 Additionally, to avoid the phenomena of routing micro-loops the 215 controller could enforce certain ordering for the network device 216 programming sequence. Specifically, every time a link-state change 217 is proposed to the controller, the devices in the network are 218 programmed starting with those farther away from the change in terms 219 of the metric of the existing graph. The same logic applies to link- 220 down conditions detected by the controller via the health probing 221 mechanism described below. 223 2.2. Architectural Assumptions 225 Firstly, the devices in the network are assumed (but not required) to 226 have minimal BGP policy applied, enough for them to exchange routing 227 information and compute best-paths based on shortest AS_PATH lengths. 228 This means that the configured policy should not override best-path 229 selection process using LOCAL_PREF or any other BGP attributes for 230 enforcing a custom routing policy. The assumption of the "minimal 231 policy" allows for making the BGP Controllers update logic less 232 intrusive, as described further in the section Section 4.7. Next, 233 every device is assumed to advertise a locally bound prefix into BGP 234 for the purpose of BGP peering with the controller. That is, the 235 controller peers "inband" with the devices it controls - either by 236 initiating iBGP sessions to all devices or by passively accepting the 237 sessions from the devices. As will be shown in the Section 5, inband 238 peering requirement is important to avoid inconsistencies between 239 multiple controllers programming the same network. 241 Another major assumption is how the link-state graph vertices are 242 defined. From the BGP Controller perspective, there are two type of 243 vertices: 245 o Type 1, Individual Devices: BGP Speaker(s) that have the SAME BGP 246 ASN configured, with the restriction that none of these speakers 247 peers with each other, inside this ASN. This could be a single 248 speaker in its own ASN as well. Each of these speakers is treated 249 as a vertex on its own. Peering with other ASN's is not 250 restricted. Notice how this is different from the traditional 251 notion of BGP ASN, where all speakers are assumed to be part of 252 the same iBGP mesh. 254 o Type 2, Complete BGP ASN: BGP Speakers in the SAME BGP ASN with 255 the normal requirement that they ALL exchange their BGP views via 256 iBGP, using either full-mesh or any other approach for full 257 internal BGP state synchronization. All of these BGP speakers are 258 grouped into a single graph vertex. 260 The following Figure 1 illustrates this concept: 262 Legend 263 ------- eBGP 264 ....... iBGP 266 eBGP Peering 267 | 268 +-----+-----+ 269 | | | 270 | +-+-+ | 271 | |R3 | | 272 | +-+-+ | 273 | | | 274 +-----+-----+ 275 | 277 eBGP Peering 279 | | 280 +---------+------------+----------+ 281 | | AS1 | | 282 | +-+-+ +-+-+ | 283 | |R1 | |R2 | | 284 | +-+-+ +-+-+ | 285 | | | | 286 +---------+------------+----------+ 287 | | 289 Type 1: Each device is individual graph vertex 290 (three vertices, each with two edges). 292 | | 293 +----+-----------+----+ 294 | | | | 295 | +-+-+ +-+-+ | 296 | |R1 |.......|R2 | | 297 | +---+. .+---+ | 298 | . . . . | 299 | . . . | 300 | . . . . | 301 | . . . . | 302 | +---+ +---+ | 303 | |R3 |.......|R4 | | 304 | +-+-+ +---+ | 305 | | | | 306 +----+-----------+----+ 307 | | 309 Type 2: All devices below are grouped into 310 single vertex with four edges. 312 Figure 1: Graph Vertices 314 Routing information could be associated with a graph vertex either by 315 means of static binding or dynamic discovery: this process is 316 described in details in sections Section 4.3. When programming the 317 network prefixes into the devices, the controller does not inject a 318 prefix back in the vertex the prefix is associated with. 320 The BGP Controller decision logic is independent of the address 321 family, and could apply to both IPv4 and IPv6 prefixes equally. It 322 is possible to run two independent controllers, one for each address 323 family. This allows for full "fate decoupling" between the address 324 families, though may result in duplication of the link state 325 information. 327 The edges of the constructed link-state graph may have two 328 attributes: metric, which is additive, and capacity (bandwidth) that 329 is non-additive. The former is used to compute shortest paths, and 330 the latter could be used to compute ECMP weight values in case where 331 multiple equal-cost paths exist to the same vertex. For every ECMP 332 path, the minimum capacity value that occurs along that path will be 333 used as its weight by the controller, if the underlying network 334 supports weighted ECMP functionality. 336 2.3. BGP Controller 338 The Figure 2 demonstrates the BGP Controller peering with the network 339 devices. Multiple managed devices peer via eBGP following the 340 traditional BGP design. For simplicity, we assume that every device 341 belongs to it's own ASN - see Section 4.6 for more information on 342 handling the "compound" Type-2 vertices consisting of multiple BGP 343 speakers interconnected with iBGP mesh. Prefixes P3, P4 and P5 are 344 associated with the devices (vertices) in ASNs 3, 4, and 5 345 respectively using techniques described in Section 4.3. The other 346 remaining vertices are assumed to be purely transit for the purpose 347 of this discussion. 349 These devices exchange routing information in the usual manner and 350 the BGP Controller establishes iBGP peering sessions with every 351 device. It uses the technique described in section Section 3.1 to 352 build the inter-AS link-state graph. For now, it is sufficient to 353 say that the discovery process uses special "beacon" prefixes 354 dynamically injected into the network and relayed back to the 355 controller to discover the state of the links interconnecting the 356 graph vertices. 358 Legend: 360 ------- iBGP (controller to network) 361 ....... eBGP (ASN to ASN) 363 BGP Controller 364 +-------+ 365 | | 366 +-------+ 367 || | || 368 || | || 369 +-------------+| | |+--------------+ 370 | +----+ | +----+ | 371 | | | | | 372 | v | v | 373 | +---+ | +---+ | 374 | |AS1|....|....|AS2| | 375 v +---+ | +---+ v 376 +---+ . | . . +---+ 377 P3 |AS3|........ | . ........|AS4| P4 378 +---+ | . +---+ 379 . V . . 380 . +---+.... . 381 ...............|AS5|................ 382 +---+ 383 P5 385 Figure 2: BGP Controller 387 At this point, the BGP Controller has knowledge of the link-state 388 graph as well as the prefixes associated with every vertex, and can 389 now run Dijkstra's SPF algorithm to compute shortest paths between 390 vertices. A result of this computation would be routing tables built 391 for every vertex. The Section 3.2 below demonstrates the adjacency 392 list built by the controller for the above topology, as well as 393 routing-tables computed for every vertex. The next-hops in the 394 routing tables presented in the figure are simply the vertices to 395 send the packets to. When programming the network devices, the 396 actual IP addresses of the next-hops are computed as described in 397 Section 4.1 section. This routing state corresponds to the unaltered 398 (default) topology. 400 3. Link-State Abstraction and Multiple Topologies 402 This section provides detailed information on the link-state 403 abstractions used by the controller and how those are used to perform 404 traffic engineering in the underlying network. 406 3.1. Link-State Discovery Process 408 The network devices that the controller peers with establish eBGP 409 peering sessions with each other. The fact that there is one-to-one 410 correspondence between eBGP sessions and underlying IP link allows 411 using the state of the eBGP session as the indication of the IP link 412 health. Specifically, this is accomplished by injecting special 413 "beacon" prefixes into every vertex (which could be a device or 414 collection of devices interconnected with iBGP mesh) and expecting 415 those beacons to be re-advetised back to the controller by every 416 vertex adjacent to the point of injection. If a particular BGP 417 session is down, the injected prefix will not be re-advertised by the 418 affected peer back to the controller, allowing us to conclude that 419 the corresponding link is down. 421 The Figure 3 demonstrates this process. For simplicity, we assume 422 that every device belongs to its own BGP ASN. The BGP controller 423 injects prefix X into device R1 and expects to hear this prefix from 424 device R2. At the same time, it is desirable to prevent this prefix 425 from leaking any farther than one hop away from R1, i.e. make sure it 426 is not re-advertised to R3. To accomplish this, prefix X could be 427 tagged with a special community value, which is replaced with the 428 well-known community "no-export" when advertising over eBGP session. 429 Because of this policy, the prefix will be announced back to the 430 controller as it uses iBGP session for peering, but not any further 431 to eBGP peers of router R2 in our case. An alternative to using the 432 standard BGP communities could be leveraging the wide-communities 433 limiting the scope of the announced prefixes - see 434 [I-D.raszuk-wide-bgp-communities] for more details on this technique. 436 ------- iBGP (controller to network) 437 ....... eBGP (ASN to ASN) 439 +------------+ 440 +------| Controller |<------+ 441 | +------------+ | 442 X X 443 | | 444 V | 445 +---+ +---+ 446 |R1 |...........X..........>|R2 | 447 +---+ +---+ 448 . 449 +---+ . 450 |R3 |............ 451 +---+ 453 Figure 3: Link-State Discovery 455 Using this technique, the controller is able to build a view of the 456 links connecting the graph vertices. Notice that if two parallel 457 links connect vertices, this method will not be able to differentiate 458 between them. For simplicity, the proposal is that such parallel 459 links should be grouped into a single logical IP link using, for 460 example, [IEEE8023AD] technology. 462 3.2. The Default Topology 464 When the controller starts, it discovers the current network graph 465 and computes the routing table assuming that all links have the same 466 metric value. The Figure 4 illustrates the adjacency list describing 467 the graph taken from Figure 2 along with the routing table computed 468 for every vertex/ASN. The numbers on the graph edges designate the 469 link costs. 471 Inter-AS Graph and Prefixes 473 +---+ +---+ 474 |AS1|...(1)...|AS2| 475 +---+ +---+ 476 +---+ . . . +---+ 477 P3 |AS3|..(1).. . ...(1)..|AS4| P4 478 +---+ . +---+ 479 . (1) . 480 . . . 481 . +---+.... . 482 .......(1)....|AS5|......(1)....... 483 +---+ 484 P5 486 Inter-AS Graph Adjacency List Per-ASN Routing Table 488 +-----+--------------+ +-----+----------------------------+ 489 | Src | Dst ASNs | | AS | Prefix:Next-Hop(s) | 490 +-----+--------------+ +-----+----------------------------+ 491 | AS1 | AS2,AS3 | | AS1 | P3:AS3,P4:AS2,P5:[AS2,AS3] | 492 +-----+--------------+ +-----+----------------------------+ 493 | AS2 | AS1,AS4,AS5 | | AS2 | P3:AS1,P4:AS4,P5:AS5 | 494 +-----+--------------+ +-----+----------------------------+ 495 | AS3 | AS1,AS5 | | AS3 | P3:Self,P4:AS5,P5:AS5 | 496 +-----+--------------+ +-----+----------------------------+ 497 | AS4 | AS2,AS5 | | AS4 | P3:AS5,P4:Self,P5:AS5 | 498 +-----+--------------+ +-----+----------------------------+ 499 | AS5 | AS4,AS2,AS3 | | AS5 | P3:AS3,P4:AS4,P5:Self | 500 +-----+--------------+ +-----+----------------------------+ 502 Figure 4: Unaltered Routing State 504 3.3. Alternate Topologies 506 Assume the following TE requirements for illustrative purposes: 508 o Traffic from AS4 to P5 needs to traverse AS2. 510 o Traffic to P4 from AS5 needs to ECMP over two paths: direct and 511 via AS2. 513 o Traffic from AS3 to P5 must not use the direct path. 515 These requirements could be satisfied with two different topologies: 517 o Topology 1 has "very large" metric assigned to the links between 518 AS4,AS5 and AS3,AS5. 520 o Topology 2 has metric value of 2 assigned to the link between AS4 521 and AS5. 523 The prefixes map to the topologies as following: P5->Topology1 and 524 P4->Topology2. P3 should retain mapping to the default (unaltered) 525 topology, which we would refer to as Topology 0 to refer to all 526 topologies by their numbers. The assumption of "very large" metric 527 is important - the path containing this link could still be used if 528 all alternate paths are down because of physical failures. For 529 simplicity, we assume "very large" equals to 100 in the case under 530 consideration. The set of topologies and associated prefixes would 531 look as on Figure 5, where numbers on the links designate their 532 metrics. 534 [Topology 0] 536 +---+ +---+ 537 |AS1|...(1)...|AS2| 538 +---+ +---+ 539 +---+ . . . +---+ 540 P3 |AS3|..(1).. . ..(1)...|AS4| 541 +---+ . +---+ 542 . (1) . 543 . . . 544 . +---+.... . 545 .....(1)......|AS5|......(1)....... 546 +---+ 548 [Topology 1] 550 +---+ +---+ 551 |AS1|...(1)...|AS2| 552 +---+ +---+ 553 +---+ . . . +---+ 554 P3 |AS3|..(1).. . ..(1)...|AS4| 555 +---+ . +---+ 556 . (1) . 557 . . . 558 . +---+.... . 559 ....(100).....|AS5|.......(100).... 560 +---+ 561 P5 563 [Topology 2] 565 +---+ +---+ 566 |AS1|...(1)...|AS2| 567 +---+ +---+ 568 +---+ . . . +---+ 569 |AS3|..(1).. . ..(1)...|AS4| P4 570 +---+ . +---+ 571 . (1) . 572 . . . 573 . +---+.... . 574 .....(1)......|AS5|.......(2)...... 576 +---+ 578 Figure 5: Alternate Topologies 580 Based on the set of topologies presented above, the BGP Controller 581 will compute the routing tables shown in Figure 6, which reflects the 582 desired traffic engineering goals defined previously. The entries 583 that differ from the routing decisions in unaltered topology are 584 highlighted with the asterisk (*) characters. Notice that AS3 now 585 sees P4 as ECMP reachable via AS1 and AS5, because of the metric 586 change in Topology 2. The original traffic engineering policy 587 requirements did not call for that, but this result appears because 588 of the change made between AS4 and AS5, which is a natural effect 589 with shortest-path, destination-based forwarding techniques. 591 Per-ASN Routing Table 593 +-----+--------------------------------+ 594 | AS | Prefix:Next-Hop(s) | 595 +-----+--------------------------------+ 596 | AS1 | P3:AS3,P4:AS2,*P5:AS2* | 597 +-----+--------------------------------+ 598 | AS2 | P3:AS1,P4:AS4,P5:AS5 | 599 +-----+--------------------------------+ 600 | AS3 | P3:Self,*P4:[AS5,AS1]*,P5:AS1 | 601 +-----+--------------------------------+ 602 | AS4 | P3:AS5,P4:Self,*P5:AS2* | 603 +-----+--------------------------------+ 604 | AS5 | P3:AS3,P4:*[AS4,AS2]*,P5:Self | 605 +-----+--------------------------------+ 607 Figure 6: Multi-Topology Routing Tables 609 The controller will push the computed routing tables to the network 610 devices using higher LOCAL_PREF values to ensure that the new 611 information overrides the routing decision that "traditional" BGP 612 processes running on the BGP speakers have already made. It is 613 possible to use other attributes to signal better preference, but 614 LOCAL_PREF has the benefit of being used very early in the BGP tie- 615 breaking process. 617 3.4. Overloading a Vertex 619 This section illustrates a special, but important practical case of 620 "overloading" a graph vertex, such that all traffic bypasses the 621 vertex. This operation could be used in a scenario in which a 622 particular network device needs an upgrade and requires all traffic 623 to be dried out of it. The Figure 7 demonstrates the implementation 624 of this policy with respect to the AS5 vertex. The Topology-0 has no 625 prefixes mapped to it, but all prefixes are mapped to Topology-2 626 instead. This topology has the cost of 100 assigned to all links 627 connected to AS5, which forces all traffic to avoid transiting AS5. 629 [Topology 0] 631 +---+ +---+ 632 |AS1|...(1)...|AS2| 633 +---+ +---+ 634 +---+ . . . +---+ 635 |AS3|..(1).. . ..(1)...|AS4| 636 +---+ . +---+ 637 . (1) . 638 . . . 639 . +---+.... . 640 .....(1)......|AS5|......(1)....... 641 +---+ 643 [Topology 2] 645 +---+ +---+ 646 |AS1|...(1)...|AS2| 647 +---+ +---+ 648 +---+ . . . +---+ 649 P3 |AS3|..(1).. . ..(1)...|AS4| P4 650 +---+ . +---+ 651 . (100) . 652 . . . 653 . +---+.... . 654 ....(100).....|AS5|......(100)..... 655 +---+ 656 P5 658 Per-ASN Routing Table 660 +-----+--------------------------------+ 661 | AS | Prefix:Next-Hop(s) | 662 +-----+--------------------------------+ 663 | AS1 | P3:AS3,P4:AS2,*P5:[AS2,AS3]* | 664 +-----+--------------------------------+ 665 | AS2 | P3:AS1,P4:AS4,P5:AS5 | 666 +-----+--------------------------------+ 667 | AS3 | P3:Self,*P4:AS1*,P5:AS5 | 668 +-----+--------------------------------+ 669 | AS4 | P3:*AS2*,P4:Self,P5:AS5 | 670 +-----+--------------------------------+ 671 | AS5 | P3:AS3,P4:AS4,P5:Self | 672 +-----+--------------------------------+ 674 Figure 7: Overloading a Vertex 676 4. Implementation Details 678 4.1. Programming Next-Hops 680 As mentioned previously, the prefixes that the controller injects in 681 the network needs to have their next-hops properly resolved. In the 682 simplest case, the next-hops could be the remote IP addresses of the 683 links directly connected to the device programmed by the controller. 684 This, however, adds certain complexities due to the IP address 685 variability on the point-to-point links connecting the network 686 devices. An alternative could be injecting pre-generated next-hops 687 into the devices - one per device - and resolving them recursively 688 via BGP. 690 Specifically, every graph vertex would have a host route (either IPv4 691 or IPv6) associated with it. The controller would inject this prefix 692 into the respective device(s) (see Section 4.6 associated with this 693 vertex, tagged with the special community value discussed in the 694 section Section 3.1. Moreover, for simplicity, it is possible to re- 695 use the same prefix used for link-state discovery as the value of the 696 next-hop attribute, thus reducing the amount of supplementary routing 697 state injected by the controller. 699 Next, it is easy to notice that using the special BGP community to 700 limit the beacon/next-hop prefix propagation is not strictly 701 necessary. Indeed, the controller may simply discard all "special" 702 prefixes whose AS_PATH contains more than one AS-hop. However, this 703 will result in unneeded routing state propagated in the network, 704 which is not desirable from manageability perspective. 706 4.2. Equal-Cost Multipath Routing 708 In many practical topologies, the controller may find multiple equal- 709 cost paths from one vertext to another. It may then proceed 710 programming multiple paths for the prefixes affected by this 711 decision. Either of the two ways could accomplish the multiple-paths 712 programming requirement: 714 o Using the BGP Add-Path extension, [I-D.ietf-idr-add-paths] 715 specifying multiple next-hops values. 717 o Using the Diverse Path Advertisement method presented in [RFC6774] 718 to inject multiple paths. 720 Furthermore, it is possible to implement weighted ECMP functionality 721 with this approach, relying on [I-D.ietf-idr-link-bandwidth] for 722 weight signaling. The graph edges could have weights associated with 723 them, and a given path's weight computed as the minimum weight value 724 along the path, as mentioned previously. The logic behind the weight 725 selection is outside the scope of this document. 727 4.3. Prefix Discovery Process 729 In order to build routing state information, the controller needs to 730 know the "leaf" prefixes associated with the graph vertices. There 731 are two ways of accomplishing this: either defining a static mapping 732 of prefixes to vertices in the BGP controller configuration, or by 733 letting the controller learn those prefixes in dynamic fashion. In 734 both cases, the assumption is that the network reachability 735 information is already advertised into BGP, such that regular "in- 736 band" routing model is working. 738 The controller may dynamically associate a prefix with a vertex by 739 using two properties: firstly, by observing an empty AS_PATH in the 740 prefix received from the managed device. Secondly, by filtering out 741 prefixes injected for the purpose of network health discovery and 742 next-hop programming. The controller treats everything that matches 743 these two criteria as the routing information associated with the 744 respective vertex. 746 4.4. Sequenced Device Programming 748 Distributed routing systems are susceptible to transient 749 inconsistencies when a network state changes in such a way that 750 requires changing the best-paths election. Since a topological event 751 (e.g. a link flap) is not propagated in an instant, devices that are 752 closer to the origin of the event would update their forwarding 753 tables faster, as compared to others. The devices directly adjacent 754 to those that have their tables already updated would still be using 755 old forwarding state. This would create transient routing loops for 756 the time it takes to fully synchronize the forwarding state of all 757 devices. 759 Since the controller is aware of the full network topology, it may 760 avoid the above scenario by pushing the routing updates in proper 761 sequence - starting with the vertices that are farthest away from the 762 location of the event. This way the newly programmed state will 763 "implode" toward the change, as opposing to "exploding" from the 764 events point of occurrence. Such sequencing is similar to the 765 process outlined in [RFC6976], but relies on centralized programming, 766 which makes it very simple to implement. 768 4.5. Mapping Prefixes to Topologies 770 The controller needs a manageable way of associating discovered 771 prefixes with any of the topologies defined by the third-party 772 applications. As mentioned previously, all prefixes are by default 773 mapped to the default topology, which corresponds to the actual 774 network state. Once an alternate topology has been defined, prefixes 775 could be mapped to this new topology. One possible way of 776 implementing such mapping table could be by maintaining a radix tree 777 data-structure, which associates a prefix with the corresponding 778 topology. Using longest-match lookup in this table for each 779 discovered prefix would then yield the topology that this prefix 780 belongs to. This allows for easy and natural grouping of prefix-to- 781 topology mappings, while maintaining familiar semantics of longest- 782 match routing lookups. To implement the default mapping, the 783 prefixes 0.0.0.0/0 and ::/128 should always be in the radix tree, 784 pointing to one of the defined topologies. When those prefixes are 785 deleted per application request, the BGP controller would need to re- 786 insert them, linking back to default topology again. 788 4.6. Autonomous Systems with iBGP Peering Mesh 790 The BGP Controller treats BGP ASN's that have a form of internal BGP 791 mesh differently than systems that do not peer over iBGP. Such 792 systems are perceived as an atomic opaque graph vertex for the 793 purpose of next-hop and beacon prefix injection. The routing inside 794 such ASN is not defined by the controller, but rather relies on some 795 other mechanism, such as IGP. The controller only defines egress 796 points out of the ASN, and possibly can specify weights associated 797 with exit points, to allow for weighted ECMP load-distribution. This 798 treatment naturally arises from the fact that iBGP injected beacon 799 prefixes are not relayed to iBGP peers. Furthermore, the beacon 800 prefixes learned from eBGP neighbors are propagated to all iBGP 801 peers, but not relayed back to the BGP Controller when learned over 802 iBGP session. Thus, the controller will discover peering links of 803 every "edge" router in such BGP ASN with all external peers, but will 804 not be able to see the internal iBGP peering mesh. 806 If the underlying ASN implements iBGP route reflection or BGP 807 Confederations, only the routers that form eBGP sessions with 808 external ASN's need to have the routing information injected into 809 them. The routing information will disseminate to the internal 810 speakers by means of normal BGP replication process, with unmodified 811 next-hops and LOCAL_PREF attribute value, thus ensuring that it 812 overrides the normal "in-band" routing information. 814 When programming ECMP paths, it may happen so that the egress points 815 specified by the controller do not satisfy iBGP requirements for 816 multipath (e.g. IGP costs to reach the egress points could be 817 different). In such case, normal BGP tie breaking will occur and 818 only ECMP-equivalent paths will be installed in the RIB. 819 Alternatively, if the underlying ASN implement tunneling techniques, 820 it is possible to perform load sharing even if the IGP costs toward 821 the BGP next-hops are different. 823 4.7. Minimizing Controller-Injected State 825 The BGP Controller can push down all of the prefixes it computes 826 paths for: that is, all prefixes known in the network. This means 827 that for every prefix present in the "regular" eBGP interconnected 828 topology the controller will inject the same prefix with different 829 attributes. It is also possible for the controller to push down only 830 the "delta" between the prefixes that need their next-hops/paths 831 changed, based on the supplied policy. This mode of operation 832 requires that the underlying network finds the best-paths between the 833 graph vertices using the "shortest-path logic", where the path length 834 equals the length of the AS_PATH attribute. This is equivalent to 835 running Dijkstra's SPF algorithm on graph unit metric values assigned 836 to the edges. This is needed since the controller performs path 837 computation using SPF logic, and BGP could elect different paths if 838 some policies are present. Ensuring that both the underlying network 839 and the controller perform the same computations effectively allows 840 for the "delta" mode operations. 842 Publishing only the "delta" state to the network means more 843 "intelligent" work on the controller side and special requirements to 844 the network policies. However, the benefit is significantly reduced 845 intervention in the regular forwarding since majority of the state is 846 not likely to change in many cases. Once again, it is possible to 847 implement the mode where the controller overrides all routing 848 information. 850 5. Handling Failure Scenarios 852 This section reviews two different type of failure scenarios: 853 failures in the underlying network and the controller failures. 855 5.1. Underlying Network Failures 857 Either vertex (if it's a device) or graph edge (network link) may 858 fail. For the BGP Controller, underlying failure be it edge or 859 vertex, is visible only after all eBGP session interconnecting two 860 vertices have failed. This could be driven either by an event, such 861 as link down condition, which is typically fast, or by BGP keepalive 862 timer expiration, which is naturally slower. When this happens, the 863 BGP processes withdraw the corresponding beacon prefixes and the 864 controller will declare the corresponding edge down. This will 865 result in re-run of SPF for all active topologies and push of new 866 routing information down to the network. Since the central 867 controller is involved in reconvergence, the restoration time will be 868 longer, compared to the restoration process driven purely by 869 underlying BGP processes. Indeed, the restoration time now include 870 failure detection time, SPF re-computations and new prefixes push. 871 However, it could be observed that such centralized reconvergence is 872 free from the BGP Path-Hunting problem, and hence improvements could 873 be noticed in complex meshed topologies. 875 Furthermore, recovery could be faster if multiple paths (ECMP) exist 876 for a prefix, and only a single path fails. In this case, BGP 877 process will simply invalidate the failed path even before the 878 controller has signaled removal, and will continue with using only 879 the active paths. The details of this reconvergence are complicated, 880 as changing ECMP is a hardware dependent operation. Furthermore, 881 some implementations may support the "consistent hashing" technique 882 that minimizes impact of ECMP group base size change on flow 883 affinities, as described in [RFC2992]. 885 5.2. BGP Controller failures 887 Under normal circumstances, an operator may shut down a controller 888 for maintenance or other reasons. In this case, it is expected that 889 BGP sessions be closed following normal BGP process, that is sending 890 a BGP Notification message and terminating the TCP session. As a 891 result, all routers will withdraw the prefixes injected by the 892 controller and recalculate the best-path. 894 If the controller fails abnormally, e.g. process crashes, the TCP 895 sessions that connect it to the underlying devices either will be 896 torn down, or be closed upon expiration of BGP keepalive timer. The 897 latter will cause some delay before prefixes announced by the 898 deceased controller are withdrawn. For the duration of that time, 899 the network will be forwarding traffic using possibly stale 900 information. Link/device failures will be handled locally, and in 901 some cases may cause traffic black-holes, if the only programmed path 902 fails. The duration of this "state" time is equal to the time it 903 takes to detect the controller failure, and update the BGP LocRIB, 904 followed by RIB/FIB reprogramming. 906 It is possible to use a single BGP controller along with BGP routing 907 persistence feature, to maintain the injected paths even after the 908 BGP Controller failure (see [I-D.uttaro-idr-bgp-persistence]). After 909 the controller restarts, it will simply refresh the "stale" routing 910 information. In this scenario, forcing the network to revert to the 911 traditional BGP-based routing could be accomplished by instructing 912 the controller to inject its paths with low LOCAL_PREF value, less 913 than the default used in the network. The possible risk is that the 914 controller may fail in such a fashion that it will not be able to 915 inject any information in the network. 917 5.3. Multiple BGP Controllers 919 If a single BGP Controller is present in the network that does not 920 implement BGP route persistence, the controller failure would result 921 in the network becoming unmanaged, and falling back to traditional 922 BGP routing. To maintain resilience, it is possible to run multiple 923 parallel BGP Controllers, assuming that they supply the network with 924 the same routing information, and differentiate themselves as 925 'primary' and 'backup'. The latter property could be accomplished by 926 using different LOCAL_PREF attribute values for primary/secondary 927 controllers - this allows having multiple controllers, backing up 928 each other. 930 With multiple BGP Controllers, it becomes critical for all of them to 931 perform the same routing decisions. Even though only one controller 932 is programming the network, the backup paths injected by the others 933 must be consistent with the primary. To accomplish that, all 934 controllers must: 936 o Have the same view of the underlying network topology - i.e. build 937 the same link-state graph. In the simplest case, this could be 938 accomplished by relying on eventual consistency, that is assuming 939 that under non-partitioned scenario the controllers will 940 eventually receive the same link-state probe prefixes and build 941 the same resulting link-state database. Alternatively, a 942 consensus protocol, e.g. [PAXOS] could be executed amongst the 943 members of the redundant group to synchronize the link-state 944 database of the master process with the secondary processes. This 945 would ensure strong consistency of the link-state database, but 946 could be over-bearing in terms of the state that may need to be 947 kept replicated reliably. 949 o Maintain the same topology definition database and prefix-to- 950 topology mapping table - as commanded by external applications. 951 This is similar to the previous approach, but would involve much 952 less state to synchronize. Specifically, the topology definitions 953 (e.g. new link costs) and prefix to topology mapping information 954 need to be distributed. This state is submitted to the 955 controllers via an API defined for the third party applications. 956 As before, it could be assumed a responsibility of an external 957 application to program all controllers with the same state and 958 ensure consistency. Alternatively, another strongly consistent 959 database could be used, leveraging the same consensus protocol. 961 5.4. Network Partitioning 963 This section reviews the possible "partitioning" scenarios, where 964 parts of the network may become managed by different controllers. 965 Situations like this are possible if the controllers are deployed 966 diversely, and may end up in situation where one or more of the 967 controllers lose iBGP peering sessions with some network devices. 968 The main concern in such situations is programming the devices with 969 inconsistent information that may cause routing loops. 971 Firstly, notice that if device A can learn the "peering source" 972 prefix announced by device B, and the BGP Controller can peer with A, 973 then by transitivity the controller can also peer with B. This means 974 that either the controller and device A cannot learn any routing 975 information from B, or both of them can - excluding transient 976 situations. This property ensures that under proper configuration a 977 set of devices is either completely managed, or completely unmanaged 978 - that is, they share the same fate. This eliminates the scenario 979 where device A is programmed by the controller X, device B is 980 programmed by the controller Y and the devices can each each other 981 inband. 983 Secondly, for the transient cases, when A and B have in-band 984 connectivity, but for some time A is programmed by X and B is 985 programmed by Y. Recall that absence of the iBGP session to the 986 device translates into the fact that this device is declared as 987 having "infinite" costs in the link-state database. Thus, X will 988 always bypass B and Y will always bypass A, and hence a routing loop 989 may never form between A and B. 991 6. Controller API 993 This section provides a set of requirements and guidance to the BGP 994 Controller API. The general recommendation is to base the API on 995 stateless principles, such as found in [REST] model. This approach 996 is efficient since no real-time event passing between the controller 997 and third-party application is needed, e.g. for the purpose of active 998 reaction to network failure events. The proposed controller model 999 assumes those events are handled by the messages exchange in the 1000 network-controller loop. The following sections are structured the 1001 around "CRUD" - Create, Read, Update, Delete operations commonly used 1002 in REST model and use the HTTP verbs and pathnames for illustration. 1003 Furthermore, applications will be referenced as clients and the BGP 1004 Controller as the server in the text below interchangeably, though 1005 the API could be implemented by a module separate from the main BGP 1006 Controller logic. 1008 6.1. Pathnames and document names 1010 The server presents the following pathnames to group various objects: 1012 o "/lsdb" - This is the document that stores the currently 1013 discovered inter-AS graph link state (link-state database). This 1014 document cannot be modified, only read. The LSDB data structure 1015 is a graph, represented in one of the common formats - e.g. as two 1016 collections: vertices and edges, where edges have associated 1017 states and weight (capacity). 1019 o "/topologies/" - This is a directory that stores documents 1020 corresponding to different topologies. Every document contains a 1021 topology definition. 1023 o "/mappings/ipv4" - This is the document that stores the IPv4 1024 mappings to the topologies. Notice that if the 0.0.0.0/0 prefix 1025 is not found it this file, it is implicitly mapped to the default 1026 topology. Internally in the BGP Controller this is stored as an 1027 efficient radix-tree, but the document represents the mappings as 1028 a collection of prefixes and associated topologies. 1030 o "/mappings/ipv6" - This is the document that stores the IPv6 1031 prefix mappings to the topologies. Same as IPv4 mappings, with 1032 except to different address family. As with the IPv4 case, if the 1033 ::/0 prefix is not found in this document, it is implicitly mapped 1034 to the default topology. 1036 6.2. Encoding of the documents and objects 1038 Either JSON or XML is an acceptable format for encoding the document 1039 contents for programmability. JSON is preferred due to its 1040 lightweight nature and simpler semantics for transporting data 1041 structures. The documents passed with RESTful calls will contain 1042 logical descriptions of the graph vertices and edges. A vertex is 1043 uniquely identified by an opaque name, e.g. a text string. The 1044 mapping between this identifier and the underlying network devices is 1045 to be done elsewhere in the controller data structures, and does not 1046 need to be exposed to the applications. 1048 6.3. Creating & Deleting State 1050 The only state that could be created is the collection of topology 1051 definitions, under the "/topology/" directory. The topology objects 1052 are to be created using the "POST" HTTP operation - supplying some 1053 basic content, e.g. empty set of the links and associated costs using 1054 the appropriate encoding. Correspondingly, a topology could be 1055 deleted using the DETELE operation. Notice that the default topology 1056 is not present in this directory, and thus could never be deleted. 1057 Notice that the separate "mapping" documents will be referencing the 1058 topology names, and when a topology is delete such mapping will 1059 become invalid. It up to the implementation to handle such 1060 referential integrity - e.g. by ignoring such entries in the mapping 1061 document, or disallowing the topology file to be deleted as long as 1062 active references are present. 1064 6.4. Reading State 1066 Every document described above could be read and transported to the 1067 client using the HTTP GET request. The document is transported 1068 completely in the corresponding encoding. It is up to the controller 1069 to implement proper read/write locking to avoid inconsistencies in 1070 data when multiple clients are present. No locking API should be 1071 ever exposed to the client, since that would affect the stateless 1072 nature of the communications. Notice that reading the link-state 1073 database is mostly informative to the client, since handling of the 1074 network failures is performed by the BGP Controller. 1076 6.5. Writing State 1078 The topology definition documents and the IPv4/IPv6 mapping tables 1079 could be fully re-written using the HTTP PUT verb. This means that 1080 with every operation, the client must supply the full new document, 1081 not an incremental change. It's up to the client to perform the 1082 merge of the new change with the already existing information. If 1083 consistency across multiple writers is required, it should be 1084 implemented by the clients, possibly via the use of an external 1085 shared locking API. Referential integrity checks could be 1086 implemented in the controller, e.g. to validate that the topology 1087 references in the mapping actually exists, or alternatively could be 1088 left to the client. 1090 It is possible to implement incremental changes using the HTTP PATCH 1091 verb semantics (see [RFC5789]) in the server. In this case, it's up 1092 to the server to perform proper merge of the incremental change and 1093 ensure there is no conflicts or duplicates. This is a more complex 1094 model as compared to the simple "PUT" logic. 1096 6.6. Typical API Call Sequence 1098 A typical sequence of actions for a client willing to perform traffic 1099 engineering could be as following (assuming absence of the PATCH 1100 operation): 1102 o Decide which prefixes are to be affected by this operation. 1104 o Create a topology to perform the link-state operation, or re-use 1105 the one previously created by this application. Verify topology 1106 existence using the GET operation in the "/topologies" directory. 1108 o Add new links with the desired costs to the topology. If the 1109 topology alredy exists, read it first using GET operation, and 1110 then perform merge on the client side, later submitting the 1111 updated topology using PUT operation. 1113 o Obtain current prefix mappings for the desired address family 1114 using the GET operation. Parse the mappings and perform any 1115 consistency checks required, followed by adding the entries for 1116 prefixes to act upon, mapping them to the topology created/updated 1117 above. 1119 o HTTP PUT the new mappings file, updating the one that existing in 1120 the server as a whole. 1122 6.7. Limitations 1124 The API is purposely focused only on routing information 1125 manipulation, and does not provide any ways to verify the requested 1126 operation has been accomplished. Such monitoring should be done 1127 separately, using either mechanics available in BGP (e.g. by learning 1128 of the prefixes' new paths via separate session) or outside of BGP, 1129 e.g. in BGP Monitoring Protocol ([I-D.ietf-grow-bmp]) or Multi- 1130 Threaded Routing Toolkit ([RFC6396]). 1132 7. Security Considerations 1134 The design of the BGP Controller in its simplest form assumes no 1135 access control in the API is presents to the third-party 1136 applications. Access could be limited at the transport level, e.g. 1137 by using protocol (HTTP) authentication or access control 1138 capabilities, but the API itself does not provide any logic to 1139 segregate applications - i.e. there is currently no way to limit an 1140 application to manipulating only a certain subset of the IP address 1141 space. 1143 8. Acknowledgements 1145 The authors would like to thank Robert Raszuk for reviewing the 1146 document and providing valueable feedback. 1148 9. References 1150 9.1. Normative References 1152 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 1153 Protocol 4 (BGP-4)", RFC 4271, January 2006. 1155 [RFC5789] Dusseault, L. and J. Snell, "PATCH Method for HTTP", RFC 1156 5789, March 2010. 1158 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 1159 Communities Attribute", RFC 1997, August 1996. 1161 9.2. Informative References 1163 [I-D.lapukhov-bgp-routing-large-dc] 1164 Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for 1165 routing in large-scale data centers", draft-lapukhov-bgp- 1166 routing-large-dc-06 (work in progress), August 2013. 1168 [I-D.ietf-grow-bmp] 1169 Scudder, J., Fernando, R., and S. Stuart, "BGP Monitoring 1170 Protocol", draft-ietf-grow-bmp-07 (work in progress), 1171 October 2012. 1173 [RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast 1174 Services", BCP 126, RFC 4786, December 2006. 1176 [RFC6774] Raszuk, R., Fernando, R., Patel, K., McPherson, D., and K. 1177 Kumaki, "Distribution of Diverse BGP Paths", RFC 6774, 1178 November 2012. 1180 [RFC6976] Shand, M., Bryant, S., Previdi, S., Filsfils, C., 1181 Francois, P., and O. Bonaventure, "Framework for Loop-Free 1182 Convergence Using the Ordered Forwarding Information Base 1183 (oFIB) Approach", RFC 6976, July 2013. 1185 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 1186 Algorithm", RFC 2992, November 2000. 1188 [RFC6241] Enns, R., Bjorklund, M., Schoenwaelder, J., and A. 1189 Bierman, "Network Configuration Protocol (NETCONF)", RFC 1190 6241, June 2011. 1192 [RFC6396] Blunk, L., Karir, M., and C. Labovitz, "Multi-Threaded 1193 Routing Toolkit (MRT) Routing Information Export Format", 1194 RFC 6396, October 2011. 1196 [I-D.ietf-idr-add-paths] 1197 Walton, D., Retana, A., Chen, E., and J. Scudder, 1198 "Advertisement of Multiple Paths in BGP", draft-ietf-idr- 1199 add-paths-08 (work in progress), December 2012. 1201 [I-D.ietf-idr-link-bandwidth] 1202 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 1203 Extended Community", draft-ietf-idr-link-bandwidth-06 1204 (work in progress), January 2013. 1206 [I-D.raszuk-wide-bgp-communities] 1207 Raszuk, R., Haas, J., Amante, S., Steenbergen, R., 1208 Decraene, B., and P. Jakma, "Wide BGP Communities 1209 Attribute", draft-raszuk-wide-bgp-communities-03 (work in 1210 progress), July 2012. 1212 [I-D.uttaro-idr-bgp-persistence] 1213 Uttaro, J., Chen, E., Decraene, B., and J. Scudder, 1214 "Support for Long-lived BGP Graceful Restart", draft- 1215 uttaro-idr-bgp-persistence-02 (work in progress), July 1216 2013. 1218 [JAKMA2008] 1219 Jakma, P., "BGP Path Hunting", 2008, . 1222 [PAXOS] Wikipedia, ., "Paxos", , 1223 . 1225 [REST] Wikipedia, ., "Representational state transfer", , . 1228 [RWHITE2005] 1229 White, R., "Graph Overlays on Path Vector: A Possible Next 1230 Step in BGP", June 2005, . 1233 [KVALBEIN2007] 1234 Kvalbein, A. and O. Lysne, "How can Multi-Topology Routing 1235 be used for Intradomain Traffic Engineering?", 2007. 1237 [IEEE8023AD] 1238 IEEE 802.3ad, ., "IEEE Standard for Link aggregation for 1239 parallel links", October 2000. 1241 [RCP] Caesar, M., Caldwell, D., Feamster, N., and J. Rexford, 1242 "Design and Implementation of a Routing Control Platform 1243 ", March 2005, 1244 . 1246 Authors' Addresses 1247 Petr Lapukhov 1248 Microsoft Corporation 1249 One Microsoft Way 1250 Redmond, WA 98052 1251 US 1253 Phone: +1 425 7032723 1254 Email: petrlapu@microsoft.com 1255 URI: http://microsoft.com/ 1257 Edet Nkposong 1258 Microsoft Corporation 1259 One Microsoft Way 1260 Redmond, WA 98052 1261 US 1263 Phone: +1 425 7071045 1264 Email: edetn@microsoft.com 1265 URI: http://microsoft.com/