idnits 2.17.1 

draft-lapukhov-bgp-sdn-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 13 instances of too long lines in the document, the longest
     one being 11 characters in excess of 72.

  ** The abstract seems to contain references ([RFC6241],
     [I-D.lapukhov-bgp-routing-large-dc], [RFC1997]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (September 02, 2013) is 3890 days in the past.  Is
     this intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'Topology 0' is mentioned on line 629, but not defined

  == Missing Reference: 'Topology 1' is mentioned on line 548, but not defined

  == Missing Reference: 'Topology 2' is mentioned on line 643, but not defined

  == Unused Reference: 'RFC4271' is defined on line 1152, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC4786' is defined on line 1173, but no explicit
     reference was found in the text

  == Unused Reference: 'JAKMA2008' is defined on line 1218, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-07) exists of
     draft-lapukhov-bgp-routing-large-dc-06

  == Outdated reference: A later version (-17) exists of
     draft-ietf-grow-bmp-07

  == Outdated reference: A later version (-15) exists of
     draft-ietf-idr-add-paths-08

  == Outdated reference: A later version (-07) exists of
     draft-ietf-idr-link-bandwidth-06

  == Outdated reference: A later version (-05) exists of
     draft-raszuk-wide-bgp-communities-03

  == Outdated reference: A later version (-05) exists of
     draft-uttaro-idr-bgp-persistence-02


     Summary: 3 errors (**), 0 flaws (~~), 13 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                        P. Lapukhov
3	Internet-Draft                                               E. Nkposong
4	Intended status: Informational                     Microsoft Corporation
5	Expires: March 06, 2014                               September 02, 2013

7	Centralized Routing Control in BGP Networks Using Link-State Abstraction
8	                       draft-lapukhov-bgp-sdn-00

10	Abstract

12	   Some operators deploy networks consisting of multiple BGP Autonomous-
13	   Systems (ASNs) under the same administrative control.  There are also
14	   implementations which use only one routing protocol, namely BGP, as
15	   in [I-D.lapukhov-bgp-routing-large-dc], for example.  In such
16	   designs, inter-AS traffic engineering is commonly implemented using
17	   BGP policies, by configuring multiple routers at the ASN boundaries.
18	   This distributed policy model is difficult to manage and scale due to
19	   its dependency on complex routing policies and the need to develop
20	   and maintain a model for per-prefix path preference signaling.  One
21	   example of such models could be standard BGP community-based (see
22	   [RFC1997]) signaling, which requires careful documentation and
23	   consistent configuration.  Furthermore, automating such policy
24	   configuration changes for the purpose of centralized management
25	   requires additional efforts and is dependent on a particular vendor's
26	   configuration management (CLI extensions, NetConf [RFC6241] etc).

28	   This document proposes a method for inter-AS traffic engineering for
29	   use with the kind of deployment scenarios outlined above.  No
30	   protocol changes or additional features are required to implement
31	   this method.  The key to the proposed methodology is a new software
32	   entity called "BGP Controller" - a special purpose application that
33	   peers with all eBGP speakers in the managed network.  This controller
34	   constructs live state of the underlying BGP ASN graph and presents
35	   multi-topology view of this graph via a simple API to third-party
36	   applications interested in performing network traffic engineering.
37	   An example application could be an operational tool used to drain
38	   traffic from network devices.  In response to changes in the logical
39	   network topology proposed by these applications, the controller
40	   computes new routing tables, and pushes them down to the network
41	   devices via the established BGP sessions.

43	Status of This Memo

45	   This Internet-Draft is submitted in full conformance with the
46	   provisions of BCP 78 and BCP 79.

48	   Internet-Drafts are working documents of the Internet Engineering
49	   Task Force (IETF).  Note that other groups may also distribute
50	   working documents as Internet-Drafts.  The list of current Internet-
51	   Drafts is at http://datatracker.ietf.org/drafts/current/.

53	   Internet-Drafts are draft documents valid for a maximum of six months
54	   and may be updated, replaced, or obsoleted by other documents at any
55	   time.  It is inappropriate to use Internet-Drafts as reference
56	   material or to cite them other than as "work in progress."

58	   This Internet-Draft will expire on March 06, 2014.

60	Copyright Notice

62	   Copyright (c) 2013 IETF Trust and the persons identified as the
63	   document authors.  All rights reserved.

65	   This document is subject to BCP 78 and the IETF Trust's Legal
66	   Provisions Relating to IETF Documents
67	   (http://trustee.ietf.org/license-info) in effect on the date of
68	   publication of this document.  Please review these documents
69	   carefully, as they describe your rights and restrictions with respect
70	   to this document.  Code Components extracted from this document must
71	   include Simplified BSD License text as described in Section 4.e of
72	   the Trust Legal Provisions and are provided without warranty as
73	   described in the Simplified BSD License.

75	Table of Contents

77	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
78	   2.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
79	     2.1.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . .   4
80	     2.2.  Architectural Assumptions . . . . . . . . . . . . . . . .   5
81	     2.3.  BGP Controller  . . . . . . . . . . . . . . . . . . . . .   8
82	   3.  Link-State Abstraction and Multiple Topologies  . . . . . . .   9
83	     3.1.  Link-State Discovery Process  . . . . . . . . . . . . . .   9
84	     3.2.  The Default Topology  . . . . . . . . . . . . . . . . . .  10
85	     3.3.  Alternate Topologies  . . . . . . . . . . . . . . . . . .  11
86	     3.4.  Overloading a Vertex  . . . . . . . . . . . . . . . . . .  13
87	   4.  Implementation Details  . . . . . . . . . . . . . . . . . . .  15
88	     4.1.  Programming Next-Hops . . . . . . . . . . . . . . . . . .  15
89	     4.2.  Equal-Cost Multipath Routing  . . . . . . . . . . . . . .  15
90	     4.3.  Prefix Discovery Process  . . . . . . . . . . . . . . . .  16
91	     4.4.  Sequenced Device Programming  . . . . . . . . . . . . . .  16
92	     4.5.  Mapping Prefixes to Topologies  . . . . . . . . . . . . .  17
93	     4.6.  Autonomous Systems with iBGP Peering Mesh . . . . . . . .  17
94	     4.7.  Minimizing Controller-Injected State  . . . . . . . . . .  18
95	   5.  Handling Failure Scenarios  . . . . . . . . . . . . . . . . .  18
96	     5.1.  Underlying Network Failures . . . . . . . . . . . . . . .  18
97	     5.2.  BGP Controller failures . . . . . . . . . . . . . . . . .  19
98	     5.3.  Multiple BGP Controllers  . . . . . . . . . . . . . . . .  20
99	     5.4.  Network Partitioning  . . . . . . . . . . . . . . . . . .  21
100	   6.  Controller API  . . . . . . . . . . . . . . . . . . . . . . .  21
101	     6.1.  Pathnames and document names  . . . . . . . . . . . . . .  22
102	     6.2.  Encoding of the documents and objects . . . . . . . . . .  22
103	     6.3.  Creating & Deleting State . . . . . . . . . . . . . . . .  22
104	     6.4.  Reading State . . . . . . . . . . . . . . . . . . . . . .  23
105	     6.5.  Writing State . . . . . . . . . . . . . . . . . . . . . .  23
106	     6.6.  Typical API Call Sequence . . . . . . . . . . . . . . . .  23
107	     6.7.  Limitations . . . . . . . . . . . . . . . . . . . . . . .  24
108	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  24
109	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  24
110	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  24
111	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  24
112	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  25
113	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  26

115	1.  Introduction

117	   BGP was intentionally designed as a path-vector protocol, since
118	   efficiently distributing link-state information for Internet-sized
119	   graph is virtually impossible.  However, some network deployments
120	   leverage multiple BGP ASN to separate IGP domains, or simply use BGP
121	   as the only routing protocol.  See, for example
122	   [I-D.lapukhov-bgp-routing-large-dc] which proposes using a BGP AS
123	   either per network device or "horizontal" device group, within a
124	   data-center.  In such cases, the number of BGP ASNs is very small
125	   when compared to the Internet - on the order of few thousands in the
126	   largest case.

128	   Under these assumptions, it becomes possible to build and maintain a
129	   link-state graph of the complete inter-AS topology and compute
130	   network paths based on this link-state information.  In accomplishing
131	   this, it is desirable to avoid adding any protocol extensions so that
132	   current implementations can leverage the proposed method, such as
133	   those described, for example in [RWHITE2005].  Instead, this document
134	   proposes the use of a centralized agent (referred to as "BGP
135	   Controller" or simply "the controller") that peers with all eBGP
136	   speakers in the underlying network.  The BGP Controller is
137	   responsible for constructing an up-to-date link-state view of the BGP
138	   inter-AS graph and pushing down routing information (prefixes and
139	   their associated next-hops) to the network devices via BGP updates.
140	   The new routing information reflects the results of link-state path
141	   computations performed by the controller.  Such routing information
142	   push is possible because BGP supports the next-hop attribute that
143	   could be recursively resolved via either IGP or BGP.  Notice that
144	   while the controller pushes routing information to the device, the
145	   underlying BGP processes also compute the best-paths for the same
146	   prefixes using the path-vector logic in the regular way.  However,
147	   the BGP Controller could override this information by manipulating
148	   BGP attributes of injected routes, such as LOCAL_PREF to make its own
149	   advertisements more preferred.

151	   Third party applications can influence routing computations by
152	   creating logical alternations of the network link-state graph, e.g.
153	   changing the cost of the links from the BGP Controllers point of
154	   view.  This document will refer to those constructs as "alternate
155	   topologies" (or simply "topologies" for short), while the original,
156	   unaltered, link-state graph will referred to as the "default
157	   topology".  The controller would use these alternate topologies to
158	   make routing decisions different from those that BGP would have made
159	   based on available information.  It is possible to create multiple
160	   alternate topologies and associate different prefixes with every
161	   topology, with the restriction that each prefix maps to one and only
162	   one topology.  Once this mapping is defined, the BGP Controller would
163	   perform autonomously, detecting network faults and reacting by re-
164	   computing routing information as needed based on the effect that the
165	   failure has across all instantiated topologies.

167	   In many aspects, the proposed method was inspired by and is similar
168	   to the "Routing Control Platform" [RCP], but differs in the fact that
169	   link-state discovery is done using BGP mechanics only, and overall
170	   BGP is the only protocol used to build the system.

172	2.  Overview

174	2.1.  Use Cases

176	   Primary intended use case of the BGP Controller is inter-AS traffic
177	   engineering.  This includes, but is not limited, to the following:

179	   o  Link/device overloading for the purpose of drying out traffic from
180	      a device.  A link, or group of links, connecting one ASN to
181	      another could be declared as having "infinite" cost from the
182	      controller's viewpoint, causing the latter to re-compute paths and
183	      instruct the network devices to bypass those links.  Notice that
184	      this does not include "internal" overload (inside an ASN), that
185	      may need to be done using IGP techniques.

187	   o  Traffic load-sharing among multiple links, e.g. links connecting
188	      two different ASN's. Multiple alternate topologies could be
189	      created where the same link is given different costs in each
190	      topology.  These topologies will then have subsets of prefixes
191	      mapped to them, thus engineering different inter-AS paths for
192	      these prefixes.  Notice that for accurate load-sharing, knowledge
193	      of the traffic matrix may be required, but this requirement
194	      equally applies to any traffic engineering solutions.  The load-
195	      sharing could be also accomplished using weighted Equal-Cost
196	      Multipath (ECMP), accounting for link capacities as "weights" to
197	      distribute different proportions of egress traffic to the peering
198	      points.  See [KVALBEIN2007] for more information on the multi-
199	      topology techniques in general and [I-D.ietf-idr-link-bandwidth]
200	      for information on weighted ECMP signaling in BGP.

202	   The main benefit of the proposed approach is centralized control of
203	   the above functions.  There is no need to configure policies on
204	   multiple devices, as all routing changes could be done using the
205	   uniform light-weight API to the controller.  This ensures ease of
206	   automation and consistent changes.  Furthermore, such a centralized
207	   model should be deployed to augment the classical distributed routing
208	   policy configuration.  The advantage is that centralized control
209	   could be disabled at any time, falling the network back to the
210	   "traditional" BGP decision model, thus allowing for a safe state to
211	   roll-back to.  Next, knowing the link-state of the network may allow
212	   avoiding the BGP path-hunting problem, and improve global BGP
213	   convergence timing in a large group of heavily meshed ASNs.
214	   Additionally, to avoid the phenomena of routing micro-loops the
215	   controller could enforce certain ordering for the network device
216	   programming sequence.  Specifically, every time a link-state change
217	   is proposed to the controller, the devices in the network are
218	   programmed starting with those farther away from the change in terms
219	   of the metric of the existing graph.  The same logic applies to link-
220	   down conditions detected by the controller via the health probing
221	   mechanism described below.

223	2.2.  Architectural Assumptions

225	   Firstly, the devices in the network are assumed (but not required) to
226	   have minimal BGP policy applied, enough for them to exchange routing
227	   information and compute best-paths based on shortest AS_PATH lengths.
228	   This means that the configured policy should not override best-path
229	   selection process using LOCAL_PREF or any other BGP attributes for
230	   enforcing a custom routing policy.  The assumption of the "minimal
231	   policy" allows for making the BGP Controllers update logic less
232	   intrusive, as described further in the section Section 4.7.  Next,
233	   every device is assumed to advertise a locally bound prefix into BGP
234	   for the purpose of BGP peering with the controller.  That is, the
235	   controller peers "inband" with the devices it controls - either by
236	   initiating iBGP sessions to all devices or by passively accepting the
237	   sessions from the devices.  As will be shown in the Section 5, inband
238	   peering requirement is important to avoid inconsistencies between
239	   multiple controllers programming the same network.

241	   Another major assumption is how the link-state graph vertices are
242	   defined.  From the BGP Controller perspective, there are two type of
243	   vertices:

245	   o  Type 1, Individual Devices: BGP Speaker(s) that have the SAME BGP
246	      ASN configured, with the restriction that none of these speakers
247	      peers with each other, inside this ASN.  This could be a single
248	      speaker in its own ASN as well.  Each of these speakers is treated
249	      as a vertex on its own.  Peering with other ASN's is not
250	      restricted.  Notice how this is different from the traditional
251	      notion of BGP ASN, where all speakers are assumed to be part of
252	      the same iBGP mesh.

254	   o  Type 2, Complete BGP ASN: BGP Speakers in the SAME BGP ASN with
255	      the normal requirement that they ALL exchange their BGP views via
256	      iBGP, using either full-mesh or any other approach for full
257	      internal BGP state synchronization.  All of these BGP speakers are
258	      grouped into a single graph vertex.

260	   The following Figure 1 illustrates this concept:

262	               Legend
263	               ------- eBGP
264	               ....... iBGP

266	                                     eBGP Peering
267	                                           |
268	                                     +-----+-----+
269	                                     |     |     |
270	                                     |   +-+-+   |
271	                                     |   |R3 |   |
272	                                     |   +-+-+   |
273	                                     |     |     |
274	                                     +-----+-----+
275	                                           |

277	                                      eBGP Peering

279	                                     |            |
280	                           +---------+------------+----------+
281	                           |         |     AS1    |          |
282	                           |       +-+-+        +-+-+        |
283	                           |       |R1 |        |R2 |        |
284	                           |       +-+-+        +-+-+        |
285	                           |         |            |          |
286	                           +---------+------------+----------+
287	                                     |            |

289	                       Type 1: Each device is individual graph vertex
290	                             (three vertices, each with two edges).

292	                                     |           |
293	                                +----+-----------+----+
294	                                |    |           |    |
295	                                |  +-+-+       +-+-+  |
296	                                |  |R1 |.......|R2 |  |
297	                                |  +---+.     .+---+  |
298	                                |    .    .  .   .    |
299	                                |    .     .     .    |
300	                                |    .    .  .   .    |
301	                                |    .  .     .  .    |
302	                                |  +---+       +---+  |
303	                                |  |R3 |.......|R4 |  |
304	                                |  +-+-+       +---+  |
305	                                |    |           |    |
306	                                +----+-----------+----+
307	                                     |           |

309	                       Type 2: All devices below are grouped into
310	                              single vertex with four edges.

312	                         Figure 1: Graph Vertices

314	   Routing information could be associated with a graph vertex either by
315	   means of static binding or dynamic discovery: this process is
316	   described in details in sections Section 4.3.  When programming the
317	   network prefixes into the devices, the controller does not inject a
318	   prefix back in the vertex the prefix is associated with.

320	   The BGP Controller decision logic is independent of the address
321	   family, and could apply to both IPv4 and IPv6 prefixes equally.  It
322	   is possible to run two independent controllers, one for each address
323	   family.  This allows for full "fate decoupling" between the address
324	   families, though may result in duplication of the link state
325	   information.

327	   The edges of the constructed link-state graph may have two
328	   attributes: metric, which is additive, and capacity (bandwidth) that
329	   is non-additive.  The former is used to compute shortest paths, and
330	   the latter could be used to compute ECMP weight values in case where
331	   multiple equal-cost paths exist to the same vertex.  For every ECMP
332	   path, the minimum capacity value that occurs along that path will be
333	   used as its weight by the controller, if the underlying network
334	   supports weighted ECMP functionality.

336	2.3.  BGP Controller

338	   The Figure 2 demonstrates the BGP Controller peering with the network
339	   devices.  Multiple managed devices peer via eBGP following the
340	   traditional BGP design.  For simplicity, we assume that every device
341	   belongs to it's own ASN - see Section 4.6 for more information on
342	   handling the "compound" Type-2 vertices consisting of multiple BGP
343	   speakers interconnected with iBGP mesh.  Prefixes P3, P4 and P5 are
344	   associated with the devices (vertices) in ASNs 3, 4, and 5
345	   respectively using techniques described in Section 4.3.  The other
346	   remaining vertices are assumed to be purely transit for the purpose
347	   of this discussion.

349	   These devices exchange routing information in the usual manner and
350	   the BGP Controller establishes iBGP peering sessions with every
351	   device.  It uses the technique described in section Section 3.1 to
352	   build the inter-AS link-state graph.  For now, it is sufficient to
353	   say that the discovery process uses special "beacon" prefixes
354	   dynamically injected into the network and relayed back to the
355	   controller to discover the state of the links interconnecting the
356	   graph vertices.

358	               Legend:

360	               ------- iBGP (controller to network)
361	               ....... eBGP (ASN to ASN)

363	                                    BGP Controller
364	                                      +-------+
365	                                      |       |
366	                                      +-------+
367	                                       || | ||
368	                                       || | ||
369	                         +-------------+| | |+--------------+
370	                         |         +----+ | +----+          |
371	                         |         |      |      |          |
372	                         |         v      |      v          |
373	                         |       +---+    |    +---+        |
374	                         |       |AS1|....|....|AS2|        |
375	                         v       +---+    |    +---+        v
376	                       +---+       .      |     . .       +---+
377	                    P3 |AS3|........      |     . ........|AS4| P4
378	                       +---+              |     .         +---+
379	                         .                V     .           .
380	                         .              +---+....           .
381	                         ...............|AS5|................
382	                                        +---+
383	                                          P5

385	                         Figure 2: BGP Controller

387	   At this point, the BGP Controller has knowledge of the link-state
388	   graph as well as the prefixes associated with every vertex, and can
389	   now run Dijkstra's SPF algorithm to compute shortest paths between
390	   vertices.  A result of this computation would be routing tables built
391	   for every vertex.  The Section 3.2 below demonstrates the adjacency
392	   list built by the controller for the above topology, as well as
393	   routing-tables computed for every vertex.  The next-hops in the
394	   routing tables presented in the figure are simply the vertices to
395	   send the packets to.  When programming the network devices, the
396	   actual IP addresses of the next-hops are computed as described in
397	   Section 4.1 section.  This routing state corresponds to the unaltered
398	   (default) topology.

400	3.  Link-State Abstraction and Multiple Topologies

402	   This section provides detailed information on the link-state
403	   abstractions used by the controller and how those are used to perform
404	   traffic engineering in the underlying network.

406	3.1.  Link-State Discovery Process

408	   The network devices that the controller peers with establish eBGP
409	   peering sessions with each other.  The fact that there is one-to-one
410	   correspondence between eBGP sessions and underlying IP link allows
411	   using the state of the eBGP session as the indication of the IP link
412	   health.  Specifically, this is accomplished by injecting special
413	   "beacon" prefixes into every vertex (which could be a device or
414	   collection of devices interconnected with iBGP mesh) and expecting
415	   those beacons to be re-advetised back to the controller by every
416	   vertex adjacent to the point of injection.  If a particular BGP
417	   session is down, the injected prefix will not be re-advertised by the
418	   affected peer back to the controller, allowing us to conclude that
419	   the corresponding link is down.

421	   The Figure 3 demonstrates this process.  For simplicity, we assume
422	   that every device belongs to its own BGP ASN.  The BGP controller
423	   injects prefix X into device R1 and expects to hear this prefix from
424	   device R2.  At the same time, it is desirable to prevent this prefix
425	   from leaking any farther than one hop away from R1, i.e. make sure it
426	   is not re-advertised to R3.  To accomplish this, prefix X could be
427	   tagged with a special community value, which is replaced with the
428	   well-known community "no-export" when advertising over eBGP session.
429	   Because of this policy, the prefix will be announced back to the
430	   controller as it uses iBGP session for peering, but not any further
431	   to eBGP peers of router R2 in our case.  An alternative to using the
432	   standard BGP communities could be leveraging the wide-communities
433	   limiting the scope of the announced prefixes - see
434	   [I-D.raszuk-wide-bgp-communities] for more details on this technique.

436	               ------- iBGP (controller to network)
437	               ....... eBGP (ASN to ASN)

439	                               +------------+
440	                        +------| Controller |<------+
441	                        |      +------------+       |
442	                        X                           X
443	                        |                           |
444	                        V                           |
445	                      +---+                       +---+
446	                      |R1 |...........X..........>|R2 |
447	                      +---+                       +---+
448	                                                    .
449	                                    +---+           .
450	                                    |R3 |............
451	                                    +---+

453	                      Figure 3: Link-State Discovery

455	   Using this technique, the controller is able to build a view of the
456	   links connecting the graph vertices.  Notice that if two parallel
457	   links connect vertices, this method will not be able to differentiate
458	   between them.  For simplicity, the proposal is that such parallel
459	   links should be grouped into a single logical IP link using, for
460	   example, [IEEE8023AD] technology.

462	3.2.  The Default Topology

464	   When the controller starts, it discovers the current network graph
465	   and computes the routing table assuming that all links have the same
466	   metric value.  The Figure 4 illustrates the adjacency list describing
467	   the graph taken from Figure 2 along with the routing table computed
468	   for every vertex/ASN.  The numbers on the graph edges designate the
469	   link costs.

471	                              Inter-AS Graph and Prefixes

473	                                 +---+         +---+
474	                                 |AS1|...(1)...|AS2|
475	                                 +---+         +---+
476	                        +---+      .            . .       +---+
477	                     P3 |AS3|..(1)..            . ...(1)..|AS4| P4
478	                        +---+                   .         +---+
479	                          .                    (1)          .
480	                          .                     .           .
481	                          .             +---+....           .
482	                          .......(1)....|AS5|......(1).......
483	                                        +---+
484	                                          P5

486	           Inter-AS Graph Adjacency List           Per-ASN Routing Table

488	              +-----+--------------+           +-----+----------------------------+
489	              | Src | Dst ASNs     |           | AS  | Prefix:Next-Hop(s)         |
490	              +-----+--------------+           +-----+----------------------------+
491	              | AS1 | AS2,AS3      |           | AS1 | P3:AS3,P4:AS2,P5:[AS2,AS3] |
492	              +-----+--------------+           +-----+----------------------------+
493	              | AS2 | AS1,AS4,AS5  |           | AS2 | P3:AS1,P4:AS4,P5:AS5       |
494	              +-----+--------------+           +-----+----------------------------+
495	              | AS3 | AS1,AS5      |           | AS3 | P3:Self,P4:AS5,P5:AS5      |
496	              +-----+--------------+           +-----+----------------------------+
497	              | AS4 | AS2,AS5      |           | AS4 | P3:AS5,P4:Self,P5:AS5      |
498	              +-----+--------------+           +-----+----------------------------+
499	              | AS5 | AS4,AS2,AS3  |           | AS5 | P3:AS3,P4:AS4,P5:Self      |
500	              +-----+--------------+           +-----+----------------------------+

502	                     Figure 4: Unaltered Routing State

504	3.3.  Alternate Topologies

506	   Assume the following TE requirements for illustrative purposes:

508	   o  Traffic from AS4 to P5 needs to traverse AS2.

510	   o  Traffic to P4 from AS5 needs to ECMP over two paths: direct and
511	      via AS2.

513	   o  Traffic from AS3 to P5 must not use the direct path.

515	   These requirements could be satisfied with two different topologies:

517	   o  Topology 1 has "very large" metric assigned to the links between
518	      AS4,AS5 and AS3,AS5.

520	   o  Topology 2 has metric value of 2 assigned to the link between AS4
521	      and AS5.

523	   The prefixes map to the topologies as following: P5->Topology1 and
524	   P4->Topology2.  P3 should retain mapping to the default (unaltered)
525	   topology, which we would refer to as Topology 0 to refer to all
526	   topologies by their numbers.  The assumption of "very large" metric
527	   is important - the path containing this link could still be used if
528	   all alternate paths are down because of physical failures.  For
529	   simplicity, we assume "very large" equals to 100 in the case under
530	   consideration.  The set of topologies and associated prefixes would
531	   look as on Figure 5, where numbers on the links designate their
532	   metrics.

534	                               [Topology 0]

536	                           +---+         +---+
537	                           |AS1|...(1)...|AS2|
538	                           +---+         +---+
539	                  +---+      .            . .       +---+
540	               P3 |AS3|..(1)..            . ..(1)...|AS4|
541	                  +---+                   .         +---+
542	                    .                    (1)          .
543	                    .                     .           .
544	                    .             +---+....           .
545	                    .....(1)......|AS5|......(1).......
546	                                  +---+

548	                               [Topology 1]

550	                           +---+         +---+
551	                           |AS1|...(1)...|AS2|
552	                           +---+         +---+
553	                  +---+      .            . .       +---+
554	               P3 |AS3|..(1)..            . ..(1)...|AS4|
555	                  +---+                   .         +---+
556	                    .                    (1)          .
557	                    .                     .           .
558	                    .             +---+....           .
559	                    ....(100).....|AS5|.......(100)....
560	                                  +---+
561	                                    P5

563	                               [Topology 2]

565	                           +---+         +---+
566	                           |AS1|...(1)...|AS2|
567	                           +---+         +---+
568	                  +---+      .            . .       +---+
569	                  |AS3|..(1)..            . ..(1)...|AS4| P4
570	                  +---+                   .         +---+
571	                    .                    (1)          .
572	                    .                     .           .
573	                    .             +---+....           .
574	                    .....(1)......|AS5|.......(2)......

576	                                  +---+

578	                      Figure 5: Alternate Topologies

580	   Based on the set of topologies presented above, the BGP Controller
581	   will compute the routing tables shown in Figure 6, which reflects the
582	   desired traffic engineering goals defined previously.  The entries
583	   that differ from the routing decisions in unaltered topology are
584	   highlighted with the asterisk (*) characters.  Notice that AS3 now
585	   sees P4 as ECMP reachable via AS1 and AS5, because of the metric
586	   change in Topology 2.  The original traffic engineering policy
587	   requirements did not call for that, but this result appears because
588	   of the change made between AS4 and AS5, which is a natural effect
589	   with shortest-path, destination-based forwarding techniques.

591	                       Per-ASN Routing Table

593	              +-----+--------------------------------+
594	              | AS  | Prefix:Next-Hop(s)             |
595	              +-----+--------------------------------+
596	              | AS1 | P3:AS3,P4:AS2,*P5:AS2*         |
597	              +-----+--------------------------------+
598	              | AS2 | P3:AS1,P4:AS4,P5:AS5           |
599	              +-----+--------------------------------+
600	              | AS3 | P3:Self,*P4:[AS5,AS1]*,P5:AS1  |
601	              +-----+--------------------------------+
602	              | AS4 | P3:AS5,P4:Self,*P5:AS2*        |
603	              +-----+--------------------------------+
604	              | AS5 | P3:AS3,P4:*[AS4,AS2]*,P5:Self  |
605	              +-----+--------------------------------+

607	                  Figure 6: Multi-Topology Routing Tables

609	   The controller will push the computed routing tables to the network
610	   devices using higher LOCAL_PREF values to ensure that the new
611	   information overrides the routing decision that "traditional" BGP
612	   processes running on the BGP speakers have already made.  It is
613	   possible to use other attributes to signal better preference, but
614	   LOCAL_PREF has the benefit of being used very early in the BGP tie-
615	   breaking process.

617	3.4.  Overloading a Vertex

619	   This section illustrates a special, but important practical case of
620	   "overloading" a graph vertex, such that all traffic bypasses the
621	   vertex.  This operation could be used in a scenario in which a
622	   particular network device needs an upgrade and requires all traffic
623	   to be dried out of it.  The Figure 7 demonstrates the implementation
624	   of this policy with respect to the AS5 vertex.  The Topology-0 has no
625	   prefixes mapped to it, but all prefixes are mapped to Topology-2
626	   instead.  This topology has the cost of 100 assigned to all links
627	   connected to AS5, which forces all traffic to avoid transiting AS5.

629	                             [Topology 0]

631	                         +---+         +---+
632	                         |AS1|...(1)...|AS2|
633	                         +---+         +---+
634	                +---+      .            . .       +---+
635	                |AS3|..(1)..            . ..(1)...|AS4|
636	                +---+                   .         +---+
637	                  .                    (1)          .
638	                  .                     .           .
639	                  .             +---+....           .
640	                  .....(1)......|AS5|......(1).......
641	                                +---+

643	                             [Topology 2]

645	                         +---+         +---+
646	                         |AS1|...(1)...|AS2|
647	                         +---+         +---+
648	                +---+      .            . .       +---+
649	             P3 |AS3|..(1)..            . ..(1)...|AS4| P4
650	                +---+                   .         +---+
651	                  .                   (100)         .
652	                  .                     .           .
653	                  .             +---+....           .
654	                  ....(100).....|AS5|......(100).....
655	                                +---+
656	                                  P5

658	                       Per-ASN Routing Table

660	              +-----+--------------------------------+
661	              | AS  | Prefix:Next-Hop(s)             |
662	              +-----+--------------------------------+
663	              | AS1 | P3:AS3,P4:AS2,*P5:[AS2,AS3]*   |
664	              +-----+--------------------------------+
665	              | AS2 | P3:AS1,P4:AS4,P5:AS5           |
666	              +-----+--------------------------------+
667	              | AS3 | P3:Self,*P4:AS1*,P5:AS5        |
668	              +-----+--------------------------------+
669	              | AS4 | P3:*AS2*,P4:Self,P5:AS5        |
670	              +-----+--------------------------------+
671	              | AS5 | P3:AS3,P4:AS4,P5:Self          |
672	              +-----+--------------------------------+

674	                      Figure 7: Overloading a Vertex

676	4.  Implementation Details

678	4.1.  Programming Next-Hops

680	   As mentioned previously, the prefixes that the controller injects in
681	   the network needs to have their next-hops properly resolved.  In the
682	   simplest case, the next-hops could be the remote IP addresses of the
683	   links directly connected to the device programmed by the controller.
684	   This, however, adds certain complexities due to the IP address
685	   variability on the point-to-point links connecting the network
686	   devices.  An alternative could be injecting pre-generated next-hops
687	   into the devices - one per device - and resolving them recursively
688	   via BGP.

690	   Specifically, every graph vertex would have a host route (either IPv4
691	   or IPv6) associated with it.  The controller would inject this prefix
692	   into the respective device(s) (see Section 4.6 associated with this
693	   vertex, tagged with the special community value discussed in the
694	   section Section 3.1.  Moreover, for simplicity, it is possible to re-
695	   use the same prefix used for link-state discovery as the value of the
696	   next-hop attribute, thus reducing the amount of supplementary routing
697	   state injected by the controller.

699	   Next, it is easy to notice that using the special BGP community to
700	   limit the beacon/next-hop prefix propagation is not strictly
701	   necessary.  Indeed, the controller may simply discard all "special"
702	   prefixes whose AS_PATH contains more than one AS-hop.  However, this
703	   will result in unneeded routing state propagated in the network,
704	   which is not desirable from manageability perspective.

706	4.2.  Equal-Cost Multipath Routing

708	   In many practical topologies, the controller may find multiple equal-
709	   cost paths from one vertext to another.  It may then proceed
710	   programming multiple paths for the prefixes affected by this
711	   decision.  Either of the two ways could accomplish the multiple-paths
712	   programming requirement:

714	   o  Using the BGP Add-Path extension, [I-D.ietf-idr-add-paths]
715	      specifying multiple next-hops values.

717	   o  Using the Diverse Path Advertisement method presented in [RFC6774]
718	      to inject multiple paths.

720	   Furthermore, it is possible to implement weighted ECMP functionality
721	   with this approach, relying on [I-D.ietf-idr-link-bandwidth] for
722	   weight signaling.  The graph edges could have weights associated with
723	   them, and a given path's weight computed as the minimum weight value
724	   along the path, as mentioned previously.  The logic behind the weight
725	   selection is outside the scope of this document.

727	4.3.  Prefix Discovery Process

729	   In order to build routing state information, the controller needs to
730	   know the "leaf" prefixes associated with the graph vertices.  There
731	   are two ways of accomplishing this: either defining a static mapping
732	   of prefixes to vertices in the BGP controller configuration, or by
733	   letting the controller learn those prefixes in dynamic fashion.  In
734	   both cases, the assumption is that the network reachability
735	   information is already advertised into BGP, such that regular "in-
736	   band" routing model is working.

738	   The controller may dynamically associate a prefix with a vertex by
739	   using two properties: firstly, by observing an empty AS_PATH in the
740	   prefix received from the managed device.  Secondly, by filtering out
741	   prefixes injected for the purpose of network health discovery and
742	   next-hop programming.  The controller treats everything that matches
743	   these two criteria as the routing information associated with the
744	   respective vertex.

746	4.4.  Sequenced Device Programming

748	   Distributed routing systems are susceptible to transient
749	   inconsistencies when a network state changes in such a way that
750	   requires changing the best-paths election.  Since a topological event
751	   (e.g. a link flap) is not propagated in an instant, devices that are
752	   closer to the origin of the event would update their forwarding
753	   tables faster, as compared to others.  The devices directly adjacent
754	   to those that have their tables already updated would still be using
755	   old forwarding state.  This would create transient routing loops for
756	   the time it takes to fully synchronize the forwarding state of all
757	   devices.

759	   Since the controller is aware of the full network topology, it may
760	   avoid the above scenario by pushing the routing updates in proper
761	   sequence - starting with the vertices that are farthest away from the
762	   location of the event.  This way the newly programmed state will
763	   "implode" toward the change, as opposing to "exploding" from the
764	   events point of occurrence.  Such sequencing is similar to the
765	   process outlined in [RFC6976], but relies on centralized programming,
766	   which makes it very simple to implement.

768	4.5.  Mapping Prefixes to Topologies

770	   The controller needs a manageable way of associating discovered
771	   prefixes with any of the topologies defined by the third-party
772	   applications.  As mentioned previously, all prefixes are by default
773	   mapped to the default topology, which corresponds to the actual
774	   network state.  Once an alternate topology has been defined, prefixes
775	   could be mapped to this new topology.  One possible way of
776	   implementing such mapping table could be by maintaining a radix tree
777	   data-structure, which associates a prefix with the corresponding
778	   topology.  Using longest-match lookup in this table for each
779	   discovered prefix would then yield the topology that this prefix
780	   belongs to.  This allows for easy and natural grouping of prefix-to-
781	   topology mappings, while maintaining familiar semantics of longest-
782	   match routing lookups.  To implement the default mapping, the
783	   prefixes 0.0.0.0/0 and ::/128 should always be in the radix tree,
784	   pointing to one of the defined topologies.  When those prefixes are
785	   deleted per application request, the BGP controller would need to re-
786	   insert them, linking back to default topology again.

788	4.6.  Autonomous Systems with iBGP Peering Mesh

790	   The BGP Controller treats BGP ASN's that have a form of internal BGP
791	   mesh differently than systems that do not peer over iBGP.  Such
792	   systems are perceived as an atomic opaque graph vertex for the
793	   purpose of next-hop and beacon prefix injection.  The routing inside
794	   such ASN is not defined by the controller, but rather relies on some
795	   other mechanism, such as IGP.  The controller only defines egress
796	   points out of the ASN, and possibly can specify weights associated
797	   with exit points, to allow for weighted ECMP load-distribution.  This
798	   treatment naturally arises from the fact that iBGP injected beacon
799	   prefixes are not relayed to iBGP peers.  Furthermore, the beacon
800	   prefixes learned from eBGP neighbors are propagated to all iBGP
801	   peers, but not relayed back to the BGP Controller when learned over
802	   iBGP session.  Thus, the controller will discover peering links of
803	   every "edge" router in such BGP ASN with all external peers, but will
804	   not be able to see the internal iBGP peering mesh.

806	   If the underlying ASN implements iBGP route reflection or BGP
807	   Confederations, only the routers that form eBGP sessions with
808	   external ASN's need to have the routing information injected into
809	   them.  The routing information will disseminate to the internal
810	   speakers by means of normal BGP replication process, with unmodified
811	   next-hops and LOCAL_PREF attribute value, thus ensuring that it
812	   overrides the normal "in-band" routing information.

814	   When programming ECMP paths, it may happen so that the egress points
815	   specified by the controller do not satisfy iBGP requirements for
816	   multipath (e.g. IGP costs to reach the egress points could be
817	   different).  In such case, normal BGP tie breaking will occur and
818	   only ECMP-equivalent paths will be installed in the RIB.
819	   Alternatively, if the underlying ASN implement tunneling techniques,
820	   it is possible to perform load sharing even if the IGP costs toward
821	   the BGP next-hops are different.

823	4.7.  Minimizing Controller-Injected State

825	   The BGP Controller can push down all of the prefixes it computes
826	   paths for: that is, all prefixes known in the network.  This means
827	   that for every prefix present in the "regular" eBGP interconnected
828	   topology the controller will inject the same prefix with different
829	   attributes.  It is also possible for the controller to push down only
830	   the "delta" between the prefixes that need their next-hops/paths
831	   changed, based on the supplied policy.  This mode of operation
832	   requires that the underlying network finds the best-paths between the
833	   graph vertices using the "shortest-path logic", where the path length
834	   equals the length of the AS_PATH attribute.  This is equivalent to
835	   running Dijkstra's SPF algorithm on graph unit metric values assigned
836	   to the edges.  This is needed since the controller performs path
837	   computation using SPF logic, and BGP could elect different paths if
838	   some policies are present.  Ensuring that both the underlying network
839	   and the controller perform the same computations effectively allows
840	   for the "delta" mode operations.

842	   Publishing only the "delta" state to the network means more
843	   "intelligent" work on the controller side and special requirements to
844	   the network policies.  However, the benefit is significantly reduced
845	   intervention in the regular forwarding since majority of the state is
846	   not likely to change in many cases.  Once again, it is possible to
847	   implement the mode where the controller overrides all routing
848	   information.

850	5.  Handling Failure Scenarios

852	   This section reviews two different type of failure scenarios:
853	   failures in the underlying network and the controller failures.

855	5.1.  Underlying Network Failures

857	   Either vertex (if it's a device) or graph edge (network link) may
858	   fail.  For the BGP Controller, underlying failure be it edge or
859	   vertex, is visible only after all eBGP session interconnecting two
860	   vertices have failed.  This could be driven either by an event, such
861	   as link down condition, which is typically fast, or by BGP keepalive
862	   timer expiration, which is naturally slower.  When this happens, the
863	   BGP processes withdraw the corresponding beacon prefixes and the
864	   controller will declare the corresponding edge down.  This will
865	   result in re-run of SPF for all active topologies and push of new
866	   routing information down to the network.  Since the central
867	   controller is involved in reconvergence, the restoration time will be
868	   longer, compared to the restoration process driven purely by
869	   underlying BGP processes.  Indeed, the restoration time now include
870	   failure detection time, SPF re-computations and new prefixes push.
871	   However, it could be observed that such centralized reconvergence is
872	   free from the BGP Path-Hunting problem, and hence improvements could
873	   be noticed in complex meshed topologies.

875	   Furthermore, recovery could be faster if multiple paths (ECMP) exist
876	   for a prefix, and only a single path fails.  In this case, BGP
877	   process will simply invalidate the failed path even before the
878	   controller has signaled removal, and will continue with using only
879	   the active paths.  The details of this reconvergence are complicated,
880	   as changing ECMP is a hardware dependent operation.  Furthermore,
881	   some implementations may support the "consistent hashing" technique
882	   that minimizes impact of ECMP group base size change on flow
883	   affinities, as described in [RFC2992].

885	5.2.  BGP Controller failures

887	   Under normal circumstances, an operator may shut down a controller
888	   for maintenance or other reasons.  In this case, it is expected that
889	   BGP sessions be closed following normal BGP process, that is sending
890	   a BGP Notification message and terminating the TCP session.  As a
891	   result, all routers will withdraw the prefixes injected by the
892	   controller and recalculate the best-path.

894	   If the controller fails abnormally, e.g. process crashes, the TCP
895	   sessions that connect it to the underlying devices either will be
896	   torn down, or be closed upon expiration of BGP keepalive timer.  The
897	   latter will cause some delay before prefixes announced by the
898	   deceased controller are withdrawn.  For the duration of that time,
899	   the network will be forwarding traffic using possibly stale
900	   information.  Link/device failures will be handled locally, and in
901	   some cases may cause traffic black-holes, if the only programmed path
902	   fails.  The duration of this "state" time is equal to the time it
903	   takes to detect the controller failure, and update the BGP LocRIB,
904	   followed by RIB/FIB reprogramming.

906	   It is possible to use a single BGP controller along with BGP routing
907	   persistence feature, to maintain the injected paths even after the
908	   BGP Controller failure (see [I-D.uttaro-idr-bgp-persistence]).  After
909	   the controller restarts, it will simply refresh the "stale" routing
910	   information.  In this scenario, forcing the network to revert to the
911	   traditional BGP-based routing could be accomplished by instructing
912	   the controller to inject its paths with low LOCAL_PREF value, less
913	   than the default used in the network.  The possible risk is that the
914	   controller may fail in such a fashion that it will not be able to
915	   inject any information in the network.

917	5.3.  Multiple BGP Controllers

919	   If a single BGP Controller is present in the network that does not
920	   implement BGP route persistence, the controller failure would result
921	   in the network becoming unmanaged, and falling back to traditional
922	   BGP routing.  To maintain resilience, it is possible to run multiple
923	   parallel BGP Controllers, assuming that they supply the network with
924	   the same routing information, and differentiate themselves as
925	   'primary' and 'backup'.  The latter property could be accomplished by
926	   using different LOCAL_PREF attribute values for primary/secondary
927	   controllers - this allows having multiple controllers, backing up
928	   each other.

930	   With multiple BGP Controllers, it becomes critical for all of them to
931	   perform the same routing decisions.  Even though only one controller
932	   is programming the network, the backup paths injected by the others
933	   must be consistent with the primary.  To accomplish that, all
934	   controllers must:

936	   o  Have the same view of the underlying network topology - i.e. build
937	      the same link-state graph.  In the simplest case, this could be
938	      accomplished by relying on eventual consistency, that is assuming
939	      that under non-partitioned scenario the controllers will
940	      eventually receive the same link-state probe prefixes and build
941	      the same resulting link-state database.  Alternatively, a
942	      consensus protocol, e.g. [PAXOS] could be executed amongst the
943	      members of the redundant group to synchronize the link-state
944	      database of the master process with the secondary processes.  This
945	      would ensure strong consistency of the link-state database, but
946	      could be over-bearing in terms of the state that may need to be
947	      kept replicated reliably.

949	   o  Maintain the same topology definition database and prefix-to-
950	      topology mapping table - as commanded by external applications.
951	      This is similar to the previous approach, but would involve much
952	      less state to synchronize.  Specifically, the topology definitions
953	      (e.g. new link costs) and prefix to topology mapping information
954	      need to be distributed.  This state is submitted to the
955	      controllers via an API defined for the third party applications.
956	      As before, it could be assumed a responsibility of an external
957	      application to program all controllers with the same state and
958	      ensure consistency.  Alternatively, another strongly consistent
959	      database could be used, leveraging the same consensus protocol.

961	5.4.  Network Partitioning

963	   This section reviews the possible "partitioning" scenarios, where
964	   parts of the network may become managed by different controllers.
965	   Situations like this are possible if the controllers are deployed
966	   diversely, and may end up in situation where one or more of the
967	   controllers lose iBGP peering sessions with some network devices.
968	   The main concern in such situations is programming the devices with
969	   inconsistent information that may cause routing loops.

971	   Firstly, notice that if device A can learn the "peering source"
972	   prefix announced by device B, and the BGP Controller can peer with A,
973	   then by transitivity the controller can also peer with B. This means
974	   that either the controller and device A cannot learn any routing
975	   information from B, or both of them can - excluding transient
976	   situations.  This property ensures that under proper configuration a
977	   set of devices is either completely managed, or completely unmanaged
978	   - that is, they share the same fate.  This eliminates the scenario
979	   where device A is programmed by the controller X, device B is
980	   programmed by the controller Y and the devices can each each other
981	   inband.

983	   Secondly, for the transient cases, when A and B have in-band
984	   connectivity, but for some time A is programmed by X and B is
985	   programmed by Y. Recall that absence of the iBGP session to the
986	   device translates into the fact that this device is declared as
987	   having "infinite" costs in the link-state database.  Thus, X will
988	   always bypass B and Y will always bypass A, and hence a routing loop
989	   may never form between A and B.

991	6.  Controller API

993	   This section provides a set of requirements and guidance to the BGP
994	   Controller API.  The general recommendation is to base the API on
995	   stateless principles, such as found in [REST] model.  This approach
996	   is efficient since no real-time event passing between the controller
997	   and third-party application is needed, e.g. for the purpose of active
998	   reaction to network failure events.  The proposed controller model
999	   assumes those events are handled by the messages exchange in the
1000	   network-controller loop.  The following sections are structured the
1001	   around "CRUD" - Create, Read, Update, Delete operations commonly used
1002	   in REST model and use the HTTP verbs and pathnames for illustration.
1003	   Furthermore, applications will be referenced as clients and the BGP
1004	   Controller as the server in the text below interchangeably, though
1005	   the API could be implemented by a module separate from the main BGP
1006	   Controller logic.

1008	6.1.  Pathnames and document names

1010	   The server presents the following pathnames to group various objects:

1012	   o  "/lsdb" - This is the document that stores the currently
1013	      discovered inter-AS graph link state (link-state database).  This
1014	      document cannot be modified, only read.  The LSDB data structure
1015	      is a graph, represented in one of the common formats - e.g. as two
1016	      collections: vertices and edges, where edges have associated
1017	      states and weight (capacity).

1019	   o  "/topologies/" - This is a directory that stores documents
1020	      corresponding to different topologies.  Every document contains a
1021	      topology definition.

1023	   o  "/mappings/ipv4" - This is the document that stores the IPv4
1024	      mappings to the topologies.  Notice that if the 0.0.0.0/0 prefix
1025	      is not found it this file, it is implicitly mapped to the default
1026	      topology.  Internally in the BGP Controller this is stored as an
1027	      efficient radix-tree, but the document represents the mappings as
1028	      a collection of prefixes and associated topologies.

1030	   o  "/mappings/ipv6" - This is the document that stores the IPv6
1031	      prefix mappings to the topologies.  Same as IPv4 mappings, with
1032	      except to different address family.  As with the IPv4 case, if the
1033	      ::/0 prefix is not found in this document, it is implicitly mapped
1034	      to the default topology.

1036	6.2.  Encoding of the documents and objects

1038	   Either JSON or XML is an acceptable format for encoding the document
1039	   contents for programmability.  JSON is preferred due to its
1040	   lightweight nature and simpler semantics for transporting data
1041	   structures.  The documents passed with RESTful calls will contain
1042	   logical descriptions of the graph vertices and edges.  A vertex is
1043	   uniquely identified by an opaque name, e.g. a text string.  The
1044	   mapping between this identifier and the underlying network devices is
1045	   to be done elsewhere in the controller data structures, and does not
1046	   need to be exposed to the applications.

1048	6.3.  Creating & Deleting State

1050	   The only state that could be created is the collection of topology
1051	   definitions, under the "/topology/" directory.  The topology objects
1052	   are to be created using the "POST" HTTP operation - supplying some
1053	   basic content, e.g. empty set of the links and associated costs using
1054	   the appropriate encoding.  Correspondingly, a topology could be
1055	   deleted using the DETELE operation.  Notice that the default topology
1056	   is not present in this directory, and thus could never be deleted.
1057	   Notice that the separate "mapping" documents will be referencing the
1058	   topology names, and when a topology is delete such mapping will
1059	   become invalid.  It up to the implementation to handle such
1060	   referential integrity - e.g. by ignoring such entries in the mapping
1061	   document, or disallowing the topology file to be deleted as long as
1062	   active references are present.

1064	6.4.  Reading State

1066	   Every document described above could be read and transported to the
1067	   client using the HTTP GET request.  The document is transported
1068	   completely in the corresponding encoding.  It is up to the controller
1069	   to implement proper read/write locking to avoid inconsistencies in
1070	   data when multiple clients are present.  No locking API should be
1071	   ever exposed to the client, since that would affect the stateless
1072	   nature of the communications.  Notice that reading the link-state
1073	   database is mostly informative to the client, since handling of the
1074	   network failures is performed by the BGP Controller.

1076	6.5.  Writing State

1078	   The topology definition documents and the IPv4/IPv6 mapping tables
1079	   could be fully re-written using the HTTP PUT verb.  This means that
1080	   with every operation, the client must supply the full new document,
1081	   not an incremental change.  It's up to the client to perform the
1082	   merge of the new change with the already existing information.  If
1083	   consistency across multiple writers is required, it should be
1084	   implemented by the clients, possibly via the use of an external
1085	   shared locking API.  Referential integrity checks could be
1086	   implemented in the controller, e.g. to validate that the topology
1087	   references in the mapping actually exists, or alternatively could be
1088	   left to the client.

1090	   It is possible to implement incremental changes using the HTTP PATCH
1091	   verb semantics (see [RFC5789]) in the server.  In this case, it's up
1092	   to the server to perform proper merge of the incremental change and
1093	   ensure there is no conflicts or duplicates.  This is a more complex
1094	   model as compared to the simple "PUT" logic.

1096	6.6.  Typical API Call Sequence

1098	   A typical sequence of actions for a client willing to perform traffic
1099	   engineering could be as following (assuming absence of the PATCH
1100	   operation):

1102	   o  Decide which prefixes are to be affected by this operation.

1104	   o  Create a topology to perform the link-state operation, or re-use
1105	      the one previously created by this application.  Verify topology
1106	      existence using the GET operation in the "/topologies" directory.

1108	   o  Add new links with the desired costs to the topology.  If the
1109	      topology alredy exists, read it first using GET operation, and
1110	      then perform merge on the client side, later submitting the
1111	      updated topology using PUT operation.

1113	   o  Obtain current prefix mappings for the desired address family
1114	      using the GET operation.  Parse the mappings and perform any
1115	      consistency checks required, followed by adding the entries for
1116	      prefixes to act upon, mapping them to the topology created/updated
1117	      above.

1119	   o  HTTP PUT the new mappings file, updating the one that existing in
1120	      the server as a whole.

1122	6.7.  Limitations

1124	   The API is purposely focused only on routing information
1125	   manipulation, and does not provide any ways to verify the requested
1126	   operation has been accomplished.  Such monitoring should be done
1127	   separately, using either mechanics available in BGP (e.g. by learning
1128	   of the prefixes' new paths via separate session) or outside of BGP,
1129	   e.g. in BGP Monitoring Protocol ([I-D.ietf-grow-bmp]) or Multi-
1130	   Threaded Routing Toolkit ([RFC6396]).

1132	7.  Security Considerations

1134	   The design of the BGP Controller in its simplest form assumes no
1135	   access control in the API is presents to the third-party
1136	   applications.  Access could be limited at the transport level, e.g.
1137	   by using protocol (HTTP) authentication or access control
1138	   capabilities, but the API itself does not provide any logic to
1139	   segregate applications - i.e. there is currently no way to limit an
1140	   application to manipulating only a certain subset of the IP address
1141	   space.

1143	8.  Acknowledgements

1145	   The authors would like to thank Robert Raszuk for reviewing the
1146	   document and providing valueable feedback.

1148	9.  References

1150	9.1.  Normative References

1152	   [RFC4271]  Rekhter, Y., Li, T., and S. Hares, "A Border Gateway
1153	              Protocol 4 (BGP-4)", RFC 4271, January 2006.

1155	   [RFC5789]  Dusseault, L. and J. Snell, "PATCH Method for HTTP", RFC
1156	              5789, March 2010.

1158	   [RFC1997]  Chandrasekeran, R., Traina, P., and T. Li, "BGP
1159	              Communities Attribute", RFC 1997, August 1996.

1161	9.2.  Informative References

1163	   [I-D.lapukhov-bgp-routing-large-dc]
1164	              Lapukhov, P., Premji, A., and J. Mitchell, "Use of BGP for
1165	              routing in large-scale data centers", draft-lapukhov-bgp-
1166	              routing-large-dc-06 (work in progress), August 2013.

1168	   [I-D.ietf-grow-bmp]
1169	              Scudder, J., Fernando, R., and S. Stuart, "BGP Monitoring
1170	              Protocol", draft-ietf-grow-bmp-07 (work in progress),
1171	              October 2012.

1173	   [RFC4786]  Abley, J. and K. Lindqvist, "Operation of Anycast
1174	              Services", BCP 126, RFC 4786, December 2006.

1176	   [RFC6774]  Raszuk, R., Fernando, R., Patel, K., McPherson, D., and K.
1177	              Kumaki, "Distribution of Diverse BGP Paths", RFC 6774,
1178	              November 2012.

1180	   [RFC6976]  Shand, M., Bryant, S., Previdi, S., Filsfils, C.,
1181	              Francois, P., and O. Bonaventure, "Framework for Loop-Free
1182	              Convergence Using the Ordered Forwarding Information Base
1183	              (oFIB) Approach", RFC 6976, July 2013.

1185	   [RFC2992]  Hopps, C., "Analysis of an Equal-Cost Multi-Path
1186	              Algorithm", RFC 2992, November 2000.

1188	   [RFC6241]  Enns, R., Bjorklund, M., Schoenwaelder, J., and A.
1189	              Bierman, "Network Configuration Protocol (NETCONF)", RFC
1190	              6241, June 2011.

1192	   [RFC6396]  Blunk, L., Karir, M., and C. Labovitz, "Multi-Threaded
1193	              Routing Toolkit (MRT) Routing Information Export Format",
1194	              RFC 6396, October 2011.

1196	   [I-D.ietf-idr-add-paths]
1197	              Walton, D., Retana, A., Chen, E., and J. Scudder,
1198	              "Advertisement of Multiple Paths in BGP", draft-ietf-idr-
1199	              add-paths-08 (work in progress), December 2012.

1201	   [I-D.ietf-idr-link-bandwidth]
1202	              Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
1203	              Extended Community", draft-ietf-idr-link-bandwidth-06
1204	              (work in progress), January 2013.

1206	   [I-D.raszuk-wide-bgp-communities]
1207	              Raszuk, R., Haas, J., Amante, S., Steenbergen, R.,
1208	              Decraene, B., and P. Jakma, "Wide BGP Communities
1209	              Attribute", draft-raszuk-wide-bgp-communities-03 (work in
1210	              progress), July 2012.

1212	   [I-D.uttaro-idr-bgp-persistence]
1213	              Uttaro, J., Chen, E., Decraene, B., and J. Scudder,
1214	              "Support for Long-lived BGP Graceful Restart", draft-
1215	              uttaro-idr-bgp-persistence-02 (work in progress), July
1216	              2013.

1218	   [JAKMA2008]
1219	              Jakma, P., "BGP Path Hunting", 2008, <https://
1220	              blogs.oracle.com/paulj/entry/bgp_path_hunting>.

1222	   [PAXOS]    Wikipedia, ., "Paxos", ,
1223	              <http://en.wikipedia.org/wiki/Paxos_(computer_science)>.

1225	   [REST]     Wikipedia, ., "Representational state transfer", , <http:/
1226	              /en.wikipedia.org/wiki/Representational_state_transfer>.

1228	   [RWHITE2005]
1229	              White, R., "Graph Overlays on Path Vector: A Possible Next
1230	              Step in BGP", June 2005, <http://www.cisco.com/web/about/
1231	              ac123/ac147/archived_issues/ipj_8-2/graph_overlays.html>.

1233	   [KVALBEIN2007]
1234	              Kvalbein, A. and O. Lysne, "How can Multi-Topology Routing
1235	              be used for Intradomain Traffic Engineering?", 2007.

1237	   [IEEE8023AD]
1238	              IEEE 802.3ad, ., "IEEE Standard for Link aggregation for
1239	              parallel links", October 2000.

1241	   [RCP]      Caesar, M., Caldwell, D., Feamster, N., and J. Rexford,
1242	              "Design and Implementation of a Routing Control Platform
1243	              ", March 2005,
1244	              <http://www.cs.princeton.edu/~jrex/papers/rcp-nsdi.pdf>.

1246	Authors' Addresses
1247	   Petr Lapukhov
1248	   Microsoft Corporation
1249	   One Microsoft Way
1250	   Redmond, WA  98052
1251	   US

1253	   Phone: +1 425 7032723
1254	   Email: petrlapu@microsoft.com
1255	   URI:   http://microsoft.com/

1257	   Edet Nkposong
1258	   Microsoft Corporation
1259	   One Microsoft Way
1260	   Redmond, WA  98052
1261	   US

1263	   Phone: +1 425 7071045
1264	   Email: edetn@microsoft.com
1265	   URI:   http://microsoft.com/