idnits 2.17.1 draft-chen-idr-ctr-availability-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (September 9, 2020) is 1323 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC4271' is defined on line 479, but no explicit reference was found in the text == Unused Reference: 'RFC4760' is defined on line 484, but no explicit reference was found in the text == Unused Reference: 'RFC5492' is defined on line 489, but no explicit reference was found in the text == Unused Reference: 'RFC8283' is defined on line 495, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Chen 3 Internet-Draft Futurewei 4 Intended status: Standards Track Y. Fan 5 Expires: March 13, 2021 Casa Systems 6 A. Wang 7 China Telecom 8 L. Liu 9 Fujitsu 10 X. Liu 11 Volta Networks 12 September 9, 2020 14 BGP for Network High Availability 15 draft-chen-idr-ctr-availability-01 17 Abstract 19 This document describes protocol extensions to BGP for improving the 20 reliability or availability of a network controlled by a controller 21 cluster. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in RFC 2119 [RFC2119]. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on March 13, 2021. 46 Copyright Notice 48 Copyright (c) 2020 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. Terminologies . . . . . . . . . . . . . . . . . . . . . . . . 3 65 3. BGP for Controller Cluster Reliability . . . . . . . . . . . 3 66 3.1. Overview of Mechanism . . . . . . . . . . . . . . . . . . 3 67 3.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 4. Extensions to BGP . . . . . . . . . . . . . . . . . . . . . . 6 69 4.1. Capability . . . . . . . . . . . . . . . . . . . . . . . 6 70 4.2. Controller NLRI . . . . . . . . . . . . . . . . . . . . . 7 71 5. Recovery Procedure . . . . . . . . . . . . . . . . . . . . . 9 72 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 73 7. Security Considerations . . . . . . . . . . . . . . . . . . . 11 74 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 76 9.1. Normative References . . . . . . . . . . . . . . . . . . 11 77 9.2. Informative References . . . . . . . . . . . . . . . . . 11 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 80 1. Introduction 82 More and more networks are controlled by central controllers or 83 controller clusters. A controller cluster is a single controller 84 externally. It normally consists of two or more controllers 85 internally working together as a single controller externally to 86 control a network, i.e., every network element (NE) in the network. 87 The reliability or availability of a network is heavily dependent on 88 its controller cluster. The issues or failures in the controller 89 cluster may impact the reliability or availability of the network 90 greatly. 92 For a controller cluster comprising two or more controllers (i.e., 93 primary controller, secondary controller, and so on), the failures in 94 the cluster may split the cluster into a few of separated controller 95 groups. These groups do not know each other and may be out of 96 synchronization. Two or more groups may be elected as primary groups 97 to control the network at the same time, which may cause some issues. 99 This document proposes some procedures and extensions to BGP for the 100 separated controllers or controller groups to know each other thus 101 elect one new primary controller or controller group correctly when 102 the cluster is split because of failures in the cluster. 104 2. Terminologies 106 The following terminologies are used in this document. 108 BGP: Border Gateway Protocol 110 NE: Network Element 112 CE: Customer Edge 114 PE: Provider Edge 116 3. BGP for Controller Cluster Reliability 118 This section briefs the mechanism of controller cluster reliability 119 or availability using BGP, and illustrates some details through a 120 simple example. 122 3.1. Overview of Mechanism 124 When a cluster of controllers is split into a few of separated groups 125 because of failures in the cluster, the live controllers are still 126 actually connected to the network (i.e., network elements). Through 127 some of these connections, each group can get the information about 128 the other groups. A new primary controller or controller group is 129 correctly elected to control the network based on the information. 131 Each controller has a BGP session with each of a give number of the 132 same NEs in the network and the session is established and maintained 133 over an IP path between the controller and the NE. The session is a 134 session of BGP with extensions. 136 In one example or configuration, the given number of NEs is one NE 137 with the highest BGP ID. Suppose that node PE2 as NE has the highest 138 BGP ID. The session between the primary controller (e.g., A) and the 139 NE (e.g., PE2) is the session of BGP with extensions. Each of the 140 non-primary controllers (e.g., B, C, ...) creates and maintains a BGP 141 session with this NE (e.g., PE2). 143 In normal operations, the cluster has all its controllers connected. 144 They are the primary controller controlling the network, the 145 secondary controller, and so on. They have current position 1, 2, 146 and so on respectively. The primary controller advertises the 147 information about the controllers via its BGP sessions to the given 148 number of the same NEs. 150 For example, it sends the information in a BGP message to the NE 151 (e.g., PE2), which transfers the information to each of the other 152 controllers via the BGP sessions to the other controllers. 154 When the cluster is split into a few separated groups of controllers, 155 each group elects an intent primary controller, secondary controller 156 and so on from the group, which have intent position 1, 2, and so on 157 respectively. The intent primary controller in each group advertises 158 the information about the controllers in its group. 160 The information advertised by the (intent) primary controller 161 includes its current (intent) position, its old position, its 162 priority to become a primary controller, number of controllers in its 163 group or cluster, and the IDs of the controllers which are ordered in 164 their (intent) positions. In addition, a flag C indicating that 165 whether it is Controlling the network (i.e., it is the primary 166 controller or intent primary controller) is included. 168 3.2. Example 170 Figure 1 shows a controller cluster comprising two controllers: the 171 primary controller and the secondary controller. Each controller has 172 a BGP session with the same NE, which is NE4. 174 +---------------------------------------------------+ 175 | Controller Cluster | 176 | | 177 | +------------+ +------------+ | 178 | |Controller A| Synchronize |Controller B| | 179 | |(Primary) +---------------+(Secondary) | | 180 | +------------+ +-----------++ | 181 | ^ | | 182 | |_______________ | | 183 | | | | 184 | v | | 185 +-----------------Channels to Network---------|-----+ 186 / \ | 187 Session ----> / \____ | 188 between / \ \____ | <--Session 189 A and NEi /\ .---. .---+ \ | between 190 (i=1,2,..) | \( ' |'.---. | | B and NE4 191 |---\ Network | '+. | 192 (o NE1\ | | ) / 193 ( | | o) / 194 ( | | ) NE4 195 ( o NE2 o NE3.-' 196 ' ) 197 '---._.-. ) 198 '---' 200 Figure 1: Controller Cluster of 2 Controllers 202 The primary BGP controller (i.e., A) has a BGP session with each NE 203 in the network, including NE4. The secondary controller (i.e., B) 204 has a BGP session with the same NE4 in the network and the session is 205 established and maintained over an IP path between B and NE4. 207 In normal operations, controller A (Primary) sends NE4 a BGP message 208 containing the information about the controllers connected to it. 209 NE4 transfers the information to controller B (Secondary). The 210 information includes: 212 C = 1, A's current Position = 1, A's OldPosition = 1, A's Priority, 213 NoControllers = 2, A's ID, B's ID 215 When failures happen in the cluster, the live controllers act as 216 follows: 218 For the secondary controller (e.g., B) alive, if the primary 219 controller is dead, it promotes itself as the new primary controller; 220 if the primary controller is alive but separated from the secondary 221 controller, the secondary controller will not promote itself to be a 222 new primary controller. 224 For the primary controller (e.g., A), if it is alive, it continues to 225 be the primary controller. 227 With the extensions to BGP, the secondary controller can determine 228 the status of the primary controller based on the information about 229 the primary controller received. The conditions that the primary 230 controller is alive but separated from the secondary controller 231 (i.e., condition a: the connection between the primary controller and 232 the secondary controller in the cluster failed, but condition b: the 233 two controllers are alive) can be determined by the secondary 234 controller as follows: 236 For condition a, when the heartbeat from the primary stops, the 237 secondary knows that the connection between the primary and secondary 238 controller failed. 240 For condition b, it checks whether the information about the primary 241 controller is updated within a given time. If so, the primary 242 controller is alive; otherwise, it is dead. 244 4. Extensions to BGP 246 This section describes extensions to BGP. 248 4.1. Capability 250 During a BGP session establishment, BGP Speakers advertise their 251 support for BGP extensions for network reliability, especially the 252 High Availability of Controller cluster (HAC). A new Controller HA 253 Support Capability Triple is defined for HAC below. A BGP speaker 254 indicates its support for HAC by including the triple in the 255 Capabilities Optional Parameter in its OPEN message if it supports 256 for HAC. 258 0 1 2 3 259 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 260 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 261 |Cap Code (TBD1)| Length | 262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 263 | Flags |C| 264 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 266 Figure 2: Controller HA Support Capability Triple 268 Cap Code (8 bits): TBD1 is to be assigned by IANA. 270 Length: It indicates the length of the Capability value portion in 271 octets, which is 4. 273 Flag (32 bits): One flag bit, C-bit, is defined. When it is set to 274 one, it indicates that the BGP speaker supports the high 275 availability of controller cluster as a Controller. When it is 276 set to zero, it indicates that the BGP speaker supports the high 277 availability of controller cluster as a network element (NE). 279 When two BGP speakers establish a BGP session between them, each of 280 the speakers indicates its support for HAC by including a Controller 281 HA Support Capability Triple in the Capabilities Optional Parameter 282 in the OPEN message if it supports for HAC. 284 For a BGP speaker supporting for HAC, if it receives the Controller 285 HA Support Capability Triple in the OPEN message from the other BGP 286 speaker over the BGP session, it records that the other BGP speaker 287 (i.e., the other/remote end of the session) supports for HAC; 288 otherwise, it records that the other speaker does not. Thus for all 289 its BGP sessions, it knows whether each session's remote end BGP 290 speaker supports for HAC. If the C-bit in the Triple is set to one, 291 the BGP speaker is a controller; otherwise, it is a NE. 293 A BGP as a controller supporting for HAC acts on the information 294 about the controllers in its cluster or group as follows: 296 It sends the information in a BGP UPDATE message to each of a given 297 set of NEs that runs BGP with HAC support whenever the information 298 changes. The given set of NEs may be the one NE with the highest BGP 299 ID. 301 It adjusts the positions of the controllers accordingly whenever 302 there is a change in the information about the controllers received 303 from the NE supporting for HAC. 305 An NE running BGP with HAC support receives the information about the 306 controllers from the BGP as a controller supporting for HAC, and 307 sends the information to every BGP as a controller supporting for HAC 308 and having a BGP session with the NE except for the one from which 309 the information is received. 311 4.2. Controller NLRI 313 A new Address Family Identifier (AFI) and Sub-address Family 314 Identifier (SAFI), called Controllers AFI and SAFI, are defined to 315 carry the information about controllers with Network Layer 316 Reachability Information (NLRI). Under the AFI and SAFI, a new NLRI, 317 called Controllers NLRI, is defined to contain the information. A 318 controller in a cluster may advertise the information in a BGP UPDATE 319 message containing a Controllers NLRI of the following format. 321 0 1 2 3 322 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 323 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 324 | Type | Length | 325 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 326 | Flags |C| Position | OldPosition | Priority | 327 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 328 | Reserved | NoControllers | 329 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 330 | Connected Controller 1 ID | 331 : : | 332 | Connected Controller n ID | 333 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 335 Figure 3: Controllers NLRI 337 Type (16 bits): TBD2 is to be assigned by IANA. 339 Length (16 bits): It indicates the length of the value portion in 340 octets. 342 Flag (8 bits): One flag bit, C-bit, is defined. When set, it 343 indicates that the position is the position of the current active 344 primary controller. In this case, C = 1 and Position = 1, which 345 indicate that the controller is the current active primary 346 controller controlling the network. 348 Position (8 bits): It indicates the current/intent position of the 349 controller in the controller cluster or group. 1: primary (first) 350 controller, 2: secondary controller, 3: third controller, and so 351 on (i.e., Controller Position of value n: n-th controller in the 352 cluster or group). 354 OldPosition (8 bits): ): It indicates the old position of the 355 controller in the controller cluster before it is split. 357 Priority (8 bits): It indicates the priority of the controller to be 358 elected as a primary controller. 360 Reserved (24 bits): Reserved field, must set to zero for 361 transmission and ignored for reception. 363 NoControllers (8 bits): It indicates the number of controllers 364 connected to the controller advertising the TLV. 366 Controller i ID (32 bits): It represents the identifier (ID) of 367 controller i at position i (i = 1, ..., n) in the cluster or 368 group. 370 5. Recovery Procedure 372 This section describes the recovery procedure for a controller 373 cluster of n (n > 2) controllers, which are the primary controller A, 374 the secondary controller B, ..., the n-th controller N. 376 When failures happen in the cluster, it may be split into a few 377 separated groups of controllers. In one policy, the group with the 378 maximum number of controllers is responsible for controlling the 379 network as the primary group of the cluster, in which the new primary 380 controller, secondary controller, and so on are elected. 382 For each separated group of controllers, the intent primary 383 controller, secondary controller, and so on are elected. The intent 384 primary controller of the group advertises the information about its 385 group. The information includes its intent position, its old 386 position, its priority to become a primary controller, the number of 387 controllers in the group, and identifiers of the controllers in the 388 group. The identifiers of the controllers are ordered according to 389 their positions. The identifier of the intent primary controller, 390 which has position 1, is the first one; The identifier of the intent 391 secondary controller, which has position 2, is the second one; and so 392 on. Thus every separated group has the information about the other 393 groups and can determine which group has the maximum number of 394 controllers. 396 In the case of tie (i.e., two or more groups have the same maximum 397 number of controllers), the group with the highest old position 398 controller (e.g., the old primary controller) wins in one policy. In 399 another policy, the group with the highest priority controller wins. 401 Some details of the recovery procedures in the current and intent 402 primary controller in a controller cluster or group are as follows. 404 In normal operations, it advertises the information about controllers 405 containing: 407 C = 1, Position = 1, Old Position = 1, Primary Controller's priority, 408 NoControllers = n, Primary Controller's ID, secondary controller's 409 ID, ..., and n-th Controller's ID. 411 When failures cause the cluster split, it advertises the information 412 about controllers containing: 414 C = 0, Position = 1, Old Position = 1, Intent Primary Controller's 415 priority, NoControllers = m (m is the number of controllers in the 416 group that the primary controller is connected after the failures), 417 Intent Primary Controller's ID, IDs of the other controllers 418 connected. 420 Then after a given time, it checks if the group is elected as the 421 primary group. If so, it advertises the information about 422 controllers containing: 424 C = 1, Position = 1, Old Position = 1, its Priority, NoControllers = 425 m, the IDs of the controllers in the group. 427 One example is that failures split the cluster into two separated 428 groups: group 1 comprising A and C, group 2 consisting of B and N. 429 Each group elects its intent primary controller, secondary 430 controller, and so on. Suppose that controller A and C are elected 431 as the intent primary and secondary controller respectively in group 432 1; controller B and N are elected as the intent primary and secondary 433 controller respectively in group 2. 435 Each of the intent primary controllers A and B advertises the 436 information about the controllers in its group. The information 437 advertised by A includes: 439 C = 0, Position = 1, OldPosition = 1, A's Priority, NoControllers = 440 2, A's ID, C's ID. 442 The information advertised by B includes: 444 C = 0, Position = 1, OldPosition = 2, B's Priority, NoControllers = 445 2, B's ID, N's ID. 447 Group 1 and 2 have the same number of controllers, which is 2. But 448 OldPosition in group 1 is higher than that in group 2. Group 1 is 449 elected as the primary group, and the intent primary controller A in 450 the primary group is determined as the current primary controller. 451 After the determination, the information about the controllers in 452 group 1 (i.e., the primary group) is changed. The updated 453 information advertised by A includes: 455 C = 1, Position = 1, OldPosition = 1, A's Priority, NoControllers = 456 2, A's ID, C's ID. 458 6. IANA Considerations 460 TBD 462 7. Security Considerations 464 TBD 466 8. Acknowledgements 468 TBD 470 9. References 472 9.1. Normative References 474 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 475 Requirement Levels", BCP 14, RFC 2119, 476 DOI 10.17487/RFC2119, March 1997, 477 . 479 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 480 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 481 DOI 10.17487/RFC4271, January 2006, 482 . 484 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 485 "Multiprotocol Extensions for BGP-4", RFC 4760, 486 DOI 10.17487/RFC4760, January 2007, 487 . 489 [RFC5492] Scudder, J. and R. Chandra, "Capabilities Advertisement 490 with BGP-4", RFC 5492, DOI 10.17487/RFC5492, February 491 2009, . 493 9.2. Informative References 495 [RFC8283] Farrel, A., Ed., Zhao, Q., Ed., Li, Z., and C. Zhou, "An 496 Architecture for Use of PCE and the PCE Communication 497 Protocol (PCEP) in a Network with Central Control", 498 RFC 8283, DOI 10.17487/RFC8283, December 2017, 499 . 501 Authors' Addresses 502 Huaimo Chen 503 Futurewei 504 Boston, MA 505 USA 507 Email: Huaimo.chen@futurewei.com 509 Yanhe Fan 510 Casa Systems 511 USA 513 Email: yfan@casa-systems.com 515 Aijun Wang 516 China Telecom 517 Beiqijia Town, Changping District 518 Beijing, 102209 519 China 521 Email: wangaj3@chinatelecom.cn 523 Lei Liu 524 Fujitsu 526 USA 528 Email: liulei.kddi@gmail.com 530 Xufeng Liu 531 Volta Networks 533 McLean, VA 534 USA 536 Email: xufeng.liu.ietf@gmail.com