idnits 2.17.1 draft-chen-idr-ctr-availability-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (15 September 2021) is 952 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC4271' is defined on line 477, but no explicit reference was found in the text == Unused Reference: 'RFC4760' is defined on line 482, but no explicit reference was found in the text == Unused Reference: 'RFC5492' is defined on line 487, but no explicit reference was found in the text == Unused Reference: 'RFC8283' is defined on line 493, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Chen 3 Internet-Draft Futurewei 4 Intended status: Standards Track Y. Fan 5 Expires: 19 March 2022 Casa Systems 6 A. Wang 7 China Telecom 8 L. Liu 9 Fujitsu 10 X. Liu 11 Volta Networks 12 15 September 2021 14 BGP for Network High Availability 15 draft-chen-idr-ctr-availability-03 17 Abstract 19 This document describes protocol extensions to BGP for improving the 20 reliability or availability of a network controlled by a controller 21 cluster. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in RFC 2119 [RFC2119]. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on 19 March 2022. 46 Copyright Notice 48 Copyright (c) 2021 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 53 license-info) in effect on the date of publication of this document. 54 Please review these documents carefully, as they describe your rights 55 and restrictions with respect to this document. Code Components 56 extracted from this document must include Simplified BSD License text 57 as described in Section 4.e of the Trust Legal Provisions and are 58 provided without warranty as described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 63 2. Terminologies . . . . . . . . . . . . . . . . . . . . . . . . 3 64 3. BGP for Controller Cluster Reliability . . . . . . . . . . . 3 65 3.1. Overview of Mechanism . . . . . . . . . . . . . . . . . . 3 66 3.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 4. Extensions to BGP . . . . . . . . . . . . . . . . . . . . . . 6 68 4.1. Capability . . . . . . . . . . . . . . . . . . . . . . . 6 69 4.2. Controller NLRI . . . . . . . . . . . . . . . . . . . . . 8 70 5. Recovery Procedure . . . . . . . . . . . . . . . . . . . . . 9 71 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 72 7. Security Considerations . . . . . . . . . . . . . . . . . . . 11 73 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 74 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 75 9.1. Normative References . . . . . . . . . . . . . . . . . . 11 76 9.2. Informative References . . . . . . . . . . . . . . . . . 11 77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 79 1. Introduction 81 More and more networks are controlled by central controllers or 82 controller clusters. A controller cluster is a single controller 83 externally. It normally consists of two or more controllers 84 internally working together as a single controller externally to 85 control a network, i.e., every network element (NE) in the network. 86 The reliability or availability of a network is heavily dependent on 87 its controller cluster. The issues or failures in the controller 88 cluster may impact the reliability or availability of the network 89 greatly. 91 For a controller cluster comprising two or more controllers (i.e., 92 primary controller, secondary controller, and so on), the failures in 93 the cluster may split the cluster into a few of separated controller 94 groups. These groups do not know each other and may be out of 95 synchronization. Two or more groups may be elected as primary groups 96 to control the network at the same time, which may cause some issues. 98 This document proposes some procedures and extensions to BGP for the 99 separated controllers or controller groups to know each other thus 100 elect one new primary controller or controller group correctly when 101 the cluster is split because of failures in the cluster. 103 2. Terminologies 105 The following terminologies are used in this document. 107 BGP: Border Gateway Protocol 109 NE: Network Element 111 CE: Customer Edge 113 PE: Provider Edge 115 3. BGP for Controller Cluster Reliability 117 This section briefs the mechanism of controller cluster reliability 118 or availability using BGP, and illustrates some details through a 119 simple example. 121 3.1. Overview of Mechanism 123 When a cluster of controllers is split into a few of separated groups 124 because of failures in the cluster, the live controllers are still 125 actually connected to the network (i.e., network elements). Through 126 some of these connections, each group can get the information about 127 the other groups. A new primary controller or controller group is 128 correctly elected to control the network based on the information. 130 Each controller has a BGP session with each of a give number of the 131 same NEs in the network and the session is established and maintained 132 over an IP path between the controller and the NE. The session is a 133 session of BGP with extensions. 135 In one example or configuration, the given number of NEs is one NE 136 with the highest BGP ID. Suppose that node PE2 as NE has the highest 137 BGP ID. The session between the primary controller (e.g., A) and the 138 NE (e.g., PE2) is the session of BGP with extensions. Each of the 139 non-primary controllers (e.g., B, C, ...) creates and maintains a BGP 140 session with this NE (e.g., PE2). 142 In normal operations, the cluster has all its controllers connected. 143 They are the primary controller controlling the network, the 144 secondary controller, and so on. They have current position 1, 2, 145 and so on respectively. The primary controller advertises the 146 information about the controllers via its BGP sessions to the given 147 number of the same NEs. 149 For example, it sends the information in a BGP message to the NE 150 (e.g., PE2), which transfers the information to each of the other 151 controllers via the BGP sessions to the other controllers. 153 When the cluster is split into a few separated groups of controllers, 154 each group elects an intent primary controller, secondary controller 155 and so on from the group, which have intent position 1, 2, and so on 156 respectively. The intent primary controller in each group advertises 157 the information about the controllers in its group. 159 The information advertised by the (intent) primary controller 160 includes its current (intent) position, its old position, its 161 priority to become a primary controller, number of controllers in its 162 group or cluster, and the IDs of the controllers which are ordered in 163 their (intent) positions. In addition, a flag C indicating that 164 whether it is Controlling the network (i.e., it is the primary 165 controller or intent primary controller) is included. 167 3.2. Example 169 Figure 1 shows a controller cluster comprising two controllers: the 170 primary controller and the secondary controller. Each controller has 171 a BGP session with the same NE, which is NE4. 173 +---------------------------------------------------+ 174 | Controller Cluster | 175 | | 176 | +------------+ +------------+ | 177 | |Controller A| Synchronize |Controller B| | 178 | |(Primary) +---------------+(Secondary) | | 179 | +------------+ +-----------++ | 180 | ^ | | 181 | |_______________ | | 182 | | | | 183 | v | | 184 +-----------------Channels to Network---------|-----+ 185 / \ | 186 Session ----> / \____ | 187 between / \ \____ | <--Session 188 A and NEi /\ .---. .---+ \ | between 189 (i=1,2,..) | \( ' |'.---. | | B and NE4 190 |---\ Network | '+. | 191 (o NE1\ | | ) / 192 ( | | o) / 193 ( | | ) NE4 194 ( o NE2 o NE3.-' 195 ' ) 196 '---._.-. ) 197 '---' 199 Figure 1: Controller Cluster of 2 Controllers 201 The primary BGP controller (i.e., A) has a BGP session with each NE 202 in the network, including NE4. The secondary controller (i.e., B) 203 has a BGP session with the same NE4 in the network and the session is 204 established and maintained over an IP path between B and NE4. 206 In normal operations, controller A (Primary) sends NE4 a BGP message 207 containing the information about the controllers connected to it. 208 NE4 transfers the information to controller B (Secondary). The 209 information includes: 211 C = 1, A's current Position = 1, A's OldPosition = 1, A's Priority, 212 NoControllers = 2, A's ID, B's ID 214 When failures happen in the cluster, the live controllers act as 215 follows: 217 For the secondary controller (e.g., B) alive, if the primary 218 controller is dead, it promotes itself as the new primary controller; 219 if the primary controller is alive but separated from the secondary 220 controller, the secondary controller will not promote itself to be a 221 new primary controller. 223 For the primary controller (e.g., A), if it is alive, it continues to 224 be the primary controller. 226 With the extensions to BGP, the secondary controller can determine 227 the status of the primary controller based on the information about 228 the primary controller received. The conditions that the primary 229 controller is alive but separated from the secondary controller 230 (i.e., condition a: the connection between the primary controller and 231 the secondary controller in the cluster failed, but condition b: the 232 two controllers are alive) can be determined by the secondary 233 controller as follows: 235 For condition a, when the heartbeat from the primary stops, the 236 secondary knows that the connection between the primary and secondary 237 controller failed. 239 For condition b, it checks whether the information about the primary 240 controller is updated within a given time. If so, the primary 241 controller is alive; otherwise, it is dead. 243 4. Extensions to BGP 245 This section describes extensions to BGP. 247 4.1. Capability 249 During a BGP session establishment, BGP Speakers advertise their 250 support for BGP extensions for network reliability, especially the 251 High Availability of Controller cluster (HAC). A new Controller HA 252 Support Capability Triple is defined for HAC below. A BGP speaker 253 indicates its support for HAC by including the triple in the 254 Capabilities Optional Parameter in its OPEN message if it supports 255 for HAC. 257 0 1 2 3 258 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 |Cap Code (TBD1)| Length | 261 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 262 | Flags |C| 263 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 264 Figure 2: Controller HA Support Capability Triple 266 Cap Code (8 bits): TBD1 is to be assigned by IANA. 268 Length: It indicates the length of the Capability value portion in 269 octets, which is 4. 271 Flag (32 bits): One flag bit, C-bit, is defined. When it is set to 272 one, it indicates that the BGP speaker supports the high 273 availability of controller cluster as a Controller. When it is 274 set to zero, it indicates that the BGP speaker supports the high 275 availability of controller cluster as a network element (NE). 277 When two BGP speakers establish a BGP session between them, each of 278 the speakers indicates its support for HAC by including a Controller 279 HA Support Capability Triple in the Capabilities Optional Parameter 280 in the OPEN message if it supports for HAC. 282 For a BGP speaker supporting for HAC, if it receives the Controller 283 HA Support Capability Triple in the OPEN message from the other BGP 284 speaker over the BGP session, it records that the other BGP speaker 285 (i.e., the other/remote end of the session) supports for HAC; 286 otherwise, it records that the other speaker does not. Thus for all 287 its BGP sessions, it knows whether each session's remote end BGP 288 speaker supports for HAC. If the C-bit in the Triple is set to one, 289 the BGP speaker is a controller; otherwise, it is a NE. 291 A BGP as a controller supporting for HAC acts on the information 292 about the controllers in its cluster or group as follows: 294 It sends the information in a BGP UPDATE message to each of a given 295 set of NEs that runs BGP with HAC support whenever the information 296 changes. The given set of NEs may be the one NE with the highest BGP 297 ID. 299 It adjusts the positions of the controllers accordingly whenever 300 there is a change in the information about the controllers received 301 from the NE supporting for HAC. 303 An NE running BGP with HAC support receives the information about the 304 controllers from the BGP as a controller supporting for HAC, and 305 sends the information to every BGP as a controller supporting for HAC 306 and having a BGP session with the NE except for the one from which 307 the information is received. 309 4.2. Controller NLRI 311 A new Address Family Identifier (AFI) and Sub-address Family 312 Identifier (SAFI), called Controllers AFI and SAFI, are defined to 313 carry the information about controllers with Network Layer 314 Reachability Information (NLRI). Under the AFI and SAFI, a new NLRI, 315 called Controllers NLRI, is defined to contain the information. A 316 controller in a cluster may advertise the information in a BGP UPDATE 317 message containing a Controllers NLRI of the following format. 319 0 1 2 3 320 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 321 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 322 | Type | Length | 323 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 324 | Flags |C| Position | OldPosition | Priority | 325 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 326 | Reserved | NoControllers | 327 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 328 | Connected Controller 1 ID | 329 : : | 330 | Connected Controller n ID | 331 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 333 Figure 3: Controllers NLRI 335 Type (16 bits): TBD2 is to be assigned by IANA. 337 Length (16 bits): It indicates the length of the value portion in 338 octets. 340 Flag (8 bits): One flag bit, C-bit, is defined. When set, it 341 indicates that the position is the position of the current active 342 primary controller. In this case, C = 1 and Position = 1, which 343 indicate that the controller is the current active primary 344 controller controlling the network. 346 Position (8 bits): It indicates the current/intent position of the 347 controller in the controller cluster or group. 1: primary (first) 348 controller, 2: secondary controller, 3: third controller, and so 349 on (i.e., Controller Position of value n: n-th controller in the 350 cluster or group). 352 OldPosition (8 bits): ): It indicates the old position of the 353 controller in the controller cluster before it is split. 355 Priority (8 bits): It indicates the priority of the controller to be 356 elected as a primary controller. 358 Reserved (24 bits): Reserved field, must set to zero for 359 transmission and ignored for reception. 361 NoControllers (8 bits): It indicates the number of controllers 362 connected to the controller advertising the TLV. 364 Controller i ID (32 bits): It represents the identifier (ID) of 365 controller i at position i (i = 1, ..., n) in the cluster or 366 group. 368 5. Recovery Procedure 370 This section describes the recovery procedure for a controller 371 cluster of n (n > 2) controllers, which are the primary controller A, 372 the secondary controller B, ..., the n-th controller N. 374 When failures happen in the cluster, it may be split into a few 375 separated groups of controllers. In one policy, the group with the 376 maximum number of controllers is responsible for controlling the 377 network as the primary group of the cluster, in which the new primary 378 controller, secondary controller, and so on are elected. 380 For each separated group of controllers, the intent primary 381 controller, secondary controller, and so on are elected. The intent 382 primary controller of the group advertises the information about its 383 group. The information includes its intent position, its old 384 position, its priority to become a primary controller, the number of 385 controllers in the group, and identifiers of the controllers in the 386 group. The identifiers of the controllers are ordered according to 387 their positions. The identifier of the intent primary controller, 388 which has position 1, is the first one; The identifier of the intent 389 secondary controller, which has position 2, is the second one; and so 390 on. Thus every separated group has the information about the other 391 groups and can determine which group has the maximum number of 392 controllers. 394 In the case of tie (i.e., two or more groups have the same maximum 395 number of controllers), the group with the highest old position 396 controller (e.g., the old primary controller) wins in one policy. In 397 another policy, the group with the highest priority controller wins. 399 Some details of the recovery procedures in the current and intent 400 primary controller in a controller cluster or group are as follows. 402 In normal operations, it advertises the information about controllers 403 containing: 405 C = 1, Position = 1, Old Position = 1, Primary Controller's priority, 406 NoControllers = n, Primary Controller's ID, secondary controller's 407 ID, ..., and n-th Controller's ID. 409 When failures cause the cluster split, it advertises the information 410 about controllers containing: 412 C = 0, Position = 1, Old Position = 1, Intent Primary Controller's 413 priority, NoControllers = m (m is the number of controllers in the 414 group that the primary controller is connected after the failures), 415 Intent Primary Controller's ID, IDs of the other controllers 416 connected. 418 Then after a given time, it checks if the group is elected as the 419 primary group. If so, it advertises the information about 420 controllers containing: 422 C = 1, Position = 1, Old Position = 1, its Priority, NoControllers = 423 m, the IDs of the controllers in the group. 425 One example is that failures split the cluster into two separated 426 groups: group 1 comprising A and C, group 2 consisting of B and N. 427 Each group elects its intent primary controller, secondary 428 controller, and so on. Suppose that controller A and C are elected 429 as the intent primary and secondary controller respectively in group 430 1; controller B and N are elected as the intent primary and secondary 431 controller respectively in group 2. 433 Each of the intent primary controllers A and B advertises the 434 information about the controllers in its group. The information 435 advertised by A includes: 437 C = 0, Position = 1, OldPosition = 1, A's Priority, NoControllers = 438 2, A's ID, C's ID. 440 The information advertised by B includes: 442 C = 0, Position = 1, OldPosition = 2, B's Priority, NoControllers = 443 2, B's ID, N's ID. 445 Group 1 and 2 have the same number of controllers, which is 2. But 446 OldPosition in group 1 is higher than that in group 2. Group 1 is 447 elected as the primary group, and the intent primary controller A in 448 the primary group is determined as the current primary controller. 449 After the determination, the information about the controllers in 450 group 1 (i.e., the primary group) is changed. The updated 451 information advertised by A includes: 453 C = 1, Position = 1, OldPosition = 1, A's Priority, NoControllers = 454 2, A's ID, C's ID. 456 6. IANA Considerations 458 TBD 460 7. Security Considerations 462 TBD 464 8. Acknowledgements 466 TBD 468 9. References 470 9.1. Normative References 472 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 473 Requirement Levels", BCP 14, RFC 2119, 474 DOI 10.17487/RFC2119, March 1997, 475 . 477 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 478 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 479 DOI 10.17487/RFC4271, January 2006, 480 . 482 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 483 "Multiprotocol Extensions for BGP-4", RFC 4760, 484 DOI 10.17487/RFC4760, January 2007, 485 . 487 [RFC5492] Scudder, J. and R. Chandra, "Capabilities Advertisement 488 with BGP-4", RFC 5492, DOI 10.17487/RFC5492, February 489 2009, . 491 9.2. Informative References 493 [RFC8283] Farrel, A., Ed., Zhao, Q., Ed., Li, Z., and C. Zhou, "An 494 Architecture for Use of PCE and the PCE Communication 495 Protocol (PCEP) in a Network with Central Control", 496 RFC 8283, DOI 10.17487/RFC8283, December 2017, 497 . 499 Authors' Addresses 501 Huaimo Chen 502 Futurewei 503 Boston, MA, 504 United States of America 506 Email: Huaimo.chen@futurewei.com 508 Yanhe Fan 509 Casa Systems 510 United States of America 512 Email: yfan@casa-systems.com 514 Aijun Wang 515 China Telecom 516 Beiqijia Town, Changping District 517 Beijing 518 102209 519 China 521 Email: wangaj3@chinatelecom.cn 523 Lei Liu 524 Fujitsu 525 United States of America 527 Email: liulei.kddi@gmail.com 529 Xufeng Liu 530 Volta Networks 531 McLean, VA 532 United States of America 534 Email: xufeng.liu.ietf@gmail.com