idnits 2.17.1 draft-sl-rtgwg-far-dcn-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 6 instances of too long lines in the document, the longest one being 41 characters in excess of 72. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 59 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (May 26, 2020) is 1402 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'FAT-TREE' is defined on line 1506, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: November 27, 2020 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 May 26, 2020 12 Generic Fault-Avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-14 15 Abstract 17 This draft describes a generic routing method and protocol for a 18 regular data center network, named the Fault-Avoidance Routing (FAR) 19 protocol. The FAR protocol provides a generic routing method for all 20 types of regular topology network architectures that have been 21 proposed for large-scale cloud-based data centers over the past few 22 years. The FAR protocol is designed to leverage any regularity in 23 the topology and compute its routing table in a simplistic manner. 24 Fat-tree is taken as an example architecture to illustrate how the 25 FAR protocol can be applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on November 27, 2020. 44 Copyright Notice 46 Copyright (c) 2020 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 4 63 2. Conventions used in this document . . . . . . . . . . . . . . 5 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 5 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 8 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 9 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 9 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 9 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 10 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 11 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 17 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 17 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 18 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 18 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 18 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 19 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 19 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 19 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 20 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 22 86 9. Topology identification and broadcast storm suppression . . . 23 87 10. Application Example . . . . . . . . . . . . . . . . . . . . . 23 88 10.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 10.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 26 90 10.2.1. Single Link Failure . . . . . . . . . . . . . . . . 26 91 10.2.2. A Group of Link Failures . . . . . . . . . . . . . . 27 92 10.2.3. Node Failures . . . . . . . . . . . . . . . . . . . 28 93 10.3. Routing Procedure . . . . . . . . . . . . . . . . . . . 28 94 10.4. FAR's Performance in Large-scale Networks . . . . . . . 30 95 10.4.1. The number of control messages required by FAR . . . 30 96 10.4.2. The Calculating Time of Routing Tables . . . . . . . 30 97 10.4.3. The Size of Routing Tables . . . . . . . . . . . . . 30 98 11. Implementations Examples . . . . . . . . . . . . . . . . . . 31 99 12. Security Considerations . . . . . . . . . . . . . . . . . . . 33 100 13. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 34 101 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 34 102 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 34 103 15.1. Normative References . . . . . . . . . . . . . . . . . . 34 104 15.2. Informative References . . . . . . . . . . . . . . . . . 34 105 16. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . 34 106 16.1. Application Area of the Solution . . . . . . . . . . . . 34 107 16.2. Technical evolution roadmap . . . . . . . . . . . . . . 35 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 110 1. Introduction 112 In recent years, with the rapid development of cloud computing 113 technologies, the widely deployed cloud services, such as Amazon EC2 114 and Google search, bring about huge challenges to data center 115 networking (DCN). Today's cloud-based data centers (DCs) require 116 large-scale networks with larger internal bandwidth and smaller 117 transfer delay. However, conventional networks cannot meet such 118 requirements due to limitations in their network architecture. In 119 order to satisfy the requirements of cloud computing services, many 120 new network architectures have been proposed for data centers, such 121 as Fat-tree, MatrixDCN[MatrixDCN], and BCube[BCube]. These new 122 architectures can support non-blocking large-scale datacenter 123 networks with more than tens of thousands of physical servers. 125 All of these architectures have regular topologies, which are common 126 features. The regular topology refers to the network topology 127 structure with obvious regularity and symmetry, which is conducive to 128 automatic configuration of the network, such as the Fat-tree network. 129 In a regular topology, each network node such as a switch or router 130 can be addressed by its location and through a node's address, the 131 node's connections to its neighbors in a network can be determined, 132 and furthermore, the route to the node from other nodes in the 133 network can be determined. So nodes can compute route entries 134 without learning topology. 136 This document describes a generic routing method and protocol, the 137 Fault-Avoidance Routing (FAR) protocol, for DCNs. This method 138 leverages the regularity in the topologies of data center networks to 139 simplify routing learning and accelerate the query of routing tables. 140 This routing method has a better fault tolerance and can be applied 141 to any DCN with a regular topology. 143 FAR is not a routing protocol to replace generic routing protocols 144 such as OSPF(Open Shortest Path First)[RFC2328] and IS- 145 IS(Intermediate System-to-Intermediate System). It cannot be used in 146 general local networks whose topological structures are arbitrary, 147 and whose scales are also not very large. OSPF and IS-IS works very 148 well in such a network. But in a large-scale network with regular 149 topology, FAR has better performance. Compared with OSPF and IS-IS, 150 FAR has shorter time of network convergence and lower PDU(Protocol 151 Data Unit) overhead. Furthermore, FAR requires less computing and 152 storage resources, which lets FAR routers to run at a lower cost of 153 production than the generic routers. 155 In addition, for each type of network architecture, researchers 156 designed a routing algorithm according to the features of its 157 topology. Because these routing algorithms are different and lack 158 compatibility with each other, it is very difficult to develop a 159 routing protocol for network routers supporting multiple routing 160 algorithms. FAR has better adaptability than these specified routing 161 methods. 163 FAR consists of three components, i.e., link state learning unit, 164 routing table building unit and routing table querying unit. In the 165 link state learning unit, FAR exchanges link failures among routers 166 to establish a consistent knowledge of the entire network. In this 167 stage, the regularity in topology is exploited to infer failed links 168 and routers. In the routing table building unit, FAR builds up two 169 routing tables, i.e., a basic routing table (BRT) and a negative 170 routing table (NRT), for each router according to the network 171 topology and link states. In the last component, routers forward 172 incoming packets by looking up the two routing tables. The matched 173 entries in BRT minus the matched entries in NRT are the final route 174 entries to be used to forward an incoming packet. 176 The remainder of this draft is organized as follows. The problem to 177 be addressed by FAR is described in Section 3. The framework of FAR 178 routing protocol is described in Section 4. Section 5 and 6 179 introduce FAR's data format FAR and modules in detail. Section 7 180 describe how FAR works by finite state machine (FSM). In Section 8, 181 we discussed how FAR works with variable network architectures. 182 Section 9 takes Fat-tree network as an example to illuminate how FAR 183 works. 185 1.1. Acronyms & Definitions 187 DCN - Data Center Network 189 FAR - Fault-Avoidance Routing 191 BRT - Basic Routing Table 192 NRT - Negative Routing Table 194 NDT - Neighbor Devices Table 196 ADT - All Devices Table 198 LFT - Link Failure Table 200 DA - Device Announcement 202 LFA - Link Failure Announcement 204 DLR - Device and Link Request 206 IP - Internet Protocol 208 VM - Virtual Machine 210 2. Conventions used in this document 212 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 213 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 214 this document are to be interpreted as described in RFC-2119 215 [RFC2119]. In this document, these words will appear with that 216 interpretation only when in ALL CAPS. Lower case uses of these words 217 are not to be interpreted as carrying RFC-2119 significance. 219 3. Problem Statement 221 The problem to be addressed by FAR as proposed in this draft is 222 described in this section. The expansion of Cloud data center 223 networks has brought significant challenges to the existing routing 224 technologies. FAR mainly solves a series of routing problems faced 225 by large-scale data center networks. 227 3.1. The Impact of Large-scale Networks on Route Calculation 229 In a large-scale cloud data center network, there may be thousands of 230 routers. Running OSPF and other routing protocols in such network 231 will encounter these two challenges: a) Network convergence time 232 would be too long, which will cause a longer time to elapse for 233 creating and updating the routes. The response time to network 234 failures may be excessively long; b) a large number of routing 235 protocol packets need to be sent. The routing information consumes 236 too much network bandwidth and CPU(central processing unit) 237 resources, which easily leads to packet loss and makes the problem 238 (a) more prominent. 240 In order to solve these problems, a common practice is to partition a 241 large network into some small areas, where the route calculation runs 242 independently within different areas. However, nowadays the cloud 243 data centers typically require very large internal bandwidth. To 244 meet this requirement, a large number of parallel equivalent links 245 are deployed in a network, such as a Fat-tree network. Partitioning 246 such a network will affect the utilization of routing algorithm on 247 equivalent multi-path and reduce internal network bandwidth 248 requirements. 250 In the FAR routing calculation process, a Basic Routing Table (BRT) 251 is built on local network topology leveraging the regularity of the 252 network topologies. In addition to BRT, FAR also builds a Negative 253 Routing Table (NRT). FAR gradually builds NRT in the process of 254 learning network link failure information, which does not require 255 learning a complete network fault information. FAR does not need to 256 wait for the completion of the network convergence in the process of 257 building these two tables. Therefore, it avoids the problem of 258 excessive network convergence overheads in the route calculation 259 process. In addition, FAR only needs to exchange a small amount of 260 link change information between routers, and hence consumes less 261 network bandwidth. 263 3.2. Dilemma of conventional routing methods in a large-scale network 264 with giant number nodes of routers 266 There are many real world scenario where tens of thousands of 267 nodes(or much more nodes) need to be deployed in a flat area, such as 268 infiniband routing and switching system, high-performance computer 269 network, and many IDC(Internet Data Center) networks in China. The 270 similar problems have been existed long ago. People have solved the 271 problems through similar solutions, such as the traditional regular 272 topology-based RFC3619 protocol, the routing protocols of infiniband 273 routing and switching system, and high-performance computer network 274 routing protocol. 275 Infiniband defines a switch-based network to interconnect processing 276 nodes and the I/O nodes. Infiniband can support very large scale 277 networks, use the regularity in topology to simplify its routing 278 algorithm, which is just the same to what we do in FAR. 280 Why OSPF and other conventional routing methods do not work well in a 281 large-scale network with giant number nodes of routers? 283 As everyone knows, the OSPF protocol uses multiple databases, more 284 topological exchange information (as seen in the following example) 285 and complicated algorithm. It requires routers to consume more 286 memory and CPU processing capability. But the processing rate of CPU 287 on the protocol message per second is very limited. When the network 288 expands, CPU will quickly approach its processing limits, and at this 289 time OSPF can not continue to expand the scale of the management. 290 The SPF(Shortest Path First) algorithm itself does not thoroughly 291 solve these problems. 293 On the contrary, the FAR protocol does not need to calculate SPF, 294 which saves calculation time and resources, so FAR does not have the 295 convergence time delay and the additional CPU overheads, which SPF 296 requires. Because in the initial stage, FAR already knows the 297 regular information of the whole network topology and does not need 298 to periodically do SPF operation. 300 One of the examples of "more topological exchange information": 301 In the OSPF protocol, LSA(Link-State Advertisement) floods every 1800 302 seconds. Especially in the larger network, the occupation of CPU and 303 band bandwidth will soon reach the router's performance bottleneck. 304 In order to reduce these adverse effects, OSPF introduced the concept 305 of Area, which still has not solved the problem thoroughly. By 306 dividing the OSPF Area into several areas, the routers in the same 307 area do not need to know the topological details outside their area. 308 (In comparison with FAR, after OSPF introducing the concept of Area, 309 the equivalent paths cannot be selected in the whole network scope) 311 OSPF can achieve the following results by Area : 1) Routers only need 312 to maintain the same link state databases as other routers within the 313 same Area, without the necessity of maintaining the same link state 314 database as all routers in the whole OSPF domain. 2) The reduction 315 of the link state databases means dealing with relatively fewer LSA, 316 which reduces the CPU consumption of routers; 3) The large number of 317 LSAs flood only within the same Area. But, its negative effect is 318 that the smaller number of routers which can be managed in each OSPF 319 area. On the contrary, because FAR does not have the above 320 disadvantages, FAR can also manage large-scale network even without 321 dividing Areas. 323 The aging time of OSPF is set in order to adapt to routing 324 transformation and protocol message exchange happened frequently in 325 the irregular topology. Its negative effect is: when the network 326 does not change, the LSA needs to be refreshed every 1800 seconds to 327 reset the aging time. In the regular topology, as the routings are 328 fixed, it does not need the complex protocol message exchange and 329 aging rules to reflect the routing changes, as long as LFA mechanism 330 in the FAR is enough. 332 Compared with the LSVR(Link State Vector Routing) protocol, the LSVR 333 protocol has no special requirements for the network topology 334 structure, however, the FAR draft is applicable to the regular 335 topology network architecture and simplifies unnecessary processing. 337 It is a solution proposed to greatly improve the routing efficiency 338 of the regular network topology. The FAR solution is more efficient 339 than the general methods such as LSVR in regular topology. 341 Therefore, in FAR, we can omit many unnecessary processing and the 342 packet exchange. The benefits are fast convergence speed and much 343 larger network scale than other dynamic routing protocol. Now there 344 are some successful implementations of simplified routings in the 345 regular topology in the HPC(High Performance Computing) environment. 346 Conclusion: As FAR needs few routing entries and the topology is 347 regular, the database does not need to be updated regularly. Without 348 the need for aging, there is no need for CPU and bandwidth overhead 349 brought by LSA flood every 30 minutes, so the expansion of the 350 network has no obvious effect on the performance of FAR, which is 351 contrary to OSPF. 353 Comparison of convergence time: The settings of OSPF spf_delay and 354 spf_hold_time can affect the change of convergence time. The 355 convergence time of the network with 2480 nodes is about 15-20 356 seconds; while the FAR does not need to calculate the SPF, so there 357 is no such convergence time. 358 These issues still exist in rapid convergence technology of OSPF 359 ,ISIS (such as I-SPF, Incremental SPF) and LSVR. The convergence 360 speed and network scale constraint each other. FAR does not have the 361 above problems, and the convergence time is almost negligible. 362 Can FRR(Fast Reroute) solve these problems? IP FRR has some 363 limitations. The establishment of IP FRR backup scheme will not 364 affect the original topology and traffic forwarding which are 365 established by protocol, however, we can not get the information of 366 whereabouts and status when the traffic is switched to an alternate 367 next hop. 369 3.3. Network Addressing Issues 371 Routers are typically configured with multiple network interfaces, 372 each connected to a subnet. OSPF and other routing algorithms 373 require that each interface of a router must be configured with an IP 374 address. A large-scale data center network may contain thousands of 375 routers and each router has dozens of network interfaces, thus, there 376 are tens of thousands of IP addresses needed to be configured in a 377 data center. It will be very complex to configure and manage a large 378 number of network interfaces and will be difficult to troubleshoot 379 network problems, then network maintenance will be costly and error- 380 prone. 382 In FAR, the device position information is encoded in the IP address 383 of the router. Each router only needs to be assigned a unique IP 384 address according its location, which greatly solves complex network 385 addressing issues in large-scale networks. 387 3.4. Big Routing Table Issues 389 There are a large number of subnets in the large-scale data center 390 network. A router may build a routing entry for each subnet, and 391 therefore the size of routing tables on each router may be very 392 large. It will increase a router's cost and reduce the querying 393 speed of the routing table. 395 FAR uses two measures to reduce the size of its routing tables: a)It 396 builds a BRT on the regularity of the network topologies; b)It 397 introduces a new routing table, i.e., a NRT. In this way FAR can 398 reduce the size of routing tables to only a few dozen routing 399 entries. 401 3.5. Adaptivity Issues for Routing Algorithms 403 To implement efficient routing in large-scale datacenters, besides 404 FAR, some other routing methods are proposed for some specific 405 network architectures, such as Fat-tree and BCube. These routing 406 methods are different(from both design and implementation viewpoints) 407 and not compatible with the conventional routing methods, which 408 brings big troubles to network equipment providers to develop new 409 routers supporting various new routing methods. 411 FAR is a generic routing method. With slight modification, FAR 412 method can be applied to most of regular datacenter networks. 413 Furthermore, the structure of routing tables and querying a routing 414 table in FAR are the same as conventional routing method. If FAR is 415 adopted, the workload of developing a new type of router will be 416 significantly decreased. 418 3.6. Virtual Machine Migration Issues 420 Supporting VM migration is very important for cloud-based datacenter 421 networks. However, in order to support layer-3 routing, routing 422 methods including OSPF and FAR require limiting VM migration within a 423 subnet. For this paradox, the mainstream methods still utilize 424 layer-3 routing on routers or switches, transmit packets encapsulated 425 by IPinIP or MACinIP between hosts by tunnels passing through network 426 to the destination access switch, and then extract original packet 427 out and send it to the destination host. 429 By utilizing the aforementioned methods, FAR can be applied to Fat- 430 tree, MatrixDCN or BCube networks for supporting VM migration in 431 entire network. 433 4. The FAR Framework 435 FAR requires that a DCN has a regular topology, and network devices, 436 including routers, switches, and servers, are assigned IP addresses 437 according to their locations in the network. In other word, we can 438 locate a device in the network according to its IP address. 440 FAR is a distributed routing method. In order to support FAR, each 441 router needs to have a routing module that implements the FAR 442 algorithm. FAR algorithm is composed of three parts, i.e., link- 443 state learning, routing table building and routing table querying, as 444 shown in Fig. 1. 446 Link-state Learning |Routing Table | Routing Table 447 | Building 448 | | Querying 449 +--------+ /---------------\ | +--------+ | Packets 450 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 451 |Learning| | Link Detection| | |Building|\ | | 452 +--------+ \---------------/ | +--------+ \ | \|/ 453 | | | \ |+--------------+ 454 | | | /|| 7 Querying | 455 \|/ \|/ | / || Routing Table| 456 +-----------------------+| / |+--------------+ 457 |3 Invisible Neighbor & || / | 458 |Link Failure Inferring || +--------- | 459 +-----------------------+|/| 6 NRT | | 460 | / |Building| | 461 \|/ /| +--------+ | 462 +--------------+ / | | 463 |4 Link Failure| / | | 464 | Learning | | | 465 +--------------+ | | 466 | | 468 Figure 1: The FAR framework 470 Link-state learning is responsible for a router to detect the states 471 of its connected links and learn the states of all the other links in 472 the entire network. The second part builds two routing tables, a 473 basic routing table (BRT) and an negative routing table (NRT), 474 according to the learned link states in the first part. The third 475 part queries the BRT and the NRT to decide a next forwarding hop for 476 the received (ingress) packets. 478 5. Data Format 480 5.1. Data Tables 482 Some data tables are maintained on each router in FAR. They are: 484 Neighbor Device Table (NDT): To store neighbor routers and related 485 links. 487 All Devices Table (ADT): To store all routers in the entire network. 489 Link Failures Table (LFT): To store all link failures in the entire 490 network. 492 Basic Routing Table (BRT): To store the candidate routes. 494 Negative Routing Table(NRT): To store the avoiding routes. 496 The format of NDT 498 ---------------------------------------------------------- 499 Device ID | Device IP | Port ID | Link State | Update Time 500 ---------------------------------------------------------- 502 Device ID: The ID of a neighbor router. 504 Device IP: The IP address of a neighbor router. 506 Port ID: The port ID that a neighbor router is attached to. 508 Link State: The state of the link between a router and its neighbor 509 router. There are two states: Up and Down. 511 Update Time: The time of updating the entry. 513 The format of ADT 515 -------------------------------------------------- 516 Device ID | Device IP | Type | State | Update Time 517 -------------------------------------------------- 519 Device ID: The ID of a neighbor router. 521 Device IP: The IP address of a neighbor router. 523 Type: The type of a neighbor router. 525 State: The state of a neighbor router. There are two states: Up and 526 Down. 528 Update Time: The time of updating the entry. 530 The format of LFT 532 -------------------------------------------- 533 No | Router 1 IP | Router 2 IP | Timestamp 534 -------------------------------------------- 536 No: The entry number. 538 Router 1 IP: The IP address of one router that a failed link connects 539 to. 541 Router 2 IP: The IP address of another router that a failed link 542 connects to. 544 Timestamp: It identifies when the entry is created. 546 The format of BRT 548 ------------------------------------------------------- 549 Destination | Mask | Next Hop | Interface | Update Time 550 ------------------------------------------------------- 552 Destination: A destination network 554 Mask: The subnet mask of a destination network. 556 Next Hop: The IP address of a next hop for a destination. 558 Interface: The interface related to a next hop. 560 Update Time: The time of updating the entry. 562 The format of NRT 564 ------------------------------------------------------------------- 565 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 566 ------------------------------------------------------------------- 568 Destination: A destination network. 570 Mask: The subnet mask of a destination network. 572 Next Hop: The IP address of a next hop that should be avoided for a 573 destination. 575 Interface: The interface related to a next hop that should be 576 avoided. 578 Failed Link No: A group of failed link numbers divided by "/", for 579 example 1/2/3. 581 Timestamp: The time of updating the entry. 583 5.2. Messages 585 Some protocol messages are exchanged between routers in FAR. 587 Hello Message: This message is exchanged between neighbor routers to 588 learn adjacency. 590 Device Announcement (DA): Synchronize the knowledge of routers 591 between routers. 593 Link Failure Announcement (LFA): Synchronize link failures between 594 routers. 596 Device and Link Request (DLR): When a router starts, it requests the 597 knowledge of routers and links from its neighbors by a DLR message. 599 A FAR Message is directly encapsulated in an IP packet. The protocol 600 field of IP header indicates an IP packet is an FAR message. 602 The four types of FAR messages have same format of packet header, 603 called FAR header (as shown in Figure 2). 605 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 606 +-----------+------------+------------------------+ 607 | Version |Message Type| Message Length | 608 +-----------+------------+------------------------+ 609 | Checksum | AuType | 610 +------------------------+------------------------+ 611 | Authentication | 612 +-------------------------------------------------+ 613 | Authentication | 614 +-------------------------------------------------+ 615 | Timestamp | 616 +-------------------------------------------------+ 618 Figure 2: The format of FAR header 620 Version: FAR version 622 Message Type: The type of FAR message. 624 Packet Length: The packet length of the total FAR message. 626 Checksum: The checksum of an entire FAR message. 628 AuType:Authentication type. 0: no authentication, 1: Plaintext 629 Authentication, 2: MD5 Authentication. 631 Authentication: Authentication information. 0: undefined, 1: Key, 2: 632 key ID, MD5 data length and packet number. MD5 data is appended to 633 the backend of the packet. 635 AuType and Authentication can refer to the definition of OSPF packet. 637 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 638 +-----------+------------+------------------------+ 639 | Version |Message Type| Message Length | 640 +-----------+------------+------------------------+ 641 | Checksum | AuType | 642 +------------------------+------------------------+ 643 | Authentication | 644 +-------------------------------------------------+ 645 | Authentication | 646 +-------------------------------------------------+ 647 | Timestamp | 648 +-------------------------------------------------+ 649 | Router IP | 650 +------------------------+------------------------+ 651 | HelloInterval | HelloDeadInterval | 652 +------------------------+------------------------+ 653 | Neighbor Router IP | 654 +-------------------------------------------------+ 655 | ... | 656 +-------------------------------------------------+ 658 Figure 3: The Format of Hello Messages 660 For Hello messages, the Message Type in FAR header is set to 661 1.Besides FAR header, a Hello message(Fig. 3) requires the following 662 fields: 664 Router IP: The router IP address. 666 HelloInterval: The interval of sending Hello messages to neighbor 667 routers. 669 RouterDeadInterval: The interval to set a neighbor router dead(out- 670 of-service). If in the interval time, a router doesn't receive a 671 Hello message from its neighbor router, the neighbor router is 672 treated as dead. 674 Neighbor Router IP: The IP address of a neighbor router. All the 675 neighbor router's addresses should be included in a Hello message. 677 |<----1---->| <----1---->|<----------2----------->| 678 +-----------+------------+------------------------+ 679 | Version |Message Type| Message Length | 680 +-----------+------------+------------------------+ 681 | Checksum | AuType | 682 +------------------------+------------------------+ 683 | Authentication | 684 +-------------------------------------------------+ 685 | Authentication | 686 +-------------------------------------------------+ 687 | Timestamp | 688 +------------------------+------------------------+ 689 | Router1 IP | 690 +-------------------------------------------------+ 691 | ... | 692 +-------------------------------------------------+ 694 Figure 4: The Format of DA Messages 696 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 697 Besides FAR header, a DA message includes IP addresses of all the 698 announced routers. 700 |<----1---->| <----1---->|<----------2----------->| 701 +-----------+------------+------------------------+ 702 | Version |Message Type| Message Length | 703 +-----------+------------+------------------------+ 704 | Checksum | AuType | 705 +------------------------+------------------------+ 706 | Authentication | 707 +-------------------------------------------------+ 708 | Authentication | 709 +-------------------------------------------------+ 710 | Timestamp | 711 +------------------------+------------------------+ 712 | Left IP | 713 +-------------------------------------------------+ 714 | Right IP | 715 +------------------------+------------------------+ 716 | State | 717 +-------------------------------------------------+ 718 | ... | 719 +-------------------------------------------------+ 721 Figure 5: The Format of LFA Messages 723 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 724 Besides FAR header, a LFA message includes all the announced link 725 failures. 727 Left IP: The IP address of the left endpoint router of a link. 729 Right IP: The IP address of the right endpoint router of a link. 731 State: Link state. 0: Up, 1: down 733 |<----1---->| <----1---->|<----------2----------->| 734 +-----------+------------+------------------------+ 735 | Version |Message Type| Message Length | 736 +-----------+------------+------------------------+ 737 | Checksum | AuType | 738 +------------------------+------------------------+ 739 | Authentication | 740 +-------------------------------------------------+ 741 | Authentication | 742 +-------------------------------------------------+ 743 | Timestamp | 744 +-------------------------------------------------+ 746 Figure 6: The Format of DLR Messages 748 For DLR messages(Fig. 6), the Message Type in FAR header is set to 749 1.Except for FAR header, DLR has no additional fields. 751 6. FAR Modules 753 6.1. Neighbor and Link Detection Module(M1) 755 M1 is responsible for sending and receiving Hello messages, and 756 detecting directly-connected links and neighbor routers. Each Hello 757 message is encapsulated in a IP packet. M1 sends Hello messages 758 periodically to all the active router ports and receives Hello 759 messages from its neighbor routers. M1 detects neighbor routers and 760 directly-connected links according to received Hello Messages and 761 stores these neighbors and links into a Neighbor Devices Table (NDT). 762 Additionally, M1 also stores neighbor routers into an All Devices 763 Table (ADT). 765 6.2. Device Learning Module(M2) 767 M2 is responsible for sending, receiving, and forwarding device 768 announcement (DA) messages, learning all the routers in the whole 769 network, and deducing faulted routers. When a router starts, it 770 sends a DA message announcing itself to its neighbors and a DLR 771 message requesting the knowledge of routers and links from its 772 neighbors. If M2 module of a router receives a DA message, it checks 773 whether the router encapsulated in the message is in an ADT. If the 774 router is not in the ADT, M2 puts this router into the ADT and 775 forwards this DA message to all the active ports except for the 776 incoming one, otherwise, M2 discards this message directly. If M2 777 module of a router receives a DLR message, it replies a DA message 778 that encapsulates all of the learned routers. 780 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 782 M3 is responsible for inferring invisible neighbors of the current 783 router by means of the ADT. If the link between a router A and its 784 neighbor B breaks, which results in that M1 module of A cannot detect 785 the existence of B, then B is an invisible neighbor of A. Since a 786 device's location is coded into its IP address, it can be judged 787 whether two routers are adjacent, according to their IP addresses. 788 Based on this idea, M3 infers all of the invisible neighbors of the 789 current router and the related link failures. The results are stored 790 into a NDT. Moreover, link failures also are added into a link- 791 failure table (LFT). LFT stores all of the failed links in the 792 entire network. 794 6.4. Link Failure Learning Module(M4) 796 M4 is responsible for sending, receiving and forwarding link failure 797 announcement (LFA) and learning all the link failures in the whole 798 network. M4 broadcasts each newly inferred link failure to all the 799 routers in the network. Each link failure is encapsulated in a LFA 800 message and one link failure is broadcasted only once. If a router 801 receives a DLR request from its neighbor, it will reply a LFA message 802 that encapsulates all the learned link failures through M4 module. 803 If M4 receives a LFA message, it checks whether the link failure 804 encapsulated in the message is in a LFT by comparing two link ends 805 and timestamp. If the link failure is not in the LFT or timestamp is 806 different, M4 puts this link failure into the LFT (or update 807 timestamp only) and forwards this LFA message to all the active ports 808 except for the incoming one, otherwise, M4 discards this message 809 directly. 811 There is a special case a router will rebroadcast a link failure. If 812 a router receives a data packet and must forward the packet going 813 ahead to destination through a failed link, it means some previous 814 router should avoid this failed link according to its NRT but it 815 doesn't. In this case, maybe the previous router missed the LFA 816 message of the link failure due to some uncertain reasons. So the 817 forwarding router rebroadcasts the LFA message. 819 6.5. BRT Building Module(M5) 821 M5 is responsible for building a BRT for the current router. By 822 leveraging the regularity in topology, M5 can calculate the routing 823 paths for any destination without the knowledge of the topology of 824 whole network, and then build the BRT based on a NDT. Since the IP 825 addresses of network devices are continuous, M5 only creates one 826 route entry for a group of destination addresses that have the same 827 network prefix by means of route aggregation technology. Usually, 828 the size of a BRT is very small. The detail of how to build a BRT is 829 described in section 5. 831 6.6. NRT Building Module(M6) 833 M6 is responsible for building a NRT for the current router. Because 834 M5 builds a BRT without considering link failures in network, the 835 routing paths calculated by the BRT cannot avoid failed links. To 836 solve this problem, a NRT is used to exclude the routing paths that 837 include some failed links from the paths calculated by a BRT. M6 838 calculate the routing paths that include failed links and stored them 839 into the NRT. The details of how to build a NRT is described in 840 section 5. 842 6.7. Routing Table Lookup(M7) 844 M7 is responsible for querying routing tables and selecting the next 845 hop for forwarding the packets. Firstly, M7 takes the destination 846 address of a forwarding packet as a criterion to look up route 847 entries in a BRT based on longest prefix match. All of the matched 848 entries are composed of a candidate hops list. Secondly, M7 look up 849 negative route entries in a NRT taking the destination address of the 850 forwarding packet as criteria. This lookup is not limited to the 851 longest prefix match, any entry that matches the criteria would be 852 selected and composed of an avoiding hops list. Thirdly, the 853 candidate hops minus avoiding hops are composed of an applicable hops 854 list. At last, M7 sends the forwarding packet to any one of the 855 applicable hops. If the applicable list is empty, the forwarding 856 packet will be dropped. 858 7. How a FAR Router Works 860 Figure 7 shows how a FAR router works by its FSM. 862 /---------------\ 863 | Start | 864 | | 865 \---------------/ 866 | 867 +--------+ |1 868 | | | 869 | \|/ \|/ 870 | +--------------------+ 871 |2 | ND | 872 | | | 873 | +--------------------+ 874 | | | 875 | | |3 876 +--------+ | 877 \|/ 878 +--------------+ 879 | ND-FIN | 880 | | 881 +--------------+ 882 | 883 |4 884 | 885 \|/ 886 ________10_______\+--------------+/_______11________ 887 | /| |\ | 888 | | Listen | | 889 | ____9_______\| |/_______12___ | 890 | | /| |\ | | 891 | | +--------------+ | | 892 | | 5/ | | \8 | | 893 | | |/_ | | _\| | | 894 | | +------------+ 6| |7 +------------+ | | 895 | --| HELLO-RECV | | | | LFA-RECV |-- | 896 | +------------+ | | +------------+ | 897 | ______| |______ | 898 | | | | 899 | \|/ \|/ | 900 | +------------+ +------------+ | 901 |___________| DLR-RECV | | DA-RECV |__________| 902 +------------+ +------------+ 903 _________ 904 | | 905 | | 906 \|/ | 907 +--------------+ |13 908 | Hello Thread | | 909 +--------------+ | 910 | | 911 |_________| 913 _________ 914 | | 915 | | 916 \|/ | 917 +--------------+ |14 918 | LFD Thread | | 919 +--------------+ | 920 | | 921 |_________| 923 _________ 924 | | 925 | | 926 \|/ | 927 +--------------+ |15 928 | DA Thread | | 929 +--------------+ | 930 | | 931 |_________| 933 _________ 934 | | 935 | | 936 \|/ | 937 +--------------+ |16 938 | DFD Thread | | 939 +--------------+ | 940 | | 941 |_________| 943 Figure 7: The Finite State Machine of FAR Router 945 1)When a router starts up, it starts a Hello thread and then starts 946 ND (neighbor detection) timer (3 seconds). Next the router goes into 947 ND (neighbor detection) state. 948 2)In the ND state, if a router received a Hello message, then it 949 performs a Hello-message processing and goes back to the ND state. 951 3)When the ND timer is over, a router goes into ND-FIN (neighbor 952 detection finished) state. 953 4)A router starts the LFD (link failure detection) thread and DFD 954 (device failure detection) state, and sends DA message and DLR 955 message to all of its active ports. Then the router goes into Listen 956 state. 957 5) If a router receives a Hello message, then goes into HELLO-RECV 958 state. 959 6) If a router receives a DLR message, then goes into DLR-RECV state. 960 7) If a router receives a DA message, then goes into DA-RECV state. 961 8) If a router receives a LFA message, then goes into LFA-RECV state. 962 9) A router performs the Hello-message processing. After that, it 963 goes back to Listen state. 964 10) A router performs the DLR-message processing. After that, it 965 goes back to Listen state. 966 11) A router performs the DA-message processing. After that, it goes 967 back to Listen state. 968 12) A router performs the LFA-message processing. After that, it 969 goes back to Listen state. 970 13) Hello thread produces and sends Hello messages to all its ports 971 periodically. 972 14) LFD thread calls link-failure-detection processing to check link 973 failures in all links periodically 974 15) DA thread produces and sends DA messages periodically (30 975 minutes). 976 16) When DFD thread starts up, it sleep a short time (30 seconds) to 977 wait for a router learning all the active routers in the network. 978 Then the thread calls the device-failure-detection processing to 979 check device failures periodically (30 minutes). 981 8. Compatible Architecture 983 As a generic routing protocol, FAR can be run in various DCNs with 984 regular topology. Up to now, we have implemented the FAR protocol 985 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 987 For different network architectures, most processing of FAR is same 988 besides calculation of routing tables. BRT routing tables are 989 calculated based on Hello messages and NRT routing tables are 990 calculated based on LFA messages in FAR. To extend FAR to support a 991 new network architecture, only processing of Hello and LFA messages 992 need providing to build BRT and NRT routing tables. 994 In this protocol, FAR can support maximally 12 network architectures 995 and at least support 1 built-in network architecture, such as Fat- 996 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 997 a unique number from 1 to 12. For example, if the 1 built-in 998 architectures are assigned 1, and other customized architectures are 999 assigned 2 to 12. 1000 1: Fat-tree 1001 2: BCube 1002 3: MatrixDCN. 1003 4: xxx. 1004 ...... 1005 12: xxx. 1007 9. Topology identification and broadcast storm suppression 1009 In this design, the initial topology discovery process is not a 1010 mandatory option for a FAR routing protocol. The recommended 1011 solution here is to use a pre-configured configuration file, which 1012 contains topology parameters of the current system, each node device 1013 as long as according to these configuration parameters will be able 1014 to know the topology information. In this way, we do not have to 1015 deal with complex topology discovery processes, nor do we need to 1016 calculate the shortest path, because the optimal path can be 1017 calculated from the parameters. This protocol also allows the 1018 formation of configuration files to be submitted to the topology 1019 discovery protocol, allowing for a variety of different 1020 implementation options. 1022 Regarding the flood suppression processing of broadcast packets, it 1023 has been considered in the previous content. Since the hello packets 1024 is only transmitted between the two nodes, it cannot be spread out. 1025 The link error message is only sent to the CPU, and are not forwarded 1026 to the nodes in layer 2 broadcasting. Moreover, each node will 1027 discard the repeated error messages when the node receives them. In 1028 this way, the broadcast storm can be suppressed. If a link is 1029 unstable and repeatedly up or down, the system will not send new 1030 messages after sending notifications, and the system will not 1031 oscillate repeatedly. The topology is updated only when the link is 1032 later detected to be stable for a long time. 1034 10. Application Example 1036 In this section, we take a Fat-tree network(Fig. 7) as an example to 1037 describe how to apply FAR routing. Since M1 to M4 are very simple, 1038 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 1039 network. 1041 A Fat-tree network is composed of 4 layers. The top layer is core 1042 layer, and the other layers are aggregation layer, edge layer and 1043 server layer. There are k pods, each one containing two layers of 1044 k/2 switches. Each k-port switch in the edge layer is directly 1045 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 1046 of the k-port switches in the aggregation layer. There are (k/2)2 1047 k-port core switches. Each core switch has one port connected to 1048 each of the k pods. 1050 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 1051 +--+ +--+ +--+ +--+ 1052 | | | | | | | | 1053 +--+,_ +--+', +--+ ,,+--+ 1054 |`,`',`-., / | \ `. .'` - .-'``.` /| 1055 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1056 | \ `', `-., .` / | `, .` ,' | 1057 | `, `'. `'-,_ .'` ,' | ', / | 1058 | . `'. `-.,-` / | \ 1059 | \ `'., .` `'., ` | `. 1060 | `, .'`, ,`'., | ', 1061 | . ,-` '., - `'-,_ | `. 1062 | \ .` ,'., `|., . 1063 | .'` / `-, | `'., `. 1064 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1065 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1066 | | | | | | | | | | | | | | | | 1067 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1068 | \/ | | \/ | | \/ | | \/ | 1069 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1070 | | | | | | | | | | | | | | | | 1071 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1072 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1073 / | | \ / | | \ / | | \ / | | | 1074 / | | \ / | | \ / | | \ / | | | 1075 / | | \ / | | \ / | | \ / | | | 1076 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1077 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1078 10.1.2.2 10.3.1.3 1080 Figure 8: Fat-tree Network 1082 Aggregation switches are given addresses of the form 10.pod.0.switch, 1083 where pod denotes the pod number, and switch denotes the position of 1084 that switch in the upper pod (in [1, k/2]). Edge switches are given 1085 addresses of the form 10.pod.switch.1, where pod denotes the pod 1086 number, and switch denotes the position of that switch in the lower 1087 pod (in [1, k/2]). The core switches are given addresses of the form 1088 10.0.j.i, where j and i denote that switch's coordinates in the 1089 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1090 The address of a host follows the pod switch to which it is connected 1091 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1092 the host's position in that subnet (in [2, k/2+1], starting from left 1093 to the right). 1095 10.1. BRT Building Procedure 1097 By leveraging the topology's regularity, every switch clearly knows 1098 how it forwards a packet. When a packet arrives at an edge switch, 1099 if the destination of the packet lies in the same subnet with the 1100 switch, then the switch directly forwards the packet to the 1101 destination server through layer-2 switching. Otherwise, the switch 1102 forwards the packet to any of aggregation switches in the same pod. 1103 When a packet arrives at an aggregation switch, if the destination of 1104 the packet lies in the same pod, the switch forwards the packet to 1105 the corresponding edge switch. Otherwise, the switch forwards the 1106 packet to any of core switches that it is connected to. If a core 1107 switch receives a packet, it forwards the packet to the corresponding 1108 aggregation switch that lies in the destination pod. 1110 The forwarding policy discussed above is easily expressed through a 1111 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1112 following entries: 1114 Destination/Mask Next hop 1115 10.0.0.0/255.0.0.0 10.1.0.1 1116 10.0.0.0/255.0.0.0 10.1.0.2 1118 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1119 the following entries: 1121 Destination/Mask Next hop 1122 10.1.1.0/255 255.255.0 10.1.1.1 1123 10.1.2.0/255.255.255.0 10.1.2.1 1124 10.0.0.0/255.0.0.0 10.0.1.1 1125 10.0.0.0/255.0.0.0 10.0.1.2 1127 The BRT of a core switch, such as 10.0.1.1, is composed of the 1128 following entries: 1130 Destination/Mask Next hop 1131 10.1.0.0/255 255.0.0 10.1.0.1 1132 10.2.0.0/255.255.0.0 10.2.0.1 1133 10.3.0.0/255.255.0.0 10.3.0.1 1134 10.4.0.0/255.255.0.0 10.4.0.1 1136 10.2. NRT Building Procedure 1138 The route entries in an NRT are related with link and node failures. 1139 We summarize all types of cases into three (3) catalogs. 1141 10.2.1. Single Link Failure 1143 In Fat-tree, Links can be classified as 3 types by their locations: 1144 1) servers to edge switches; 2) edge to aggregation switches; 3) 1145 aggregation to core switches. Link failures between servers to edge 1146 switches only affect the communication of the corresponding servers 1147 and don't affect the routing tables of any switch, so we only discuss 1148 the second and third type of links failures. 1150 Edge to Aggregation Switches 1152 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1153 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1154 failure may affect 3 types of communications. 1156 o Sources lie in the same subnet with A, and destinations do not. In 1157 this case, the link failure will only affect the routing tables of A. 1158 As this link is attached to A directly, A only needs to delete the 1159 route entries whose next hop is B in its BRT and add no entries to 1160 its NRT when A's M6 module detect the link failure. 1162 o Destinations lie in the same subnet with A, and sources lie in 1163 another subnet of the same pod. In this case, the link failure will 1164 affect the routing tables of all the edge switches in the same pod 1165 except for A. When an edge switch, such as 10.1.1.1, learns the link 1166 failure, it will add a route entry to its NRT: 1168 Destination/Mask Next hop 1169 10.1.2.0/255.255.255.0 10.1.0.1 1171 o Destinations lie in the same subnet with A, sources lie in another 1172 pod. In this case, the link failure will affect the routing tables 1173 of all the edge switches in the other pods. When an edge switch in 1174 one other pod, such as 10.3.1.1, learns the link failure, because all 1175 the routings that pass through 10.3.0.1 to A will certainly pass 1176 through the link between A and B, 10.3.1.1 need add a route entry to 1177 its NRT: 1179 Destination/Mask Next hop 1180 10.1.2.0/255.255.255.0 10.3.0.1 1181 Aggregation to Core Switches 1183 Suppose that the link between an aggregation switch, such as 10.1.0.1 1184 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1185 failure may affect 2 types of communications. 1187 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1188 the other pods. In this case, the link failure will only affect the 1189 routing tables of A. As this link is attached to A directly, A only 1190 need to delete the route entries whose next hop is B in its BRT and 1191 add no entries to its NRT when A's M6 module detect the link failure. 1193 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1194 another pod. In this case, the link failure will affect the routing 1195 tables of all the aggregation switches in other pods except for pod 1196 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1197 learns the link failure, because all the routings that pass through 1198 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1199 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1201 Destination/Mask Next hop 1202 10.1.0.0/255.255.0.0 10.0.1.2 1204 10.2.2. A Group of Link Failures 1206 If all the uplinks of an aggregation switch fail, then this switch 1207 cannot forward packets, which will affect the routing of every edge 1208 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1209 fail, it will affect two types of communications. 1211 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1212 the other pods. In this case, the link failures will affect the 1213 routing of the edge switches in the Pod of A. To avoid the node A, 1214 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1215 10.1.0.1" in which the next hop is the node A. 1217 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1218 other pods. In this case, the link failures will affect the routing 1219 of edge switches in other pods. For example, if the edge switch 1220 10.3.1.1 communicates with some node in the pod of A, it should avoid 1221 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1222 pod of A will pass through the node A. So a route entry should be 1223 added to 10.3.1.1: 1225 Destination/Mask Next hop 1226 10.1.0.0/255.255.0.0 10.3.0.1 1228 10.2.3. Node Failures 1230 At last, we discuss the effect of node failures to a NRT. There are 1231 3 types of node failures: the failure of edge, aggregation and core 1232 switches. 1234 o An edge switch fails. The failure doesn't affect the routing table 1235 of any switch. 1237 o A core switch fails. Only when all the core switches connected to 1238 the same aggregation switch fail, they will affect the routing of 1239 other switches. This case is equal to the case that all the uplinks 1240 of an aggregation switch fail, so the process of link failures can 1241 cover it. 1243 o An aggregation switch fails. This case is similar to the case that 1244 all the uplinks of an aggregation switch fail. It affects the 1245 routing of edge switches in other pods, but doesn't affect the 1246 routing of edge switches in pod of the failed switch. The process of 1247 this failure is same to the second case in section 6.2.2. 1249 10.3. Routing Procedure 1251 FAR decides a routing by looking up its BRT and NRT. We illuminate 1252 the routing procedure by an example. In this example, we suppose 1253 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1254 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1255 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1256 (destination). 1258 Step 1: The source 10.3.1.3 sends packets to its default router 1259 10.3.1.1 1261 Step 2: The routing of 10.3.1.1. 1263 1) Calculate candidate hops 1265 10.3.1.1 looks up its BRT and gets the following matched entries: 1267 Destination/Mask Next hop 1268 10.0.0.0/255.0.0.0 10.3.0.1 1270 So the candidate hops = {10.3.0.1} 1272 2) Calculate avoiding hops 1274 Its NRT is empty, so the set of avoiding hop is empty too. 1276 3) Calculate applicable hops 1278 The applicable hops are candidate hops minus avoiding hops, so: 1280 The applicable hops = {10.3.0.1} 1282 4) Forward packets to 10.3.0.1 1284 Step 3: The routing of 10.3.0.1 1286 1) Calculate candidate hops. 1288 10.3. 0.1 looks up its BRT and gets the following matched entries: 1290 Destination/Mask Next hop 1291 10.1.0.0/255.255.0.0 10.0.1.1 1292 10.1.0.0/255.255.0.0 10.0.1.2 1294 So the candidate hops = {10.0.1.1, 10.0.1.2} 1296 2) Calculate avoiding hops 1298 Destination/Mask Next hop 1299 10.1.0.0/255.255.0.0 10.0.1.2 1301 So the avoiding hops = {10.0.1.2} 1303 3) Calculate applicable hops 1305 The applicable hops are candidate hops minus avoiding hops, so: 1307 The applicable hops = {10.0.1.1} 1309 4) Forward packets to 10.0.1.1 1311 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1312 routing tables. 1314 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1315 routing tables. 1317 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1318 layer-2 switching. 1320 10.4. FAR's Performance in Large-scale Networks 1322 FAR has good performance to support large-scale networks. In this 1323 section, we take a Fat-tree network composed of 2,880 48-port 1324 switches and 27,648 servers as an example to show FAR's performance. 1326 10.4.1. The number of control messages required by FAR 1328 FAR exchanges a few messages between routers and only consumes a 1329 little network bandwidth. Tab. 1 shows the required messages in the 1330 example Fat-tree network. 1331 Table 1:Required messages in a Fat-tree network. 1332 _____________________________________________________________________ 1333 Message Type| Scope | size(bytes) | Rate | Bandwidth 1334 --------------------------------------------------------------------- 1335 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1336 kbps 1337 --------------------------------------------------------------------- 1338 DLR |adjacent switches| less than 48 | (1) |48bytes 1339 --------------------------------------------------------------------- 1340 DA |entire network| less than 48 | (2) |1.106M 1341 --------------------------------------------------------------------- 1342 LFA |entire network| less than 48 | (3) |48 bytes 1343 _____________________________________________________________________ 1344 (1)Produce one when a router starts 1345 (2)The number of switches(2,880) in a period 1346 (3)Produce one when a link fails or recovers 1348 10.4.2. The Calculating Time of Routing Tables 1350 A BRT is calculated according to the states of its neighbor routers 1351 and attached links. An NRT is calculated according to device and 1352 link failures in the entire network. So FAR does not calculate 1353 network topology and has no problem of network convergence, which 1354 greatly reduces the calculating time of routing tables. The 1355 detection and spread time of link failures is very short in FAR. 1356 Detection time is up to the interval of sending Hello message. In 1357 FAR, the interval is set to 100ms, and a link failure will be 1358 detected in 200ms. The spread time between any pair of routers is 1359 less than 200ms.If a link fails in a data center network, FAR can 1360 detect it, spread it to all the routers, and calculate routing tables 1361 in no more than 500ms. 1363 10.4.3. The Size of Routing Tables 1365 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1366 in Tab. 2. 1368 Table 2: The size of routing tables in FAR 1369 _____________________________________________________________________ 1370 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1371 --------------------------------------------------------------------- 1372 BRT | 48 | 48 | 24 1373 --------------------------------------------------------------------- 1374 NRT | 0 | 14 | 333 1375 _____________________________________________________________________ 1377 The BRT's size at a switch is determined by the number of its 1378 neighbor switches. In the example network, a core switch has 48 1379 neighbor switches (aggregation switch), so it has 48 entries in its 1380 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1381 switch is related to the number of link failures in the network. 1382 Suppose that there are 1000 link failures in the example network, the 1383 number of failed links is 1.2% of total links, which is a very high 1384 failure ratio. We suppose that link failures are uniformly 1385 distributed in the entire network. The NRT size at an edge switch is 1386 about 333 and the NRT size of an aggregation switch is about 14in 1387 average. 1389 11. Implementations Examples 1391 In the FAR draft scenario, Fat-Tree topology has only three layers of 1392 routers. To expand the network scale is achieved through horizontal 1393 expansion: increase the number of core switches, and increase the 1394 number of aggregation switches and edge switches in pod. 1396 For example, the following two scenarios. 1398 10.0.1.1 10.0.2.1 10.0.3.1 10.0.24.1 1399 +--+ +--+ +--+ +--+ 1400 | |24devices | |24devices | |24devices | | 24devices 1401 +--+,_ +--+', +--+ ,,+--+ 1402 |`,`',`-., / | \ `. .'` - .-'``.` /| 1403 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1404 | \ `', `-., .` / | `, .` ,' | 1405 | `, `'. `'-,_ .'` ,' | ', / | 1406 | . `'. `-.,-` / | \ 1407 | \ `'., .` `'., ` | `. 1408 | `, .'`, ,`'., | ', 1409 | . ,-` '., - `'-,_ | `. 1410 | \ .` ,'., `|., . 1411 | .'` / `-, | `'., `. 1412 10.1.0.1 ,-` . .' `'., ', 1413 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1414 | |..| | | |..| | | |..| | | |..| |A total of 1152 aggregation switches 1415 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1416 | \/ | | \/ | | \/ | | \/ | 1417 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1418 | |..| | | |..| | | |..| | | |..| |A total of 1152 access switches 1419 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1420 /| 10.1.24.1 /| |\ |\ /| | | 1421 / | | \ / | | \ / | | \ / | | | 1422 / | | \ / | | \ / | | \ / | | | 1423 / | | \ / | | \ / | | \ / | | | 1424 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1425 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1427 Figure 9: 48 pods, each of which has 24 aggregation switches and 24 1428 access switches 1430 In the Fat-tree network of Figure 9, there are a total of 48 pods, 1431 each of which has 24 aggregation switches and 24 access switches, and 1432 each access switch is connected to 24 servers. 576 core switches, 1433 1152 aggregation switches, and 1152 access switches are required, for 1434 a total of 2880 switches, which can accommodate 27,648 servers. 1436 96 10 gigabit core switches 1437 +--+ +--+ +--+ +--+ 1438 | |4devices | |4devices | |4devices | |4devices 1439 +--+,_ +--+', +--+ ,,+--+ 1440 |`,`',`-., / | \ `. .'` - .-'``.` /| 1441 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1442 | \ `', `-., .` / | `, .` ,' | 1443 | `, `'. `'-,_ .'` ,' | ', / | 1444 | . `'. `-.,-` / | \ 1445 | \ `'., .` `'., ` | `. 1446 | `, .'`, ,`'., | ', 1447 | . ,-` '., - `'-,_ | `. 1448 | \ .` ,'., `|., . 1449 | .'` / `-, | `'., `. 1450 10.1.0.1 ,-` . .' `'., ', 1451 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1452 | |..| | | |..| | | |..| | | |..| |A total of 192 10 gigabit aggregation switches 1453 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1454 | \/ | | \/ | | \/ | | \/ | 1455 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1456 | |..| | | |..| | | |..| | | |..| |A total of 1152 access switches 1457 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1458 /| 10.1.4.1 /| |\ |\ /| | | 1459 / | | \ / | | \ / | | \ / | | | 1460 / | | \ / | | \ / | | \ / | | | 1461 / | | \ / | | \ / | | \ / | | | 1462 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1463 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1465 Figure 10: 48 pods. Each pod has 4 10G aggregation switches and 24 1466 Gigabit access switches 1468 In the Fat-Tree network of Figure 10, there are a total of 48 pods. 1469 Each pod has 4 10G aggregation switches and 24 Gigabit access 1470 switches. Each access switch is connected to 40 servers. Requires 1471 96 core switches, 192 aggregation switches, and 1152 access switches, 1472 for a total of 1,440 switches, which can accommodate 46,080 servers. 1474 12. Security Considerations 1476 The security considerations will be discussed in a future version of 1477 this document. 1479 13. Conclusions 1481 This draft introduces FAR protocol, a generic routing method and 1482 protocol, for data centers that have a regular topology. It uses two 1483 routing tables, a BRT and an NRT, to store the normal routing paths 1484 and the forbidden (to-be-avoided) routing paths, respectively. This 1485 makes the FAR protocol very simple and efficient. The sizes of these 1486 two tables are very small. Usually, a BRT has only several tens of 1487 entries and an NRT has only several or about a dozen entries. 1489 14. Acknowledgments 1491 This document is supported by ZTE Enterprise-University-Research 1492 Joint Project. 1494 15. References 1496 15.1. Normative References 1498 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1499 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1500 1997. 1502 [RFC2328] J. Moy, "OSPF Version 2", BCP 14, RFC2328, April 1998. 1504 15.2. Informative References 1506 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1507 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1508 2008. 1510 [BCube] Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., ... Lu, 1511 S. (2009, August). BCube: a high performance, server-centric network 1512 architecture for modular data centers. In Proceedings of the ACM 1513 SIGCOMM 2009 conference on Data communication (pp. 63-74). 1515 [MatrixDCN] Sun, Y., Chen, M., Peng, L., Hassan, M. M., Alelaiwi, A. 1516 (2016). MatrixDCN: a high performance network architecture for 1517 large-scale cloud data centers. Wireless Communications and Mobile 1518 Computing, 16(8), 942-959. 1520 16. Appendix 1522 16.1. Application Area of the Solution 1524 According to the horizontal expansion mode of the above scenarios, 1525 the whole Fat-Tree network does not need to be expanded to 4 layers 1526 (4 order Fat-Tree) even if it is expanded. Using a standard three- 1527 tier fattree network, we can scale the network to meet all the 1528 problems of commercial network applications. This scheme is suitable 1529 for non-SDN distributed Fat-Tree network architecture. 1531 16.2. Technical evolution roadmap 1533 In this draft, we should design different rules for FAR switches in 1534 different regular networks to calculate routing tables, which limits 1535 FAR's extensibility. Fortunately, the latest SDN technology make it 1536 is easy to update the control plane of switches, since all the 1537 function of control plane are centralized to a controller in SDN. We 1538 are designing the next generation routing scheme for regular networks 1539 based on SDN. In the new scheme, we design a regular ToPology 1540 Description Language (TPDL) to descript a regular network. In TPDL, 1541 the distance between different type of node groups is defined by a 1542 group of distance formulas and the number of formulas is finite and 1543 fixed without increasing by the scale of a network. And then, 1544 switches learn the topology of a network by taking advantage of TPDL 1545 and generate flow table entries to forward packets without help of 1546 the SDN controller. If no entry is found for a forwarding packet, 1547 switches transport the packet to the SDN controller, and the 1548 controller recalculates a new routing path using A* algorithm by 1549 taking TPDL's distance formulas as a heuristic function and dispatch 1550 flow table entries down to related switches on the path. 1552 Authors' Addresses 1554 Bin Liu 1555 ZTE Inc., ZTE Plaza 1556 No.19 East Huayuan Road,Hai Dian District 1557 Beijing 100191 1558 China 1560 Phone: +86 -010-59932039 1561 Email: 13683610386@139.com 1563 Yantao Sun 1564 Beijing Jiaotong University 1565 No.3 Shang Yuan Cun, Hai Dian District 1566 Beijing 100044 1567 China 1569 Email: ytsun@bjtu.edu.cn 1570 Jing Cheng 1571 Beijing Jiaotong University 1572 No.3 Shang Yuan Cun, Hai Dian District 1573 Beijing 100044 1574 China 1576 Email: yourney.j@gmail.com 1578 Yichen Zhang 1579 Beijing Jiaotong University 1580 No.3 Shang Yuan Cun, Hai Dian District 1581 Beijing 100044 1582 China 1584 Email: snowfall_dan@sina.com 1586 Bhumip Khasnabish 1587 ZTE TX Inc. 1588 55 Madison Avenue, Suite 160 1589 Morristown, New Jersey 07960 1590 USA 1592 Phone: +001-781-752-8003 1593 Email: vumip1@gmail.com 1594 URI: http://tinyurl.com/bhumip/