idnits 2.17.1 draft-sl-rtgwg-far-dcn-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 6 instances of too long lines in the document, the longest one being 41 characters in excess of 72. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 59 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (May 29, 2021) is 1062 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC8174' is mentioned on line 226, but not defined == Unused Reference: 'FAT-TREE' is defined on line 1526, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: November 30, 2021 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 Individual contributor 10 May 29, 2021 12 Generic Fault-Avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-16 15 Abstract 17 This document describes a generic routing method and protocol for a 18 regular data center network, named the Fault-Avoidance Routing (FAR) 19 protocol. The FAR protocol provides a generic routing method for all 20 types of regular topology network architectures that have been 21 proposed for large-scale cloud-based data centers over the past few 22 years. The FAR protocol is designed to leverage any regularity in 23 the topology and compute its routing table in a concise manner. Fat- 24 tree is taken as an example architecture to illustrate how the FAR 25 protocol can be applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on November 30, 2021. 44 Copyright Notice 46 Copyright (c) 2021 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 5 63 2. Conventions used in this document . . . . . . . . . . . . . . 5 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 6 66 3.2. Issues of Conventional Routing Methods in a Large-scale 67 Network with Giant Number Nodes of Routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 9 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 9 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 9 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 10 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 10 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 11 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 14 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 18 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 18 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 18 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 19 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 19 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 19 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 20 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 20 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 20 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 23 86 9. Topology identification and broadcast storm suppression . . . 23 87 10. Application Example . . . . . . . . . . . . . . . . . . . . . 24 88 10.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 26 89 10.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 27 90 10.2.1. Single Link Failure . . . . . . . . . . . . . . . . 27 91 10.2.2. A Group of Link Failures . . . . . . . . . . . . . . 28 92 10.2.3. Node Failures . . . . . . . . . . . . . . . . . . . 29 93 10.3. Routing Procedure . . . . . . . . . . . . . . . . . . . 29 94 10.4. FAR's Performance in Large-scale Networks . . . . . . . 31 95 10.4.1. The number of control messages required by FAR . . . 31 96 10.4.2. The Calculating Time of Routing Tables . . . . . . . 31 97 10.4.3. The Size of Routing Tables . . . . . . . . . . . . . 31 98 11. Implementations Examples . . . . . . . . . . . . . . . . . . 32 99 12. Security Considerations . . . . . . . . . . . . . . . . . . . 34 100 13. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 35 101 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 35 102 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 103 15.1. Normative References . . . . . . . . . . . . . . . . . . 35 104 15.2. Informative References . . . . . . . . . . . . . . . . . 35 105 16. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . 36 106 16.1. Application Area of the Solution . . . . . . . . . . . . 36 107 16.2. Technical evolution roadmap . . . . . . . . . . . . . . 36 108 16.3. Updating roadmap . . . . . . . . . . . . . . . . . . . . 36 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 111 1. Introduction 113 In recent years, with the rapid development of cloud computing 114 technologies, the widely deployed cloud services, such as Amazon EC2 115 and Google search, bring about huge challenges to data center 116 networking (DCN). Today's cloud-based data centers (DCs) require 117 large-scale networks with larger internal bandwidth and smaller 118 transfer delay. However, conventional networks cannot meet such 119 requirements due to limitations in their network architecture. In 120 order to satisfy the requirements of cloud computing services, many 121 new network architectures have been proposed for data centers, such 122 as Fat-tree, MatrixDCN[MatrixDCN], and BCube[BCube]. These new 123 architectures can support non-blocking large-scale datacenter 124 networks with more than tens of thousands of physical servers. 126 All of these architectures have regular topologies, which are common 127 features. The regular topology refers to the network topology 128 structure with obvious regularity and symmetry, which is conducive to 129 automatic configuration of the network, such as the Fat-tree network. 130 In a regular topology, each network node such as a switch or router 131 can be addressed by its location and through a node's address, the 132 node's connections to its neighbors in a network can be determined, 133 and furthermore, the route to the node from other nodes in the 134 network can be determined. So nodes can compute route entries 135 without learning topology. 137 This document describes a generic routing method and protocol, the 138 Fault-Avoidance Routing (FAR) protocol, for DCNs. This method 139 leverages the regularity in the topologies of data center networks to 140 simplify routing learning and accelerate the query of routing tables. 141 This routing method has a better fault tolerance and can be applied 142 to any DCN with a regular topology. 144 FAR is not a routing protocol to replace generic routing protocols 145 such as OSPF(Open Shortest Path First)[RFC2328] and IS- 146 IS(Intermediate System-to-Intermediate System). It cannot be used in 147 general local networks whose topological structures are arbitrary, 148 and whose scales are also not very large. OSPF and IS-IS work very 149 well in such a network. But in a large-scale network with regular 150 topology, FAR has better performance. Compared with OSPF and IS-IS, 151 FAR has shorter time of network convergence and lower PDU(Protocol 152 Data Unit) overhead. Furthermore, FAR requires less computing and 153 storage resources, which lets FAR routers to run at a lower cost of 154 production than the generic routers. 156 In addition, for each type of network architecture, researchers 157 designed a routing algorithm according to the features of its 158 topology. Because these routing algorithms are different and lack 159 compatibility with each other, it is very difficult to develop a 160 routing protocol for network routers supporting multiple routing 161 algorithms. FAR has better adaptability than these specified routing 162 methods. 164 FAR consists of three components, i.e., link state learning unit, 165 routing table building unit and routing table querying unit. In the 166 link state learning unit, FAR exchanges link failures among routers 167 to establish a consistent knowledge of the entire network. In this 168 stage, the regularity in topology is exploited to infer failed links 169 and routers. In the routing table building unit, FAR builds up two 170 routing tables, i.e., a basic routing table (BRT) and a negative 171 routing table (NRT), for each router according to the network 172 topology and link states. In the last component, routers forward 173 incoming packets by looking up the two routing tables. The matched 174 entries in BRT minus the matched entries in NRT are the final route 175 entries to be used to forward an incoming packet. 177 This document describes a protocol developed by ZTE and Beijing 178 Jiaotong University. It is just presented here to record the work 179 and to make it available for use in later IETF work if desirable. 181 The remainder of this draft is organized as follows. The problem to 182 be addressed by FAR is described in Section 3. The framework of FAR 183 routing protocol is described in Section 4. Section 5 and 6 184 introduce FAR's data format FAR and modules in detail. Section 7 185 describe how FAR works by finite state machine (FSM). In Section 8, 186 we discussed how FAR works with variable network architectures. 187 Section 9 takes Fat-tree network as an example to illuminate how FAR 188 works. 190 1.1. Acronyms & Definitions 192 DCN - Data Center Network 194 FAR - Fault-Avoidance Routing 196 BRT - Basic Routing Table 198 NRT - Negative Routing Table 200 NDT - Neighbor Devices Table 202 ADT - All Devices Table 204 LFT - Link Failure Table 206 DA - Device Announcement 208 LFA - Link Failure Announcement 210 DLR - Device and Link Request 212 IP - Internet Protocol 214 VM - Virtual Machine 216 2. Conventions used in this document 218 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 219 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 220 "OPTIONAL" in this document are to be interpreted as described in BCP 221 14 [RFC2119] [RFC8174] when, and only when, they appear in all 222 capitals, as shown here. In this document, these words will appear 223 with that interpretation only when in ALL CAPS. Lower case uses of 224 these words are not to be interpreted as carrying special 225 significance. 227 3. Problem Statement 229 The problem to be addressed by FAR as proposed in this draft is 230 described in this section. The expansion of Cloud data center 231 networks has brought significant challenges to the existing routing 232 technologies. FAR mainly solves a series of routing problems faced 233 by large-scale data center networks. 235 3.1. The Impact of Large-scale Networks on Route Calculation 237 In a large-scale cloud data center network, there may be thousands of 238 routers. Running OSPF or IS-IS in such network will encounter these 239 two challenges: 241 (1) Network convergence time would be too long, which will cause a 242 longer time to elapse for creating and updating the routes. The 243 response time to network failures may be excessively long; 245 (2) High resource consumption. Since a large number of routing 246 protocol packets need to be sent, it causes the routing information 247 consuming too much network bandwidth and CPU(central processing unit) 248 resources, and easily leads to packet loss and makes the challenge 249 (1) more prominent. 251 In order to solve these two challenges, a common practice is to 252 partition a large network into some small areas, where the route 253 calculation runs independently within different areas. However, 254 nowadays the cloud data centers typically require very large internal 255 bandwidth. To meet this requirement, a large number of parallel 256 equivalent links are deployed in a network, such as a Fat-tree 257 network. Partitioning such a network will affect the utilization of 258 routing algorithm on equivalent multi-path and reduce internal 259 network bandwidth requirements. 261 In the FAR routing calculation process, a Basic Routing Table (BRT) 262 is built on local network topology leveraging the regularity of the 263 network topologies. In addition to BRT, FAR also builds a Negative 264 Routing Table (NRT). FAR gradually builds NRT in the process of 265 learning network link failure information, which does not require 266 learning a complete network fault information. FAR does not need to 267 wait for the completion of the network convergence in the process of 268 building these two tables. Therefore, it avoids the problem of 269 excessive network convergence overheads in the route calculation 270 process. In addition, FAR only needs to exchange a small amount of 271 link change information between routers, and hence consumes less 272 network bandwidth. 274 3.2. Issues of Conventional Routing Methods in a Large-scale Network 275 with Giant Number Nodes of Routers 277 There are many real world scenario where tens of thousands of 278 nodes(or much more nodes) need to be deployed in a flat area, such as 279 infiniband routing and switching system, high-performance computer 280 network, and many IDC(Internet Data Center) networks in China. The 281 similar problems have been existed long ago. People have solved the 282 problems through similar solutions, such as the traditional regular 283 topology-based RFC3619[RFC3619] protocol, the routing protocols of 284 infiniband routing and switching system, and high-performance 285 computer network routing protocol. 286 Infiniband defines a switch-based network to interconnect processing 287 nodes and the I/O nodes. Infiniband can support very large scale 288 networks, use the regularity in topology to simplify its routing 289 algorithm, which is just the same to what we do in FAR. 291 Why OSPF and IS-IS do not work well in a large-scale network with 292 giant number nodes of routers? 294 As we know, the OSPF protocol uses multiple databases, more 295 topological exchange information (as seen in the following example) 296 and complicated algorithm. It requires routers to consume more 297 memory and CPU processing capability. But the processing rate of CPU 298 on the protocol message per second is very limited. When the network 299 expands, CPU will quickly approach its processing limits, and at this 300 time OSPF can not continue to expand the scale of the management. 301 The SPF(Shortest Path First) algorithm itself does not thoroughly 302 solve these problems. 304 On the contrary, the FAR protocol does not need to calculate SPF, 305 which saves calculation time and resources, so FAR does not have the 306 convergence time delay and the additional CPU overheads, which SPF 307 requires. Because in the initial stage, FAR already knows the 308 regular information of the whole network topology and does not need 309 to periodically do SPF operation. 311 One of the examples of "more topological exchange information": 312 In the OSPF protocol, LSA(Link-State Advertisement) floods every 1800 313 seconds. Especially in the larger network, the occupation of CPU and 314 band bandwidth will soon reach the router's performance bottleneck. 315 In order to reduce these adverse effects, OSPF introduced the concept 316 of Area, which still has not solved the problem thoroughly. By 317 dividing the OSPF Area into several areas, the routers in the same 318 area do not need to know the topological details outside their area. 319 (In comparison with FAR, after OSPF introducing the concept of Area, 320 the equivalent paths cannot be selected in the whole network scope) 322 OSPF can achieve the following results by Area : 1) Routers only need 323 to maintain the same link state databases as other routers within the 324 same Area, without the necessity of maintaining the same link state 325 database as all routers in the whole OSPF domain. 2) The reduction 326 of the link state databases means dealing with relatively fewer LSA, 327 which reduces the CPU consumption of routers; 3) The large number of 328 LSAs flood only within the same Area. But, its negative effect is 329 that the smaller number of routers which can be managed in each OSPF 330 area. On the contrary, because FAR does not have the above 331 disadvantages, FAR can also manage large-scale network even without 332 dividing Areas. 334 The aging time of OSPF is set in order to adapt to routing 335 transformation and protocol message exchange happened frequently in 336 the irregular topology. Its negative effect is: when the network 337 does not change, the LSA needs to be refreshed every 1800 seconds to 338 reset the aging time. In the regular topology, as the routings are 339 fixed, it does not need the complex protocol message exchange and 340 aging rules to reflect the routing changes, as long as LFA mechanism 341 in the FAR is enough. 343 Compared with the LSVR(Link State Vector Routing) protocol, the LSVR 344 protocol has no special requirements for the network topology 345 structure, however, the FAR draft is applicable to the regular 346 topology network architecture and simplifies unnecessary processing. 347 It is a solution proposed to greatly improve the routing efficiency 348 of the regular network topology. The FAR solution is more efficient 349 than the general methods such as LSVR in regular topology. 351 Therefore, in FAR, we can omit many unnecessary processing and the 352 packet exchange. The benefits are fast convergence speed and much 353 larger network scale than other dynamic routing protocol. Now there 354 are some successful implementations of simplified routings in the 355 regular topology in the HPC(High Performance Computing) environment. 356 Conclusion: As FAR needs few routing entries and the topology is 357 regular, the database does not need to be updated regularly. Without 358 the need for aging, there is no need for CPU and bandwidth overhead 359 brought by LSA flood every 30 minutes, so the expansion of the 360 network has no obvious effect on the performance of FAR, which is 361 contrary to OSPF. 363 Comparison of convergence time: The settings of OSPF spf_delay and 364 spf_hold_time can affect the change of convergence time. The 365 convergence time of the network with 2480 nodes is about 15-20 366 seconds; while the FAR does not need to calculate the SPF, so there 367 is no such convergence time. 368 These issues still exist in rapid convergence technology of OSPF 369 ,ISIS (such as I-SPF, Incremental SPF) and LSVR. The convergence 370 speed and network scale constraint each other. FAR does not have the 371 above problems, and the convergence time is almost negligible. 372 Can FRR(Fast Reroute) solve these problems? IP FRR has some 373 limitations. The establishment of IP FRR backup scheme will not 374 affect the original topology and traffic forwarding which are 375 established by protocol, however, we can not get the information of 376 whereabouts and status when the traffic is switched to an alternate 377 next hop. 379 3.3. Network Addressing Issues 381 Routers are typically configured with multiple network interfaces, 382 each connected to a subnet. OSPF and other routing algorithms 383 require that each interface of a router must be configured with an IP 384 address. A large-scale data center network may contain thousands of 385 routers and each router has dozens of network interfaces, thus, there 386 are tens of thousands of IP addresses needed to be configured in a 387 data center. It will be very complex to configure and manage a large 388 number of network interfaces and will be difficult to troubleshoot 389 network problems, then network maintenance will be costly and error- 390 prone. 392 In FAR, the device position information is encoded in the IP address 393 of the router. Each router only needs to be assigned a unique IP 394 address according its location, which greatly solves complex network 395 addressing issues in large-scale networks. 397 3.4. Big Routing Table Issues 399 There are a large number of subnets in the large-scale data center 400 network. A router may build a routing entry for each subnet, and 401 therefore the size of routing tables on each router may be very 402 large. It will increase a router's cost and reduce the querying 403 speed of the routing table. 405 FAR uses two measures to reduce the size of its routing tables: a)It 406 builds a BRT on the regularity of the network topologies; b)It 407 introduces a new routing table, i.e., a NRT. In this way FAR can 408 reduce the size of routing tables to only a few dozen routing 409 entries. 411 3.5. Adaptivity Issues for Routing Algorithms 413 To implement efficient routing in large-scale datacenters, besides 414 FAR, some other routing methods are proposed for some specific 415 network architectures, such as Fat-tree and BCube. These routing 416 methods are different(from both design and implementation viewpoints) 417 and not compatible with the conventional routing methods, which 418 brings big troubles to network equipment providers to develop new 419 routers supporting various new routing methods. 421 FAR is a generic routing method. With slight modification, FAR 422 method can be applied to most of regular datacenter networks. 423 Furthermore, the structure of routing tables and querying a routing 424 table in FAR are the same as conventional routing method. If FAR is 425 adopted, the workload of developing a new type of router will be 426 significantly decreased. 428 3.6. Virtual Machine Migration Issues 430 Supporting VM migration is very important for cloud-based datacenter 431 networks. However, in order to support layer-3 routing, routing 432 methods including OSPF and FAR require limiting VM migration within a 433 subnet. For this paradox, the mainstream methods still utilize 434 layer-3 routing on routers or switches, transmit packets encapsulated 435 by IPinIP or MACinIP between hosts by tunnels passing through network 436 to the destination access switch, and then extract original packet 437 out and send it to the destination host. 439 By utilizing the aforementioned methods, FAR can be applied to Fat- 440 tree, MatrixDCN or BCube networks for supporting VM migration in 441 entire network. 443 4. The FAR Framework 445 FAR requires that a DCN has a regular topology, and network devices, 446 including routers, switches, and servers, are assigned IP addresses 447 according to their locations in the network. In other word, we can 448 locate a device in the network according to its IP address. 450 FAR is a distributed routing method. In order to support FAR, each 451 router needs to have a routing module that implements the FAR 452 algorithm. FAR algorithm is composed of three parts, i.e., link- 453 state learning, routing table building and routing table querying, as 454 shown in Fig. 1. 456 Link-state Learning |Routing Table | Routing Table 457 | Building | Querying 458 | | 459 +--------+ /---------------\ | +--------+ | 460 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | Packets 461 |Learning| | Link Detection| | |Building|\ | | 462 +--------+ \---------------/ | +--------+ \ | \|/ 463 | | | \ |+--------------+ 464 | | | /|| 7 Querying | 465 \|/ \|/ | / || Routing Table| 466 +-----------------------+| / |+--------------+ 467 |3 Invisible Neighbor & || / | 468 |Link Failure Inferring || +--------- | 469 +-----------------------+|/| 6 NRT | | 470 | / |Building| | 471 \|/ /| +--------+ | 472 +--------------+ / | | 473 |4 Link Failure| / | | 474 | Learning | | | 475 +--------------+ | | 476 | | 478 Figure 1: The FAR framework 480 1:Neighbor and Link Detection Module(M1) 481 2:Device Learning Module(M2) 482 3:Invisible Neighbor and Link Failure Inferring Module(M3) 483 4:Link Failure Learning Module(M4) 484 5:BRT Building Module(M5) 485 6:NRT Building Module(M6) 486 7:Routing Table Lookup(M7) 487 The meanings of M1-M7 are explained in detail in section 6. 488 Link-state learning is responsible for a router to detect the states 489 of its connected links and learn the states of all the other links in 490 the entire network. The second part builds two routing tables, a 491 basic routing table (BRT) and an negative routing table (NRT), 492 according to the learned link states in the first part. The third 493 part queries the BRT and the NRT to decide a next forwarding hop for 494 the received (ingress) packets. 496 5. Data Format 498 5.1. Data Tables 500 Some data tables are maintained on each router in FAR. They are: 502 Neighbor Device Table (NDT): To store neighbor routers and related 503 links. 505 All Devices Table (ADT): To store all routers in the entire network. 507 Link Failures Table (LFT): To store all link failures in the entire 508 network. 510 Basic Routing Table (BRT): To store the candidate routes. 512 Negative Routing Table(NRT): To store the avoiding routes. 514 The format of NDT 516 ---------------------------------------------------------- 517 Device ID | Device IP | Port ID | Link State | Update Time 518 ---------------------------------------------------------- 520 Device ID: The ID of a neighbor router. 522 Device IP: The IP address of a neighbor router. 524 Port ID: The port ID that a neighbor router is attached to. 526 Link State: The state of the link between a router and its neighbor 527 router. There are two states: Up and Down. 529 Update Time: The time of updating the entry. 531 The format of ADT 533 -------------------------------------------------- 534 Device ID | Device IP | Type | State | Update Time 535 -------------------------------------------------- 537 Device ID: The ID of a neighbor router. 539 Device IP: The IP address of a neighbor router. 541 Type: The type of a neighbor router. 543 State: The state of a neighbor router. There are two states: Up and 544 Down. 546 Update Time: The time of updating the entry. 548 The format of LFT 549 -------------------------------------------- 550 No | Router 1 IP | Router 2 IP | Timestamp 551 -------------------------------------------- 553 No: The entry number. 555 Router 1 IP: The IP address of one router that a failed link connects 556 to. 558 Router 2 IP: The IP address of another router that a failed link 559 connects to. 561 Timestamp: It identifies when the entry is created. 563 The format of BRT 565 ------------------------------------------------------- 566 Destination | Mask | Next Hop | Interface | Update Time 567 ------------------------------------------------------- 569 Destination: A destination network 571 Mask: The subnet mask of a destination network. 573 Next Hop: The IP address of a next hop for a destination. 575 Interface: The interface related to a next hop. 577 Update Time: The time of updating the entry. 579 The format of NRT 581 ------------------------------------------------------------------- 582 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 583 ------------------------------------------------------------------- 585 Destination: A destination network. 587 Mask: The subnet mask of a destination network. 589 Next Hop: The IP address of a next hop that should be avoided for a 590 destination. 592 Interface: The interface related to a next hop that should be 593 avoided. 595 Failed Link No: A group of failed link numbers divided by "/", for 596 example 1/2/3. 598 Timestamp: The time of updating the entry. 600 5.2. Messages 602 Some protocol messages are exchanged between routers in FAR. 604 Hello Message: This message is exchanged between neighbor routers to 605 learn adjacency. 607 Device Announcement (DA): Synchronize the knowledge of routers 608 between routers. 610 Link Failure Announcement (LFA): Synchronize link failures between 611 routers. 613 Device and Link Request (DLR): When a router starts, it requests the 614 knowledge of routers and links from its neighbors by a DLR message. 616 A FAR Message is directly encapsulated in an IP packet. The protocol 617 field of IP header indicates an IP packet is an FAR message. 619 The four types of FAR messages have same format of packet header, 620 called FAR header (as shown in Figure 2). 622 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 623 +-----------+------------+------------------------+ 624 | Version |Message Type| Message Length | 625 +-----------+------------+------------------------+ 626 | Checksum | AuType | 627 +------------------------+------------------------+ 628 | Authentication | 629 +-------------------------------------------------+ 630 | Authentication | 631 +-------------------------------------------------+ 632 | Timestamp | 633 +-------------------------------------------------+ 635 Figure 2: The format of FAR header 637 Version: FAR version 639 Message Type: The type of FAR message. 641 Packet Length: The packet length of the total FAR message. 643 Checksum: The checksum of an entire FAR message. 645 AuType:Authentication type. 0: no authentication, 1: Plaintext 646 Authentication, 2: MD5 Authentication. 648 Authentication: Authentication information. 0: undefined, 1: Key, 2: 649 key ID, MD5 data length and packet number. MD5 data is appended to 650 the backend of the packet. 652 AuType and Authentication can refer to the definition of OSPF packet. 654 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 655 +-----------+------------+------------------------+ 656 | Version |Message Type| Message Length | 657 +-----------+------------+------------------------+ 658 | Checksum | AuType | 659 +------------------------+------------------------+ 660 | Authentication | 661 +-------------------------------------------------+ 662 | Authentication | 663 +-------------------------------------------------+ 664 | Timestamp | 665 +-------------------------------------------------+ 666 | Router IP | 667 +------------------------+------------------------+ 668 | HelloInterval | HelloDeadInterval | 669 +------------------------+------------------------+ 670 | Neighbor Router IP | 671 +-------------------------------------------------+ 672 | ... | 673 +-------------------------------------------------+ 675 Figure 3: The Format of Hello Messages 677 For Hello messages, the Message Type in FAR header is set to 678 1.Besides FAR header, a Hello message(Fig. 3) requires the following 679 fields: 681 Router IP: The router IP address. 683 HelloInterval: The interval of sending Hello messages to neighbor 684 routers. 686 RouterDeadInterval: The interval to set a neighbor router dead(out- 687 of-service). If in the interval time, a router doesn't receive a 688 Hello message from its neighbor router, the neighbor router is 689 treated as dead. 691 Neighbor Router IP: The IP address of a neighbor router. All the 692 neighbor router's addresses should be included in a Hello message. 694 |<----1---->| <----1---->|<----------2----------->| 695 +-----------+------------+------------------------+ 696 | Version |Message Type| Message Length | 697 +-----------+------------+------------------------+ 698 | Checksum | AuType | 699 +------------------------+------------------------+ 700 | Authentication | 701 +-------------------------------------------------+ 702 | Authentication | 703 +-------------------------------------------------+ 704 | Timestamp | 705 +------------------------+------------------------+ 706 | Router1 IP | 707 +-------------------------------------------------+ 708 | ... | 709 +-------------------------------------------------+ 711 Figure 4: The Format of DA Messages 713 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 714 Besides FAR header, a DA message includes IP addresses of all the 715 announced routers. 717 |<----1---->| <----1---->|<----------2----------->| 718 +-----------+------------+------------------------+ 719 | Version |Message Type| Message Length | 720 +-----------+------------+------------------------+ 721 | Checksum | AuType | 722 +------------------------+------------------------+ 723 | Authentication | 724 +-------------------------------------------------+ 725 | Authentication | 726 +-------------------------------------------------+ 727 | Timestamp | 728 +------------------------+------------------------+ 729 | Left IP | 730 +-------------------------------------------------+ 731 | Right IP | 732 +------------------------+------------------------+ 733 | State | 734 +-------------------------------------------------+ 735 | ... | 736 +-------------------------------------------------+ 738 Figure 5: The Format of LFA Messages 740 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 741 Besides FAR header, a LFA message includes all the announced link 742 failures. 744 Left IP: The IP address of the left endpoint router of a link. 746 Right IP: The IP address of the right endpoint router of a link. 748 State: Link state. 0: Up, 1: down 749 |<----1---->| <----1---->|<----------2----------->| 750 +-----------+------------+------------------------+ 751 | Version |Message Type| Message Length | 752 +-----------+------------+------------------------+ 753 | Checksum | AuType | 754 +------------------------+------------------------+ 755 | Authentication | 756 +-------------------------------------------------+ 757 | Authentication | 758 +-------------------------------------------------+ 759 | Timestamp | 760 +-------------------------------------------------+ 762 Figure 6: The Format of DLR Messages 764 For DLR messages(Fig. 6), the Message Type in FAR header is set to 765 1.Except for FAR header, DLR has no additional fields. 767 6. FAR Modules 769 6.1. Neighbor and Link Detection Module(M1) 771 M1 is responsible for sending and receiving Hello messages, and 772 detecting directly-connected links and neighbor routers. Each Hello 773 message is encapsulated in an IP packet. M1 sends Hello messages 774 periodically to all the active router ports and receives Hello 775 messages from its neighbor routers. M1 detects neighbor routers and 776 directly-connected links according to received Hello Messages and 777 stores these neighbors and links into a Neighbor Devices Table (NDT). 778 Additionally, M1 also stores neighbor routers into an All Devices 779 Table (ADT). 781 6.2. Device Learning Module(M2) 783 M2 is responsible for sending, receiving, and forwarding device 784 announcement (DA) messages, learning all the routers in the whole 785 network, and deducing faulted routers. When a router starts, it 786 sends a DA message announcing itself to its neighbors and a DLR 787 message requesting the knowledge of routers and links from its 788 neighbors. If M2 module of a router receives a DA message, it checks 789 whether the router encapsulated in the message is in an ADT. If the 790 router is not in the ADT, M2 puts this router into the ADT and 791 forwards this DA message to all the active ports except for the 792 incoming one, otherwise, M2 discards this message directly. If M2 793 module of a router receives a DLR message, it replies a DA message 794 that encapsulates all of the learned routers. 796 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 798 M3 is responsible for inferring invisible neighbors of the current 799 router by means of the ADT. If the link between a router A and its 800 neighbor B breaks, which results in that M1 module of A cannot detect 801 the existence of B, then B is an invisible neighbor of A. Since a 802 device's location is coded into its IP address, it can be judged 803 whether two routers are adjacent, according to their IP addresses. 804 Based on this idea, M3 infers all of the invisible neighbors of the 805 current router and the related link failures. The results are stored 806 into an NDT. Moreover, link failures also are added into a link- 807 failure table (LFT). LFT stores all of the failed links in the 808 entire network. 810 6.4. Link Failure Learning Module(M4) 812 M4 is responsible for sending, receiving and forwarding link failure 813 announcement (LFA) and learning all the link failures in the whole 814 network. M4 broadcasts each newly inferred link failure to all the 815 routers in the network. Each link failure is encapsulated in a LFA 816 message and one link failure is broadcasted only once. If a router 817 receives a DLR request from its neighbor, it will reply a LFA message 818 that encapsulates all the learned link failures through M4 module. 819 If M4 receives a LFA message, it checks whether the link failure 820 encapsulated in the message is in a LFT by comparing two link ends 821 and timestamp. If the link failure is not in the LFT or timestamp is 822 different, M4 puts this link failure into the LFT (or update 823 timestamp only) and forwards this LFA message to all the active ports 824 except for the incoming one, otherwise, M4 discards this message 825 directly. 827 There is a special case a router will rebroadcast a link failure. If 828 a router receives a data packet and must forward the packet going 829 ahead to destination through a failed link, it means some previous 830 router should avoid this failed link according to its NRT but it 831 doesn't. In this case, maybe the previous router missed the LFA 832 message of the link failure due to some uncertain reasons. So the 833 forwarding router rebroadcasts the LFA message. 835 6.5. BRT Building Module(M5) 837 M5 is responsible for building a BRT for the current router. By 838 leveraging the regularity in topology, M5 can calculate the routing 839 paths for any destination without the knowledge of the topology of 840 whole network, and then build the BRT based on an NDT. Since the IP 841 addresses of network devices are continuous, M5 only creates one 842 route entry for a group of destination addresses that have the same 843 network prefix by means of route aggregation technology. Usually, 844 the size of a BRT is very small. The detail of how to build a BRT is 845 described in section 5. 847 6.6. NRT Building Module(M6) 849 M6 is responsible for building a NRT for the current router. Because 850 M5 builds a BRT without considering link failures in network, the 851 routing paths calculated by the BRT cannot avoid failed links. To 852 solve this problem, a NRT is used to exclude the routing paths that 853 include some failed links from the paths calculated by a BRT. M6 854 calculate the routing paths that include failed links and stored them 855 into the NRT. The details of how to build a NRT is described in 856 section 5. 858 6.7. Routing Table Lookup(M7) 860 M7 is responsible for querying routing tables and selecting the next 861 hop for forwarding the packets. Firstly, M7 takes the destination 862 address of a forwarding packet as a criterion to look up route 863 entries in a BRT based on longest prefix match. All of the matched 864 entries are composed of a candidate hops list. Secondly, M7 look up 865 negative route entries in a NRT taking the destination address of the 866 forwarding packet as criteria. This lookup is not limited to the 867 longest prefix match, any entry that matches the criteria would be 868 selected and composed of an avoiding hops list. Thirdly, the 869 candidate hops minus avoiding hops are composed of an applicable hops 870 list. At last, M7 sends the forwarding packet to any one of the 871 applicable hops. If the applicable list is empty, the forwarding 872 packet will be dropped. 874 7. How a FAR Router Works 876 Figure 7 shows how a FAR router works by its FSM. 878 /---------------\ 879 | Start | 880 | | 881 \---------------/ 882 | 883 +--------+ |1 884 | | | 885 | \|/ \|/ 886 | +--------------------+ 887 |2 | ND | 888 | | | 889 | +--------------------+ 890 | | | 891 | | |3 892 +--------+ | 893 \|/ 894 +--------------+ 895 | ND-FIN | 896 | | 897 +--------------+ 898 | 899 |4 900 | 901 \|/ 902 ________10_______\+--------------+/_______11________ 903 | /| |\ | 904 | | Listen | | 905 | ____9_______\| |/_______12___ | 906 | | /| |\ | | 907 | | +--------------+ | | 908 | | 5/ | | \8 | | 909 | | |/_ | | _\| | | 910 | | +------------+ 6| |7 +------------+ | | 911 | --| HELLO-RECV | | | | LFA-RECV |-- | 912 | +------------+ | | +------------+ | 913 | ______| |______ | 914 | | | | 915 | \|/ \|/ | 916 | +------------+ +------------+ | 917 |___________| DLR-RECV | | DA-RECV |__________| 918 +------------+ +------------+ 920 _________ 921 | | 922 | | 923 \|/ | 924 +--------------+ |13 925 | Hello Thread | | 926 +--------------+ | 927 | | 928 |_________| 930 _________ 931 | | 932 | | 933 \|/ | 934 +--------------+ |14 935 | LFD Thread | | 936 +--------------+ | 937 | | 938 |_________| 940 _________ 941 | | 942 | | 943 \|/ | 944 +--------------+ |15 945 | DA Thread | | 946 +--------------+ | 947 | | 948 |_________| 950 _________ 951 | | 952 | | 953 \|/ | 954 +--------------+ |16 955 | DFD Thread | | 956 +--------------+ | 957 | | 958 |_________| 960 Figure 7: The Finite State Machine of FAR Router 962 1)When a router starts up, it starts a Hello thread and then starts 963 ND (neighbor detection) timer (3 seconds). Next the router goes into 964 ND (neighbor detection) state. 965 2)In the ND state, if a router received a Hello message, then it 966 performs a Hello-message processing and goes back to the ND state. 967 3)When the ND timer is over, a router goes into ND-FIN (neighbor 968 detection finished) state. 969 4)A router starts the LFD (link failure detection) thread and DFD 970 (device failure detection) state, and sends DA message and DLR 971 message to all of its active ports. Then the router goes into Listen 972 state. 973 5) If a router receives a Hello message, then goes into HELLO-RECV 974 state. 975 6) If a router receives a DLR message, then goes into DLR-RECV state. 976 7) If a router receives a DA message, then goes into DA-RECV state. 977 8) If a router receives a LFA message, then goes into LFA-RECV state. 978 9) A router performs the Hello-message processing. After that, it 979 goes back to Listen state. 980 10) A router performs the DLR-message processing. After that, it 981 goes back to Listen state. 982 11) A router performs the DA-message processing. After that, it goes 983 back to Listen state. 985 12) A router performs the LFA-message processing. After that, it 986 goes back to Listen state. 987 13) Hello thread produces and sends Hello messages to all its ports 988 periodically. 989 14) LFD thread calls link-failure-detection processing to check link 990 failures in all links periodically 991 15) DA thread produces and sends DA messages periodically (30 992 minutes). 993 16) When DFD thread starts up, it sleep a short time (30 seconds) to 994 wait for a router learning all the active routers in the network. 995 Then the thread calls the device-failure-detection processing to 996 check device failures periodically (30 minutes). 998 8. Compatible Architecture 1000 As a generic routing protocol, FAR can be run in various DCNs with 1001 regular topology. Up to now, we have implemented the FAR protocol 1002 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 1004 For different network architectures, most processing of FAR is same 1005 besides calculation of routing tables. BRT routing tables are 1006 calculated based on Hello messages and NRT routing tables are 1007 calculated based on LFA messages in FAR. To extend FAR to support a 1008 new network architecture, only processing of Hello and LFA messages 1009 need providing to build BRT and NRT routing tables. 1011 In this protocol, FAR can support maximally 12 network architectures 1012 and at least support 1 built-in network architecture, such as Fat- 1013 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 1014 a unique number from 1 to 12. For example, if the 1 built-in 1015 architectures are assigned 1, and other customized architectures are 1016 assigned 2 to 12. 1017 1: Fat-tree 1018 2: BCube 1019 3: MatrixDCN. 1020 4: xxx. 1021 ...... 1022 12: xxx. 1024 9. Topology identification and broadcast storm suppression 1026 In this design, the initial topology discovery process is not a 1027 mandatory option for a FAR routing protocol. The recommended 1028 solution here is to use a pre-configured configuration file, which 1029 contains topology parameters of the current system, each node device 1030 as long as according to these configuration parameters will be able 1031 to know the topology information. In this way, we do not have to 1032 deal with complex topology discovery processes, nor do we need to 1033 calculate the shortest path, because the optimal path can be 1034 calculated from the parameters. This protocol also allows the 1035 formation of configuration files to be submitted to the topology 1036 discovery protocol, allowing for a variety of different 1037 implementation options. 1039 Regarding the flood suppression processing of broadcast packets, it 1040 has been considered in the previous content. Since the hello packets 1041 is only transmitted between the two nodes, it cannot be spread out. 1042 The link error message is only sent to the CPU, and are not forwarded 1043 to the nodes in layer 2 broadcasting. Moreover, each node will 1044 discard the repeated error messages when the node receives them. In 1045 this way, the broadcast storm can be suppressed. If a link is 1046 unstable and repeatedly up or down, the system will not send new 1047 messages after sending notifications, and the system will not 1048 oscillate repeatedly. The topology is updated only when the link is 1049 later detected to be stable for a long time. 1051 10. Application Example 1053 In this section, we take a Fat-tree network(Fig. 7) as an example to 1054 describe how to apply FAR routing. Since M1 to M4 are very simple, 1055 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 1056 network. 1058 A Fat-tree network is composed of 4 layers. The top layer is core 1059 layer, and the other layers are aggregation layer, edge layer and 1060 server layer. There are k pods, each one containing two layers of 1061 k/2 switches. Each k-port switch in the edge layer is directly 1062 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 1063 of the k-port switches in the aggregation layer. There are (k/2)2 1064 k-port core switches. Each core switch has one port connected to 1065 each of the k pods. 1067 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 1068 +--+ +--+ +--+ +--+ 1069 | | | | | | | | 1070 +--+,_ +--+', +--+ ,,+--+ 1071 |`,`',`-., / | \ `. .'` - .-'``.` /| 1072 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1073 | \ `', `-., .` / | `, .` ,' | 1074 | `, `'. `'-,_ .'` ,' | ', / | 1075 | . `'. `-.,-` / | \ 1076 | \ `'., .` `'., ` | `. 1077 | `, .'`, ,`'., | ', 1078 | . ,-` '., - `'-,_ | `. 1079 | \ .` ,'., `|., . 1080 | .'` / `-, | `'., `. 1081 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1082 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1083 | | | | | | | | | | | | | | | | 1084 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1085 | \/ | | \/ | | \/ | | \/ | 1086 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1087 | | | | | | | | | | | | | | | | 1088 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1089 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1090 / | | \ / | | \ / | | \ / | | | 1091 / | | \ / | | \ / | | \ / | | | 1092 / | | \ / | | \ / | | \ / | | | 1093 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1094 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1095 10.1.2.2 10.3.1.3 1097 Figure 8: Fat-tree Network 1099 Aggregation switches are given addresses of the form 10.pod.0.switch, 1100 where pod denotes the pod number, and switch denotes the position of 1101 that switch in the upper pod (in [1, k/2]). Edge switches are given 1102 addresses of the form 10.pod.switch.1, where pod denotes the pod 1103 number, and switch denotes the position of that switch in the lower 1104 pod (in [1, k/2]). The core switches are given addresses of the form 1105 10.0.j.i, where j and i denote that switch's coordinates in the 1106 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1107 The address of a host follows the pod switch to which it is connected 1108 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1109 the host's position in that subnet (in [2, k/2+1], starting from left 1110 to the right). 1112 10.1. BRT Building Procedure 1114 By leveraging the topology's regularity, every switch clearly knows 1115 how it forwards a packet. When a packet arrives at an edge switch, 1116 if the destination of the packet lies in the same subnet with the 1117 switch, then the switch directly forwards the packet to the 1118 destination server through layer-2 switching. Otherwise, the switch 1119 forwards the packet to any of aggregation switches in the same pod. 1120 When a packet arrives at an aggregation switch, if the destination of 1121 the packet lies in the same pod, the switch forwards the packet to 1122 the corresponding edge switch. Otherwise, the switch forwards the 1123 packet to any of core switches that it is connected to. If a core 1124 switch receives a packet, it forwards the packet to the corresponding 1125 aggregation switch that lies in the destination pod. 1127 The forwarding policy discussed above is easily expressed through a 1128 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1129 following entries: 1131 Destination/Mask Next hop 1132 10.0.0.0/255.0.0.0 10.1.0.1 1133 10.0.0.0/255.0.0.0 10.1.0.2 1135 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1136 the following entries: 1138 Destination/Mask Next hop 1139 10.1.1.0/255 255.255.0 10.1.1.1 1140 10.1.2.0/255.255.255.0 10.1.2.1 1141 10.0.0.0/255.0.0.0 10.0.1.1 1142 10.0.0.0/255.0.0.0 10.0.1.2 1144 The BRT of a core switch, such as 10.0.1.1, is composed of the 1145 following entries: 1147 Destination/Mask Next hop 1148 10.1.0.0/255 255.0.0 10.1.0.1 1149 10.2.0.0/255.255.0.0 10.2.0.1 1150 10.3.0.0/255.255.0.0 10.3.0.1 1151 10.4.0.0/255.255.0.0 10.4.0.1 1153 10.2. NRT Building Procedure 1155 The route entries in an NRT are related with link and node failures. 1156 We summarize all types of cases into three (3) catalogs. 1158 10.2.1. Single Link Failure 1160 In Fat-tree, Links can be classified as 3 types by their locations: 1161 1) servers to edge switches; 2) edge to aggregation switches; 3) 1162 aggregation to core switches. Link failures between servers to edge 1163 switches only affect the communication of the corresponding servers 1164 and don't affect the routing tables of any switch, so we only discuss 1165 the second and third type of links failures. 1167 Edge to Aggregation Switches 1169 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1170 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1171 failure may affect 3 types of communications. 1173 o Sources lie in the same subnet with A, and destinations do not. In 1174 this case, the link failure will only affect the routing tables of A. 1175 As this link is attached to A directly, A only needs to delete the 1176 route entries whose next hop is B in its BRT and add no entries to 1177 its NRT when A's M6 module detect the link failure. 1179 o Destinations lie in the same subnet with A, and sources lie in 1180 another subnet of the same pod. In this case, the link failure will 1181 affect the routing tables of all the edge switches in the same pod 1182 except for A. When an edge switch, such as 10.1.1.1, learns the link 1183 failure, it will add a route entry to its NRT: 1185 Destination/Mask Next hop 1186 10.1.2.0/255.255.255.0 10.1.0.1 1188 o Destinations lie in the same subnet with A, sources lie in another 1189 pod. In this case, the link failure will affect the routing tables 1190 of all the edge switches in the other pods. When an edge switch in 1191 one other pod, such as 10.3.1.1, learns the link failure, because all 1192 the routings that pass through 10.3.0.1 to A will certainly pass 1193 through the link between A and B, 10.3.1.1 need add a route entry to 1194 its NRT: 1196 Destination/Mask Next hop 1197 10.1.2.0/255.255.255.0 10.3.0.1 1198 Aggregation to Core Switches 1200 Suppose that the link between an aggregation switch, such as 10.1.0.1 1201 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1202 failure may affect 2 types of communications. 1204 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1205 the other pods. In this case, the link failure will only affect the 1206 routing tables of A. As this link is attached to A directly, A only 1207 need to delete the route entries whose next hop is B in its BRT and 1208 add no entries to its NRT when A's M6 module detect the link failure. 1210 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1211 another pod. In this case, the link failure will affect the routing 1212 tables of all the aggregation switches in other pods except for pod 1213 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1214 learns the link failure, because all the routings that pass through 1215 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1216 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1218 Destination/Mask Next hop 1219 10.1.0.0/255.255.0.0 10.0.1.2 1221 10.2.2. A Group of Link Failures 1223 If all the uplinks of an aggregation switch fail, then this switch 1224 cannot forward packets, which will affect the routing of every edge 1225 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1226 fail, it will affect two types of communications. 1228 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1229 the other pods. In this case, the link failures will affect the 1230 routing of the edge switches in the Pod of A. To avoid the node A, 1231 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1232 10.1.0.1" in which the next hop is the node A. 1234 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1235 other pods. In this case, the link failures will affect the routing 1236 of edge switches in other pods. For example, if the edge switch 1237 10.3.1.1 communicates with some node in the pod of A, it should avoid 1238 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1239 pod of A will pass through the node A. So a route entry should be 1240 added to 10.3.1.1: 1242 Destination/Mask Next hop 1243 10.1.0.0/255.255.0.0 10.3.0.1 1245 10.2.3. Node Failures 1247 At last, we discuss the effect of node failures to a NRT. There are 1248 3 types of node failures: the failure of edge, aggregation and core 1249 switches. 1251 o An edge switch fails. The failure doesn't affect the routing table 1252 of any switch. 1254 o A core switch fails. Only when all the core switches connected to 1255 the same aggregation switch fail, they will affect the routing of 1256 other switches. This case is equal to the case that all the uplinks 1257 of an aggregation switch fail, so the process of link failures can 1258 cover it. 1260 o An aggregation switch fails. This case is similar to the case that 1261 all the uplinks of an aggregation switch fail. It affects the 1262 routing of edge switches in other pods, but doesn't affect the 1263 routing of edge switches in pod of the failed switch. The process of 1264 this failure is same to the second case in section 6.2.2. 1266 10.3. Routing Procedure 1268 FAR decides a routing by looking up its BRT and NRT. We illuminate 1269 the routing procedure by an example. In this example, we suppose 1270 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1271 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1272 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1273 (destination). 1275 Step 1: The source 10.3.1.3 sends packets to its default router 1276 10.3.1.1 1278 Step 2: The routing of 10.3.1.1. 1280 1) Calculate candidate hops 1282 10.3.1.1 looks up its BRT and gets the following matched entries: 1284 Destination/Mask Next hop 1285 10.0.0.0/255.0.0.0 10.3.0.1 1287 So the candidate hops = {10.3.0.1} 1289 2) Calculate avoiding hops 1291 Its NRT is empty, so the set of avoiding hop is empty too. 1293 3) Calculate applicable hops 1295 The applicable hops are candidate hops minus avoiding hops, so: 1297 The applicable hops = {10.3.0.1} 1299 4) Forward packets to 10.3.0.1 1301 Step 3: The routing of 10.3.0.1 1303 1) Calculate candidate hops. 1305 10.3. 0.1 looks up its BRT and gets the following matched entries: 1307 Destination/Mask Next hop 1308 10.1.0.0/255.255.0.0 10.0.1.1 1309 10.1.0.0/255.255.0.0 10.0.1.2 1311 So the candidate hops = {10.0.1.1, 10.0.1.2} 1313 2) Calculate avoiding hops 1315 Destination/Mask Next hop 1316 10.1.0.0/255.255.0.0 10.0.1.2 1318 So the avoiding hops = {10.0.1.2} 1320 3) Calculate applicable hops 1322 The applicable hops are candidate hops minus avoiding hops, so: 1324 The applicable hops = {10.0.1.1} 1326 4) Forward packets to 10.0.1.1 1328 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1329 routing tables. 1331 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1332 routing tables. 1334 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1335 layer-2 switching. 1337 10.4. FAR's Performance in Large-scale Networks 1339 FAR has good performance to support large-scale networks. In this 1340 section, we take a Fat-tree network composed of 2,880 48-port 1341 switches and 27,648 servers as an example to show FAR's performance. 1343 10.4.1. The number of control messages required by FAR 1345 FAR exchanges a few messages between routers and only consumes a 1346 little network bandwidth. Tab. 1 shows the required messages in the 1347 example Fat-tree network. 1348 Table 1:Required messages in a Fat-tree network. 1349 _____________________________________________________________________ 1350 Message Type| Scope | size(bytes) | Rate | Bandwidth 1351 --------------------------------------------------------------------- 1352 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1353 kbps 1354 --------------------------------------------------------------------- 1355 DLR |adjacent switches| less than 48 | (1) |48bytes 1356 --------------------------------------------------------------------- 1357 DA |entire network| less than 48 | (2) |1.106M 1358 --------------------------------------------------------------------- 1359 LFA |entire network| less than 48 | (3) |48 bytes 1360 _____________________________________________________________________ 1361 (1)Produce one when a router starts 1362 (2)The number of switches(2,880) in a period 1363 (3)Produce one when a link fails or recovers 1365 10.4.2. The Calculating Time of Routing Tables 1367 A BRT is calculated according to the states of its neighbor routers 1368 and attached links. An NRT is calculated according to device and 1369 link failures in the entire network. So FAR does not calculate 1370 network topology and has no problem of network convergence, which 1371 greatly reduces the calculating time of routing tables. The 1372 detection and spread time of link failures is very short in FAR. 1373 Detection time is up to the interval of sending Hello message. In 1374 FAR, the interval is set to 100ms, and a link failure will be 1375 detected in 200ms. The spread time between any pair of routers is 1376 less than 200ms.If a link fails in a data center network, FAR can 1377 detect it, spread it to all the routers, and calculate routing tables 1378 in no more than 500ms. 1380 10.4.3. The Size of Routing Tables 1382 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1383 in Tab. 2. 1385 Table 2: The size of routing tables in FAR 1386 _____________________________________________________________________ 1387 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1388 --------------------------------------------------------------------- 1389 BRT | 48 | 48 | 24 1390 --------------------------------------------------------------------- 1391 NRT | 0 | 14 | 333 1392 _____________________________________________________________________ 1394 The BRT's size at a switch is determined by the number of its 1395 neighbor switches. In the example network, a core switch has 48 1396 neighbor switches (aggregation switch), so it has 48 entries in its 1397 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1398 switch is related to the number of link failures in the network. 1399 Suppose that there are 1000 link failures in the example network, the 1400 number of failed links is 1.2% of total links, which is a very high 1401 failure ratio. We suppose that link failures are uniformly 1402 distributed in the entire network. The NRT size at an edge switch is 1403 about 333 and the NRT size of an aggregation switch is about 14in 1404 average. 1406 11. Implementations Examples 1408 In the FAR draft scenario, Fat-Tree topology has only three layers of 1409 routers. To expand the network scale is achieved through horizontal 1410 expansion: increase the number of core switches, and increase the 1411 number of aggregation switches and edge switches in pod. 1413 For example, the following two scenarios. 1415 10.0.1.1 10.0.2.1 10.0.3.1 10.0.24.1 1416 +--+ +--+ +--+ +--+ 1417 | |24devices | |24devices | |24devices | | 24devices 1418 +--+,_ +--+', +--+ ,,+--+ 1419 |`,`',`-., / | \ `. .'` - .-'``.` /| 1420 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1421 | \ `', `-., .` / | `, .` ,' | 1422 | `, `'. `'-,_ .'` ,' | ', / | 1423 | . `'. `-.,-` / | \ 1424 | \ `'., .` `'., ` | `. 1425 | `, .'`, ,`'., | ', 1426 | . ,-` '., - `'-,_ | `. 1427 | \ .` ,'., `|., . 1428 | .'` / `-, | `'., `. 1429 10.1.0.1 ,-` . .' `'., ', 1430 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1431 | |..| | | |..| | | |..| | | |..| |A total of 1152 aggregation switches 1432 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1433 | \/ | | \/ | | \/ | | \/ | 1434 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1435 | |..| | | |..| | | |..| | | |..| |A total of 1152 access switches 1436 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1437 /| 10.1.24.1 /| |\ |\ /| | | 1438 / | | \ / | | \ / | | \ / | | | 1439 / | | \ / | | \ / | | \ / | | | 1440 / | | \ / | | \ / | | \ / | | | 1441 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1442 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1444 Figure 9: 48 pods, each of which has 24 aggregation switches and 24 1445 access switches 1447 In the Fat-tree network of Figure 9, there are a total of 48 pods, 1448 each of which has 24 aggregation switches and 24 access switches, and 1449 each access switch is connected to 24 servers. 576 core switches, 1450 1152 aggregation switches, and 1152 access switches are required, for 1451 a total of 2880 switches, which can accommodate 27,648 servers. 1453 96 10 gigabit core switches 1454 +--+ +--+ +--+ +--+ 1455 | |4devices | |4devices | |4devices | |4devices 1456 +--+,_ +--+', +--+ ,,+--+ 1457 |`,`',`-., / | \ `. .'` - .-'``.` /| 1458 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1459 | \ `', `-., .` / | `, .` ,' | 1460 | `, `'. `'-,_ .'` ,' | ', / | 1461 | . `'. `-.,-` / | \ 1462 | \ `'., .` `'., ` | `. 1463 | `, .'`, ,`'., | ', 1464 | . ,-` '., - `'-,_ | `. 1465 | \ .` ,'., `|., . 1466 | .'` / `-, | `'., `. 1467 10.1.0.1 ,-` . .' `'., ', 1468 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1469 | |..| | | |..| | | |..| | | |..| |A total of 192 10 gigabit aggregation switches 1470 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1471 | \/ | | \/ | | \/ | | \/ | 1472 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1473 | |..| | | |..| | | |..| | | |..| |A total of 1152 access switches 1474 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1475 /| 10.1.4.1 /| |\ |\ /| | | 1476 / | | \ / | | \ / | | \ / | | | 1477 / | | \ / | | \ / | | \ / | | | 1478 / | | \ / | | \ / | | \ / | | | 1479 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1480 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1482 Figure 10: 48 pods. Each pod has 4 10G aggregation switches and 24 1483 Gigabit access switches 1485 In the Fat-Tree network of Figure 10, there are a total of 48 pods. 1486 Each pod has 4 10G aggregation switches and 24 Gigabit access 1487 switches. Each access switch is connected to 40 servers. Requires 1488 96 core switches, 192 aggregation switches, and 1152 access switches, 1489 for a total of 1,440 switches, which can accommodate 46,080 servers. 1491 12. Security Considerations 1493 The security considerations will be discussed in a future version of 1494 this document. 1496 13. Conclusions 1498 This draft introduces FAR protocol, a generic routing method and 1499 protocol, for data centers that have a regular topology. It uses two 1500 routing tables, a BRT and an NRT, to store the normal routing paths 1501 and the forbidden (to-be-avoided) routing paths, respectively. This 1502 makes the FAR protocol very simple and efficient. The sizes of these 1503 two tables are very small. Usually, a BRT has only several tens of 1504 entries and an NRT has only several or about a dozen entries. 1506 14. Acknowledgments 1508 This document is supported by ZTE Enterprise-University-Research 1509 Joint Project. 1511 15. References 1513 15.1. Normative References 1515 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1516 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1517 1997. 1519 [RFC2328] J. Moy, "OSPF Version 2", BCP 14, RFC2328, April 1998. 1521 [RFC3619] SHAH, S.; YIP, M. RFC3619: Extreme Networks' Ethernet 1522 Automatic Protection Switching (EAPS) Version 1. 2003. 1524 15.2. Informative References 1526 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1527 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1528 2008. 1530 [BCube] Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., ... Lu, 1531 S. (2009, August). BCube: a high performance, server-centric network 1532 architecture for modular data centers. In Proceedings of the ACM 1533 SIGCOMM 2009 conference on Data communication (pp. 63-74). 1535 [MatrixDCN] Sun, Y., Chen, M., Peng, L., Hassan, M. M., Alelaiwi, A. 1536 (2016). MatrixDCN: a high performance network architecture for 1537 large-scale cloud data centers. Wireless Communications and Mobile 1538 Computing, 16(8), 942-959. 1540 16. Appendix 1542 16.1. Application Area of the Solution 1544 According to the horizontal expansion mode of the above scenarios, 1545 the whole Fat-Tree network does not need to be expanded to 4 layers 1546 (4 order Fat-Tree) even if it is expanded. Using a standard three- 1547 tier Fat-tree network, we can scale the network to meet all the 1548 problems of commercial network applications. This scheme is suitable 1549 for non-SDN distributed Fat-Tree network architecture. 1551 16.2. Technical evolution roadmap 1553 In this draft, we should design different rules for FAR switches in 1554 different regular networks to calculate routing tables, which limits 1555 FAR's extensibility. Fortunately, the latest SDN technology make it 1556 is easy to update the control plane of switches, since all the 1557 function of control plane are centralized to a controller in SDN. We 1558 are designing the next generation routing scheme for regular networks 1559 based on SDN. In the new scheme, we design a regular ToPology 1560 Description Language (TPDL) to descript a regular network. In TPDL, 1561 the distance between different type of node groups is defined by a 1562 group of distance formulas and the number of formulas is finite and 1563 fixed without increasing by the scale of a network. And then, 1564 switches learn the topology of a network by taking advantage of TPDL 1565 and generate flow table entries to forward packets without help of 1566 the SDN controller. If no entry is found for a forwarding packet, 1567 switches transport the packet to the SDN controller, and the 1568 controller recalculates a new routing path using A* algorithm by 1569 taking TPDL's distance formulas as a heuristic function and dispatch 1570 flow table entries down to related switches on the path. 1572 16.3. Updating roadmap 1574 In the next version, we will continue to advance the remaining issues 1575 such as protocol fields, section 3.2 simplification, etc. 1577 Authors' Addresses 1579 Bin Liu 1580 ZTE Inc., ZTE Plaza 1581 No.19 East Huayuan Road,Hai Dian District 1582 Beijing 100191 1583 China 1585 Phone: +86 -010-59932039 1586 Email: 13683610386@139.com 1587 Yantao Sun 1588 Beijing Jiaotong University 1589 No.3 Shang Yuan Cun, Hai Dian District 1590 Beijing 100044 1591 China 1593 Email: ytsun@bjtu.edu.cn 1595 Jing Cheng 1596 Beijing Jiaotong University 1597 No.3 Shang Yuan Cun, Hai Dian District 1598 Beijing 100044 1599 China 1601 Email: yourney.j@gmail.com 1603 Yichen Zhang 1604 Beijing Jiaotong University 1605 No.3 Shang Yuan Cun, Hai Dian District 1606 Beijing 100044 1607 China 1609 Email: snowfall_dan@sina.com 1611 Bhumip Khasnabish 1612 Individual contributor 1613 55 Madison Avenue, Suite 160 1614 Morristown, New Jersey 07960 1615 USA 1617 Phone: +001-781-752-8003 1618 Email: vumip1@gmail.com 1619 URI: http://tinyurl.com/bhumip/