idnits 2.17.1 draft-sl-rtgwg-far-dcn-13.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 54 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (Nov 23, 2019) is 1615 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'FAT-TREE' is defined on line 1404, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: May 26, 2020 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 Nov 23, 2019 12 Generic Fault-avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-13 15 Abstract 17 This draft describes a generic routing method and protocol for a 18 regular data center network, named the Fault-Avoidance Routing (FAR) 19 protocol. The FAR protocol provides a generic routing method for all 20 types of regular topology network architectures that have been 21 proposed for large-scale cloud-based data centers over the past few 22 years. The FAR protocol is designed to leverage any regularity in 23 the topology and compute its routing table in a simplistic manner. 24 Fat-tree is taken as an example architecture to illustrate how the 25 FAR protocol can be applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 26, 2020. 44 Copyright Notice 46 Copyright (c) 2019 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 4 63 2. Conventions used in this document . . . . . . . . . . . . . . 5 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 5 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 8 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 8 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 9 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 9 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 9 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 10 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 10 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 17 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 17 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 17 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 18 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 18 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 18 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 19 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 19 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 19 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 22 86 9. Topology identification and broadcast storm suppression . . . 22 87 10. Application Example . . . . . . . . . . . . . . . . . . . . . 23 88 10.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 10.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 26 90 10.2.1. Single Link Failure . . . . . . . . . . . . . . . . 26 91 10.2.2. A Group of Link Failures . . . . . . . . . . . . . . 27 92 10.2.3. Node Failures . . . . . . . . . . . . . . . . . . . 28 93 10.3. Routing Procedure . . . . . . . . . . . . . . . . . . . 28 94 10.4. FAR's Performance in Large-scale Networks . . . . . . . 30 95 10.4.1. The number of control messages required by FAR . . . 30 96 10.4.2. The Calculating Time of Routing Tables . . . . . . . 30 97 10.4.3. The Size of Routing Tables . . . . . . . . . . . . . 30 98 11. Security Considerations . . . . . . . . . . . . . . . . . . . 31 99 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 31 100 13. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 31 101 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 104 1. Introduction 106 In recent years, with the rapid development of cloud computing 107 technologies, the widely deployed cloud services, such as Amazon EC2 108 and Google search, bring about huge challenges to data center 109 networking (DCN). Today's cloud-based data centers (DCs) require 110 large-scale networks with larger internal bandwidth and smaller 111 transfer delay. However, conventional networks cannot meet such 112 requirements due to limitations in their network architecture. In 113 order to satisfy the requirements of cloud computing services, many 114 new network architectures have been proposed for data centers, such 115 as Fat-tree, MatrixDCN, and BCube. These new architectures can 116 support non-blocking large-scale datacenter networks with more than 117 tens of thousands of physical servers. 119 All of these architectures have regular topologies, which are common 120 features. The regular topology refers to the network topology 121 structure with obvious regularity and symmetry, which is conducive to 122 automatic configuration of the network, such as the Fat-tree network. 123 In a regular topology, each network node such as a switch or router 124 can be addressed by its location and through a node's address, the 125 node's connections to its neighbors in a network can be determined, 126 and furthermore, the route to the node from other nodes in the 127 network can be determined. So nodes can compute route entries 128 without learning topology. 130 This document describes a generic routing method and protocol, the 131 Fault-Avoidance Routing (FAR) protocol, for DCNs. This method 132 leverages the regularity in the topologies of data center networks to 133 simplify routing learning and accelerate the query of routing tables. 134 This routing method has a better fault tolerance and can be applied 135 to any DCN with a regular topology. 137 FAR is not a routing protocol to replace generic routing protocols 138 such as OSPF[RFC2328] and IS-IS. It cannot be used in general local 139 networks whose topological structures are arbitrary, and whose scales 140 are also not very large. OSPF and IS-IS works very well in such a 141 network. But in a large-scale network with regular topology, FAR has 142 better performance. Compared with OSPF and IS-IS, FAR has shorter 143 time of network convergence and lower PDU overhead. Furthermore, FAR 144 requires less computing and storage resources, which lets FAR routers 145 to run at a lower cost of production than the generic routers. 147 In addition, for each type of network architecture, researchers 148 designed a routing algorithm according to the features of its 149 topology. Because these routing algorithms are different and lack 150 compatibility with each other, it is very difficult to develop a 151 routing protocol for network routers supporting multiple routing 152 algorithms. FAR has better adaptability than these specified routing 153 methods. 155 FAR consists of three components, i.e., link state learning unit, 156 routing table building unit and routing table querying unit. In the 157 link state learning unit, FAR exchanges link failures among routers 158 to establish a consistent knowledge of the entire network. In this 159 stage, the regularity in topology is exploited to infer failed links 160 and routers. In the routing table building unit, FAR builds up two 161 routing tables, i.e., a basic routing table (BRT) and a negative 162 routing table (NRT), for each router according to the network 163 topology and link states. In the last component, routers forward 164 incoming packets by looking up the two routing tables. The matched 165 entries in BRT minus the matched entries in NRT are the final route 166 entries to be used to forward an incoming packet. 168 The remainder of this draft is organized as follows. The problem to 169 be addressed by FAR is described in Section 3. The framework of FAR 170 routing protocol is described in Section 4. Section 5 and 6 171 introduce FAR's data format FAR and modules in detail. Section 7 172 describe how FAR works by finite state machine (FSM). In Section 8, 173 we discussed how FAR works with variable network architectures. 174 Section 9 takes Fat-tree network as an example to illuminate how FAR 175 works. 177 1.1. Acronyms & Definitions 179 DCN - Data Center Network 181 FAR - Fault-Avoidance Routing 183 BRT - Basic Routing Table 185 NRT - Negative Routing Table 187 NDT - Neighbor Devices Table 189 ADT - All Devices Table 191 LFT - Link Failure Table 192 DA - Device Announcement 194 LFA - Link Failure Announcement 196 DLR - Device and Link Request 198 IP - Internet Protocol 200 VM - Virtual Machine 202 2. Conventions used in this document 204 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 205 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 206 this document are to be interpreted as described in RFC-2119 207 [RFC2119]. In this document, these words will appear with that 208 interpretation only when in ALL CAPS. Lower case uses of these words 209 are not to be interpreted as carrying RFC-2119 significance. 211 3. Problem Statement 213 The problem to be addressed by FAR as proposed in this draft is 214 described in this section. The expansion of Cloud data center 215 networks has brought significant challenges to the existing routing 216 technologies. FAR mainly solves a series of routing problems faced 217 by large-scale data center networks. 219 3.1. The Impact of Large-scale Networks on Route Calculation 221 In a large-scale cloud data center network, there may be thousands of 222 routers. Running OSPF and other routing protocols in such network 223 will encounter these two challenges: a) Network convergence time 224 would be too long, which will cause a longer time to elapse for 225 creating and updating the routes. The response time to network 226 failures may be excessively long; b) a large number of routing 227 protocol packets need to be sent. The routing information consumes 228 too much network bandwidth and CPU resources, which easily leads to 229 packet loss and makes the problem (a) more prominent. 231 In order to solve these problems, a common practice is to partition a 232 large network into some small areas, where the route calculation runs 233 independently within different areas. However, nowadays the cloud 234 data centers typically require very large internal bandwidth. To 235 meet this requirement, a large number of parallel equivalent links 236 are deployed in a network, such as a Fat-tree network. Partitioning 237 such a network will affect the utilization of routing algorithm on 238 equivalent multi-path and reduce internal network bandwidth 239 requirements. 241 In the FAR routing calculation process, a Basic Routing Table (BRT) 242 is built on local network topology leveraging the regularity of the 243 network topologies. In addition to BRT, FAR also builds a Negative 244 Routing Table (NRT). FAR gradually builds NRT in the process of 245 learning network link failure information, which does not require 246 learning a complete network fault information. FAR does not need to 247 wait for the completion of the network convergence in the process of 248 building these two tables. Therefore, it avoids the problem of 249 excessive network convergence overheads in the route calculation 250 process. In addition, FAR only needs to exchange a small amount of 251 link change information between routers, and hence consumes less 252 network bandwidth. 254 3.2. Dilemma of conventional routing methods in a large-scale network 255 with giant number nodes of routers 257 There are many real world scenario where tens of thousands of 258 nodes(or much more nodes) need to be deployed in a flat area, such as 259 infiniband routing and switching system, high-performance computer 260 network, and many IDC networks in China. The similar problems have 261 been existed long ago. People have solved the problems through 262 similar solutions, such as the traditional regular topology-based 263 RFC3619 protocol, the routing protocols of infiniband routing and 264 switching system, and high-performance computer network routing 265 protocol. 266 Infiniband defines a switch-based network to interconnect processing 267 nodes and the I/O nodes. Infiniband can support very large scale 268 networks, use the regularity in topology to simplify its routing 269 algorithm, which is just the same to what we do in FAR. 271 Why OSPF and other conventional routing methods do not work well in a 272 large-scale network with giant number nodes of routers? 274 As everyone knows, the OSPF protocol uses multiple databases, more 275 topological exchange information (as seen in the following example) 276 and complicated algorithm. It requires routers to consume more 277 memory and CPU processing capability. But the processing rate of CPU 278 on the protocol message per second is very limited. When the network 279 expands, CPU will quickly approach its processing limits, and at this 280 time OSPF can not continue to expand the scale of the management. 281 The SPF algorithm itself does not thoroughly solve these problems. 283 On the contrary, FAR does not have the convergence time delay and the 284 additional CPU overheads, which SPF requires. Because in the initial 285 stage, FAR already knows the regular information of the whole network 286 topology and does not need to periodically do SPF operation. 288 One of the examples of "more topological exchange information": 290 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 291 the larger network, the occupation of CPU and band bandwidth will 292 soon reach the router's performance bottleneck. In order to reduce 293 these adverse effects, OSPF introduced the concept of Area, which 294 still has not solved the problem thoroughly. By dividing the OSPF 295 Area into several areas, the routers in the same area do not need to 296 know the topological details outside their area. (In comparison with 297 FAR, after OSPF introducing the concept of Area, the equivalent paths 298 cannot be selected in the whole network scope) 300 OSPF can achieve the following results by Area : 1) Routers only need 301 to maintain the same link state databases as other routers within the 302 same Area, without the necessity of maintaining the same link state 303 database as all routers in the whole OSPF domain. 2) The reduction 304 of the link state databases means dealing with relatively fewer LSA, 305 which reduces the CPU consumption of routers; 3) The large number of 306 LSAs flood only within the same Area. But, its negative effect is 307 that the smaller number of routers which can be managed in each OSPF 308 area. On the contrary, because FAR does not have the above 309 disadvantages, FAR can also manage large-scale network even without 310 dividing Areas. 312 The aging time of OSPF is set in order to adapt to routing 313 transformation and protocol message exchange happened frequently in 314 the irregular topology. Its negative effect is: when the network 315 does not change, the LSA needs to be refreshed every 1800 seconds to 316 reset the aging time. In the regular topology, as the routings are 317 fixed, it does not need the complex protocol message exchange and 318 aging rules to reflect the routing changes, as long as LFA mechanism 319 in the FAR is enough. 321 Compared with the LSVR protocol, the LSVR protocol has no special 322 requirements for the network topology structure, however, the FAR 323 draft is applicable to the regular topology network architecture and 324 simplifies unnecessary processing. It is a solution proposed to 325 greatly improve the routing efficiency of the regular network 326 topology. The FAR solution is more efficient than the general 327 methods such as LSVR in regular topology. 329 Therefore, in FAR, we can omit many unnecessary processing and the 330 packet exchange. The benefits are fast convergence speed and much 331 larger network scale than other dynamic routing protocol. Now there 332 are some successful implementations of simplified routings in the 333 regular topology in the HPC environment. Conclusion: As FAR needs 334 few routing entries and the topology is regular, the database does 335 not need to be updated regularly. Without the need for aging, there 336 is no need for CPU and bandwidth overhead brought by LSA flood every 337 30 minutes, so the expansion of the network has no obvious effect on 338 the performance of FAR, which is contrary to OSPF. 340 Comparison of convergence time: The settings of OSPF spf_delay and 341 spf_hold_time can affect the change of convergence time. The 342 convergence time of the network with 2480 nodes is about 15-20 343 seconds; while the FAR does not need to calculate the SFP, so there 344 is no such convergence time. 345 These issues still exist in rapid convergence technology of OSPF 346 ,ISIS (such as I-SPF) and LSVR. The convergence speed and network 347 scale constraint each other. FAR does not have the above problems, 348 and the convergence time is almost negligible. 349 Can FRR solve these problems? IP FRR has some limitations. The 350 establishment of IP FRR backup scheme will not affect the original 351 topology and traffic forwarding which are established by protocol, 352 however, we can not get the information of whereabouts and status 353 when the traffic is switched to an alternate next hop. 355 3.3. Network Addressing Issues 357 Routers are typically configured with multiple network interfaces, 358 each connected to a subnet. OSPF and other routing algorithms 359 require that each interface of a router must be configured with an IP 360 address. A large-scale data center network may contain thousands of 361 routers and each router has dozens of network interfaces, thus, there 362 are tens of thousands of IP addresses needed to be configured in a 363 data center. It will be very complex to configure and manage a large 364 number of network interfaces and will be difficult to troubleshoot 365 network problems, then network maintenance will be costly and error- 366 prone. 368 In FAR, the device position information is encoded in the IP address 369 of the router. Each router only needs to be assigned a unique IP 370 address according its location, which greatly solves complex network 371 addressing issues in large-scale networks. 373 3.4. Big Routing Table Issues 375 There are a large number of subnets in the large-scale data center 376 network. A router may build a routing entry for each subnet, and 377 therefore the size of routing tables on each router may be very 378 large. It will increase a router's cost and reduce the querying 379 speed of the routing table. 381 FAR uses two measures to reduce the size of its routing tables: a)It 382 builds a BRT on the regularity of the network topologies; b)It 383 introduces a new routing table, i.e., a NRT. In this way FAR can 384 reduce the size of routing tables to only a few dozen routing 385 entries. 387 3.5. Adaptivity Issues for Routing Algorithms 389 To implement efficient routing in large-scale datacenters, besides 390 FAR, some other routing methods are proposed for some specific 391 network architectures, such as Fat-tree and BCube. These routing 392 methods are different(from both design and implementation viewpoints) 393 and not compatible with the conventional routing methods, which 394 brings big troubles to network equipment providers to develop new 395 routers supporting various new routing methods. 397 FAR is a generic routing method. With slight modification, FAR 398 method can be applied to most of regular datacenter networks. 399 Furthermore, the structure of routing tables and querying a routing 400 table in FAR are the same as conventional routing method. If FAR is 401 adopted, the workload of developing a new type of router will be 402 significantly decreased. 404 3.6. Virtual Machine Migration Issues 406 Supporting VM migration is very important for cloud-based datacenter 407 networks. However, in order to support layer-3 routing, routing 408 methods including OSPF and FAR require limiting VM migration within a 409 subnet. For this paradox, the mainstream methods still utilize 410 layer-3 routing on routers or switches, transmit packets encapsulated 411 by IPinIP or MACinIP between hosts by tunnels passing through network 412 to the destination access switch, and then extract original packet 413 out and send it to the destination host. 415 By utilizing the aforementioned methods, FAR can be applied to Fat- 416 tree, MatrixDCN or BCube networks for supporting VM migration in 417 entire network. 419 4. The FAR Framework 421 FAR requires that a DCN has a regular topology, and network devices, 422 including routers, switches, and servers, are assigned IP addresses 423 according to their locations in the network. In other word, we can 424 locate a device in the network according to its IP address. 426 FAR is a distributed routing method. In order to support FAR, each 427 router needs to have a routing module that implements the FAR 428 algorithm. FAR algorithm is composed of three parts, i.e., link- 429 state learning, routing table building and routing table querying, as 430 shown in Fig. 1. 432 Link-state Learning |Routing Table | Routing Table 433 | Building 434 | | Querying 435 +--------+ /---------------\ | +--------+ | Packets 436 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 437 |Learning| | Link Detection| | |Building|\ | | 438 +--------+ \---------------/ | +--------+ \ | \|/ 439 | | | \ |+--------------+ 440 | | | /|| 7 Querying | 441 \|/ \|/ | / || Routing Table| 442 +-----------------------+| / |+--------------+ 443 |3 Invisible Neighbor & || / | 444 |Link Failure Inferring || +--------- | 445 +-----------------------+|/| 6 NRT | | 446 | / |Building| | 447 \|/ /| +--------+ | 448 +--------------+ / | | 449 |4 Link Failure| / | | 450 | Learning | | | 451 +--------------+ | | 452 | | 454 Figure 1: The FAR framework 456 Link-state learning is responsible for a router to detect the states 457 of its connected links and learn the states of all the other links in 458 the entire network. The second part builds two routing tables, a 459 basic routing table (BRT) and an negative routing table (NRT), 460 according to the learned link states in the first part. The third 461 part queries the BRT and the NRT to decide a next forwarding hop for 462 the received (ingress) packets. 464 5. Data Format 466 5.1. Data Tables 468 Some data tables are maintained on each router in FAR. They are: 470 Neighbor Device Table (NDT): To store neighbor routers and related 471 links. 473 All Devices Table (ADT): To store all routers in the entire network. 475 Link Failures Table (LFT): To store all link failures in the entire 476 network. 478 Basic Routing Table (BRT): To store the candidate routes. 480 Negative Routing Table(NRT): To store the avoiding routes. 482 The format of NDT 484 ---------------------------------------------------------- 485 Device ID | Device IP | Port ID | Link State | Update Time 486 ---------------------------------------------------------- 488 Device ID: The ID of a neighbor router. 490 Device IP: The IP address of a neighbor router. 492 Port ID: The port ID that a neighbor router is attached to. 494 Link State: The state of the link between a router and its neighbor 495 router. There are two states: Up and Down. 497 Update Time: The time of updating the entry. 499 The format of ADT 501 -------------------------------------------------- 502 Device ID | Device IP | Type | State | Update Time 503 -------------------------------------------------- 505 Device ID: The ID of a neighbor router. 507 Device IP: The IP address of a neighbor router. 509 Type: The type of a neighbor router. 511 State: The state of a neighbor router. There are two states: Up and 512 Down. 514 Update Time: The time of updating the entry. 516 The format of LFT 518 -------------------------------------------- 519 No | Router 1 IP | Router 2 IP | Timestamp 520 -------------------------------------------- 522 No: The entry number. 524 Router 1 IP: The IP address of one router that a failed link connects 525 to. 527 Router 2 IP: The IP address of another router that a failed link 528 connects to. 530 Timestamp: It identifies when the entry is created. 532 The format of BRT 534 ------------------------------------------------------- 535 Destination | Mask | Next Hop | Interface | Update Time 536 ------------------------------------------------------- 538 Destination: A destination network 540 Mask: The subnet mask of a destination network. 542 Next Hop: The IP address of a next hop for a destination. 544 Interface: The interface related to a next hop. 546 Update Time: The time of updating the entry. 548 The format of NRT 550 ------------------------------------------------------------------- 551 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 552 ------------------------------------------------------------------- 554 Destination: A destination network. 556 Mask: The subnet mask of a destination network. 558 Next Hop: The IP address of a next hop that should be avoided for a 559 destination. 561 Interface: The interface related to a next hop that should be 562 avoided. 564 Failed Link No: A group of failed link numbers divided by "/", for 565 example 1/2/3. 567 Timestamp: The time of updating the entry. 569 5.2. Messages 571 Some protocol messages are exchanged between routers in FAR. 573 Hello Message: This message is exchanged between neighbor routers to 574 learn adjacency. 576 Device Announcement (DA): Synchronize the knowledge of routers 577 between routers. 579 Link Failure Announcement (LFA): Synchronize link failures between 580 routers. 582 Device and Link Request (DLR): When a router starts, it requests the 583 knowledge of routers and links from its neighbors by a DLR message. 585 A FAR Message is directly encapsulated in an IP packet. The protocol 586 field of IP header indicates an IP packet is an FAR message. The 587 protocol of IP for FAR should be assigned by IANA. 589 The four types of FAR messages have same format of packet header, 590 called FAR header (as shown in Figure 2). 592 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 593 +-----------+------------+------------------------+ 594 | Version |Message Type| Message Length | 595 +-----------+------------+------------------------+ 596 | Checksum | AuType | 597 +------------------------+------------------------+ 598 | Authentication | 599 +-------------------------------------------------+ 600 | Authentication | 601 +-------------------------------------------------+ 602 | Timestamp | 603 +-------------------------------------------------+ 605 Figure 2: The format of FAR header 607 Version: FAR version 609 Message Type: The type of FAR message. 611 Packet Length: The packet length of the total FAR message. 613 Checksum: The checksum of an entire FAR message. 615 AuType:Authentication type. 0: no authentication, 1: Plaintext 616 Authentication, 2: MD5 Authentication. 618 Authentication: Authentication information. 0: undefined, 1: Key, 2: 619 key ID, MD5 data length and packet number. MD5 data is appended to 620 the backend of the packet. 622 AuType and Authentication can refer to the definition of OSPF packet. 624 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 625 +-----------+------------+------------------------+ 626 | Version |Message Type| Message Length | 627 +-----------+------------+------------------------+ 628 | Checksum | AuType | 629 +------------------------+------------------------+ 630 | Authentication | 631 +-------------------------------------------------+ 632 | Authentication | 633 +-------------------------------------------------+ 634 | Timestamp | 635 +-------------------------------------------------+ 636 | Router IP | 637 +------------------------+------------------------+ 638 | HelloInterval | HelloDeadInterval | 639 +------------------------+------------------------+ 640 | Neighbor Router IP | 641 +-------------------------------------------------+ 642 | ... | 643 +-------------------------------------------------+ 645 Figure 3: The Format of Hello Messages 647 For Hello messages, the Message Type in FAR header is set to 648 1.Besides FAR header, a Hello message(Fig. 3) requires the following 649 fields: 651 Router IP: The router IP address. 653 HelloInterval: The interval of sending Hello messages to neighbor 654 routers. 656 RouterDeadInterval: The interval to set a neighbor router dead(out- 657 of-service). If in the interval time, a router doesn't receive a 658 Hello message from its neighbor router, the neighbor router is 659 treated as dead. 661 Neighbor Router IP: The IP address of a neighbor router. All the 662 neighbor router's addresses should be included in a Hello message. 664 |<----1---->| <----1---->|<----------2----------->| 665 +-----------+------------+------------------------+ 666 | Version |Message Type| Message Length | 667 +-----------+------------+------------------------+ 668 | Checksum | AuType | 669 +------------------------+------------------------+ 670 | Authentication | 671 +-------------------------------------------------+ 672 | Authentication | 673 +-------------------------------------------------+ 674 | Timestamp | 675 +------------------------+------------------------+ 676 | Router1 IP | 677 +-------------------------------------------------+ 678 | ... | 679 +-------------------------------------------------+ 681 Figure 4: The Format of DA Messages 683 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 684 Besides FAR header, a DA message includes IP addresses of all the 685 announced routers. 687 |<----1---->| <----1---->|<----------2----------->| 688 +-----------+------------+------------------------+ 689 | Version |Message Type| Message Length | 690 +-----------+------------+------------------------+ 691 | Checksum | AuType | 692 +------------------------+------------------------+ 693 | Authentication | 694 +-------------------------------------------------+ 695 | Authentication | 696 +-------------------------------------------------+ 697 | Timestamp | 698 +------------------------+------------------------+ 699 | Left IP | 700 +-------------------------------------------------+ 701 | Right IP | 702 +------------------------+------------------------+ 703 | State | 704 +-------------------------------------------------+ 705 | ... | 706 +-------------------------------------------------+ 708 Figure 5: The Format of LFA Messages 710 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 711 Besides FAR header, a LFA message includes all the announced link 712 failures. 714 Left IP: The IP address of the left endpoint router of a link. 716 Right IP: The IP address of the right endpoint router of a link. 718 State: Link state. 0: Up, 1: down 719 |<----1---->| <----1---->|<----------2----------->| 720 +-----------+------------+------------------------+ 721 | Version |Message Type| Message Length | 722 +-----------+------------+------------------------+ 723 | Checksum | AuType | 724 +------------------------+------------------------+ 725 | Authentication | 726 +-------------------------------------------------+ 727 | Authentication | 728 +-------------------------------------------------+ 729 | Timestamp | 730 +-------------------------------------------------+ 732 Figure 6: The Format of DLR Messages 734 For DLR messages(Fig. 6), the Message Type in FAR header is set to 735 1.Except for FAR header, DLR has no additional fields. 737 6. FAR Modules 739 6.1. Neighbor and Link Detection Module(M1) 741 M1 is responsible for sending and receiving Hello messages, and 742 detecting directly-connected links and neighbor routers. Each Hello 743 message is encapsulated in a IP packet. M1 sends Hello messages 744 periodically to all the active router ports and receives Hello 745 messages from its neighbor routers. M1 detects neighbor routers and 746 directly-connected links according to received Hello Messages and 747 stores these neighbors and links into a Neighbor Devices Table (NDT). 748 Additionally, M1 also stores neighbor routers into an All Devices 749 Table (ADT). 751 6.2. Device Learning Module(M2) 753 M2 is responsible for sending, receiving, and forwarding device 754 announcement (DA) messages, learning all the routers in the whole 755 network, and deducing faulted routers. When a router starts, it 756 sends a DA message announcing itself to its neighbors and a DLR 757 message requesting the knowledge of routers and links from its 758 neighbors. If M2 module of a router receives a DA message, it checks 759 whether the router encapsulated in the message is in an ADT. If the 760 router is not in the ADT, M2 puts this router into the ADT and 761 forwards this DA message to all the active ports except for the 762 incoming one, otherwise, M2 discards this message directly. If M2 763 module of a router receives a DLR message, it replies a DA message 764 that encapsulates all of the learned routers. 766 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 768 M3 is responsible for inferring invisible neighbors of the current 769 router by means of the ADT. If the link between a router A and its 770 neighbor B breaks, which results in that M1 module of A cannot detect 771 the existence of B, then B is an invisible neighbor of A. Since a 772 device's location is coded into its IP address, it can be judged 773 whether two routers are adjacent, according to their IP addresses. 774 Based on this idea, M3 infers all of the invisible neighbors of the 775 current router and the related link failures. The results are stored 776 into a NDT. Moreover, link failures also are added into a link- 777 failure table (LFT). LFT stores all of the failed links in the 778 entire network. 780 6.4. Link Failure Learning Module(M4) 782 M4 is responsible for sending, receiving and forwarding link failure 783 announcement (LFA) and learning all the link failures in the whole 784 network. M4 broadcasts each newly inferred link failure to all the 785 routers in the network. Each link failure is encapsulated in a LFA 786 message and one link failure is broadcasted only once. If a router 787 receives a DLR request from its neighbor, it will reply a LFA message 788 that encapsulates all the learned link failures through M4 module. 789 If M4 receives a LFA message, it checks whether the link failure 790 encapsulated in the message is in a LFT by comparing two link ends 791 and timestamp. If the link failure is not in the LFT or timestamp is 792 different, M4 puts this link failure into the LFT (or update 793 timestamp only) and forwards this LFA message to all the active ports 794 except for the incoming one, otherwise, M4 discards this message 795 directly. 797 There is a special case a router will rebroadcast a link failure. If 798 a router receives a data packet and must forward the packet going 799 ahead to destination through a failed link, it means some previous 800 router should avoid this failed link according to its NRT but it 801 doesn't. In this case, maybe the previous router missed the LFA 802 message of the link failure due to some uncertain reasons. So the 803 forwarding router rebroadcasts the LFA message. 805 6.5. BRT Building Module(M5) 807 M5 is responsible for building a BRT for the current router. By 808 leveraging the regularity in topology, M5 can calculate the routing 809 paths for any destination without the knowledge of the topology of 810 whole network, and then build the BRT based on a NDT. Since the IP 811 addresses of network devices are continuous, M5 only creates one 812 route entry for a group of destination addresses that have the same 813 network prefix by means of route aggregation technology. Usually, 814 the size of a BRT is very small. The detail of how to build a BRT is 815 described in section 5. 817 6.6. NRT Building Module(M6) 819 M6 is responsible for building a NRT for the current router. Because 820 M5 builds a BRT without considering link failures in network, the 821 routing paths calculated by the BRT cannot avoid failed links. To 822 solve this problem, a NRT is used to exclude the routing paths that 823 include some failed links from the paths calculated by a BRT. M6 824 calculate the routing paths that include failed links and stored them 825 into the NRT. The details of how to build a NRT is described in 826 section 5. 828 6.7. Routing Table Lookup(M7) 830 M7 is responsible for querying routing tables and selecting the next 831 hop for forwarding the packets. Firstly, M7 takes the destination 832 address of a forwarding packet as a criterion to look up route 833 entries in a BRT based on longest prefix match. All of the matched 834 entries are composed of a candidate hops list. Secondly, M7 look up 835 negative route entries in a NRT taking the destination address of the 836 forwarding packet as criteria. This lookup is not limited to the 837 longest prefix match, any entry that matches the criteria would be 838 selected and composed of an avoiding hops list. Thirdly, the 839 candidate hops minus avoiding hops are composed of an applicable hops 840 list. At last, M7 sends the forwarding packet to any one of the 841 applicable hops. If the applicable list is empty, the forwarding 842 packet will be dropped. 844 7. How a FAR Router Works 846 Figure 7 shows how a FAR router works by its FSM. 848 /---------------\ 849 | Start | 850 | | 851 \---------------/ 852 | 853 +--------+ |1 854 | | | 855 | \|/ \|/ 856 | +--------------------+ 857 |2 | ND | 858 | | | 859 | +--------------------+ 860 | | | 861 | | |3 862 +--------+ | 863 \|/ 864 +--------------+ 865 | ND-FIN | 866 | | 867 +--------------+ 868 | 869 |4 870 | 871 \|/ 872 ________10_______\+--------------+/_______11________ 873 | /| |\ | 874 | | Listen | | 875 | ____9_______\| |/_______12___ | 876 | | /| |\ | | 877 | | +--------------+ | | 878 | | 5/ | | \8 | | 879 | | |/_ | | _\| | | 880 | | +------------+ 6| |7 +------------+ | | 881 | --| HELLO-RECV | | | | LFA-RECV |-- | 882 | +------------+ | | +------------+ | 883 | ______| |______ | 884 | | | | 885 | \|/ \|/ | 886 | +------------+ +------------+ | 887 |___________| DLR-RECV | | DA-RECV |__________| 888 +------------+ +------------+ 890 _________ 891 | | 892 | | 893 \|/ | 894 +--------------+ |13 895 | Hello Thread | | 896 +--------------+ | 897 | | 898 |_________| 900 _________ 901 | | 902 | | 903 \|/ | 904 +--------------+ |14 905 | LFD Thread | | 906 +--------------+ | 907 | | 908 |_________| 910 _________ 911 | | 912 | | 913 \|/ | 914 +--------------+ |15 915 | DA Thread | | 916 +--------------+ | 917 | | 918 |_________| 920 _________ 921 | | 922 | | 923 \|/ | 924 +--------------+ |16 925 | DFD Thread | | 926 +--------------+ | 927 | | 928 |_________| 930 Figure 7: The Finite State Machine of FAR Router 932 1)When a router starts up, it starts a Hello thread and then starts 933 ND (neighbor detection) timer (3 seconds). Next the router goes into 934 ND (neighbor detection) state. 935 2)In the ND state, if a router received a Hello message, then it 936 performs a Hello-message processing and goes back to the ND state. 937 3)When the ND timer is over, a router goes into ND-FIN (neighbor 938 detection finished) state. 939 4)A router starts the LFD (link failure detection) thread and DFD 940 (device failure detection) state, and sends DA message and DLR 941 message to all of its active ports. Then the router goes into Listen 942 state. 943 5) If a router receives a Hello message, then goes into HELLO-RECV 944 state. 945 6) If a router receives a DLR message, then goes into DLR-RECV state. 946 7) If a router receives a DA message, then goes into DA-RECV state. 947 8) If a router receives a LFA message, then goes into LFA-RECV state. 948 9) A router performs the Hello-message processing. After that, it 949 goes back to Listen state. 950 10) A router performs the DLR-message processing. After that, it 951 goes back to Listen state. 952 11) A router performs the DA-message processing. After that, it goes 953 back to Listen state. 955 12) A router performs the LFA-message processing. After that, it 956 goes back to Listen state. 957 13) Hello thread produces and sends Hello messages to all its ports 958 periodically. 959 14) LFD thread calls link-failure-detection processing to check link 960 failures in all links periodically 961 15) DA thread produces and sends DA messages periodically (30 962 minutes). 963 16) When DFD thread starts up, it sleep a short time (30 seconds) to 964 wait for a router learning all the active routers in the network. 965 Then the thread calls the device-failure-detection processing to 966 check device failures periodically (30 minutes). 968 8. Compatible Architecture 970 As a generic routing protocol, FAR can be run in various DCNs with 971 regular topology. Up to now, we have implemented the FAR protocol 972 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 974 For different network architectures, most processing of FAR is same 975 besides calculation of routing tables. BRT routing tables are 976 calculated based on Hello messages and NRT routing tables are 977 calculated based on LFA messages in FAR. To extend FAR to support a 978 new network architecture, only processing of Hello and LFA messages 979 need providing to build BRT and NRT routing tables. 981 In this protocol, FAR can support maximally 12 network architectures 982 and at least support 1 built-in network architecture, such as Fat- 983 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 984 a unique number from 1 to 12. For example, if the 1 built-in 985 architectures are assigned 1, and other customized architectures are 986 assigned 2 to 12. 987 1: Fat-tree 988 2: BCube 989 3: MatrixDCN. 990 4: xxx. 991 ...... 992 12: xxx. 994 9. Topology identification and broadcast storm suppression 996 In this design, the initial topology discovery process is not a 997 mandatory option for a FAR routing protocol. The recommended 998 solution here is to use a pre-configured configuration file, which 999 contains topology parameters of the current system, each node device 1000 as long as according to these configuration parameters will be able 1001 to know the topology information. In this way, we do not have to 1002 deal with complex topology discovery processes, nor do we need to 1003 calculate the shortest path, because the optimal path can be 1004 calculated from the parameters. This protocol also allows the 1005 formation of configuration files to be submitted to the topology 1006 discovery protocol, allowing for a variety of different 1007 implementation options. 1009 Regarding the flood suppression processing of broadcast packets, it 1010 has been considered in the previous content. Since the hello packets 1011 is only transmitted between the two nodes, it cannot be spread out. 1012 The link error message is only sent to the CPU, and are not forwarded 1013 to the nodes in layer 2 broadcasting. Moreover, each node will 1014 discard the repeated error messages when the node receives them. In 1015 this way, the broadcast storm can be suppressed. If a link is 1016 unstable and repeatedly up or down, the system will not send new 1017 messages after sending notifications, and the system will not 1018 oscillate repeatedly. The topology is updated only when the link is 1019 later detected to be stable for a long time. 1021 10. Application Example 1023 In this section, we take a Fat-tree network(Fig. 7) as an example to 1024 describe how to apply FAR routing. Since M1 to M4 are very simple, 1025 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 1026 network. 1028 A Fat-tree network is composed of 4 layers. The top layer is core 1029 layer, and the other layers are aggregation layer, edge layer and 1030 server layer. There are k pods, each one containing two layers of 1031 k/2 switches. Each k-port switch in the edge layer is directly 1032 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 1033 of the k-port switches in the aggregation layer. There are (k/2)2 1034 k-port core switches. Each core switch has one port connected to 1035 each of the k pods. 1037 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 1038 +--+ +--+ +--+ +--+ 1039 | | | | | | | | 1040 +--+,_ +--+', +--+ ,,+--+ 1041 |`,`',`-., / | \ `. .'` - .-'``.` /| 1042 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1043 | \ `', `-., .` / | `, .` ,' | 1044 | `, `'. `'-,_ .'` ,' | ', / | 1045 | . `'. `-.,-` / | \ 1046 | \ `'., .` `'., ` | `. 1047 | `, .'`, ,`'., | ', 1048 | . ,-` '., - `'-,_ | `. 1049 | \ .` ,'., `|., . 1050 | .'` / `-, | `'., `. 1051 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1052 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1053 | | | | | | | | | | | | | | | | 1054 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1055 | \/ | | \/ | | \/ | | \/ | 1056 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1057 | | | | | | | | | | | | | | | | 1058 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1059 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1060 / | | \ / | | \ / | | \ / | | | 1061 / | | \ / | | \ / | | \ / | | | 1062 / | | \ / | | \ / | | \ / | | | 1063 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1064 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1065 10.1.2.2 10.3.1.3 1067 Figure 8: Fat-tree Network 1069 Aggregation switches are given addresses of the form 10.pod.0.switch, 1070 where pod denotes the pod number, and switch denotes the position of 1071 that switch in the upper pod (in [1, k/2]). Edge switches are given 1072 addresses of the form 10.pod.switch.1, where pod denotes the pod 1073 number, and switch denotes the position of that switch in the lower 1074 pod (in [1, k/2]). The core switches are given addresses of the form 1075 10.0.j.i, where j and i denote that switch's coordinates in the 1076 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1077 The address of a host follows the pod switch to which it is connected 1078 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1079 the host's position in that subnet (in [2, k/2+1], starting from left 1080 to the right). 1082 10.1. BRT Building Procedure 1084 By leveraging the topology's regularity, every switch clearly knows 1085 how it forwards a packet. When a packet arrives at an edge switch, 1086 if the destination of the packet lies in the same subnet with the 1087 switch, then the switch directly forwards the packet to the 1088 destination server through layer-2 switching. Otherwise, the switch 1089 forwards the packet to any of aggregation switches in the same pod. 1090 When a packet arrives at an aggregation switch, if the destination of 1091 the packet lies in the same pod, the switch forwards the packet to 1092 the corresponding edge switch. Otherwise, the switch forwards the 1093 packet to any of core switches that it is connected to. If a core 1094 switch receives a packet, it forwards the packet to the corresponding 1095 aggregation switch that lies in the destination pod. 1097 The forwarding policy discussed above is easily expressed through a 1098 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1099 following entries: 1101 Destination/Mask Next hop 1102 10.0.0.0/255.0.0.0 10.1.0.1 1103 10.0.0.0/255.0.0.0 10.1.0.2 1105 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1106 the following entries: 1108 Destination/Mask Next hop 1109 10.1.1.0/255 255.255.0 10.1.1.1 1110 10.1.2.0/255.255.255.0 10.1.2.1 1111 10.0.0.0/255.0.0.0 10.0.1.1 1112 10.0.0.0/255.0.0.0 10.0.1.2 1114 The BRT of acore switch, such as 10.0.1.1, is composed of the 1115 following entries: 1117 Destination/Mask Next hop 1118 10.1.0.0/255 255.0.0 10.1.0.1 1119 10.2.0.0/255.255.0.0 10.2.0.1 1120 10.3.0.0/255.255.0.0 10.3.0.1 1121 10.4.0.0/255.255.0.0 10.4.0.1 1123 10.2. NRT Building Procedure 1125 The route entries in an NRT are related with link and node failures. 1126 We summarize all types of cases into three (3) catalogs. 1128 10.2.1. Single Link Failure 1130 In Fat-tree, Links can be classified as 3 types by their locations: 1131 1) servers to edge switches; 2) edge to aggregation switches; 3) 1132 aggregation to core switches. Link failures between servers to edge 1133 switches only affect the communication of the corresponding servers 1134 and don't affect the routing tables of any switch, so we only discuss 1135 the second and third type of links failures. 1137 Edge to Aggregation Switches 1139 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1140 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1141 failure may affect 3 types of communications. 1143 o Sources lie in the same subnet with A, and destinations do not. In 1144 this case, the link failure will only affect the routing tables of A. 1145 As this link is attached to A directly, A only needs to delete the 1146 route entries whose next hop is B in its BRT and add no entries to 1147 its NRT when A's M6 module detect the link failure. 1149 o Destinations lie in the same subnet with A, and sources lie in 1150 another subnet of the same pod. In this case, the link failure will 1151 affect the routing tables of all the edge switches in the same pod 1152 except for A. When an edge switch, such as 10.1.1.1, learns the link 1153 failure, it will add a route entry to its NRT: 1155 Destination/Mask Next hop 1156 10.1.2.0/255.255.255.0 10.1.0.1 1158 o Destinations lie in the same subnet with A, sources lie in another 1159 pod. In this case, the link failure will affect the routing tables 1160 of all the edge switches in the other pods. When an edge switch in 1161 one other pod, such as 10.3.1.1, learns the link failure, because all 1162 the routings that pass through 10.3.0.1 to A will certainly pass 1163 through the link between A and B, 10.3.1.1 need add a route entry to 1164 its NRT: 1166 Destination/Mask Next hop 1167 10.1.2.0/255.255.255.0 10.3.0.1 1168 Aggregation to Core Switches 1170 Suppose that the link between an aggregation switch, such as 10.1.0.1 1171 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1172 failure may affect 2 types of communications. 1174 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1175 the other pods. In this case, the link failure will only affect the 1176 routing tables of A. As this link is attached to A directly, A only 1177 need to delete the route entries whose next hop is B in its BRT and 1178 add no entries to its NRT when A's M6 module detect the link failure. 1180 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1181 another pod. In this case, the link failure will affect the routing 1182 tables of all the aggregation switches in other pods except for pod 1183 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1184 learns the link failure, because all the routings that pass through 1185 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1186 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1188 Destination/Mask Next hop 1189 10.1.0.0/255.255.0.0 10.0.1.2 1191 10.2.2. A Group of Link Failures 1193 If all the uplinks of an aggregation switch fail, then this switch 1194 cannot forward packets, which will affect the routing of every edge 1195 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1196 fail, it will affect two types of communications. 1198 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1199 the other pods. In this case, the link failures will affect the 1200 routing of the edge switches in the Pod of A. To avoid the node A, 1201 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1202 10.1.0.1" in which the next hop is the node A. 1204 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1205 other pods. In this case, the link failures will affect the routing 1206 of edge switches in other pods. For example, if the edge switch 1207 10.3.1.1 communicates with some node in the pod of A, it should avoid 1208 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1209 pod of A will pass through the node A. So a route entry should be 1210 added to 10.3.1.1: 1212 Destination/Mask Next hop 1213 10.1.0.0/255.255.0.0 10.3.0.1 1215 10.2.3. Node Failures 1217 At last, we discuss the effect of node failures to a NRT. There are 1218 3 types of node failures: the failure of edge, aggregation and core 1219 switches. 1221 o An edge switch fails. The failure doesn't affect the routing table 1222 of any switch. 1224 o A core switch fails. Only when all the core switches connected to 1225 the same aggregation switch fail, they will affect the routing of 1226 other switches. This case is equal to the case that all the uplinks 1227 of an aggregation switch fail, so the process of link failures can 1228 cover it. 1230 o An aggregation switch fails. This case is similar to the case that 1231 all the uplinks of an aggregation switch fail. It affects the 1232 routing of edge switches in other pods, but doesn't affect the 1233 routing of edge switches in pod of the failed switch. The process of 1234 this failure is same to the second case in section 6.2.2. 1236 10.3. Routing Procedure 1238 FAR decides a routing by looking up its BRT and NRT. We illuminate 1239 the routing procedure by an example. In this example, we suppose 1240 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1241 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1242 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1243 (destination). 1245 Step 1: The source 10.3.1.3 sends packets to its default router 1246 10.3.1.1 1248 Step 2: The routing of 10.3.1.1. 1250 1) Calculate candidate hops 1252 10.3.1.1 looks up its BRT and gets the following matched entries: 1254 Destination/Mask Next hop 1255 10.0.0.0/255.0.0.0 10.3.0.1 1257 So the candidate hops = {10.3.0.1} 1259 2) Calculate avoiding hops 1261 Its NRT is empty, so the set of avoiding hop is empty too. 1263 3) Calculate applicable hops 1265 The applicable hops are candidate hops minus avoiding hops, so: 1267 The applicable hops = {10.3.0.1} 1269 4) Forward packets to 10.3.0.1 1271 Step 3: The routing of 10.3.0.1 1273 1) Calculate candidate hops. 1275 10.3. 0.1 looks up its BRT and gets the following matched entries: 1277 Destination/Mask Next hop 1278 10.1.0.0/255.255.0.0 10.0.1.1 1279 10.1.0.0/255.255.0.0 10.0.1.2 1281 So the candidate hops = {10.0.1.1, 10.0.1.2} 1283 2) Calculate avoiding hops 1285 Destination/Mask Next hop 1286 10.1.0.0/255.255.0.0 10.0.1.2 1288 So the avoiding hops = {10.0.1.2} 1290 3) Calculate applicable hops 1292 The applicable hops are candidate hops minus avoiding hops, so: 1294 The applicable hops = {10.0.1.1} 1296 4) Forward packets to 10.0.1.1 1298 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1299 routing tables. 1301 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1302 routing tables. 1304 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1305 layer-2 switching. 1307 10.4. FAR's Performance in Large-scale Networks 1309 FAR has good performance to support large-scale networks. In this 1310 section, we take a Fat-tree network composed of 2,880 48-port 1311 switches and 27,648 servers as an example to show FAR's performance. 1313 10.4.1. The number of control messages required by FAR 1315 FAR exchanges a few messages between routers and only consumes a 1316 little network bandwidth. Tab. 1 shows the required messages in the 1317 example Fat-tree network. 1318 Table 1:Required messages in a Fat-tree network. 1319 _____________________________________________________________________ 1320 Message Type| Scope | size(bytes) | Rate | Bandwidth 1321 --------------------------------------------------------------------- 1322 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1323 kbps 1324 --------------------------------------------------------------------- 1325 DLR |adjacent switches| less than 48 | (1) |48bytes 1326 --------------------------------------------------------------------- 1327 DA |entire network| less than 48 | (2) |1.106M 1328 --------------------------------------------------------------------- 1329 LFA |entire network| less than 48 | (3) |48 bytes 1330 _____________________________________________________________________ 1331 (1)Produce one when a router starts 1332 (2)The number of switches(2,880) in a period 1333 (3)Produce one when a link fails or recovers 1335 10.4.2. The Calculating Time of Routing Tables 1337 A BRT is calculated according to the states of its neighbor routers 1338 and attached links. An NRT is calculated according to device and 1339 link failures in the entire network. So FAR does not calculate 1340 network topology and has no problem of network convergence, which 1341 greatly reduces the calculating time of routing tables. The 1342 detection and spread time of link failures is very short in FAR. 1343 Detection time is up to the interval of sending Hello message. In 1344 FAR, the interval is set to 100ms, and a link failure will be 1345 detected in 200ms. The spread time between any pair of routers is 1346 less than 200ms.If a link fails in a data center network, FAR can 1347 detect it, spread it to all the routers, and calculate routing tables 1348 in no more than 500ms. 1350 10.4.3. The Size of Routing Tables 1352 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1353 in Tab. 2. 1355 Table 2: The size of routing tables in FAR 1356 _____________________________________________________________________ 1357 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1358 --------------------------------------------------------------------- 1359 BRT | 48 | 48 | 24 1360 --------------------------------------------------------------------- 1361 NRT | 0 | 14 | 333 1362 _____________________________________________________________________ 1364 The BRT's size at a switch is determined by the number of its 1365 neighbor switches. In the example network, a core switch has 48 1366 neighbor switches (aggregation switch), so it has 48 entries in its 1367 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1368 switch is related to the number of link failures in the network. 1369 Suppose that there are 1000 link failures in the example network, the 1370 number of failed links is 1.2% of total links, which is a very high 1371 failure ratio. We suppose that link failures are uniformly 1372 distributed in the entire network. The NRT size at an edge switch is 1373 about 333 and the NRT size of an aggregation switch is about 14in 1374 average. 1376 11. Security Considerations 1378 The security considerations will be discussed in a future version of 1379 this document. 1381 12. Conclusions 1383 This draft introduces FAR protocol, a generic routing method and 1384 protocol, for data centers that have a regular topology. It uses two 1385 routing tables, a BRT and an NRT, to store the normal routing paths 1386 and the forbidden (to-be-avoided) routing paths, respectively. This 1387 makes the FAR protocol very simple and efficient. The sizes of these 1388 two tables are very small. Usually, a BRT has only several tens of 1389 entries and an NRT has only several or about a dozen entries. 1391 13. Acknowledgments 1393 This document is supported by ZTE Enterprise-University-Research 1394 Joint Project. 1396 14. References 1398 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1399 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1400 1997. 1402 [RFC2328] J. Moy, "OSPF Version 2", BCP 14, RFC2328, April 1998. 1404 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1405 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1406 2008. 1408 Authors' Addresses 1410 Bin Liu 1411 ZTE Inc., ZTE Plaza 1412 No.19 East Huayuan Road,Hai Dian District 1413 Beijing 100191 1414 China 1416 Phone: +86 -010-59932039 1417 Email: 13683610386@139.com 1419 Yantao Sun 1420 Beijing Jiaotong University 1421 No.3 Shang Yuan Cun, Hai Dian District 1422 Beijing 100044 1423 China 1425 Email: ytsun@bjtu.edu.cn 1427 Jing Cheng 1428 Beijing Jiaotong University 1429 No.3 Shang Yuan Cun, Hai Dian District 1430 Beijing 100044 1431 China 1433 Email: yourney.j@gmail.com 1435 Yichen Zhang 1436 Beijing Jiaotong University 1437 No.3 Shang Yuan Cun, Hai Dian District 1438 Beijing 100044 1439 China 1441 Email: snowfall_dan@sina.com 1442 Bhumip Khasnabish 1443 ZTE TX Inc. 1444 55 Madison Avenue, Suite 160 1445 Morristown, New Jersey 07960 1446 USA 1448 Phone: +001-781-752-8003 1449 Email: vumip1@gmail.com 1450 URI: http://tinyurl.com/bhumip/