idnits 2.17.1 draft-sl-rtgwg-far-dcn-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 54 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (November 15, 2017) is 2325 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'FAT-TREE' is defined on line 1369, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: May 19, 2018 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 November 15, 2017 12 Generic Fault-avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-09 15 Abstract 17 This draft proposes a generic routing method and protocol for a 18 regular data center network, named as fault-avoidance routing (FAR) 19 protocol. FAR protocol provides a generic routing method for all 20 types of network architectures that are proposed for large-scale 21 cloud-based data centers over the past few years. FAR protocol is 22 well designed to fully leverage the regularity in the topology and 23 compute its routing table in a simplistic manner. Fat-tree is taken 24 as an example architecture to illustrate how FAR protocol can be 25 applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 19, 2018. 44 Copyright Notice 46 Copyright (c) 2017 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 4 63 2. Conventions used in this document . . . . . . . . . . . . . . 5 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 5 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 8 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 8 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 8 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 9 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 9 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 10 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 10 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 17 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 17 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 17 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 18 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 18 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 18 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 19 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 19 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 19 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 22 86 9. Application Example . . . . . . . . . . . . . . . . . . . . . 22 87 9.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 24 88 9.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 9.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 25 90 9.2.2. A Group of Link Failures . . . . . . . . . . . . . . 26 91 9.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 27 92 9.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 27 93 9.4. FAR's Performance in Large-scale Networks . . . . . . . . 29 94 9.4.1. The number of control messages required by FAR . . . 29 95 9.4.2. The Calculating Time of Routing Tables . . . . . . . 29 96 9.4.3. The Size of Routing Tables . . . . . . . . . . . . . 30 98 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 99 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 30 100 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 31 101 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 104 1. Introduction 106 In recent years, with the rapid development of cloud computing 107 technologies, the widely deployed cloud services, such as Amazon EC2 108 and Google search, bring about huge challenges to data center 109 networking (DCN). Today's cloud-based data centers (DCs) require 110 large-scale networks with larger internal bandwidth and smaller 111 transfer delay. However, conventional networks cannot meet such 112 requirements due to limitations in their network architecture. In 113 order to satisfy the requirements of cloud computing services, many 114 new network architectures have been proposed for data centers, such 115 as Fat-tree, MatrixDCN, and BCube. These new architectures can 116 support non-blocking large-scale datacenter networks with more than 117 tens of thousands of physical servers. 119 All of these architectures have a common feature that is with a 120 regular topology. Here a regular topology is not a mathematical or 121 definite conception, which means a non-arbitrary network fabric, or 122 an inerratic network fabric. In a regular topology, each network 123 node such as a switch or router can be addressed by its location and 124 through a node's address, the node's connections to its neighbors in 125 a network can be determined, and furthermore, the route to the node 126 from other nodes in the network can be determined. So nodes can 127 compute route entries without learning topology. 129 This draft proposes a generic routing method and protocol, fault- 130 avoidance routing (FAR) protocol, for DCNs. This method leverages 131 the regularity in the topologies of data center networks to simplify 132 routing learning and accelerate the query of routing tables. This 133 routing method has a better fault tolerance and can be applied to any 134 DCN with a regular topology. 136 FAR is not a routing protocol to replace generic routing protocols 137 such as OSPF and IS-IS. It cannot be used in general local networks 138 whose topological structures are arbitrary, and whose scales are also 139 not very large. OSPF works very well in such a network. But in a 140 large-scale network with regular topology, FAR has a better 141 performance. Compared with OSPF and IS-IS, FAR has shorter time of 142 network convergence and lower PDU overhead. Furthermore, FAR 143 requires less computing and storage resources, which let FAR routers 144 to run at a lower cost of production than the generic routers. 146 In addition, for each type of network architecture, researchers 147 designed a routing algorithm according to the features of its 148 topology. Because these routing algorithms are different and lack 149 compatibility with each other, it is very difficult to develop a 150 routing protocol for network routers supporting multiple routing 151 algorithms. FAR has better adaptability than these specified routing 152 methods. 154 FAR consists of three components, i.e., link state learning unit, 155 routing table building unit and routing table querying unit. In the 156 link state learning unit, FAR exchanges link failures among routers 157 to establish a consistent knowledge of the entire network. In this 158 stage, the regularity in topology is exploited to infer failed links 159 and routers. In the routing table building unit, FAR builds up two 160 routing tables, i.e., a basic routing table (BRT) and a negative 161 routing table (NRT), for each router according to the network 162 topology and link states. In the last component, routers forward 163 incoming packets by looking up the two routing tables. The matched 164 entries in BRT minus the matched entries in NRT are the final route 165 entries to be used to forward an incoming packet. 167 The remainder of this draft is organized as follows. The problem to 168 be addressed by FAR is described in Section 3. The framework of FAR 169 routing protocol is described in Section 4. Section 5 and 6 170 introduce FAR's data format FAR and modules in detail. Section 7 171 describe how FAR works by finite state machine (FSM). In Section 8, 172 we discussed how FAR works with variable network architectures. 173 Section 9 takes Fat-tree network as an example to illuminate how FAR 174 works. 176 1.1. Acronyms & Definitions 178 DCN - Data Center Network 180 FAR - Fault-Avoidance Routing 182 BRT - Basic Routing Table 184 NRT - Negative Routing Table 186 NDT - Neighbor Devices Table 188 ADT - All Devices Table 190 LFT - Link Failure Table 192 DA - Device Announcement 193 LFA - Link Failure Announcement 195 DLR - Device and Link Request 197 IP - Internet Protocol 199 UDP - User Datagram Protocol 201 VM - Virtual Machine 203 2. Conventions used in this document 205 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 206 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 207 this document are to be interpreted as described in RFC-2119 208 [RFC2119]. In this document, these words will appear with that 209 interpretation only when in ALL CAPS. Lower case uses of these words 210 are not to be interpreted as carrying RFC-2119 significance. 212 3. Problem Statement 214 The problem to be addressed by FAR as proposed in this draft is 215 described in this section. The expansion of Cloud data center 216 networks has brought significant challenges to the existing routing 217 technologies. FAR mainly solves a series of routing problems faced 218 by large-scale data center networks. 220 3.1. The Impact of Large-scale Networks on Route Calculation 222 In a large-scale cloud data center network, there may be thousands of 223 routers. Running OSPF and other routing protocols in such network 224 will encounter these two challenges: a) Network convergence time 225 would be too long, which will cause a longer time to elapse for 226 creating and updating the routes. The response time to network 227 failures may be excessively long; b) a large number of routing 228 protocol packets need to be sent. The routing information consumes 229 too much network bandwidth and CPU resources, which easily leads to 230 packet loss and makes the problem (a) more prominent. 232 In order to solve these problems, a common practice is to partition a 233 large network into some small areas, where the route calculation runs 234 independently within different areas. However, nowadays the cloud 235 data centers typically require very large internal bandwidth. To 236 meet this requirement, a large number of parallel equivalent links 237 are deployed in a network, such as a Fat-tree network. Partitioning 238 such a network will affect the utilization of routing algorithm on 239 equivalent multi-path and reduce internal network bandwidth 240 requirements. 242 In the FAR routing calculation process, a Basic Routing Table (BRT) 243 is built on local network topology leveraging the regularity of the 244 network topologies. In addition to BRT, FAR also builds a Negative 245 Routing Table (NRT). FAR gradually builds NRT in the process of 246 learning network link failure information, which does not require 247 learning a complete network fault information. FAR does not need to 248 wait for the completion of the network convergence in the process of 249 building these two tables. Therefore, it avoids the problem of 250 excessive network convergence overheads in the route calculation 251 process. In addition, FAR only needs to exchange a small amount of 252 link change information between routers, and hence consumes less 253 network bandwidth. 255 3.2. Dilemma of conventional routing methods in a large-scale network 256 with giant number nodes of routers 258 There are many real world scenario where tens of thousands of 259 nodes(or much more nodes) need to be deployed in a flat area, such as 260 infiniband routing and switching system, high-performance computer 261 network, and many IDC networks in China. The similar problems have 262 been existed long ago. People have solved the problems through 263 similar solutions, such as the traditional regular topology-based 264 RFC3619 protocol, the routing protocols of infiniband routing and 265 switching system, and high-performance computer network routing 266 protocol. 267 Infiniband defines a switch-based network to interconnect processing 268 nodes and the I/O nodes. Infiniband can support very large scale 269 networks, use the regularity in topology to simplify its routing 270 algorithm, which is just the same to what we do in FAR. 272 Why OSPF and other conventional routing methods do not work well in a 273 large-scale network with giant number nodes of routers? 275 As everyone knows, the OSPF protocol uses multiple databases, more 276 topological exchange information (as seen in the following example) 277 and complicated algorithm. It requires routers to consume more 278 memory and CPU processing capability. But the processing rate of CPU 279 on the protocol message per second is very limited. When the network 280 expands, CPU will quickly approach its processing limits, and at this 281 time OSPF can not continue to expand the scale of the management. 282 The SPF algorithm itself does not thoroughly solve these problems. 284 On the contrary, FAR does not have the convergence time delay and the 285 additional CPU overheads, which SPF requires. Because in the initial 286 stage, FAR already knows the regular information of the whole network 287 topology and does not need to periodically do SPF operation. 289 One of the examples of "more topological exchange information": 291 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 292 the larger network, the occupation of CPU and band bandwidth will 293 soon reach the router's performance bottleneck. In order to reduce 294 these adverse effects, OSPF introduced the concept of Area, which 295 still has not solved the problem thoroughly. By dividing the OSPF 296 Area into several areas, the routers in the same area do not need to 297 know the topological details outside their area. (In comparison with 298 FAR, after OSPF introducing the concept of Area, the equivalent paths 299 cannot be selected in the whole network scope) 301 OSPF can achieve the following results by Area : 1) Routers only need 302 to maintain the same link state databases as other routers within the 303 same Area, without the necessity of maintaining the same link state 304 database as all routers in the whole OSPF domain. 2) The reduction 305 of the link state databases means dealing with relatively fewer LSA, 306 which reduces the CPU consumption of routers; 3) The large number of 307 LSAs flood only within the same Area. But, its negative effect is 308 that the smaller number of routers which can be managed in each OSPF 309 area. On the contrary, because FAR does not have the above 310 disadvantages, FAR can also manage large-scale network even without 311 dividing Areas. 313 The aging time of OSPF is set in order to adapt to routing 314 transformation and protocol message exchange happened frequently in 315 the irregular topology. Its negative effect is: when the network 316 does not change, the LSA needs to be refreshed every 1800 seconds to 317 reset the aging time. In the regular topology, as the routings are 318 fixed, it does not need the complex protocol message exchange and 319 aging rules to reflect the routing changes, as long as LFA mechanism 320 in the FAR is enough. 322 Therefore, in FAR, we can omit many unnecessary processing and the 323 packet exchange. The benefits are fast convergence speed and much 324 larger network scale than other dynamic routing protocol. Now there 325 are some successful implementations of simplified routings in the 326 regular topology in the HPC environment. Conclusion: As FAR needs 327 few routing entries and the topology is regular, the database does 328 not need to be updated regularly. Without the need for aging, there 329 is no need for CPU and bandwidth overhead brought by LSA flood every 330 30 minutes, so the expansion of the network has no obvious effect on 331 the performance of FAR, which is contrary to OSPF. 333 Comparison of convergence time: The settings of OSPF spf_delay and 334 spf_hold_time can affect the change of convergence time. The 335 convergence time of the network with 2480 nodes is about 15-20 336 seconds; while the FAR does not need to calculate the SFP, so there 337 is no such convergence time. 339 These issues still exist in rapid convergence technology of OSPF and 340 ISIS (such as I-SPF). The convergence speed and network scale 341 constraint each other. FAR does not have the above problems, and the 342 convergence time is almost negligible. 343 Can FRR solve these problems? IP FRR has some limitations. The 344 establishment of IP FRR backup scheme will not affect the original 345 topology and traffic forwarding which are established by protocol, 346 however, we can not get the information of whereabouts and status 347 when the traffic is switched to an alternate next hop. 349 3.3. Network Addressing Issues 351 Routers are typically configured with multiple network interfaces, 352 each connected to a subnet. OSPF and other routing algorithms 353 require that each interface of a router must be configured with an IP 354 address. A large-scale data center network may contain thousands of 355 routers and each router has dozens of network interfaces, thus, there 356 are tens of thousands of IP addresses needed to be configured in a 357 data center. It will be very complex to configure and manage a large 358 number of network interfaces and will be difficult to troubleshoot 359 network problems, then network maintenance will be costly and error- 360 prone. 362 In FAR, the device position information is encoded in the IP address 363 of the router. Each router only needs to be assigned a unique IP 364 address according its location, which greatly solves complex network 365 addressing issues in large-scale networks. 367 3.4. Big Routing Table Issues 369 There are a large number of subnets in the large-scale data center 370 network. A router may build a routing entry for each subnet, and 371 therefore the size of routing tables on each router may be very 372 large. It will increase a router's cost and reduce the querying 373 speed of the routing table. 375 FAR uses two measures to reduce the size of its routing tables: a)It 376 builds a BRT on the regularity of the network topologies; b)It 377 introduces a new routing table, i.e., a NRT. In this way FAR can 378 reduce the size of routing tables to only a few dozen routing 379 entries. 381 3.5. Adaptivity Issues for Routing Algorithms 383 To implement efficient routing in large-scale datacenters, besides 384 FAR, some other routing methods are proposed for some specific 385 network architectures, such as Fat-tree and BCube. These routing 386 methods are different(from both design and implementation viewpoints) 387 and not compatible with the conventional routing methods, which 388 brings big troubles to network equipment providers to develop new 389 routers supporting various new routing methods. 391 FAR is a generic routing method. With slight modification, FAR 392 method can be applied to most of regular datacenter networks. 393 Furthermore, the structure of routing tables and querying a routing 394 table in FAR are the same as conventional routing method. If FAR is 395 adopted, the workload of developing a new type of router will be 396 significantly decreased. 398 3.6. Virtual Machine Migration Issues 400 Supporting VM migration is very important for cloud-based datacenter 401 networks. However, in order to support layer-3 routing, routing 402 methods including OSPF and FAR require limiting VM migration within a 403 subnet. For this paradox, the mainstream methods still utilize 404 layer-3 routing on routers or switches, transmit packets encapsulated 405 by IPinIP or MACinIP between hosts by tunnels passing through network 406 to the destination access switch, and then extract original packet 407 out and send it to the destination host. 409 By utilizing the aforementioned methods, FAR can be applied to Fat- 410 tree, MatrixDCN or BCube networks for supporting VM migration in 411 entire network. 413 4. The FAR Framework 415 FAR requires that a DCN has a regular topology, and network devices, 416 including routers, switches, and servers, are assigned IP addresses 417 according to their locations in the network. In other word, we can 418 locate a device in the network according to its IP address. 420 FAR is a distributed routing method. In order to support FAR, each 421 router needs to have a routing module that implements the FAR 422 algorithm. FAR algorithm is composed of three parts, i.e., link- 423 state learning, routing table building and routing table querying, as 424 shown in Fig. 1. 426 Link-state Learning |Routing Table | Routing Table 427 | Building 428 | | Querying 429 +--------+ /---------------\ | +--------+ | Packets 430 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 431 |Learning| | Link Detection| | |Building|\ | | 432 +--------+ \---------------/ | +--------+ \ | \|/ 433 | | | \ |+--------------+ 434 | | | /|| 7 Querying | 435 \|/ \|/ | / || Routing Table| 436 +-----------------------+| / |+--------------+ 437 |3 Invisible Neighbor & || / | 438 |Link Failure Inferring || +--------- | 439 +-----------------------+|/| 6 NRT | | 440 | / |Building| | 441 \|/ /| +--------+ | 442 +--------------+ / | | 443 |4 Link Failure| / | | 444 | Learning | | | 445 +--------------+ | | 446 | | 448 Figure 1: The FAR framework 450 Link-state learning is responsible for a router to detect the states 451 of its connected links and learn the states of all the other links in 452 the entire network. The second part builds two routing tables, a 453 basic routing table (BRT) and an negative routing table (NRT), 454 according to the learned link states in the first part. The third 455 part queries the BRT and the NRT to decide a next forwarding hop for 456 the received (ingress) packets. 458 5. Data Format 460 5.1. Data Tables 462 Some data tables are maintained on each router in FAR. They are: 464 Neighbor Device Table (NDT): To store neighbor routers and related 465 links. 467 All Devices Table (ADT): To store all routers in the entire network. 469 Link Failures Table (LFT): To store all link failures in the entire 470 network. 472 Basic Routing Table (BRT): To store the candidate routes. 474 Negative Routing Table(NRT): To store the avoiding routes. 476 The format of NDT 478 ---------------------------------------------------------- 479 Device ID | Device IP | Port ID | Link State | Update Time 480 ---------------------------------------------------------- 482 Device ID: The ID of a neighbor router. 484 Device IP: The IP address of a neighbor router. 486 Port ID: The port ID that a neighbor router is attached to. 488 Link State: The state of the link between a router and its neighbor 489 router. There are two states: Up and Down. 491 Update Time: The time of updating the entry. 493 The format of ADT 495 -------------------------------------------------- 496 Device ID | Device IP | Type | State | Update Time 497 -------------------------------------------------- 499 Device ID: The ID of a neighbor router. 501 Device IP: The IP address of a neighbor router. 503 Type: The type of a neighbor router. 505 State: The state of a neighbor router. There are two states: Up and 506 Down. 508 Update Time: The time of updating the entry. 510 The format of LFT 512 -------------------------------------------- 513 No | Router 1 IP | Router 2 IP | Timestamp 514 -------------------------------------------- 516 No: The entry number. 518 Router 1 IP: The IP address of one router that a failed link connects 519 to. 521 Router 2 IP: The IP address of another router that a failed link 522 connects to. 524 Timestamp: It identifies when the entry is created. 526 The format of BRT 528 ------------------------------------------------------- 529 Destination | Mask | Next Hop | Interface | Update Time 530 ------------------------------------------------------- 532 Destination: A destination network 534 Mask: The subnet mask of a destination network. 536 Next Hop: The IP address of a next hop for a destination. 538 Interface: The interface related to a next hop. 540 Update Time: The time of updating the entry. 542 The format of NRT 544 ------------------------------------------------------------------- 545 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 546 ------------------------------------------------------------------- 548 Destination: A destination network. 550 Mask: The subnet mask of a destination network. 552 Next Hop: The IP address of a next hop that should be avoided for a 553 destination. 555 Interface: The interface related to a next hop that should be 556 avoided. 558 Failed Link No: A group of failed link numbers divided by "/", for 559 example 1/2/3. 561 Timestamp: The time of updating the entry. 563 5.2. Messages 565 Some protocol messages are exchanged between routers in FAR. 567 Hello Message: This message is exchanged between neighbor routers to 568 learn adjacency. 570 Device Announcement (DA): Synchronize the knowledge of routers 571 between routers. 573 Link Failure Announcement (LFA): Synchronize link failures between 574 routers. 576 Device and Link Request (DLR): When a router starts, it requests the 577 knowledge of routers and links from its neighbors by a DLR message. 579 A FAR Message is directly encapsulated in an IP packet. The protocol 580 field of IP header indicates an IP packet is an FAR message. The 581 protocol of IP for FAR should be assigned by IANA. 583 The four types of FAR messages have same format of packet header, 584 called FAR header (as shown in Figure 2). 586 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 587 +-----------+------------+------------------------+ 588 | Version |Message Type| Message Length | 589 +-----------+------------+------------------------+ 590 | Checksum | AuType | 591 +------------------------+------------------------+ 592 | Authentication | 593 +-------------------------------------------------+ 594 | Authentication | 595 +-------------------------------------------------+ 596 | Timestamp | 597 +-------------------------------------------------+ 599 Figure 2: The format of FAR header 601 Version: FAR version 603 Message Type: The type of FAR message. 605 Packet Length: The packet length of the total FAR message. 607 Checksum: The checksum of an entire FAR message. 609 AuType:Authentication type. 0: no authentication, 1: Plaintext 610 Authentication, 2: MD5 Authentication. 612 Authentication: Authentication information. 0: undefined, 1: Key, 2: 613 key ID, MD5 data length and packet number. MD5 data is appended to 614 the backend of the packet. 616 AuType and Authentication can refer to the definition of OSPF packet. 618 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 619 +-----------+------------+------------------------+ 620 | Version |Message Type| Message Length | 621 +-----------+------------+------------------------+ 622 | Checksum | AuType | 623 +------------------------+------------------------+ 624 | Authentication | 625 +-------------------------------------------------+ 626 | Authentication | 627 +-------------------------------------------------+ 628 | Timestamp | 629 +-------------------------------------------------+ 630 | Router IP | 631 +------------------------+------------------------+ 632 | HelloInterval | HelloDeadInterval | 633 +------------------------+------------------------+ 634 | Neighbor Router IP | 635 +-------------------------------------------------+ 636 | ... | 637 +-------------------------------------------------+ 639 Figure 3: The Format of Hello Messages 641 For Hello messages, the Message Type in FAR header is set to 642 1.Besides FAR header, a Hello message(Fig. 3) requires the following 643 fields: 645 Router IP: The router IP address. 647 HelloInterval: The interval of sending Hello messages to neighbor 648 routers. 650 RouterDeadInterval: The interval to set a neighbor router dead(out- 651 of-service). If in the interval time, a router doesn't receive a 652 Hello message from its neighbor router, the neighbor router is 653 treated as dead. 655 Neighbor Router IP: The IP address of a neighbor router. All the 656 neighbor router's addresses should be included in a Hello message. 658 |<----1---->| <----1---->|<----------2----------->| 659 +-----------+------------+------------------------+ 660 | Version |Message Type| Message Length | 661 +-----------+------------+------------------------+ 662 | Checksum | AuType | 663 +------------------------+------------------------+ 664 | Authentication | 665 +-------------------------------------------------+ 666 | Authentication | 667 +-------------------------------------------------+ 668 | Timestamp | 669 +------------------------+------------------------+ 670 | Router1 IP | 671 +-------------------------------------------------+ 672 | ... | 673 +-------------------------------------------------+ 675 Figure 4: The Format of DA Messages 677 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 678 Besides FAR header, a DA message includes IP addresses of all the 679 announced routers. 681 |<----1---->| <----1---->|<----------2----------->| 682 +-----------+------------+------------------------+ 683 | Version |Message Type| Message Length | 684 +-----------+------------+------------------------+ 685 | Checksum | AuType | 686 +------------------------+------------------------+ 687 | Authentication | 688 +-------------------------------------------------+ 689 | Authentication | 690 +-------------------------------------------------+ 691 | Timestamp | 692 +------------------------+------------------------+ 693 | Left IP | 694 +-------------------------------------------------+ 695 | Right IP | 696 +------------------------+------------------------+ 697 | State | 698 +-------------------------------------------------+ 699 | ... | 700 +-------------------------------------------------+ 702 Figure 5: The Format of LFA Messages 704 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 705 Besides FAR header, a LFA message includes all the announced link 706 failures. 708 Left IP: The IP address of the left endpoint router of a link. 710 Right IP: The IP address of the right endpoint router of a link. 712 State: Link state. 0: Up, 1: down 713 |<----1---->| <----1---->|<----------2----------->| 714 +-----------+------------+------------------------+ 715 | Version |Message Type| Message Length | 716 +-----------+------------+------------------------+ 717 | Checksum | AuType | 718 +------------------------+------------------------+ 719 | Authentication | 720 +-------------------------------------------------+ 721 | Authentication | 722 +-------------------------------------------------+ 723 | Timestamp | 724 +-------------------------------------------------+ 726 Figure 6: The Format of DLR Messages 728 For DLR messages(Fig. 6), the Message Type in FAR header is set to 729 1.Except for FAR header, DLR has no additional fields. 731 6. FAR Modules 733 6.1. Neighbor and Link Detection Module(M1) 735 M1 is responsible for sending and receiving Hello messages, and 736 detecting directly-connected links and neighbor routers. Each Hello 737 message is encapsulated in a UDP packet. M1 sends Hello messages 738 periodically to all the active router ports and receives Hello 739 messages from its neighbor routers. M1 detects neighbor routers and 740 directly-connected links according to received Hello Messages and 741 stores these neighbors and links into a Neighbor Devices Table (NDT). 742 Additionally, M1 also stores neighbor routers into an All Devices 743 Table (ADT). 745 6.2. Device Learning Module(M2) 747 M2 is responsible for sending, receiving, and forwarding device 748 announcement (DA) messages, learning all the routers in the whole 749 network, and deducing faulted routers. When a router starts, it 750 sends a DA message announcing itself to its neighbors and a DLR 751 message requesting the knowledge of routers and links from its 752 neighbors. If M2 module of a router receives a DA message, it checks 753 whether the router encapsulated in the message is in an ADT. If the 754 router is not in the ADT, M2 puts this router into the ADT and 755 forwards this DA message to all the active ports except for the 756 incoming one, otherwise, M2 discards this message directly. If M2 757 module of a router receives a DLR message, it replies a DA message 758 that encapsulates all of the learned routers. 760 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 762 M3 is responsible for inferring invisible neighbors of the current 763 router by means of the ADT. If the link between a router A and its 764 neighbor B breaks, which results in that M1 module of A cannot detect 765 the existence of B, then B is an invisible neighbor of A. Since a 766 device's location is coded into its IP address, it can be judged 767 whether two routers are adjacent, according to their IP addresses. 768 Based on this idea, M3 infers all of the invisible neighbors of the 769 current router and the related link failures. The results are stored 770 into a NDT. Moreover, link failures also are added into a link- 771 failure table (LFT). LFT stores all of the failed links in the 772 entire network. 774 6.4. Link Failure Learning Module(M4) 776 M4 is responsible for sending, receiving and forwarding link failure 777 announcement (LFA) and learning all the link failures in the whole 778 network. M4 broadcasts each newly inferred link failure to all the 779 routers in the network. Each link failure is encapsulated in a LFA 780 message and one link failure is broadcasted only once. If a router 781 receives a DLR request from its neighbor, it will reply a LFA message 782 that encapsulates all the learned link failures through M4 module. 783 If M4 receives a LFA message, it checks whether the link failure 784 encapsulated in the message is in a LFT by comparing two link ends 785 and timestamp. If the link failure is not in the LFT or timestamp is 786 different, M4 puts this link failure into the LFT (or update 787 timestamp only) and forwards this LFA message to all the active ports 788 except for the incoming one, otherwise, M4 discards this message 789 directly. 791 There is a special case a router will rebroadcast a link failure. If 792 a router receives a data packet and must forward the packet going 793 ahead to destination through a failed link, it means some previous 794 router should avoid this failed link according to its NRT but it 795 doesn't. In this case, maybe the previous router missed the LFA 796 message of the link failure due to some uncertain reasons. So the 797 forwarding router rebroadcasts the LFA message. 799 6.5. BRT Building Module(M5) 801 M5 is responsible for building a BRT for the current router. By 802 leveraging the regularity in topology, M5 can calculate the routing 803 paths for any destination without the knowledge of the topology of 804 whole network, and then build the BRT based on a NDT. Since the IP 805 addresses of network devices are continuous, M5 only creates one 806 route entry for a group of destination addresses that have the same 807 network prefix by means of route aggregation technology. Usually, 808 the size of a BRT is very small. The detail of how to build a BRT is 809 described in section 5. 811 6.6. NRT Building Module(M6) 813 M6 is responsible for building a NRT for the current router. Because 814 M5 builds a BRT without considering link failures in network, the 815 routing paths calculated by the BRT cannot avoid failed links. To 816 solve this problem, a NRT is used to exclude the routing paths that 817 include some failed links from the paths calculated by a BRT. M6 818 calculate the routing paths that include failed links and stored them 819 into the NRT. The details of how to build a NRT is described in 820 section 5. 822 6.7. Routing Table Lookup(M7) 824 M7 is responsible for querying routing tables and selecting the next 825 hop for forwarding the packets. Firstly, M7 takes the destination 826 address of a forwarding packet as a criterion to look up route 827 entries in a BRT based on longest prefix match. All of the matched 828 entries are composed of a candidate hops list. Secondly, M7 look up 829 negative route entries in a NRT taking the destination address of the 830 forwarding packet as criteria. This lookup is not limited to the 831 longest prefix match, any entry that matches the criteria would be 832 selected and composed of an avoiding hops list. Thirdly, the 833 candidate hops minus avoiding hops are composed of an applicable hops 834 list. At last, M7 sends the forwarding packet to any one of the 835 applicable hops. If the applicable list is empty, the forwarding 836 packet will be dropped. 838 7. How a FAR Router Works 840 Figure 7 shows how a FAR router works by its FSM. 842 /---------------\ 843 | Start | 844 | | 845 \---------------/ 846 | 847 +--------+ |1 848 | | | 849 | \|/ \|/ 850 | +--------------------+ 851 |2 | ND | 852 | | | 853 | +--------------------+ 854 | | | 855 | | |3 856 +--------+ | 857 \|/ 858 +--------------+ 859 | ND-FIN | 860 | | 861 +--------------+ 862 | 863 |4 864 | 865 \|/ 866 ________10_______\+--------------+/_______11________ 867 | /| |\ | 868 | | Listen | | 869 | ____9_______\| |/_______12___ | 870 | | /| |\ | | 871 | | +--------------+ | | 872 | | 5/ | | \8 | | 873 | | |/_ | | _\| | | 874 | | +------------+ 6| |7 +------------+ | | 875 | --| HELLO-RECV | | | | LFA-RECV |-- | 876 | +------------+ | | +------------+ | 877 | ______| |______ | 878 | | | | 879 | \|/ \|/ | 880 | +------------+ +------------+ | 881 |___________| DLR-RECV | | DA-RECV |__________| 882 +------------+ +------------+ 884 _________ 885 | | 886 | | 887 \|/ | 888 +--------------+ |13 889 | Hello Thread | | 890 +--------------+ | 891 | | 892 |_________| 894 _________ 895 | | 896 | | 897 \|/ | 898 +--------------+ |14 899 | LFD Thread | | 900 +--------------+ | 901 | | 902 |_________| 904 _________ 905 | | 906 | | 907 \|/ | 908 +--------------+ |15 909 | DA Thread | | 910 +--------------+ | 911 | | 912 |_________| 914 _________ 915 | | 916 | | 917 \|/ | 918 +--------------+ |16 919 | DFD Thread | | 920 +--------------+ | 921 | | 922 |_________| 924 Figure 7: The Finite State Machine of FAR Router 926 1)When a router starts up, it starts a Hello thread and then starts 927 ND (neighbor detection) timer (3 seconds). Next the router goes into 928 ND (neighbor detection) state. 929 2)In the ND state, if a router received a Hello message, then it 930 performs a Hello-message processing and goes back to the ND state. 931 3)When the ND timer is over, a router goes into ND-FIN (neighbor 932 detection finished) state. 933 4)A router starts the LFD (link failure detection) thread and DFD 934 (device failure detection) state, and sends DA message and DLR 935 message to all of its active ports. Then the router goes into Listen 936 state. 937 5) If a router receives a Hello message, then goes into HELLO-RECV 938 state. 939 6) If a router receives a DLR message, then goes into DLR-RECV state. 940 7) If a router receives a DA message, then goes into DA-RECV state. 941 8) If a router receives a LFA message, then goes into LFA-RECV state. 942 9) A router performs the Hello-message processing. After that, it 943 goes back to Listen state. 944 10) A router performs the DLR-message processing. After that, it 945 goes back to Listen state. 946 11) A router performs the DA-message processing. After that, it goes 947 back to Listen state. 949 12) A router performs the LFA-message processing. After that, it 950 goes back to Listen state. 951 13) Hello thread produces and sends Hello messages to all its ports 952 periodically. 953 14) LFD thread calls link-failure-detection processing to check link 954 failures in all links periodically 955 15) DA thread produces and sends DA messages periodically (30 956 minutes). 957 16) When DFD thread starts up, it sleep a short time (30 seconds) to 958 wait for a router learning all the active routers in the network. 959 Then the thread calls the device-failure-detection processing to 960 check device failures periodically (30 minutes). 962 8. Compatible Architecture 964 As a generic routing protocol, FAR can be run in various DCNs with 965 regular topology. Up to now, we have implemented the FAR protocol 966 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 968 For different network architectures, most processing of FAR is same 969 besides calculation of routing tables. BRT routing tables are 970 calculated based on Hello messages and NRT routing tables are 971 calculated based on LFA messages in FAR. To extend FAR to support a 972 new network architecture, only processing of Hello and LFA messages 973 need providing to build BRT and NRT routing tables. 975 In this protocol, FAR can support maximally 12 network architectures 976 and at least support 1 built-in network architecture, such as Fat- 977 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 978 a unique number from 1 to 12. For example, if the 1 built-in 979 architectures are assigned 1, and other customized architectures are 980 assigned 2 to 12. 981 1: Fat-tree 982 2: BCube 983 3: MatrixDCN. 984 4: xxx. 985 ...... 986 12: xxx. 988 9. Application Example 990 In this section, we take a Fat-tree network(Fig. 7) as an example to 991 describe how to apply FAR routing. Since M1 to M4 are very simple, 992 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 993 network. 995 A Fat-tree network is composed of 4 layers. The top layer is core 996 layer, and the other layers are aggregation layer, edge layer and 997 server layer. There are k pods, each one containing two layers of 998 k/2 switches. Each k-port switch in the edge layer is directly 999 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 1000 of the k-port switches in the aggregation layer. There are (k/2)2 1001 k-port core switches. Each core switch has one port connected to 1002 each of the k pods. 1004 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 1005 +--+ +--+ +--+ +--+ 1006 | | | | | | | | 1007 +--+,_ +--+', +--+ ,,+--+ 1008 |`,`',`-., / | \ `. .'` - .-'``.` /| 1009 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1010 | \ `', `-., .` / | `, .` ,' | 1011 | `, `'. `'-,_ .'` ,' | ', / | 1012 | . `'. `-.,-` / | \ 1013 | \ `'., .` `'., ` | `. 1014 | `, .'`, ,`'., | ', 1015 | . ,-` '., - `'-,_ | `. 1016 | \ .` ,'., `|., . 1017 | .'` / `-, | `'., `. 1018 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1019 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1020 | | | | | | | | | | | | | | | | 1021 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1022 | \/ | | \/ | | \/ | | \/ | 1023 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1024 | | | | | | | | | | | | | | | | 1025 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1026 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1027 / | | \ / | | \ / | | \ / | | | 1028 / | | \ / | | \ / | | \ / | | | 1029 / | | \ / | | \ / | | \ / | | | 1030 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1031 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1032 10.1.2.2 10.3.1.3 1034 Figure 8: Fat-tree Network 1036 Aggregation switches are given addresses of the form 10.pod.0.switch, 1037 where pod denotes the pod number, and switch denotes the position of 1038 that switch in the upper pod (in [1, k/2]). Edge switches are given 1039 addresses of the form 10.pod.switch.1, where pod denotes the pod 1040 number, and switch denotes the position of that switch in the lower 1041 pod (in [1, k/2]). The core switches are given addresses of the form 1042 10.0.j.i, where j and i denote that switch's coordinates in the 1043 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1044 The address of a host follows the pod switch to which it is connected 1045 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1046 the host's position in that subnet (in [2, k/2+1], starting from left 1047 to the right). 1049 9.1. BRT Building Procedure 1051 By leveraging the topology's regularity, every switch clearly knows 1052 how it forwards a packet. When a packet arrives at an edge switch, 1053 if the destination of the packet lies in the same subnet with the 1054 switch, then the switch directly forwards the packet to the 1055 destination server through layer-2 switching. Otherwise, the switch 1056 forwards the packet to any of aggregation switches in the same pod. 1057 When a packet arrives at an aggregation switch, if the destination of 1058 the packet lies in the same pod, the switch forwards the packet to 1059 the corresponding edge switch. Otherwise, the switch forwards the 1060 packet to any of core switches that it is connected to. If a core 1061 switch receives a packet, it forwards the packet to the corresponding 1062 aggregation switch that lies in the destination pod. 1064 The forwarding policy discussed above is easily expressed through a 1065 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1066 following entries: 1068 Destination/Mask Next hop 1069 10.0.0.0/255.0.0.0 10.1.0.1 1070 10.0.0.0/255.0.0.0 10.1.0.2 1072 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1073 the following entries: 1075 Destination/Mask Next hop 1076 10.1.1.0/255 255.255.0 10.1.1.1 1077 10.1.2.0/255.255.255.0 10.1.2.1 1078 10.0.0.0/255.0.0.0 10.0.1.1 1079 10.0.0.0/255.0.0.0 10.0.1.2 1081 The BRT of acore switch, such as 10.0.1.1, is composed of the 1082 following entries: 1084 Destination/Mask Next hop 1085 10.1.0.0/255 255.0.0 10.1.0.1 1086 10.2.0.0/255.255.0.0 10.2.0.1 1087 10.3.0.0/255.255.0.0 10.3.0.1 1088 10.4.0.0/255.255.0.0 10.4.0.1 1090 9.2. NRT Building Procedure 1092 The route entries in an NRT are related with link and node failures. 1093 We summarize all types of cases into three (3) catalogs. 1095 9.2.1. Single Link Failure 1097 In Fat-tree, Links can be classified as 3 types by their locations: 1098 1) servers to edge switches; 2) edge to aggregation switches; 3) 1099 aggregation to core switches. Link failures between servers to edge 1100 switches only affect the communication of the corresponding servers 1101 and don't affect the routing tables of any switch, so we only discuss 1102 the second and third type of links failures. 1104 Edge to Aggregation Switches 1106 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1107 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1108 failure may affect 3 types of communications. 1110 o Sources lie in the same subnet with A, and destinations do not. In 1111 this case, the link failure will only affect the routing tables of A. 1112 As this link is attached to A directly, A only needs to delete the 1113 route entries whose next hop is B in its BRT and add no entries to 1114 its NRT when A's M6 module detect the link failure. 1116 o Destinations lie in the same subnet with A, and sources lie in 1117 another subnet of the same pod. In this case, the link failure will 1118 affect the routing tables of all the edge switches in the same pod 1119 except for A. When an edge switch, such as 10.1.1.1, learns the link 1120 failure, it will add a route entry to its NRT: 1122 Destination/Mask Next hop 1123 10.1.2.0/255.255.255.0 10.1.0.1 1125 o Destinations lie in the same subnet with A, sources lie in another 1126 pod. In this case, the link failure will affect the routing tables 1127 of all the edge switches in the other pods. When an edge switch in 1128 one other pod, such as 10.3.1.1, learns the link failure, because all 1129 the routings that pass through 10.3.0.1 to A will certainly pass 1130 through the link between A and B, 10.3.1.1 need add a route entry to 1131 its NRT: 1133 Destination/Mask Next hop 1134 10.1.2.0/255.255.255.0 10.3.0.1 1136 Aggregation to Core Switches 1138 Suppose that the link between an aggregation switch, such as 10.1.0.1 1139 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1140 failure may affect 2 types of communications. 1142 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1143 the other pods. In this case, the link failure will only affect the 1144 routing tables of A. As this link is attached to A directly, A only 1145 need to delete the route entries whose next hop is B in its BRT and 1146 add no entries to its NRT when A's M6 module detect the link failure. 1148 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1149 another pod. In this case, the link failure will affect the routing 1150 tables of all the aggregation switches in other pods except for pod 1151 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1152 learns the link failure, because all the routings that pass through 1153 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1154 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1156 Destination/Mask Next hop 1157 10.1.0.0/255.255.0.0 10.0.1.2 1159 9.2.2. A Group of Link Failures 1161 If all the uplinks of an aggregation switch fail, then this switch 1162 cannot forward packets, which will affect the routing of every edge 1163 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1164 fail, it will affect two types of communications. 1166 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1167 the other pods. In this case, the link failures will affect the 1168 routing of the edge switches in the Pod of A. To avoid the node A, 1169 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1170 10.1.0.1" in which the next hop is the node A. 1172 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1173 other pods. In this case, the link failures will affect the routing 1174 of edge switches in other pods. For example, if the edge switch 1175 10.3.1.1 communicates with some node in the pod of A, it should avoid 1176 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1177 pod of A will pass through the node A. So a route entry should be 1178 added to 10.3.1.1: 1180 Destination/Mask Next hop 1181 10.1.0.0/255.255.0.0 10.3.0.1 1183 9.2.3. Node Failures 1185 At last, we discuss the effect of node failures to a NRT. There are 1186 3 types of node failures: the failure of edge, aggregation and core 1187 switches. 1189 o An edge switch fails. The failure doesn't affect the routing table 1190 of any switch. 1192 o A core switch fails. Only when all the core switches connected to 1193 the same aggregation switch fail, they will affect the routing of 1194 other switches. This case is equal to the case that all the uplinks 1195 of an aggregation switch fail, so the process of link failures can 1196 cover it. 1198 o An aggregation switch fails. This case is similar to the case that 1199 all the uplinks of an aggregation switch fail. It affects the 1200 routing of edge switches in other pods, but doesn't affect the 1201 routing of edge switches in pod of the failed switch. The process of 1202 this failure is same to the second case in section 6.2.2. 1204 9.3. Routing Procedure 1206 FAR decides a routing by looking up its BRT and NRT. We illuminate 1207 the routing procedure by an example. In this example, we suppose 1208 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1209 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1210 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1211 (destination). 1213 Step 1: The source 10.3.1.3 sends packets to its default router 1214 10.3.1.1 1216 Step 2: The routing of 10.3.1.1. 1218 1) Calculate candidate hops 1220 10.3.1.1 looks up its BRT and gets the following matched entries: 1222 Destination/Mask Next hop 1223 10.0.0.0/255.0.0.0 10.3.0.1 1225 So the candidate hops = {10.3.0.1} 1227 2) Calculate avoiding hops 1229 Its NRT is empty, so the set of avoiding hop is empty too. 1231 3) Calculate applicable hops 1233 The applicable hops are candidate hops minus avoiding hops, so: 1235 The applicable hops = {10.3.0.1} 1237 4) Forward packets to 10.3.0.1 1239 Step 3: The routing of 10.3.0.1 1241 1) Calculate candidate hops. 1243 10.3. 0.1 looks up its BRT and gets the following matched entries: 1245 Destination/Mask Next hop 1246 10.1.0.0/255.255.0.0 10.0.1.1 1247 10.1.0.0/255.255.0.0 10.0.1.2 1249 So the candidate hops = {10.0.1.1, 10.0.1.2} 1251 2) Calculate avoiding hops 1253 Destination/Mask Next hop 1254 10.1.0.0/255.255.0.0 10.0.1.2 1256 So the avoiding hops = {10.0.1.2} 1258 3) Calculate applicable hops 1260 The applicable hops are candidate hops minus avoiding hops, so: 1262 The applicable hops = {10.0.1.1} 1264 4) Forward packets to 10.0.1.1 1265 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1266 routing tables. 1268 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1269 routing tables. 1271 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1272 layer-2 switching. 1274 9.4. FAR's Performance in Large-scale Networks 1276 FAR has good performance to support large-scale networks. In this 1277 section, we take a Fat-tree network composed of 2,880 48-port 1278 switches and 27,648 servers as an example to show FAR's performance. 1280 9.4.1. The number of control messages required by FAR 1282 FAR exchanges a few messages between routers and only consumes a 1283 little network bandwidth. Tab. 1 shows the required messages in the 1284 example Fat-tree network. 1285 Table 1:Required messages in a Fat-tree network. 1286 _____________________________________________________________________ 1287 Message Type| Scope | size(bytes) | Rate | Bandwidth 1288 --------------------------------------------------------------------- 1289 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1290 kbps 1291 --------------------------------------------------------------------- 1292 DLR |adjacent switches| less than 48 | (1) |48bytes 1293 --------------------------------------------------------------------- 1294 DA |entire network| less than 48 | (2) |1.106M 1295 --------------------------------------------------------------------- 1296 LFA |entire network| less than 48 | (3) |48 bytes 1297 _____________________________________________________________________ 1298 (1)Produce one when a router starts 1299 (2)The number of switches(2,880) in a period 1300 (3)Produce one when a link fails or recovers 1302 9.4.2. The Calculating Time of Routing Tables 1304 A BRT is calculated according to the states of its neighbor routers 1305 and attached links. An NRT is calculated according to device and 1306 link failures in the entire network. So FAR does not calculate 1307 network topology and has no problem of network convergence, which 1308 greatly reduces the calculating time of routing tables. The 1309 detection and spread time of link failures is very short in FAR. 1310 Detection time is up to the interval of sending Hello message. In 1311 FAR, the interval is set to 100ms, and a link failure will be 1312 detected in 200ms. The spread time between any pair of routers is 1313 less than 200ms.If a link fails in a data center network, FAR can 1314 detect it, spread it to all the routers, and calculate routing tables 1315 in no more than 500ms. 1317 9.4.3. The Size of Routing Tables 1319 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1320 in Tab. 2. 1322 Table 2: The size of routing tables in FAR 1323 _____________________________________________________________________ 1324 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1325 --------------------------------------------------------------------- 1326 BRT | 48 | 48 | 24 1327 --------------------------------------------------------------------- 1328 NRT | 0 | 14 | 333 1329 _____________________________________________________________________ 1331 The BRT's size at a switch is determined by the number of its 1332 neighbor switches. In the example network, a core switch has 48 1333 neighbor switches (aggregation switch), so it has 48 entries in its 1334 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1335 switch is related to the number of link failures in the network. 1336 Suppose that there are 1000 link failures in the example network, the 1337 number of failed links is 1.2% of total links, which is a very high 1338 failure ratio. We suppose that link failures are uniformly 1339 distributed in the entire network. The NRT size at an edge switch is 1340 about 333 and the NRT size of an aggregation switch is about 14in 1341 average. 1343 10. Security Considerations 1345 The security considerations will be discussed in a future version of 1346 this document. 1348 11. Conclusions 1350 This draft introduces FAR protocol, a generic routing method and 1351 protocol, for data centers that have a regular topology. It uses two 1352 routing tables, a BRT and an NRT, to store the normal routing paths 1353 and the forbidden (to-be-avoided) routing paths, respectively. This 1354 makes the FAR protocol very simple and efficient. The sizes of these 1355 two tables are very small. Usually, a BRT has only several tens of 1356 entries and an NRT has only several or about a dozen entries. 1358 12. Acknowledgments 1360 This document is supported by ZTE Enterprise-University-Research 1361 Joint Project. 1363 13. References 1365 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1366 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1367 1997. 1369 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1370 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1371 2008. 1373 Authors' Addresses 1375 Bin Liu 1376 ZTE Inc., ZTE Plaza 1377 No.19 East Huayuan Road,Hai Dian District 1378 Beijing 100191 1379 China 1381 Phone: +86 -010-59932039 1382 Email: 13683610386@139.com 1384 Yantao Sun 1385 Beijing Jiaotong University 1386 No.3 Shang Yuan Cun, Hai Dian District 1387 Beijing 100044 1388 China 1390 Email: ytsun@bjtu.edu.cn 1392 Jing Cheng 1393 Beijing Jiaotong University 1394 No.3 Shang Yuan Cun, Hai Dian District 1395 Beijing 100044 1396 China 1398 Email: yourney.j@gmail.com 1399 Yichen Zhang 1400 Beijing Jiaotong University 1401 No.3 Shang Yuan Cun, Hai Dian District 1402 Beijing 100044 1403 China 1405 Email: snowfall_dan@sina.com 1407 Bhumip Khasnabish 1408 ZTE TX Inc. 1409 55 Madison Avenue, Suite 160 1410 Morristown, New Jersey 07960 1411 USA 1413 Phone: +001-781-752-8003 1414 Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com 1415 URI: http://tinyurl.com/bhumip/