idnits 2.17.1 draft-sl-rtgwg-far-dcn-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 54 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (April 30, 2015) is 3284 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? 'RFC2119' on line 172 looks like a reference -- Missing reference section? 'FAT-TREE' on line 1313 looks like a reference Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc. 4 Intended status: Informational Yantao Sun 5 Expires: November 1, 2015 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 April 30, 2015 12 Generic Fault-avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-03 15 Abstract 17 This draft proposes a generic routing method and protocol for a 18 regular data center network, named as the fault-avoidance routing 19 (FAR) protocol. FAR protocol provides a generic routing method for 20 all types of network architectures that are proposed for large-scale 21 cloud-based data centers over the past few years. FAR protocol is 22 well designed to fully leverage the regularity in the topology and 23 compute its routing table in a simplistic manner. Fat-tree is taken 24 as an example architecture to illustrate how FAR protocol can be 25 applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on November 1, 2015. 44 Copyright Notice 46 Copyright (c) 2015 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 3 63 2. Conventions used in this document . . . . . . . . . . . . . . 4 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4 65 3.1. The Impact of Large-scale Networks on Route Calculation . 4 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 5 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 7 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 8 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 8 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 8 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 9 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 10 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 10 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 12 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 16 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 16 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 17 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 17 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 17 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 18 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 18 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 18 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 18 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 21 86 9. Application Example . . . . . . . . . . . . . . . . . . . . . 22 87 9.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 24 88 9.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 9.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 25 90 9.2.2. A Group of Link Failures . . . . . . . . . . . . . . 26 91 9.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 27 92 9.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 27 93 9.4. FAR's Performance in Large-scale Networks . . . . . . . . 29 94 9.4.1. The number of control messages required by FAR . . . 29 95 9.4.2. The Calculating Time of Routing Tables . . . . . . . 29 96 9.4.3. The Size of Routing Tables . . . . . . . . . . . . . 29 98 10. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 30 99 11. Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 30 100 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 30 101 13. Security Considerations . . . . . . . . . . . . . . . . . . . 30 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 104 1. Introduction 106 In recent years, with the rapid development of cloud computing 107 technologies, the widely deployed cloud services, such as Amazon EC2 108 and Google search, bring about huge challenges to data center 109 networking (DCN). Today's cloud-based data centers (DCs) require 110 large-scale networks with larger internal bandwidth and smaller 111 transfer delay, However, conventional networks cannot meet such 112 requirements due to limitations in their network architecture. In 113 order satisfy the requirements of cloud computing services, many new 114 network architectures have been proposed for data centers, such as 115 Fat-tree, MatrixDCN, and BCube. These new architectures can support 116 non-blocking large-scale datacenter networks with more than tens of 117 thousands of physical servers. 119 Generic routing protocols such as OSPF and IS-IS cannot be scaled to 120 support a very large-scale datacenter network because of the problems 121 of network convergence and PDU overhead. For each type of 122 architecture, researchers designed a routing algorithm according to 123 the features of its topology. Because these routing algorithms are 124 different and lack compatibility with each other, it is very 125 difficult to develop a routing protocol for network routers 126 supporting multiple routing algorithms. Furthermore, the fault 127 tolerances in those routing algorithms are very complicated and have 128 low efficiency. 130 This draft proposes a generic routing method and protocol, fault- 131 avoidance routing (FAR) protocol, for DCNs. This method leverages 132 the regularity in the topologies of data center networks to simplify 133 routing learning and accelerate the query of routing tables. This 134 routing method has a better fault tolerance property and can be 135 applied to any DCN with a regular topology. 137 1.1. Acronyms & Definitions 139 DCN - Data Center Network 141 FAR - Fault-avoidance Routing 143 BRT - Basic Routing Table 145 NRT - Negative Routing Table 146 NDT - Neighbor Devices Table 148 ADT - All Devices Table 150 LFT - Link failure Table 152 DA - Device Announcement 154 LFA - Link Failure Announcement 156 DLR¨C Device and Link Request 158 IP - Internet Protocol 160 UDP - User Datagram Protocol 162 VM - Virtual Machine 164 2. Conventions used in this document 166 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 167 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 168 this document are to be interpreted as described in RFC-2119 169 [RFC2119]. In this document, these words will appear with that 170 interpretation only when in ALL CAPS. Lower case uses of these words 171 are not to be interpreted as carrying RFC-2119 significance. 173 3. Problem Statement 175 The problem to be addressed by FAR as proposed in this draft is 176 described in this section. The expansion of Cloud data center 177 networks has brought significant challenges to the existing routing 178 technologies. FAR mainly solves a series of routing problems faced 179 by large-scale data center networks. 181 3.1. The Impact of Large-scale Networks on Route Calculation 183 In a large-scale cloud data center network, there may be thousands of 184 routers. Running OSPF and other routing protocols in such network 185 will encounter these two challenges: a) Network convergence time 186 would be too long, which will cause a longer time to elapse for 187 creating and updating the routes. The response to network failures 188 may be excessively high; b) a large number of routing protocol 189 packets need to be sent. The routing information consumes a lot of 190 network bandwidth and CPU resources, which easily leads to packet 191 loss and makes the problem (a) more prominent. 193 In order to solve these problems, a common practice is partitioning 194 the large network into some small areas, where the route calculation 195 runs independently within different areas. However, nowadays the 196 cloud data centers typically require very large internal bandwidth. 197 To meet this requirement, a large number of parallel equivalent links 198 are deployed in the network, such as the Fat-tree network 199 architecture. Partitioning the network will affect the utilization 200 of routing algorithm on equivalent multi-path and reduce internal 201 network bandwidth requirements. 203 In the FAR routing calculation process, a Basic Routing Table (BRT) 204 is built on local network topology leveraging the regularity of the 205 network topologies. In addition to BRT, FAR also builds a Negative 206 Routing Table (NRT). FAR gradually builds NRT in the process of 207 learning network link failure information, which does not require 208 learning the complete network fault information. FAR does not need 209 to wait for the completion of the network convergence in the process 210 of building these two tables. Therefor, it avoids the problem of 211 excessive network convergence overheads in the route calculation 212 process. In addition, FAR only needs to exchange a small amount of 213 link change information between routers, and hence consumes less 214 network bandwidth. 216 3.2. Dilemma of conventional routing methods in a large-scale network 217 with giant number nodes of routers 219 There are many real world scenario where tens of thousands of 220 nodes(or much more nodes) need to be deployed in a flat area, such as 221 infiniband routing and switching system, high-performance computer 222 network, and many IDC networks in China. The similar problems have 223 been existed long ago. People have solved the problems through 224 similar solutions, such as the traditional regular topology-based 225 RFC3619 protocol, the routing protocols of infiniband routing and 226 switching system, and high-performance computer network routing 227 protocol. 228 Infiniband defines a switch-based network to interconnect processing 229 nodes and the I/O nodes. The network is composed of HCAs, Switches 230 and SMs. SMs are responsible for the discovery, configuration, 231 activation, and management of the entire subnets. SMs can be on the 232 nodes of any subnets (such as Switches, Routers or HCAs). SMs 233 exchange control management packets through subnet management 234 interfaces SMIs and subnet management agents SMAs. These control 235 packets are called subnet management packets SMPs. SMPs use 236 unreliable datagram service to send. SMPs are divided into the lid 237 routing and directional routing, and SMPs use directional routing for 238 network topology discovery before the network initialization. 239 Infiniband can support very large scale networks, use the regularity 240 in topology to simplify its routing algorithm, which is just the same 241 to what we do in FAR. 243 Why OSPF and other conventional routing methods do not work well in a 244 large-scale network with giant number nodes of routers? 246 As everyone knows, the OSPF protocol uses multiple databases, more 247 topological exchange information (as seen in the following example) 248 and complicated algorithm. It requires routers to consume more 249 memory and CPU processing capability. But the processing rate of CPU 250 on the protocol message per second is very limited. When the network 251 expands, CPU will quickly approach its processing limits, and at this 252 time OSPF can not continue to expand the scale of the management. 253 The SPF algorithm itself does not thoroughly solve these problems. 255 On the contrary, FAR does not have the convergence time delay and the 256 additional CPU overheads, which SPF requires. Because in the initial 257 stage, FAR already knows the regular information of the whole network 258 topology and does not need to periodically do SPF operation. 260 One of the examples of "more topological exchange information": 261 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 262 the larger network, the occupation of CPU and band bandwidth will 263 soon reach the router's performance bottleneck. In order to reduce 264 these adverse effects, OSPF introduced the concept of Area, which 265 still has not solved the problem thoroughly). By dividing the OSPF 266 Area into several areas, the routers in the same area do not need to 267 know the topological details outside their area. (In comparison with 268 FAR, after OSPF introducing the concept of Area, the equivalent paths 269 cannot be selected in the whole network scope) 271 OSPF can achieve the following results by Area : 1) Routers only need 272 to maintain the same link state databases as other routers within the 273 same Area, without the necessity of maintaining the same link state 274 database as all routers in the whole OSPF domain. 2) The reduction 275 of the link state databases means dealing with relatively fewer LSA, 276 which reduces the CPU consumption of routers; 3) The large number of 277 LSAs flood only within the same Area. But, its negative effect is 278 that the smaller number of routers which can be managed in each OSPF 279 area. On the contrary, because FAR does not have the above 280 disadvantages, FAR can also manage large-scale network even without 281 dividing Areas. 283 The aging time of OSPF is set in order to adapt to routing 284 transformation and protocol message exchange happened frequently in 285 the irregular topology. Its negative effect is: when the network 286 does not change, the LSA needs to be refreshed every 1800 seconds to 287 reset the aging time. In the regular topology, as the routings are 288 fixed, it does not need the complex protocol message exchange and 289 aging rules to reflect the routing changes, as long as LFA mechanism 290 in the FAR is enough. 292 Therefore, in FAR, we can omit many unnecessary processing and the 293 packet exchange. The benefits are fast convergence speed and much 294 larger network scale than other dynamic routing protocol. Now there 295 are some successful implementations of simplified routings in the 296 regular topology in the HPC environment. Conclusion: As FAR needs 297 few routing entries and the topology is regular, the database does 298 not need to be updated regularly. Without the need for aging, there 299 is no need for CPU and bandwidth overhead brought by LSA flood every 300 30 minutes, so the expansion of the network has no obvious effect on 301 the performance of FAR, which is contrary to OSPF. 303 Comparison of convergence time: The settings of OSPF spf_delay and 304 spf_hold_time can affect the change of convergence time. The 305 convergence time of the network with 2480 nodes is about 15-20 306 seconds(as seen in the following pages); while the FAR does not need 307 to calculate the SFP, so there is no such convergence time. 308 These issues still exist in rapid convergence technology of OSPF and 309 ISIS (such as I-SPF). The convergence speed and network scale 310 constraint each other. FAR does not have the above problems, and the 311 convergence time is almost negligible. 312 Can FRR solve these problems? IP FRR has some limitations. The 313 establishment of IP FRR backup scheme will not affect the original 314 topology and traffic forwarding which are established by protocol, 315 however, we can not get the information of whereabouts and status 316 when the traffic is switched to an alternate next hop. 318 3.3. Network Addressing Issues 320 Routers typically configure multiple network interfaces, each 321 connected to a subnet. OSPF and other routing algorithms require 322 that each interface of the router must be configured with an IP 323 address. A large-scale data center network may contain thousands of 324 routers. Tens of thousands of IP addresses may be needed to 325 configure for each router with dozens of network interfaces. It will 326 be a very complex issue to configure and manage a large number of 327 network interfaces. Network maintenance is usually costly and error- 328 prone. It will be difficult to troubleshoot the problems. 330 In FAR, the device position information is encoded in the IP address 331 of the router. Each router only needs to be assigned a unique IP 332 address according its location, which greatly solves complex network 333 addressing issues in large-scale networks. 335 3.4. Big Routing Table Issues 337 There are a large number of subnets in the large-scale data center 338 network. Routers may build a routing entry for each subnet, and 339 therefore the size of the routing tables on each router may be very 340 large. It will increase equipment cost and reduce the querying speed 341 of the routing table. 343 FAR uses two measures to reduce the size of the routing tables: a) 344 Builds a BRT on the regularity of the network topologies; b) 345 introduces a new routing table, i.e., a NRT. In this way FAR can 346 reduce the size of routing tables to only a few dozen routing 347 entries. 349 3.5. Adaptivity Issues for Routing Algorithms 351 To implement efficient routing in large-scale datacenters, besides 352 FAR, some other routing methods are proposed for some specific 353 network architectures, such as Fat-tree and BCube. These routing 354 methods are different(from both design and implementation viewpoints) 355 and not compatibile with the conventional routing methods, which 356 brings big troubles to network equipment providers to develop new 357 routers supporting various new routing methods. 359 FAR is a generic routing method. With slight modification, FAR 360 method can be applied to most of regular datacenter networks. 361 Furthermore, the structure of routing tables and querying a routing 362 table in FAR are the same as conventional routing method. If FAR is 363 adopted, the workload of developing a new type of router will be 364 significantly decreased. 366 3.6. Virtual Machine Migration Issues 368 Supporting VM migration is very important for cloud-based datacenter 369 networks. However, in order to support layer-3 routing, routing 370 methods including OSPF and FAR require limiting VM migration within a 371 subnet. For this paradox, the mainstream methods still utilize 372 layer-3 routing on routers or switches, transmit packets encapsulated 373 by IPinIP or MACinIP between hosts by tunnels passing through network 374 to the destination access switch, and then extract original packet 375 out and send it to the destination host. 377 By utilizing the aforementioned methods, FAR can be applied to Fat- 378 tree, MatrixDCN or BCube networks for supporting VM migration in 379 entire network. 381 4. The FAR Framework 383 FAR requires that a DCN has a regular topology, and network devices, 384 including routers, switches, and servers, are assigned IP addresses 385 according to their locations in the network. In other word, we can 386 locate a device in the network according to its IP address. 388 FAR is a distributed routing method. In order to support FAR, each 389 router needs to have a routing module that implements the FAR 390 algorithm. FAR algorithm is composed of three parts, i.e., link- 391 state learning, routing table building and routing table querying, as 392 shown in Fig. 1. 394 Link-state Learning |Routing Table | Routing Table 395 | Building 396 | | Querying 397 +--------+ /---------------\ | +--------+ | Packets 398 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 399 |Learning| | Link Detection| | |Building|\ | | 400 +--------+ \---------------/ | +--------+ \ | \|/ 401 | | | \ |+--------------+ 402 | | | /|| 7 Querying | 403 \|/ \|/ | / || Routing Table| 404 +-----------------------+| / |+--------------+ 405 |3 Invisible Neighbor & || / | 406 |Link Failure Inferring || +--------- | 407 +-----------------------+|/| 6 NRT | | 408 | / |Building| | 409 \|/ /| +--------+ | 410 +--------------+ / | | 411 |4 Link Failure| / | | 412 | Learning | | | 413 +--------------+ | | 414 | | 416 Figure 1: The FAR framework 418 Link-state learning is responsible for a router to detect the states 419 of its connected links and learn the states of all the other links in 420 the entire network. The second part builds two routing tables, a 421 basic routing table (BRT) and an negative routing table (NRT), 422 according to the learned link states in the first part. The third 423 part queries the BRT and the NRT to decide a next forwarding hop for 424 the received (ingress) packets. 426 5. Data Format 428 5.1. Data Tables 430 Some data tables are maintained on each router in FAR. They are: 432 Neighbor Device Table (NDT): To store neighbor routers and related 433 links. 435 All Devices Table (ADT): To store all routers in the entire network. 437 Link Failures Table (LFT): To store all link failures in the entire 438 network. 440 Basic Routing Table (BRT): To store the candidate routes. 442 Negative Routing Table(NRT): To store the avoiding routes. 444 The format of NDT 446 ---------------------------------------------------------- 447 Device ID | Device IP | Port ID | Link State | Update Time 448 ---------------------------------------------------------- 450 Device ID: The ID of a neighbor router. 452 Device IP: The IP address of a neighbor router. 454 Port ID: The port ID that a neighbor router is attached to. 456 Link State: The state of the link between a router and its neighbor 457 router. There are two states: Up and Down. 459 Update Time: The time of updating the entry. 461 The format of ADT 463 -------------------------------------------------- 464 Device ID | Device IP | Type | State | Update Time 465 -------------------------------------------------- 467 Device ID: The ID of a neighbor router. 469 Device IP: The IP address of a neighbor router. 471 Type: The type of a neighbor router. 473 State: The state of a neighbor router. There are two states: Up and 474 Down. 476 Update Time: The time of updating the entry. 478 The format of LFT 480 -------------------------------------------- 481 No | Router 1 IP | Router 2 IP | Update Time 482 -------------------------------------------- 484 No: The entry number. 486 Router 1 IP: The IP address of one router that a failed link connects 487 to. 489 Router 2 IP: The IP address of another router that a failed link 490 connects to. 492 Update Time: The time of updating the entry. 494 The format of BRT 496 ------------------------------------------------------- 497 Destination | Mask | Next Hop | Interface | Update Time 498 ------------------------------------------------------- 500 Destination: A destination network 502 Mask: The subnet mask of a destination network. 504 Next Hop: The IP address of a next hop for a destination. 506 Interface: The interface related to a next hop. 508 Update Time: The time of updating the entry. 510 The format of NRT 512 ------------------------------------------------------------------- 513 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 514 ------------------------------------------------------------------- 516 Destination: A destination network. 518 Mask: The subnet mask of a destination network. 520 Next Hop: The IP address of a next hop that should be avoided for a 521 destination. 523 Interface: The interface related to a next hop that should be 524 avoided. 526 Failed Link No: A group of failed link numbers divided by "/", for 527 example 1/2/3. 529 Timestamp: The time of updating the entry. 531 5.2. Messages 533 Some protocol messages are exchanged between routers in FAR. 535 Hello Message: This message is exchanged between neighbor routers to 536 learn adjacency. 538 Device Announcement (DA): Synchronize the knowledge of routers 539 between routers. 541 Link Failure Announcement (LFA): Synchronize link failures between 542 routers. 544 Device and Link Request (DLR): When a router starts, it requests the 545 knowledge of routers and links from its neighbors by a DLR message. 547 A FAR Message is directly encapsulated in an IP packet. The protocol 548 field of IP header indicates an IP packet is an FAR message. The 549 protocol of IP for FAR should be assigned by IANA. 551 The four types of FAR messages have same format of packet header, 552 called FAR header (as shown in Figure 2). 554 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 555 +-----------+------------+------------------------+ 556 | Version |Message Type| Message Length | 557 +-----------+------------+------------------------+ 558 | Checksum | AuType | 559 +------------------------+------------------------+ 560 | Authentication | 561 +-------------------------------------------------+ 562 | Authentication | 563 +-------------------------------------------------+ 564 | Timestamp | 565 +-------------------------------------------------+ 567 Figure 2: The format of FAR header 569 Version: FAR version 571 Message Type: The type of FAR message. 573 Packet Length: The packet length of the total FAR message. 575 Checksum: The checksum of an entire FAR message. 577 AuType:Authentication type. 0: no authentication, 1: Plaintext 578 Authentication, 2: MD5 Authentication. 580 Authentication: Authentication information. 0: undefined, 1: Key, 2: 581 key ID, MD5 data length and packet number. MD5 data is appended to 582 the backend of the packet. 584 AuType and Authentication can refer to the definition of OSPF packet. 586 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 587 +-----------+------------+------------------------+ 588 | Version |Message Type| Message Length | 589 +-----------+------------+------------------------+ 590 | Checksum | AuType | 591 +------------------------+------------------------+ 592 | Authentication | 593 +-------------------------------------------------+ 594 | Authentication | 595 +-------------------------------------------------+ 596 | Timestamp | 597 +-------------------------------------------------+ 598 | Router IP | 599 +------------------------+------------------------+ 600 | HelloInterval | HelloDeadInterval | 601 +------------------------+------------------------+ 602 | Neighbor Router IP | 603 +-------------------------------------------------+ 604 | ... | 605 +-------------------------------------------------+ 607 Figure 3: The Format of Hello Messages 609 For Hello messages, the Message Type in FAR header is set to 610 1.Besides FAR header, a Hello message(Fig. 3) requires the following 611 fields: 613 Router IP: The router IP address. 615 HelloInterval: The interval of sending Hello messages to neighbor 616 routers. 618 RouterDeadInterval: The interval to set a neighbor router dead(out- 619 of-service). If in the interval time, a router doesn't receive a 620 Hello message from its neighbor router, the neighbor router is 621 treated as dead. 623 Neighbor Router IP: The IP address of a neighbor router. All the 624 neighbor router's addresses should be included in a Hello message. 626 |<----1---->| <----1---->|<----------2----------->| 627 +-----------+------------+------------------------+ 628 | Version |Message Type| Message Length | 629 +-----------+------------+------------------------+ 630 | Checksum | AuType | 631 +------------------------+------------------------+ 632 | Authentication | 633 +-------------------------------------------------+ 634 | Authentication | 635 +-------------------------------------------------+ 636 | Timestamp | 637 +------------------------+------------------------+ 638 | Router1 IP | 639 +-------------------------------------------------+ 640 | ... | 641 +-------------------------------------------------+ 643 Figure 4: The Format of DA Messages 645 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 646 Besides FAR header, a DA message includes IP addresses of all the 647 announced routers. 649 |<----1---->| <----1---->|<----------2----------->| 650 +-----------+------------+------------------------+ 651 | Version |Message Type| Message Length | 652 +-----------+------------+------------------------+ 653 | Checksum | AuType | 654 +------------------------+------------------------+ 655 | Authentication | 656 +-------------------------------------------------+ 657 | Authentication | 658 +-------------------------------------------------+ 659 | Timestamp | 660 +------------------------+------------------------+ 661 | Left IP | 662 +-------------------------------------------------+ 663 | Right IP | 664 +------------------------+------------------------+ 665 | State | 666 +-------------------------------------------------+ 667 | ... | 668 +-------------------------------------------------+ 670 Figure 5: The Format of LFA Messages 672 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 673 Besides FAR header, a LFA message includes all the announced link 674 failures. 676 Left IP: The IP address of the left endpoint router of a link. 678 Right IP: The IP address of the right endpoint router of a link. 680 State: Link state. 0: Up, 1: down 682 |<----1---->| <----1---->|<----------2----------->| 683 +-----------+------------+------------------------+ 684 | Version |Message Type| Message Length | 685 +-----------+------------+------------------------+ 686 | Checksum | AuType | 687 +------------------------+------------------------+ 688 | Authentication | 689 +-------------------------------------------------+ 690 | Authentication | 691 +-------------------------------------------------+ 692 | Timestamp | 693 +-------------------------------------------------+ 695 Figure 6: The Format of DLR Messages 697 For DLR messages(Fig. 6), the Message Type in FAR header is set to 698 1.Except for FAR header, DLR has no additional fields. 700 6. FAR Modules 702 6.1. Neighbor and Link Detection Module(M1) 704 M1 is responsible for sending and receiving Hello messages, and 705 detecting directly-connected links and neighbor routers. Each Hello 706 message is encapsulated in a UDP packet. M1 sends Hello messages 707 periodically to all the active router ports and receives Hello 708 messages from its neighbor routers. M1 detects neighbor routers and 709 directly-connected links according to received Hello Messages and 710 stores these neighbors and links into a Neighbor Devices Table (NDT). 711 Additionally, M1 also stores the neighbor routers into an All Devices 712 Table (ADT). 714 6.2. Device Learning Module(M2) 716 M2 is responsible for sending, receiving, and forwarding device 717 announcement (DA) messages, learning all the routers in the whole 718 network, and deducing faulted routers. When a router starts, it 719 sends a DA message announcing itself to its neighbors and a DLR 720 message requesting the knowledge of routers and links from its 721 neighbors. If M2 module of a router receives a DA message, it checks 722 whether the router encapsulated in the message is in an ADT. If the 723 router is not in the ADT, M2 puts this router into the ADT and 724 forwards this DA message to all the active ports except for the 725 incoming one, otherwise, M2 discards this message directly. If M2 726 module of a router receives a DLR message, it replies a DA message 727 that encapsulates all of the learned routers. 729 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 731 M3 is responsible for inferring invisible neighbors of the current 732 router by means of the ADT. If the link between the router A and its 733 neighbor B breaks, which results in that M1 module of A cannot detect 734 the existence of B, then B is an invisible neighbor of A. Since a 735 device's location is coded into its IP address, it can be judged 736 whether two routers are adjacent, according to their IP addresses. 737 Based on this idea, M3 infers all of the invisible neighbors of the 738 current router and the related link failures. The results are stored 739 into a NDT. Moreover, link failures also are added into a link- 740 failure table (LFT). LFT stores all of the failed links in the 741 entire network. 743 6.4. Link Failure Learning Module(M4) 745 M4 is responsible for sending, receiving and forwarding link failure 746 announcement (LFA) and learning all the link failures in the whole 747 network. M4 broadcasts each newly inferred link failure to all the 748 routers in the network. Each link failure is encapsulated in a LFA 749 message and one link failure is broadcasted only once. If a router 750 receives a DLR request from its neighbor, it will reply a LFA message 751 that encapsulates all the learned link failures through M4 module. 752 If M4 receives a LFA message, it checks whether the link failure 753 encapsulated in the message is in a LFT. If the link failure is not 754 in the LFT, M4 puts this link failure into the LFT and forwards this 755 LFA message to all the active ports except for the incoming one, 756 otherwise, M4 discards this message directly. 758 6.5. BRT Building Module(M5) 760 M5 is responsible for building a BRT for the current router. By 761 leveraging the regularity in topology, M5 can calculate the routing 762 paths for any destination without the knowledge of the topology of 763 whole network, and then build the BRT based on a NDT. Since the IP 764 addresses of network devices are continuous, M5 only creates one 765 route entry for a group of destination addresses that have the same 766 network prefix by means of route aggregation technology. Usually, 767 the size of a BRT is very small. The detail of how to build a BRT is 768 described in section 5. 770 6.6. NRT Building Module(M6) 772 M6 is responsible for building a NRT for the current router. Because 773 M5 builds a BRT without considering link failures in network, the 774 routing paths calculated by the BRT cannot avoid failed links. To 775 solve this problem, a NRT is used to exclude the routing paths that 776 include some failed links from the paths calculated by a BRT. M6 777 calculate the routing paths that include failed links and stored them 778 into the NRT. The details of how to build a NRT is described in 779 section 5. 781 6.7. Routing Table Lookup(M7) 783 M7 is responsible for querying routing tables and selecting the next 784 hop for forwarding the packets. Firstly, M7 takes the destination 785 address of a forwarding packet as a criterion to look up route 786 entries in a BRT based on longest prefix match. All of the matched 787 entries are composed of a candidate hops list. Secondly, M7 look up 788 negative route entries in a NRT taking the destination address of the 789 forwarding packet as criteria. This lookup is not limited to the 790 longest prefix match, any entry that matches the criteria would be 791 selected and composed of an avoiding hops list. Thirdly, the 792 candidate hops minus avoiding hops are composed of an applicable hops 793 list. At last, M7 sends the forwarding packet to any one of the 794 applicable hops. If the applicable list is empty, the forwarding 795 packet will be dropped. 797 7. How a FAR Router Works 799 Figure 7 shows how a FAR router works by its FSM. 801 /---------------\ 802 | Start | 803 | | 804 \---------------/ 805 | 806 +--------+ |1 807 | | | 808 | \|/ \|/ 809 | +--------------------+ 810 |2 | ND | 811 | | | 812 | +--------------------+ 813 | | | 814 | | |3 815 +--------+ | 816 \|/ 817 +--------------+ 818 | ND-FIN | 819 | | 820 +--------------+ 821 | 822 |4 823 | 824 \|/ 825 ________10_______\+--------------+/_______11________ 826 | /| |\ | 827 | | Listen | | 828 | ____9_______\| |/_______12___ | 829 | | /| |\ | | 830 | | +--------------+ | | 831 | | 5/ | | \8 | | 832 | | |/_ | | _\| | | 833 | | +------------+ 6| |7 +------------+ | | 834 | --| HELLO-RECV | | | | LFA-RECV |-- | 835 | +------------+ | | +------------+ | 836 | ______| |______ | 837 | | | | 838 | \|/ \|/ | 839 | +------------+ +------------+ | 840 |___________| DLR-RECV | | DA-RECV |__________| 841 +------------+ +------------+ 843 _________ 844 | | 845 | | 846 \|/ | 847 +--------------+ |13 848 | Hello Thread | | 849 +--------------+ | 850 | | 851 |_________| 853 _________ 854 | | 855 | | 856 \|/ | 857 +--------------+ |14 858 | LFD Thread | | 859 +--------------+ | 860 | | 861 |_________| 863 _________ 864 | | 865 | | 866 \|/ | 867 +--------------+ |15 868 | DA Thread | | 869 +--------------+ | 870 | | 871 |_________| 873 _________ 874 | | 875 | | 876 \|/ | 877 +--------------+ |16 878 | DFD Thread | | 879 +--------------+ | 880 | | 881 |_________| 883 Figure 7: The Finite State Machine of FAR Router 885 1)When a router starts up, it starts a Hello thread and then starts 886 ND (neighbor detection) timer (3 seconds). Next the router goes into 887 ND (neighbor detection) state. 888 2)In the ND state, if a router received a Hello message, then it 889 performs a Hello-message processing and goes back to the ND state. 890 3)When the ND timer is over, a router goes into ND-FIN (neighbor 891 detection finished) state. 892 4)A router starts the LFD (link failure detection) thread and DFD 893 (device failure detection) state, and sends DA message and DLR 894 message to all of its active ports. Then the router goes into Listen 895 state. 896 5) If a router receives a Hello message, then goes into HELLO-RECV 897 state. 899 6) If a router receives a DLR message, then goes into DLR-RECV state. 900 7) If a router receives a DA message, then goes into DA-RECV state. 901 8) If a router receives a LFA message, then goes into LFA-RECV state. 902 9) A router performs the Hello-message processing. After that, it 903 goes back to Listen state. 904 10) A router performs the DLR-message processing. After that, it 905 goes back to Listen state. 906 11) A router performs the DA-message processing. After that, it goes 907 back to Listen state. 908 12) A router performs the LFA-message processing. After that, it 909 goes back to Listen state. 910 13) Hello thread produces and sends Hello messages to all its ports 911 periodically. 912 14) LFD thread calls link-failure-detection processing to check link 913 failures in all links periodically 914 15) DA thread produces and sends DA messages periodically (30 915 minutes). 916 16) When DFD thread starts up, it sleep a short time (30 seconds) to 917 wait for a router learning all the active routers in the network. 918 Then the thread calls the device-failure-detection processing to 919 check device failures periodically (30 minutes). 921 8. Compatible Architecture 923 As a generic routing protocol, FAR can be run in various DCNs with 924 regular topology. Up to now, we have implemented the FAR protocol 925 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 927 For different network architectures, most processing of FAR is same 928 besides calculation of routing tables. BRT routing tables are 929 calculated based on Hello messages and NRT routing tables are 930 calculated based on LFA messages in FAR. To extend FAR to support a 931 new network architecture, only processing of Hello and LFA messages 932 need providing to build BRT and NRT routing tables. 934 In this protocol, FAR can support maximally 12 network architectures 935 and at least support 1 built-in network architecture, such as Fat- 936 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 937 a unique number from 1 to 12. For example, if the 1 built-in 938 architectures are assigned 1, and other customized architectures are 939 assigned 2 to 12. 940 1: Fat-tree 941 2: BCube 942 3: MatrixDCN. 943 4: xxx. 944 ...... 945 12: xxx. 947 9. Application Example 949 In this section, we take a Fat-tree network(Fig. 7) as an example to 950 describe how to apply FAR routing. Since M1 to M4 are very simple, 951 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 952 network. 954 A Fat-tree network is composed of 4 layers. The top layer is core 955 layer, and the other layers are aggregation layer, edge layer and 956 server layer. There are k pods, each one containing two layers of 957 k/2 switches. Each k-port switch in the edge layer is directly 958 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 959 of the k-port switches in the aggregation layer. There are (k/2)2 960 k-port core switches. Each core switch has one port connected to 961 each of the k pods. 963 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 964 +--+ +--+ +--+ +--+ 965 | | | | | | | | 966 +--+,_ +--+', +--+ ,,+--+ 967 |`,`',`-., / | \ `. .'` - .-'``.` /| 968 | . `', `'., / | ' ' ,-`,' |`. .` ' | 969 | \ `', `-., .` / | `, .` ,' | 970 | `, `'. `'-,_ .'` ,' | ', / | 971 | . `'. `-.,-` / | \ 972 | \ `'., .` `'., ` | `. 973 | `, .'`, ,`'., | ', 974 | . ,-` '., - `'-,_ | `. 975 | \ .` ,'., `|., . 976 | .'` / `-, | `'., `. 977 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 978 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 979 | | | | | | | | | | | | | | | | 980 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 981 | \/ | | \/ | | \/ | | \/ | 982 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 983 | | | | | | | | | | | | | | | | 984 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 985 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 986 / | | \ / | | \ / | | \ / | | | 987 / | | \ / | | \ / | | \ / | | | 988 / | | \ / | | \ / | | \ / | | | 989 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 990 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 991 10.1.2.2 10.3.1.3 993 Figure 8: Fat-tree Network 995 Aggregation switches are given addresses of the form 10.pod.0.switch, 996 where pod denotes the pod number, and switch denotes the position of 997 that switch in the upper pod (in [1, k/2]). Edge switches are given 998 addresses of the form 10.pod.switch.1, where pod denotes the pod 999 number, and switch denotes the position of that switch in the lower 1000 pod (in [1, k/2]). The core switches are given addresses of the form 1001 10.0.j.i, where j and i denote that switch's coordinates in the 1002 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1003 The address of a host follows the pod switch to which it is connected 1004 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1005 the host's position in that subnet (in [2, k/2+1], starting from left 1006 to the right). 1008 9.1. BRT Building Procedure 1010 By leveraging the topology's regularity, every switch clearly knows 1011 how it forwards a packet. When a packet arrives at an edge switch, 1012 if the destination of the packet lies in the same subnet with the 1013 switch, then the switch directly forwards the packet to the 1014 destination server through layer-2 switching. Otherwise, the switch 1015 forwards the packet to any of aggregation switches in the same pod. 1016 When a packet arrives at an aggregation switch, if the destination of 1017 the packet lies in the same pod, the switch forwards the packet to 1018 the corresponding edge switch. Otherwise, the switch forwards the 1019 packet to any of core switches that it is connected to. If a core 1020 switch receives a packet, it forwards the packet to the corresponding 1021 aggregation switch that lies in the destination pod. 1023 The forwarding policy discussed above is easily expressed through a 1024 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1025 following entries: 1027 Destination/Mask Next hop 1028 10.0.0.0/255.0.0.0 10.1.0.1 1029 10.0.0.0/255.0.0.0 10.1.0.2 1031 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1032 the following entries: 1034 Destination/Mask Next hop 1035 10.1.1.0/255 255.255.0 10.1.1.1 1036 10.1.2.0/255.255.255.0 10.1.2.1 1037 10.0.0.0/255.0.0.0 10.0.1.1 1038 10.0.0.0/255.0.0.0 10.0.1.2 1040 The BRT of acore switch, such as 10.0.1.1, is composed of the 1041 following entries: 1043 Destination/Mask Next hop 1044 10.1.0.0/255 255.0.0 10.1.0.1 1045 10.2.0.0/255.255.0.0 10.2.0.1 1046 10.3.0.0/255.255.0.0 10.3.0.1 1047 10.4.0.0/255.255.0.0 10.4.0.1 1049 9.2. NRT Building Procedure 1051 The route entries in an NRT are related with link and node failures. 1052 We summarize all types of cases into three (3) catalogs. 1054 9.2.1. Single Link Failure 1056 In Fat-tree, Links can be classified as 3 types by their locations: 1057 1) servers to edge switches; 2) edge to aggregation switches; 3) 1058 aggregation to core switches. Link failures between servers to edge 1059 switches only affect the communication of the corresponding servers 1060 and don't affect the routing tables of any switch, so we only discuss 1061 the second and third type of links failures. 1063 Edge to Aggregation Switches 1065 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1066 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1067 failure may affect 3 types of communications. 1069 o Sources lie in the same subnet with A, and destinations do not. In 1070 this case, the link failure will only affect the routing tables of A. 1071 As this link is attached to A directly, A only needs to delete the 1072 route entries whose next hop is B in its BRT and add no entries to 1073 its NRT when A's M6 module detect the link failure. 1075 o Destinations lie in the same subnet with A, and sources lie in 1076 another subnet of the same pod. In this case, the link failure will 1077 affect the routing tables of all the edge switches in the same pod 1078 except for A. When an edge switch, such as 10.1.1.1, learns the link 1079 failure, it will add a route entry to its NRT: 1081 Destination/Mask Next hop 1082 10.1.2.0/255.255.255.0 10.1.0.1 1084 o Destinations lie in the same subnet with A, sources lie in another 1085 pod. In this case, the link failure will affect the routing tables 1086 of all the edge switches in the other pods. When an edge switch in 1087 one other pod, such as 10.3.1.1, learns the link failure, because all 1088 the routings that pass through 10.3.0.1 to A will certainly pass 1089 through the link between A and B, 10.3.1.1 need add a route entry to 1090 its NRT: 1092 Destination/Mask Next hop 1093 10.1.2.0/255.255.255.0 10.3.0.1 1094 Aggregation to Core Switches 1096 Suppose that the link between an aggregation switch, such as 10.1.0.1 1097 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1098 failure may affect 2 types of communications. 1100 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1101 the other pods. In this case, the link failure will only affect the 1102 routing tables of A. As this link is attached to A directly, A only 1103 need to delete the route entries whose next hop is B in its BRT and 1104 add no entries to its NRT when A's M6 module detect the link failure. 1106 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1107 another pod. In this case, the link failure will affect the routing 1108 tables of all the aggregation switches in other pods except for pod 1109 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1110 learns the link failure, because all the routings that pass through 1111 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1112 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1114 Destination/Mask Next hop 1115 10.1.0.0/255.255.0.0 10.0.1.2 1117 9.2.2. A Group of Link Failures 1119 If all the uplinks of an aggregation switch fail, then this switch 1120 cannot forward packets, which will affect the routing of every edge 1121 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1122 fail, it will affect two types of communications. 1124 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1125 the other pods. In this case, the link failures will affect the 1126 routing of the edge switches in the Pod of A. To avoid the node A, 1127 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1128 10.1.0.1" in which the next hop is the node A. 1130 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1131 other pods. In this case, the link failures will affect the routing 1132 of edge switches in other pods. For example, if the edge switch 1133 10.3.1.1 communicates with some node in the pod of A, it should avoid 1134 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1135 pod of A will pass through the node A. So a route entry should be 1136 added to 10.3.1.1: 1138 Destination/Mask Next hop 1139 10.1.0.0/255.255.0.0 10.3.0.1 1141 9.2.3. Node Failures 1143 At last, we discuss the effect of node failures to a NRT. There are 1144 3 types of node failures: the failure of edge, aggregation and core 1145 switches. 1147 o An edge switch fails. The failure doesn't affect the routing table 1148 of any switch. 1150 o A core switch fails. Only when all the core switches connected to 1151 the same aggregation switch fail, they will affect the routing of 1152 other switches. This case is equal to the case that all the uplinks 1153 of an aggregation switch fail, so the process of link failures can 1154 cover it. 1156 o An aggregation switch fails. This case is similar to the case that 1157 all the uplinks of an aggregation switch fail. It affects the 1158 routing of edge switches in other pods, but doesn't affect the 1159 routing of edge switches in pod of the failed switch. The process of 1160 this failure is same to the second case in section 6.2.2. 1162 9.3. Routing Procedure 1164 FAR decides a routing by looking up its BRT and NRT. We illuminate 1165 the routing procedure by an example. In this example, we suppose 1166 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1167 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1168 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1169 (destination). 1171 Step 1: The source 10.3.1.3 sends packets to its default router 1172 10.3.1.1 1174 Step 2: The routing of 10.3.1.1. 1176 1) Calculate candidate hops 1178 10.3.1.1 looks up its BRT and gets the following matched entries: 1180 Destination/Mask Next hop 1181 10.0.0.0/255.0.0.0 10.3.0.1 1183 So the candidate hops = {10.3.0.1} 1185 2) Calculate avoiding hops 1187 Its NRT is empty, so the set of avoiding hop is empty too. 1189 3) Calculate applicable hops 1191 The applicable hops are candidate hops minus avoiding hops, so: 1193 The applicable hops = {10.3.0.1} 1195 4) Forward packets to 10.3.0.1 1197 Step 3: The routing of 10.3.0.1 1199 1) Calculate candidate hops. 1201 10.3. 0.1 looks up its BRT and gets the following matched entries: 1203 Destination/Mask Next hop 1204 10.1.0.0/255.255.0.0 10.0.1.1 1205 10.1.0.0/255.255.0.0 10.0.1.2 1207 So the candidate hops = {10.0.1.1, 10.0.1.2} 1209 2) Calculate avoiding hops 1211 Destination/Mask Next hop 1212 10.1.0.0/255.255.0.0 10.0.1.2 1214 So the avoiding hops = {10.0.1.2} 1216 3) Calculate applicable hops 1218 The applicable hops are candidate hops minus avoiding hops, so: 1220 The applicable hops = {10.0.1.1} 1222 4) Forward packets to 10.0.1.1 1224 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1225 routing tables. 1227 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1228 routing tables. 1230 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1231 layer-2 switching. 1233 9.4. FAR's Performance in Large-scale Networks 1235 FAR has good performanceto support large-scale networks. In this 1236 section, we take a Fat-tree networkcomposed of 2,880 48-port switches 1237 and 27,648 servers as an example to show FAR's performance. 1239 9.4.1. The number of control messages required by FAR 1241 FAR exchanges a few messages between routers and only consumes a 1242 little network bandwidth. Tab. 1 shows the required messages in the 1243 example Fat-tree network. 1244 Table 1:Required messages in a Fat-tree network. 1245 _____________________________________________________________________ 1246 Message Type| Scope | size(bytes) | Rate | Bandwidth 1247 --------------------------------------------------------------------- 1248 Hello |adjacentswitches|less than 48|10 messages/sec|less than 4 kbps 1249 --------------------------------------------------------------------- 1250 DLR |adjacentswitches| less than 48 | (1) |48bytes 1251 --------------------------------------------------------------------- 1252 DA |entire network| less than 48 | (2) |1.106M 1253 --------------------------------------------------------------------- 1254 LFA |entire network| less than 48 | (3) |48 bytes 1255 _____________________________________________________________________ 1256 (1)Produce one when arouter starts 1257 (2)The number of switches(2,880) in a period 1258 (3)Produce one when a link fails or recovers 1260 9.4.2. The Calculating Time of Routing Tables 1262 A BRT is calculated according to the states of its neighbor routers 1263 and attached links. An NRT is calculated according to device and 1264 link failures in the entire network. So FAR does not calculate 1265 network topology and has no problem of network convergence, which 1266 greatly reduces the calculating time of routing tables. The 1267 detection and spread time of link failures is very short in FAR. 1268 Detection time is up to the interval of sending Hello message. In 1269 FAR, the interval is set to 100ms, and a link failure will be 1270 detected in 200ms. The spread time between any pair of routers is 1271 less than 200ms.If a link fails in a data center network, FAR can 1272 detect it, spread it to all the routers, and calculate routing tables 1273 in no more than 500ms. 1275 9.4.3. The Size of Routing Tables 1277 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1278 in Tab. 2. 1280 Table 2: The size of routing tables in FAR 1281 _____________________________________________________________________ 1282 Routing Table| CoreSwitch | AggregationSwitch | EdgeSwitch | 1283 --------------------------------------------------------------------- 1284 BRT | 48 | 48 | 24 1285 --------------------------------------------------------------------- 1286 NRT | 0 | 14 | 333 1287 _____________________________________________________________________ 1289 The BRT's size at a switch is determined by the number of its 1290 neighbor switches. In the example network, a core switch has 48 1291 neighbor switches (aggregation switch), so it has 48 entries in its 1292 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1293 switch is related to the number of link failures in the network. 1294 Suppose that there are 1000 link failures in the example network, the 1295 number of failed links is 1.2% of total links, which is a very high 1296 failure ratio. We suppose that link failures are uniformly 1297 distributed in the entire network. The NRT size at an edge switch is 1298 about 333 and the NRT size of an aggregation switch is about 14in 1299 average. 1301 10. Conclusion 1303 This draft introduces FAR protocol, a generic routing method and 1304 protocol, for data centers that have a regular topology. It uses two 1305 routing tables, a BRT and a NRT, to store the normal routing paths 1306 and avoiding routing paths, respectively, which makes FAR very simple 1307 and efficient. The sizes of two tables are very small. Usually, a 1308 BRT has only several tens of entries and a NRT has only several or 1309 about a dozen entries. 1311 11. Reference 1313 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1314 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1315 2008. 1317 12. Acknowledgments 1319 This document is supported by ZTE Enterprise-University-Research 1320 Joint Project. 1322 13. Security Considerations 1323 Authors' Addresses 1325 Bin Liu 1326 ZTE Inc. 1327 ZTE Plaza, No.19 East Huayuan Road,Haidian District 1328 Beijing 100191 1329 China 1331 Phone: +86 -010-59932039 1332 Email: 13683610386@139.com 1334 Yantao Sun 1335 Beijing Jiaotong University 1336 No.3 Shang Yuan Cun, Hai Dian District 1337 Beijing 100044 1338 China 1340 Email: ytsun@bjtu.edu.cn 1342 Jing Cheng 1343 Beijing Jiaotong University 1344 No.3 Shang Yuan Cun, Hai Dian District 1345 Beijing 100044 1346 China 1348 Email: yourney.j@gmail.com 1350 Yichen Zhang 1351 Beijing Jiaotong University 1352 No.3 Shang Yuan Cun, Hai Dian District 1353 Beijing 100044 1354 China 1356 Email: snowfall_dan@sina.com 1358 Bhumip Khasnabish 1359 ZTE TX Inc. 1360 55 Madison Avenue, Suite 160 1361 Morristown, New Jersey 07960 1362 USA 1364 Phone: +001-781-752-8003 1365 Email: bhumip.khasnabish@ztetx.com