idnits 2.17.1 draft-sl-rtgwg-far-dcn-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 54 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (May 23, 2019) is 1793 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'FAT-TREE' is defined on line 1377, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: November 24, 2019 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 May 23, 2019 12 Generic Fault-avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-12 15 Abstract 17 This draft proposes a generic routing method and protocol for a 18 regular data center network, named as fault-avoidance routing (FAR) 19 protocol. FAR protocol provides a generic routing method for all 20 types of regular topology network architectures that are proposed for 21 large-scale cloud-based data centers over the past few years. FAR 22 protocol is well designed to fully leverage the regularity in the 23 topology and compute its routing table in a simplistic manner. Fat- 24 tree is taken as an example architecture to illustrate how FAR 25 protocol can be applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on November 24, 2019. 44 Copyright Notice 46 Copyright (c) 2019 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 4 63 2. Conventions used in this document . . . . . . . . . . . . . . 5 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 5 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 8 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 8 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 9 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 9 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 9 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 10 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 10 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 17 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 17 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 17 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 18 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 18 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 18 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 19 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 19 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 19 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 22 86 9. Application Example . . . . . . . . . . . . . . . . . . . . . 22 87 9.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 24 88 9.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 9.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 25 90 9.2.2. A Group of Link Failures . . . . . . . . . . . . . . 26 91 9.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 27 92 9.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 27 93 9.4. FAR's Performance in Large-scale Networks . . . . . . . . 29 94 9.4.1. The number of control messages required by FAR . . . 29 95 9.4.2. The Calculating Time of Routing Tables . . . . . . . 29 96 9.4.3. The Size of Routing Tables . . . . . . . . . . . . . 30 98 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 99 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 30 100 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 31 101 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 104 1. Introduction 106 In recent years, with the rapid development of cloud computing 107 technologies, the widely deployed cloud services, such as Amazon EC2 108 and Google search, bring about huge challenges to data center 109 networking (DCN). Today's cloud-based data centers (DCs) require 110 large-scale networks with larger internal bandwidth and smaller 111 transfer delay. However, conventional networks cannot meet such 112 requirements due to limitations in their network architecture. In 113 order to satisfy the requirements of cloud computing services, many 114 new network architectures have been proposed for data centers, such 115 as Fat-tree, MatrixDCN, and BCube. These new architectures can 116 support non-blocking large-scale datacenter networks with more than 117 tens of thousands of physical servers. 119 All of these architectures have regular topologies, which are common 120 features. The regular topology refers to the network topology 121 structure with obvious regularity and symmetry, which is conducive to 122 automatic configuration of the network, such as the Fattree network. 123 In a regular topology, each network node such as a switch or router 124 can be addressed by its location and through a node's address, the 125 node's connections to its neighbors in a network can be determined, 126 and furthermore, the route to the node from other nodes in the 127 network can be determined. So nodes can compute route entries 128 without learning topology. 130 This draft proposes a generic routing method and protocol, fault- 131 avoidance routing (FAR) protocol, for DCNs. This method leverages 132 the regularity in the topologies of data center networks to simplify 133 routing learning and accelerate the query of routing tables. This 134 routing method has a better fault tolerance and can be applied to any 135 DCN with a regular topology. 137 FAR is not a routing protocol to replace generic routing protocols 138 such as OSPF and IS-IS. It cannot be used in general local networks 139 whose topological structures are arbitrary, and whose scales are also 140 not very large. OSPF works very well in such a network. But in a 141 large-scale network with regular topology, FAR has a better 142 performance. Compared with OSPF and IS-IS, FAR has shorter time of 143 network convergence and lower PDU overhead. Furthermore, FAR 144 requires less computing and storage resources, which let FAR routers 145 to run at a lower cost of production than the generic routers. 147 In addition, for each type of network architecture, researchers 148 designed a routing algorithm according to the features of its 149 topology. Because these routing algorithms are different and lack 150 compatibility with each other, it is very difficult to develop a 151 routing protocol for network routers supporting multiple routing 152 algorithms. FAR has better adaptability than these specified routing 153 methods. 155 FAR consists of three components, i.e., link state learning unit, 156 routing table building unit and routing table querying unit. In the 157 link state learning unit, FAR exchanges link failures among routers 158 to establish a consistent knowledge of the entire network. In this 159 stage, the regularity in topology is exploited to infer failed links 160 and routers. In the routing table building unit, FAR builds up two 161 routing tables, i.e., a basic routing table (BRT) and a negative 162 routing table (NRT), for each router according to the network 163 topology and link states. In the last component, routers forward 164 incoming packets by looking up the two routing tables. The matched 165 entries in BRT minus the matched entries in NRT are the final route 166 entries to be used to forward an incoming packet. 168 The remainder of this draft is organized as follows. The problem to 169 be addressed by FAR is described in Section 3. The framework of FAR 170 routing protocol is described in Section 4. Section 5 and 6 171 introduce FAR's data format FAR and modules in detail. Section 7 172 describe how FAR works by finite state machine (FSM). In Section 8, 173 we discussed how FAR works with variable network architectures. 174 Section 9 takes Fat-tree network as an example to illuminate how FAR 175 works. 177 1.1. Acronyms & Definitions 179 DCN - Data Center Network 181 FAR - Fault-Avoidance Routing 183 BRT - Basic Routing Table 185 NRT - Negative Routing Table 187 NDT - Neighbor Devices Table 189 ADT - All Devices Table 191 LFT - Link Failure Table 193 DA - Device Announcement 194 LFA - Link Failure Announcement 196 DLR - Device and Link Request 198 IP - Internet Protocol 200 UDP - User Datagram Protocol 202 VM - Virtual Machine 204 2. Conventions used in this document 206 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 207 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 208 this document are to be interpreted as described in RFC-2119 209 [RFC2119]. In this document, these words will appear with that 210 interpretation only when in ALL CAPS. Lower case uses of these words 211 are not to be interpreted as carrying RFC-2119 significance. 213 3. Problem Statement 215 The problem to be addressed by FAR as proposed in this draft is 216 described in this section. The expansion of Cloud data center 217 networks has brought significant challenges to the existing routing 218 technologies. FAR mainly solves a series of routing problems faced 219 by large-scale data center networks. 221 3.1. The Impact of Large-scale Networks on Route Calculation 223 In a large-scale cloud data center network, there may be thousands of 224 routers. Running OSPF and other routing protocols in such network 225 will encounter these two challenges: a) Network convergence time 226 would be too long, which will cause a longer time to elapse for 227 creating and updating the routes. The response time to network 228 failures may be excessively long; b) a large number of routing 229 protocol packets need to be sent. The routing information consumes 230 too much network bandwidth and CPU resources, which easily leads to 231 packet loss and makes the problem (a) more prominent. 233 In order to solve these problems, a common practice is to partition a 234 large network into some small areas, where the route calculation runs 235 independently within different areas. However, nowadays the cloud 236 data centers typically require very large internal bandwidth. To 237 meet this requirement, a large number of parallel equivalent links 238 are deployed in a network, such as a Fat-tree network. Partitioning 239 such a network will affect the utilization of routing algorithm on 240 equivalent multi-path and reduce internal network bandwidth 241 requirements. 243 In the FAR routing calculation process, a Basic Routing Table (BRT) 244 is built on local network topology leveraging the regularity of the 245 network topologies. In addition to BRT, FAR also builds a Negative 246 Routing Table (NRT). FAR gradually builds NRT in the process of 247 learning network link failure information, which does not require 248 learning a complete network fault information. FAR does not need to 249 wait for the completion of the network convergence in the process of 250 building these two tables. Therefore, it avoids the problem of 251 excessive network convergence overheads in the route calculation 252 process. In addition, FAR only needs to exchange a small amount of 253 link change information between routers, and hence consumes less 254 network bandwidth. 256 3.2. Dilemma of conventional routing methods in a large-scale network 257 with giant number nodes of routers 259 There are many real world scenario where tens of thousands of 260 nodes(or much more nodes) need to be deployed in a flat area, such as 261 infiniband routing and switching system, high-performance computer 262 network, and many IDC networks in China. The similar problems have 263 been existed long ago. People have solved the problems through 264 similar solutions, such as the traditional regular topology-based 265 RFC3619 protocol, the routing protocols of infiniband routing and 266 switching system, and high-performance computer network routing 267 protocol. 268 Infiniband defines a switch-based network to interconnect processing 269 nodes and the I/O nodes. Infiniband can support very large scale 270 networks, use the regularity in topology to simplify its routing 271 algorithm, which is just the same to what we do in FAR. 273 Why OSPF and other conventional routing methods do not work well in a 274 large-scale network with giant number nodes of routers? 276 As everyone knows, the OSPF protocol uses multiple databases, more 277 topological exchange information (as seen in the following example) 278 and complicated algorithm. It requires routers to consume more 279 memory and CPU processing capability. But the processing rate of CPU 280 on the protocol message per second is very limited. When the network 281 expands, CPU will quickly approach its processing limits, and at this 282 time OSPF can not continue to expand the scale of the management. 283 The SPF algorithm itself does not thoroughly solve these problems. 285 On the contrary, FAR does not have the convergence time delay and the 286 additional CPU overheads, which SPF requires. Because in the initial 287 stage, FAR already knows the regular information of the whole network 288 topology and does not need to periodically do SPF operation. 290 One of the examples of "more topological exchange information": 292 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 293 the larger network, the occupation of CPU and band bandwidth will 294 soon reach the router's performance bottleneck. In order to reduce 295 these adverse effects, OSPF introduced the concept of Area, which 296 still has not solved the problem thoroughly. By dividing the OSPF 297 Area into several areas, the routers in the same area do not need to 298 know the topological details outside their area. (In comparison with 299 FAR, after OSPF introducing the concept of Area, the equivalent paths 300 cannot be selected in the whole network scope) 302 OSPF can achieve the following results by Area : 1) Routers only need 303 to maintain the same link state databases as other routers within the 304 same Area, without the necessity of maintaining the same link state 305 database as all routers in the whole OSPF domain. 2) The reduction 306 of the link state databases means dealing with relatively fewer LSA, 307 which reduces the CPU consumption of routers; 3) The large number of 308 LSAs flood only within the same Area. But, its negative effect is 309 that the smaller number of routers which can be managed in each OSPF 310 area. On the contrary, because FAR does not have the above 311 disadvantages, FAR can also manage large-scale network even without 312 dividing Areas. 314 The aging time of OSPF is set in order to adapt to routing 315 transformation and protocol message exchange happened frequently in 316 the irregular topology. Its negative effect is: when the network 317 does not change, the LSA needs to be refreshed every 1800 seconds to 318 reset the aging time. In the regular topology, as the routings are 319 fixed, it does not need the complex protocol message exchange and 320 aging rules to reflect the routing changes, as long as LFA mechanism 321 in the FAR is enough. 323 Compared with the LSVR protocol, the LSVR protocol has no special 324 requirements for the network topology structure, however, the FAR 325 draft is applicable to the regular topology network architecture and 326 simplifies unnecessary processing. It is a solution proposed to 327 greatly improve the routing efficiency of the regular network 328 topology. The FAR solution is more efficient than the general 329 methods such as LSVR in regular topology. 331 Therefore, in FAR, we can omit many unnecessary processing and the 332 packet exchange. The benefits are fast convergence speed and much 333 larger network scale than other dynamic routing protocol. Now there 334 are some successful implementations of simplified routings in the 335 regular topology in the HPC environment. Conclusion: As FAR needs 336 few routing entries and the topology is regular, the database does 337 not need to be updated regularly. Without the need for aging, there 338 is no need for CPU and bandwidth overhead brought by LSA flood every 339 30 minutes, so the expansion of the network has no obvious effect on 340 the performance of FAR, which is contrary to OSPF. 342 Comparison of convergence time: The settings of OSPF spf_delay and 343 spf_hold_time can affect the change of convergence time. The 344 convergence time of the network with 2480 nodes is about 15-20 345 seconds; while the FAR does not need to calculate the SFP, so there 346 is no such convergence time. 347 These issues still exist in rapid convergence technology of OSPF 348 ,ISIS (such as I-SPF) and LSVR. The convergence speed and network 349 scale constraint each other. FAR does not have the above problems, 350 and the convergence time is almost negligible. 351 Can FRR solve these problems? IP FRR has some limitations. The 352 establishment of IP FRR backup scheme will not affect the original 353 topology and traffic forwarding which are established by protocol, 354 however, we can not get the information of whereabouts and status 355 when the traffic is switched to an alternate next hop. 357 3.3. Network Addressing Issues 359 Routers are typically configured with multiple network interfaces, 360 each connected to a subnet. OSPF and other routing algorithms 361 require that each interface of a router must be configured with an IP 362 address. A large-scale data center network may contain thousands of 363 routers and each router has dozens of network interfaces, thus, there 364 are tens of thousands of IP addresses needed to be configured in a 365 data center. It will be very complex to configure and manage a large 366 number of network interfaces and will be difficult to troubleshoot 367 network problems, then network maintenance will be costly and error- 368 prone. 370 In FAR, the device position information is encoded in the IP address 371 of the router. Each router only needs to be assigned a unique IP 372 address according its location, which greatly solves complex network 373 addressing issues in large-scale networks. 375 3.4. Big Routing Table Issues 377 There are a large number of subnets in the large-scale data center 378 network. A router may build a routing entry for each subnet, and 379 therefore the size of routing tables on each router may be very 380 large. It will increase a router's cost and reduce the querying 381 speed of the routing table. 383 FAR uses two measures to reduce the size of its routing tables: a)It 384 builds a BRT on the regularity of the network topologies; b)It 385 introduces a new routing table, i.e., a NRT. In this way FAR can 386 reduce the size of routing tables to only a few dozen routing 387 entries. 389 3.5. Adaptivity Issues for Routing Algorithms 391 To implement efficient routing in large-scale datacenters, besides 392 FAR, some other routing methods are proposed for some specific 393 network architectures, such as Fat-tree and BCube. These routing 394 methods are different(from both design and implementation viewpoints) 395 and not compatible with the conventional routing methods, which 396 brings big troubles to network equipment providers to develop new 397 routers supporting various new routing methods. 399 FAR is a generic routing method. With slight modification, FAR 400 method can be applied to most of regular datacenter networks. 401 Furthermore, the structure of routing tables and querying a routing 402 table in FAR are the same as conventional routing method. If FAR is 403 adopted, the workload of developing a new type of router will be 404 significantly decreased. 406 3.6. Virtual Machine Migration Issues 408 Supporting VM migration is very important for cloud-based datacenter 409 networks. However, in order to support layer-3 routing, routing 410 methods including OSPF and FAR require limiting VM migration within a 411 subnet. For this paradox, the mainstream methods still utilize 412 layer-3 routing on routers or switches, transmit packets encapsulated 413 by IPinIP or MACinIP between hosts by tunnels passing through network 414 to the destination access switch, and then extract original packet 415 out and send it to the destination host. 417 By utilizing the aforementioned methods, FAR can be applied to Fat- 418 tree, MatrixDCN or BCube networks for supporting VM migration in 419 entire network. 421 4. The FAR Framework 423 FAR requires that a DCN has a regular topology, and network devices, 424 including routers, switches, and servers, are assigned IP addresses 425 according to their locations in the network. In other word, we can 426 locate a device in the network according to its IP address. 428 FAR is a distributed routing method. In order to support FAR, each 429 router needs to have a routing module that implements the FAR 430 algorithm. FAR algorithm is composed of three parts, i.e., link- 431 state learning, routing table building and routing table querying, as 432 shown in Fig. 1. 434 Link-state Learning |Routing Table | Routing Table 435 | Building 436 | | Querying 437 +--------+ /---------------\ | +--------+ | Packets 438 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 439 |Learning| | Link Detection| | |Building|\ | | 440 +--------+ \---------------/ | +--------+ \ | \|/ 441 | | | \ |+--------------+ 442 | | | /|| 7 Querying | 443 \|/ \|/ | / || Routing Table| 444 +-----------------------+| / |+--------------+ 445 |3 Invisible Neighbor & || / | 446 |Link Failure Inferring || +--------- | 447 +-----------------------+|/| 6 NRT | | 448 | / |Building| | 449 \|/ /| +--------+ | 450 +--------------+ / | | 451 |4 Link Failure| / | | 452 | Learning | | | 453 +--------------+ | | 454 | | 456 Figure 1: The FAR framework 458 Link-state learning is responsible for a router to detect the states 459 of its connected links and learn the states of all the other links in 460 the entire network. The second part builds two routing tables, a 461 basic routing table (BRT) and an negative routing table (NRT), 462 according to the learned link states in the first part. The third 463 part queries the BRT and the NRT to decide a next forwarding hop for 464 the received (ingress) packets. 466 5. Data Format 468 5.1. Data Tables 470 Some data tables are maintained on each router in FAR. They are: 472 Neighbor Device Table (NDT): To store neighbor routers and related 473 links. 475 All Devices Table (ADT): To store all routers in the entire network. 477 Link Failures Table (LFT): To store all link failures in the entire 478 network. 480 Basic Routing Table (BRT): To store the candidate routes. 482 Negative Routing Table(NRT): To store the avoiding routes. 484 The format of NDT 486 ---------------------------------------------------------- 487 Device ID | Device IP | Port ID | Link State | Update Time 488 ---------------------------------------------------------- 490 Device ID: The ID of a neighbor router. 492 Device IP: The IP address of a neighbor router. 494 Port ID: The port ID that a neighbor router is attached to. 496 Link State: The state of the link between a router and its neighbor 497 router. There are two states: Up and Down. 499 Update Time: The time of updating the entry. 501 The format of ADT 503 -------------------------------------------------- 504 Device ID | Device IP | Type | State | Update Time 505 -------------------------------------------------- 507 Device ID: The ID of a neighbor router. 509 Device IP: The IP address of a neighbor router. 511 Type: The type of a neighbor router. 513 State: The state of a neighbor router. There are two states: Up and 514 Down. 516 Update Time: The time of updating the entry. 518 The format of LFT 520 -------------------------------------------- 521 No | Router 1 IP | Router 2 IP | Timestamp 522 -------------------------------------------- 524 No: The entry number. 526 Router 1 IP: The IP address of one router that a failed link connects 527 to. 529 Router 2 IP: The IP address of another router that a failed link 530 connects to. 532 Timestamp: It identifies when the entry is created. 534 The format of BRT 536 ------------------------------------------------------- 537 Destination | Mask | Next Hop | Interface | Update Time 538 ------------------------------------------------------- 540 Destination: A destination network 542 Mask: The subnet mask of a destination network. 544 Next Hop: The IP address of a next hop for a destination. 546 Interface: The interface related to a next hop. 548 Update Time: The time of updating the entry. 550 The format of NRT 552 ------------------------------------------------------------------- 553 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 554 ------------------------------------------------------------------- 556 Destination: A destination network. 558 Mask: The subnet mask of a destination network. 560 Next Hop: The IP address of a next hop that should be avoided for a 561 destination. 563 Interface: The interface related to a next hop that should be 564 avoided. 566 Failed Link No: A group of failed link numbers divided by "/", for 567 example 1/2/3. 569 Timestamp: The time of updating the entry. 571 5.2. Messages 573 Some protocol messages are exchanged between routers in FAR. 575 Hello Message: This message is exchanged between neighbor routers to 576 learn adjacency. 578 Device Announcement (DA): Synchronize the knowledge of routers 579 between routers. 581 Link Failure Announcement (LFA): Synchronize link failures between 582 routers. 584 Device and Link Request (DLR): When a router starts, it requests the 585 knowledge of routers and links from its neighbors by a DLR message. 587 A FAR Message is directly encapsulated in an IP packet. The protocol 588 field of IP header indicates an IP packet is an FAR message. The 589 protocol of IP for FAR should be assigned by IANA. 591 The four types of FAR messages have same format of packet header, 592 called FAR header (as shown in Figure 2). 594 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 595 +-----------+------------+------------------------+ 596 | Version |Message Type| Message Length | 597 +-----------+------------+------------------------+ 598 | Checksum | AuType | 599 +------------------------+------------------------+ 600 | Authentication | 601 +-------------------------------------------------+ 602 | Authentication | 603 +-------------------------------------------------+ 604 | Timestamp | 605 +-------------------------------------------------+ 607 Figure 2: The format of FAR header 609 Version: FAR version 611 Message Type: The type of FAR message. 613 Packet Length: The packet length of the total FAR message. 615 Checksum: The checksum of an entire FAR message. 617 AuType:Authentication type. 0: no authentication, 1: Plaintext 618 Authentication, 2: MD5 Authentication. 620 Authentication: Authentication information. 0: undefined, 1: Key, 2: 621 key ID, MD5 data length and packet number. MD5 data is appended to 622 the backend of the packet. 624 AuType and Authentication can refer to the definition of OSPF packet. 626 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 627 +-----------+------------+------------------------+ 628 | Version |Message Type| Message Length | 629 +-----------+------------+------------------------+ 630 | Checksum | AuType | 631 +------------------------+------------------------+ 632 | Authentication | 633 +-------------------------------------------------+ 634 | Authentication | 635 +-------------------------------------------------+ 636 | Timestamp | 637 +-------------------------------------------------+ 638 | Router IP | 639 +------------------------+------------------------+ 640 | HelloInterval | HelloDeadInterval | 641 +------------------------+------------------------+ 642 | Neighbor Router IP | 643 +-------------------------------------------------+ 644 | ... | 645 +-------------------------------------------------+ 647 Figure 3: The Format of Hello Messages 649 For Hello messages, the Message Type in FAR header is set to 650 1.Besides FAR header, a Hello message(Fig. 3) requires the following 651 fields: 653 Router IP: The router IP address. 655 HelloInterval: The interval of sending Hello messages to neighbor 656 routers. 658 RouterDeadInterval: The interval to set a neighbor router dead(out- 659 of-service). If in the interval time, a router doesn't receive a 660 Hello message from its neighbor router, the neighbor router is 661 treated as dead. 663 Neighbor Router IP: The IP address of a neighbor router. All the 664 neighbor router's addresses should be included in a Hello message. 666 |<----1---->| <----1---->|<----------2----------->| 667 +-----------+------------+------------------------+ 668 | Version |Message Type| Message Length | 669 +-----------+------------+------------------------+ 670 | Checksum | AuType | 671 +------------------------+------------------------+ 672 | Authentication | 673 +-------------------------------------------------+ 674 | Authentication | 675 +-------------------------------------------------+ 676 | Timestamp | 677 +------------------------+------------------------+ 678 | Router1 IP | 679 +-------------------------------------------------+ 680 | ... | 681 +-------------------------------------------------+ 683 Figure 4: The Format of DA Messages 685 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 686 Besides FAR header, a DA message includes IP addresses of all the 687 announced routers. 689 |<----1---->| <----1---->|<----------2----------->| 690 +-----------+------------+------------------------+ 691 | Version |Message Type| Message Length | 692 +-----------+------------+------------------------+ 693 | Checksum | AuType | 694 +------------------------+------------------------+ 695 | Authentication | 696 +-------------------------------------------------+ 697 | Authentication | 698 +-------------------------------------------------+ 699 | Timestamp | 700 +------------------------+------------------------+ 701 | Left IP | 702 +-------------------------------------------------+ 703 | Right IP | 704 +------------------------+------------------------+ 705 | State | 706 +-------------------------------------------------+ 707 | ... | 708 +-------------------------------------------------+ 710 Figure 5: The Format of LFA Messages 712 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 713 Besides FAR header, a LFA message includes all the announced link 714 failures. 716 Left IP: The IP address of the left endpoint router of a link. 718 Right IP: The IP address of the right endpoint router of a link. 720 State: Link state. 0: Up, 1: down 721 |<----1---->| <----1---->|<----------2----------->| 722 +-----------+------------+------------------------+ 723 | Version |Message Type| Message Length | 724 +-----------+------------+------------------------+ 725 | Checksum | AuType | 726 +------------------------+------------------------+ 727 | Authentication | 728 +-------------------------------------------------+ 729 | Authentication | 730 +-------------------------------------------------+ 731 | Timestamp | 732 +-------------------------------------------------+ 734 Figure 6: The Format of DLR Messages 736 For DLR messages(Fig. 6), the Message Type in FAR header is set to 737 1.Except for FAR header, DLR has no additional fields. 739 6. FAR Modules 741 6.1. Neighbor and Link Detection Module(M1) 743 M1 is responsible for sending and receiving Hello messages, and 744 detecting directly-connected links and neighbor routers. Each Hello 745 message is encapsulated in a UDP packet. M1 sends Hello messages 746 periodically to all the active router ports and receives Hello 747 messages from its neighbor routers. M1 detects neighbor routers and 748 directly-connected links according to received Hello Messages and 749 stores these neighbors and links into a Neighbor Devices Table (NDT). 750 Additionally, M1 also stores neighbor routers into an All Devices 751 Table (ADT). 753 6.2. Device Learning Module(M2) 755 M2 is responsible for sending, receiving, and forwarding device 756 announcement (DA) messages, learning all the routers in the whole 757 network, and deducing faulted routers. When a router starts, it 758 sends a DA message announcing itself to its neighbors and a DLR 759 message requesting the knowledge of routers and links from its 760 neighbors. If M2 module of a router receives a DA message, it checks 761 whether the router encapsulated in the message is in an ADT. If the 762 router is not in the ADT, M2 puts this router into the ADT and 763 forwards this DA message to all the active ports except for the 764 incoming one, otherwise, M2 discards this message directly. If M2 765 module of a router receives a DLR message, it replies a DA message 766 that encapsulates all of the learned routers. 768 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 770 M3 is responsible for inferring invisible neighbors of the current 771 router by means of the ADT. If the link between a router A and its 772 neighbor B breaks, which results in that M1 module of A cannot detect 773 the existence of B, then B is an invisible neighbor of A. Since a 774 device's location is coded into its IP address, it can be judged 775 whether two routers are adjacent, according to their IP addresses. 776 Based on this idea, M3 infers all of the invisible neighbors of the 777 current router and the related link failures. The results are stored 778 into a NDT. Moreover, link failures also are added into a link- 779 failure table (LFT). LFT stores all of the failed links in the 780 entire network. 782 6.4. Link Failure Learning Module(M4) 784 M4 is responsible for sending, receiving and forwarding link failure 785 announcement (LFA) and learning all the link failures in the whole 786 network. M4 broadcasts each newly inferred link failure to all the 787 routers in the network. Each link failure is encapsulated in a LFA 788 message and one link failure is broadcasted only once. If a router 789 receives a DLR request from its neighbor, it will reply a LFA message 790 that encapsulates all the learned link failures through M4 module. 791 If M4 receives a LFA message, it checks whether the link failure 792 encapsulated in the message is in a LFT by comparing two link ends 793 and timestamp. If the link failure is not in the LFT or timestamp is 794 different, M4 puts this link failure into the LFT (or update 795 timestamp only) and forwards this LFA message to all the active ports 796 except for the incoming one, otherwise, M4 discards this message 797 directly. 799 There is a special case a router will rebroadcast a link failure. If 800 a router receives a data packet and must forward the packet going 801 ahead to destination through a failed link, it means some previous 802 router should avoid this failed link according to its NRT but it 803 doesn't. In this case, maybe the previous router missed the LFA 804 message of the link failure due to some uncertain reasons. So the 805 forwarding router rebroadcasts the LFA message. 807 6.5. BRT Building Module(M5) 809 M5 is responsible for building a BRT for the current router. By 810 leveraging the regularity in topology, M5 can calculate the routing 811 paths for any destination without the knowledge of the topology of 812 whole network, and then build the BRT based on a NDT. Since the IP 813 addresses of network devices are continuous, M5 only creates one 814 route entry for a group of destination addresses that have the same 815 network prefix by means of route aggregation technology. Usually, 816 the size of a BRT is very small. The detail of how to build a BRT is 817 described in section 5. 819 6.6. NRT Building Module(M6) 821 M6 is responsible for building a NRT for the current router. Because 822 M5 builds a BRT without considering link failures in network, the 823 routing paths calculated by the BRT cannot avoid failed links. To 824 solve this problem, a NRT is used to exclude the routing paths that 825 include some failed links from the paths calculated by a BRT. M6 826 calculate the routing paths that include failed links and stored them 827 into the NRT. The details of how to build a NRT is described in 828 section 5. 830 6.7. Routing Table Lookup(M7) 832 M7 is responsible for querying routing tables and selecting the next 833 hop for forwarding the packets. Firstly, M7 takes the destination 834 address of a forwarding packet as a criterion to look up route 835 entries in a BRT based on longest prefix match. All of the matched 836 entries are composed of a candidate hops list. Secondly, M7 look up 837 negative route entries in a NRT taking the destination address of the 838 forwarding packet as criteria. This lookup is not limited to the 839 longest prefix match, any entry that matches the criteria would be 840 selected and composed of an avoiding hops list. Thirdly, the 841 candidate hops minus avoiding hops are composed of an applicable hops 842 list. At last, M7 sends the forwarding packet to any one of the 843 applicable hops. If the applicable list is empty, the forwarding 844 packet will be dropped. 846 7. How a FAR Router Works 848 Figure 7 shows how a FAR router works by its FSM. 850 /---------------\ 851 | Start | 852 | | 853 \---------------/ 854 | 855 +--------+ |1 856 | | | 857 | \|/ \|/ 858 | +--------------------+ 859 |2 | ND | 860 | | | 861 | +--------------------+ 862 | | | 863 | | |3 864 +--------+ | 865 \|/ 866 +--------------+ 867 | ND-FIN | 868 | | 869 +--------------+ 870 | 871 |4 872 | 873 \|/ 874 ________10_______\+--------------+/_______11________ 875 | /| |\ | 876 | | Listen | | 877 | ____9_______\| |/_______12___ | 878 | | /| |\ | | 879 | | +--------------+ | | 880 | | 5/ | | \8 | | 881 | | |/_ | | _\| | | 882 | | +------------+ 6| |7 +------------+ | | 883 | --| HELLO-RECV | | | | LFA-RECV |-- | 884 | +------------+ | | +------------+ | 885 | ______| |______ | 886 | | | | 887 | \|/ \|/ | 888 | +------------+ +------------+ | 889 |___________| DLR-RECV | | DA-RECV |__________| 890 +------------+ +------------+ 892 _________ 893 | | 894 | | 895 \|/ | 896 +--------------+ |13 897 | Hello Thread | | 898 +--------------+ | 899 | | 900 |_________| 902 _________ 903 | | 904 | | 905 \|/ | 906 +--------------+ |14 907 | LFD Thread | | 908 +--------------+ | 909 | | 910 |_________| 912 _________ 913 | | 914 | | 915 \|/ | 916 +--------------+ |15 917 | DA Thread | | 918 +--------------+ | 919 | | 920 |_________| 922 _________ 923 | | 924 | | 925 \|/ | 926 +--------------+ |16 927 | DFD Thread | | 928 +--------------+ | 929 | | 930 |_________| 932 Figure 7: The Finite State Machine of FAR Router 934 1)When a router starts up, it starts a Hello thread and then starts 935 ND (neighbor detection) timer (3 seconds). Next the router goes into 936 ND (neighbor detection) state. 937 2)In the ND state, if a router received a Hello message, then it 938 performs a Hello-message processing and goes back to the ND state. 939 3)When the ND timer is over, a router goes into ND-FIN (neighbor 940 detection finished) state. 941 4)A router starts the LFD (link failure detection) thread and DFD 942 (device failure detection) state, and sends DA message and DLR 943 message to all of its active ports. Then the router goes into Listen 944 state. 945 5) If a router receives a Hello message, then goes into HELLO-RECV 946 state. 947 6) If a router receives a DLR message, then goes into DLR-RECV state. 948 7) If a router receives a DA message, then goes into DA-RECV state. 949 8) If a router receives a LFA message, then goes into LFA-RECV state. 950 9) A router performs the Hello-message processing. After that, it 951 goes back to Listen state. 952 10) A router performs the DLR-message processing. After that, it 953 goes back to Listen state. 954 11) A router performs the DA-message processing. After that, it goes 955 back to Listen state. 957 12) A router performs the LFA-message processing. After that, it 958 goes back to Listen state. 959 13) Hello thread produces and sends Hello messages to all its ports 960 periodically. 961 14) LFD thread calls link-failure-detection processing to check link 962 failures in all links periodically 963 15) DA thread produces and sends DA messages periodically (30 964 minutes). 965 16) When DFD thread starts up, it sleep a short time (30 seconds) to 966 wait for a router learning all the active routers in the network. 967 Then the thread calls the device-failure-detection processing to 968 check device failures periodically (30 minutes). 970 8. Compatible Architecture 972 As a generic routing protocol, FAR can be run in various DCNs with 973 regular topology. Up to now, we have implemented the FAR protocol 974 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 976 For different network architectures, most processing of FAR is same 977 besides calculation of routing tables. BRT routing tables are 978 calculated based on Hello messages and NRT routing tables are 979 calculated based on LFA messages in FAR. To extend FAR to support a 980 new network architecture, only processing of Hello and LFA messages 981 need providing to build BRT and NRT routing tables. 983 In this protocol, FAR can support maximally 12 network architectures 984 and at least support 1 built-in network architecture, such as Fat- 985 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 986 a unique number from 1 to 12. For example, if the 1 built-in 987 architectures are assigned 1, and other customized architectures are 988 assigned 2 to 12. 989 1: Fat-tree 990 2: BCube 991 3: MatrixDCN. 992 4: xxx. 993 ...... 994 12: xxx. 996 9. Application Example 998 In this section, we take a Fat-tree network(Fig. 7) as an example to 999 describe how to apply FAR routing. Since M1 to M4 are very simple, 1000 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 1001 network. 1003 A Fat-tree network is composed of 4 layers. The top layer is core 1004 layer, and the other layers are aggregation layer, edge layer and 1005 server layer. There are k pods, each one containing two layers of 1006 k/2 switches. Each k-port switch in the edge layer is directly 1007 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 1008 of the k-port switches in the aggregation layer. There are (k/2)2 1009 k-port core switches. Each core switch has one port connected to 1010 each of the k pods. 1012 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 1013 +--+ +--+ +--+ +--+ 1014 | | | | | | | | 1015 +--+,_ +--+', +--+ ,,+--+ 1016 |`,`',`-., / | \ `. .'` - .-'``.` /| 1017 | . `', `'., / | ' ' ,-`,' |`. .` ' | 1018 | \ `', `-., .` / | `, .` ,' | 1019 | `, `'. `'-,_ .'` ,' | ', / | 1020 | . `'. `-.,-` / | \ 1021 | \ `'., .` `'., ` | `. 1022 | `, .'`, ,`'., | ', 1023 | . ,-` '., - `'-,_ | `. 1024 | \ .` ,'., `|., . 1025 | .'` / `-, | `'., `. 1026 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1027 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1028 | | | | | | | | | | | | | | | | 1029 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1030 | \/ | | \/ | | \/ | | \/ | 1031 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1032 | | | | | | | | | | | | | | | | 1033 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1034 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1035 / | | \ / | | \ / | | \ / | | | 1036 / | | \ / | | \ / | | \ / | | | 1037 / | | \ / | | \ / | | \ / | | | 1038 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1039 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1040 10.1.2.2 10.3.1.3 1042 Figure 8: Fat-tree Network 1044 Aggregation switches are given addresses of the form 10.pod.0.switch, 1045 where pod denotes the pod number, and switch denotes the position of 1046 that switch in the upper pod (in [1, k/2]). Edge switches are given 1047 addresses of the form 10.pod.switch.1, where pod denotes the pod 1048 number, and switch denotes the position of that switch in the lower 1049 pod (in [1, k/2]). The core switches are given addresses of the form 1050 10.0.j.i, where j and i denote that switch's coordinates in the 1051 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1052 The address of a host follows the pod switch to which it is connected 1053 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1054 the host's position in that subnet (in [2, k/2+1], starting from left 1055 to the right). 1057 9.1. BRT Building Procedure 1059 By leveraging the topology's regularity, every switch clearly knows 1060 how it forwards a packet. When a packet arrives at an edge switch, 1061 if the destination of the packet lies in the same subnet with the 1062 switch, then the switch directly forwards the packet to the 1063 destination server through layer-2 switching. Otherwise, the switch 1064 forwards the packet to any of aggregation switches in the same pod. 1065 When a packet arrives at an aggregation switch, if the destination of 1066 the packet lies in the same pod, the switch forwards the packet to 1067 the corresponding edge switch. Otherwise, the switch forwards the 1068 packet to any of core switches that it is connected to. If a core 1069 switch receives a packet, it forwards the packet to the corresponding 1070 aggregation switch that lies in the destination pod. 1072 The forwarding policy discussed above is easily expressed through a 1073 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1074 following entries: 1076 Destination/Mask Next hop 1077 10.0.0.0/255.0.0.0 10.1.0.1 1078 10.0.0.0/255.0.0.0 10.1.0.2 1080 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1081 the following entries: 1083 Destination/Mask Next hop 1084 10.1.1.0/255 255.255.0 10.1.1.1 1085 10.1.2.0/255.255.255.0 10.1.2.1 1086 10.0.0.0/255.0.0.0 10.0.1.1 1087 10.0.0.0/255.0.0.0 10.0.1.2 1089 The BRT of acore switch, such as 10.0.1.1, is composed of the 1090 following entries: 1092 Destination/Mask Next hop 1093 10.1.0.0/255 255.0.0 10.1.0.1 1094 10.2.0.0/255.255.0.0 10.2.0.1 1095 10.3.0.0/255.255.0.0 10.3.0.1 1096 10.4.0.0/255.255.0.0 10.4.0.1 1098 9.2. NRT Building Procedure 1100 The route entries in an NRT are related with link and node failures. 1101 We summarize all types of cases into three (3) catalogs. 1103 9.2.1. Single Link Failure 1105 In Fat-tree, Links can be classified as 3 types by their locations: 1106 1) servers to edge switches; 2) edge to aggregation switches; 3) 1107 aggregation to core switches. Link failures between servers to edge 1108 switches only affect the communication of the corresponding servers 1109 and don't affect the routing tables of any switch, so we only discuss 1110 the second and third type of links failures. 1112 Edge to Aggregation Switches 1114 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1115 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1116 failure may affect 3 types of communications. 1118 o Sources lie in the same subnet with A, and destinations do not. In 1119 this case, the link failure will only affect the routing tables of A. 1120 As this link is attached to A directly, A only needs to delete the 1121 route entries whose next hop is B in its BRT and add no entries to 1122 its NRT when A's M6 module detect the link failure. 1124 o Destinations lie in the same subnet with A, and sources lie in 1125 another subnet of the same pod. In this case, the link failure will 1126 affect the routing tables of all the edge switches in the same pod 1127 except for A. When an edge switch, such as 10.1.1.1, learns the link 1128 failure, it will add a route entry to its NRT: 1130 Destination/Mask Next hop 1131 10.1.2.0/255.255.255.0 10.1.0.1 1133 o Destinations lie in the same subnet with A, sources lie in another 1134 pod. In this case, the link failure will affect the routing tables 1135 of all the edge switches in the other pods. When an edge switch in 1136 one other pod, such as 10.3.1.1, learns the link failure, because all 1137 the routings that pass through 10.3.0.1 to A will certainly pass 1138 through the link between A and B, 10.3.1.1 need add a route entry to 1139 its NRT: 1141 Destination/Mask Next hop 1142 10.1.2.0/255.255.255.0 10.3.0.1 1144 Aggregation to Core Switches 1146 Suppose that the link between an aggregation switch, such as 10.1.0.1 1147 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1148 failure may affect 2 types of communications. 1150 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1151 the other pods. In this case, the link failure will only affect the 1152 routing tables of A. As this link is attached to A directly, A only 1153 need to delete the route entries whose next hop is B in its BRT and 1154 add no entries to its NRT when A's M6 module detect the link failure. 1156 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1157 another pod. In this case, the link failure will affect the routing 1158 tables of all the aggregation switches in other pods except for pod 1159 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1160 learns the link failure, because all the routings that pass through 1161 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1162 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1164 Destination/Mask Next hop 1165 10.1.0.0/255.255.0.0 10.0.1.2 1167 9.2.2. A Group of Link Failures 1169 If all the uplinks of an aggregation switch fail, then this switch 1170 cannot forward packets, which will affect the routing of every edge 1171 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1172 fail, it will affect two types of communications. 1174 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1175 the other pods. In this case, the link failures will affect the 1176 routing of the edge switches in the Pod of A. To avoid the node A, 1177 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1178 10.1.0.1" in which the next hop is the node A. 1180 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1181 other pods. In this case, the link failures will affect the routing 1182 of edge switches in other pods. For example, if the edge switch 1183 10.3.1.1 communicates with some node in the pod of A, it should avoid 1184 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1185 pod of A will pass through the node A. So a route entry should be 1186 added to 10.3.1.1: 1188 Destination/Mask Next hop 1189 10.1.0.0/255.255.0.0 10.3.0.1 1191 9.2.3. Node Failures 1193 At last, we discuss the effect of node failures to a NRT. There are 1194 3 types of node failures: the failure of edge, aggregation and core 1195 switches. 1197 o An edge switch fails. The failure doesn't affect the routing table 1198 of any switch. 1200 o A core switch fails. Only when all the core switches connected to 1201 the same aggregation switch fail, they will affect the routing of 1202 other switches. This case is equal to the case that all the uplinks 1203 of an aggregation switch fail, so the process of link failures can 1204 cover it. 1206 o An aggregation switch fails. This case is similar to the case that 1207 all the uplinks of an aggregation switch fail. It affects the 1208 routing of edge switches in other pods, but doesn't affect the 1209 routing of edge switches in pod of the failed switch. The process of 1210 this failure is same to the second case in section 6.2.2. 1212 9.3. Routing Procedure 1214 FAR decides a routing by looking up its BRT and NRT. We illuminate 1215 the routing procedure by an example. In this example, we suppose 1216 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1217 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1218 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1219 (destination). 1221 Step 1: The source 10.3.1.3 sends packets to its default router 1222 10.3.1.1 1224 Step 2: The routing of 10.3.1.1. 1226 1) Calculate candidate hops 1228 10.3.1.1 looks up its BRT and gets the following matched entries: 1230 Destination/Mask Next hop 1231 10.0.0.0/255.0.0.0 10.3.0.1 1233 So the candidate hops = {10.3.0.1} 1235 2) Calculate avoiding hops 1237 Its NRT is empty, so the set of avoiding hop is empty too. 1239 3) Calculate applicable hops 1241 The applicable hops are candidate hops minus avoiding hops, so: 1243 The applicable hops = {10.3.0.1} 1245 4) Forward packets to 10.3.0.1 1247 Step 3: The routing of 10.3.0.1 1249 1) Calculate candidate hops. 1251 10.3. 0.1 looks up its BRT and gets the following matched entries: 1253 Destination/Mask Next hop 1254 10.1.0.0/255.255.0.0 10.0.1.1 1255 10.1.0.0/255.255.0.0 10.0.1.2 1257 So the candidate hops = {10.0.1.1, 10.0.1.2} 1259 2) Calculate avoiding hops 1261 Destination/Mask Next hop 1262 10.1.0.0/255.255.0.0 10.0.1.2 1264 So the avoiding hops = {10.0.1.2} 1266 3) Calculate applicable hops 1268 The applicable hops are candidate hops minus avoiding hops, so: 1270 The applicable hops = {10.0.1.1} 1272 4) Forward packets to 10.0.1.1 1273 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1274 routing tables. 1276 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1277 routing tables. 1279 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1280 layer-2 switching. 1282 9.4. FAR's Performance in Large-scale Networks 1284 FAR has good performance to support large-scale networks. In this 1285 section, we take a Fat-tree network composed of 2,880 48-port 1286 switches and 27,648 servers as an example to show FAR's performance. 1288 9.4.1. The number of control messages required by FAR 1290 FAR exchanges a few messages between routers and only consumes a 1291 little network bandwidth. Tab. 1 shows the required messages in the 1292 example Fat-tree network. 1293 Table 1:Required messages in a Fat-tree network. 1294 _____________________________________________________________________ 1295 Message Type| Scope | size(bytes) | Rate | Bandwidth 1296 --------------------------------------------------------------------- 1297 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1298 kbps 1299 --------------------------------------------------------------------- 1300 DLR |adjacent switches| less than 48 | (1) |48bytes 1301 --------------------------------------------------------------------- 1302 DA |entire network| less than 48 | (2) |1.106M 1303 --------------------------------------------------------------------- 1304 LFA |entire network| less than 48 | (3) |48 bytes 1305 _____________________________________________________________________ 1306 (1)Produce one when a router starts 1307 (2)The number of switches(2,880) in a period 1308 (3)Produce one when a link fails or recovers 1310 9.4.2. The Calculating Time of Routing Tables 1312 A BRT is calculated according to the states of its neighbor routers 1313 and attached links. An NRT is calculated according to device and 1314 link failures in the entire network. So FAR does not calculate 1315 network topology and has no problem of network convergence, which 1316 greatly reduces the calculating time of routing tables. The 1317 detection and spread time of link failures is very short in FAR. 1318 Detection time is up to the interval of sending Hello message. In 1319 FAR, the interval is set to 100ms, and a link failure will be 1320 detected in 200ms. The spread time between any pair of routers is 1321 less than 200ms.If a link fails in a data center network, FAR can 1322 detect it, spread it to all the routers, and calculate routing tables 1323 in no more than 500ms. 1325 9.4.3. The Size of Routing Tables 1327 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1328 in Tab. 2. 1330 Table 2: The size of routing tables in FAR 1331 _____________________________________________________________________ 1332 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1333 --------------------------------------------------------------------- 1334 BRT | 48 | 48 | 24 1335 --------------------------------------------------------------------- 1336 NRT | 0 | 14 | 333 1337 _____________________________________________________________________ 1339 The BRT's size at a switch is determined by the number of its 1340 neighbor switches. In the example network, a core switch has 48 1341 neighbor switches (aggregation switch), so it has 48 entries in its 1342 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1343 switch is related to the number of link failures in the network. 1344 Suppose that there are 1000 link failures in the example network, the 1345 number of failed links is 1.2% of total links, which is a very high 1346 failure ratio. We suppose that link failures are uniformly 1347 distributed in the entire network. The NRT size at an edge switch is 1348 about 333 and the NRT size of an aggregation switch is about 14in 1349 average. 1351 10. Security Considerations 1353 The security considerations will be discussed in a future version of 1354 this document. 1356 11. Conclusions 1358 This draft introduces FAR protocol, a generic routing method and 1359 protocol, for data centers that have a regular topology. It uses two 1360 routing tables, a BRT and an NRT, to store the normal routing paths 1361 and the forbidden (to-be-avoided) routing paths, respectively. This 1362 makes the FAR protocol very simple and efficient. The sizes of these 1363 two tables are very small. Usually, a BRT has only several tens of 1364 entries and an NRT has only several or about a dozen entries. 1366 12. Acknowledgments 1368 This document is supported by ZTE Enterprise-University-Research 1369 Joint Project. 1371 13. References 1373 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1374 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1375 1997. 1377 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1378 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1379 2008. 1381 Authors' Addresses 1383 Bin Liu 1384 ZTE Inc., ZTE Plaza 1385 No.19 East Huayuan Road,Hai Dian District 1386 Beijing 100191 1387 China 1389 Phone: +86 -010-59932039 1390 Email: 13683610386@139.com 1392 Yantao Sun 1393 Beijing Jiaotong University 1394 No.3 Shang Yuan Cun, Hai Dian District 1395 Beijing 100044 1396 China 1398 Email: ytsun@bjtu.edu.cn 1400 Jing Cheng 1401 Beijing Jiaotong University 1402 No.3 Shang Yuan Cun, Hai Dian District 1403 Beijing 100044 1404 China 1406 Email: yourney.j@gmail.com 1407 Yichen Zhang 1408 Beijing Jiaotong University 1409 No.3 Shang Yuan Cun, Hai Dian District 1410 Beijing 100044 1411 China 1413 Email: snowfall_dan@sina.com 1415 Bhumip Khasnabish 1416 ZTE TX Inc. 1417 55 Madison Avenue, Suite 160 1418 Morristown, New Jersey 07960 1419 USA 1421 Phone: +001-781-752-8003 1422 Email: vumip1@gmail.com 1423 URI: http://tinyurl.com/bhumip/