idnits 2.17.1 draft-sl-rtgwg-far-dcn-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 54 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (November 11, 2015) is 3088 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'FAT-TREE' is defined on line 1354, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc., ZTE Plaza 4 Intended status: Informational Yantao Sun 5 Expires: May 14, 2016 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 Bhumip Khasnabish 9 ZTE TX Inc. 10 November 11, 2015 12 Generic Fault-avoidance Routing Protocol for Data Center Networks 13 draft-sl-rtgwg-far-dcn-05 15 Abstract 17 This draft proposes a generic routing method and protocol for a 18 regular data center network, named as the fault-avoidance routing 19 (FAR) protocol. FAR protocol provides a generic routing method for 20 all types of network architectures that are proposed for large-scale 21 cloud-based data centers over the past few years. FAR protocol is 22 well designed to fully leverage the regularity in the topology and 23 compute its routing table in a simplistic manner. Fat-tree is taken 24 as an example architecture to illustrate how FAR protocol can be 25 applied in real operational scenarios. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on May 14, 2016. 44 Copyright Notice 46 Copyright (c) 2015 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 4 63 2. Conventions used in this document . . . . . . . . . . . . . . 4 64 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 65 3.1. The Impact of Large-scale Networks on Route Calculation . 5 66 3.2. Dilemma of conventional routing methods in a large-scale 67 network with giant number nodes of routers . . . . . . . 6 68 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 8 69 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 8 70 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 8 71 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 9 72 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 9 73 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 10 74 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 10 75 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 13 76 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 17 77 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 17 78 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 17 79 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 18 80 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 18 81 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 18 82 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 19 83 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 19 84 7. How a FAR Router Works . . . . . . . . . . . . . . . . . . . 19 85 8. Compatible Architecture . . . . . . . . . . . . . . . . . . . 22 86 9. Application Example . . . . . . . . . . . . . . . . . . . . . 22 87 9.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 24 88 9.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 25 89 9.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 25 90 9.2.2. A Group of Link Failures . . . . . . . . . . . . . . 26 91 9.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 27 92 9.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 27 93 9.4. FAR's Performance in Large-scale Networks . . . . . . . . 29 94 9.4.1. The number of control messages required by FAR . . . 29 95 9.4.2. The Calculating Time of Routing Tables . . . . . . . 29 96 9.4.3. The Size of Routing Tables . . . . . . . . . . . . . 30 98 10. Security Considerations . . . . . . . . . . . . . . . . . . . 30 99 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 30 100 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 31 101 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 31 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 31 104 1. Introduction 106 In recent years, with the rapid development of cloud computing 107 technologies, the widely deployed cloud services, such as Amazon EC2 108 and Google search, bring about huge challenges to data center 109 networking (DCN). Today's cloud-based data centers (DCs) require 110 large-scale networks with larger internal bandwidth and smaller 111 transfer delay. However, conventional networks cannot meet such 112 requirements due to limitations in their network architecture. In 113 order to satisfy the requirements of cloud computing services, many 114 new network architectures have been proposed for data centers, such 115 as Fat-tree, MatrixDCN, and BCube. These new architectures can 116 support non-blocking large-scale datacenter networks with more than 117 tens of thousands of physical servers. 119 All of these architectures have a common feature that is with a 120 regular topology. Here a regular topology is not a mathematical or 121 definite conception, which means a non-arbitrary network fabric, an 122 inerratic network fabric. In a regular topology, each network node 123 such as a switch or router can be addressed by its location and 124 through its address the node's connections in a network can be 125 determined, and furthermore, the route to the node from other nodes 126 in the network can be determined. 128 This draft proposes a generic routing method and protocol, fault- 129 avoidance routing (FAR) protocol, for DCNs. This method leverages 130 the regularity in the topologies of data center networks to simplify 131 routing learning and accelerate the query of routing tables. This 132 routing method has a better fault tolerance and can be applied to any 133 DCN with a regular topology. 135 FAR is not a routing protocol to replace generic routing protocols 136 such as OSPF and IS-IS. It cannot be used in general local networks 137 whose topological structures are arbitrary, and whose scales are also 138 not very large. OSPF works very well in such a network. But in a 139 large-scale network with regular topology, FAR has a better 140 performance. Compared with OSPF and IS-IS, FAR has shorter time of 141 network convergence and lower PDU overhead. Furthermore, FAR 142 requires less computing and storage resources, which let FAR routers 143 to run at a lower cost of production than the generic routers. 145 In addition, for each type of architecture, researchers designed a 146 routing algorithm according to the features of its topology. Because 147 these routing algorithms are different and lack compatibility with 148 each other, it is very difficult to develop a routing protocol for 149 network routers supporting multiple routing algorithms. FAR has 150 better adaptability than these specified routing methods. 152 1.1. Acronyms & Definitions 154 DCN - Data Center Network 156 FAR - Fault-Avoidance Routing 158 BRT - Basic Routing Table 160 NRT - Negative Routing Table 162 NDT - Neighbor Devices Table 164 ADT - All Devices Table 166 LFT - Link Failure Table 168 DA - Device Announcement 170 LFA - Link Failure Announcement 172 DLR - Device and Link Request 174 IP - Internet Protocol 176 UDP - User Datagram Protocol 178 VM - Virtual Machine 180 2. Conventions used in this document 182 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 183 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 184 this document are to be interpreted as described in RFC-2119 185 [RFC2119]. In this document, these words will appear with that 186 interpretation only when in ALL CAPS. Lower case uses of these words 187 are not to be interpreted as carrying RFC-2119 significance. 189 3. Problem Statement 191 The problem to be addressed by FAR as proposed in this draft is 192 described in this section. The expansion of Cloud data center 193 networks has brought significant challenges to the existing routing 194 technologies. FAR mainly solves a series of routing problems faced 195 by large-scale data center networks. 197 3.1. The Impact of Large-scale Networks on Route Calculation 199 In a large-scale cloud data center network, there may be thousands of 200 routers. Running OSPF and other routing protocols in such network 201 will encounter these two challenges: a) Network convergence time 202 would be too long, which will cause a longer time to elapse for 203 creating and updating the routes. The response time to network 204 failures may be excessively high; b) a large number of routing 205 protocol packets need to be sent. The routing information consumes a 206 lot of network bandwidth and CPU resources, which easily leads to 207 packet loss and makes the problem (a) more prominent. 209 In order to solve these problems, a common practice is partitioning 210 the large network into some small areas, where the route calculation 211 runs independently within different areas. However, nowadays the 212 cloud data centers typically require very large internal bandwidth. 213 To meet this requirement, a large number of parallel equivalent links 214 are deployed in the network, such as the Fat-tree network 215 architecture. Partitioning the network will affect the utilization 216 of routing algorithm on equivalent multi-path and reduce internal 217 network bandwidth requirements. 219 In the FAR routing calculation process, a Basic Routing Table (BRT) 220 is built on local network topology leveraging the regularity of the 221 network topologies. In addition to BRT, FAR also builds a Negative 222 Routing Table (NRT). FAR gradually builds NRT in the process of 223 learning network link failure information, which does not require 224 learning the complete network fault information. FAR does not need 225 to wait for the completion of the network convergence in the process 226 of building these two tables. Therefor, it avoids the problem of 227 excessive network convergence overheads in the route calculation 228 process. In addition, FAR only needs to exchange a small amount of 229 link change information between routers, and hence consumes less 230 network bandwidth. 232 3.2. Dilemma of conventional routing methods in a large-scale network 233 with giant number nodes of routers 235 There are many real world scenario where tens of thousands of 236 nodes(or much more nodes) need to be deployed in a flat area, such as 237 infiniband routing and switching system, high-performance computer 238 network, and many IDC networks in China. The similar problems have 239 been existed long ago. People have solved the problems through 240 similar solutions, such as the traditional regular topology-based 241 RFC3619 protocol, the routing protocols of infiniband routing and 242 switching system, and high-performance computer network routing 243 protocol. 244 Infiniband defines a switch-based network to interconnect processing 245 nodes and the I/O nodes. The network is composed of HCAs, Switches 246 and SMs. SMs are responsible for the discovery, configuration, 247 activation, and management of the entire subnets. SMs can be on the 248 nodes of any subnets (such as Switches, Routers or HCAs). SMs 249 exchange control management packets through subnet management 250 interfaces SMIs and subnet management agents SMAs. These control 251 packets are called subnet management packets SMPs. SMPs use 252 unreliable datagram service to send. SMPs are divided into the lid 253 routing and directional routing, and SMPs use directional routing for 254 network topology discovery before the network initialization. 255 Infiniband can support very large scale networks, use the regularity 256 in topology to simplify its routing algorithm, which is just the same 257 to what we do in FAR. 259 Why OSPF and other conventional routing methods do not work well in a 260 large-scale network with giant number nodes of routers? 262 As everyone knows, the OSPF protocol uses multiple databases, more 263 topological exchange information (as seen in the following example) 264 and complicated algorithm. It requires routers to consume more 265 memory and CPU processing capability. But the processing rate of CPU 266 on the protocol message per second is very limited. When the network 267 expands, CPU will quickly approach its processing limits, and at this 268 time OSPF can not continue to expand the scale of the management. 269 The SPF algorithm itself does not thoroughly solve these problems. 271 On the contrary, FAR does not have the convergence time delay and the 272 additional CPU overheads, which SPF requires. Because in the initial 273 stage, FAR already knows the regular information of the whole network 274 topology and does not need to periodically do SPF operation. 276 One of the examples of "more topological exchange information": 277 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 278 the larger network, the occupation of CPU and band bandwidth will 279 soon reach the router's performance bottleneck. In order to reduce 280 these adverse effects, OSPF introduced the concept of Area, which 281 still has not solved the problem thoroughly). By dividing the OSPF 282 Area into several areas, the routers in the same area do not need to 283 know the topological details outside their area. (In comparison with 284 FAR, after OSPF introducing the concept of Area, the equivalent paths 285 cannot be selected in the whole network scope) 287 OSPF can achieve the following results by Area : 1) Routers only need 288 to maintain the same link state databases as other routers within the 289 same Area, without the necessity of maintaining the same link state 290 database as all routers in the whole OSPF domain. 2) The reduction 291 of the link state databases means dealing with relatively fewer LSA, 292 which reduces the CPU consumption of routers; 3) The large number of 293 LSAs flood only within the same Area. But, its negative effect is 294 that the smaller number of routers which can be managed in each OSPF 295 area. On the contrary, because FAR does not have the above 296 disadvantages, FAR can also manage large-scale network even without 297 dividing Areas. 299 The aging time of OSPF is set in order to adapt to routing 300 transformation and protocol message exchange happened frequently in 301 the irregular topology. Its negative effect is: when the network 302 does not change, the LSA needs to be refreshed every 1800 seconds to 303 reset the aging time. In the regular topology, as the routings are 304 fixed, it does not need the complex protocol message exchange and 305 aging rules to reflect the routing changes, as long as LFA mechanism 306 in the FAR is enough. 308 Therefore, in FAR, we can omit many unnecessary processing and the 309 packet exchange. The benefits are fast convergence speed and much 310 larger network scale than other dynamic routing protocol. Now there 311 are some successful implementations of simplified routings in the 312 regular topology in the HPC environment. Conclusion: As FAR needs 313 few routing entries and the topology is regular, the database does 314 not need to be updated regularly. Without the need for aging, there 315 is no need for CPU and bandwidth overhead brought by LSA flood every 316 30 minutes, so the expansion of the network has no obvious effect on 317 the performance of FAR, which is contrary to OSPF. 319 Comparison of convergence time: The settings of OSPF spf_delay and 320 spf_hold_time can affect the change of convergence time. The 321 convergence time of the network with 2480 nodes is about 15-20 322 seconds(as seen in the following pages); while the FAR does not need 323 to calculate the SFP, so there is no such convergence time. 324 These issues still exist in rapid convergence technology of OSPF and 325 ISIS (such as I-SPF). The convergence speed and network scale 326 constraint each other. FAR does not have the above problems, and the 327 convergence time is almost negligible. 329 Can FRR solve these problems? IP FRR has some limitations. The 330 establishment of IP FRR backup scheme will not affect the original 331 topology and traffic forwarding which are established by protocol, 332 however, we can not get the information of whereabouts and status 333 when the traffic is switched to an alternate next hop. 335 3.3. Network Addressing Issues 337 Routers typically configure multiple network interfaces, each 338 connected to a subnet. OSPF and other routing algorithms require 339 that each interface of the router must be configured with an IP 340 address. A large-scale data center network may contain thousands of 341 routers. Tens of thousands of IP addresses may be needed to 342 configure for each router with dozens of network interfaces. It will 343 be a very complex issue to configure and manage a large number of 344 network interfaces. Network maintenance is usually costly and error- 345 prone. It will be difficult to troubleshoot the problems. 347 In FAR, the device position information is encoded in the IP address 348 of the router. Each router only needs to be assigned a unique IP 349 address according its location, which greatly solves complex network 350 addressing issues in large-scale networks. 352 3.4. Big Routing Table Issues 354 There are a large number of subnets in the large-scale data center 355 network. Routers may build a routing entry for each subnet, and 356 therefore the size of the routing tables on each router may be very 357 large. It will increase equipment cost and reduce the querying speed 358 of the routing table. 360 FAR uses two measures to reduce the size of the routing tables: a) 361 Builds a BRT on the regularity of the network topologies; b) 362 introduces a new routing table, i.e., a NRT. In this way FAR can 363 reduce the size of routing tables to only a few dozen routing 364 entries. 366 3.5. Adaptivity Issues for Routing Algorithms 368 To implement efficient routing in large-scale datacenters, besides 369 FAR, some other routing methods are proposed for some specific 370 network architectures, such as Fat-tree and BCube. These routing 371 methods are different(from both design and implementation viewpoints) 372 and not compatibile with the conventional routing methods, which 373 brings big troubles to network equipment providers to develop new 374 routers supporting various new routing methods. 376 FAR is a generic routing method. With slight modification, FAR 377 method can be applied to most of regular datacenter networks. 378 Furthermore, the structure of routing tables and querying a routing 379 table in FAR are the same as conventional routing method. If FAR is 380 adopted, the workload of developing a new type of router will be 381 significantly decreased. 383 3.6. Virtual Machine Migration Issues 385 Supporting VM migration is very important for cloud-based datacenter 386 networks. However, in order to support layer-3 routing, routing 387 methods including OSPF and FAR require limiting VM migration within a 388 subnet. For this paradox, the mainstream methods still utilize 389 layer-3 routing on routers or switches, transmit packets encapsulated 390 by IPinIP or MACinIP between hosts by tunnels passing through network 391 to the destination access switch, and then extract original packet 392 out and send it to the destination host. 394 By utilizing the aforementioned methods, FAR can be applied to Fat- 395 tree, MatrixDCN or BCube networks for supporting VM migration in 396 entire network. 398 4. The FAR Framework 400 FAR requires that a DCN has a regular topology, and network devices, 401 including routers, switches, and servers, are assigned IP addresses 402 according to their locations in the network. In other word, we can 403 locate a device in the network according to its IP address. 405 FAR is a distributed routing method. In order to support FAR, each 406 router needs to have a routing module that implements the FAR 407 algorithm. FAR algorithm is composed of three parts, i.e., link- 408 state learning, routing table building and routing table querying, as 409 shown in Fig. 1. 411 Link-state Learning |Routing Table | Routing Table 412 | Building 413 | | Querying 414 +--------+ /---------------\ | +--------+ | Packets 415 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 416 |Learning| | Link Detection| | |Building|\ | | 417 +--------+ \---------------/ | +--------+ \ | \|/ 418 | | | \ |+--------------+ 419 | | | /|| 7 Querying | 420 \|/ \|/ | / || Routing Table| 421 +-----------------------+| / |+--------------+ 422 |3 Invisible Neighbor & || / | 423 |Link Failure Inferring || +--------- | 424 +-----------------------+|/| 6 NRT | | 425 | / |Building| | 426 \|/ /| +--------+ | 427 +--------------+ / | | 428 |4 Link Failure| / | | 429 | Learning | | | 430 +--------------+ | | 431 | | 433 Figure 1: The FAR framework 435 Link-state learning is responsible for a router to detect the states 436 of its connected links and learn the states of all the other links in 437 the entire network. The second part builds two routing tables, a 438 basic routing table (BRT) and an negative routing table (NRT), 439 according to the learned link states in the first part. The third 440 part queries the BRT and the NRT to decide a next forwarding hop for 441 the received (ingress) packets. 443 5. Data Format 445 5.1. Data Tables 447 Some data tables are maintained on each router in FAR. They are: 449 Neighbor Device Table (NDT): To store neighbor routers and related 450 links. 452 All Devices Table (ADT): To store all routers in the entire network. 454 Link Failures Table (LFT): To store all link failures in the entire 455 network. 457 Basic Routing Table (BRT): To store the candidate routes. 459 Negative Routing Table(NRT): To store the avoiding routes. 461 The format of NDT 463 ---------------------------------------------------------- 464 Device ID | Device IP | Port ID | Link State | Update Time 465 ---------------------------------------------------------- 467 Device ID: The ID of a neighbor router. 469 Device IP: The IP address of a neighbor router. 471 Port ID: The port ID that a neighbor router is attached to. 473 Link State: The state of the link between a router and its neighbor 474 router. There are two states: Up and Down. 476 Update Time: The time of updating the entry. 478 The format of ADT 480 -------------------------------------------------- 481 Device ID | Device IP | Type | State | Update Time 482 -------------------------------------------------- 484 Device ID: The ID of a neighbor router. 486 Device IP: The IP address of a neighbor router. 488 Type: The type of a neighbor router. 490 State: The state of a neighbor router. There are two states: Up and 491 Down. 493 Update Time: The time of updating the entry. 495 The format of LFT 497 -------------------------------------------- 498 No | Router 1 IP | Router 2 IP | Timestamp 499 -------------------------------------------- 501 No: The entry number. 503 Router 1 IP: The IP address of one router that a failed link connects 504 to. 506 Router 2 IP: The IP address of another router that a failed link 507 connects to. 509 Timestamp: It identifies when the entry is created. 511 The format of BRT 513 ------------------------------------------------------- 514 Destination | Mask | Next Hop | Interface | Update Time 515 ------------------------------------------------------- 517 Destination: A destination network 519 Mask: The subnet mask of a destination network. 521 Next Hop: The IP address of a next hop for a destination. 523 Interface: The interface related to a next hop. 525 Update Time: The time of updating the entry. 527 The format of NRT 529 ------------------------------------------------------------------- 530 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 531 ------------------------------------------------------------------- 533 Destination: A destination network. 535 Mask: The subnet mask of a destination network. 537 Next Hop: The IP address of a next hop that should be avoided for a 538 destination. 540 Interface: The interface related to a next hop that should be 541 avoided. 543 Failed Link No: A group of failed link numbers divided by "/", for 544 example 1/2/3. 546 Timestamp: The time of updating the entry. 548 5.2. Messages 550 Some protocol messages are exchanged between routers in FAR. 552 Hello Message: This message is exchanged between neighbor routers to 553 learn adjacency. 555 Device Announcement (DA): Synchronize the knowledge of routers 556 between routers. 558 Link Failure Announcement (LFA): Synchronize link failures between 559 routers. 561 Device and Link Request (DLR): When a router starts, it requests the 562 knowledge of routers and links from its neighbors by a DLR message. 564 A FAR Message is directly encapsulated in an IP packet. The protocol 565 field of IP header indicates an IP packet is an FAR message. The 566 protocol of IP for FAR should be assigned by IANA. 568 The four types of FAR messages have same format of packet header, 569 called FAR header (as shown in Figure 2). 571 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 572 +-----------+------------+------------------------+ 573 | Version |Message Type| Message Length | 574 +-----------+------------+------------------------+ 575 | Checksum | AuType | 576 +------------------------+------------------------+ 577 | Authentication | 578 +-------------------------------------------------+ 579 | Authentication | 580 +-------------------------------------------------+ 581 | Timestamp | 582 +-------------------------------------------------+ 584 Figure 2: The format of FAR header 586 Version: FAR version 588 Message Type: The type of FAR message. 590 Packet Length: The packet length of the total FAR message. 592 Checksum: The checksum of an entire FAR message. 594 AuType:Authentication type. 0: no authentication, 1: Plaintext 595 Authentication, 2: MD5 Authentication. 597 Authentication: Authentication information. 0: undefined, 1: Key, 2: 598 key ID, MD5 data length and packet number. MD5 data is appended to 599 the backend of the packet. 601 AuType and Authentication can refer to the definition of OSPF packet. 603 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 604 +-----------+------------+------------------------+ 605 | Version |Message Type| Message Length | 606 +-----------+------------+------------------------+ 607 | Checksum | AuType | 608 +------------------------+------------------------+ 609 | Authentication | 610 +-------------------------------------------------+ 611 | Authentication | 612 +-------------------------------------------------+ 613 | Timestamp | 614 +-------------------------------------------------+ 615 | Router IP | 616 +------------------------+------------------------+ 617 | HelloInterval | HelloDeadInterval | 618 +------------------------+------------------------+ 619 | Neighbor Router IP | 620 +-------------------------------------------------+ 621 | ... | 622 +-------------------------------------------------+ 624 Figure 3: The Format of Hello Messages 626 For Hello messages, the Message Type in FAR header is set to 627 1.Besides FAR header, a Hello message(Fig. 3) requires the following 628 fields: 630 Router IP: The router IP address. 632 HelloInterval: The interval of sending Hello messages to neighbor 633 routers. 635 RouterDeadInterval: The interval to set a neighbor router dead(out- 636 of-service). If in the interval time, a router doesn't receive a 637 Hello message from its neighbor router, the neighbor router is 638 treated as dead. 640 Neighbor Router IP: The IP address of a neighbor router. All the 641 neighbor router's addresses should be included in a Hello message. 643 |<----1---->| <----1---->|<----------2----------->| 644 +-----------+------------+------------------------+ 645 | Version |Message Type| Message Length | 646 +-----------+------------+------------------------+ 647 | Checksum | AuType | 648 +------------------------+------------------------+ 649 | Authentication | 650 +-------------------------------------------------+ 651 | Authentication | 652 +-------------------------------------------------+ 653 | Timestamp | 654 +------------------------+------------------------+ 655 | Router1 IP | 656 +-------------------------------------------------+ 657 | ... | 658 +-------------------------------------------------+ 660 Figure 4: The Format of DA Messages 662 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 663 Besides FAR header, a DA message includes IP addresses of all the 664 announced routers. 666 |<----1---->| <----1---->|<----------2----------->| 667 +-----------+------------+------------------------+ 668 | Version |Message Type| Message Length | 669 +-----------+------------+------------------------+ 670 | Checksum | AuType | 671 +------------------------+------------------------+ 672 | Authentication | 673 +-------------------------------------------------+ 674 | Authentication | 675 +-------------------------------------------------+ 676 | Timestamp | 677 +------------------------+------------------------+ 678 | Left IP | 679 +-------------------------------------------------+ 680 | Right IP | 681 +------------------------+------------------------+ 682 | State | 683 +-------------------------------------------------+ 684 | ... | 685 +-------------------------------------------------+ 687 Figure 5: The Format of LFA Messages 689 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 690 Besides FAR header, a LFA message includes all the announced link 691 failures. 693 Left IP: The IP address of the left endpoint router of a link. 695 Right IP: The IP address of the right endpoint router of a link. 697 State: Link state. 0: Up, 1: down 698 |<----1---->| <----1---->|<----------2----------->| 699 +-----------+------------+------------------------+ 700 | Version |Message Type| Message Length | 701 +-----------+------------+------------------------+ 702 | Checksum | AuType | 703 +------------------------+------------------------+ 704 | Authentication | 705 +-------------------------------------------------+ 706 | Authentication | 707 +-------------------------------------------------+ 708 | Timestamp | 709 +-------------------------------------------------+ 711 Figure 6: The Format of DLR Messages 713 For DLR messages(Fig. 6), the Message Type in FAR header is set to 714 1.Except for FAR header, DLR has no additional fields. 716 6. FAR Modules 718 6.1. Neighbor and Link Detection Module(M1) 720 M1 is responsible for sending and receiving Hello messages, and 721 detecting directly-connected links and neighbor routers. Each Hello 722 message is encapsulated in a UDP packet. M1 sends Hello messages 723 periodically to all the active router ports and receives Hello 724 messages from its neighbor routers. M1 detects neighbor routers and 725 directly-connected links according to received Hello Messages and 726 stores these neighbors and links into a Neighbor Devices Table (NDT). 727 Additionally, M1 also stores the neighbor routers into an All Devices 728 Table (ADT). 730 6.2. Device Learning Module(M2) 732 M2 is responsible for sending, receiving, and forwarding device 733 announcement (DA) messages, learning all the routers in the whole 734 network, and deducing faulted routers. When a router starts, it 735 sends a DA message announcing itself to its neighbors and a DLR 736 message requesting the knowledge of routers and links from its 737 neighbors. If M2 module of a router receives a DA message, it checks 738 whether the router encapsulated in the message is in an ADT. If the 739 router is not in the ADT, M2 puts this router into the ADT and 740 forwards this DA message to all the active ports except for the 741 incoming one, otherwise, M2 discards this message directly. If M2 742 module of a router receives a DLR message, it replies a DA message 743 that encapsulates all of the learned routers. 745 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 747 M3 is responsible for inferring invisible neighbors of the current 748 router by means of the ADT. If the link between the router A and its 749 neighbor B breaks, which results in that M1 module of A cannot detect 750 the existence of B, then B is an invisible neighbor of A. Since a 751 device's location is coded into its IP address, it can be judged 752 whether two routers are adjacent, according to their IP addresses. 753 Based on this idea, M3 infers all of the invisible neighbors of the 754 current router and the related link failures. The results are stored 755 into a NDT. Moreover, link failures also are added into a link- 756 failure table (LFT). LFT stores all of the failed links in the 757 entire network. 759 6.4. Link Failure Learning Module(M4) 761 M4 is responsible for sending, receiving and forwarding link failure 762 announcement (LFA) and learning all the link failures in the whole 763 network. M4 broadcasts each newly inferred link failure to all the 764 routers in the network. Each link failure is encapsulated in a LFA 765 message and one link failure is broadcasted only once. If a router 766 receives a DLR request from its neighbor, it will reply a LFA message 767 that encapsulates all the learned link failures through M4 module. 768 If M4 receives a LFA message, it checks whether the link failure 769 encapsulated in the message is in a LFT by comparing two link ends 770 and timestamp. If the link failure is not in the LFT or timestamp is 771 different, M4 puts this link failure into the LFT (or update 772 timestamp only) and forwards this LFA message to all the active ports 773 except for the incoming one, otherwise, M4 discards this message 774 directly. 776 There is a special case a router will rebroadcast a link failure. If 777 a router receives a data packet and must forward the packet going 778 ahead to destination through a failed link, it means some previous 779 router should avoid this failed link according to its NRT but it 780 doesn't. In this case, maybe the previous router missed the LFA 781 message of the link failure due to some uncertain reasons. So the 782 forwarding router rebroadcasts the LFA message. 784 6.5. BRT Building Module(M5) 786 M5 is responsible for building a BRT for the current router. By 787 leveraging the regularity in topology, M5 can calculate the routing 788 paths for any destination without the knowledge of the topology of 789 whole network, and then build the BRT based on a NDT. Since the IP 790 addresses of network devices are continuous, M5 only creates one 791 route entry for a group of destination addresses that have the same 792 network prefix by means of route aggregation technology. Usually, 793 the size of a BRT is very small. The detail of how to build a BRT is 794 described in section 5. 796 6.6. NRT Building Module(M6) 798 M6 is responsible for building a NRT for the current router. Because 799 M5 builds a BRT without considering link failures in network, the 800 routing paths calculated by the BRT cannot avoid failed links. To 801 solve this problem, a NRT is used to exclude the routing paths that 802 include some failed links from the paths calculated by a BRT. M6 803 calculate the routing paths that include failed links and stored them 804 into the NRT. The details of how to build a NRT is described in 805 section 5. 807 6.7. Routing Table Lookup(M7) 809 M7 is responsible for querying routing tables and selecting the next 810 hop for forwarding the packets. Firstly, M7 takes the destination 811 address of a forwarding packet as a criterion to look up route 812 entries in a BRT based on longest prefix match. All of the matched 813 entries are composed of a candidate hops list. Secondly, M7 look up 814 negative route entries in a NRT taking the destination address of the 815 forwarding packet as criteria. This lookup is not limited to the 816 longest prefix match, any entry that matches the criteria would be 817 selected and composed of an avoiding hops list. Thirdly, the 818 candidate hops minus avoiding hops are composed of an applicable hops 819 list. At last, M7 sends the forwarding packet to any one of the 820 applicable hops. If the applicable list is empty, the forwarding 821 packet will be dropped. 823 7. How a FAR Router Works 825 Figure 7 shows how a FAR router works by its FSM. 827 /---------------\ 828 | Start | 829 | | 830 \---------------/ 831 | 832 +--------+ |1 833 | | | 834 | \|/ \|/ 835 | +--------------------+ 836 |2 | ND | 837 | | | 838 | +--------------------+ 839 | | | 840 | | |3 841 +--------+ | 842 \|/ 843 +--------------+ 844 | ND-FIN | 845 | | 846 +--------------+ 847 | 848 |4 849 | 850 \|/ 851 ________10_______\+--------------+/_______11________ 852 | /| |\ | 853 | | Listen | | 854 | ____9_______\| |/_______12___ | 855 | | /| |\ | | 856 | | +--------------+ | | 857 | | 5/ | | \8 | | 858 | | |/_ | | _\| | | 859 | | +------------+ 6| |7 +------------+ | | 860 | --| HELLO-RECV | | | | LFA-RECV |-- | 861 | +------------+ | | +------------+ | 862 | ______| |______ | 863 | | | | 864 | \|/ \|/ | 865 | +------------+ +------------+ | 866 |___________| DLR-RECV | | DA-RECV |__________| 867 +------------+ +------------+ 869 _________ 870 | | 871 | | 872 \|/ | 873 +--------------+ |13 874 | Hello Thread | | 875 +--------------+ | 876 | | 877 |_________| 879 _________ 880 | | 881 | | 882 \|/ | 883 +--------------+ |14 884 | LFD Thread | | 885 +--------------+ | 886 | | 887 |_________| 889 _________ 890 | | 891 | | 892 \|/ | 893 +--------------+ |15 894 | DA Thread | | 895 +--------------+ | 896 | | 897 |_________| 899 _________ 900 | | 901 | | 902 \|/ | 903 +--------------+ |16 904 | DFD Thread | | 905 +--------------+ | 906 | | 907 |_________| 909 Figure 7: The Finite State Machine of FAR Router 911 1)When a router starts up, it starts a Hello thread and then starts 912 ND (neighbor detection) timer (3 seconds). Next the router goes into 913 ND (neighbor detection) state. 914 2)In the ND state, if a router received a Hello message, then it 915 performs a Hello-message processing and goes back to the ND state. 916 3)When the ND timer is over, a router goes into ND-FIN (neighbor 917 detection finished) state. 918 4)A router starts the LFD (link failure detection) thread and DFD 919 (device failure detection) state, and sends DA message and DLR 920 message to all of its active ports. Then the router goes into Listen 921 state. 922 5) If a router receives a Hello message, then goes into HELLO-RECV 923 state. 924 6) If a router receives a DLR message, then goes into DLR-RECV state. 925 7) If a router receives a DA message, then goes into DA-RECV state. 926 8) If a router receives a LFA message, then goes into LFA-RECV state. 927 9) A router performs the Hello-message processing. After that, it 928 goes back to Listen state. 929 10) A router performs the DLR-message processing. After that, it 930 goes back to Listen state. 931 11) A router performs the DA-message processing. After that, it goes 932 back to Listen state. 934 12) A router performs the LFA-message processing. After that, it 935 goes back to Listen state. 936 13) Hello thread produces and sends Hello messages to all its ports 937 periodically. 938 14) LFD thread calls link-failure-detection processing to check link 939 failures in all links periodically 940 15) DA thread produces and sends DA messages periodically (30 941 minutes). 942 16) When DFD thread starts up, it sleep a short time (30 seconds) to 943 wait for a router learning all the active routers in the network. 944 Then the thread calls the device-failure-detection processing to 945 check device failures periodically (30 minutes). 947 8. Compatible Architecture 949 As a generic routing protocol, FAR can be run in various DCNs with 950 regular topology. Up to now, we have implemented the FAR protocol 951 for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond. 953 For different network architectures, most processing of FAR is same 954 besides calculation of routing tables. BRT routing tables are 955 calculated based on Hello messages and NRT routing tables are 956 calculated based on LFA messages in FAR. To extend FAR to support a 957 new network architecture, only processing of Hello and LFA messages 958 need providing to build BRT and NRT routing tables. 960 In this protocol, FAR can support maximally 12 network architectures 961 and at least support 1 built-in network architecture, such as Fat- 962 tree, BCube and MatrixDCN,etc. Each network architecture is assigned 963 a unique number from 1 to 12. For example, if the 1 built-in 964 architectures are assigned 1, and other customized architectures are 965 assigned 2 to 12. 966 1: Fat-tree 967 2: BCube 968 3: MatrixDCN. 969 4: xxx. 970 ...... 971 12: xxx. 973 9. Application Example 975 In this section, we take a Fat-tree network(Fig. 7) as an example to 976 describe how to apply FAR routing. Since M1 to M4 are very simple, 977 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 978 network. 980 A Fat-tree network is composed of 4 layers. The top layer is core 981 layer, and the other layers are aggregation layer, edge layer and 982 server layer. There are k pods, each one containing two layers of 983 k/2 switches. Each k-port switch in the edge layer is directly 984 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 985 of the k-port switches in the aggregation layer. There are (k/2)2 986 k-port core switches. Each core switch has one port connected to 987 each of the k pods. 989 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 990 +--+ +--+ +--+ +--+ 991 | | | | | | | | 992 +--+,_ +--+', +--+ ,,+--+ 993 |`,`',`-., / | \ `. .'` - .-'``.` /| 994 | . `', `'., / | ' ' ,-`,' |`. .` ' | 995 | \ `', `-., .` / | `, .` ,' | 996 | `, `'. `'-,_ .'` ,' | ', / | 997 | . `'. `-.,-` / | \ 998 | \ `'., .` `'., ` | `. 999 | `, .'`, ,`'., | ', 1000 | . ,-` '., - `'-,_ | `. 1001 | \ .` ,'., `|., . 1002 | .'` / `-, | `'., `. 1003 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 1004 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1005 | | | | | | | | | | | | | | | | 1006 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1007 | \/ | | \/ | | \/ | | \/ | 1008 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 1009 | | | | | | | | | | | | | | | | 1010 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 1011 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 1012 / | | \ / | | \ / | | \ / | | | 1013 / | | \ / | | \ / | | \ / | | | 1014 / | | \ / | | \ / | | \ / | | | 1015 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1016 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 1017 10.1.2.2 10.3.1.3 1019 Figure 8: Fat-tree Network 1021 Aggregation switches are given addresses of the form 10.pod.0.switch, 1022 where pod denotes the pod number, and switch denotes the position of 1023 that switch in the upper pod (in [1, k/2]). Edge switches are given 1024 addresses of the form 10.pod.switch.1, where pod denotes the pod 1025 number, and switch denotes the position of that switch in the lower 1026 pod (in [1, k/2]). The core switches are given addresses of the form 1027 10.0.j.i, where j and i denote that switch's coordinates in the 1028 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 1029 The address of a host follows the pod switch to which it is connected 1030 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 1031 the host's position in that subnet (in [2, k/2+1], starting from left 1032 to the right). 1034 9.1. BRT Building Procedure 1036 By leveraging the topology's regularity, every switch clearly knows 1037 how it forwards a packet. When a packet arrives at an edge switch, 1038 if the destination of the packet lies in the same subnet with the 1039 switch, then the switch directly forwards the packet to the 1040 destination server through layer-2 switching. Otherwise, the switch 1041 forwards the packet to any of aggregation switches in the same pod. 1042 When a packet arrives at an aggregation switch, if the destination of 1043 the packet lies in the same pod, the switch forwards the packet to 1044 the corresponding edge switch. Otherwise, the switch forwards the 1045 packet to any of core switches that it is connected to. If a core 1046 switch receives a packet, it forwards the packet to the corresponding 1047 aggregation switch that lies in the destination pod. 1049 The forwarding policy discussed above is easily expressed through a 1050 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 1051 following entries: 1053 Destination/Mask Next hop 1054 10.0.0.0/255.0.0.0 10.1.0.1 1055 10.0.0.0/255.0.0.0 10.1.0.2 1057 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 1058 the following entries: 1060 Destination/Mask Next hop 1061 10.1.1.0/255 255.255.0 10.1.1.1 1062 10.1.2.0/255.255.255.0 10.1.2.1 1063 10.0.0.0/255.0.0.0 10.0.1.1 1064 10.0.0.0/255.0.0.0 10.0.1.2 1066 The BRT of acore switch, such as 10.0.1.1, is composed of the 1067 following entries: 1069 Destination/Mask Next hop 1070 10.1.0.0/255 255.0.0 10.1.0.1 1071 10.2.0.0/255.255.0.0 10.2.0.1 1072 10.3.0.0/255.255.0.0 10.3.0.1 1073 10.4.0.0/255.255.0.0 10.4.0.1 1075 9.2. NRT Building Procedure 1077 The route entries in an NRT are related with link and node failures. 1078 We summarize all types of cases into three (3) catalogs. 1080 9.2.1. Single Link Failure 1082 In Fat-tree, Links can be classified as 3 types by their locations: 1083 1) servers to edge switches; 2) edge to aggregation switches; 3) 1084 aggregation to core switches. Link failures between servers to edge 1085 switches only affect the communication of the corresponding servers 1086 and don't affect the routing tables of any switch, so we only discuss 1087 the second and third type of links failures. 1089 Edge to Aggregation Switches 1091 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 1092 and an aggregation switch, such as 10.1.0.1(B),fails. This link 1093 failure may affect 3 types of communications. 1095 o Sources lie in the same subnet with A, and destinations do not. In 1096 this case, the link failure will only affect the routing tables of A. 1097 As this link is attached to A directly, A only needs to delete the 1098 route entries whose next hop is B in its BRT and add no entries to 1099 its NRT when A's M6 module detect the link failure. 1101 o Destinations lie in the same subnet with A, and sources lie in 1102 another subnet of the same pod. In this case, the link failure will 1103 affect the routing tables of all the edge switches in the same pod 1104 except for A. When an edge switch, such as 10.1.1.1, learns the link 1105 failure, it will add a route entry to its NRT: 1107 Destination/Mask Next hop 1108 10.1.2.0/255.255.255.0 10.1.0.1 1110 o Destinations lie in the same subnet with A, sources lie in another 1111 pod. In this case, the link failure will affect the routing tables 1112 of all the edge switches in the other pods. When an edge switch in 1113 one other pod, such as 10.3.1.1, learns the link failure, because all 1114 the routings that pass through 10.3.0.1 to A will certainly pass 1115 through the link between A and B, 10.3.1.1 need add a route entry to 1116 its NRT: 1118 Destination/Mask Next hop 1119 10.1.2.0/255.255.255.0 10.3.0.1 1121 Aggregation to Core Switches 1123 Suppose that the link between an aggregation switch, such as 10.1.0.1 1124 (A), and a core switch, such as 10.0.1.2(B), fails. This link 1125 failure may affect 2 types of communications. 1127 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1128 the other pods. In this case, the link failure will only affect the 1129 routing tables of A. As this link is attached to A directly, A only 1130 need to delete the route entries whose next hop is B in its BRT and 1131 add no entries to its NRT when A's M6 module detect the link failure. 1133 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1134 another pod. In this case, the link failure will affect the routing 1135 tables of all the aggregation switches in other pods except for pod 1136 1. When an aggregation switch in one other pod, such as 10.3.0.1, 1137 learns the link failure, because all the routings that pass through 1138 10.0.1.2 to the pod 1 where A lies will certainly pass through the 1139 link between A and B, 10.3.0.1 need add a route entry to its NRT: 1141 Destination/Mask Next hop 1142 10.1.0.0/255.255.0.0 10.0.1.2 1144 9.2.2. A Group of Link Failures 1146 If all the uplinks of an aggregation switch fail, then this switch 1147 cannot forward packets, which will affect the routing of every edge 1148 switches. Suppose that all the uplinks of the node A (10.1.0.1) 1149 fail, it will affect two types of communications. 1151 o Sources lie in the same pod (pod 1) with A, and destinations lie in 1152 the other pods. In this case, the link failures will affect the 1153 routing of the edge switches in the Pod of A. To avoid the node A, 1154 each edge switch should remove the route entry "10.0.0.0/255.0.0.0 1155 10.1.0.1" in which the next hop is the node A. 1157 o Destinations lie in the same pod (pod 1) with A, and sources lie in 1158 other pods. In this case, the link failures will affect the routing 1159 of edge switches in other pods. For example, if the edge switch 1160 10.3.1.1 communicates with some node in the pod of A, it should avoid 1161 the node 10.3.0.1, because any communication through 10.3.0.1 to the 1162 pod of A will pass through the node A. So a route entry should be 1163 added to 10.3.1.1: 1165 Destination/Mask Next hop 1166 10.1.0.0/255.255.0.0 10.3.0.1 1168 9.2.3. Node Failures 1170 At last, we discuss the effect of node failures to a NRT. There are 1171 3 types of node failures: the failure of edge, aggregation and core 1172 switches. 1174 o An edge switch fails. The failure doesn't affect the routing table 1175 of any switch. 1177 o A core switch fails. Only when all the core switches connected to 1178 the same aggregation switch fail, they will affect the routing of 1179 other switches. This case is equal to the case that all the uplinks 1180 of an aggregation switch fail, so the process of link failures can 1181 cover it. 1183 o An aggregation switch fails. This case is similar to the case that 1184 all the uplinks of an aggregation switch fail. It affects the 1185 routing of edge switches in other pods, but doesn't affect the 1186 routing of edge switches in pod of the failed switch. The process of 1187 this failure is same to the second case in section 6.2.2. 1189 9.3. Routing Procedure 1191 FAR decides a routing by looking up its BRT and NRT. We illuminate 1192 the routing procedure by an example. In this example, we suppose 1193 that the link between 10.3.1.1 and 10.3.0.2 and the link between 1194 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 1195 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 1196 (destination). 1198 Step 1: The source 10.3.1.3 sends packets to its default router 1199 10.3.1.1 1201 Step 2: The routing of 10.3.1.1. 1203 1) Calculate candidate hops 1205 10.3.1.1 looks up its BRT and gets the following matched entries: 1207 Destination/Mask Next hop 1208 10.0.0.0/255.0.0.0 10.3.0.1 1210 So the candidate hops = {10.3.0.1} 1212 2) Calculate avoiding hops 1214 Its NRT is empty, so the set of avoiding hop is empty too. 1216 3) Calculate applicable hops 1218 The applicable hops are candidate hops minus avoiding hops, so: 1220 The applicable hops = {10.3.0.1} 1222 4) Forward packets to 10.3.0.1 1224 Step 3: The routing of 10.3.0.1 1226 1) Calculate candidate hops. 1228 10.3. 0.1 looks up its BRT and gets the following matched entries: 1230 Destination/Mask Next hop 1231 10.1.0.0/255.255.0.0 10.0.1.1 1232 10.1.0.0/255.255.0.0 10.0.1.2 1234 So the candidate hops = {10.0.1.1, 10.0.1.2} 1236 2) Calculate avoiding hops 1238 Destination/Mask Next hop 1239 10.1.0.0/255.255.0.0 10.0.1.2 1241 So the avoiding hops = {10.0.1.2} 1243 3) Calculate applicable hops 1245 The applicable hops are candidate hops minus avoiding hops, so: 1247 The applicable hops = {10.0.1.1} 1249 4) Forward packets to 10.0.1.1 1250 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1251 routing tables. 1253 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1254 routing tables. 1256 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1257 layer-2 switching. 1259 9.4. FAR's Performance in Large-scale Networks 1261 FAR has good performance to support large-scale networks. In this 1262 section, we take a Fat-tree network composed of 2,880 48-port 1263 switches and 27,648 servers as an example to show FAR's performance. 1265 9.4.1. The number of control messages required by FAR 1267 FAR exchanges a few messages between routers and only consumes a 1268 little network bandwidth. Tab. 1 shows the required messages in the 1269 example Fat-tree network. 1270 Table 1:Required messages in a Fat-tree network. 1271 _____________________________________________________________________ 1272 Message Type| Scope | size(bytes) | Rate | Bandwidth 1273 --------------------------------------------------------------------- 1274 Hello |adjacent switches|less than 48|10 messages/sec|less than 4 1275 kbps 1276 --------------------------------------------------------------------- 1277 DLR |adjacent switches| less than 48 | (1) |48bytes 1278 --------------------------------------------------------------------- 1279 DA |entire network| less than 48 | (2) |1.106M 1280 --------------------------------------------------------------------- 1281 LFA |entire network| less than 48 | (3) |48 bytes 1282 _____________________________________________________________________ 1283 (1)Produce one when a router starts 1284 (2)The number of switches(2,880) in a period 1285 (3)Produce one when a link fails or recovers 1287 9.4.2. The Calculating Time of Routing Tables 1289 A BRT is calculated according to the states of its neighbor routers 1290 and attached links. An NRT is calculated according to device and 1291 link failures in the entire network. So FAR does not calculate 1292 network topology and has no problem of network convergence, which 1293 greatly reduces the calculating time of routing tables. The 1294 detection and spread time of link failures is very short in FAR. 1295 Detection time is up to the interval of sending Hello message. In 1296 FAR, the interval is set to 100ms, and a link failure will be 1297 detected in 200ms. The spread time between any pair of routers is 1298 less than 200ms.If a link fails in a data center network, FAR can 1299 detect it, spread it to all the routers, and calculate routing tables 1300 in no more than 500ms. 1302 9.4.3. The Size of Routing Tables 1304 For the test Fat-tree network, the sizes of BRTs and NRTs are shown 1305 in Tab. 2. 1307 Table 2: The size of routing tables in FAR 1308 _____________________________________________________________________ 1309 Routing Table| Core Switch | Aggregation Switch | Edge Switch | 1310 --------------------------------------------------------------------- 1311 BRT | 48 | 48 | 24 1312 --------------------------------------------------------------------- 1313 NRT | 0 | 14 | 333 1314 _____________________________________________________________________ 1316 The BRT's size at a switch is determined by the number of its 1317 neighbor switches. In the example network, a core switch has 48 1318 neighbor switches (aggregation switch), so it has 48 entries in its 1319 BRT.Only aggregation and edge switches have NRTs. The NRT size at a 1320 switch is related to the number of link failures in the network. 1321 Suppose that there are 1000 link failures in the example network, the 1322 number of failed links is 1.2% of total links, which is a very high 1323 failure ratio. We suppose that link failures are uniformly 1324 distributed in the entire network. The NRT size at an edge switch is 1325 about 333 and the NRT size of an aggregation switch is about 14in 1326 average. 1328 10. Security Considerations 1330 The security considerations will be discussed in a future version of 1331 this document. 1333 11. Conclusions 1335 This draft introduces FAR protocol, a generic routing method and 1336 protocol, for data centers that have a regular topology. It uses two 1337 routing tables, a BRT and an NRT, to store the normal routing paths 1338 and the forbidden (to-be-avoided) routing paths, respectively. This 1339 makes the FAR protocol very simple and efficient. The sizes of these 1340 two tables are very small. Usually, a BRT has only several tens of 1341 entries and an NRT has only several or about a dozen entries. 1343 12. Acknowledgments 1345 This document is supported by ZTE Enterprise-University-Research 1346 Joint Project. 1348 13. References 1350 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1351 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1352 1997. 1354 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1355 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1356 2008. 1358 Authors' Addresses 1360 Bin Liu 1361 ZTE Inc., ZTE Plaza 1362 No.19 East Huayuan Road,Hai Dian District 1363 Beijing 100191 1364 China 1366 Phone: +86 -010-59932039 1367 Email: 13683610386@139.com 1369 Yantao Sun 1370 Beijing Jiaotong University 1371 No.3 Shang Yuan Cun, Hai Dian District 1372 Beijing 100044 1373 China 1375 Email: ytsun@bjtu.edu.cn 1377 Jing Cheng 1378 Beijing Jiaotong University 1379 No.3 Shang Yuan Cun, Hai Dian District 1380 Beijing 100044 1381 China 1383 Email: yourney.j@gmail.com 1384 Yichen Zhang 1385 Beijing Jiaotong University 1386 No.3 Shang Yuan Cun, Hai Dian District 1387 Beijing 100044 1388 China 1390 Email: snowfall_dan@sina.com 1392 Bhumip Khasnabish 1393 ZTE TX Inc. 1394 55 Madison Avenue, Suite 160 1395 Morristown, New Jersey 07960 1396 USA 1398 Phone: +001-781-752-8003 1399 Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com 1400 URI: http://tinyurl.com/bhumip/