idnits 2.17.1 draft-sl-rtgwg-far-dcn-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 53 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (June 21, 2014) is 3596 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? 'RFC2119' on line 163 looks like a reference -- Missing reference section? 'FAT-TREE' on line 1074 looks like a reference Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Routing Area Working Group Bin Liu 3 Internet-Draft ZTE Inc. 4 Intended status: Informational Yantao Sun 5 Expires: December 23, 2014 Jing Cheng 6 Yichen Zhang 7 Beijing Jiaotong University 8 June 21, 2014 10 Generic Fault-avoidance Routing Protocol for Data Center Networks 11 draft-sl-rtgwg-far-dcn-01 13 Abstract 15 This draft proposes a generic routing method and protocol for a 16 regular data center network, named as the fault-avoidance routing 17 (FAR) protocol. FAR protocol provides a generic routing method for 18 all types of network architectures that are proposed for large-scale 19 cloud-based data centers over the past few years. FAR protocol is 20 well designed to fully leverage the regularity in the topology and 21 compute its routing table in a simplistic manner. Fat-tree is taken 22 as an example architecture to illustrate how FAR protocol can be 23 applied in real operational scenarios. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on December 23, 2014. 42 Copyright Notice 44 Copyright (c) 2014 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Acronyms & Definitions . . . . . . . . . . . . . . . . . 3 61 2. Conventions used in this document . . . . . . . . . . . . . . 4 62 3. Problem Statement . . . . . . . . . . . . . . . . . . . . . 4 63 3.1. The Impact of Large-scale Networks on Route Calculation . 4 64 3.2. Dilemma of conventional routing methods in a large-scale 65 network with giant number nodes of routers . . . . . . . 5 66 3.3. Network Addressing Issues . . . . . . . . . . . . . . . . 7 67 3.4. Big Routing Table Issues . . . . . . . . . . . . . . . . 7 68 3.5. Adaptivity Issues for Routing Algorithms . . . . . . . . 7 69 3.6. Virtual Machine Migration Issues . . . . . . . . . . . . 8 70 4. The FAR Framework . . . . . . . . . . . . . . . . . . . . . . 8 71 5. Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 9 72 5.1. Data Tables . . . . . . . . . . . . . . . . . . . . . . . 9 73 5.2. Messages . . . . . . . . . . . . . . . . . . . . . . . . 12 74 6. FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . . 16 75 6.1. Neighbor and Link Detection Module(M1) . . . . . . . . . 16 76 6.2. Device Learning Module(M2) . . . . . . . . . . . . . . . 16 77 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 17 78 6.4. Link Failure Learning Module(M4) . . . . . . . . . . . . 17 79 6.5. BRT Building Module(M5) . . . . . . . . . . . . . . . . . 17 80 6.6. NRT Building Module(M6) . . . . . . . . . . . . . . . . . 17 81 6.7. Routing Table Lookup(M7) . . . . . . . . . . . . . . . . 18 82 7. Application Example . . . . . . . . . . . . . . . . . . . . . 18 83 7.1. BRT Building Procedure . . . . . . . . . . . . . . . . . 20 84 7.2. NRT Building Procedure . . . . . . . . . . . . . . . . . 21 85 7.2.1. Single Link Failure . . . . . . . . . . . . . . . . . 21 86 7.2.2. A Group of Link Failures . . . . . . . . . . . . . . 22 87 7.2.3. Node Failures . . . . . . . . . . . . . . . . . . . . 23 88 7.3. Routing Procedure . . . . . . . . . . . . . . . . . . . . 23 89 8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 25 90 9. Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 25 91 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 25 92 11. Security Considerations . . . . . . . . . . . . . . . . . . . 25 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25 95 1. Introduction 97 In recent years, with the rapid development of cloud computing 98 technologies, the widely deployed cloud services, such as Amazon EC2 99 and Google search, bring about huge challenges to data center 100 networking (DCN). Today's cloud-based data centers (DCs) require 101 large-scale networks with larger internal bandwidth and smaller 102 transfer delay, However, conventional networks cannot meet such 103 requirements due to limitations in their network architecture. In 104 order satisfy the requirements of cloud computing services, many new 105 network architectures have been proposed for data centers, such as 106 Fat-tree, MatrixDCN, and BCube. These new architectures can support 107 non-blocking large-scale datacenter networks with more than tens of 108 thousands of physical servers. 110 Generic routing protocols such as OSPF and IS-IS cannot be scaled to 111 support a very large-scale datacenter network because of the problems 112 of network convergence and PDU overhead. For each type of 113 architecture, researchers designed a routing algorithm according to 114 the features of its topology. Because these routing algorithms are 115 different and lack compatibility with each other, it is very 116 difficult to develop a routing protocol for network routers 117 supporting multiple routing algorithms. Furthermore, the fault 118 tolerances in those routing algorithms are very complicated and have 119 low efficiency. 121 This draft proposes a generic routing method and protocol, fault- 122 avoidance routing (FAR) protocol, for DCNs. This method leverages 123 the regularity in the topologies of data center networks to simplify 124 routing learning and accelerate the query of routing tables. This 125 routing method has a better fault tolerance property and can be 126 applied to any DCN with a regular topology. 128 1.1. Acronyms & Definitions 130 DCN - Data Center Network 132 FAR - Fault-avoidance Routing 134 BRT - Basic Routing Table 136 NRT - Negative Routing Table 138 NDT - Neighbor Devices Table 140 ADT - All Devices Table 142 LFT - Link failure Table 143 DA - Device Announcement 145 LFA - Link Failure Announcement 147 DLR¨C Device and Link Request 149 IP - Internet Protocol 151 UDP - User Datagram Protocol 153 VM - Virtual Machine 155 2. Conventions used in this document 157 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 158 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 159 this document are to be interpreted as described in RFC-2119 160 [RFC2119]. In this document, these words will appear with that 161 interpretation only when in ALL CAPS. Lower case uses of these words 162 are not to be interpreted as carrying RFC-2119 significance. 164 3. Problem Statement 166 The problem to be addressed by FAR as proposed in this draft is 167 described in this section. The expansion of Cloud data center 168 networks has brought significant challenges to the existing routing 169 technologies. FAR mainly solves a series of routing problems faced 170 by large-scale data center networks. 172 3.1. The Impact of Large-scale Networks on Route Calculation 174 In a large-scale cloud data center network, there may be thousands of 175 routers. Running OSPF and other routing protocols in such network 176 will encounter these two challenges: a) Network convergence time 177 would be too long, which will cause a longer time to elapse for 178 creating and updating the routes. The response to network failures 179 may be excessively high; b) a large number of routing protocol 180 packets need to be sent. The routing information consumes a lot of 181 network bandwidth and CPU resources, which easily leads to packet 182 loss and makes the problem (a) more prominent. 184 In order to solve these problems, a common practice is partitioning 185 the large network into some small areas, where the route calculation 186 runs independently within different areas. However, nowadays the 187 cloud data centers typically require very large internal bandwidth. 188 To meet this requirement, a large number of parallel equivalent links 189 are deployed in the network, such as the Fat-tree network 190 architecture. Partitioning the network will affect the utilization 191 of routing algorithm on equivalent multi-path and reduce internal 192 network bandwidth requirements. 194 In the FAR routing calculation process, a Basic Routing Table (BRT) 195 is built on local network topology leveraging the regularity of the 196 network topologies. In addition to BRT, FAR also builds a Negative 197 Routing Table (NRT). FAR gradually builds NRT in the process of 198 learning network link failure information, which does not require 199 learning the complete network fault information. FAR does not need 200 to wait for the completion of the network convergence in the process 201 of building these two tables. Therefor, it avoids the problem of 202 excessive network convergence overheads in the route calculation 203 process. In addition, FAR only needs to exchange a small amount of 204 link change information between routers, and hence consumes less 205 network bandwidth. 207 3.2. Dilemma of conventional routing methods in a large-scale network 208 with giant number nodes of routers 210 There are many real world scenario where tens of thousands of 211 nodes(or much more nodes) need to be deployed in a flat area, such as 212 infiniband routing and switching system, high-performance computer 213 network, and many IDC networks in China. The similar problems have 214 been existed long ago. People have solved the problems through 215 similar solutions, such as the traditional regular topology-based 216 RFC3619 protocol, the routing protocols of infiniband routing and 217 switching system, and high-performance computer network routing 218 protocol. 220 Why OSPF and other conventional routing methods do not work well in a 221 large-scale network with giant number nodes of routers? 223 As everyone knows, the OSPF protocol uses multiple databases, more 224 topological exchange information (as seen in the following example) 225 and complicated algorithm. It requires routers to consume more 226 memory and CPU processing capability. But the processing rate of CPU 227 on the protocol message per second is very limited. When the network 228 expands, CPU will quickly approach its processing limits, and at this 229 time OSPF can not continue to expand the scale of the management. 230 The SPF algorithm itself does not thoroughly solve these problems. 232 On the contrary, FAR does not have the convergence time delay and the 233 additional CPU overheads, which SPF requires. Because in the initial 234 stage, FAR already knows the regular information of the whole network 235 topology and does not need to periodically do SPF operation. 237 One of the examples of "more topological exchange information": 239 In the OSPF protocol, LSA floods every 1800 seconds. Especially in 240 the larger network, the occupation of CPU and band bandwidth will 241 soon reach the router's performance bottleneck. In order to reduce 242 these adverse effects, OSPF introduced the concept of Area, which 243 still has not solved the problem thoroughly). By dividing the OSPF 244 Area into several areas, the routers in the same area do not need to 245 know the topological details outside their area. (In comparison with 246 FAR, after OSPF introducing the concept of Area, the equivalent paths 247 cannot be selected in the whole network scope) 249 OSPF can achieve the following results by Area : 1) Routers only need 250 to maintain the same link state databases as other routers within the 251 same Area, without the necessity of maintaining the same link state 252 database as all routers in the whole OSPF domain. 2) The reduction 253 of the link state databases means dealing with relatively fewer LSA, 254 which reduces the CPU consumption of routers; 3) The large number of 255 LSAs flood only within the same Area. But, its negative effect is 256 that the smaller number of routers which can be managed in each OSPF 257 area. On the contrary, because FAR does not have the above 258 disadvantages, FAR can also manage large-scale network even without 259 dividing Areas. 261 The aging time of OSPF is set in order to adapt to routing 262 transformation and protocol message exchange happened frequently in 263 the irregular topology. Its negative effect is: when the network 264 does not change, the LSA needs to be refreshed every 1800 seconds to 265 reset the aging time. In the regular topology, as the routings are 266 fixed, it does not need the complex protocol message exchange and 267 aging rules to reflect the routing changes, as long as LFA mechanism 268 in the FAR is enough. 270 Therefore, in FAR, we can omit many unnecessary processing and the 271 packet exchange. The benefits are fast convergence speed and much 272 larger network scale than other dynamic routing protocol. Now there 273 are some successful implementations of simplified routings in the 274 regular topology in the HPC environment. Conclusion: As FAR needs 275 few routing entries and the topology is regular, the database does 276 not need to be updated regularly. Without the need for aging, there 277 is no need for CPU and bandwidth overhead brought by LSA flood every 278 30 minutes, so the expansion of the network has no obvious effect on 279 the performance of FAR, which is contrary to OSPF. 281 Comparison of convergence time: The settings of OSPF spf_delay and 282 spf_hold_time can affect the change of convergence time. The 283 convergence time of the network with 2480 nodes is about 15-20 284 seconds(as seen in the following pages); while the FAR does not need 285 to calculate the SFP, so there is no such convergence time. 287 These issues still exist in rapid convergence technology of OSPF and 288 ISIS (such as I-SPF). The convergence speed and network scale 289 constraint each other. FAR does not have the above problems, and the 290 convergence time is almost negligible. 291 Can FRR solve these problems? IP FRR has some limitations. The 292 establishment of IP FRR backup scheme will not affect the original 293 topology and traffic forwarding which are established by protocol, 294 however, we can not get the information of whereabouts and status 295 when the traffic is switched to an alternate next hop. 297 3.3. Network Addressing Issues 299 Routers typically configure multiple network interfaces, each 300 connected to a subnet. OSPF and other routing algorithms require 301 that each interface of the router must be configured with an IP 302 address. A large-scale data center network may contain thousands of 303 routers. Tens of thousands of IP addresses may be needed to 304 configure for each router with dozens of network interfaces. It will 305 be a very complex issue to configure and manage a large number of 306 network interfaces. Network maintenance is usually costly and error- 307 prone. It will be difficult to troubleshoot the problems. 309 In FAR, the device position information is encoded in the IP address 310 of the router. Each router only needs to be assigned a unique IP 311 address according its location, which greatly solves complex network 312 addressing issues in large-scale networks. 314 3.4. Big Routing Table Issues 316 There are a large number of subnets in the large-scale data center 317 network. Routers may build a routing entry for each subnet, and 318 therefore the size of the routing tables on each router may be very 319 large. It will increase equipment cost and reduce the querying speed 320 of the routing table. 322 FAR uses two measures to reduce the size of the routing tables: a) 323 Builds a BRT on the regularity of the network topologies; b) 324 introduces a new routing table, i.e., a NRT. In this way FAR can 325 reduce the size of routing tables to only a few dozen routing 326 entries. 328 3.5. Adaptivity Issues for Routing Algorithms 330 To implement efficient routing in large-scale datacenters, besides 331 FAR, some other routing methods are proposed for some specific 332 network architectures, such as Fat-tree and BCube. These routing 333 methods are different(from both design and implementation viewpoints) 334 and not compatibile with the conventional routing methods, which 335 brings big troubles to network equipment providers to develop new 336 routers supporting various new routing methods. 338 FAR is a generic routing method. With slight modification, FAR 339 method can be applied to most of regular datacenter networks. 340 Furthermore, the structure of routing tables and querying a routing 341 table in FAR are the same as conventional routing method. If FAR is 342 adopted, the workload of developing a new type of router will be 343 significantly decreased. 345 3.6. Virtual Machine Migration Issues 347 Supporting VM migration is very important for cloud-based datacenter 348 networks. However, in order to support layer-3 routing, routing 349 methods including OSPF and FAR require limiting VM migration within a 350 subnet. For this paradox, the mainstream methods still utilize 351 layer-3 routing on routers or switches, transmit packets encapsulated 352 by IPinIP or MACinIP between hosts by tunnels passing through network 353 to the destination access switch, and then extract original packet 354 out and send it to the destination host. 356 By utilizing the aforementioned methods, FAR can be applied to Fat- 357 tree, MatrixDCN or BCube networks for supporting VM migration in 358 entire network. 360 4. The FAR Framework 362 FAR requires that a DCN has a regular topology, and network devices, 363 including routers, switches, and servers, are assigned IP addresses 364 according to their locations in the network. In other word, we can 365 locate a device in the network according to its IP address. 367 FAR is a distributed routing method. In order to support FAR, each 368 router needs to have a routing module that implements the FAR 369 algorithm. FAR algorithm is composed of three parts, i.e., link- 370 state learning, routing table building and routing table querying, as 371 shown in Fig. 1. 373 Link-state Learning |Routing Table | Routing Table 374 | Building 375 | | Querying 376 +--------+ /---------------\ | +--------+ | Packets 377 |2 Device|<--| 1 Neighbor & |----->| 5 BRT \ | 378 |Learning| | Link Detection| | |Building|\ | | 379 +--------+ \---------------/ | +--------+ \ | \|/ 380 | | | \ |+--------------+ 381 | | | /|| 7 Querying | 382 \|/ \|/ | / || Routing Table| 383 +-----------------------+| / |+--------------+ 384 |3 Invisible Neighbor & || / | 385 |Link Failure Inferring || +--------- | 386 +-----------------------+|/| 6 NRT | | 387 | / |Building| | 388 \|/ /| +--------+ | 389 +--------------+ / | | 390 |4 Link Failure| / | | 391 | Learning | | | 392 +--------------+ | | 393 | | 395 Figure 1: The FAR framework 397 Link-state learning is responsible for a router to detect the states 398 of its connected links and learn the states of all the other links in 399 the entire network. The second part builds two routing tables, a 400 basic routing table (BRT) and an negative routing table (NRT), 401 according to the learned link states in the first part. The third 402 part queries the BRT and the NRT to decide a next forwarding hop for 403 the received (ingress) packets. 405 5. Data Format 407 5.1. Data Tables 409 Some data tables are maintained on each router in FAR. They are: 411 Neighbor Device Table (NDT): To store neighbor routers and related 412 links. 414 All Devices Table (ADT): To store all routers in the entire network. 416 Link Failures Table (LFT): To store all link failures in the entire 417 network. 419 Basic Routing Table (BRT): To store the candidate routes. 421 Negative Routing Table(NRT): To store the avoiding routes. 423 The format of NDT 425 ---------------------------------------------------------- 426 Device ID | Device IP | Port ID | Link State | Update Time 427 ---------------------------------------------------------- 429 Device ID: The ID of a neighbor router. 431 Device IP: The IP address of a neighbor router. 433 Port ID: The port ID that a neighbor router is attached to. 435 Link State: The state of the link between a router and its neighbor 436 router. There are two states: Up and Down. 438 Update Time: The time of updating the entry. 440 The format of ADT 442 -------------------------------------------------- 443 Device ID | Device IP | Type | State | Update Time 444 -------------------------------------------------- 446 Device ID: The ID of a neighbor router. 448 Device IP: The IP address of a neighbor router. 450 Type: The type of a neighbor router. 452 State: The state of a neighbor router. There are two states: Up and 453 Down. 455 Update Time: The time of updating the entry. 457 The format of LFT 459 -------------------------------------------- 460 No | Router 1 IP | Router 2 IP | Update Time 461 -------------------------------------------- 463 No: The entry number. 465 Router 1 IP: The IP address of one router that a failed link connects 466 to. 468 Router 2 IP: The IP address of another router that a failed link 469 connects to. 471 Update Time: The time of updating the entry. 473 The format of BRT 475 ------------------------------------------------------- 476 Destination | Mask | Next Hop | Interface | Update Time 477 ------------------------------------------------------- 479 Destination: A destination network 481 Mask: The subnet mask of a destination network. 483 Next Hop: The IP address of a next hop for a destination. 485 Interface: The interface related to a next hop. 487 Update Time: The time of updating the entry. 489 The format of NRT 491 ------------------------------------------------------------------- 492 Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp 493 ------------------------------------------------------------------- 495 Destination: A destination network. 497 Mask: The subnet mask of a destination network. 499 Next Hop: The IP address of a next hop that should be avoided for a 500 destination. 502 Interface: The interface related to a next hop that should be 503 avoided. 505 Failed Link No: A group of failed link numbers divided by !o/!+/-, 506 for example 1/2/3. 508 Timestamp: The time of updating the entry. 510 5.2. Messages 512 Some protocol messages are exchanged between routers in FAR. 514 Hello Message: This message is exchanged between neighbor routers to 515 learn adjacency. 517 Device Announcement (DA): Synchronize the knowledge of routers 518 between routers. 520 Link Failure Announcement (LFA): Synchronize link failures between 521 routers. 523 Device and Link Request (DLR): When a router starts, it requests the 524 knowledge of routers and links from its neighbors by a DLR message. 526 A FAR Message is directly encapsulated in an IP packet. The protocol 527 field of IP header indicates an IP packet is an FAR message. The 528 protocol of IP for FAR should be assigned by IANA. 530 The four types of FAR messages have same format of packet header, 531 called FAR header (as shown in Figure 2). 533 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 534 +-----------+------------+------------------------+ 535 | Version |Message Type| Message Length | 536 +-----------+------------+------------------------+ 537 | Checksum | AuType | 538 +------------------------+------------------------+ 539 | Authentication | 540 +-------------------------------------------------+ 541 | Authentication | 542 +-------------------------------------------------+ 543 | Timestamp | 544 +-------------------------------------------------+ 546 Figure 2: The format of FAR header 548 Version: FAR version 550 Message Type: The type of FAR message. 552 Packet Length: The packet length of the total FAR message. 554 Checksum: The checksum of an entire FAR message. 556 AuType:Authentication type. 0: no authentication, 1: Plaintext 557 Authentication, 2: MD5 Authentication. 559 Authentication: Authentication information. 0: undefined, 1: Key, 2: 560 key ID, MD5 data length and packet number. MD5 data is appended to 561 the backend of the packet. 563 AuType and Authentication can refer to the definition of OSPF packet. 565 |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->| 566 +-----------+------------+------------------------+ 567 | Version |Message Type| Message Length | 568 +-----------+------------+------------------------+ 569 | Checksum | AuType | 570 +------------------------+------------------------+ 571 | Authentication | 572 +-------------------------------------------------+ 573 | Authentication | 574 +-------------------------------------------------+ 575 | Timestamp | 576 +-------------------------------------------------+ 577 | Router IP | 578 +------------------------+------------------------+ 579 | HelloInterval | HelloDeadInterval | 580 +------------------------+------------------------+ 581 | Neighbor Router IP | 582 +-------------------------------------------------+ 583 | ... | 584 +-------------------------------------------------+ 586 Figure 3: The Format of Hello Messages 588 For Hello messages, the Message Type in FAR header is set to 589 1.Besides FAR header, a Hello message(Fig. 3) requires the following 590 fields: 592 Router IP: The router IP address. 594 HelloInterval: The interval of sending Hello messages to neighbor 595 routers. 597 RouterDeadInterval: The interval to set a neighbor router dead(out- 598 of-service). If in the interval time, a router doesn't receive a 599 Hello message from its neighbor router, the neighbor router is 600 treated as dead. 602 Neighbor Router IP: The IP address of a neighbor router. All the 603 neighbor router's addresses should be included in a Hello message. 605 |<----1---->| <----1---->|<----------2----------->| 606 +-----------+------------+------------------------+ 607 | Version |Message Type| Message Length | 608 +-----------+------------+------------------------+ 609 | Checksum | AuType | 610 +------------------------+------------------------+ 611 | Authentication | 612 +-------------------------------------------------+ 613 | Authentication | 614 +-------------------------------------------------+ 615 | Timestamp | 616 +------------------------+------------------------+ 617 | Router1 IP | 618 +-------------------------------------------------+ 619 | ... | 620 +-------------------------------------------------+ 622 Figure 4: The Format of DA Messages 624 For DA messages(Fig. 4), the Message Type in FAR header is set to 2. 625 Besides FAR header, a DA message includes IP addresses of all the 626 announced routers. 628 |<----1---->| <----1---->|<----------2----------->| 629 +-----------+------------+------------------------+ 630 | Version |Message Type| Message Length | 631 +-----------+------------+------------------------+ 632 | Checksum | AuType | 633 +------------------------+------------------------+ 634 | Authentication | 635 +-------------------------------------------------+ 636 | Authentication | 637 +-------------------------------------------------+ 638 | Timestamp | 639 +------------------------+------------------------+ 640 | Left IP | 641 +-------------------------------------------------+ 642 | Right IP | 643 +------------------------+------------------------+ 644 | State | 645 +-------------------------------------------------+ 646 | ... | 647 +-------------------------------------------------+ 649 Figure 5: The Format of LFA Messages 651 For LFA messages(Fig. 5), the Message Type in FAR header is set to 3. 652 Besides FAR header, a LFA message includes all the announced link 653 failures. 655 Left IP: The IP address of the left endpoint router of a link. 657 Right IP: The IP address of the right endpoint router of a link. 659 State: Link state. 0: Up, 1: down 660 |<----1---->| <----1---->|<----------2----------->| 661 +-----------+------------+------------------------+ 662 | Version |Message Type| Message Length | 663 +-----------+------------+------------------------+ 664 | Checksum | AuType | 665 +------------------------+------------------------+ 666 | Authentication | 667 +-------------------------------------------------+ 668 | Authentication | 669 +-------------------------------------------------+ 670 | Timestamp | 671 +-------------------------------------------------+ 673 Figure 6: The Format of DLR Messages 675 For DLR messages(Fig. 6), the Message Type in FAR header is set to 676 1.Except for FAR header, DLR has no additional fields. 678 6. FAR Modules 680 6.1. Neighbor and Link Detection Module(M1) 682 M1 is responsible for sending and receiving Hello messages, and 683 detecting directly-connected links and neighbor routers. Each Hello 684 message is encapsulated in a UDP packet. M1 sends Hello messages 685 periodically to all the active router ports and receives Hello 686 messages from its neighbor routers. M1 detects neighbor routers and 687 directly-connected links according to received Hello Messages and 688 stores these neighbors and links into a Neighbor Devices Table (NDT). 689 Additionally, M1 also stores the neighbor routers into an All Devices 690 Table (ADT). 692 6.2. Device Learning Module(M2) 694 M2 is responsible for sending, receiving, and forwarding device 695 announcement (DA) messages, learning all the routers in the whole 696 network, and deducing faulted routers. When a router starts, it 697 sends a DA message announcing itself to its neighbors and a DLR 698 message requesting the knowledge of routers and links from its 699 neighbors. If M2 module of a router receives a DA message, it checks 700 whether the router encapsulated in the message is in an ADT. If the 701 router is not in the ADT, M2 puts this router into the ADT and 702 forwards this DA message to all the active ports except for the 703 incoming one, otherwise, M2 discards this message directly. If M2 704 module of a router receives a DLR message, it replies a DA message 705 that encapsulates all of the learned routers. 707 6.3. Invisible Neighbor and Link Failure Inferring Module(M3) 709 M3 is responsible for inferring invisible neighbors of the current 710 router by means of the ADT. If the link between the router A and its 711 neighbor B breaks, which results in that M1 module of A cannot detect 712 the existence of B, then B is an invisible neighbor of A. Since a 713 device's location is coded into its IP address, it can be judged 714 whether two routers are adjacent, according to their IP addresses. 715 Based on this idea, M3 infers all of the invisible neighbors of the 716 current router and the related link failures. The results are stored 717 into a NDT. Moreover, link failures also are added into a link- 718 failure table (LFT). LFT stores all of the failed links in the 719 entire network. 721 6.4. Link Failure Learning Module(M4) 723 M4 is responsible for sending, receiving and forwarding link failure 724 announcement (LFA) and learning all the link failures in the whole 725 network. M4 broadcasts each newly inferred link failure to all the 726 routers in the network. Each link failure is encapsulated in a LFA 727 message and one link failure is broadcasted only once. If a router 728 receives a DLR request from its neighbor, it will reply a LFA message 729 that encapsulates all the learned link failures through M4 module. 730 If M4 receives a LFA message, it checks whether the link failure 731 encapsulated in the message is in a LFT. If the link failure is not 732 in the LFT, M4 puts this link failure into the LFT and forwards this 733 LFA message to all the active ports except for the incoming one, 734 otherwise, M4 discards this message directly. 736 6.5. BRT Building Module(M5) 738 M5 is responsible for building a BRT for the current router. By 739 leveraging the regularity in topology, M5 can calculate the routing 740 paths for any destination without the knowledge of the topology of 741 whole network, and then build the BRT based on a NDT. Since the IP 742 addresses of network devices are continuous, M5 only creates one 743 route entry for a group of destination addresses that have the same 744 network prefix by means of route aggregation technology. Usually, 745 the size of a BRT is very small. The detail of how to build a BRT is 746 described in section 5. 748 6.6. NRT Building Module(M6) 750 M6 is responsible for building a NRT for the current router. Because 751 M5 builds a BRT without considering link failures in network, the 752 routing paths calculated by the BRT cannot avoid failed links. To 753 solve this problem, a NRT is used to exclude the routing paths that 754 include some failed links from the paths calculated by a BRT. M6 755 calculate the routing paths that include failed links and stored them 756 into the NRT. The details of how to build a NRT is described in 757 section 5. 759 6.7. Routing Table Lookup(M7) 761 M7 is responsible for querying routing tables and selecting the next 762 hop for forwarding the packets. Firstly, M7 takes the destination 763 address of a forwarding packet as a criterion to look up route 764 entries in a BRT based on longest prefix match. All of the matched 765 entries are composed of a candidate hops list. Secondly, M7 look up 766 negative route entries in a NRT taking the destination address of the 767 forwarding packet as criteria. This lookup is not limited to the 768 longest prefix match, any entry that matches the criteria would be 769 selected and composed of an avoiding hops list. Thirdly, the 770 candidate hops minus avoiding hops are composed of an applicable hops 771 list. At last, M7 sends the forwarding packet to any one of the 772 applicable hops. If the applicable list is empty, the forwarding 773 packet will be dropped. 775 7. Application Example 777 In this section, we take a Fat-tree network(Fig. 7) as an example to 778 describe how to apply FAR routing. Since M1 to M4 are very simple, 779 we only introduce how the modules M5, M6, and M7 work in a Fat-tree 780 network. 782 A Fat-tree network is composed of 4 layers. The top layer is core 783 layer, and the other layers are aggregation layer, edge layer and 784 server layer. There are k pods, each one containing two layers of 785 k/2 switches. Each k-port switch in the edge layer is directly 786 connected to k/2 hosts. The remaining k/2 ports are connected to k/2 787 of the k-port switches in the aggregation layer. There are (k/2)2 788 k-port core switches. Each core switch has one port connected to 789 each of the k pods. 791 10.0.1.1 10.0.1.2 10.0.2.1 10.0.2.2 792 +--+ +--+ +--+ +--+ 793 | | | | | | | | 794 +--+,_ +--+', +--+ ,,+--+ 795 |`,`',`-., / | \ `. .'` - .-'``.` /| 796 | . `', `'., / | ' ' ,-`,' |`. .` ' | 797 | \ `', `-., .` / | `, .` ,' | 798 | `, `'. `'-,_ .'` ,' | ', / | 799 | . `'. `-.,-` / | \ 800 | \ `'., .` `'., ` | `. 801 | `, .'`, ,`'., | ', 802 | . ,-` '., - `'-,_ | `. 803 | \ .` ,'., `|., . 804 | .'` / `-, | `'., `. 805 10.1.0.1 ,-` . .' 10.3.0.1 10.3.0.2`'., ', 806 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 807 | | | | | | | | | | | | | | | | 808 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 809 | \/ | | \/ | | \/ | | \/ | 810 +--+/\+--+ +--+/\+--+ +--+/\+--+ +--+/\+--+ 811 | | | | | | | | | | | | | | | | 812 +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ 813 /| 10.1.2.1 /| |\ 10 3.1.3 |\ /| | | 814 / | | \ / | | \ / | | \ / | | | 815 / | | \ / | | \ / | | \ / | | | 816 / | | \ / | | \ / | | \ / | | | 817 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 818 ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ 819 10.1.2.2 10.3.1.3 821 Figure 7: Fat-tree Network 823 Aggregation switches are given addresses of the form 10.pod.0.switch, 824 where pod denotes the pod number, and switch denotes the position of 825 that switch in the upper pod (in [1, k/2]). Edge switches are given 826 addresses of the form 10.pod.switch.1, where pod denotes the pod 827 number, and switch denotes the position of that switch in the lower 828 pod (in [1, k/2]). The core switches are given addresses of the form 829 10.0.j.i, where j and i denote that switch's coordinates in the 830 (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left). 831 The address of a host follows the pod switch to which it is connected 832 to; hosts have addresses of the form: 10.pod.switch.ID, where ID is 833 the host's position in that subnet (in [2, k/2+1], starting from left 834 to the right). 836 7.1. BRT Building Procedure 838 By leveraging the topology's regularity, every switch clearly knows 839 how it forwards a packet. When a packet arrives at an edge switch, 840 if the destination of the packet lies in the same subnet with the 841 switch, then the switch directly forwards the packet to the 842 destination server through layer-2 switching. Otherwise, the switch 843 forwards the packet to any of aggregation switches in the same pod. 844 When a packet arrives at an aggregation switch, if the destination of 845 the packet lies in the same pod, the switch forwards the packet to 846 the corresponding edge switch. Otherwise, the switch forwards the 847 packet to any of core switches that it is connected to. If a core 848 switch receives a packet, it forwards the packet to the corresponding 849 aggregation switch that lies in the destination pod. 851 The forwarding policy discussed above is easily expressed through a 852 BRT. The BRT of an edge switch, such as 10.1.1.1, is composed of the 853 following entries: 855 Destination/Mask Next hop 856 10.0.0.0/255.0.0.0 10.1.0.1 857 10.0.0.0/255.0.0.0 10.1.0.2 859 The BRT of an aggregation switch, such as 10.1.0.1, is composed of 860 the following entries: 862 Destination/Mask Next hop 863 10.1.1.0/255 255.255.0 10.1.1.1 864 10.1.2.0/255.255.255.0 10.1.2.1 865 10.0.0.0/255.0.0.0 10.0.1.1 866 10.0.0.0/255.0.0.0 10.0.1.2 868 The BRT of acore switch, such as 10.0.1.1, is composed of the 869 following entries: 871 Destination/Mask Next hop 872 10.1.0.0/255 255.0.0 10.1.0.1 873 10.2.0.0/255.255.0.0 10.2.0.1 874 10.3.0.0/255.255.0.0 10.3.0.1 875 10.4.0.0/255.255.0.0 10.4.0.1 877 7.2. NRT Building Procedure 879 The route entries in an NRT are related with link and node failures. 880 We summarize all types of cases into three (3) catalogs. 882 7.2.1. Single Link Failure 884 In Fat-tree, Links can be classified as 3 types by their locations: 885 1) servers to edge switches; 2) edge to aggregation switches; 3) 886 aggregation to core switches. Link failures between servers to edge 887 switches only affect the communication of the corresponding servers 888 and don't affect the routing tables of any switch, so we only discuss 889 the second and third type of links failures. 891 Edge to Aggregation Switches 893 Suppose that the link between an edge switch, such as 10.1.2.1 (A), 894 and an aggregation switch, such as 10.1.0.1(B),fails. This link 895 failure may affect 3 types of communications. 897 o Sources lie in the same subnet with A, and destinations do not. In 898 this case, the link failure will only affect the routing tables of A. 899 As this link is attached to A directly, A only needs to delete the 900 route entries whose next hop is B in its BRT and add no entries to 901 its NRT when A's M6 module detect the link failure. 903 o Destinations lie in the same subnet with A, and sources lie in 904 another subnet of the same pod. In this case, the link failure will 905 affect the routing tables of all the edge switches in the same pod 906 except for A. When an edge switch, such as 10.1.1.1, learns the link 907 failure, it will add a route entry to its NRT: 909 Destination/Mask Next hop 910 10.1.2.0/255.255.255.0 10.1.0.1 912 o Destinations lie in the same subnet with A, sources lie in another 913 pod. In this case, the link failure will affect the routing tables 914 of all the edge switches in the other pods. When an edge switch in 915 one other pod, such as 10.3.1.1, learns the link failure, because all 916 the routings that pass through 10.3.0.1 to A will certainly pass 917 through the link between A and B, 10.3.1.1 need add a route entry to 918 its NRT: 920 Destination/Mask Next hop 921 10.1.2.0/255.255.255.0 10.3.0.1 922 Aggregation to Core Switches 924 Suppose that the link between an aggregation switch, such as 10.1.0.1 925 (A), and a core switch, such as 10.0.1.2(B), fails. This link 926 failure may affect 2 types of communications. 928 o Sources lie in the same pod (pod 1) with A, and destinations lie in 929 the other pods. In this case, the link failure will only affect the 930 routing tables of A. As this link is attached to A directly, A only 931 need to delete the route entries whose next hop is B in its BRT and 932 add no entries to its NRT when A!_s M6 module detect the link 933 failure. 935 o Destinations lie in the same pod (pod 1) with A, and sources lie in 936 another pod. In this case, the link failure will affect the routing 937 tables of all the aggregation switches in other pods except for pod 938 1. When an aggregation switch in one other pod, such as 10.3.0.1, 939 learns the link failure, because all the routings that pass through 940 10.0.1.2 to the pod 1 where A lies will certainly pass through the 941 link between A and B, 10.3.0.1 need add a route entry to its NRT: 943 Destination/Mask Next hop 944 10.1.0.0/255.255.0.0 10.0.1.2 946 7.2.2. A Group of Link Failures 948 If all the uplinks of an aggregation switch fail, then this switch 949 cannot forward packets, which will affect the routing of every edge 950 switches. Suppose that all the uplinks of the node A (10.1.0.1) 951 fail, it will affect two types of communications. 953 o Sources lie in the same pod (pod 1) with A, and destinations lie in 954 the other pods. In this case, the link failures will affect the 955 routing of the edge switches in the Pod of A. To avoid the node A, 956 each edge switch should remove the route entry 957 !o10.0.0.0/255.0.0.0GBP[not]10.1.0.1!+/- in which the next hop is the 958 node A. 960 o Destinations lie in the same pod (pod 1) with A, and sources lie in 961 other pods. In this case, the link failures will affect the routing 962 of edge switches in other pods. For example, if the edge switch 963 10.3.1.1 communicates with some node in the pod of A, it should avoid 964 the node 10.3.0.1, because any communication through 10.3.0.1 to the 965 pod of A will pass through the node A. So a route entry should be 966 added to 10.3.1.1: 968 Destination/Mask Next hop 969 10.1.0.0/255.255.0.0 10.3.0.1 971 7.2.3. Node Failures 973 At last, we discuss the effect of node failures to a NRT. There are 974 3 types of node failures: the failure of edge, aggregation and core 975 switches. 977 o An edge switch fails. The failure doesn't affect the routing table 978 of any switch. 980 o A core switch fails. Only when all the core switches connected to 981 the same aggregation switch fail, they will affect the routing of 982 other switches. This case is equal to the case that all the uplinks 983 of an aggregation switch fail, so the process of link failures can 984 cover it. 986 o An aggregation switch fails. This case is similar to the case that 987 all the uplinks of an aggregation switch fail. It affects the 988 routing of edge switches in other pods, but doesn't affect the 989 routing of edge switches in pod of the failed switch. The process of 990 this failure is same to the second case in section 6.2.2. 992 7.3. Routing Procedure 994 FAR decides a routing by looking up its BRT and NRT. We illuminate 995 the routing procedure by an example. In this example, we suppose 996 that the link between 10.3.1.1 and 10.3.0.2 and the link between 997 10.1.2.1 and 10.1.0.2 have failed. Then we look into the routing 998 procedure of a communication from 10.3.1.3 (source) to 10.1.2.2 999 (destination). 1001 Step 1: The source 10.3.1.3 sends packets to its default router 1002 10.3.1.1 1004 Step 2: The routing of 10.3.1.1. 1006 1) Calculate candidate hops 1008 10.3.1.1 looks up its BRT and gets the following matched entriesGBPo 1010 Destination/Mask Next hop 1011 10.0.0.0/255.0.0.0 10.3.0.1 1013 So the candidate hops = {10.3.0.1} 1014 2) Calculate avoiding hops 1016 Its NRT is empty, so the set of avoiding hop is empty too. 1018 3) Calculate applicable hops 1020 The applicable hops are candidate hops minus avoiding hops, so: 1022 The applicable hops = {10.3.0.1} 1024 4) Forward packets to 10.3.0.1 1026 Step 3: The routing of 10.3.0.1 1028 1) Calculate candidate hops. 1030 10.3. 0.1 looks up its BRT and gets the following matched entriesGBPo 1032 Destination/Mask Next hop 1033 10.1.0.0/255.255.0.0 10.0.1.1 1034 10.1.0.0/255.255.0.0 10.0.1.2 1036 So the candidate hops = {10.0.1.1, 10.0.1.2} 1038 2) Calculate avoiding hops 1040 Destination/Mask Next hop 1041 10.1.0.0/255.255.0.0 10.0.1.2 1043 So the avoiding hops = {10.0.1.2} 1045 3) Calculate applicable hops 1047 The applicable hops are candidate hops minus avoiding hops, so: 1049 The applicable hops = {10.0.1.1} 1051 4) Forward packets to 10.0.1.1 1053 Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its 1054 routing tables. 1056 Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its 1057 routing tables. 1059 Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by 1060 layer-2 switching. 1062 8. Conclusion 1064 This draft introduces FAR protocol, a generic routing method and 1065 protocol, for data centers that have a regular topology. It uses two 1066 routing tables, a BRT and a NRT, to store the normal routing paths 1067 and avoiding routing paths, respectively, which makes FAR very simple 1068 and efficient. The sizes of two tables are very small. Usually, a 1069 BRT has only several tens of entries and a NRT has only several or 1070 about a dozen entries. 1072 9. Reference 1074 [FAT-TREE] M. Al-Fares, A. Loukissas, and A. Vahdat."A 1075 Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM 1076 2008. 1078 10. Acknowledgments 1080 This document is supported by ZTE Enterprise-University-Research 1081 Joint Project. 1083 11. Security Considerations 1085 Authors' Addresses 1087 Bin Liu 1088 ZTE Inc. 1089 ZTE Plaza, No.19 East Huayuan Road,Haidian District 1090 Beijing 100191 1091 China 1093 Email: liu.bin21@zte.com.cn 1095 Yantao Sun 1096 Beijing Jiaotong University 1097 No.3 Shang Yuan Cun, Hai Dian District 1098 Beijing 100044 1099 China 1101 Email: ytsun@bjtu.edu.cn 1102 Jing Cheng 1103 Beijing Jiaotong University 1104 No.3 Shang Yuan Cun, Hai Dian District 1105 Beijing 100044 1106 China 1108 Email: yourney.j@gmail.com 1110 Yichen Zhang 1111 Beijing Jiaotong University 1112 No.3 Shang Yuan Cun, Hai Dian District 1113 Beijing 100044 1114 China 1116 Email: snowfall_dan@sina.com