idnits 2.17.1 

draft-sl-rtgwg-far-dcn-16.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There are 6 instances of too long lines in the document, the longest one
     being 41 characters in excess of 72.

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 59 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (May 29, 2021) is 1062 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC8174' is mentioned on line 226, but not defined

  == Unused Reference: 'FAT-TREE' is defined on line 1526, but no explicit
     reference was found in the text


     Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Routing Area Working Group                                       Bin Liu
3	Internet-Draft                                       ZTE Inc., ZTE Plaza
4	Intended status: Informational                                Yantao Sun
5	Expires: November 30, 2021                                    Jing Cheng
6	                                                            Yichen Zhang
7	                                             Beijing Jiaotong University
8	                                                       Bhumip Khasnabish
9	                                                  Individual contributor
10	                                                            May 29, 2021

12	   Generic Fault-Avoidance Routing Protocol for Data Center Networks
13	                       draft-sl-rtgwg-far-dcn-16

15	Abstract

17	   This document describes a generic routing method and protocol for a
18	   regular data center network, named the Fault-Avoidance Routing (FAR)
19	   protocol.  The FAR protocol provides a generic routing method for all
20	   types of regular topology network architectures that have been
21	   proposed for large-scale cloud-based data centers over the past few
22	   years.  The FAR protocol is designed to leverage any regularity in
23	   the topology and compute its routing table in a concise manner.  Fat-
24	   tree is taken as an example architecture to illustrate how the FAR
25	   protocol can be applied in real operational scenarios.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on November 30, 2021.

44	Copyright Notice

46	   Copyright (c) 2021 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Acronyms & Definitions  . . . . . . . . . . . . . . . . .   5
63	   2.  Conventions used in this document . . . . . . . . . . . . . .   5
64	   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   5
65	     3.1.  The Impact of Large-scale Networks on Route Calculation .   6
66	     3.2.  Issues of Conventional Routing Methods in a Large-scale
67	           Network with Giant Number Nodes of Routers  . . . . . . .   6
68	     3.3.  Network Addressing Issues . . . . . . . . . . . . . . . .   9
69	     3.4.  Big Routing Table Issues  . . . . . . . . . . . . . . . .   9
70	     3.5.  Adaptivity Issues for Routing Algorithms  . . . . . . . .   9
71	     3.6.  Virtual Machine Migration Issues  . . . . . . . . . . . .  10
72	   4.  The FAR Framework . . . . . . . . . . . . . . . . . . . . . .  10
73	   5.  Data Format . . . . . . . . . . . . . . . . . . . . . . . . .  11
74	     5.1.  Data Tables . . . . . . . . . . . . . . . . . . . . . . .  11
75	     5.2.  Messages  . . . . . . . . . . . . . . . . . . . . . . . .  14
76	   6.  FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . .  18
77	     6.1.  Neighbor and Link Detection Module(M1)  . . . . . . . . .  18
78	     6.2.  Device Learning Module(M2)  . . . . . . . . . . . . . . .  18
79	     6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)   19
80	     6.4.  Link Failure Learning Module(M4)  . . . . . . . . . . . .  19
81	     6.5.  BRT Building Module(M5) . . . . . . . . . . . . . . . . .  19
82	     6.6.  NRT Building Module(M6) . . . . . . . . . . . . . . . . .  20
83	     6.7.  Routing Table Lookup(M7)  . . . . . . . . . . . . . . . .  20
84	   7.  How a FAR Router Works  . . . . . . . . . . . . . . . . . . .  20
85	   8.  Compatible Architecture . . . . . . . . . . . . . . . . . . .  23
86	   9.  Topology identification and broadcast storm suppression . . .  23
87	   10. Application Example . . . . . . . . . . . . . . . . . . . . .  24
88	     10.1.  BRT Building Procedure . . . . . . . . . . . . . . . . .  26
89	     10.2.  NRT Building Procedure . . . . . . . . . . . . . . . . .  27
90	       10.2.1.  Single Link Failure  . . . . . . . . . . . . . . . .  27
91	       10.2.2.  A Group of Link Failures . . . . . . . . . . . . . .  28
92	       10.2.3.  Node Failures  . . . . . . . . . . . . . . . . . . .  29
93	     10.3.  Routing Procedure  . . . . . . . . . . . . . . . . . . .  29
94	     10.4.  FAR's Performance in Large-scale Networks  . . . . . . .  31
95	       10.4.1.  The number of control messages required by FAR . . .  31
96	       10.4.2.  The Calculating Time of Routing Tables . . . . . . .  31
97	       10.4.3.  The Size of Routing Tables . . . . . . . . . . . . .  31
98	   11. Implementations Examples  . . . . . . . . . . . . . . . . . .  32
99	   12. Security Considerations . . . . . . . . . . . . . . . . . . .  34
100	   13. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  35
101	   14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  35
102	   15. References  . . . . . . . . . . . . . . . . . . . . . . . . .  35
103	     15.1.  Normative References . . . . . . . . . . . . . . . . . .  35
104	     15.2.  Informative References . . . . . . . . . . . . . . . . .  35
105	   16. Appendix  . . . . . . . . . . . . . . . . . . . . . . . . . .  36
106	     16.1.  Application Area of the Solution . . . . . . . . . . . .  36
107	     16.2.  Technical evolution roadmap  . . . . . . . . . . . . . .  36
108	     16.3.  Updating roadmap . . . . . . . . . . . . . . . . . . . .  36
109	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  36

111	1.  Introduction

113	   In recent years, with the rapid development of cloud computing
114	   technologies, the widely deployed cloud services, such as Amazon EC2
115	   and Google search, bring about huge challenges to data center
116	   networking (DCN).  Today's cloud-based data centers (DCs) require
117	   large-scale networks with larger internal bandwidth and smaller
118	   transfer delay.  However, conventional networks cannot meet such
119	   requirements due to limitations in their network architecture.  In
120	   order to satisfy the requirements of cloud computing services, many
121	   new network architectures have been proposed for data centers, such
122	   as Fat-tree, MatrixDCN[MatrixDCN], and BCube[BCube].  These new
123	   architectures can support non-blocking large-scale datacenter
124	   networks with more than tens of thousands of physical servers.

126	   All of these architectures have regular topologies, which are common
127	   features.  The regular topology refers to the network topology
128	   structure with obvious regularity and symmetry, which is conducive to
129	   automatic configuration of the network, such as the Fat-tree network.
130	   In a regular topology, each network node such as a switch or router
131	   can be addressed by its location and through a node's address, the
132	   node's connections to its neighbors in a network can be determined,
133	   and furthermore, the route to the node from other nodes in the
134	   network can be determined.  So nodes can compute route entries
135	   without learning topology.

137	   This document describes a generic routing method and protocol, the
138	   Fault-Avoidance Routing (FAR) protocol, for DCNs.  This method
139	   leverages the regularity in the topologies of data center networks to
140	   simplify routing learning and accelerate the query of routing tables.
141	   This routing method has a better fault tolerance and can be applied
142	   to any DCN with a regular topology.

144	   FAR is not a routing protocol to replace generic routing protocols
145	   such as OSPF(Open Shortest Path First)[RFC2328] and IS-
146	   IS(Intermediate System-to-Intermediate System).  It cannot be used in
147	   general local networks whose topological structures are arbitrary,
148	   and whose scales are also not very large.  OSPF and IS-IS work very
149	   well in such a network.  But in a large-scale network with regular
150	   topology, FAR has better performance.  Compared with OSPF and IS-IS,
151	   FAR has shorter time of network convergence and lower PDU(Protocol
152	   Data Unit) overhead.  Furthermore, FAR requires less computing and
153	   storage resources, which lets FAR routers to run at a lower cost of
154	   production than the generic routers.

156	   In addition, for each type of network architecture, researchers
157	   designed a routing algorithm according to the features of its
158	   topology.  Because these routing algorithms are different and lack
159	   compatibility with each other, it is very difficult to develop a
160	   routing protocol for network routers supporting multiple routing
161	   algorithms.  FAR has better adaptability than these specified routing
162	   methods.

164	   FAR consists of three components, i.e., link state learning unit,
165	   routing table building unit and routing table querying unit.  In the
166	   link state learning unit, FAR exchanges link failures among routers
167	   to establish a consistent knowledge of the entire network.  In this
168	   stage, the regularity in topology is exploited to infer failed links
169	   and routers.  In the routing table building unit, FAR builds up two
170	   routing tables, i.e., a basic routing table (BRT) and a negative
171	   routing table (NRT), for each router according to the network
172	   topology and link states.  In the last component, routers forward
173	   incoming packets by looking up the two routing tables.  The matched
174	   entries in BRT minus the matched entries in NRT are the final route
175	   entries to be used to forward an incoming packet.

177	   This document describes a protocol developed by ZTE and Beijing
178	   Jiaotong University.  It is just presented here to record the work
179	   and to make it available for use in later IETF work if desirable.

181	   The remainder of this draft is organized as follows.  The problem to
182	   be addressed by FAR is described in Section 3.  The framework of FAR
183	   routing protocol is described in Section 4.  Section 5 and 6
184	   introduce FAR's data format FAR and modules in detail.  Section 7
185	   describe how FAR works by finite state machine (FSM).  In Section 8,
186	   we discussed how FAR works with variable network architectures.
187	   Section 9 takes Fat-tree network as an example to illuminate how FAR
188	   works.

190	1.1.  Acronyms & Definitions

192	   DCN - Data Center Network

194	   FAR - Fault-Avoidance Routing

196	   BRT - Basic Routing Table

198	   NRT - Negative Routing Table

200	   NDT - Neighbor Devices Table

202	   ADT - All Devices Table

204	   LFT - Link Failure Table

206	   DA - Device Announcement

208	   LFA - Link Failure Announcement

210	   DLR - Device and Link Request

212	   IP - Internet Protocol

214	   VM - Virtual Machine

216	2.  Conventions used in this document

218	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
219	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
220	   "OPTIONAL" in this document are to be interpreted as described in BCP
221	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
222	   capitals, as shown here.  In this document, these words will appear
223	   with that interpretation only when in ALL CAPS.  Lower case uses of
224	   these words are not to be interpreted as carrying special
225	   significance.

227	3.  Problem Statement

229	   The problem to be addressed by FAR as proposed in this draft is
230	   described in this section.  The expansion of Cloud data center
231	   networks has brought significant challenges to the existing routing
232	   technologies.  FAR mainly solves a series of routing problems faced
233	   by large-scale data center networks.

235	3.1.  The Impact of Large-scale Networks on Route Calculation

237	   In a large-scale cloud data center network, there may be thousands of
238	   routers.  Running OSPF or IS-IS in such network will encounter these
239	   two challenges:

241	   (1) Network convergence time would be too long, which will cause a
242	   longer time to elapse for creating and updating the routes.  The
243	   response time to network failures may be excessively long;

245	   (2) High resource consumption.  Since a large number of routing
246	   protocol packets need to be sent, it causes the routing information
247	   consuming too much network bandwidth and CPU(central processing unit)
248	   resources, and easily leads to packet loss and makes the challenge
249	   (1) more prominent.

251	   In order to solve these two challenges, a common practice is to
252	   partition a large network into some small areas, where the route
253	   calculation runs independently within different areas.  However,
254	   nowadays the cloud data centers typically require very large internal
255	   bandwidth.  To meet this requirement, a large number of parallel
256	   equivalent links are deployed in a network, such as a Fat-tree
257	   network.  Partitioning such a network will affect the utilization of
258	   routing algorithm on equivalent multi-path and reduce internal
259	   network bandwidth requirements.

261	   In the FAR routing calculation process, a Basic Routing Table (BRT)
262	   is built on local network topology leveraging the regularity of the
263	   network topologies.  In addition to BRT, FAR also builds a Negative
264	   Routing Table (NRT).  FAR gradually builds NRT in the process of
265	   learning network link failure information, which does not require
266	   learning a complete network fault information.  FAR does not need to
267	   wait for the completion of the network convergence in the process of
268	   building these two tables.  Therefore, it avoids the problem of
269	   excessive network convergence overheads in the route calculation
270	   process.  In addition, FAR only needs to exchange a small amount of
271	   link change information between routers, and hence consumes less
272	   network bandwidth.

274	3.2.  Issues of Conventional Routing Methods in a Large-scale Network
275	      with Giant Number Nodes of Routers

277	   There are many real world scenario where tens of thousands of
278	   nodes(or much more nodes) need to be deployed in a flat area, such as
279	   infiniband routing and switching system, high-performance computer
280	   network, and many IDC(Internet Data Center) networks in China.  The
281	   similar problems have been existed long ago.  People have solved the
282	   problems through similar solutions, such as the traditional regular
283	   topology-based RFC3619[RFC3619] protocol, the routing protocols of
284	   infiniband routing and switching system, and high-performance
285	   computer network routing protocol.
286	   Infiniband defines a switch-based network to interconnect processing
287	   nodes and the I/O nodes.  Infiniband can support very large scale
288	   networks, use the regularity in topology to simplify its routing
289	   algorithm, which is just the same to what we do in FAR.

291	   Why OSPF and IS-IS do not work well in a large-scale network with
292	   giant number nodes of routers?

294	   As we know, the OSPF protocol uses multiple databases, more
295	   topological exchange information (as seen in the following example)
296	   and complicated algorithm.  It requires routers to consume more
297	   memory and CPU processing capability.  But the processing rate of CPU
298	   on the protocol message per second is very limited.  When the network
299	   expands, CPU will quickly approach its processing limits, and at this
300	   time OSPF can not continue to expand the scale of the management.
301	   The SPF(Shortest Path First) algorithm itself does not thoroughly
302	   solve these problems.

304	   On the contrary, the FAR protocol does not need to calculate SPF,
305	   which saves calculation time and resources, so FAR does not have the
306	   convergence time delay and the additional CPU overheads, which SPF
307	   requires.  Because in the initial stage, FAR already knows the
308	   regular information of the whole network topology and does not need
309	   to periodically do SPF operation.

311	   One of the examples of "more topological exchange information":
312	   In the OSPF protocol, LSA(Link-State Advertisement) floods every 1800
313	   seconds.  Especially in the larger network, the occupation of CPU and
314	   band bandwidth will soon reach the router's performance bottleneck.
315	   In order to reduce these adverse effects, OSPF introduced the concept
316	   of Area, which still has not solved the problem thoroughly.  By
317	   dividing the OSPF Area into several areas, the routers in the same
318	   area do not need to know the topological details outside their area.
319	   (In comparison with FAR, after OSPF introducing the concept of Area,
320	   the equivalent paths cannot be selected in the whole network scope)

322	   OSPF can achieve the following results by Area : 1) Routers only need
323	   to maintain the same link state databases as other routers within the
324	   same Area, without the necessity of maintaining the same link state
325	   database as all routers in the whole OSPF domain.  2) The reduction
326	   of the link state databases means dealing with relatively fewer LSA,
327	   which reduces the CPU consumption of routers; 3) The large number of
328	   LSAs flood only within the same Area.  But, its negative effect is
329	   that the smaller number of routers which can be managed in each OSPF
330	   area.  On the contrary, because FAR does not have the above
331	   disadvantages, FAR can also manage large-scale network even without
332	   dividing Areas.

334	   The aging time of OSPF is set in order to adapt to routing
335	   transformation and protocol message exchange happened frequently in
336	   the irregular topology.  Its negative effect is: when the network
337	   does not change, the LSA needs to be refreshed every 1800 seconds to
338	   reset the aging time.  In the regular topology, as the routings are
339	   fixed, it does not need the complex protocol message exchange and
340	   aging rules to reflect the routing changes, as long as LFA mechanism
341	   in the FAR is enough.

343	   Compared with the LSVR(Link State Vector Routing) protocol, the LSVR
344	   protocol has no special requirements for the network topology
345	   structure, however, the FAR draft is applicable to the regular
346	   topology network architecture and simplifies unnecessary processing.
347	   It is a solution proposed to greatly improve the routing efficiency
348	   of the regular network topology.  The FAR solution is more efficient
349	   than the general methods such as LSVR in regular topology.

351	   Therefore, in FAR, we can omit many unnecessary processing and the
352	   packet exchange.  The benefits are fast convergence speed and much
353	   larger network scale than other dynamic routing protocol.  Now there
354	   are some successful implementations of simplified routings in the
355	   regular topology in the HPC(High Performance Computing) environment.
356	   Conclusion: As FAR needs few routing entries and the topology is
357	   regular, the database does not need to be updated regularly.  Without
358	   the need for aging, there is no need for CPU and bandwidth overhead
359	   brought by LSA flood every 30 minutes, so the expansion of the
360	   network has no obvious effect on the performance of FAR, which is
361	   contrary to OSPF.

363	   Comparison of convergence time: The settings of OSPF spf_delay and
364	   spf_hold_time can affect the change of convergence time.  The
365	   convergence time of the network with 2480 nodes is about 15-20
366	   seconds; while the FAR does not need to calculate the SPF, so there
367	   is no such convergence time.
368	   These issues still exist in rapid convergence technology of OSPF
369	   ,ISIS (such as I-SPF, Incremental SPF) and LSVR.  The convergence
370	   speed and network scale constraint each other.  FAR does not have the
371	   above problems, and the convergence time is almost negligible.
372	   Can FRR(Fast Reroute) solve these problems?  IP FRR has some
373	   limitations.  The establishment of IP FRR backup scheme will not
374	   affect the original topology and traffic forwarding which are
375	   established by protocol, however, we can not get the information of
376	   whereabouts and status when the traffic is switched to an alternate
377	   next hop.

379	3.3.  Network Addressing Issues

381	   Routers are typically configured with multiple network interfaces,
382	   each connected to a subnet.  OSPF and other routing algorithms
383	   require that each interface of a router must be configured with an IP
384	   address.  A large-scale data center network may contain thousands of
385	   routers and each router has dozens of network interfaces, thus, there
386	   are tens of thousands of IP addresses needed to be configured in a
387	   data center.  It will be very complex to configure and manage a large
388	   number of network interfaces and will be difficult to troubleshoot
389	   network problems, then network maintenance will be costly and error-
390	   prone.

392	   In FAR, the device position information is encoded in the IP address
393	   of the router.  Each router only needs to be assigned a unique IP
394	   address according its location, which greatly solves complex network
395	   addressing issues in large-scale networks.

397	3.4.  Big Routing Table Issues

399	   There are a large number of subnets in the large-scale data center
400	   network.  A router may build a routing entry for each subnet, and
401	   therefore the size of routing tables on each router may be very
402	   large.  It will increase a router's cost and reduce the querying
403	   speed of the routing table.

405	   FAR uses two measures to reduce the size of its routing tables: a)It
406	   builds a BRT on the regularity of the network topologies; b)It
407	   introduces a new routing table, i.e., a NRT.  In this way FAR can
408	   reduce the size of routing tables to only a few dozen routing
409	   entries.

411	3.5.  Adaptivity Issues for Routing Algorithms

413	   To implement efficient routing in large-scale datacenters, besides
414	   FAR, some other routing methods are proposed for some specific
415	   network architectures, such as Fat-tree and BCube.  These routing
416	   methods are different(from both design and implementation viewpoints)
417	   and not compatible with the conventional routing methods, which
418	   brings big troubles to network equipment providers to develop new
419	   routers supporting various new routing methods.

421	   FAR is a generic routing method.  With slight modification, FAR
422	   method can be applied to most of regular datacenter networks.
423	   Furthermore, the structure of routing tables and querying a routing
424	   table in FAR are the same as conventional routing method.  If FAR is
425	   adopted, the workload of developing a new type of router will be
426	   significantly decreased.

428	3.6.  Virtual Machine Migration Issues

430	   Supporting VM migration is very important for cloud-based datacenter
431	   networks.  However, in order to support layer-3 routing, routing
432	   methods including OSPF and FAR require limiting VM migration within a
433	   subnet.  For this paradox, the mainstream methods still utilize
434	   layer-3 routing on routers or switches, transmit packets encapsulated
435	   by IPinIP or MACinIP between hosts by tunnels passing through network
436	   to the destination access switch, and then extract original packet
437	   out and send it to the destination host.

439	   By utilizing the aforementioned methods, FAR can be applied to Fat-
440	   tree, MatrixDCN or BCube networks for supporting VM migration in
441	   entire network.

443	4.  The FAR Framework

445	   FAR requires that a DCN has a regular topology, and network devices,
446	   including routers, switches, and servers, are assigned IP addresses
447	   according to their locations in the network.  In other word, we can
448	   locate a device in the network according to its IP address.

450	   FAR is a distributed routing method.  In order to support FAR, each
451	   router needs to have a routing module that implements the FAR
452	   algorithm.  FAR algorithm is composed of three parts, i.e., link-
453	   state learning, routing table building and routing table querying, as
454	   shown in Fig. 1.

456	               Link-state Learning    |Routing Table  | Routing Table
457	                                      |  Building     |   Querying
458	                                      |               |
459	    +--------+   /---------------\    | +--------+    |
460	    |2 Device|<--| 1 Neighbor &  |----->| 5 BRT  \    |    Packets
461	    |Learning|   | Link Detection|    | |Building|\   |       |
462	    +--------+   \---------------/    | +--------+ \  |      \|/
463	             |           |            |             \ |+--------------+
464	             |           |            |              /||  7 Querying  |
465	            \|/         \|/           |             / || Routing Table|
466	             +-----------------------+|            /  |+--------------+
467	             |3 Invisible Neighbor & ||           /   |
468	             |Link Failure Inferring || +---------    |
469	             +-----------------------+|/| 6 NRT  |    |
470	                         |            / |Building|    |
471	                        \|/          /| +--------+    |
472	                  +--------------+  / |               |
473	                  |4 Link Failure| /  |               |
474	                  |  Learning    |    |               |
475	                  +--------------+    |               |
476	                                      |               |

478	                        Figure 1: The FAR framework

480	   1:Neighbor and Link Detection Module(M1)
481	   2:Device Learning Module(M2)
482	   3:Invisible Neighbor and Link Failure Inferring Module(M3)
483	   4:Link Failure Learning Module(M4)
484	   5:BRT Building Module(M5)
485	   6:NRT Building Module(M6)
486	   7:Routing Table Lookup(M7)
487	   The meanings of M1-M7 are explained in detail in section 6.
488	   Link-state learning is responsible for a router to detect the states
489	   of its connected links and learn the states of all the other links in
490	   the entire network.  The second part builds two routing tables, a
491	   basic routing table (BRT) and an negative routing table (NRT),
492	   according to the learned link states in the first part.  The third
493	   part queries the BRT and the NRT to decide a next forwarding hop for
494	   the received (ingress) packets.

496	5.  Data Format

498	5.1.  Data Tables

500	   Some data tables are maintained on each router in FAR.  They are:

502	   Neighbor Device Table (NDT): To store neighbor routers and related
503	   links.

505	   All Devices Table (ADT): To store all routers in the entire network.

507	   Link Failures Table (LFT): To store all link failures in the entire
508	   network.

510	   Basic Routing Table (BRT): To store the candidate routes.

512	   Negative Routing Table(NRT): To store the avoiding routes.

514	   The format of NDT

516	    ----------------------------------------------------------
517	    Device ID | Device IP | Port ID | Link State | Update Time
518	    ----------------------------------------------------------

520	   Device ID: The ID of a neighbor router.

522	   Device IP: The IP address of a neighbor router.

524	   Port ID: The port ID that a neighbor router is attached to.

526	   Link State: The state of the link between a router and its neighbor
527	   router.  There are two states: Up and Down.

529	   Update Time: The time of updating the entry.

531	   The format of ADT

533	    --------------------------------------------------
534	    Device ID | Device IP | Type | State | Update Time
535	    --------------------------------------------------

537	   Device ID: The ID of a neighbor router.

539	   Device IP: The IP address of a neighbor router.

541	   Type: The type of a neighbor router.

543	   State: The state of a neighbor router.  There are two states: Up and
544	   Down.

546	   Update Time: The time of updating the entry.

548	   The format of LFT
549	    --------------------------------------------
550	    No | Router 1 IP | Router 2 IP | Timestamp
551	    --------------------------------------------

553	   No: The entry number.

555	   Router 1 IP: The IP address of one router that a failed link connects
556	   to.

558	   Router 2 IP: The IP address of another router that a failed link
559	   connects to.

561	   Timestamp: It identifies when the entry is created.

563	   The format of BRT

565	    -------------------------------------------------------
566	    Destination | Mask | Next Hop | Interface | Update Time
567	    -------------------------------------------------------

569	   Destination: A destination network

571	   Mask: The subnet mask of a destination network.

573	   Next Hop: The IP address of a next hop for a destination.

575	   Interface: The interface related to a next hop.

577	   Update Time: The time of updating the entry.

579	   The format of NRT

581	    -------------------------------------------------------------------
582	    Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
583	    -------------------------------------------------------------------

585	   Destination: A destination network.

587	   Mask: The subnet mask of a destination network.

589	   Next Hop: The IP address of a next hop that should be avoided for a
590	   destination.

592	   Interface: The interface related to a next hop that should be
593	   avoided.

595	   Failed Link No: A group of failed link numbers divided by "/", for
596	   example 1/2/3.

598	   Timestamp: The time of updating the entry.

600	5.2.  Messages

602	   Some protocol messages are exchanged between routers in FAR.

604	   Hello Message: This message is exchanged between neighbor routers to
605	   learn adjacency.

607	   Device Announcement (DA): Synchronize the knowledge of routers
608	   between routers.

610	   Link Failure Announcement (LFA): Synchronize link failures between
611	   routers.

613	   Device and Link Request (DLR): When a router starts, it requests the
614	   knowledge of routers and links from its neighbors by a DLR message.

616	   A FAR Message is directly encapsulated in an IP packet.  The protocol
617	   field of IP header indicates an IP packet is an FAR message.

619	   The four types of FAR messages have same format of packet header,
620	   called FAR header (as shown in Figure 2).

622	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
623	            +-----------+------------+------------------------+
624	            |  Version  |Message Type|    Message Length      |
625	            +-----------+------------+------------------------+
626	            |        Checksum        |       AuType           |
627	            +------------------------+------------------------+
628	            |                  Authentication                 |
629	            +-------------------------------------------------+
630	            |                  Authentication                 |
631	            +-------------------------------------------------+
632	            |                   Timestamp                     |
633	            +-------------------------------------------------+

635	                    Figure 2: The format of FAR header

637	   Version: FAR version

639	   Message Type: The type of FAR message.

641	   Packet Length: The packet length of the total FAR message.

643	   Checksum: The checksum of an entire FAR message.

645	   AuType:Authentication type. 0: no authentication, 1: Plaintext
646	   Authentication, 2: MD5 Authentication.

648	   Authentication: Authentication information. 0: undefined, 1: Key, 2:
649	   key ID, MD5 data length and packet number.  MD5 data is appended to
650	   the backend of the packet.

652	   AuType and Authentication can refer to the definition of OSPF packet.

654	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
655	            +-----------+------------+------------------------+
656	            |  Version  |Message Type|    Message Length      |
657	            +-----------+------------+------------------------+
658	            |        Checksum        |       AuType           |
659	            +------------------------+------------------------+
660	            |                  Authentication                 |
661	            +-------------------------------------------------+
662	            |                  Authentication                 |
663	            +-------------------------------------------------+
664	            |                   Timestamp                     |
665	            +-------------------------------------------------+
666	            |                    Router IP                    |
667	            +------------------------+------------------------+
668	            |     HelloInterval      |     HelloDeadInterval  |
669	            +------------------------+------------------------+
670	            |                Neighbor Router IP               |
671	            +-------------------------------------------------+
672	            |                       ...                       |
673	            +-------------------------------------------------+

675	                  Figure 3: The Format of Hello Messages

677	   For Hello messages, the Message Type in FAR header is set to
678	   1.Besides FAR header, a Hello message(Fig. 3) requires the following
679	   fields:

681	   Router IP: The router IP address.

683	   HelloInterval: The interval of sending Hello messages to neighbor
684	   routers.

686	   RouterDeadInterval: The interval to set a neighbor router dead(out-
687	   of-service).  If in the interval time, a router doesn't receive a
688	   Hello message from its neighbor router, the neighbor router is
689	   treated as dead.

691	   Neighbor Router IP: The IP address of a neighbor router.  All the
692	   neighbor router's addresses should be included in a Hello message.

694	            |<----1---->| <----1---->|<----------2----------->|
695	            +-----------+------------+------------------------+
696	            |  Version  |Message Type|    Message Length      |
697	            +-----------+------------+------------------------+
698	            |        Checksum        |       AuType           |
699	            +------------------------+------------------------+
700	            |                  Authentication                 |
701	            +-------------------------------------------------+
702	            |                  Authentication                 |
703	            +-------------------------------------------------+
704	            |                   Timestamp                     |
705	            +------------------------+------------------------+
706	            |                   Router1 IP                    |
707	            +-------------------------------------------------+
708	            |                       ...                       |
709	            +-------------------------------------------------+

711	                    Figure 4: The Format of DA Messages

713	   For DA messages(Fig. 4), the Message Type in FAR header is set to 2.
714	   Besides FAR header, a DA message includes IP addresses of all the
715	   announced routers.

717	            |<----1---->| <----1---->|<----------2----------->|
718	            +-----------+------------+------------------------+
719	            |  Version  |Message Type|    Message Length      |
720	            +-----------+------------+------------------------+
721	            |        Checksum        |       AuType           |
722	            +------------------------+------------------------+
723	            |                  Authentication                 |
724	            +-------------------------------------------------+
725	            |                  Authentication                 |
726	            +-------------------------------------------------+
727	            |                   Timestamp                     |
728	            +------------------------+------------------------+
729	            |                     Left IP                     |
730	            +-------------------------------------------------+
731	            |                    Right IP                     |
732	            +------------------------+------------------------+
733	            |                      State                      |
734	            +-------------------------------------------------+
735	            |                       ...                       |
736	            +-------------------------------------------------+

738	                   Figure 5: The Format of LFA Messages

740	   For LFA messages(Fig. 5), the Message Type in FAR header is set to 3.
741	   Besides FAR header, a LFA message includes all the announced link
742	   failures.

744	   Left IP: The IP address of the left endpoint router of a link.

746	   Right IP: The IP address of the right endpoint router of a link.

748	   State: Link state. 0: Up, 1: down
749	            |<----1---->| <----1---->|<----------2----------->|
750	            +-----------+------------+------------------------+
751	            |  Version  |Message Type|    Message Length      |
752	            +-----------+------------+------------------------+
753	            |        Checksum        |       AuType           |
754	            +------------------------+------------------------+
755	            |                  Authentication                 |
756	            +-------------------------------------------------+
757	            |                  Authentication                 |
758	            +-------------------------------------------------+
759	            |                   Timestamp                     |
760	            +-------------------------------------------------+

762	                   Figure 6: The Format of DLR Messages

764	   For DLR messages(Fig. 6), the Message Type in FAR header is set to
765	   1.Except for FAR header, DLR has no additional fields.

767	6.  FAR Modules

769	6.1.  Neighbor and Link Detection Module(M1)

771	   M1 is responsible for sending and receiving Hello messages, and
772	   detecting directly-connected links and neighbor routers.  Each Hello
773	   message is encapsulated in an IP packet.  M1 sends Hello messages
774	   periodically to all the active router ports and receives Hello
775	   messages from its neighbor routers.  M1 detects neighbor routers and
776	   directly-connected links according to received Hello Messages and
777	   stores these neighbors and links into a Neighbor Devices Table (NDT).
778	   Additionally, M1 also stores neighbor routers into an All Devices
779	   Table (ADT).

781	6.2.  Device Learning Module(M2)

783	   M2 is responsible for sending, receiving, and forwarding device
784	   announcement (DA) messages, learning all the routers in the whole
785	   network, and deducing faulted routers.  When a router starts, it
786	   sends a DA message announcing itself to its neighbors and a DLR
787	   message requesting the knowledge of routers and links from its
788	   neighbors.  If M2 module of a router receives a DA message, it checks
789	   whether the router encapsulated in the message is in an ADT.  If the
790	   router is not in the ADT, M2 puts this router into the ADT and
791	   forwards this DA message to all the active ports except for the
792	   incoming one, otherwise, M2 discards this message directly.  If M2
793	   module of a router receives a DLR message, it replies a DA message
794	   that encapsulates all of the learned routers.

796	6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)

798	   M3 is responsible for inferring invisible neighbors of the current
799	   router by means of the ADT.  If the link between a router A and its
800	   neighbor B breaks, which results in that M1 module of A cannot detect
801	   the existence of B, then B is an invisible neighbor of A.  Since a
802	   device's location is coded into its IP address, it can be judged
803	   whether two routers are adjacent, according to their IP addresses.
804	   Based on this idea, M3 infers all of the invisible neighbors of the
805	   current router and the related link failures.  The results are stored
806	   into an NDT.  Moreover, link failures also are added into a link-
807	   failure table (LFT).  LFT stores all of the failed links in the
808	   entire network.

810	6.4.  Link Failure Learning Module(M4)

812	   M4 is responsible for sending, receiving and forwarding link failure
813	   announcement (LFA) and learning all the link failures in the whole
814	   network.  M4 broadcasts each newly inferred link failure to all the
815	   routers in the network.  Each link failure is encapsulated in a LFA
816	   message and one link failure is broadcasted only once.  If a router
817	   receives a DLR request from its neighbor, it will reply a LFA message
818	   that encapsulates all the learned link failures through M4 module.
819	   If M4 receives a LFA message, it checks whether the link failure
820	   encapsulated in the message is in a LFT by comparing two link ends
821	   and timestamp.  If the link failure is not in the LFT or timestamp is
822	   different, M4 puts this link failure into the LFT (or update
823	   timestamp only) and forwards this LFA message to all the active ports
824	   except for the incoming one, otherwise, M4 discards this message
825	   directly.

827	   There is a special case a router will rebroadcast a link failure.  If
828	   a router receives a data packet and must forward the packet going
829	   ahead to destination through a failed link, it means some previous
830	   router should avoid this failed link according to its NRT but it
831	   doesn't.  In this case, maybe the previous router missed the LFA
832	   message of the link failure due to some uncertain reasons.  So the
833	   forwarding router rebroadcasts the LFA message.

835	6.5.  BRT Building Module(M5)

837	   M5 is responsible for building a BRT for the current router.  By
838	   leveraging the regularity in topology, M5 can calculate the routing
839	   paths for any destination without the knowledge of the topology of
840	   whole network, and then build the BRT based on an NDT.  Since the IP
841	   addresses of network devices are continuous, M5 only creates one
842	   route entry for a group of destination addresses that have the same
843	   network prefix by means of route aggregation technology.  Usually,
844	   the size of a BRT is very small.  The detail of how to build a BRT is
845	   described in section 5.

847	6.6.  NRT Building Module(M6)

849	   M6 is responsible for building a NRT for the current router.  Because
850	   M5 builds a BRT without considering link failures in network, the
851	   routing paths calculated by the BRT cannot avoid failed links.  To
852	   solve this problem, a NRT is used to exclude the routing paths that
853	   include some failed links from the paths calculated by a BRT.  M6
854	   calculate the routing paths that include failed links and stored them
855	   into the NRT.  The details of how to build a NRT is described in
856	   section 5.

858	6.7.  Routing Table Lookup(M7)

860	   M7 is responsible for querying routing tables and selecting the next
861	   hop for forwarding the packets.  Firstly, M7 takes the destination
862	   address of a forwarding packet as a criterion to look up route
863	   entries in a BRT based on longest prefix match.  All of the matched
864	   entries are composed of a candidate hops list.  Secondly, M7 look up
865	   negative route entries in a NRT taking the destination address of the
866	   forwarding packet as criteria.  This lookup is not limited to the
867	   longest prefix match, any entry that matches the criteria would be
868	   selected and composed of an avoiding hops list.  Thirdly, the
869	   candidate hops minus avoiding hops are composed of an applicable hops
870	   list.  At last, M7 sends the forwarding packet to any one of the
871	   applicable hops.  If the applicable list is empty, the forwarding
872	   packet will be dropped.

874	7.  How a FAR Router Works

876	   Figure 7 shows how a FAR router works by its FSM.

878	                            /---------------\
879	                            |     Start     |
880	                            |               |
881	                            \---------------/
882	                                    |
883	                     +--------+     |1
884	                     |        |     |
885	                     |       \|/   \|/
886	                     |    +--------------------+
887	                     |2   |         ND         |
888	                     |    |                    |
889	                     |    +--------------------+
890	                     |        |      |
891	                     |        |      |3
892	                     +--------+      |
893	                                    \|/
894	                              +--------------+
895	                              |    ND-FIN    |
896	                              |              |
897	                              +--------------+
898	                                     |
899	                                     |4
900	                                     |
901	                                    \|/
902	            ________10_______\+--------------+/_______11________
903	           |                 /|              |\                |
904	           |                  |    Listen    |                 |
905	           |     ____9_______\|              |/_______12___    |
906	           |    |            /|              |\            |   |
907	           |    |             +--------------+             |   |
908	           |    |               5/  |  |    \8             |   |
909	           |    |              |/_  |  |    _\|            |   |
910	           |    |  +------------+  6|  |7   +------------+ |   |
911	           |     --| HELLO-RECV |   |  |    | LFA-RECV   |--   |
912	           |       +------------+   |  |    +------------+     |
913	           |                  ______|  |______                 |
914	           |                 |                |                |
915	           |                \|/              \|/               |
916	           |           +------------+  +------------+          |
917	           |___________|  DLR-RECV  |  |  DA-RECV   |__________|
918	                       +------------+  +------------+

920	                                  _________
921	                                 |         |
922	                                 |         |
923	                                \|/        |
924	                         +--------------+  |13
925	                         | Hello Thread |  |
926	                         +--------------+  |
927	                                 |         |
928	                                 |_________|

930	                                  _________
931	                                 |         |
932	                                 |         |
933	                                \|/        |
934	                         +--------------+  |14
935	                         |  LFD Thread  |  |
936	                         +--------------+  |
937	                                 |         |
938	                                 |_________|

940	                                  _________
941	                                 |         |
942	                                 |         |
943	                                \|/        |
944	                         +--------------+  |15
945	                         |  DA Thread   |  |
946	                         +--------------+  |
947	                                 |         |
948	                                 |_________|

950	                                  _________
951	                                 |         |
952	                                 |         |
953	                                \|/        |
954	                         +--------------+  |16
955	                         |  DFD Thread  |  |
956	                         +--------------+  |
957	                                 |         |
958	                                 |_________|

960	             Figure 7: The Finite State Machine of FAR Router

962	   1)When a router starts up, it starts a Hello thread and then starts
963	   ND (neighbor detection) timer (3 seconds).  Next the router goes into
964	   ND (neighbor detection) state.
965	   2)In the ND state, if a router received a Hello message, then it
966	   performs a Hello-message processing and goes back to the ND state.
967	   3)When the ND timer is over, a router goes into ND-FIN (neighbor
968	   detection finished) state.
969	   4)A router starts the LFD (link failure detection) thread and DFD
970	   (device failure detection) state, and sends DA message and DLR
971	   message to all of its active ports.  Then the router goes into Listen
972	   state.
973	   5) If a router receives a Hello message, then goes into HELLO-RECV
974	   state.
975	   6) If a router receives a DLR message, then goes into DLR-RECV state.
976	   7) If a router receives a DA message, then goes into DA-RECV state.
977	   8) If a router receives a LFA message, then goes into LFA-RECV state.
978	   9) A router performs the Hello-message processing.  After that, it
979	   goes back to Listen state.
980	   10) A router performs the DLR-message processing.  After that, it
981	   goes back to Listen state.
982	   11) A router performs the DA-message processing.  After that, it goes
983	   back to Listen state.

985	   12) A router performs the LFA-message processing.  After that, it
986	   goes back to Listen state.
987	   13) Hello thread produces and sends Hello messages to all its ports
988	   periodically.
989	   14) LFD thread calls link-failure-detection processing to check link
990	   failures in all links periodically
991	   15) DA thread produces and sends DA messages periodically (30
992	   minutes).
993	   16) When DFD thread starts up, it sleep a short time (30 seconds) to
994	   wait for a router learning all the active routers in the network.
995	   Then the thread calls the device-failure-detection processing to
996	   check device failures periodically (30 minutes).

998	8.  Compatible Architecture

1000	   As a generic routing protocol, FAR can be run in various DCNs with
1001	   regular topology.  Up to now, we have implemented the FAR protocol
1002	   for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond.

1004	   For different network architectures, most processing of FAR is same
1005	   besides calculation of routing tables.  BRT routing tables are
1006	   calculated based on Hello messages and NRT routing tables are
1007	   calculated based on LFA messages in FAR.  To extend FAR to support a
1008	   new network architecture, only processing of Hello and LFA messages
1009	   need providing to build BRT and NRT routing tables.

1011	   In this protocol, FAR can support maximally 12 network architectures
1012	   and at least support 1 built-in network architecture, such as Fat-
1013	   tree, BCube and MatrixDCN,etc.  Each network architecture is assigned
1014	   a unique number from 1 to 12.  For example, if the 1 built-in
1015	   architectures are assigned 1, and other customized architectures are
1016	   assigned 2 to 12.
1017	   1: Fat-tree
1018	   2: BCube
1019	   3: MatrixDCN.
1020	   4: xxx.
1021	   ......
1022	   12: xxx.

1024	9.  Topology identification and broadcast storm suppression

1026	   In this design, the initial topology discovery process is not a
1027	   mandatory option for a FAR routing protocol.  The recommended
1028	   solution here is to use a pre-configured configuration file, which
1029	   contains topology parameters of the current system, each node device
1030	   as long as according to these configuration parameters will be able
1031	   to know the topology information.  In this way, we do not have to
1032	   deal with complex topology discovery processes, nor do we need to
1033	   calculate the shortest path, because the optimal path can be
1034	   calculated from the parameters.  This protocol also allows the
1035	   formation of configuration files to be submitted to the topology
1036	   discovery protocol, allowing for a variety of different
1037	   implementation options.

1039	   Regarding the flood suppression processing of broadcast packets, it
1040	   has been considered in the previous content.  Since the hello packets
1041	   is only transmitted between the two nodes, it cannot be spread out.
1042	   The link error message is only sent to the CPU, and are not forwarded
1043	   to the nodes in layer 2 broadcasting.  Moreover, each node will
1044	   discard the repeated error messages when the node receives them.  In
1045	   this way, the broadcast storm can be suppressed.  If a link is
1046	   unstable and repeatedly up or down, the system will not send new
1047	   messages after sending notifications, and the system will not
1048	   oscillate repeatedly.  The topology is updated only when the link is
1049	   later detected to be stable for a long time.

1051	10.  Application Example

1053	   In this section, we take a Fat-tree network(Fig. 7) as an example to
1054	   describe how to apply FAR routing.  Since M1 to M4 are very simple,
1055	   we only introduce how the modules M5, M6, and M7 work in a Fat-tree
1056	   network.

1058	   A Fat-tree network is composed of 4 layers.  The top layer is core
1059	   layer, and the other layers are aggregation layer, edge layer and
1060	   server layer.  There are k pods, each one containing two layers of
1061	   k/2 switches.  Each k-port switch in the edge layer is directly
1062	   connected to k/2 hosts.  The remaining k/2 ports are connected to k/2
1063	   of the k-port switches in the aggregation layer.  There are (k/2)2
1064	   k-port core switches.  Each core switch has one port connected to
1065	   each of the k pods.

1067	       10.0.1.1          10.0.1.2            10.0.2.1         10.0.2.2
1068	         +--+              +--+                +--+               +--+
1069	         |  |              |  |                |  |               |  |
1070	         +--+,_            +--+',              +--+             ,,+--+
1071	         |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
1072	         |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
1073	         |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
1074	         |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
1075	         |      .       `'.     `-.,-`    /       |      \
1076	         |       \         `'.,  .` `'., `        |       `.
1077	         |        `,          .'`,     ,`'.,      |         ',
1078	         |          .      ,-`    '., -     `'-,_ |           `.
1079	         |           \   .`         ,'.,         `|.,           .
1080	         |            .'`          /    `-,       |  `'.,        `.
1081	     10.1.0.1      ,-`  .        .'    10.3.0.1  10.3.0.2`'.,      ',
1082	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1083	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1084	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1085	          |  \/ |         |  \/ |           |  \/ |           |  \/ |
1086	         +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
1087	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1088	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1089	          /|   10.1.2.1   /|    |\     10 3.1.3   |\          /|   |  |
1090	         / |    | \      / |    | \        / |    | \        / |   |  |
1091	        /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
1092	       /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
1093	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1094	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1095	            10.1.2.2                           10.3.1.3

1097	                        Figure 8: Fat-tree Network

1099	   Aggregation switches are given addresses of the form 10.pod.0.switch,
1100	   where pod denotes the pod number, and switch denotes the position of
1101	   that switch in the upper pod (in [1, k/2]).  Edge switches are given
1102	   addresses of the form 10.pod.switch.1, where pod denotes the pod
1103	   number, and switch denotes the position of that switch in the lower
1104	   pod (in [1, k/2]).  The core switches are given addresses of the form
1105	   10.0.j.i, where j and i denote that switch's coordinates in the
1106	   (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left).
1107	   The address of a host follows the pod switch to which it is connected
1108	   to; hosts have addresses of the form: 10.pod.switch.ID, where ID is
1109	   the host's position in that subnet (in [2, k/2+1], starting from left
1110	   to the right).

1112	10.1.  BRT Building Procedure

1114	   By leveraging the topology's regularity, every switch clearly knows
1115	   how it forwards a packet.  When a packet arrives at an edge switch,
1116	   if the destination of the packet lies in the same subnet with the
1117	   switch, then the switch directly forwards the packet to the
1118	   destination server through layer-2 switching.  Otherwise, the switch
1119	   forwards the packet to any of aggregation switches in the same pod.
1120	   When a packet arrives at an aggregation switch, if the destination of
1121	   the packet lies in the same pod, the switch forwards the packet to
1122	   the corresponding edge switch.  Otherwise, the switch forwards the
1123	   packet to any of core switches that it is connected to.  If a core
1124	   switch receives a packet, it forwards the packet to the corresponding
1125	   aggregation switch that lies in the destination pod.

1127	   The forwarding policy discussed above is easily expressed through a
1128	   BRT.  The BRT of an edge switch, such as 10.1.1.1, is composed of the
1129	   following entries:

1131	   Destination/Mask       Next hop
1132	   10.0.0.0/255.0.0.0     10.1.0.1
1133	   10.0.0.0/255.0.0.0     10.1.0.2

1135	   The BRT of an aggregation switch, such as 10.1.0.1, is composed of
1136	   the following entries:

1138	   Destination/Mask            Next hop
1139	   10.1.1.0/255 255.255.0      10.1.1.1
1140	   10.1.2.0/255.255.255.0      10.1.2.1
1141	   10.0.0.0/255.0.0.0          10.0.1.1
1142	   10.0.0.0/255.0.0.0          10.0.1.2

1144	   The BRT of a core switch, such as 10.0.1.1, is composed of the
1145	   following entries:

1147	   Destination/Mask            Next hop
1148	   10.1.0.0/255 255.0.0        10.1.0.1
1149	   10.2.0.0/255.255.0.0        10.2.0.1
1150	   10.3.0.0/255.255.0.0        10.3.0.1
1151	   10.4.0.0/255.255.0.0        10.4.0.1

1153	10.2.  NRT Building Procedure

1155	   The route entries in an NRT are related with link and node failures.
1156	   We summarize all types of cases into three (3) catalogs.

1158	10.2.1.  Single Link Failure

1160	   In Fat-tree, Links can be classified as 3 types by their locations:
1161	   1) servers to edge switches; 2) edge to aggregation switches; 3)
1162	   aggregation to core switches.  Link failures between servers to edge
1163	   switches only affect the communication of the corresponding servers
1164	   and don't affect the routing tables of any switch, so we only discuss
1165	   the second and third type of links failures.

1167	   Edge to Aggregation Switches

1169	   Suppose that the link between an edge switch, such as 10.1.2.1 (A),
1170	   and an aggregation switch, such as 10.1.0.1(B),fails.  This link
1171	   failure may affect 3 types of communications.

1173	   o Sources lie in the same subnet with A, and destinations do not.  In
1174	   this case, the link failure will only affect the routing tables of A.
1175	   As this link is attached to A directly, A only needs to delete the
1176	   route entries whose next hop is B in its BRT and add no entries to
1177	   its NRT when A's M6 module detect the link failure.

1179	   o Destinations lie in the same subnet with A, and sources lie in
1180	   another subnet of the same pod.  In this case, the link failure will
1181	   affect the routing tables of all the edge switches in the same pod
1182	   except for A.  When an edge switch, such as 10.1.1.1, learns the link
1183	   failure, it will add a route entry to its NRT:

1185	   Destination/Mask            Next hop
1186	   10.1.2.0/255.255.255.0      10.1.0.1

1188	   o Destinations lie in the same subnet with A, sources lie in another
1189	   pod.  In this case, the link failure will affect the routing tables
1190	   of all the edge switches in the other pods.  When an edge switch in
1191	   one other pod, such as 10.3.1.1, learns the link failure, because all
1192	   the routings that pass through 10.3.0.1 to A will certainly pass
1193	   through the link between A and B, 10.3.1.1 need add a route entry to
1194	   its NRT:

1196	   Destination/Mask            Next hop
1197	   10.1.2.0/255.255.255.0      10.3.0.1
1198	   Aggregation to Core Switches

1200	   Suppose that the link between an aggregation switch, such as 10.1.0.1
1201	   (A), and a core switch, such as 10.0.1.2(B), fails.  This link
1202	   failure may affect 2 types of communications.

1204	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1205	   the other pods.  In this case, the link failure will only affect the
1206	   routing tables of A.  As this link is attached to A directly, A only
1207	   need to delete the route entries whose next hop is B in its BRT and
1208	   add no entries to its NRT when A's M6 module detect the link failure.

1210	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1211	   another pod.  In this case, the link failure will affect the routing
1212	   tables of all the aggregation switches in other pods except for pod
1213	   1.  When an aggregation switch in one other pod, such as 10.3.0.1,
1214	   learns the link failure, because all the routings that pass through
1215	   10.0.1.2 to the pod 1 where A lies will certainly pass through the
1216	   link between A and B, 10.3.0.1 need add a route entry to its NRT:

1218	   Destination/Mask            Next hop
1219	   10.1.0.0/255.255.0.0        10.0.1.2

1221	10.2.2.  A Group of Link Failures

1223	   If all the uplinks of an aggregation switch fail, then this switch
1224	   cannot forward packets, which will affect the routing of every edge
1225	   switches.  Suppose that all the uplinks of the node A (10.1.0.1)
1226	   fail, it will affect two types of communications.

1228	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1229	   the other pods.  In this case, the link failures will affect the
1230	   routing of the edge switches in the Pod of A.  To avoid the node A,
1231	   each edge switch should remove the route entry "10.0.0.0/255.0.0.0
1232	   10.1.0.1" in which the next hop is the node A.

1234	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1235	   other pods.  In this case, the link failures will affect the routing
1236	   of edge switches in other pods.  For example, if the edge switch
1237	   10.3.1.1 communicates with some node in the pod of A, it should avoid
1238	   the node 10.3.0.1, because any communication through 10.3.0.1 to the
1239	   pod of A will pass through the node A.  So a route entry should be
1240	   added to 10.3.1.1:

1242	   Destination/Mask            Next hop
1243	   10.1.0.0/255.255.0.0        10.3.0.1

1245	10.2.3.  Node Failures

1247	   At last, we discuss the effect of node failures to a NRT.  There are
1248	   3 types of node failures: the failure of edge, aggregation and core
1249	   switches.

1251	   o An edge switch fails.  The failure doesn't affect the routing table
1252	   of any switch.

1254	   o A core switch fails.  Only when all the core switches connected to
1255	   the same aggregation switch fail, they will affect the routing of
1256	   other switches.  This case is equal to the case that all the uplinks
1257	   of an aggregation switch fail, so the process of link failures can
1258	   cover it.

1260	   o An aggregation switch fails.  This case is similar to the case that
1261	   all the uplinks of an aggregation switch fail.  It affects the
1262	   routing of edge switches in other pods, but doesn't affect the
1263	   routing of edge switches in pod of the failed switch.  The process of
1264	   this failure is same to the second case in section 6.2.2.

1266	10.3.  Routing Procedure

1268	   FAR decides a routing by looking up its BRT and NRT.  We illuminate
1269	   the routing procedure by an example.  In this example, we suppose
1270	   that the link between 10.3.1.1 and 10.3.0.2 and the link between
1271	   10.1.2.1 and 10.1.0.2 have failed.  Then we look into the routing
1272	   procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
1273	   (destination).

1275	   Step 1: The source 10.3.1.3 sends packets to its default router
1276	   10.3.1.1

1278	   Step 2: The routing of 10.3.1.1.

1280	   1) Calculate candidate hops

1282	   10.3.1.1 looks up its BRT and gets the following matched entries:

1284	   Destination/Mask            Next hop
1285	   10.0.0.0/255.0.0.0          10.3.0.1

1287	   So the candidate hops = {10.3.0.1}

1289	   2) Calculate avoiding hops

1291	   Its NRT is empty, so the set of avoiding hop is empty too.

1293	   3) Calculate applicable hops

1295	   The applicable hops are candidate hops minus avoiding hops, so:

1297	   The applicable hops = {10.3.0.1}

1299	   4) Forward packets to 10.3.0.1

1301	   Step 3: The routing of 10.3.0.1

1303	   1) Calculate candidate hops.

1305	   10.3. 0.1 looks up its BRT and gets the following matched entries:

1307	   Destination/Mask           Next hop
1308	   10.1.0.0/255.255.0.0       10.0.1.1
1309	   10.1.0.0/255.255.0.0       10.0.1.2

1311	   So the candidate hops = {10.0.1.1, 10.0.1.2}

1313	   2) Calculate avoiding hops

1315	   Destination/Mask           Next hop
1316	   10.1.0.0/255.255.0.0       10.0.1.2

1318	   So the avoiding hops = {10.0.1.2}

1320	   3) Calculate applicable hops

1322	   The applicable hops are candidate hops minus avoiding hops, so:

1324	   The applicable hops = {10.0.1.1}

1326	   4) Forward packets to 10.0.1.1

1328	   Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
1329	   routing tables.

1331	   Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
1332	   routing tables.

1334	   Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
1335	   layer-2 switching.

1337	10.4.  FAR's Performance in Large-scale Networks

1339	   FAR has good performance to support large-scale networks.  In this
1340	   section, we take a Fat-tree network composed of 2,880 48-port
1341	   switches and 27,648 servers as an example to show FAR's performance.

1343	10.4.1.  The number of control messages required by FAR

1345	   FAR exchanges a few messages between routers and only consumes a
1346	   little network bandwidth.  Tab. 1 shows the required messages in the
1347	   example Fat-tree network.
1348	   Table 1:Required messages in a Fat-tree network.
1349	   _____________________________________________________________________
1350	   Message Type| Scope | size(bytes) | Rate | Bandwidth
1351	   ---------------------------------------------------------------------
1352	   Hello |adjacent switches|less than 48|10 messages/sec|less than 4
1353	   kbps
1354	   ---------------------------------------------------------------------
1355	   DLR |adjacent switches| less than 48 | (1) |48bytes
1356	   ---------------------------------------------------------------------
1357	   DA |entire network| less than 48 | (2) |1.106M
1358	   ---------------------------------------------------------------------
1359	   LFA |entire network| less than 48 | (3) |48 bytes
1360	   _____________________________________________________________________
1361	   (1)Produce one when a router starts
1362	   (2)The number of switches(2,880) in a period
1363	   (3)Produce one when a link fails or recovers

1365	10.4.2.  The Calculating Time of Routing Tables

1367	   A BRT is calculated according to the states of its neighbor routers
1368	   and attached links.  An NRT is calculated according to device and
1369	   link failures in the entire network.  So FAR does not calculate
1370	   network topology and has no problem of network convergence, which
1371	   greatly reduces the calculating time of routing tables.  The
1372	   detection and spread time of link failures is very short in FAR.
1373	   Detection time is up to the interval of sending Hello message.  In
1374	   FAR, the interval is set to 100ms, and a link failure will be
1375	   detected in 200ms.  The spread time between any pair of routers is
1376	   less than 200ms.If a link fails in a data center network, FAR can
1377	   detect it, spread it to all the routers, and calculate routing tables
1378	   in no more than 500ms.

1380	10.4.3.  The Size of Routing Tables

1382	   For the test Fat-tree network, the sizes of BRTs and NRTs are shown
1383	   in Tab. 2.

1385	   Table 2: The size of routing tables in FAR
1386	   _____________________________________________________________________
1387	   Routing Table| Core Switch | Aggregation Switch | Edge Switch |
1388	   ---------------------------------------------------------------------
1389	   BRT | 48 | 48 | 24
1390	   ---------------------------------------------------------------------
1391	   NRT | 0 | 14 | 333
1392	   _____________________________________________________________________

1394	   The BRT's size at a switch is determined by the number of its
1395	   neighbor switches.  In the example network, a core switch has 48
1396	   neighbor switches (aggregation switch), so it has 48 entries in its
1397	   BRT.Only aggregation and edge switches have NRTs.  The NRT size at a
1398	   switch is related to the number of link failures in the network.
1399	   Suppose that there are 1000 link failures in the example network, the
1400	   number of failed links is 1.2% of total links, which is a very high
1401	   failure ratio.  We suppose that link failures are uniformly
1402	   distributed in the entire network.  The NRT size at an edge switch is
1403	   about 333 and the NRT size of an aggregation switch is about 14in
1404	   average.

1406	11.  Implementations Examples

1408	   In the FAR draft scenario, Fat-Tree topology has only three layers of
1409	   routers.  To expand the network scale is achieved through horizontal
1410	   expansion: increase the number of core switches, and increase the
1411	   number of aggregation switches and edge switches in pod.

1413	   For example, the following two scenarios.

1415	   10.0.1.1          10.0.2.1            10.0.3.1         10.0.24.1
1416	     +--+              +--+                +--+               +--+
1417	     |  |24devices     |  |24devices       |  |24devices      |  | 24devices
1418	     +--+,_            +--+',              +--+             ,,+--+
1419	     |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
1420	     |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
1421	     |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
1422	     |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
1423	     |      .       `'.     `-.,-`    /       |      \
1424	     |       \         `'.,  .` `'., `        |       `.
1425	     |        `,          .'`,     ,`'.,      |         ',
1426	     |          .      ,-`    '., -     `'-,_ |           `.
1427	     |           \   .`         ,'.,         `|.,           .
1428	     |            .'`          /    `-,       |  `'.,        `.
1429	 10.1.0.1      ,-`  .        .'                      `'.,      ',
1430	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1431	     |  |..|  |      |  |..|  |        |  |..|  |        |  |..|  |A total of 1152 aggregation switches
1432	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1433	      |  \/ |         |  \/ |           |  \/ |           |  \/ |
1434	     +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
1435	     |  |..|  |      |  |..|  |        |  |..|  |        |  |..|  |A total of 1152 access switches
1436	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1437	      /|   10.1.24.1  /|    |\                |\          /|   |  |
1438	     / |    | \      / |    | \        / |    | \        / |   |  |
1439	    /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
1440	   /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
1441	   ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1442	   ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++

1444	    Figure 9: 48 pods, each of which has 24 aggregation switches and 24
1445	                              access switches

1447	   In the Fat-tree network of Figure 9, there are a total of 48 pods,
1448	   each of which has 24 aggregation switches and 24 access switches, and
1449	   each access switch is connected to 24 servers. 576 core switches,
1450	   1152 aggregation switches, and 1152 access switches are required, for
1451	   a total of 2880 switches, which can accommodate 27,648 servers.

1453	                         96 10 gigabit core switches
1454	     +--+              +--+                +--+               +--+
1455	     |  |4devices      |  |4devices        |  |4devices       |  |4devices
1456	     +--+,_            +--+',              +--+             ,,+--+
1457	     |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
1458	     |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
1459	     |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
1460	     |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
1461	     |      .       `'.     `-.,-`    /       |      \
1462	     |       \         `'.,  .` `'., `        |       `.
1463	     |        `,          .'`,     ,`'.,      |         ',
1464	     |          .      ,-`    '., -     `'-,_ |           `.
1465	     |           \   .`         ,'.,         `|.,           .
1466	     |            .'`          /    `-,       |  `'.,        `.
1467	 10.1.0.1      ,-`  .        .'                      `'.,      ',
1468	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1469	     |  |..|  |      |  |..|  |        |  |..|  |        |  |..|  |A total of 192 10 gigabit aggregation switches
1470	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1471	      |  \/ |         |  \/ |           |  \/ |           |  \/ |
1472	     +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
1473	     |  |..|  |      |  |..|  |        |  |..|  |        |  |..|  |A total of 1152 access switches
1474	     +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1475	      /|   10.1.4.1   /|    |\                |\          /|   |  |
1476	     / |    | \      / |    | \        / |    | \        / |   |  |
1477	    /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
1478	   /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
1479	   ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1480	   ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++

1482	    Figure 10: 48 pods.  Each pod has 4 10G aggregation switches and 24
1483	                          Gigabit access switches

1485	   In the Fat-Tree network of Figure 10, there are a total of 48 pods.
1486	   Each pod has 4 10G aggregation switches and 24 Gigabit access
1487	   switches.  Each access switch is connected to 40 servers.  Requires
1488	   96 core switches, 192 aggregation switches, and 1152 access switches,
1489	   for a total of 1,440 switches, which can accommodate 46,080 servers.

1491	12.  Security Considerations

1493	   The security considerations will be discussed in a future version of
1494	   this document.

1496	13.  Conclusions

1498	   This draft introduces FAR protocol, a generic routing method and
1499	   protocol, for data centers that have a regular topology.  It uses two
1500	   routing tables, a BRT and an NRT, to store the normal routing paths
1501	   and the forbidden (to-be-avoided) routing paths, respectively.  This
1502	   makes the FAR protocol very simple and efficient.  The sizes of these
1503	   two tables are very small.  Usually, a BRT has only several tens of
1504	   entries and an NRT has only several or about a dozen entries.

1506	14.  Acknowledgments

1508	   This document is supported by ZTE Enterprise-University-Research
1509	   Joint Project.

1511	15.  References

1513	15.1.  Normative References

1515	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1516	   Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March
1517	   1997.

1519	   [RFC2328] J.  Moy, "OSPF Version 2", BCP 14, RFC2328, April 1998.

1521	   [RFC3619] SHAH, S.; YIP, M.  RFC3619: Extreme Networks' Ethernet
1522	   Automatic Protection Switching (EAPS) Version 1. 2003.

1524	15.2.  Informative References

1526	   [FAT-TREE] M.  Al-Fares, A.  Loukissas, and A.  Vahdat."A
1527	   Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
1528	   2008.

1530	   [BCube] Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., ...  Lu,
1531	   S. (2009, August).  BCube: a high performance, server-centric network
1532	   architecture for modular data centers.  In Proceedings of the ACM
1533	   SIGCOMM 2009 conference on Data communication (pp. 63-74).

1535	   [MatrixDCN] Sun, Y., Chen, M., Peng, L., Hassan, M.  M., Alelaiwi, A.
1536	   (2016).  MatrixDCN: a high performance network architecture for
1537	   large-scale cloud data centers.  Wireless Communications and Mobile
1538	   Computing, 16(8), 942-959.

1540	16.  Appendix

1542	16.1.  Application Area of the Solution

1544	   According to the horizontal expansion mode of the above scenarios,
1545	   the whole Fat-Tree network does not need to be expanded to 4 layers
1546	   (4 order Fat-Tree) even if it is expanded.  Using a standard three-
1547	   tier Fat-tree network, we can scale the network to meet all the
1548	   problems of commercial network applications.  This scheme is suitable
1549	   for non-SDN distributed Fat-Tree network architecture.

1551	16.2.  Technical evolution roadmap

1553	   In this draft, we should design different rules for FAR switches in
1554	   different regular networks to calculate routing tables, which limits
1555	   FAR's extensibility.  Fortunately, the latest SDN technology make it
1556	   is easy to update the control plane of switches, since all the
1557	   function of control plane are centralized to a controller in SDN.  We
1558	   are designing the next generation routing scheme for regular networks
1559	   based on SDN.  In the new scheme, we design a regular ToPology
1560	   Description Language (TPDL) to descript a regular network.  In TPDL,
1561	   the distance between different type of node groups is defined by a
1562	   group of distance formulas and the number of formulas is finite and
1563	   fixed without increasing by the scale of a network.  And then,
1564	   switches learn the topology of a network by taking advantage of TPDL
1565	   and generate flow table entries to forward packets without help of
1566	   the SDN controller.  If no entry is found for a forwarding packet,
1567	   switches transport the packet to the SDN controller, and the
1568	   controller recalculates a new routing path using A* algorithm by
1569	   taking TPDL's distance formulas as a heuristic function and dispatch
1570	   flow table entries down to related switches on the path.

1572	16.3.  Updating roadmap

1574	   In the next version, we will continue to advance the remaining issues
1575	   such as protocol fields, section 3.2 simplification, etc.

1577	Authors' Addresses

1579	   Bin Liu
1580	   ZTE Inc., ZTE Plaza
1581	   No.19 East Huayuan Road,Hai Dian District
1582	   Beijing  100191
1583	   China

1585	   Phone: +86 -010-59932039
1586	   Email: 13683610386@139.com
1587	   Yantao Sun
1588	   Beijing Jiaotong University
1589	   No.3 Shang Yuan Cun, Hai Dian District
1590	   Beijing  100044
1591	   China

1593	   Email: ytsun@bjtu.edu.cn

1595	   Jing Cheng
1596	   Beijing Jiaotong University
1597	   No.3 Shang Yuan Cun, Hai Dian District
1598	   Beijing  100044
1599	   China

1601	   Email: yourney.j@gmail.com

1603	   Yichen Zhang
1604	   Beijing Jiaotong University
1605	   No.3 Shang Yuan Cun, Hai Dian District
1606	   Beijing  100044
1607	   China

1609	   Email: snowfall_dan@sina.com

1611	   Bhumip Khasnabish
1612	   Individual contributor
1613	   55 Madison Avenue, Suite 160
1614	   Morristown, New Jersey    07960
1615	   USA

1617	   Phone: +001-781-752-8003
1618	   Email: vumip1@gmail.com
1619	   URI:   http://tinyurl.com/bhumip/