idnits 2.17.1 

draft-sl-rtgwg-far-dcn-13.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 54 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet has text resembling
     RFC 2119 boilerplate text.

  -- The document date (Nov 23, 2019) is 1615 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'FAT-TREE' is defined on line 1404, but no explicit
     reference was found in the text


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Routing Area Working Group                                       Bin Liu
3	Internet-Draft                                       ZTE Inc., ZTE Plaza
4	Intended status: Informational                                Yantao Sun
5	Expires: May 26, 2020                                         Jing Cheng
6	                                                            Yichen Zhang
7	                                             Beijing Jiaotong University
8	                                                       Bhumip Khasnabish
9	                                                             ZTE TX Inc.
10	                                                            Nov 23, 2019

12	   Generic Fault-avoidance Routing Protocol for Data Center Networks
13	                       draft-sl-rtgwg-far-dcn-13

15	Abstract

17	   This draft describes a generic routing method and protocol for a
18	   regular data center network, named the Fault-Avoidance Routing (FAR)
19	   protocol.  The FAR protocol provides a generic routing method for all
20	   types of regular topology network architectures that have been
21	   proposed for large-scale cloud-based data centers over the past few
22	   years.  The FAR protocol is designed to leverage any regularity in
23	   the topology and compute its routing table in a simplistic manner.
24	   Fat-tree is taken as an example architecture to illustrate how the
25	   FAR protocol can be applied in real operational scenarios.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on May 26, 2020.

44	Copyright Notice

46	   Copyright (c) 2019 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Acronyms & Definitions  . . . . . . . . . . . . . . . . .   4
63	   2.  Conventions used in this document . . . . . . . . . . . . . .   5
64	   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   5
65	     3.1.  The Impact of Large-scale Networks on Route Calculation .   5
66	     3.2.  Dilemma of conventional routing methods in a large-scale
67	           network with giant number nodes of routers  . . . . . . .   6
68	     3.3.  Network Addressing Issues . . . . . . . . . . . . . . . .   8
69	     3.4.  Big Routing Table Issues  . . . . . . . . . . . . . . . .   8
70	     3.5.  Adaptivity Issues for Routing Algorithms  . . . . . . . .   9
71	     3.6.  Virtual Machine Migration Issues  . . . . . . . . . . . .   9
72	   4.  The FAR Framework . . . . . . . . . . . . . . . . . . . . . .   9
73	   5.  Data Format . . . . . . . . . . . . . . . . . . . . . . . . .  10
74	     5.1.  Data Tables . . . . . . . . . . . . . . . . . . . . . . .  10
75	     5.2.  Messages  . . . . . . . . . . . . . . . . . . . . . . . .  13
76	   6.  FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . .  17
77	     6.1.  Neighbor and Link Detection Module(M1)  . . . . . . . . .  17
78	     6.2.  Device Learning Module(M2)  . . . . . . . . . . . . . . .  17
79	     6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)   18
80	     6.4.  Link Failure Learning Module(M4)  . . . . . . . . . . . .  18
81	     6.5.  BRT Building Module(M5) . . . . . . . . . . . . . . . . .  18
82	     6.6.  NRT Building Module(M6) . . . . . . . . . . . . . . . . .  19
83	     6.7.  Routing Table Lookup(M7)  . . . . . . . . . . . . . . . .  19
84	   7.  How a FAR Router Works  . . . . . . . . . . . . . . . . . . .  19
85	   8.  Compatible Architecture . . . . . . . . . . . . . . . . . . .  22
86	   9.  Topology identification and broadcast storm suppression . . .  22
87	   10. Application Example . . . . . . . . . . . . . . . . . . . . .  23
88	     10.1.  BRT Building Procedure . . . . . . . . . . . . . . . . .  25
89	     10.2.  NRT Building Procedure . . . . . . . . . . . . . . . . .  26
90	       10.2.1.  Single Link Failure  . . . . . . . . . . . . . . . .  26
91	       10.2.2.  A Group of Link Failures . . . . . . . . . . . . . .  27
92	       10.2.3.  Node Failures  . . . . . . . . . . . . . . . . . . .  28
93	     10.3.  Routing Procedure  . . . . . . . . . . . . . . . . . . .  28
94	     10.4.  FAR's Performance in Large-scale Networks  . . . . . . .  30
95	       10.4.1.  The number of control messages required by FAR . . .  30
96	       10.4.2.  The Calculating Time of Routing Tables . . . . . . .  30
97	       10.4.3.  The Size of Routing Tables . . . . . . . . . . . . .  30
98	   11. Security Considerations . . . . . . . . . . . . . . . . . . .  31
99	   12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  31
100	   13. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  31
101	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  31
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  32

104	1.  Introduction

106	   In recent years, with the rapid development of cloud computing
107	   technologies, the widely deployed cloud services, such as Amazon EC2
108	   and Google search, bring about huge challenges to data center
109	   networking (DCN).  Today's cloud-based data centers (DCs) require
110	   large-scale networks with larger internal bandwidth and smaller
111	   transfer delay.  However, conventional networks cannot meet such
112	   requirements due to limitations in their network architecture.  In
113	   order to satisfy the requirements of cloud computing services, many
114	   new network architectures have been proposed for data centers, such
115	   as Fat-tree, MatrixDCN, and BCube.  These new architectures can
116	   support non-blocking large-scale datacenter networks with more than
117	   tens of thousands of physical servers.

119	   All of these architectures have regular topologies, which are common
120	   features.  The regular topology refers to the network topology
121	   structure with obvious regularity and symmetry, which is conducive to
122	   automatic configuration of the network, such as the Fat-tree network.
123	   In a regular topology, each network node such as a switch or router
124	   can be addressed by its location and through a node's address, the
125	   node's connections to its neighbors in a network can be determined,
126	   and furthermore, the route to the node from other nodes in the
127	   network can be determined.  So nodes can compute route entries
128	   without learning topology.

130	   This document describes a generic routing method and protocol, the
131	   Fault-Avoidance Routing (FAR) protocol, for DCNs.  This method
132	   leverages the regularity in the topologies of data center networks to
133	   simplify routing learning and accelerate the query of routing tables.
134	   This routing method has a better fault tolerance and can be applied
135	   to any DCN with a regular topology.

137	   FAR is not a routing protocol to replace generic routing protocols
138	   such as OSPF[RFC2328] and IS-IS.  It cannot be used in general local
139	   networks whose topological structures are arbitrary, and whose scales
140	   are also not very large.  OSPF and IS-IS works very well in such a
141	   network.  But in a large-scale network with regular topology, FAR has
142	   better performance.  Compared with OSPF and IS-IS, FAR has shorter
143	   time of network convergence and lower PDU overhead.  Furthermore, FAR
144	   requires less computing and storage resources, which lets FAR routers
145	   to run at a lower cost of production than the generic routers.

147	   In addition, for each type of network architecture, researchers
148	   designed a routing algorithm according to the features of its
149	   topology.  Because these routing algorithms are different and lack
150	   compatibility with each other, it is very difficult to develop a
151	   routing protocol for network routers supporting multiple routing
152	   algorithms.  FAR has better adaptability than these specified routing
153	   methods.

155	   FAR consists of three components, i.e., link state learning unit,
156	   routing table building unit and routing table querying unit.  In the
157	   link state learning unit, FAR exchanges link failures among routers
158	   to establish a consistent knowledge of the entire network.  In this
159	   stage, the regularity in topology is exploited to infer failed links
160	   and routers.  In the routing table building unit, FAR builds up two
161	   routing tables, i.e., a basic routing table (BRT) and a negative
162	   routing table (NRT), for each router according to the network
163	   topology and link states.  In the last component, routers forward
164	   incoming packets by looking up the two routing tables.  The matched
165	   entries in BRT minus the matched entries in NRT are the final route
166	   entries to be used to forward an incoming packet.

168	   The remainder of this draft is organized as follows.  The problem to
169	   be addressed by FAR is described in Section 3.  The framework of FAR
170	   routing protocol is described in Section 4.  Section 5 and 6
171	   introduce FAR's data format FAR and modules in detail.  Section 7
172	   describe how FAR works by finite state machine (FSM).  In Section 8,
173	   we discussed how FAR works with variable network architectures.
174	   Section 9 takes Fat-tree network as an example to illuminate how FAR
175	   works.

177	1.1.  Acronyms & Definitions

179	   DCN - Data Center Network

181	   FAR - Fault-Avoidance Routing

183	   BRT - Basic Routing Table

185	   NRT - Negative Routing Table

187	   NDT - Neighbor Devices Table

189	   ADT - All Devices Table

191	   LFT - Link Failure Table
192	   DA - Device Announcement

194	   LFA - Link Failure Announcement

196	   DLR - Device and Link Request

198	   IP - Internet Protocol

200	   VM - Virtual Machine

202	2.  Conventions used in this document

204	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
205	   NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
206	   this document are to be interpreted as described in RFC-2119
207	   [RFC2119].  In this document, these words will appear with that
208	   interpretation only when in ALL CAPS.  Lower case uses of these words
209	   are not to be interpreted as carrying RFC-2119 significance.

211	3.  Problem Statement

213	   The problem to be addressed by FAR as proposed in this draft is
214	   described in this section.  The expansion of Cloud data center
215	   networks has brought significant challenges to the existing routing
216	   technologies.  FAR mainly solves a series of routing problems faced
217	   by large-scale data center networks.

219	3.1.  The Impact of Large-scale Networks on Route Calculation

221	   In a large-scale cloud data center network, there may be thousands of
222	   routers.  Running OSPF and other routing protocols in such network
223	   will encounter these two challenges: a) Network convergence time
224	   would be too long, which will cause a longer time to elapse for
225	   creating and updating the routes.  The response time to network
226	   failures may be excessively long; b) a large number of routing
227	   protocol packets need to be sent.  The routing information consumes
228	   too much network bandwidth and CPU resources, which easily leads to
229	   packet loss and makes the problem (a) more prominent.

231	   In order to solve these problems, a common practice is to partition a
232	   large network into some small areas, where the route calculation runs
233	   independently within different areas.  However, nowadays the cloud
234	   data centers typically require very large internal bandwidth.  To
235	   meet this requirement, a large number of parallel equivalent links
236	   are deployed in a network, such as a Fat-tree network.  Partitioning
237	   such a network will affect the utilization of routing algorithm on
238	   equivalent multi-path and reduce internal network bandwidth
239	   requirements.

241	   In the FAR routing calculation process, a Basic Routing Table (BRT)
242	   is built on local network topology leveraging the regularity of the
243	   network topologies.  In addition to BRT, FAR also builds a Negative
244	   Routing Table (NRT).  FAR gradually builds NRT in the process of
245	   learning network link failure information, which does not require
246	   learning a complete network fault information.  FAR does not need to
247	   wait for the completion of the network convergence in the process of
248	   building these two tables.  Therefore, it avoids the problem of
249	   excessive network convergence overheads in the route calculation
250	   process.  In addition, FAR only needs to exchange a small amount of
251	   link change information between routers, and hence consumes less
252	   network bandwidth.

254	3.2.  Dilemma of conventional routing methods in a large-scale network
255	      with giant number nodes of routers

257	   There are many real world scenario where tens of thousands of
258	   nodes(or much more nodes) need to be deployed in a flat area, such as
259	   infiniband routing and switching system, high-performance computer
260	   network, and many IDC networks in China.  The similar problems have
261	   been existed long ago.  People have solved the problems through
262	   similar solutions, such as the traditional regular topology-based
263	   RFC3619 protocol, the routing protocols of infiniband routing and
264	   switching system, and high-performance computer network routing
265	   protocol.
266	   Infiniband defines a switch-based network to interconnect processing
267	   nodes and the I/O nodes.  Infiniband can support very large scale
268	   networks, use the regularity in topology to simplify its routing
269	   algorithm, which is just the same to what we do in FAR.

271	   Why OSPF and other conventional routing methods do not work well in a
272	   large-scale network with giant number nodes of routers?

274	   As everyone knows, the OSPF protocol uses multiple databases, more
275	   topological exchange information (as seen in the following example)
276	   and complicated algorithm.  It requires routers to consume more
277	   memory and CPU processing capability.  But the processing rate of CPU
278	   on the protocol message per second is very limited.  When the network
279	   expands, CPU will quickly approach its processing limits, and at this
280	   time OSPF can not continue to expand the scale of the management.
281	   The SPF algorithm itself does not thoroughly solve these problems.

283	   On the contrary, FAR does not have the convergence time delay and the
284	   additional CPU overheads, which SPF requires.  Because in the initial
285	   stage, FAR already knows the regular information of the whole network
286	   topology and does not need to periodically do SPF operation.

288	   One of the examples of "more topological exchange information":

290	   In the OSPF protocol, LSA floods every 1800 seconds.  Especially in
291	   the larger network, the occupation of CPU and band bandwidth will
292	   soon reach the router's performance bottleneck.  In order to reduce
293	   these adverse effects, OSPF introduced the concept of Area, which
294	   still has not solved the problem thoroughly.  By dividing the OSPF
295	   Area into several areas, the routers in the same area do not need to
296	   know the topological details outside their area.  (In comparison with
297	   FAR, after OSPF introducing the concept of Area, the equivalent paths
298	   cannot be selected in the whole network scope)

300	   OSPF can achieve the following results by Area : 1) Routers only need
301	   to maintain the same link state databases as other routers within the
302	   same Area, without the necessity of maintaining the same link state
303	   database as all routers in the whole OSPF domain.  2) The reduction
304	   of the link state databases means dealing with relatively fewer LSA,
305	   which reduces the CPU consumption of routers; 3) The large number of
306	   LSAs flood only within the same Area.  But, its negative effect is
307	   that the smaller number of routers which can be managed in each OSPF
308	   area.  On the contrary, because FAR does not have the above
309	   disadvantages, FAR can also manage large-scale network even without
310	   dividing Areas.

312	   The aging time of OSPF is set in order to adapt to routing
313	   transformation and protocol message exchange happened frequently in
314	   the irregular topology.  Its negative effect is: when the network
315	   does not change, the LSA needs to be refreshed every 1800 seconds to
316	   reset the aging time.  In the regular topology, as the routings are
317	   fixed, it does not need the complex protocol message exchange and
318	   aging rules to reflect the routing changes, as long as LFA mechanism
319	   in the FAR is enough.

321	   Compared with the LSVR protocol, the LSVR protocol has no special
322	   requirements for the network topology structure, however, the FAR
323	   draft is applicable to the regular topology network architecture and
324	   simplifies unnecessary processing.  It is a solution proposed to
325	   greatly improve the routing efficiency of the regular network
326	   topology.  The FAR solution is more efficient than the general
327	   methods such as LSVR in regular topology.

329	   Therefore, in FAR, we can omit many unnecessary processing and the
330	   packet exchange.  The benefits are fast convergence speed and much
331	   larger network scale than other dynamic routing protocol.  Now there
332	   are some successful implementations of simplified routings in the
333	   regular topology in the HPC environment.  Conclusion: As FAR needs
334	   few routing entries and the topology is regular, the database does
335	   not need to be updated regularly.  Without the need for aging, there
336	   is no need for CPU and bandwidth overhead brought by LSA flood every
337	   30 minutes, so the expansion of the network has no obvious effect on
338	   the performance of FAR, which is contrary to OSPF.

340	   Comparison of convergence time: The settings of OSPF spf_delay and
341	   spf_hold_time can affect the change of convergence time.  The
342	   convergence time of the network with 2480 nodes is about 15-20
343	   seconds; while the FAR does not need to calculate the SFP, so there
344	   is no such convergence time.
345	   These issues still exist in rapid convergence technology of OSPF
346	   ,ISIS (such as I-SPF) and LSVR.  The convergence speed and network
347	   scale constraint each other.  FAR does not have the above problems,
348	   and the convergence time is almost negligible.
349	   Can FRR solve these problems?  IP FRR has some limitations.  The
350	   establishment of IP FRR backup scheme will not affect the original
351	   topology and traffic forwarding which are established by protocol,
352	   however, we can not get the information of whereabouts and status
353	   when the traffic is switched to an alternate next hop.

355	3.3.  Network Addressing Issues

357	   Routers are typically configured with multiple network interfaces,
358	   each connected to a subnet.  OSPF and other routing algorithms
359	   require that each interface of a router must be configured with an IP
360	   address.  A large-scale data center network may contain thousands of
361	   routers and each router has dozens of network interfaces, thus, there
362	   are tens of thousands of IP addresses needed to be configured in a
363	   data center.  It will be very complex to configure and manage a large
364	   number of network interfaces and will be difficult to troubleshoot
365	   network problems, then network maintenance will be costly and error-
366	   prone.

368	   In FAR, the device position information is encoded in the IP address
369	   of the router.  Each router only needs to be assigned a unique IP
370	   address according its location, which greatly solves complex network
371	   addressing issues in large-scale networks.

373	3.4.  Big Routing Table Issues

375	   There are a large number of subnets in the large-scale data center
376	   network.  A router may build a routing entry for each subnet, and
377	   therefore the size of routing tables on each router may be very
378	   large.  It will increase a router's cost and reduce the querying
379	   speed of the routing table.

381	   FAR uses two measures to reduce the size of its routing tables: a)It
382	   builds a BRT on the regularity of the network topologies; b)It
383	   introduces a new routing table, i.e., a NRT.  In this way FAR can
384	   reduce the size of routing tables to only a few dozen routing
385	   entries.

387	3.5.  Adaptivity Issues for Routing Algorithms

389	   To implement efficient routing in large-scale datacenters, besides
390	   FAR, some other routing methods are proposed for some specific
391	   network architectures, such as Fat-tree and BCube.  These routing
392	   methods are different(from both design and implementation viewpoints)
393	   and not compatible with the conventional routing methods, which
394	   brings big troubles to network equipment providers to develop new
395	   routers supporting various new routing methods.

397	   FAR is a generic routing method.  With slight modification, FAR
398	   method can be applied to most of regular datacenter networks.
399	   Furthermore, the structure of routing tables and querying a routing
400	   table in FAR are the same as conventional routing method.  If FAR is
401	   adopted, the workload of developing a new type of router will be
402	   significantly decreased.

404	3.6.  Virtual Machine Migration Issues

406	   Supporting VM migration is very important for cloud-based datacenter
407	   networks.  However, in order to support layer-3 routing, routing
408	   methods including OSPF and FAR require limiting VM migration within a
409	   subnet.  For this paradox, the mainstream methods still utilize
410	   layer-3 routing on routers or switches, transmit packets encapsulated
411	   by IPinIP or MACinIP between hosts by tunnels passing through network
412	   to the destination access switch, and then extract original packet
413	   out and send it to the destination host.

415	   By utilizing the aforementioned methods, FAR can be applied to Fat-
416	   tree, MatrixDCN or BCube networks for supporting VM migration in
417	   entire network.

419	4.  The FAR Framework

421	   FAR requires that a DCN has a regular topology, and network devices,
422	   including routers, switches, and servers, are assigned IP addresses
423	   according to their locations in the network.  In other word, we can
424	   locate a device in the network according to its IP address.

426	   FAR is a distributed routing method.  In order to support FAR, each
427	   router needs to have a routing module that implements the FAR
428	   algorithm.  FAR algorithm is composed of three parts, i.e., link-
429	   state learning, routing table building and routing table querying, as
430	   shown in Fig. 1.

432	               Link-state Learning    |Routing Table  | Routing Table
433	                                      |  Building
434	                                      |               |   Querying
435	    +--------+   /---------------\    | +--------+    |    Packets
436	    |2 Device|<--| 1 Neighbor &  |----->| 5 BRT  \    |
437	    |Learning|   | Link Detection|    | |Building|\   |       |
438	    +--------+   \---------------/    | +--------+ \  |      \|/
439	             |           |            |             \ |+--------------+
440	             |           |            |              /||  7 Querying  |
441	            \|/         \|/           |             / || Routing Table|
442	             +-----------------------+|            /  |+--------------+
443	             |3 Invisible Neighbor & ||           /   |
444	             |Link Failure Inferring || +---------    |
445	             +-----------------------+|/| 6 NRT  |    |
446	                         |            / |Building|    |
447	                        \|/          /| +--------+    |
448	                  +--------------+  / |               |
449	                  |4 Link Failure| /  |               |
450	                  |  Learning    |    |               |
451	                  +--------------+    |               |
452	                                      |               |

454	                        Figure 1: The FAR framework

456	   Link-state learning is responsible for a router to detect the states
457	   of its connected links and learn the states of all the other links in
458	   the entire network.  The second part builds two routing tables, a
459	   basic routing table (BRT) and an negative routing table (NRT),
460	   according to the learned link states in the first part.  The third
461	   part queries the BRT and the NRT to decide a next forwarding hop for
462	   the received (ingress) packets.

464	5.  Data Format

466	5.1.  Data Tables

468	   Some data tables are maintained on each router in FAR.  They are:

470	   Neighbor Device Table (NDT): To store neighbor routers and related
471	   links.

473	   All Devices Table (ADT): To store all routers in the entire network.

475	   Link Failures Table (LFT): To store all link failures in the entire
476	   network.

478	   Basic Routing Table (BRT): To store the candidate routes.

480	   Negative Routing Table(NRT): To store the avoiding routes.

482	   The format of NDT

484	    ----------------------------------------------------------
485	    Device ID | Device IP | Port ID | Link State | Update Time
486	    ----------------------------------------------------------

488	   Device ID: The ID of a neighbor router.

490	   Device IP: The IP address of a neighbor router.

492	   Port ID: The port ID that a neighbor router is attached to.

494	   Link State: The state of the link between a router and its neighbor
495	   router.  There are two states: Up and Down.

497	   Update Time: The time of updating the entry.

499	   The format of ADT

501	    --------------------------------------------------
502	    Device ID | Device IP | Type | State | Update Time
503	    --------------------------------------------------

505	   Device ID: The ID of a neighbor router.

507	   Device IP: The IP address of a neighbor router.

509	   Type: The type of a neighbor router.

511	   State: The state of a neighbor router.  There are two states: Up and
512	   Down.

514	   Update Time: The time of updating the entry.

516	   The format of LFT

518	    --------------------------------------------
519	    No | Router 1 IP | Router 2 IP | Timestamp
520	    --------------------------------------------

522	   No: The entry number.

524	   Router 1 IP: The IP address of one router that a failed link connects
525	   to.

527	   Router 2 IP: The IP address of another router that a failed link
528	   connects to.

530	   Timestamp: It identifies when the entry is created.

532	   The format of BRT

534	    -------------------------------------------------------
535	    Destination | Mask | Next Hop | Interface | Update Time
536	    -------------------------------------------------------

538	   Destination: A destination network

540	   Mask: The subnet mask of a destination network.

542	   Next Hop: The IP address of a next hop for a destination.

544	   Interface: The interface related to a next hop.

546	   Update Time: The time of updating the entry.

548	   The format of NRT

550	    -------------------------------------------------------------------
551	    Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
552	    -------------------------------------------------------------------

554	   Destination: A destination network.

556	   Mask: The subnet mask of a destination network.

558	   Next Hop: The IP address of a next hop that should be avoided for a
559	   destination.

561	   Interface: The interface related to a next hop that should be
562	   avoided.

564	   Failed Link No: A group of failed link numbers divided by "/", for
565	   example 1/2/3.

567	   Timestamp: The time of updating the entry.

569	5.2.  Messages

571	   Some protocol messages are exchanged between routers in FAR.

573	   Hello Message: This message is exchanged between neighbor routers to
574	   learn adjacency.

576	   Device Announcement (DA): Synchronize the knowledge of routers
577	   between routers.

579	   Link Failure Announcement (LFA): Synchronize link failures between
580	   routers.

582	   Device and Link Request (DLR): When a router starts, it requests the
583	   knowledge of routers and links from its neighbors by a DLR message.

585	   A FAR Message is directly encapsulated in an IP packet.  The protocol
586	   field of IP header indicates an IP packet is an FAR message.  The
587	   protocol of IP for FAR should be assigned by IANA.

589	   The four types of FAR messages have same format of packet header,
590	   called FAR header (as shown in Figure 2).

592	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
593	            +-----------+------------+------------------------+
594	            |  Version  |Message Type|    Message Length      |
595	            +-----------+------------+------------------------+
596	            |        Checksum        |       AuType           |
597	            +------------------------+------------------------+
598	            |                  Authentication                 |
599	            +-------------------------------------------------+
600	            |                  Authentication                 |
601	            +-------------------------------------------------+
602	            |                   Timestamp                     |
603	            +-------------------------------------------------+

605	                    Figure 2: The format of FAR header

607	   Version: FAR version

609	   Message Type: The type of FAR message.

611	   Packet Length: The packet length of the total FAR message.

613	   Checksum: The checksum of an entire FAR message.

615	   AuType:Authentication type. 0: no authentication, 1: Plaintext
616	   Authentication, 2: MD5 Authentication.

618	   Authentication: Authentication information. 0: undefined, 1: Key, 2:
619	   key ID, MD5 data length and packet number.  MD5 data is appended to
620	   the backend of the packet.

622	   AuType and Authentication can refer to the definition of OSPF packet.

624	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
625	            +-----------+------------+------------------------+
626	            |  Version  |Message Type|    Message Length      |
627	            +-----------+------------+------------------------+
628	            |        Checksum        |       AuType           |
629	            +------------------------+------------------------+
630	            |                  Authentication                 |
631	            +-------------------------------------------------+
632	            |                  Authentication                 |
633	            +-------------------------------------------------+
634	            |                   Timestamp                     |
635	            +-------------------------------------------------+
636	            |                    Router IP                    |
637	            +------------------------+------------------------+
638	            |     HelloInterval      |     HelloDeadInterval  |
639	            +------------------------+------------------------+
640	            |                Neighbor Router IP               |
641	            +-------------------------------------------------+
642	            |                       ...                       |
643	            +-------------------------------------------------+

645	                  Figure 3: The Format of Hello Messages

647	   For Hello messages, the Message Type in FAR header is set to
648	   1.Besides FAR header, a Hello message(Fig. 3) requires the following
649	   fields:

651	   Router IP: The router IP address.

653	   HelloInterval: The interval of sending Hello messages to neighbor
654	   routers.

656	   RouterDeadInterval: The interval to set a neighbor router dead(out-
657	   of-service).  If in the interval time, a router doesn't receive a
658	   Hello message from its neighbor router, the neighbor router is
659	   treated as dead.

661	   Neighbor Router IP: The IP address of a neighbor router.  All the
662	   neighbor router's addresses should be included in a Hello message.

664	            |<----1---->| <----1---->|<----------2----------->|
665	            +-----------+------------+------------------------+
666	            |  Version  |Message Type|    Message Length      |
667	            +-----------+------------+------------------------+
668	            |        Checksum        |       AuType           |
669	            +------------------------+------------------------+
670	            |                  Authentication                 |
671	            +-------------------------------------------------+
672	            |                  Authentication                 |
673	            +-------------------------------------------------+
674	            |                   Timestamp                     |
675	            +------------------------+------------------------+
676	            |                   Router1 IP                    |
677	            +-------------------------------------------------+
678	            |                       ...                       |
679	            +-------------------------------------------------+

681	                    Figure 4: The Format of DA Messages

683	   For DA messages(Fig. 4), the Message Type in FAR header is set to 2.
684	   Besides FAR header, a DA message includes IP addresses of all the
685	   announced routers.

687	            |<----1---->| <----1---->|<----------2----------->|
688	            +-----------+------------+------------------------+
689	            |  Version  |Message Type|    Message Length      |
690	            +-----------+------------+------------------------+
691	            |        Checksum        |       AuType           |
692	            +------------------------+------------------------+
693	            |                  Authentication                 |
694	            +-------------------------------------------------+
695	            |                  Authentication                 |
696	            +-------------------------------------------------+
697	            |                   Timestamp                     |
698	            +------------------------+------------------------+
699	            |                     Left IP                     |
700	            +-------------------------------------------------+
701	            |                    Right IP                     |
702	            +------------------------+------------------------+
703	            |                      State                      |
704	            +-------------------------------------------------+
705	            |                       ...                       |
706	            +-------------------------------------------------+

708	                   Figure 5: The Format of LFA Messages

710	   For LFA messages(Fig. 5), the Message Type in FAR header is set to 3.
711	   Besides FAR header, a LFA message includes all the announced link
712	   failures.

714	   Left IP: The IP address of the left endpoint router of a link.

716	   Right IP: The IP address of the right endpoint router of a link.

718	   State: Link state. 0: Up, 1: down
719	            |<----1---->| <----1---->|<----------2----------->|
720	            +-----------+------------+------------------------+
721	            |  Version  |Message Type|    Message Length      |
722	            +-----------+------------+------------------------+
723	            |        Checksum        |       AuType           |
724	            +------------------------+------------------------+
725	            |                  Authentication                 |
726	            +-------------------------------------------------+
727	            |                  Authentication                 |
728	            +-------------------------------------------------+
729	            |                   Timestamp                     |
730	            +-------------------------------------------------+

732	                   Figure 6: The Format of DLR Messages

734	   For DLR messages(Fig. 6), the Message Type in FAR header is set to
735	   1.Except for FAR header, DLR has no additional fields.

737	6.  FAR Modules

739	6.1.  Neighbor and Link Detection Module(M1)

741	   M1 is responsible for sending and receiving Hello messages, and
742	   detecting directly-connected links and neighbor routers.  Each Hello
743	   message is encapsulated in a IP packet.  M1 sends Hello messages
744	   periodically to all the active router ports and receives Hello
745	   messages from its neighbor routers.  M1 detects neighbor routers and
746	   directly-connected links according to received Hello Messages and
747	   stores these neighbors and links into a Neighbor Devices Table (NDT).
748	   Additionally, M1 also stores neighbor routers into an All Devices
749	   Table (ADT).

751	6.2.  Device Learning Module(M2)

753	   M2 is responsible for sending, receiving, and forwarding device
754	   announcement (DA) messages, learning all the routers in the whole
755	   network, and deducing faulted routers.  When a router starts, it
756	   sends a DA message announcing itself to its neighbors and a DLR
757	   message requesting the knowledge of routers and links from its
758	   neighbors.  If M2 module of a router receives a DA message, it checks
759	   whether the router encapsulated in the message is in an ADT.  If the
760	   router is not in the ADT, M2 puts this router into the ADT and
761	   forwards this DA message to all the active ports except for the
762	   incoming one, otherwise, M2 discards this message directly.  If M2
763	   module of a router receives a DLR message, it replies a DA message
764	   that encapsulates all of the learned routers.

766	6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)

768	   M3 is responsible for inferring invisible neighbors of the current
769	   router by means of the ADT.  If the link between a router A and its
770	   neighbor B breaks, which results in that M1 module of A cannot detect
771	   the existence of B, then B is an invisible neighbor of A.  Since a
772	   device's location is coded into its IP address, it can be judged
773	   whether two routers are adjacent, according to their IP addresses.
774	   Based on this idea, M3 infers all of the invisible neighbors of the
775	   current router and the related link failures.  The results are stored
776	   into a NDT.  Moreover, link failures also are added into a link-
777	   failure table (LFT).  LFT stores all of the failed links in the
778	   entire network.

780	6.4.  Link Failure Learning Module(M4)

782	   M4 is responsible for sending, receiving and forwarding link failure
783	   announcement (LFA) and learning all the link failures in the whole
784	   network.  M4 broadcasts each newly inferred link failure to all the
785	   routers in the network.  Each link failure is encapsulated in a LFA
786	   message and one link failure is broadcasted only once.  If a router
787	   receives a DLR request from its neighbor, it will reply a LFA message
788	   that encapsulates all the learned link failures through M4 module.
789	   If M4 receives a LFA message, it checks whether the link failure
790	   encapsulated in the message is in a LFT by comparing two link ends
791	   and timestamp.  If the link failure is not in the LFT or timestamp is
792	   different, M4 puts this link failure into the LFT (or update
793	   timestamp only) and forwards this LFA message to all the active ports
794	   except for the incoming one, otherwise, M4 discards this message
795	   directly.

797	   There is a special case a router will rebroadcast a link failure.  If
798	   a router receives a data packet and must forward the packet going
799	   ahead to destination through a failed link, it means some previous
800	   router should avoid this failed link according to its NRT but it
801	   doesn't.  In this case, maybe the previous router missed the LFA
802	   message of the link failure due to some uncertain reasons.  So the
803	   forwarding router rebroadcasts the LFA message.

805	6.5.  BRT Building Module(M5)

807	   M5 is responsible for building a BRT for the current router.  By
808	   leveraging the regularity in topology, M5 can calculate the routing
809	   paths for any destination without the knowledge of the topology of
810	   whole network, and then build the BRT based on a NDT.  Since the IP
811	   addresses of network devices are continuous, M5 only creates one
812	   route entry for a group of destination addresses that have the same
813	   network prefix by means of route aggregation technology.  Usually,
814	   the size of a BRT is very small.  The detail of how to build a BRT is
815	   described in section 5.

817	6.6.  NRT Building Module(M6)

819	   M6 is responsible for building a NRT for the current router.  Because
820	   M5 builds a BRT without considering link failures in network, the
821	   routing paths calculated by the BRT cannot avoid failed links.  To
822	   solve this problem, a NRT is used to exclude the routing paths that
823	   include some failed links from the paths calculated by a BRT.  M6
824	   calculate the routing paths that include failed links and stored them
825	   into the NRT.  The details of how to build a NRT is described in
826	   section 5.

828	6.7.  Routing Table Lookup(M7)

830	   M7 is responsible for querying routing tables and selecting the next
831	   hop for forwarding the packets.  Firstly, M7 takes the destination
832	   address of a forwarding packet as a criterion to look up route
833	   entries in a BRT based on longest prefix match.  All of the matched
834	   entries are composed of a candidate hops list.  Secondly, M7 look up
835	   negative route entries in a NRT taking the destination address of the
836	   forwarding packet as criteria.  This lookup is not limited to the
837	   longest prefix match, any entry that matches the criteria would be
838	   selected and composed of an avoiding hops list.  Thirdly, the
839	   candidate hops minus avoiding hops are composed of an applicable hops
840	   list.  At last, M7 sends the forwarding packet to any one of the
841	   applicable hops.  If the applicable list is empty, the forwarding
842	   packet will be dropped.

844	7.  How a FAR Router Works

846	   Figure 7 shows how a FAR router works by its FSM.

848	                            /---------------\
849	                            |     Start     |
850	                            |               |
851	                            \---------------/
852	                                    |
853	                     +--------+     |1
854	                     |        |     |
855	                     |       \|/   \|/
856	                     |    +--------------------+
857	                     |2   |         ND         |
858	                     |    |                    |
859	                     |    +--------------------+
860	                     |        |      |
861	                     |        |      |3
862	                     +--------+      |
863	                                    \|/
864	                              +--------------+
865	                              |    ND-FIN    |
866	                              |              |
867	                              +--------------+
868	                                     |
869	                                     |4
870	                                     |
871	                                    \|/
872	            ________10_______\+--------------+/_______11________
873	           |                 /|              |\                |
874	           |                  |    Listen    |                 |
875	           |     ____9_______\|              |/_______12___    |
876	           |    |            /|              |\            |   |
877	           |    |             +--------------+             |   |
878	           |    |               5/  |  |    \8             |   |
879	           |    |              |/_  |  |    _\|            |   |
880	           |    |  +------------+  6|  |7   +------------+ |   |
881	           |     --| HELLO-RECV |   |  |    | LFA-RECV   |--   |
882	           |       +------------+   |  |    +------------+     |
883	           |                  ______|  |______                 |
884	           |                 |                |                |
885	           |                \|/              \|/               |
886	           |           +------------+  +------------+          |
887	           |___________|  DLR-RECV  |  |  DA-RECV   |__________|
888	                       +------------+  +------------+

890	                                  _________
891	                                 |         |
892	                                 |         |
893	                                \|/        |
894	                         +--------------+  |13
895	                         | Hello Thread |  |
896	                         +--------------+  |
897	                                 |         |
898	                                 |_________|

900	                                  _________
901	                                 |         |
902	                                 |         |
903	                                \|/        |
904	                         +--------------+  |14
905	                         |  LFD Thread  |  |
906	                         +--------------+  |
907	                                 |         |
908	                                 |_________|

910	                                  _________
911	                                 |         |
912	                                 |         |
913	                                \|/        |
914	                         +--------------+  |15
915	                         |  DA Thread   |  |
916	                         +--------------+  |
917	                                 |         |
918	                                 |_________|

920	                                  _________
921	                                 |         |
922	                                 |         |
923	                                \|/        |
924	                         +--------------+  |16
925	                         |  DFD Thread  |  |
926	                         +--------------+  |
927	                                 |         |
928	                                 |_________|

930	             Figure 7: The Finite State Machine of FAR Router

932	   1)When a router starts up, it starts a Hello thread and then starts
933	   ND (neighbor detection) timer (3 seconds).  Next the router goes into
934	   ND (neighbor detection) state.
935	   2)In the ND state, if a router received a Hello message, then it
936	   performs a Hello-message processing and goes back to the ND state.
937	   3)When the ND timer is over, a router goes into ND-FIN (neighbor
938	   detection finished) state.
939	   4)A router starts the LFD (link failure detection) thread and DFD
940	   (device failure detection) state, and sends DA message and DLR
941	   message to all of its active ports.  Then the router goes into Listen
942	   state.
943	   5) If a router receives a Hello message, then goes into HELLO-RECV
944	   state.
945	   6) If a router receives a DLR message, then goes into DLR-RECV state.
946	   7) If a router receives a DA message, then goes into DA-RECV state.
947	   8) If a router receives a LFA message, then goes into LFA-RECV state.
948	   9) A router performs the Hello-message processing.  After that, it
949	   goes back to Listen state.
950	   10) A router performs the DLR-message processing.  After that, it
951	   goes back to Listen state.
952	   11) A router performs the DA-message processing.  After that, it goes
953	   back to Listen state.

955	   12) A router performs the LFA-message processing.  After that, it
956	   goes back to Listen state.
957	   13) Hello thread produces and sends Hello messages to all its ports
958	   periodically.
959	   14) LFD thread calls link-failure-detection processing to check link
960	   failures in all links periodically
961	   15) DA thread produces and sends DA messages periodically (30
962	   minutes).
963	   16) When DFD thread starts up, it sleep a short time (30 seconds) to
964	   wait for a router learning all the active routers in the network.
965	   Then the thread calls the device-failure-detection processing to
966	   check device failures periodically (30 minutes).

968	8.  Compatible Architecture

970	   As a generic routing protocol, FAR can be run in various DCNs with
971	   regular topology.  Up to now, we have implemented the FAR protocol
972	   for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond.

974	   For different network architectures, most processing of FAR is same
975	   besides calculation of routing tables.  BRT routing tables are
976	   calculated based on Hello messages and NRT routing tables are
977	   calculated based on LFA messages in FAR.  To extend FAR to support a
978	   new network architecture, only processing of Hello and LFA messages
979	   need providing to build BRT and NRT routing tables.

981	   In this protocol, FAR can support maximally 12 network architectures
982	   and at least support 1 built-in network architecture, such as Fat-
983	   tree, BCube and MatrixDCN,etc.  Each network architecture is assigned
984	   a unique number from 1 to 12.  For example, if the 1 built-in
985	   architectures are assigned 1, and other customized architectures are
986	   assigned 2 to 12.
987	   1: Fat-tree
988	   2: BCube
989	   3: MatrixDCN.
990	   4: xxx.
991	   ......
992	   12: xxx.

994	9.  Topology identification and broadcast storm suppression

996	   In this design, the initial topology discovery process is not a
997	   mandatory option for a FAR routing protocol.  The recommended
998	   solution here is to use a pre-configured configuration file, which
999	   contains topology parameters of the current system, each node device
1000	   as long as according to these configuration parameters will be able
1001	   to know the topology information.  In this way, we do not have to
1002	   deal with complex topology discovery processes, nor do we need to
1003	   calculate the shortest path, because the optimal path can be
1004	   calculated from the parameters.  This protocol also allows the
1005	   formation of configuration files to be submitted to the topology
1006	   discovery protocol, allowing for a variety of different
1007	   implementation options.

1009	   Regarding the flood suppression processing of broadcast packets, it
1010	   has been considered in the previous content.  Since the hello packets
1011	   is only transmitted between the two nodes, it cannot be spread out.
1012	   The link error message is only sent to the CPU, and are not forwarded
1013	   to the nodes in layer 2 broadcasting.  Moreover, each node will
1014	   discard the repeated error messages when the node receives them.  In
1015	   this way, the broadcast storm can be suppressed.  If a link is
1016	   unstable and repeatedly up or down, the system will not send new
1017	   messages after sending notifications, and the system will not
1018	   oscillate repeatedly.  The topology is updated only when the link is
1019	   later detected to be stable for a long time.

1021	10.  Application Example

1023	   In this section, we take a Fat-tree network(Fig. 7) as an example to
1024	   describe how to apply FAR routing.  Since M1 to M4 are very simple,
1025	   we only introduce how the modules M5, M6, and M7 work in a Fat-tree
1026	   network.

1028	   A Fat-tree network is composed of 4 layers.  The top layer is core
1029	   layer, and the other layers are aggregation layer, edge layer and
1030	   server layer.  There are k pods, each one containing two layers of
1031	   k/2 switches.  Each k-port switch in the edge layer is directly
1032	   connected to k/2 hosts.  The remaining k/2 ports are connected to k/2
1033	   of the k-port switches in the aggregation layer.  There are (k/2)2
1034	   k-port core switches.  Each core switch has one port connected to
1035	   each of the k pods.

1037	       10.0.1.1          10.0.1.2            10.0.2.1         10.0.2.2
1038	         +--+              +--+                +--+               +--+
1039	         |  |              |  |                |  |               |  |
1040	         +--+,_            +--+',              +--+             ,,+--+
1041	         |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
1042	         |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
1043	         |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
1044	         |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
1045	         |      .       `'.     `-.,-`    /       |      \
1046	         |       \         `'.,  .` `'., `        |       `.
1047	         |        `,          .'`,     ,`'.,      |         ',
1048	         |          .      ,-`    '., -     `'-,_ |           `.
1049	         |           \   .`         ,'.,         `|.,           .
1050	         |            .'`          /    `-,       |  `'.,        `.
1051	     10.1.0.1      ,-`  .        .'    10.3.0.1  10.3.0.2`'.,      ',
1052	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1053	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1054	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1055	          |  \/ |         |  \/ |           |  \/ |           |  \/ |
1056	         +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
1057	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1058	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1059	          /|   10.1.2.1   /|    |\     10 3.1.3   |\          /|   |  |
1060	         / |    | \      / |    | \        / |    | \        / |   |  |
1061	        /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
1062	       /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
1063	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1064	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1065	            10.1.2.2                           10.3.1.3

1067	                        Figure 8: Fat-tree Network

1069	   Aggregation switches are given addresses of the form 10.pod.0.switch,
1070	   where pod denotes the pod number, and switch denotes the position of
1071	   that switch in the upper pod (in [1, k/2]).  Edge switches are given
1072	   addresses of the form 10.pod.switch.1, where pod denotes the pod
1073	   number, and switch denotes the position of that switch in the lower
1074	   pod (in [1, k/2]).  The core switches are given addresses of the form
1075	   10.0.j.i, where j and i denote that switch's coordinates in the
1076	   (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left).
1077	   The address of a host follows the pod switch to which it is connected
1078	   to; hosts have addresses of the form: 10.pod.switch.ID, where ID is
1079	   the host's position in that subnet (in [2, k/2+1], starting from left
1080	   to the right).

1082	10.1.  BRT Building Procedure

1084	   By leveraging the topology's regularity, every switch clearly knows
1085	   how it forwards a packet.  When a packet arrives at an edge switch,
1086	   if the destination of the packet lies in the same subnet with the
1087	   switch, then the switch directly forwards the packet to the
1088	   destination server through layer-2 switching.  Otherwise, the switch
1089	   forwards the packet to any of aggregation switches in the same pod.
1090	   When a packet arrives at an aggregation switch, if the destination of
1091	   the packet lies in the same pod, the switch forwards the packet to
1092	   the corresponding edge switch.  Otherwise, the switch forwards the
1093	   packet to any of core switches that it is connected to.  If a core
1094	   switch receives a packet, it forwards the packet to the corresponding
1095	   aggregation switch that lies in the destination pod.

1097	   The forwarding policy discussed above is easily expressed through a
1098	   BRT.  The BRT of an edge switch, such as 10.1.1.1, is composed of the
1099	   following entries:

1101	   Destination/Mask       Next hop
1102	   10.0.0.0/255.0.0.0     10.1.0.1
1103	   10.0.0.0/255.0.0.0     10.1.0.2

1105	   The BRT of an aggregation switch, such as 10.1.0.1, is composed of
1106	   the following entries:

1108	   Destination/Mask            Next hop
1109	   10.1.1.0/255 255.255.0      10.1.1.1
1110	   10.1.2.0/255.255.255.0      10.1.2.1
1111	   10.0.0.0/255.0.0.0          10.0.1.1
1112	   10.0.0.0/255.0.0.0          10.0.1.2

1114	   The BRT of acore switch, such as 10.0.1.1, is composed of the
1115	   following entries:

1117	   Destination/Mask            Next hop
1118	   10.1.0.0/255 255.0.0        10.1.0.1
1119	   10.2.0.0/255.255.0.0        10.2.0.1
1120	   10.3.0.0/255.255.0.0        10.3.0.1
1121	   10.4.0.0/255.255.0.0        10.4.0.1

1123	10.2.  NRT Building Procedure

1125	   The route entries in an NRT are related with link and node failures.
1126	   We summarize all types of cases into three (3) catalogs.

1128	10.2.1.  Single Link Failure

1130	   In Fat-tree, Links can be classified as 3 types by their locations:
1131	   1) servers to edge switches; 2) edge to aggregation switches; 3)
1132	   aggregation to core switches.  Link failures between servers to edge
1133	   switches only affect the communication of the corresponding servers
1134	   and don't affect the routing tables of any switch, so we only discuss
1135	   the second and third type of links failures.

1137	   Edge to Aggregation Switches

1139	   Suppose that the link between an edge switch, such as 10.1.2.1 (A),
1140	   and an aggregation switch, such as 10.1.0.1(B),fails.  This link
1141	   failure may affect 3 types of communications.

1143	   o Sources lie in the same subnet with A, and destinations do not.  In
1144	   this case, the link failure will only affect the routing tables of A.
1145	   As this link is attached to A directly, A only needs to delete the
1146	   route entries whose next hop is B in its BRT and add no entries to
1147	   its NRT when A's M6 module detect the link failure.

1149	   o Destinations lie in the same subnet with A, and sources lie in
1150	   another subnet of the same pod.  In this case, the link failure will
1151	   affect the routing tables of all the edge switches in the same pod
1152	   except for A.  When an edge switch, such as 10.1.1.1, learns the link
1153	   failure, it will add a route entry to its NRT:

1155	   Destination/Mask            Next hop
1156	   10.1.2.0/255.255.255.0      10.1.0.1

1158	   o Destinations lie in the same subnet with A, sources lie in another
1159	   pod.  In this case, the link failure will affect the routing tables
1160	   of all the edge switches in the other pods.  When an edge switch in
1161	   one other pod, such as 10.3.1.1, learns the link failure, because all
1162	   the routings that pass through 10.3.0.1 to A will certainly pass
1163	   through the link between A and B, 10.3.1.1 need add a route entry to
1164	   its NRT:

1166	   Destination/Mask            Next hop
1167	   10.1.2.0/255.255.255.0      10.3.0.1
1168	   Aggregation to Core Switches

1170	   Suppose that the link between an aggregation switch, such as 10.1.0.1
1171	   (A), and a core switch, such as 10.0.1.2(B), fails.  This link
1172	   failure may affect 2 types of communications.

1174	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1175	   the other pods.  In this case, the link failure will only affect the
1176	   routing tables of A.  As this link is attached to A directly, A only
1177	   need to delete the route entries whose next hop is B in its BRT and
1178	   add no entries to its NRT when A's M6 module detect the link failure.

1180	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1181	   another pod.  In this case, the link failure will affect the routing
1182	   tables of all the aggregation switches in other pods except for pod
1183	   1.  When an aggregation switch in one other pod, such as 10.3.0.1,
1184	   learns the link failure, because all the routings that pass through
1185	   10.0.1.2 to the pod 1 where A lies will certainly pass through the
1186	   link between A and B, 10.3.0.1 need add a route entry to its NRT:

1188	   Destination/Mask            Next hop
1189	   10.1.0.0/255.255.0.0        10.0.1.2

1191	10.2.2.  A Group of Link Failures

1193	   If all the uplinks of an aggregation switch fail, then this switch
1194	   cannot forward packets, which will affect the routing of every edge
1195	   switches.  Suppose that all the uplinks of the node A (10.1.0.1)
1196	   fail, it will affect two types of communications.

1198	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1199	   the other pods.  In this case, the link failures will affect the
1200	   routing of the edge switches in the Pod of A.  To avoid the node A,
1201	   each edge switch should remove the route entry "10.0.0.0/255.0.0.0
1202	   10.1.0.1" in which the next hop is the node A.

1204	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1205	   other pods.  In this case, the link failures will affect the routing
1206	   of edge switches in other pods.  For example, if the edge switch
1207	   10.3.1.1 communicates with some node in the pod of A, it should avoid
1208	   the node 10.3.0.1, because any communication through 10.3.0.1 to the
1209	   pod of A will pass through the node A.  So a route entry should be
1210	   added to 10.3.1.1:

1212	   Destination/Mask            Next hop
1213	   10.1.0.0/255.255.0.0        10.3.0.1

1215	10.2.3.  Node Failures

1217	   At last, we discuss the effect of node failures to a NRT.  There are
1218	   3 types of node failures: the failure of edge, aggregation and core
1219	   switches.

1221	   o An edge switch fails.  The failure doesn't affect the routing table
1222	   of any switch.

1224	   o A core switch fails.  Only when all the core switches connected to
1225	   the same aggregation switch fail, they will affect the routing of
1226	   other switches.  This case is equal to the case that all the uplinks
1227	   of an aggregation switch fail, so the process of link failures can
1228	   cover it.

1230	   o An aggregation switch fails.  This case is similar to the case that
1231	   all the uplinks of an aggregation switch fail.  It affects the
1232	   routing of edge switches in other pods, but doesn't affect the
1233	   routing of edge switches in pod of the failed switch.  The process of
1234	   this failure is same to the second case in section 6.2.2.

1236	10.3.  Routing Procedure

1238	   FAR decides a routing by looking up its BRT and NRT.  We illuminate
1239	   the routing procedure by an example.  In this example, we suppose
1240	   that the link between 10.3.1.1 and 10.3.0.2 and the link between
1241	   10.1.2.1 and 10.1.0.2 have failed.  Then we look into the routing
1242	   procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
1243	   (destination).

1245	   Step 1: The source 10.3.1.3 sends packets to its default router
1246	   10.3.1.1

1248	   Step 2: The routing of 10.3.1.1.

1250	   1) Calculate candidate hops

1252	   10.3.1.1 looks up its BRT and gets the following matched entries:

1254	   Destination/Mask            Next hop
1255	   10.0.0.0/255.0.0.0          10.3.0.1

1257	   So the candidate hops = {10.3.0.1}

1259	   2) Calculate avoiding hops

1261	   Its NRT is empty, so the set of avoiding hop is empty too.

1263	   3) Calculate applicable hops

1265	   The applicable hops are candidate hops minus avoiding hops, so:

1267	   The applicable hops = {10.3.0.1}

1269	   4) Forward packets to 10.3.0.1

1271	   Step 3: The routing of 10.3.0.1

1273	   1) Calculate candidate hops.

1275	   10.3. 0.1 looks up its BRT and gets the following matched entries:

1277	   Destination/Mask           Next hop
1278	   10.1.0.0/255.255.0.0       10.0.1.1
1279	   10.1.0.0/255.255.0.0       10.0.1.2

1281	   So the candidate hops = {10.0.1.1, 10.0.1.2}

1283	   2) Calculate avoiding hops

1285	   Destination/Mask           Next hop
1286	   10.1.0.0/255.255.0.0       10.0.1.2

1288	   So the avoiding hops = {10.0.1.2}

1290	   3) Calculate applicable hops

1292	   The applicable hops are candidate hops minus avoiding hops, so:

1294	   The applicable hops = {10.0.1.1}

1296	   4) Forward packets to 10.0.1.1

1298	   Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
1299	   routing tables.

1301	   Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
1302	   routing tables.

1304	   Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
1305	   layer-2 switching.

1307	10.4.  FAR's Performance in Large-scale Networks

1309	   FAR has good performance to support large-scale networks.  In this
1310	   section, we take a Fat-tree network composed of 2,880 48-port
1311	   switches and 27,648 servers as an example to show FAR's performance.

1313	10.4.1.  The number of control messages required by FAR

1315	   FAR exchanges a few messages between routers and only consumes a
1316	   little network bandwidth.  Tab. 1 shows the required messages in the
1317	   example Fat-tree network.
1318	   Table 1:Required messages in a Fat-tree network.
1319	   _____________________________________________________________________
1320	   Message Type| Scope | size(bytes) | Rate | Bandwidth
1321	   ---------------------------------------------------------------------
1322	   Hello |adjacent switches|less than 48|10 messages/sec|less than 4
1323	   kbps
1324	   ---------------------------------------------------------------------
1325	   DLR |adjacent switches| less than 48 | (1) |48bytes
1326	   ---------------------------------------------------------------------
1327	   DA |entire network| less than 48 | (2) |1.106M
1328	   ---------------------------------------------------------------------
1329	   LFA |entire network| less than 48 | (3) |48 bytes
1330	   _____________________________________________________________________
1331	   (1)Produce one when a router starts
1332	   (2)The number of switches(2,880) in a period
1333	   (3)Produce one when a link fails or recovers

1335	10.4.2.  The Calculating Time of Routing Tables

1337	   A BRT is calculated according to the states of its neighbor routers
1338	   and attached links.  An NRT is calculated according to device and
1339	   link failures in the entire network.  So FAR does not calculate
1340	   network topology and has no problem of network convergence, which
1341	   greatly reduces the calculating time of routing tables.  The
1342	   detection and spread time of link failures is very short in FAR.
1343	   Detection time is up to the interval of sending Hello message.  In
1344	   FAR, the interval is set to 100ms, and a link failure will be
1345	   detected in 200ms.  The spread time between any pair of routers is
1346	   less than 200ms.If a link fails in a data center network, FAR can
1347	   detect it, spread it to all the routers, and calculate routing tables
1348	   in no more than 500ms.

1350	10.4.3.  The Size of Routing Tables

1352	   For the test Fat-tree network, the sizes of BRTs and NRTs are shown
1353	   in Tab. 2.

1355	   Table 2: The size of routing tables in FAR
1356	   _____________________________________________________________________
1357	   Routing Table| Core Switch | Aggregation Switch | Edge Switch |
1358	   ---------------------------------------------------------------------
1359	   BRT | 48 | 48 | 24
1360	   ---------------------------------------------------------------------
1361	   NRT | 0 | 14 | 333
1362	   _____________________________________________________________________

1364	   The BRT's size at a switch is determined by the number of its
1365	   neighbor switches.  In the example network, a core switch has 48
1366	   neighbor switches (aggregation switch), so it has 48 entries in its
1367	   BRT.Only aggregation and edge switches have NRTs.  The NRT size at a
1368	   switch is related to the number of link failures in the network.
1369	   Suppose that there are 1000 link failures in the example network, the
1370	   number of failed links is 1.2% of total links, which is a very high
1371	   failure ratio.  We suppose that link failures are uniformly
1372	   distributed in the entire network.  The NRT size at an edge switch is
1373	   about 333 and the NRT size of an aggregation switch is about 14in
1374	   average.

1376	11.  Security Considerations

1378	   The security considerations will be discussed in a future version of
1379	   this document.

1381	12.  Conclusions

1383	   This draft introduces FAR protocol, a generic routing method and
1384	   protocol, for data centers that have a regular topology.  It uses two
1385	   routing tables, a BRT and an NRT, to store the normal routing paths
1386	   and the forbidden (to-be-avoided) routing paths, respectively.  This
1387	   makes the FAR protocol very simple and efficient.  The sizes of these
1388	   two tables are very small.  Usually, a BRT has only several tens of
1389	   entries and an NRT has only several or about a dozen entries.

1391	13.  Acknowledgments

1393	   This document is supported by ZTE Enterprise-University-Research
1394	   Joint Project.

1396	14.  References

1398	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1399	   Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March
1400	   1997.

1402	   [RFC2328] J.  Moy, "OSPF Version 2", BCP 14, RFC2328, April 1998.

1404	   [FAT-TREE] M.  Al-Fares, A.  Loukissas, and A.  Vahdat."A
1405	   Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
1406	   2008.

1408	Authors' Addresses

1410	   Bin Liu
1411	   ZTE Inc., ZTE Plaza
1412	   No.19 East Huayuan Road,Hai Dian District
1413	   Beijing  100191
1414	   China

1416	   Phone: +86 -010-59932039
1417	   Email: 13683610386@139.com

1419	   Yantao Sun
1420	   Beijing Jiaotong University
1421	   No.3 Shang Yuan Cun, Hai Dian District
1422	   Beijing  100044
1423	   China

1425	   Email: ytsun@bjtu.edu.cn

1427	   Jing Cheng
1428	   Beijing Jiaotong University
1429	   No.3 Shang Yuan Cun, Hai Dian District
1430	   Beijing  100044
1431	   China

1433	   Email: yourney.j@gmail.com

1435	   Yichen Zhang
1436	   Beijing Jiaotong University
1437	   No.3 Shang Yuan Cun, Hai Dian District
1438	   Beijing  100044
1439	   China

1441	   Email: snowfall_dan@sina.com
1442	   Bhumip Khasnabish
1443	   ZTE TX Inc.
1444	   55 Madison Avenue, Suite 160
1445	   Morristown, New Jersey    07960
1446	   USA

1448	   Phone: +001-781-752-8003
1449	   Email: vumip1@gmail.com
1450	   URI:   http://tinyurl.com/bhumip/