idnits 2.17.1 

draft-sl-rtgwg-far-dcn-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 54 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet has text resembling
     RFC 2119 boilerplate text.

  -- The document date (May 23, 2019) is 1793 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'FAT-TREE' is defined on line 1377, but no explicit
     reference was found in the text


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Routing Area Working Group                                       Bin Liu
3	Internet-Draft                                       ZTE Inc., ZTE Plaza
4	Intended status: Informational                                Yantao Sun
5	Expires: November 24, 2019                                    Jing Cheng
6	                                                            Yichen Zhang
7	                                             Beijing Jiaotong University
8	                                                       Bhumip Khasnabish
9	                                                             ZTE TX Inc.
10	                                                            May 23, 2019

12	   Generic Fault-avoidance Routing Protocol for Data Center Networks
13	                       draft-sl-rtgwg-far-dcn-12

15	Abstract

17	   This draft proposes a generic routing method and protocol for a
18	   regular data center network, named as fault-avoidance routing (FAR)
19	   protocol.  FAR protocol provides a generic routing method for all
20	   types of regular topology network architectures that are proposed for
21	   large-scale cloud-based data centers over the past few years.  FAR
22	   protocol is well designed to fully leverage the regularity in the
23	   topology and compute its routing table in a simplistic manner.  Fat-
24	   tree is taken as an example architecture to illustrate how FAR
25	   protocol can be applied in real operational scenarios.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on November 24, 2019.

44	Copyright Notice

46	   Copyright (c) 2019 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Acronyms & Definitions  . . . . . . . . . . . . . . . . .   4
63	   2.  Conventions used in this document . . . . . . . . . . . . . .   5
64	   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   5
65	     3.1.  The Impact of Large-scale Networks on Route Calculation .   5
66	     3.2.  Dilemma of conventional routing methods in a large-scale
67	           network with giant number nodes of routers  . . . . . . .   6
68	     3.3.  Network Addressing Issues . . . . . . . . . . . . . . . .   8
69	     3.4.  Big Routing Table Issues  . . . . . . . . . . . . . . . .   8
70	     3.5.  Adaptivity Issues for Routing Algorithms  . . . . . . . .   9
71	     3.6.  Virtual Machine Migration Issues  . . . . . . . . . . . .   9
72	   4.  The FAR Framework . . . . . . . . . . . . . . . . . . . . . .   9
73	   5.  Data Format . . . . . . . . . . . . . . . . . . . . . . . . .  10
74	     5.1.  Data Tables . . . . . . . . . . . . . . . . . . . . . . .  10
75	     5.2.  Messages  . . . . . . . . . . . . . . . . . . . . . . . .  13
76	   6.  FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . .  17
77	     6.1.  Neighbor and Link Detection Module(M1)  . . . . . . . . .  17
78	     6.2.  Device Learning Module(M2)  . . . . . . . . . . . . . . .  17
79	     6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)   18
80	     6.4.  Link Failure Learning Module(M4)  . . . . . . . . . . . .  18
81	     6.5.  BRT Building Module(M5) . . . . . . . . . . . . . . . . .  18
82	     6.6.  NRT Building Module(M6) . . . . . . . . . . . . . . . . .  19
83	     6.7.  Routing Table Lookup(M7)  . . . . . . . . . . . . . . . .  19
84	   7.  How a FAR Router Works  . . . . . . . . . . . . . . . . . . .  19
85	   8.  Compatible Architecture . . . . . . . . . . . . . . . . . . .  22
86	   9.  Application Example . . . . . . . . . . . . . . . . . . . . .  22
87	     9.1.  BRT Building Procedure  . . . . . . . . . . . . . . . . .  24
88	     9.2.  NRT Building Procedure  . . . . . . . . . . . . . . . . .  25
89	       9.2.1.  Single Link Failure . . . . . . . . . . . . . . . . .  25
90	       9.2.2.  A Group of Link Failures  . . . . . . . . . . . . . .  26
91	       9.2.3.  Node Failures . . . . . . . . . . . . . . . . . . . .  27
92	     9.3.  Routing Procedure . . . . . . . . . . . . . . . . . . . .  27
93	     9.4.  FAR's Performance in Large-scale Networks . . . . . . . .  29
94	       9.4.1.  The number of control messages required by FAR  . . .  29
95	       9.4.2.  The Calculating Time of Routing Tables  . . . . . . .  29
96	       9.4.3.  The Size of Routing Tables  . . . . . . . . . . . . .  30

98	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  30
99	   11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  30
100	   12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  31
101	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  31
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  31

104	1.  Introduction

106	   In recent years, with the rapid development of cloud computing
107	   technologies, the widely deployed cloud services, such as Amazon EC2
108	   and Google search, bring about huge challenges to data center
109	   networking (DCN).  Today's cloud-based data centers (DCs) require
110	   large-scale networks with larger internal bandwidth and smaller
111	   transfer delay.  However, conventional networks cannot meet such
112	   requirements due to limitations in their network architecture.  In
113	   order to satisfy the requirements of cloud computing services, many
114	   new network architectures have been proposed for data centers, such
115	   as Fat-tree, MatrixDCN, and BCube.  These new architectures can
116	   support non-blocking large-scale datacenter networks with more than
117	   tens of thousands of physical servers.

119	   All of these architectures have regular topologies, which are common
120	   features.  The regular topology refers to the network topology
121	   structure with obvious regularity and symmetry, which is conducive to
122	   automatic configuration of the network, such as the Fattree network.
123	   In a regular topology, each network node such as a switch or router
124	   can be addressed by its location and through a node's address, the
125	   node's connections to its neighbors in a network can be determined,
126	   and furthermore, the route to the node from other nodes in the
127	   network can be determined.  So nodes can compute route entries
128	   without learning topology.

130	   This draft proposes a generic routing method and protocol, fault-
131	   avoidance routing (FAR) protocol, for DCNs.  This method leverages
132	   the regularity in the topologies of data center networks to simplify
133	   routing learning and accelerate the query of routing tables.  This
134	   routing method has a better fault tolerance and can be applied to any
135	   DCN with a regular topology.

137	   FAR is not a routing protocol to replace generic routing protocols
138	   such as OSPF and IS-IS.  It cannot be used in general local networks
139	   whose topological structures are arbitrary, and whose scales are also
140	   not very large.  OSPF works very well in such a network.  But in a
141	   large-scale network with regular topology, FAR has a better
142	   performance.  Compared with OSPF and IS-IS, FAR has shorter time of
143	   network convergence and lower PDU overhead.  Furthermore, FAR
144	   requires less computing and storage resources, which let FAR routers
145	   to run at a lower cost of production than the generic routers.

147	   In addition, for each type of network architecture, researchers
148	   designed a routing algorithm according to the features of its
149	   topology.  Because these routing algorithms are different and lack
150	   compatibility with each other, it is very difficult to develop a
151	   routing protocol for network routers supporting multiple routing
152	   algorithms.  FAR has better adaptability than these specified routing
153	   methods.

155	   FAR consists of three components, i.e., link state learning unit,
156	   routing table building unit and routing table querying unit.  In the
157	   link state learning unit, FAR exchanges link failures among routers
158	   to establish a consistent knowledge of the entire network.  In this
159	   stage, the regularity in topology is exploited to infer failed links
160	   and routers.  In the routing table building unit, FAR builds up two
161	   routing tables, i.e., a basic routing table (BRT) and a negative
162	   routing table (NRT), for each router according to the network
163	   topology and link states.  In the last component, routers forward
164	   incoming packets by looking up the two routing tables.  The matched
165	   entries in BRT minus the matched entries in NRT are the final route
166	   entries to be used to forward an incoming packet.

168	   The remainder of this draft is organized as follows.  The problem to
169	   be addressed by FAR is described in Section 3.  The framework of FAR
170	   routing protocol is described in Section 4.  Section 5 and 6
171	   introduce FAR's data format FAR and modules in detail.  Section 7
172	   describe how FAR works by finite state machine (FSM).  In Section 8,
173	   we discussed how FAR works with variable network architectures.
174	   Section 9 takes Fat-tree network as an example to illuminate how FAR
175	   works.

177	1.1.  Acronyms & Definitions

179	   DCN - Data Center Network

181	   FAR - Fault-Avoidance Routing

183	   BRT - Basic Routing Table

185	   NRT - Negative Routing Table

187	   NDT - Neighbor Devices Table

189	   ADT - All Devices Table

191	   LFT - Link Failure Table

193	   DA - Device Announcement
194	   LFA - Link Failure Announcement

196	   DLR - Device and Link Request

198	   IP - Internet Protocol

200	   UDP - User Datagram Protocol

202	   VM - Virtual Machine

204	2.  Conventions used in this document

206	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
207	   NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
208	   this document are to be interpreted as described in RFC-2119
209	   [RFC2119].  In this document, these words will appear with that
210	   interpretation only when in ALL CAPS.  Lower case uses of these words
211	   are not to be interpreted as carrying RFC-2119 significance.

213	3.  Problem Statement

215	   The problem to be addressed by FAR as proposed in this draft is
216	   described in this section.  The expansion of Cloud data center
217	   networks has brought significant challenges to the existing routing
218	   technologies.  FAR mainly solves a series of routing problems faced
219	   by large-scale data center networks.

221	3.1.  The Impact of Large-scale Networks on Route Calculation

223	   In a large-scale cloud data center network, there may be thousands of
224	   routers.  Running OSPF and other routing protocols in such network
225	   will encounter these two challenges: a) Network convergence time
226	   would be too long, which will cause a longer time to elapse for
227	   creating and updating the routes.  The response time to network
228	   failures may be excessively long; b) a large number of routing
229	   protocol packets need to be sent.  The routing information consumes
230	   too much network bandwidth and CPU resources, which easily leads to
231	   packet loss and makes the problem (a) more prominent.

233	   In order to solve these problems, a common practice is to partition a
234	   large network into some small areas, where the route calculation runs
235	   independently within different areas.  However, nowadays the cloud
236	   data centers typically require very large internal bandwidth.  To
237	   meet this requirement, a large number of parallel equivalent links
238	   are deployed in a network, such as a Fat-tree network.  Partitioning
239	   such a network will affect the utilization of routing algorithm on
240	   equivalent multi-path and reduce internal network bandwidth
241	   requirements.

243	   In the FAR routing calculation process, a Basic Routing Table (BRT)
244	   is built on local network topology leveraging the regularity of the
245	   network topologies.  In addition to BRT, FAR also builds a Negative
246	   Routing Table (NRT).  FAR gradually builds NRT in the process of
247	   learning network link failure information, which does not require
248	   learning a complete network fault information.  FAR does not need to
249	   wait for the completion of the network convergence in the process of
250	   building these two tables.  Therefore, it avoids the problem of
251	   excessive network convergence overheads in the route calculation
252	   process.  In addition, FAR only needs to exchange a small amount of
253	   link change information between routers, and hence consumes less
254	   network bandwidth.

256	3.2.  Dilemma of conventional routing methods in a large-scale network
257	      with giant number nodes of routers

259	   There are many real world scenario where tens of thousands of
260	   nodes(or much more nodes) need to be deployed in a flat area, such as
261	   infiniband routing and switching system, high-performance computer
262	   network, and many IDC networks in China.  The similar problems have
263	   been existed long ago.  People have solved the problems through
264	   similar solutions, such as the traditional regular topology-based
265	   RFC3619 protocol, the routing protocols of infiniband routing and
266	   switching system, and high-performance computer network routing
267	   protocol.
268	   Infiniband defines a switch-based network to interconnect processing
269	   nodes and the I/O nodes.  Infiniband can support very large scale
270	   networks, use the regularity in topology to simplify its routing
271	   algorithm, which is just the same to what we do in FAR.

273	   Why OSPF and other conventional routing methods do not work well in a
274	   large-scale network with giant number nodes of routers?

276	   As everyone knows, the OSPF protocol uses multiple databases, more
277	   topological exchange information (as seen in the following example)
278	   and complicated algorithm.  It requires routers to consume more
279	   memory and CPU processing capability.  But the processing rate of CPU
280	   on the protocol message per second is very limited.  When the network
281	   expands, CPU will quickly approach its processing limits, and at this
282	   time OSPF can not continue to expand the scale of the management.
283	   The SPF algorithm itself does not thoroughly solve these problems.

285	   On the contrary, FAR does not have the convergence time delay and the
286	   additional CPU overheads, which SPF requires.  Because in the initial
287	   stage, FAR already knows the regular information of the whole network
288	   topology and does not need to periodically do SPF operation.

290	   One of the examples of "more topological exchange information":

292	   In the OSPF protocol, LSA floods every 1800 seconds.  Especially in
293	   the larger network, the occupation of CPU and band bandwidth will
294	   soon reach the router's performance bottleneck.  In order to reduce
295	   these adverse effects, OSPF introduced the concept of Area, which
296	   still has not solved the problem thoroughly.  By dividing the OSPF
297	   Area into several areas, the routers in the same area do not need to
298	   know the topological details outside their area.  (In comparison with
299	   FAR, after OSPF introducing the concept of Area, the equivalent paths
300	   cannot be selected in the whole network scope)

302	   OSPF can achieve the following results by Area : 1) Routers only need
303	   to maintain the same link state databases as other routers within the
304	   same Area, without the necessity of maintaining the same link state
305	   database as all routers in the whole OSPF domain.  2) The reduction
306	   of the link state databases means dealing with relatively fewer LSA,
307	   which reduces the CPU consumption of routers; 3) The large number of
308	   LSAs flood only within the same Area.  But, its negative effect is
309	   that the smaller number of routers which can be managed in each OSPF
310	   area.  On the contrary, because FAR does not have the above
311	   disadvantages, FAR can also manage large-scale network even without
312	   dividing Areas.

314	   The aging time of OSPF is set in order to adapt to routing
315	   transformation and protocol message exchange happened frequently in
316	   the irregular topology.  Its negative effect is: when the network
317	   does not change, the LSA needs to be refreshed every 1800 seconds to
318	   reset the aging time.  In the regular topology, as the routings are
319	   fixed, it does not need the complex protocol message exchange and
320	   aging rules to reflect the routing changes, as long as LFA mechanism
321	   in the FAR is enough.

323	   Compared with the LSVR protocol, the LSVR protocol has no special
324	   requirements for the network topology structure, however, the FAR
325	   draft is applicable to the regular topology network architecture and
326	   simplifies unnecessary processing.  It is a solution proposed to
327	   greatly improve the routing efficiency of the regular network
328	   topology.  The FAR solution is more efficient than the general
329	   methods such as LSVR in regular topology.

331	   Therefore, in FAR, we can omit many unnecessary processing and the
332	   packet exchange.  The benefits are fast convergence speed and much
333	   larger network scale than other dynamic routing protocol.  Now there
334	   are some successful implementations of simplified routings in the
335	   regular topology in the HPC environment.  Conclusion: As FAR needs
336	   few routing entries and the topology is regular, the database does
337	   not need to be updated regularly.  Without the need for aging, there
338	   is no need for CPU and bandwidth overhead brought by LSA flood every
339	   30 minutes, so the expansion of the network has no obvious effect on
340	   the performance of FAR, which is contrary to OSPF.

342	   Comparison of convergence time: The settings of OSPF spf_delay and
343	   spf_hold_time can affect the change of convergence time.  The
344	   convergence time of the network with 2480 nodes is about 15-20
345	   seconds; while the FAR does not need to calculate the SFP, so there
346	   is no such convergence time.
347	   These issues still exist in rapid convergence technology of OSPF
348	   ,ISIS (such as I-SPF) and LSVR.  The convergence speed and network
349	   scale constraint each other.  FAR does not have the above problems,
350	   and the convergence time is almost negligible.
351	   Can FRR solve these problems?  IP FRR has some limitations.  The
352	   establishment of IP FRR backup scheme will not affect the original
353	   topology and traffic forwarding which are established by protocol,
354	   however, we can not get the information of whereabouts and status
355	   when the traffic is switched to an alternate next hop.

357	3.3.  Network Addressing Issues

359	   Routers are typically configured with multiple network interfaces,
360	   each connected to a subnet.  OSPF and other routing algorithms
361	   require that each interface of a router must be configured with an IP
362	   address.  A large-scale data center network may contain thousands of
363	   routers and each router has dozens of network interfaces, thus, there
364	   are tens of thousands of IP addresses needed to be configured in a
365	   data center.  It will be very complex to configure and manage a large
366	   number of network interfaces and will be difficult to troubleshoot
367	   network problems, then network maintenance will be costly and error-
368	   prone.

370	   In FAR, the device position information is encoded in the IP address
371	   of the router.  Each router only needs to be assigned a unique IP
372	   address according its location, which greatly solves complex network
373	   addressing issues in large-scale networks.

375	3.4.  Big Routing Table Issues

377	   There are a large number of subnets in the large-scale data center
378	   network.  A router may build a routing entry for each subnet, and
379	   therefore the size of routing tables on each router may be very
380	   large.  It will increase a router's cost and reduce the querying
381	   speed of the routing table.

383	   FAR uses two measures to reduce the size of its routing tables: a)It
384	   builds a BRT on the regularity of the network topologies; b)It
385	   introduces a new routing table, i.e., a NRT.  In this way FAR can
386	   reduce the size of routing tables to only a few dozen routing
387	   entries.

389	3.5.  Adaptivity Issues for Routing Algorithms

391	   To implement efficient routing in large-scale datacenters, besides
392	   FAR, some other routing methods are proposed for some specific
393	   network architectures, such as Fat-tree and BCube.  These routing
394	   methods are different(from both design and implementation viewpoints)
395	   and not compatible with the conventional routing methods, which
396	   brings big troubles to network equipment providers to develop new
397	   routers supporting various new routing methods.

399	   FAR is a generic routing method.  With slight modification, FAR
400	   method can be applied to most of regular datacenter networks.
401	   Furthermore, the structure of routing tables and querying a routing
402	   table in FAR are the same as conventional routing method.  If FAR is
403	   adopted, the workload of developing a new type of router will be
404	   significantly decreased.

406	3.6.  Virtual Machine Migration Issues

408	   Supporting VM migration is very important for cloud-based datacenter
409	   networks.  However, in order to support layer-3 routing, routing
410	   methods including OSPF and FAR require limiting VM migration within a
411	   subnet.  For this paradox, the mainstream methods still utilize
412	   layer-3 routing on routers or switches, transmit packets encapsulated
413	   by IPinIP or MACinIP between hosts by tunnels passing through network
414	   to the destination access switch, and then extract original packet
415	   out and send it to the destination host.

417	   By utilizing the aforementioned methods, FAR can be applied to Fat-
418	   tree, MatrixDCN or BCube networks for supporting VM migration in
419	   entire network.

421	4.  The FAR Framework

423	   FAR requires that a DCN has a regular topology, and network devices,
424	   including routers, switches, and servers, are assigned IP addresses
425	   according to their locations in the network.  In other word, we can
426	   locate a device in the network according to its IP address.

428	   FAR is a distributed routing method.  In order to support FAR, each
429	   router needs to have a routing module that implements the FAR
430	   algorithm.  FAR algorithm is composed of three parts, i.e., link-
431	   state learning, routing table building and routing table querying, as
432	   shown in Fig. 1.

434	               Link-state Learning    |Routing Table  | Routing Table
435	                                      |  Building
436	                                      |               |   Querying
437	    +--------+   /---------------\    | +--------+    |    Packets
438	    |2 Device|<--| 1 Neighbor &  |----->| 5 BRT  \    |
439	    |Learning|   | Link Detection|    | |Building|\   |       |
440	    +--------+   \---------------/    | +--------+ \  |      \|/
441	             |           |            |             \ |+--------------+
442	             |           |            |              /||  7 Querying  |
443	            \|/         \|/           |             / || Routing Table|
444	             +-----------------------+|            /  |+--------------+
445	             |3 Invisible Neighbor & ||           /   |
446	             |Link Failure Inferring || +---------    |
447	             +-----------------------+|/| 6 NRT  |    |
448	                         |            / |Building|    |
449	                        \|/          /| +--------+    |
450	                  +--------------+  / |               |
451	                  |4 Link Failure| /  |               |
452	                  |  Learning    |    |               |
453	                  +--------------+    |               |
454	                                      |               |

456	                        Figure 1: The FAR framework

458	   Link-state learning is responsible for a router to detect the states
459	   of its connected links and learn the states of all the other links in
460	   the entire network.  The second part builds two routing tables, a
461	   basic routing table (BRT) and an negative routing table (NRT),
462	   according to the learned link states in the first part.  The third
463	   part queries the BRT and the NRT to decide a next forwarding hop for
464	   the received (ingress) packets.

466	5.  Data Format

468	5.1.  Data Tables

470	   Some data tables are maintained on each router in FAR.  They are:

472	   Neighbor Device Table (NDT): To store neighbor routers and related
473	   links.

475	   All Devices Table (ADT): To store all routers in the entire network.

477	   Link Failures Table (LFT): To store all link failures in the entire
478	   network.

480	   Basic Routing Table (BRT): To store the candidate routes.

482	   Negative Routing Table(NRT): To store the avoiding routes.

484	   The format of NDT

486	    ----------------------------------------------------------
487	    Device ID | Device IP | Port ID | Link State | Update Time
488	    ----------------------------------------------------------

490	   Device ID: The ID of a neighbor router.

492	   Device IP: The IP address of a neighbor router.

494	   Port ID: The port ID that a neighbor router is attached to.

496	   Link State: The state of the link between a router and its neighbor
497	   router.  There are two states: Up and Down.

499	   Update Time: The time of updating the entry.

501	   The format of ADT

503	    --------------------------------------------------
504	    Device ID | Device IP | Type | State | Update Time
505	    --------------------------------------------------

507	   Device ID: The ID of a neighbor router.

509	   Device IP: The IP address of a neighbor router.

511	   Type: The type of a neighbor router.

513	   State: The state of a neighbor router.  There are two states: Up and
514	   Down.

516	   Update Time: The time of updating the entry.

518	   The format of LFT

520	    --------------------------------------------
521	    No | Router 1 IP | Router 2 IP | Timestamp
522	    --------------------------------------------

524	   No: The entry number.

526	   Router 1 IP: The IP address of one router that a failed link connects
527	   to.

529	   Router 2 IP: The IP address of another router that a failed link
530	   connects to.

532	   Timestamp: It identifies when the entry is created.

534	   The format of BRT

536	    -------------------------------------------------------
537	    Destination | Mask | Next Hop | Interface | Update Time
538	    -------------------------------------------------------

540	   Destination: A destination network

542	   Mask: The subnet mask of a destination network.

544	   Next Hop: The IP address of a next hop for a destination.

546	   Interface: The interface related to a next hop.

548	   Update Time: The time of updating the entry.

550	   The format of NRT

552	    -------------------------------------------------------------------
553	    Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
554	    -------------------------------------------------------------------

556	   Destination: A destination network.

558	   Mask: The subnet mask of a destination network.

560	   Next Hop: The IP address of a next hop that should be avoided for a
561	   destination.

563	   Interface: The interface related to a next hop that should be
564	   avoided.

566	   Failed Link No: A group of failed link numbers divided by "/", for
567	   example 1/2/3.

569	   Timestamp: The time of updating the entry.

571	5.2.  Messages

573	   Some protocol messages are exchanged between routers in FAR.

575	   Hello Message: This message is exchanged between neighbor routers to
576	   learn adjacency.

578	   Device Announcement (DA): Synchronize the knowledge of routers
579	   between routers.

581	   Link Failure Announcement (LFA): Synchronize link failures between
582	   routers.

584	   Device and Link Request (DLR): When a router starts, it requests the
585	   knowledge of routers and links from its neighbors by a DLR message.

587	   A FAR Message is directly encapsulated in an IP packet.  The protocol
588	   field of IP header indicates an IP packet is an FAR message.  The
589	   protocol of IP for FAR should be assigned by IANA.

591	   The four types of FAR messages have same format of packet header,
592	   called FAR header (as shown in Figure 2).

594	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
595	            +-----------+------------+------------------------+
596	            |  Version  |Message Type|    Message Length      |
597	            +-----------+------------+------------------------+
598	            |        Checksum        |       AuType           |
599	            +------------------------+------------------------+
600	            |                  Authentication                 |
601	            +-------------------------------------------------+
602	            |                  Authentication                 |
603	            +-------------------------------------------------+
604	            |                   Timestamp                     |
605	            +-------------------------------------------------+

607	                    Figure 2: The format of FAR header

609	   Version: FAR version

611	   Message Type: The type of FAR message.

613	   Packet Length: The packet length of the total FAR message.

615	   Checksum: The checksum of an entire FAR message.

617	   AuType:Authentication type. 0: no authentication, 1: Plaintext
618	   Authentication, 2: MD5 Authentication.

620	   Authentication: Authentication information. 0: undefined, 1: Key, 2:
621	   key ID, MD5 data length and packet number.  MD5 data is appended to
622	   the backend of the packet.

624	   AuType and Authentication can refer to the definition of OSPF packet.

626	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
627	            +-----------+------------+------------------------+
628	            |  Version  |Message Type|    Message Length      |
629	            +-----------+------------+------------------------+
630	            |        Checksum        |       AuType           |
631	            +------------------------+------------------------+
632	            |                  Authentication                 |
633	            +-------------------------------------------------+
634	            |                  Authentication                 |
635	            +-------------------------------------------------+
636	            |                   Timestamp                     |
637	            +-------------------------------------------------+
638	            |                    Router IP                    |
639	            +------------------------+------------------------+
640	            |     HelloInterval      |     HelloDeadInterval  |
641	            +------------------------+------------------------+
642	            |                Neighbor Router IP               |
643	            +-------------------------------------------------+
644	            |                       ...                       |
645	            +-------------------------------------------------+

647	                  Figure 3: The Format of Hello Messages

649	   For Hello messages, the Message Type in FAR header is set to
650	   1.Besides FAR header, a Hello message(Fig. 3) requires the following
651	   fields:

653	   Router IP: The router IP address.

655	   HelloInterval: The interval of sending Hello messages to neighbor
656	   routers.

658	   RouterDeadInterval: The interval to set a neighbor router dead(out-
659	   of-service).  If in the interval time, a router doesn't receive a
660	   Hello message from its neighbor router, the neighbor router is
661	   treated as dead.

663	   Neighbor Router IP: The IP address of a neighbor router.  All the
664	   neighbor router's addresses should be included in a Hello message.

666	            |<----1---->| <----1---->|<----------2----------->|
667	            +-----------+------------+------------------------+
668	            |  Version  |Message Type|    Message Length      |
669	            +-----------+------------+------------------------+
670	            |        Checksum        |       AuType           |
671	            +------------------------+------------------------+
672	            |                  Authentication                 |
673	            +-------------------------------------------------+
674	            |                  Authentication                 |
675	            +-------------------------------------------------+
676	            |                   Timestamp                     |
677	            +------------------------+------------------------+
678	            |                   Router1 IP                    |
679	            +-------------------------------------------------+
680	            |                       ...                       |
681	            +-------------------------------------------------+

683	                    Figure 4: The Format of DA Messages

685	   For DA messages(Fig. 4), the Message Type in FAR header is set to 2.
686	   Besides FAR header, a DA message includes IP addresses of all the
687	   announced routers.

689	            |<----1---->| <----1---->|<----------2----------->|
690	            +-----------+------------+------------------------+
691	            |  Version  |Message Type|    Message Length      |
692	            +-----------+------------+------------------------+
693	            |        Checksum        |       AuType           |
694	            +------------------------+------------------------+
695	            |                  Authentication                 |
696	            +-------------------------------------------------+
697	            |                  Authentication                 |
698	            +-------------------------------------------------+
699	            |                   Timestamp                     |
700	            +------------------------+------------------------+
701	            |                     Left IP                     |
702	            +-------------------------------------------------+
703	            |                    Right IP                     |
704	            +------------------------+------------------------+
705	            |                      State                      |
706	            +-------------------------------------------------+
707	            |                       ...                       |
708	            +-------------------------------------------------+

710	                   Figure 5: The Format of LFA Messages

712	   For LFA messages(Fig. 5), the Message Type in FAR header is set to 3.
713	   Besides FAR header, a LFA message includes all the announced link
714	   failures.

716	   Left IP: The IP address of the left endpoint router of a link.

718	   Right IP: The IP address of the right endpoint router of a link.

720	   State: Link state. 0: Up, 1: down
721	            |<----1---->| <----1---->|<----------2----------->|
722	            +-----------+------------+------------------------+
723	            |  Version  |Message Type|    Message Length      |
724	            +-----------+------------+------------------------+
725	            |        Checksum        |       AuType           |
726	            +------------------------+------------------------+
727	            |                  Authentication                 |
728	            +-------------------------------------------------+
729	            |                  Authentication                 |
730	            +-------------------------------------------------+
731	            |                   Timestamp                     |
732	            +-------------------------------------------------+

734	                   Figure 6: The Format of DLR Messages

736	   For DLR messages(Fig. 6), the Message Type in FAR header is set to
737	   1.Except for FAR header, DLR has no additional fields.

739	6.  FAR Modules

741	6.1.  Neighbor and Link Detection Module(M1)

743	   M1 is responsible for sending and receiving Hello messages, and
744	   detecting directly-connected links and neighbor routers.  Each Hello
745	   message is encapsulated in a UDP packet.  M1 sends Hello messages
746	   periodically to all the active router ports and receives Hello
747	   messages from its neighbor routers.  M1 detects neighbor routers and
748	   directly-connected links according to received Hello Messages and
749	   stores these neighbors and links into a Neighbor Devices Table (NDT).
750	   Additionally, M1 also stores neighbor routers into an All Devices
751	   Table (ADT).

753	6.2.  Device Learning Module(M2)

755	   M2 is responsible for sending, receiving, and forwarding device
756	   announcement (DA) messages, learning all the routers in the whole
757	   network, and deducing faulted routers.  When a router starts, it
758	   sends a DA message announcing itself to its neighbors and a DLR
759	   message requesting the knowledge of routers and links from its
760	   neighbors.  If M2 module of a router receives a DA message, it checks
761	   whether the router encapsulated in the message is in an ADT.  If the
762	   router is not in the ADT, M2 puts this router into the ADT and
763	   forwards this DA message to all the active ports except for the
764	   incoming one, otherwise, M2 discards this message directly.  If M2
765	   module of a router receives a DLR message, it replies a DA message
766	   that encapsulates all of the learned routers.

768	6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)

770	   M3 is responsible for inferring invisible neighbors of the current
771	   router by means of the ADT.  If the link between a router A and its
772	   neighbor B breaks, which results in that M1 module of A cannot detect
773	   the existence of B, then B is an invisible neighbor of A.  Since a
774	   device's location is coded into its IP address, it can be judged
775	   whether two routers are adjacent, according to their IP addresses.
776	   Based on this idea, M3 infers all of the invisible neighbors of the
777	   current router and the related link failures.  The results are stored
778	   into a NDT.  Moreover, link failures also are added into a link-
779	   failure table (LFT).  LFT stores all of the failed links in the
780	   entire network.

782	6.4.  Link Failure Learning Module(M4)

784	   M4 is responsible for sending, receiving and forwarding link failure
785	   announcement (LFA) and learning all the link failures in the whole
786	   network.  M4 broadcasts each newly inferred link failure to all the
787	   routers in the network.  Each link failure is encapsulated in a LFA
788	   message and one link failure is broadcasted only once.  If a router
789	   receives a DLR request from its neighbor, it will reply a LFA message
790	   that encapsulates all the learned link failures through M4 module.
791	   If M4 receives a LFA message, it checks whether the link failure
792	   encapsulated in the message is in a LFT by comparing two link ends
793	   and timestamp.  If the link failure is not in the LFT or timestamp is
794	   different, M4 puts this link failure into the LFT (or update
795	   timestamp only) and forwards this LFA message to all the active ports
796	   except for the incoming one, otherwise, M4 discards this message
797	   directly.

799	   There is a special case a router will rebroadcast a link failure.  If
800	   a router receives a data packet and must forward the packet going
801	   ahead to destination through a failed link, it means some previous
802	   router should avoid this failed link according to its NRT but it
803	   doesn't.  In this case, maybe the previous router missed the LFA
804	   message of the link failure due to some uncertain reasons.  So the
805	   forwarding router rebroadcasts the LFA message.

807	6.5.  BRT Building Module(M5)

809	   M5 is responsible for building a BRT for the current router.  By
810	   leveraging the regularity in topology, M5 can calculate the routing
811	   paths for any destination without the knowledge of the topology of
812	   whole network, and then build the BRT based on a NDT.  Since the IP
813	   addresses of network devices are continuous, M5 only creates one
814	   route entry for a group of destination addresses that have the same
815	   network prefix by means of route aggregation technology.  Usually,
816	   the size of a BRT is very small.  The detail of how to build a BRT is
817	   described in section 5.

819	6.6.  NRT Building Module(M6)

821	   M6 is responsible for building a NRT for the current router.  Because
822	   M5 builds a BRT without considering link failures in network, the
823	   routing paths calculated by the BRT cannot avoid failed links.  To
824	   solve this problem, a NRT is used to exclude the routing paths that
825	   include some failed links from the paths calculated by a BRT.  M6
826	   calculate the routing paths that include failed links and stored them
827	   into the NRT.  The details of how to build a NRT is described in
828	   section 5.

830	6.7.  Routing Table Lookup(M7)

832	   M7 is responsible for querying routing tables and selecting the next
833	   hop for forwarding the packets.  Firstly, M7 takes the destination
834	   address of a forwarding packet as a criterion to look up route
835	   entries in a BRT based on longest prefix match.  All of the matched
836	   entries are composed of a candidate hops list.  Secondly, M7 look up
837	   negative route entries in a NRT taking the destination address of the
838	   forwarding packet as criteria.  This lookup is not limited to the
839	   longest prefix match, any entry that matches the criteria would be
840	   selected and composed of an avoiding hops list.  Thirdly, the
841	   candidate hops minus avoiding hops are composed of an applicable hops
842	   list.  At last, M7 sends the forwarding packet to any one of the
843	   applicable hops.  If the applicable list is empty, the forwarding
844	   packet will be dropped.

846	7.  How a FAR Router Works

848	   Figure 7 shows how a FAR router works by its FSM.

850	                            /---------------\
851	                            |     Start     |
852	                            |               |
853	                            \---------------/
854	                                    |
855	                     +--------+     |1
856	                     |        |     |
857	                     |       \|/   \|/
858	                     |    +--------------------+
859	                     |2   |         ND         |
860	                     |    |                    |
861	                     |    +--------------------+
862	                     |        |      |
863	                     |        |      |3
864	                     +--------+      |
865	                                    \|/
866	                              +--------------+
867	                              |    ND-FIN    |
868	                              |              |
869	                              +--------------+
870	                                     |
871	                                     |4
872	                                     |
873	                                    \|/
874	            ________10_______\+--------------+/_______11________
875	           |                 /|              |\                |
876	           |                  |    Listen    |                 |
877	           |     ____9_______\|              |/_______12___    |
878	           |    |            /|              |\            |   |
879	           |    |             +--------------+             |   |
880	           |    |               5/  |  |    \8             |   |
881	           |    |              |/_  |  |    _\|            |   |
882	           |    |  +------------+  6|  |7   +------------+ |   |
883	           |     --| HELLO-RECV |   |  |    | LFA-RECV   |--   |
884	           |       +------------+   |  |    +------------+     |
885	           |                  ______|  |______                 |
886	           |                 |                |                |
887	           |                \|/              \|/               |
888	           |           +------------+  +------------+          |
889	           |___________|  DLR-RECV  |  |  DA-RECV   |__________|
890	                       +------------+  +------------+

892	                                  _________
893	                                 |         |
894	                                 |         |
895	                                \|/        |
896	                         +--------------+  |13
897	                         | Hello Thread |  |
898	                         +--------------+  |
899	                                 |         |
900	                                 |_________|

902	                                  _________
903	                                 |         |
904	                                 |         |
905	                                \|/        |
906	                         +--------------+  |14
907	                         |  LFD Thread  |  |
908	                         +--------------+  |
909	                                 |         |
910	                                 |_________|

912	                                  _________
913	                                 |         |
914	                                 |         |
915	                                \|/        |
916	                         +--------------+  |15
917	                         |  DA Thread   |  |
918	                         +--------------+  |
919	                                 |         |
920	                                 |_________|

922	                                  _________
923	                                 |         |
924	                                 |         |
925	                                \|/        |
926	                         +--------------+  |16
927	                         |  DFD Thread  |  |
928	                         +--------------+  |
929	                                 |         |
930	                                 |_________|

932	             Figure 7: The Finite State Machine of FAR Router

934	   1)When a router starts up, it starts a Hello thread and then starts
935	   ND (neighbor detection) timer (3 seconds).  Next the router goes into
936	   ND (neighbor detection) state.
937	   2)In the ND state, if a router received a Hello message, then it
938	   performs a Hello-message processing and goes back to the ND state.
939	   3)When the ND timer is over, a router goes into ND-FIN (neighbor
940	   detection finished) state.
941	   4)A router starts the LFD (link failure detection) thread and DFD
942	   (device failure detection) state, and sends DA message and DLR
943	   message to all of its active ports.  Then the router goes into Listen
944	   state.
945	   5) If a router receives a Hello message, then goes into HELLO-RECV
946	   state.
947	   6) If a router receives a DLR message, then goes into DLR-RECV state.
948	   7) If a router receives a DA message, then goes into DA-RECV state.
949	   8) If a router receives a LFA message, then goes into LFA-RECV state.
950	   9) A router performs the Hello-message processing.  After that, it
951	   goes back to Listen state.
952	   10) A router performs the DLR-message processing.  After that, it
953	   goes back to Listen state.
954	   11) A router performs the DA-message processing.  After that, it goes
955	   back to Listen state.

957	   12) A router performs the LFA-message processing.  After that, it
958	   goes back to Listen state.
959	   13) Hello thread produces and sends Hello messages to all its ports
960	   periodically.
961	   14) LFD thread calls link-failure-detection processing to check link
962	   failures in all links periodically
963	   15) DA thread produces and sends DA messages periodically (30
964	   minutes).
965	   16) When DFD thread starts up, it sleep a short time (30 seconds) to
966	   wait for a router learning all the active routers in the network.
967	   Then the thread calls the device-failure-detection processing to
968	   check device failures periodically (30 minutes).

970	8.  Compatible Architecture

972	   As a generic routing protocol, FAR can be run in various DCNs with
973	   regular topology.  Up to now, we have implemented the FAR protocol
974	   for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond.

976	   For different network architectures, most processing of FAR is same
977	   besides calculation of routing tables.  BRT routing tables are
978	   calculated based on Hello messages and NRT routing tables are
979	   calculated based on LFA messages in FAR.  To extend FAR to support a
980	   new network architecture, only processing of Hello and LFA messages
981	   need providing to build BRT and NRT routing tables.

983	   In this protocol, FAR can support maximally 12 network architectures
984	   and at least support 1 built-in network architecture, such as Fat-
985	   tree, BCube and MatrixDCN,etc.  Each network architecture is assigned
986	   a unique number from 1 to 12.  For example, if the 1 built-in
987	   architectures are assigned 1, and other customized architectures are
988	   assigned 2 to 12.
989	   1: Fat-tree
990	   2: BCube
991	   3: MatrixDCN.
992	   4: xxx.
993	   ......
994	   12: xxx.

996	9.  Application Example

998	   In this section, we take a Fat-tree network(Fig. 7) as an example to
999	   describe how to apply FAR routing.  Since M1 to M4 are very simple,
1000	   we only introduce how the modules M5, M6, and M7 work in a Fat-tree
1001	   network.

1003	   A Fat-tree network is composed of 4 layers.  The top layer is core
1004	   layer, and the other layers are aggregation layer, edge layer and
1005	   server layer.  There are k pods, each one containing two layers of
1006	   k/2 switches.  Each k-port switch in the edge layer is directly
1007	   connected to k/2 hosts.  The remaining k/2 ports are connected to k/2
1008	   of the k-port switches in the aggregation layer.  There are (k/2)2
1009	   k-port core switches.  Each core switch has one port connected to
1010	   each of the k pods.

1012	       10.0.1.1          10.0.1.2            10.0.2.1         10.0.2.2
1013	         +--+              +--+                +--+               +--+
1014	         |  |              |  |                |  |               |  |
1015	         +--+,_            +--+',              +--+             ,,+--+
1016	         |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
1017	         |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
1018	         |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
1019	         |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
1020	         |      .       `'.     `-.,-`    /       |      \
1021	         |       \         `'.,  .` `'., `        |       `.
1022	         |        `,          .'`,     ,`'.,      |         ',
1023	         |          .      ,-`    '., -     `'-,_ |           `.
1024	         |           \   .`         ,'.,         `|.,           .
1025	         |            .'`          /    `-,       |  `'.,        `.
1026	     10.1.0.1      ,-`  .        .'    10.3.0.1  10.3.0.2`'.,      ',
1027	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1028	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1029	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1030	          |  \/ |         |  \/ |           |  \/ |           |  \/ |
1031	         +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
1032	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
1033	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
1034	          /|   10.1.2.1   /|    |\     10 3.1.3   |\          /|   |  |
1035	         / |    | \      / |    | \        / |    | \        / |   |  |
1036	        /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
1037	       /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
1038	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1039	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
1040	            10.1.2.2                           10.3.1.3

1042	                        Figure 8: Fat-tree Network

1044	   Aggregation switches are given addresses of the form 10.pod.0.switch,
1045	   where pod denotes the pod number, and switch denotes the position of
1046	   that switch in the upper pod (in [1, k/2]).  Edge switches are given
1047	   addresses of the form 10.pod.switch.1, where pod denotes the pod
1048	   number, and switch denotes the position of that switch in the lower
1049	   pod (in [1, k/2]).  The core switches are given addresses of the form
1050	   10.0.j.i, where j and i denote that switch's coordinates in the
1051	   (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left).
1052	   The address of a host follows the pod switch to which it is connected
1053	   to; hosts have addresses of the form: 10.pod.switch.ID, where ID is
1054	   the host's position in that subnet (in [2, k/2+1], starting from left
1055	   to the right).

1057	9.1.  BRT Building Procedure

1059	   By leveraging the topology's regularity, every switch clearly knows
1060	   how it forwards a packet.  When a packet arrives at an edge switch,
1061	   if the destination of the packet lies in the same subnet with the
1062	   switch, then the switch directly forwards the packet to the
1063	   destination server through layer-2 switching.  Otherwise, the switch
1064	   forwards the packet to any of aggregation switches in the same pod.
1065	   When a packet arrives at an aggregation switch, if the destination of
1066	   the packet lies in the same pod, the switch forwards the packet to
1067	   the corresponding edge switch.  Otherwise, the switch forwards the
1068	   packet to any of core switches that it is connected to.  If a core
1069	   switch receives a packet, it forwards the packet to the corresponding
1070	   aggregation switch that lies in the destination pod.

1072	   The forwarding policy discussed above is easily expressed through a
1073	   BRT.  The BRT of an edge switch, such as 10.1.1.1, is composed of the
1074	   following entries:

1076	   Destination/Mask       Next hop
1077	   10.0.0.0/255.0.0.0     10.1.0.1
1078	   10.0.0.0/255.0.0.0     10.1.0.2

1080	   The BRT of an aggregation switch, such as 10.1.0.1, is composed of
1081	   the following entries:

1083	   Destination/Mask            Next hop
1084	   10.1.1.0/255 255.255.0      10.1.1.1
1085	   10.1.2.0/255.255.255.0      10.1.2.1
1086	   10.0.0.0/255.0.0.0          10.0.1.1
1087	   10.0.0.0/255.0.0.0          10.0.1.2

1089	   The BRT of acore switch, such as 10.0.1.1, is composed of the
1090	   following entries:

1092	   Destination/Mask            Next hop
1093	   10.1.0.0/255 255.0.0        10.1.0.1
1094	   10.2.0.0/255.255.0.0        10.2.0.1
1095	   10.3.0.0/255.255.0.0        10.3.0.1
1096	   10.4.0.0/255.255.0.0        10.4.0.1

1098	9.2.  NRT Building Procedure

1100	   The route entries in an NRT are related with link and node failures.
1101	   We summarize all types of cases into three (3) catalogs.

1103	9.2.1.  Single Link Failure

1105	   In Fat-tree, Links can be classified as 3 types by their locations:
1106	   1) servers to edge switches; 2) edge to aggregation switches; 3)
1107	   aggregation to core switches.  Link failures between servers to edge
1108	   switches only affect the communication of the corresponding servers
1109	   and don't affect the routing tables of any switch, so we only discuss
1110	   the second and third type of links failures.

1112	   Edge to Aggregation Switches

1114	   Suppose that the link between an edge switch, such as 10.1.2.1 (A),
1115	   and an aggregation switch, such as 10.1.0.1(B),fails.  This link
1116	   failure may affect 3 types of communications.

1118	   o Sources lie in the same subnet with A, and destinations do not.  In
1119	   this case, the link failure will only affect the routing tables of A.
1120	   As this link is attached to A directly, A only needs to delete the
1121	   route entries whose next hop is B in its BRT and add no entries to
1122	   its NRT when A's M6 module detect the link failure.

1124	   o Destinations lie in the same subnet with A, and sources lie in
1125	   another subnet of the same pod.  In this case, the link failure will
1126	   affect the routing tables of all the edge switches in the same pod
1127	   except for A.  When an edge switch, such as 10.1.1.1, learns the link
1128	   failure, it will add a route entry to its NRT:

1130	   Destination/Mask            Next hop
1131	   10.1.2.0/255.255.255.0      10.1.0.1

1133	   o Destinations lie in the same subnet with A, sources lie in another
1134	   pod.  In this case, the link failure will affect the routing tables
1135	   of all the edge switches in the other pods.  When an edge switch in
1136	   one other pod, such as 10.3.1.1, learns the link failure, because all
1137	   the routings that pass through 10.3.0.1 to A will certainly pass
1138	   through the link between A and B, 10.3.1.1 need add a route entry to
1139	   its NRT:

1141	   Destination/Mask            Next hop
1142	   10.1.2.0/255.255.255.0      10.3.0.1

1144	   Aggregation to Core Switches

1146	   Suppose that the link between an aggregation switch, such as 10.1.0.1
1147	   (A), and a core switch, such as 10.0.1.2(B), fails.  This link
1148	   failure may affect 2 types of communications.

1150	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1151	   the other pods.  In this case, the link failure will only affect the
1152	   routing tables of A.  As this link is attached to A directly, A only
1153	   need to delete the route entries whose next hop is B in its BRT and
1154	   add no entries to its NRT when A's M6 module detect the link failure.

1156	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1157	   another pod.  In this case, the link failure will affect the routing
1158	   tables of all the aggregation switches in other pods except for pod
1159	   1.  When an aggregation switch in one other pod, such as 10.3.0.1,
1160	   learns the link failure, because all the routings that pass through
1161	   10.0.1.2 to the pod 1 where A lies will certainly pass through the
1162	   link between A and B, 10.3.0.1 need add a route entry to its NRT:

1164	   Destination/Mask            Next hop
1165	   10.1.0.0/255.255.0.0        10.0.1.2

1167	9.2.2.  A Group of Link Failures

1169	   If all the uplinks of an aggregation switch fail, then this switch
1170	   cannot forward packets, which will affect the routing of every edge
1171	   switches.  Suppose that all the uplinks of the node A (10.1.0.1)
1172	   fail, it will affect two types of communications.

1174	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1175	   the other pods.  In this case, the link failures will affect the
1176	   routing of the edge switches in the Pod of A.  To avoid the node A,
1177	   each edge switch should remove the route entry "10.0.0.0/255.0.0.0
1178	   10.1.0.1" in which the next hop is the node A.

1180	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1181	   other pods.  In this case, the link failures will affect the routing
1182	   of edge switches in other pods.  For example, if the edge switch
1183	   10.3.1.1 communicates with some node in the pod of A, it should avoid
1184	   the node 10.3.0.1, because any communication through 10.3.0.1 to the
1185	   pod of A will pass through the node A.  So a route entry should be
1186	   added to 10.3.1.1:

1188	   Destination/Mask            Next hop
1189	   10.1.0.0/255.255.0.0        10.3.0.1

1191	9.2.3.  Node Failures

1193	   At last, we discuss the effect of node failures to a NRT.  There are
1194	   3 types of node failures: the failure of edge, aggregation and core
1195	   switches.

1197	   o An edge switch fails.  The failure doesn't affect the routing table
1198	   of any switch.

1200	   o A core switch fails.  Only when all the core switches connected to
1201	   the same aggregation switch fail, they will affect the routing of
1202	   other switches.  This case is equal to the case that all the uplinks
1203	   of an aggregation switch fail, so the process of link failures can
1204	   cover it.

1206	   o An aggregation switch fails.  This case is similar to the case that
1207	   all the uplinks of an aggregation switch fail.  It affects the
1208	   routing of edge switches in other pods, but doesn't affect the
1209	   routing of edge switches in pod of the failed switch.  The process of
1210	   this failure is same to the second case in section 6.2.2.

1212	9.3.  Routing Procedure

1214	   FAR decides a routing by looking up its BRT and NRT.  We illuminate
1215	   the routing procedure by an example.  In this example, we suppose
1216	   that the link between 10.3.1.1 and 10.3.0.2 and the link between
1217	   10.1.2.1 and 10.1.0.2 have failed.  Then we look into the routing
1218	   procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
1219	   (destination).

1221	   Step 1: The source 10.3.1.3 sends packets to its default router
1222	   10.3.1.1

1224	   Step 2: The routing of 10.3.1.1.

1226	   1) Calculate candidate hops

1228	   10.3.1.1 looks up its BRT and gets the following matched entries:

1230	   Destination/Mask            Next hop
1231	   10.0.0.0/255.0.0.0          10.3.0.1

1233	   So the candidate hops = {10.3.0.1}

1235	   2) Calculate avoiding hops

1237	   Its NRT is empty, so the set of avoiding hop is empty too.

1239	   3) Calculate applicable hops

1241	   The applicable hops are candidate hops minus avoiding hops, so:

1243	   The applicable hops = {10.3.0.1}

1245	   4) Forward packets to 10.3.0.1

1247	   Step 3: The routing of 10.3.0.1

1249	   1) Calculate candidate hops.

1251	   10.3. 0.1 looks up its BRT and gets the following matched entries:

1253	   Destination/Mask           Next hop
1254	   10.1.0.0/255.255.0.0       10.0.1.1
1255	   10.1.0.0/255.255.0.0       10.0.1.2

1257	   So the candidate hops = {10.0.1.1, 10.0.1.2}

1259	   2) Calculate avoiding hops

1261	   Destination/Mask           Next hop
1262	   10.1.0.0/255.255.0.0       10.0.1.2

1264	   So the avoiding hops = {10.0.1.2}

1266	   3) Calculate applicable hops

1268	   The applicable hops are candidate hops minus avoiding hops, so:

1270	   The applicable hops = {10.0.1.1}

1272	   4) Forward packets to 10.0.1.1
1273	   Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
1274	   routing tables.

1276	   Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
1277	   routing tables.

1279	   Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
1280	   layer-2 switching.

1282	9.4.  FAR's Performance in Large-scale Networks

1284	   FAR has good performance to support large-scale networks.  In this
1285	   section, we take a Fat-tree network composed of 2,880 48-port
1286	   switches and 27,648 servers as an example to show FAR's performance.

1288	9.4.1.  The number of control messages required by FAR

1290	   FAR exchanges a few messages between routers and only consumes a
1291	   little network bandwidth.  Tab. 1 shows the required messages in the
1292	   example Fat-tree network.
1293	   Table 1:Required messages in a Fat-tree network.
1294	   _____________________________________________________________________
1295	   Message Type| Scope | size(bytes) | Rate | Bandwidth
1296	   ---------------------------------------------------------------------
1297	   Hello |adjacent switches|less than 48|10 messages/sec|less than 4
1298	   kbps
1299	   ---------------------------------------------------------------------
1300	   DLR |adjacent switches| less than 48 | (1) |48bytes
1301	   ---------------------------------------------------------------------
1302	   DA |entire network| less than 48 | (2) |1.106M
1303	   ---------------------------------------------------------------------
1304	   LFA |entire network| less than 48 | (3) |48 bytes
1305	   _____________________________________________________________________
1306	   (1)Produce one when a router starts
1307	   (2)The number of switches(2,880) in a period
1308	   (3)Produce one when a link fails or recovers

1310	9.4.2.  The Calculating Time of Routing Tables

1312	   A BRT is calculated according to the states of its neighbor routers
1313	   and attached links.  An NRT is calculated according to device and
1314	   link failures in the entire network.  So FAR does not calculate
1315	   network topology and has no problem of network convergence, which
1316	   greatly reduces the calculating time of routing tables.  The
1317	   detection and spread time of link failures is very short in FAR.
1318	   Detection time is up to the interval of sending Hello message.  In
1319	   FAR, the interval is set to 100ms, and a link failure will be
1320	   detected in 200ms.  The spread time between any pair of routers is
1321	   less than 200ms.If a link fails in a data center network, FAR can
1322	   detect it, spread it to all the routers, and calculate routing tables
1323	   in no more than 500ms.

1325	9.4.3.  The Size of Routing Tables

1327	   For the test Fat-tree network, the sizes of BRTs and NRTs are shown
1328	   in Tab. 2.

1330	   Table 2: The size of routing tables in FAR
1331	   _____________________________________________________________________
1332	   Routing Table| Core Switch | Aggregation Switch | Edge Switch |
1333	   ---------------------------------------------------------------------
1334	   BRT | 48 | 48 | 24
1335	   ---------------------------------------------------------------------
1336	   NRT | 0 | 14 | 333
1337	   _____________________________________________________________________

1339	   The BRT's size at a switch is determined by the number of its
1340	   neighbor switches.  In the example network, a core switch has 48
1341	   neighbor switches (aggregation switch), so it has 48 entries in its
1342	   BRT.Only aggregation and edge switches have NRTs.  The NRT size at a
1343	   switch is related to the number of link failures in the network.
1344	   Suppose that there are 1000 link failures in the example network, the
1345	   number of failed links is 1.2% of total links, which is a very high
1346	   failure ratio.  We suppose that link failures are uniformly
1347	   distributed in the entire network.  The NRT size at an edge switch is
1348	   about 333 and the NRT size of an aggregation switch is about 14in
1349	   average.

1351	10.  Security Considerations

1353	   The security considerations will be discussed in a future version of
1354	   this document.

1356	11.  Conclusions

1358	   This draft introduces FAR protocol, a generic routing method and
1359	   protocol, for data centers that have a regular topology.  It uses two
1360	   routing tables, a BRT and an NRT, to store the normal routing paths
1361	   and the forbidden (to-be-avoided) routing paths, respectively.  This
1362	   makes the FAR protocol very simple and efficient.  The sizes of these
1363	   two tables are very small.  Usually, a BRT has only several tens of
1364	   entries and an NRT has only several or about a dozen entries.

1366	12.  Acknowledgments

1368	   This document is supported by ZTE Enterprise-University-Research
1369	   Joint Project.

1371	13.  References

1373	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1374	   Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March
1375	   1997.

1377	   [FAT-TREE] M.  Al-Fares, A.  Loukissas, and A.  Vahdat."A
1378	   Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
1379	   2008.

1381	Authors' Addresses

1383	   Bin Liu
1384	   ZTE Inc., ZTE Plaza
1385	   No.19 East Huayuan Road,Hai Dian District
1386	   Beijing  100191
1387	   China

1389	   Phone: +86 -010-59932039
1390	   Email: 13683610386@139.com

1392	   Yantao Sun
1393	   Beijing Jiaotong University
1394	   No.3 Shang Yuan Cun, Hai Dian District
1395	   Beijing  100044
1396	   China

1398	   Email: ytsun@bjtu.edu.cn

1400	   Jing Cheng
1401	   Beijing Jiaotong University
1402	   No.3 Shang Yuan Cun, Hai Dian District
1403	   Beijing  100044
1404	   China

1406	   Email: yourney.j@gmail.com
1407	   Yichen Zhang
1408	   Beijing Jiaotong University
1409	   No.3 Shang Yuan Cun, Hai Dian District
1410	   Beijing  100044
1411	   China

1413	   Email: snowfall_dan@sina.com

1415	   Bhumip Khasnabish
1416	   ZTE TX Inc.
1417	   55 Madison Avenue, Suite 160
1418	   Morristown, New Jersey    07960
1419	   USA

1421	   Phone: +001-781-752-8003
1422	   Email: vumip1@gmail.com
1423	   URI:   http://tinyurl.com/bhumip/