idnits 2.17.1 

draft-sl-rtgwg-far-dcn-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 54 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet has text resembling
     RFC 2119 boilerplate text.

  -- The document date (April 30, 2015) is 3284 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Missing reference section? 'RFC2119' on line 172 looks like a reference

  -- Missing reference section? 'FAT-TREE' on line 1313 looks like a reference


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Routing Area Working Group                                       Bin Liu
3	Internet-Draft                                                  ZTE Inc.
4	Intended status: Informational                                Yantao Sun
5	Expires: November 1, 2015                                     Jing Cheng
6	                                                            Yichen Zhang
7	                                             Beijing Jiaotong University
8	                                                       Bhumip Khasnabish
9	                                                             ZTE TX Inc.
10	                                                          April 30, 2015

12	   Generic Fault-avoidance Routing Protocol for Data Center Networks
13	                       draft-sl-rtgwg-far-dcn-03

15	Abstract

17	   This draft proposes a generic routing method and protocol for a
18	   regular data center network, named as the fault-avoidance routing
19	   (FAR) protocol.  FAR protocol provides a generic routing method for
20	   all types of network architectures that are proposed for large-scale
21	   cloud-based data centers over the past few years.  FAR protocol is
22	   well designed to fully leverage the regularity in the topology and
23	   compute its routing table in a simplistic manner.  Fat-tree is taken
24	   as an example architecture to illustrate how FAR protocol can be
25	   applied in real operational scenarios.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on November 1, 2015.

44	Copyright Notice

46	   Copyright (c) 2015 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
62	     1.1.  Acronyms & Definitions  . . . . . . . . . . . . . . . . .   3
63	   2.  Conventions used in this document . . . . . . . . . . . . . .   4
64	   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   4
65	     3.1.  The Impact of Large-scale Networks on Route Calculation .   4
66	     3.2.  Dilemma of conventional routing methods in a large-scale
67	           network with giant number nodes of routers  . . . . . . .   5
68	     3.3.  Network Addressing Issues . . . . . . . . . . . . . . . .   7
69	     3.4.  Big Routing Table Issues  . . . . . . . . . . . . . . . .   8
70	     3.5.  Adaptivity Issues for Routing Algorithms  . . . . . . . .   8
71	     3.6.  Virtual Machine Migration Issues  . . . . . . . . . . . .   8
72	   4.  The FAR Framework . . . . . . . . . . . . . . . . . . . . . .   9
73	   5.  Data Format . . . . . . . . . . . . . . . . . . . . . . . . .  10
74	     5.1.  Data Tables . . . . . . . . . . . . . . . . . . . . . . .  10
75	     5.2.  Messages  . . . . . . . . . . . . . . . . . . . . . . . .  12
76	   6.  FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . .  16
77	     6.1.  Neighbor and Link Detection Module(M1)  . . . . . . . . .  16
78	     6.2.  Device Learning Module(M2)  . . . . . . . . . . . . . . .  17
79	     6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)   17
80	     6.4.  Link Failure Learning Module(M4)  . . . . . . . . . . . .  17
81	     6.5.  BRT Building Module(M5) . . . . . . . . . . . . . . . . .  18
82	     6.6.  NRT Building Module(M6) . . . . . . . . . . . . . . . . .  18
83	     6.7.  Routing Table Lookup(M7)  . . . . . . . . . . . . . . . .  18
84	   7.  How a FAR Router Works  . . . . . . . . . . . . . . . . . . .  18
85	   8.  Compatible Architecture . . . . . . . . . . . . . . . . . . .  21
86	   9.  Application Example . . . . . . . . . . . . . . . . . . . . .  22
87	     9.1.  BRT Building Procedure  . . . . . . . . . . . . . . . . .  24
88	     9.2.  NRT Building Procedure  . . . . . . . . . . . . . . . . .  25
89	       9.2.1.  Single Link Failure . . . . . . . . . . . . . . . . .  25
90	       9.2.2.  A Group of Link Failures  . . . . . . . . . . . . . .  26
91	       9.2.3.  Node Failures . . . . . . . . . . . . . . . . . . . .  27
92	     9.3.  Routing Procedure . . . . . . . . . . . . . . . . . . . .  27
93	     9.4.  FAR's Performance in Large-scale Networks . . . . . . . .  29
94	       9.4.1.  The number of control messages required by FAR  . . .  29
95	       9.4.2.  The Calculating Time of Routing Tables  . . . . . . .  29
96	       9.4.3.  The Size of Routing Tables  . . . . . . . . . . . . .  29

98	   10. Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  30
99	   11. Reference . . . . . . . . . . . . . . . . . . . . . . . . . .  30
100	   12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  30
101	   13. Security Considerations . . . . . . . . . . . . . . . . . . .  30
102	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  31

104	1.  Introduction

106	   In recent years, with the rapid development of cloud computing
107	   technologies, the widely deployed cloud services, such as Amazon EC2
108	   and Google search, bring about huge challenges to data center
109	   networking (DCN).  Today's cloud-based data centers (DCs) require
110	   large-scale networks with larger internal bandwidth and smaller
111	   transfer delay, However, conventional networks cannot meet such
112	   requirements due to limitations in their network architecture.  In
113	   order satisfy the requirements of cloud computing services, many new
114	   network architectures have been proposed for data centers, such as
115	   Fat-tree, MatrixDCN, and BCube.  These new architectures can support
116	   non-blocking large-scale datacenter networks with more than tens of
117	   thousands of physical servers.

119	   Generic routing protocols such as OSPF and IS-IS cannot be scaled to
120	   support a very large-scale datacenter network because of the problems
121	   of network convergence and PDU overhead.  For each type of
122	   architecture, researchers designed a routing algorithm according to
123	   the features of its topology.  Because these routing algorithms are
124	   different and lack compatibility with each other, it is very
125	   difficult to develop a routing protocol for network routers
126	   supporting multiple routing algorithms.  Furthermore, the fault
127	   tolerances in those routing algorithms are very complicated and have
128	   low efficiency.

130	   This draft proposes a generic routing method and protocol, fault-
131	   avoidance routing (FAR) protocol, for DCNs.  This method leverages
132	   the regularity in the topologies of data center networks to simplify
133	   routing learning and accelerate the query of routing tables.  This
134	   routing method has a better fault tolerance property and can be
135	   applied to any DCN with a regular topology.

137	1.1.  Acronyms & Definitions

139	   DCN - Data Center Network

141	   FAR - Fault-avoidance Routing

143	   BRT - Basic Routing Table

145	   NRT - Negative Routing Table
146	   NDT - Neighbor Devices Table

148	   ADT - All Devices Table

150	   LFT - Link failure Table

152	   DA - Device Announcement

154	   LFA - Link Failure Announcement

156	   DLR&#168;C Device and Link Request

158	   IP - Internet Protocol

160	   UDP - User Datagram Protocol

162	   VM - Virtual Machine

164	2.  Conventions used in this document

166	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
167	   NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
168	   this document are to be interpreted as described in RFC-2119
169	   [RFC2119].  In this document, these words will appear with that
170	   interpretation only when in ALL CAPS.  Lower case uses of these words
171	   are not to be interpreted as carrying RFC-2119 significance.

173	3.  Problem Statement

175	   The problem to be addressed by FAR as proposed in this draft is
176	   described in this section.  The expansion of Cloud data center
177	   networks has brought significant challenges to the existing routing
178	   technologies.  FAR mainly solves a series of routing problems faced
179	   by large-scale data center networks.

181	3.1.  The Impact of Large-scale Networks on Route Calculation

183	   In a large-scale cloud data center network, there may be thousands of
184	   routers.  Running OSPF and other routing protocols in such network
185	   will encounter these two challenges: a) Network convergence time
186	   would be too long, which will cause a longer time to elapse for
187	   creating and updating the routes.  The response to network failures
188	   may be excessively high; b) a large number of routing protocol
189	   packets need to be sent.  The routing information consumes a lot of
190	   network bandwidth and CPU resources, which easily leads to packet
191	   loss and makes the problem (a) more prominent.

193	   In order to solve these problems, a common practice is partitioning
194	   the large network into some small areas, where the route calculation
195	   runs independently within different areas.  However, nowadays the
196	   cloud data centers typically require very large internal bandwidth.
197	   To meet this requirement, a large number of parallel equivalent links
198	   are deployed in the network, such as the Fat-tree network
199	   architecture.  Partitioning the network will affect the utilization
200	   of routing algorithm on equivalent multi-path and reduce internal
201	   network bandwidth requirements.

203	   In the FAR routing calculation process, a Basic Routing Table (BRT)
204	   is built on local network topology leveraging the regularity of the
205	   network topologies.  In addition to BRT, FAR also builds a Negative
206	   Routing Table (NRT).  FAR gradually builds NRT in the process of
207	   learning network link failure information, which does not require
208	   learning the complete network fault information.  FAR does not need
209	   to wait for the completion of the network convergence in the process
210	   of building these two tables.  Therefor, it avoids the problem of
211	   excessive network convergence overheads in the route calculation
212	   process.  In addition, FAR only needs to exchange a small amount of
213	   link change information between routers, and hence consumes less
214	   network bandwidth.

216	3.2.  Dilemma of conventional routing methods in a large-scale network
217	      with giant number nodes of routers

219	   There are many real world scenario where tens of thousands of
220	   nodes(or much more nodes) need to be deployed in a flat area, such as
221	   infiniband routing and switching system, high-performance computer
222	   network, and many IDC networks in China.  The similar problems have
223	   been existed long ago.  People have solved the problems through
224	   similar solutions, such as the traditional regular topology-based
225	   RFC3619 protocol, the routing protocols of infiniband routing and
226	   switching system, and high-performance computer network routing
227	   protocol.
228	   Infiniband defines a switch-based network to interconnect processing
229	   nodes and the I/O nodes.  The network is composed of HCAs, Switches
230	   and SMs.  SMs are responsible for the discovery, configuration,
231	   activation, and management of the entire subnets.  SMs can be on the
232	   nodes of any subnets (such as Switches, Routers or HCAs).  SMs
233	   exchange control management packets through subnet management
234	   interfaces SMIs and subnet management agents SMAs.  These control
235	   packets are called subnet management packets SMPs.  SMPs use
236	   unreliable datagram service to send.  SMPs are divided into the lid
237	   routing and directional routing, and SMPs use directional routing for
238	   network topology discovery before the network initialization.
239	   Infiniband can support very large scale networks, use the regularity
240	   in topology to simplify its routing algorithm, which is just the same
241	   to what we do in FAR.

243	   Why OSPF and other conventional routing methods do not work well in a
244	   large-scale network with giant number nodes of routers?

246	   As everyone knows, the OSPF protocol uses multiple databases, more
247	   topological exchange information (as seen in the following example)
248	   and complicated algorithm.  It requires routers to consume more
249	   memory and CPU processing capability.  But the processing rate of CPU
250	   on the protocol message per second is very limited.  When the network
251	   expands, CPU will quickly approach its processing limits, and at this
252	   time OSPF can not continue to expand the scale of the management.
253	   The SPF algorithm itself does not thoroughly solve these problems.

255	   On the contrary, FAR does not have the convergence time delay and the
256	   additional CPU overheads, which SPF requires.  Because in the initial
257	   stage, FAR already knows the regular information of the whole network
258	   topology and does not need to periodically do SPF operation.

260	   One of the examples of "more topological exchange information":
261	   In the OSPF protocol, LSA floods every 1800 seconds.  Especially in
262	   the larger network, the occupation of CPU and band bandwidth will
263	   soon reach the router's performance bottleneck.  In order to reduce
264	   these adverse effects, OSPF introduced the concept of Area, which
265	   still has not solved the problem thoroughly).  By dividing the OSPF
266	   Area into several areas, the routers in the same area do not need to
267	   know the topological details outside their area.  (In comparison with
268	   FAR, after OSPF introducing the concept of Area, the equivalent paths
269	   cannot be selected in the whole network scope)

271	   OSPF can achieve the following results by Area : 1) Routers only need
272	   to maintain the same link state databases as other routers within the
273	   same Area, without the necessity of maintaining the same link state
274	   database as all routers in the whole OSPF domain.  2) The reduction
275	   of the link state databases means dealing with relatively fewer LSA,
276	   which reduces the CPU consumption of routers; 3) The large number of
277	   LSAs flood only within the same Area.  But, its negative effect is
278	   that the smaller number of routers which can be managed in each OSPF
279	   area.  On the contrary, because FAR does not have the above
280	   disadvantages, FAR can also manage large-scale network even without
281	   dividing Areas.

283	   The aging time of OSPF is set in order to adapt to routing
284	   transformation and protocol message exchange happened frequently in
285	   the irregular topology.  Its negative effect is: when the network
286	   does not change, the LSA needs to be refreshed every 1800 seconds to
287	   reset the aging time.  In the regular topology, as the routings are
288	   fixed, it does not need the complex protocol message exchange and
289	   aging rules to reflect the routing changes, as long as LFA mechanism
290	   in the FAR is enough.

292	   Therefore, in FAR, we can omit many unnecessary processing and the
293	   packet exchange.  The benefits are fast convergence speed and much
294	   larger network scale than other dynamic routing protocol.  Now there
295	   are some successful implementations of simplified routings in the
296	   regular topology in the HPC environment.  Conclusion: As FAR needs
297	   few routing entries and the topology is regular, the database does
298	   not need to be updated regularly.  Without the need for aging, there
299	   is no need for CPU and bandwidth overhead brought by LSA flood every
300	   30 minutes, so the expansion of the network has no obvious effect on
301	   the performance of FAR, which is contrary to OSPF.

303	   Comparison of convergence time: The settings of OSPF spf_delay and
304	   spf_hold_time can affect the change of convergence time.  The
305	   convergence time of the network with 2480 nodes is about 15-20
306	   seconds(as seen in the following pages); while the FAR does not need
307	   to calculate the SFP, so there is no such convergence time.
308	   These issues still exist in rapid convergence technology of OSPF and
309	   ISIS (such as I-SPF).  The convergence speed and network scale
310	   constraint each other.  FAR does not have the above problems, and the
311	   convergence time is almost negligible.
312	   Can FRR solve these problems?  IP FRR has some limitations.  The
313	   establishment of IP FRR backup scheme will not affect the original
314	   topology and traffic forwarding which are established by protocol,
315	   however, we can not get the information of whereabouts and status
316	   when the traffic is switched to an alternate next hop.

318	3.3.  Network Addressing Issues

320	   Routers typically configure multiple network interfaces, each
321	   connected to a subnet.  OSPF and other routing algorithms require
322	   that each interface of the router must be configured with an IP
323	   address.  A large-scale data center network may contain thousands of
324	   routers.  Tens of thousands of IP addresses may be needed to
325	   configure for each router with dozens of network interfaces.  It will
326	   be a very complex issue to configure and manage a large number of
327	   network interfaces.  Network maintenance is usually costly and error-
328	   prone.  It will be difficult to troubleshoot the problems.

330	   In FAR, the device position information is encoded in the IP address
331	   of the router.  Each router only needs to be assigned a unique IP
332	   address according its location, which greatly solves complex network
333	   addressing issues in large-scale networks.

335	3.4.  Big Routing Table Issues

337	   There are a large number of subnets in the large-scale data center
338	   network.  Routers may build a routing entry for each subnet, and
339	   therefore the size of the routing tables on each router may be very
340	   large.  It will increase equipment cost and reduce the querying speed
341	   of the routing table.

343	   FAR uses two measures to reduce the size of the routing tables: a)
344	   Builds a BRT on the regularity of the network topologies; b)
345	   introduces a new routing table, i.e., a NRT.  In this way FAR can
346	   reduce the size of routing tables to only a few dozen routing
347	   entries.

349	3.5.  Adaptivity Issues for Routing Algorithms

351	   To implement efficient routing in large-scale datacenters, besides
352	   FAR, some other routing methods are proposed for some specific
353	   network architectures, such as Fat-tree and BCube.  These routing
354	   methods are different(from both design and implementation viewpoints)
355	   and not compatibile with the conventional routing methods, which
356	   brings big troubles to network equipment providers to develop new
357	   routers supporting various new routing methods.

359	   FAR is a generic routing method.  With slight modification, FAR
360	   method can be applied to most of regular datacenter networks.
361	   Furthermore, the structure of routing tables and querying a routing
362	   table in FAR are the same as conventional routing method.  If FAR is
363	   adopted, the workload of developing a new type of router will be
364	   significantly decreased.

366	3.6.  Virtual Machine Migration Issues

368	   Supporting VM migration is very important for cloud-based datacenter
369	   networks.  However, in order to support layer-3 routing, routing
370	   methods including OSPF and FAR require limiting VM migration within a
371	   subnet.  For this paradox, the mainstream methods still utilize
372	   layer-3 routing on routers or switches, transmit packets encapsulated
373	   by IPinIP or MACinIP between hosts by tunnels passing through network
374	   to the destination access switch, and then extract original packet
375	   out and send it to the destination host.

377	   By utilizing the aforementioned methods, FAR can be applied to Fat-
378	   tree, MatrixDCN or BCube networks for supporting VM migration in
379	   entire network.

381	4.  The FAR Framework

383	   FAR requires that a DCN has a regular topology, and network devices,
384	   including routers, switches, and servers, are assigned IP addresses
385	   according to their locations in the network.  In other word, we can
386	   locate a device in the network according to its IP address.

388	   FAR is a distributed routing method.  In order to support FAR, each
389	   router needs to have a routing module that implements the FAR
390	   algorithm.  FAR algorithm is composed of three parts, i.e., link-
391	   state learning, routing table building and routing table querying, as
392	   shown in Fig. 1.

394	               Link-state Learning    |Routing Table  | Routing Table
395	                                      |  Building
396	                                      |               |   Querying
397	    +--------+   /---------------\    | +--------+    |    Packets
398	    |2 Device|<--| 1 Neighbor &  |----->| 5 BRT  \    |
399	    |Learning|   | Link Detection|    | |Building|\   |       |
400	    +--------+   \---------------/    | +--------+ \  |      \|/
401	             |           |            |             \ |+--------------+
402	             |           |            |              /||  7 Querying  |
403	            \|/         \|/           |             / || Routing Table|
404	             +-----------------------+|            /  |+--------------+
405	             |3 Invisible Neighbor & ||           /   |
406	             |Link Failure Inferring || +---------    |
407	             +-----------------------+|/| 6 NRT  |    |
408	                         |            / |Building|    |
409	                        \|/          /| +--------+    |
410	                  +--------------+  / |               |
411	                  |4 Link Failure| /  |               |
412	                  |  Learning    |    |               |
413	                  +--------------+    |               |
414	                                      |               |

416	                        Figure 1: The FAR framework

418	   Link-state learning is responsible for a router to detect the states
419	   of its connected links and learn the states of all the other links in
420	   the entire network.  The second part builds two routing tables, a
421	   basic routing table (BRT) and an negative routing table (NRT),
422	   according to the learned link states in the first part.  The third
423	   part queries the BRT and the NRT to decide a next forwarding hop for
424	   the received (ingress) packets.

426	5.  Data Format

428	5.1.  Data Tables

430	   Some data tables are maintained on each router in FAR.  They are:

432	   Neighbor Device Table (NDT): To store neighbor routers and related
433	   links.

435	   All Devices Table (ADT): To store all routers in the entire network.

437	   Link Failures Table (LFT): To store all link failures in the entire
438	   network.

440	   Basic Routing Table (BRT): To store the candidate routes.

442	   Negative Routing Table(NRT): To store the avoiding routes.

444	   The format of NDT

446	    ----------------------------------------------------------
447	    Device ID | Device IP | Port ID | Link State | Update Time
448	    ----------------------------------------------------------

450	   Device ID: The ID of a neighbor router.

452	   Device IP: The IP address of a neighbor router.

454	   Port ID: The port ID that a neighbor router is attached to.

456	   Link State: The state of the link between a router and its neighbor
457	   router.  There are two states: Up and Down.

459	   Update Time: The time of updating the entry.

461	   The format of ADT

463	    --------------------------------------------------
464	    Device ID | Device IP | Type | State | Update Time
465	    --------------------------------------------------

467	   Device ID: The ID of a neighbor router.

469	   Device IP: The IP address of a neighbor router.

471	   Type: The type of a neighbor router.

473	   State: The state of a neighbor router.  There are two states: Up and
474	   Down.

476	   Update Time: The time of updating the entry.

478	   The format of LFT

480	    --------------------------------------------
481	    No | Router 1 IP | Router 2 IP | Update Time
482	    --------------------------------------------

484	   No: The entry number.

486	   Router 1 IP: The IP address of one router that a failed link connects
487	   to.

489	   Router 2 IP: The IP address of another router that a failed link
490	   connects to.

492	   Update Time: The time of updating the entry.

494	   The format of BRT

496	    -------------------------------------------------------
497	    Destination | Mask | Next Hop | Interface | Update Time
498	    -------------------------------------------------------

500	   Destination: A destination network

502	   Mask: The subnet mask of a destination network.

504	   Next Hop: The IP address of a next hop for a destination.

506	   Interface: The interface related to a next hop.

508	   Update Time: The time of updating the entry.

510	   The format of NRT

512	    -------------------------------------------------------------------
513	    Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
514	    -------------------------------------------------------------------

516	   Destination: A destination network.

518	   Mask: The subnet mask of a destination network.

520	   Next Hop: The IP address of a next hop that should be avoided for a
521	   destination.

523	   Interface: The interface related to a next hop that should be
524	   avoided.

526	   Failed Link No: A group of failed link numbers divided by "/", for
527	   example 1/2/3.

529	   Timestamp: The time of updating the entry.

531	5.2.  Messages

533	   Some protocol messages are exchanged between routers in FAR.

535	   Hello Message: This message is exchanged between neighbor routers to
536	   learn adjacency.

538	   Device Announcement (DA): Synchronize the knowledge of routers
539	   between routers.

541	   Link Failure Announcement (LFA): Synchronize link failures between
542	   routers.

544	   Device and Link Request (DLR): When a router starts, it requests the
545	   knowledge of routers and links from its neighbors by a DLR message.

547	   A FAR Message is directly encapsulated in an IP packet.  The protocol
548	   field of IP header indicates an IP packet is an FAR message.  The
549	   protocol of IP for FAR should be assigned by IANA.

551	   The four types of FAR messages have same format of packet header,
552	   called FAR header (as shown in Figure 2).

554	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
555	            +-----------+------------+------------------------+
556	            |  Version  |Message Type|    Message Length      |
557	            +-----------+------------+------------------------+
558	            |        Checksum        |       AuType           |
559	            +------------------------+------------------------+
560	            |                  Authentication                 |
561	            +-------------------------------------------------+
562	            |                  Authentication                 |
563	            +-------------------------------------------------+
564	            |                   Timestamp                     |
565	            +-------------------------------------------------+

567	                    Figure 2: The format of FAR header

569	   Version: FAR version

571	   Message Type: The type of FAR message.

573	   Packet Length: The packet length of the total FAR message.

575	   Checksum: The checksum of an entire FAR message.

577	   AuType:Authentication type. 0: no authentication, 1: Plaintext
578	   Authentication, 2: MD5 Authentication.

580	   Authentication: Authentication information. 0: undefined, 1: Key, 2:
581	   key ID, MD5 data length and packet number.  MD5 data is appended to
582	   the backend of the packet.

584	   AuType and Authentication can refer to the definition of OSPF packet.

586	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
587	            +-----------+------------+------------------------+
588	            |  Version  |Message Type|    Message Length      |
589	            +-----------+------------+------------------------+
590	            |        Checksum        |       AuType           |
591	            +------------------------+------------------------+
592	            |                  Authentication                 |
593	            +-------------------------------------------------+
594	            |                  Authentication                 |
595	            +-------------------------------------------------+
596	            |                   Timestamp                     |
597	            +-------------------------------------------------+
598	            |                    Router IP                    |
599	            +------------------------+------------------------+
600	            |     HelloInterval      |     HelloDeadInterval  |
601	            +------------------------+------------------------+
602	            |                Neighbor Router IP               |
603	            +-------------------------------------------------+
604	            |                       ...                       |
605	            +-------------------------------------------------+

607	                  Figure 3: The Format of Hello Messages

609	   For Hello messages, the Message Type in FAR header is set to
610	   1.Besides FAR header, a Hello message(Fig. 3) requires the following
611	   fields:

613	   Router IP: The router IP address.

615	   HelloInterval: The interval of sending Hello messages to neighbor
616	   routers.

618	   RouterDeadInterval: The interval to set a neighbor router dead(out-
619	   of-service).  If in the interval time, a router doesn't receive a
620	   Hello message from its neighbor router, the neighbor router is
621	   treated as dead.

623	   Neighbor Router IP: The IP address of a neighbor router.  All the
624	   neighbor router's addresses should be included in a Hello message.

626	            |<----1---->| <----1---->|<----------2----------->|
627	            +-----------+------------+------------------------+
628	            |  Version  |Message Type|    Message Length      |
629	            +-----------+------------+------------------------+
630	            |        Checksum        |       AuType           |
631	            +------------------------+------------------------+
632	            |                  Authentication                 |
633	            +-------------------------------------------------+
634	            |                  Authentication                 |
635	            +-------------------------------------------------+
636	            |                   Timestamp                     |
637	            +------------------------+------------------------+
638	            |                   Router1 IP                    |
639	            +-------------------------------------------------+
640	            |                       ...                       |
641	            +-------------------------------------------------+

643	                    Figure 4: The Format of DA Messages

645	   For DA messages(Fig. 4), the Message Type in FAR header is set to 2.
646	   Besides FAR header, a DA message includes IP addresses of all the
647	   announced routers.

649	            |<----1---->| <----1---->|<----------2----------->|
650	            +-----------+------------+------------------------+
651	            |  Version  |Message Type|    Message Length      |
652	            +-----------+------------+------------------------+
653	            |        Checksum        |       AuType           |
654	            +------------------------+------------------------+
655	            |                  Authentication                 |
656	            +-------------------------------------------------+
657	            |                  Authentication                 |
658	            +-------------------------------------------------+
659	            |                   Timestamp                     |
660	            +------------------------+------------------------+
661	            |                     Left IP                     |
662	            +-------------------------------------------------+
663	            |                    Right IP                     |
664	            +------------------------+------------------------+
665	            |                      State                      |
666	            +-------------------------------------------------+
667	            |                       ...                       |
668	            +-------------------------------------------------+

670	                   Figure 5: The Format of LFA Messages

672	   For LFA messages(Fig. 5), the Message Type in FAR header is set to 3.
673	   Besides FAR header, a LFA message includes all the announced link
674	   failures.

676	   Left IP: The IP address of the left endpoint router of a link.

678	   Right IP: The IP address of the right endpoint router of a link.

680	   State: Link state. 0: Up, 1: down

682	            |<----1---->| <----1---->|<----------2----------->|
683	            +-----------+------------+------------------------+
684	            |  Version  |Message Type|    Message Length      |
685	            +-----------+------------+------------------------+
686	            |        Checksum        |       AuType           |
687	            +------------------------+------------------------+
688	            |                  Authentication                 |
689	            +-------------------------------------------------+
690	            |                  Authentication                 |
691	            +-------------------------------------------------+
692	            |                   Timestamp                     |
693	            +-------------------------------------------------+

695	                   Figure 6: The Format of DLR Messages

697	   For DLR messages(Fig. 6), the Message Type in FAR header is set to
698	   1.Except for FAR header, DLR has no additional fields.

700	6.  FAR Modules

702	6.1.  Neighbor and Link Detection Module(M1)

704	   M1 is responsible for sending and receiving Hello messages, and
705	   detecting directly-connected links and neighbor routers.  Each Hello
706	   message is encapsulated in a UDP packet.  M1 sends Hello messages
707	   periodically to all the active router ports and receives Hello
708	   messages from its neighbor routers.  M1 detects neighbor routers and
709	   directly-connected links according to received Hello Messages and
710	   stores these neighbors and links into a Neighbor Devices Table (NDT).
711	   Additionally, M1 also stores the neighbor routers into an All Devices
712	   Table (ADT).

714	6.2.  Device Learning Module(M2)

716	   M2 is responsible for sending, receiving, and forwarding device
717	   announcement (DA) messages, learning all the routers in the whole
718	   network, and deducing faulted routers.  When a router starts, it
719	   sends a DA message announcing itself to its neighbors and a DLR
720	   message requesting the knowledge of routers and links from its
721	   neighbors.  If M2 module of a router receives a DA message, it checks
722	   whether the router encapsulated in the message is in an ADT.  If the
723	   router is not in the ADT, M2 puts this router into the ADT and
724	   forwards this DA message to all the active ports except for the
725	   incoming one, otherwise, M2 discards this message directly.  If M2
726	   module of a router receives a DLR message, it replies a DA message
727	   that encapsulates all of the learned routers.

729	6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)

731	   M3 is responsible for inferring invisible neighbors of the current
732	   router by means of the ADT.  If the link between the router A and its
733	   neighbor B breaks, which results in that M1 module of A cannot detect
734	   the existence of B, then B is an invisible neighbor of A.  Since a
735	   device's location is coded into its IP address, it can be judged
736	   whether two routers are adjacent, according to their IP addresses.
737	   Based on this idea, M3 infers all of the invisible neighbors of the
738	   current router and the related link failures.  The results are stored
739	   into a NDT.  Moreover, link failures also are added into a link-
740	   failure table (LFT).  LFT stores all of the failed links in the
741	   entire network.

743	6.4.  Link Failure Learning Module(M4)

745	   M4 is responsible for sending, receiving and forwarding link failure
746	   announcement (LFA) and learning all the link failures in the whole
747	   network.  M4 broadcasts each newly inferred link failure to all the
748	   routers in the network.  Each link failure is encapsulated in a LFA
749	   message and one link failure is broadcasted only once.  If a router
750	   receives a DLR request from its neighbor, it will reply a LFA message
751	   that encapsulates all the learned link failures through M4 module.
752	   If M4 receives a LFA message, it checks whether the link failure
753	   encapsulated in the message is in a LFT.  If the link failure is not
754	   in the LFT, M4 puts this link failure into the LFT and forwards this
755	   LFA message to all the active ports except for the incoming one,
756	   otherwise, M4 discards this message directly.

758	6.5.  BRT Building Module(M5)

760	   M5 is responsible for building a BRT for the current router.  By
761	   leveraging the regularity in topology, M5 can calculate the routing
762	   paths for any destination without the knowledge of the topology of
763	   whole network, and then build the BRT based on a NDT.  Since the IP
764	   addresses of network devices are continuous, M5 only creates one
765	   route entry for a group of destination addresses that have the same
766	   network prefix by means of route aggregation technology.  Usually,
767	   the size of a BRT is very small.  The detail of how to build a BRT is
768	   described in section 5.

770	6.6.  NRT Building Module(M6)

772	   M6 is responsible for building a NRT for the current router.  Because
773	   M5 builds a BRT without considering link failures in network, the
774	   routing paths calculated by the BRT cannot avoid failed links.  To
775	   solve this problem, a NRT is used to exclude the routing paths that
776	   include some failed links from the paths calculated by a BRT.  M6
777	   calculate the routing paths that include failed links and stored them
778	   into the NRT.  The details of how to build a NRT is described in
779	   section 5.

781	6.7.  Routing Table Lookup(M7)

783	   M7 is responsible for querying routing tables and selecting the next
784	   hop for forwarding the packets.  Firstly, M7 takes the destination
785	   address of a forwarding packet as a criterion to look up route
786	   entries in a BRT based on longest prefix match.  All of the matched
787	   entries are composed of a candidate hops list.  Secondly, M7 look up
788	   negative route entries in a NRT taking the destination address of the
789	   forwarding packet as criteria.  This lookup is not limited to the
790	   longest prefix match, any entry that matches the criteria would be
791	   selected and composed of an avoiding hops list.  Thirdly, the
792	   candidate hops minus avoiding hops are composed of an applicable hops
793	   list.  At last, M7 sends the forwarding packet to any one of the
794	   applicable hops.  If the applicable list is empty, the forwarding
795	   packet will be dropped.

797	7.  How a FAR Router Works

799	   Figure 7 shows how a FAR router works by its FSM.

801	                            /---------------\
802	                            |     Start     |
803	                            |               |
804	                            \---------------/
805	                                    |
806	                     +--------+     |1
807	                     |        |     |
808	                     |       \|/   \|/
809	                     |    +--------------------+
810	                     |2   |         ND         |
811	                     |    |                    |
812	                     |    +--------------------+
813	                     |        |      |
814	                     |        |      |3
815	                     +--------+      |
816	                                    \|/
817	                              +--------------+
818	                              |    ND-FIN    |
819	                              |              |
820	                              +--------------+
821	                                     |
822	                                     |4
823	                                     |
824	                                    \|/
825	            ________10_______\+--------------+/_______11________
826	           |                 /|              |\                |
827	           |                  |    Listen    |                 |
828	           |     ____9_______\|              |/_______12___    |
829	           |    |            /|              |\            |   |
830	           |    |             +--------------+             |   |
831	           |    |               5/  |  |    \8             |   |
832	           |    |              |/_  |  |    _\|            |   |
833	           |    |  +------------+  6|  |7   +------------+ |   |
834	           |     --| HELLO-RECV |   |  |    | LFA-RECV   |--   |
835	           |       +------------+   |  |    +------------+     |
836	           |                  ______|  |______                 |
837	           |                 |                |                |
838	           |                \|/              \|/               |
839	           |           +------------+  +------------+          |
840	           |___________|  DLR-RECV  |  |  DA-RECV   |__________|
841	                       +------------+  +------------+

843	                                  _________
844	                                 |         |
845	                                 |         |
846	                                \|/        |
847	                         +--------------+  |13
848	                         | Hello Thread |  |
849	                         +--------------+  |
850	                                 |         |
851	                                 |_________|

853	                                  _________
854	                                 |         |
855	                                 |         |
856	                                \|/        |
857	                         +--------------+  |14
858	                         |  LFD Thread  |  |
859	                         +--------------+  |
860	                                 |         |
861	                                 |_________|

863	                                  _________
864	                                 |         |
865	                                 |         |
866	                                \|/        |
867	                         +--------------+  |15
868	                         |  DA Thread   |  |
869	                         +--------------+  |
870	                                 |         |
871	                                 |_________|

873	                                  _________
874	                                 |         |
875	                                 |         |
876	                                \|/        |
877	                         +--------------+  |16
878	                         |  DFD Thread  |  |
879	                         +--------------+  |
880	                                 |         |
881	                                 |_________|

883	             Figure 7: The Finite State Machine of FAR Router

885	   1)When a router starts up, it starts a Hello thread and then starts
886	   ND (neighbor detection) timer (3 seconds).  Next the router goes into
887	   ND (neighbor detection) state.
888	   2)In the ND state, if a router received a Hello message, then it
889	   performs a Hello-message processing and goes back to the ND state.
890	   3)When the ND timer is over, a router goes into ND-FIN (neighbor
891	   detection finished) state.
892	   4)A router starts the LFD (link failure detection) thread and DFD
893	   (device failure detection) state, and sends DA message and DLR
894	   message to all of its active ports.  Then the router goes into Listen
895	   state.
896	   5) If a router receives a Hello message, then goes into HELLO-RECV
897	   state.

899	   6) If a router receives a DLR message, then goes into DLR-RECV state.
900	   7) If a router receives a DA message, then goes into DA-RECV state.
901	   8) If a router receives a LFA message, then goes into LFA-RECV state.
902	   9) A router performs the Hello-message processing.  After that, it
903	   goes back to Listen state.
904	   10) A router performs the DLR-message processing.  After that, it
905	   goes back to Listen state.
906	   11) A router performs the DA-message processing.  After that, it goes
907	   back to Listen state.
908	   12) A router performs the LFA-message processing.  After that, it
909	   goes back to Listen state.
910	   13) Hello thread produces and sends Hello messages to all its ports
911	   periodically.
912	   14) LFD thread calls link-failure-detection processing to check link
913	   failures in all links periodically
914	   15) DA thread produces and sends DA messages periodically (30
915	   minutes).
916	   16) When DFD thread starts up, it sleep a short time (30 seconds) to
917	   wait for a router learning all the active routers in the network.
918	   Then the thread calls the device-failure-detection processing to
919	   check device failures periodically (30 minutes).

921	8.  Compatible Architecture

923	   As a generic routing protocol, FAR can be run in various DCNs with
924	   regular topology.  Up to now, we have implemented the FAR protocol
925	   for 4 types of DCN, including Fat-tree, BCube, MatrixDCN and Diamond.

927	   For different network architectures, most processing of FAR is same
928	   besides calculation of routing tables.  BRT routing tables are
929	   calculated based on Hello messages and NRT routing tables are
930	   calculated based on LFA messages in FAR.  To extend FAR to support a
931	   new network architecture, only processing of Hello and LFA messages
932	   need providing to build BRT and NRT routing tables.

934	   In this protocol, FAR can support maximally 12 network architectures
935	   and at least support 1 built-in network architecture, such as Fat-
936	   tree, BCube and MatrixDCN,etc.  Each network architecture is assigned
937	   a unique number from 1 to 12.  For example, if the 1 built-in
938	   architectures are assigned 1, and other customized architectures are
939	   assigned 2 to 12.
940	   1: Fat-tree
941	   2: BCube
942	   3: MatrixDCN.
943	   4: xxx.
944	   ......
945	   12: xxx.

947	9.  Application Example

949	   In this section, we take a Fat-tree network(Fig. 7) as an example to
950	   describe how to apply FAR routing.  Since M1 to M4 are very simple,
951	   we only introduce how the modules M5, M6, and M7 work in a Fat-tree
952	   network.

954	   A Fat-tree network is composed of 4 layers.  The top layer is core
955	   layer, and the other layers are aggregation layer, edge layer and
956	   server layer.  There are k pods, each one containing two layers of
957	   k/2 switches.  Each k-port switch in the edge layer is directly
958	   connected to k/2 hosts.  The remaining k/2 ports are connected to k/2
959	   of the k-port switches in the aggregation layer.  There are (k/2)2
960	   k-port core switches.  Each core switch has one port connected to
961	   each of the k pods.

963	       10.0.1.1          10.0.1.2            10.0.2.1         10.0.2.2
964	         +--+              +--+                +--+               +--+
965	         |  |              |  |                |  |               |  |
966	         +--+,_            +--+',              +--+             ,,+--+
967	         |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
968	         |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
969	         |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
970	         |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
971	         |      .       `'.     `-.,-`    /       |      \
972	         |       \         `'.,  .` `'., `        |       `.
973	         |        `,          .'`,     ,`'.,      |         ',
974	         |          .      ,-`    '., -     `'-,_ |           `.
975	         |           \   .`         ,'.,         `|.,           .
976	         |            .'`          /    `-,       |  `'.,        `.
977	     10.1.0.1      ,-`  .        .'    10.3.0.1  10.3.0.2`'.,      ',
978	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
979	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
980	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
981	          |  \/ |         |  \/ |           |  \/ |           |  \/ |
982	         +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
983	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
984	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
985	          /|   10.1.2.1   /|    |\     10 3.1.3   |\          /|   |  |
986	         / |    | \      / |    | \        / |    | \        / |   |  |
987	        /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
988	       /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
989	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
990	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
991	            10.1.2.2                           10.3.1.3

993	                        Figure 8: Fat-tree Network

995	   Aggregation switches are given addresses of the form 10.pod.0.switch,
996	   where pod denotes the pod number, and switch denotes the position of
997	   that switch in the upper pod (in [1, k/2]).  Edge switches are given
998	   addresses of the form 10.pod.switch.1, where pod denotes the pod
999	   number, and switch denotes the position of that switch in the lower
1000	   pod (in [1, k/2]).  The core switches are given addresses of the form
1001	   10.0.j.i, where j and i denote that switch's coordinates in the
1002	   (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left).
1003	   The address of a host follows the pod switch to which it is connected
1004	   to; hosts have addresses of the form: 10.pod.switch.ID, where ID is
1005	   the host's position in that subnet (in [2, k/2+1], starting from left
1006	   to the right).

1008	9.1.  BRT Building Procedure

1010	   By leveraging the topology's regularity, every switch clearly knows
1011	   how it forwards a packet.  When a packet arrives at an edge switch,
1012	   if the destination of the packet lies in the same subnet with the
1013	   switch, then the switch directly forwards the packet to the
1014	   destination server through layer-2 switching.  Otherwise, the switch
1015	   forwards the packet to any of aggregation switches in the same pod.
1016	   When a packet arrives at an aggregation switch, if the destination of
1017	   the packet lies in the same pod, the switch forwards the packet to
1018	   the corresponding edge switch.  Otherwise, the switch forwards the
1019	   packet to any of core switches that it is connected to.  If a core
1020	   switch receives a packet, it forwards the packet to the corresponding
1021	   aggregation switch that lies in the destination pod.

1023	   The forwarding policy discussed above is easily expressed through a
1024	   BRT.  The BRT of an edge switch, such as 10.1.1.1, is composed of the
1025	   following entries:

1027	   Destination/Mask       Next hop
1028	   10.0.0.0/255.0.0.0     10.1.0.1
1029	   10.0.0.0/255.0.0.0     10.1.0.2

1031	   The BRT of an aggregation switch, such as 10.1.0.1, is composed of
1032	   the following entries:

1034	   Destination/Mask            Next hop
1035	   10.1.1.0/255 255.255.0      10.1.1.1
1036	   10.1.2.0/255.255.255.0      10.1.2.1
1037	   10.0.0.0/255.0.0.0          10.0.1.1
1038	   10.0.0.0/255.0.0.0          10.0.1.2

1040	   The BRT of acore switch, such as 10.0.1.1, is composed of the
1041	   following entries:

1043	   Destination/Mask            Next hop
1044	   10.1.0.0/255 255.0.0        10.1.0.1
1045	   10.2.0.0/255.255.0.0        10.2.0.1
1046	   10.3.0.0/255.255.0.0        10.3.0.1
1047	   10.4.0.0/255.255.0.0        10.4.0.1

1049	9.2.  NRT Building Procedure

1051	   The route entries in an NRT are related with link and node failures.
1052	   We summarize all types of cases into three (3) catalogs.

1054	9.2.1.  Single Link Failure

1056	   In Fat-tree, Links can be classified as 3 types by their locations:
1057	   1) servers to edge switches; 2) edge to aggregation switches; 3)
1058	   aggregation to core switches.  Link failures between servers to edge
1059	   switches only affect the communication of the corresponding servers
1060	   and don't affect the routing tables of any switch, so we only discuss
1061	   the second and third type of links failures.

1063	   Edge to Aggregation Switches

1065	   Suppose that the link between an edge switch, such as 10.1.2.1 (A),
1066	   and an aggregation switch, such as 10.1.0.1(B),fails.  This link
1067	   failure may affect 3 types of communications.

1069	   o Sources lie in the same subnet with A, and destinations do not.  In
1070	   this case, the link failure will only affect the routing tables of A.
1071	   As this link is attached to A directly, A only needs to delete the
1072	   route entries whose next hop is B in its BRT and add no entries to
1073	   its NRT when A's M6 module detect the link failure.

1075	   o Destinations lie in the same subnet with A, and sources lie in
1076	   another subnet of the same pod.  In this case, the link failure will
1077	   affect the routing tables of all the edge switches in the same pod
1078	   except for A.  When an edge switch, such as 10.1.1.1, learns the link
1079	   failure, it will add a route entry to its NRT:

1081	   Destination/Mask            Next hop
1082	   10.1.2.0/255.255.255.0      10.1.0.1

1084	   o Destinations lie in the same subnet with A, sources lie in another
1085	   pod.  In this case, the link failure will affect the routing tables
1086	   of all the edge switches in the other pods.  When an edge switch in
1087	   one other pod, such as 10.3.1.1, learns the link failure, because all
1088	   the routings that pass through 10.3.0.1 to A will certainly pass
1089	   through the link between A and B, 10.3.1.1 need add a route entry to
1090	   its NRT:

1092	   Destination/Mask            Next hop
1093	   10.1.2.0/255.255.255.0      10.3.0.1
1094	   Aggregation to Core Switches

1096	   Suppose that the link between an aggregation switch, such as 10.1.0.1
1097	   (A), and a core switch, such as 10.0.1.2(B), fails.  This link
1098	   failure may affect 2 types of communications.

1100	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1101	   the other pods.  In this case, the link failure will only affect the
1102	   routing tables of A.  As this link is attached to A directly, A only
1103	   need to delete the route entries whose next hop is B in its BRT and
1104	   add no entries to its NRT when A's M6 module detect the link failure.

1106	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1107	   another pod.  In this case, the link failure will affect the routing
1108	   tables of all the aggregation switches in other pods except for pod
1109	   1.  When an aggregation switch in one other pod, such as 10.3.0.1,
1110	   learns the link failure, because all the routings that pass through
1111	   10.0.1.2 to the pod 1 where A lies will certainly pass through the
1112	   link between A and B, 10.3.0.1 need add a route entry to its NRT:

1114	   Destination/Mask            Next hop
1115	   10.1.0.0/255.255.0.0        10.0.1.2

1117	9.2.2.  A Group of Link Failures

1119	   If all the uplinks of an aggregation switch fail, then this switch
1120	   cannot forward packets, which will affect the routing of every edge
1121	   switches.  Suppose that all the uplinks of the node A (10.1.0.1)
1122	   fail, it will affect two types of communications.

1124	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
1125	   the other pods.  In this case, the link failures will affect the
1126	   routing of the edge switches in the Pod of A.  To avoid the node A,
1127	   each edge switch should remove the route entry "10.0.0.0/255.0.0.0
1128	   10.1.0.1" in which the next hop is the node A.

1130	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
1131	   other pods.  In this case, the link failures will affect the routing
1132	   of edge switches in other pods.  For example, if the edge switch
1133	   10.3.1.1 communicates with some node in the pod of A, it should avoid
1134	   the node 10.3.0.1, because any communication through 10.3.0.1 to the
1135	   pod of A will pass through the node A.  So a route entry should be
1136	   added to 10.3.1.1:

1138	   Destination/Mask            Next hop
1139	   10.1.0.0/255.255.0.0        10.3.0.1

1141	9.2.3.  Node Failures

1143	   At last, we discuss the effect of node failures to a NRT.  There are
1144	   3 types of node failures: the failure of edge, aggregation and core
1145	   switches.

1147	   o An edge switch fails.  The failure doesn't affect the routing table
1148	   of any switch.

1150	   o A core switch fails.  Only when all the core switches connected to
1151	   the same aggregation switch fail, they will affect the routing of
1152	   other switches.  This case is equal to the case that all the uplinks
1153	   of an aggregation switch fail, so the process of link failures can
1154	   cover it.

1156	   o An aggregation switch fails.  This case is similar to the case that
1157	   all the uplinks of an aggregation switch fail.  It affects the
1158	   routing of edge switches in other pods, but doesn't affect the
1159	   routing of edge switches in pod of the failed switch.  The process of
1160	   this failure is same to the second case in section 6.2.2.

1162	9.3.  Routing Procedure

1164	   FAR decides a routing by looking up its BRT and NRT.  We illuminate
1165	   the routing procedure by an example.  In this example, we suppose
1166	   that the link between 10.3.1.1 and 10.3.0.2 and the link between
1167	   10.1.2.1 and 10.1.0.2 have failed.  Then we look into the routing
1168	   procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
1169	   (destination).

1171	   Step 1: The source 10.3.1.3 sends packets to its default router
1172	   10.3.1.1

1174	   Step 2: The routing of 10.3.1.1.

1176	   1) Calculate candidate hops

1178	   10.3.1.1 looks up its BRT and gets the following matched entries:

1180	   Destination/Mask            Next hop
1181	   10.0.0.0/255.0.0.0          10.3.0.1

1183	   So the candidate hops = {10.3.0.1}

1185	   2) Calculate avoiding hops

1187	   Its NRT is empty, so the set of avoiding hop is empty too.

1189	   3) Calculate applicable hops

1191	   The applicable hops are candidate hops minus avoiding hops, so:

1193	   The applicable hops = {10.3.0.1}

1195	   4) Forward packets to 10.3.0.1

1197	   Step 3: The routing of 10.3.0.1

1199	   1) Calculate candidate hops.

1201	   10.3. 0.1 looks up its BRT and gets the following matched entries:

1203	   Destination/Mask           Next hop
1204	   10.1.0.0/255.255.0.0       10.0.1.1
1205	   10.1.0.0/255.255.0.0       10.0.1.2

1207	   So the candidate hops = {10.0.1.1, 10.0.1.2}

1209	   2) Calculate avoiding hops

1211	   Destination/Mask           Next hop
1212	   10.1.0.0/255.255.0.0       10.0.1.2

1214	   So the avoiding hops = {10.0.1.2}

1216	   3) Calculate applicable hops

1218	   The applicable hops are candidate hops minus avoiding hops, so:

1220	   The applicable hops = {10.0.1.1}

1222	   4) Forward packets to 10.0.1.1

1224	   Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
1225	   routing tables.

1227	   Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
1228	   routing tables.

1230	   Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
1231	   layer-2 switching.

1233	9.4.  FAR's Performance in Large-scale Networks

1235	   FAR has good performanceto support large-scale networks.  In this
1236	   section, we take a Fat-tree networkcomposed of 2,880 48-port switches
1237	   and 27,648 servers as an example to show FAR's performance.

1239	9.4.1.  The number of control messages required by FAR

1241	   FAR exchanges a few messages between routers and only consumes a
1242	   little network bandwidth.  Tab. 1 shows the required messages in the
1243	   example Fat-tree network.
1244	   Table 1:Required messages in a Fat-tree network.
1245	   _____________________________________________________________________
1246	   Message Type| Scope | size(bytes) | Rate | Bandwidth
1247	   ---------------------------------------------------------------------
1248	   Hello |adjacentswitches|less than 48|10 messages/sec|less than 4 kbps
1249	   ---------------------------------------------------------------------
1250	   DLR |adjacentswitches| less than 48 | (1) |48bytes
1251	   ---------------------------------------------------------------------
1252	   DA |entire network| less than 48 | (2) |1.106M
1253	   ---------------------------------------------------------------------
1254	   LFA |entire network| less than 48 | (3) |48 bytes
1255	   _____________________________________________________________________
1256	   (1)Produce one when arouter starts
1257	   (2)The number of switches(2,880) in a period
1258	   (3)Produce one when a link fails or recovers

1260	9.4.2.  The Calculating Time of Routing Tables

1262	   A BRT is calculated according to the states of its neighbor routers
1263	   and attached links.  An NRT is calculated according to device and
1264	   link failures in the entire network.  So FAR does not calculate
1265	   network topology and has no problem of network convergence, which
1266	   greatly reduces the calculating time of routing tables.  The
1267	   detection and spread time of link failures is very short in FAR.
1268	   Detection time is up to the interval of sending Hello message.  In
1269	   FAR, the interval is set to 100ms, and a link failure will be
1270	   detected in 200ms.  The spread time between any pair of routers is
1271	   less than 200ms.If a link fails in a data center network, FAR can
1272	   detect it, spread it to all the routers, and calculate routing tables
1273	   in no more than 500ms.

1275	9.4.3.  The Size of Routing Tables

1277	   For the test Fat-tree network, the sizes of BRTs and NRTs are shown
1278	   in Tab. 2.

1280	   Table 2: The size of routing tables in FAR
1281	   _____________________________________________________________________
1282	   Routing Table| CoreSwitch | AggregationSwitch | EdgeSwitch |
1283	   ---------------------------------------------------------------------
1284	   BRT | 48 | 48 | 24
1285	   ---------------------------------------------------------------------
1286	   NRT | 0 | 14 | 333
1287	   _____________________________________________________________________

1289	   The BRT's size at a switch is determined by the number of its
1290	   neighbor switches.  In the example network, a core switch has 48
1291	   neighbor switches (aggregation switch), so it has 48 entries in its
1292	   BRT.Only aggregation and edge switches have NRTs.  The NRT size at a
1293	   switch is related to the number of link failures in the network.
1294	   Suppose that there are 1000 link failures in the example network, the
1295	   number of failed links is 1.2% of total links, which is a very high
1296	   failure ratio.  We suppose that link failures are uniformly
1297	   distributed in the entire network.  The NRT size at an edge switch is
1298	   about 333 and the NRT size of an aggregation switch is about 14in
1299	   average.

1301	10.  Conclusion

1303	   This draft introduces FAR protocol, a generic routing method and
1304	   protocol, for data centers that have a regular topology.  It uses two
1305	   routing tables, a BRT and a NRT, to store the normal routing paths
1306	   and avoiding routing paths, respectively, which makes FAR very simple
1307	   and efficient.  The sizes of two tables are very small.  Usually, a
1308	   BRT has only several tens of entries and a NRT has only several or
1309	   about a dozen entries.

1311	11.  Reference

1313	   [FAT-TREE] M.  Al-Fares, A.  Loukissas, and A.  Vahdat."A
1314	   Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
1315	   2008.

1317	12.  Acknowledgments

1319	   This document is supported by ZTE Enterprise-University-Research
1320	   Joint Project.

1322	13.  Security Considerations
1323	Authors' Addresses

1325	   Bin Liu
1326	   ZTE Inc.
1327	   ZTE Plaza, No.19 East Huayuan Road,Haidian District
1328	   Beijing  100191
1329	   China

1331	   Phone: +86 -010-59932039
1332	   Email: 13683610386@139.com

1334	   Yantao Sun
1335	   Beijing Jiaotong University
1336	   No.3 Shang Yuan Cun, Hai Dian District
1337	   Beijing  100044
1338	   China

1340	   Email: ytsun@bjtu.edu.cn

1342	   Jing Cheng
1343	   Beijing Jiaotong University
1344	   No.3 Shang Yuan Cun, Hai Dian District
1345	   Beijing  100044
1346	   China

1348	   Email: yourney.j@gmail.com

1350	   Yichen Zhang
1351	   Beijing Jiaotong University
1352	   No.3 Shang Yuan Cun, Hai Dian District
1353	   Beijing  100044
1354	   China

1356	   Email: snowfall_dan@sina.com

1358	   Bhumip Khasnabish
1359	   ZTE TX Inc.
1360	   55 Madison Avenue, Suite 160
1361	   Morristown, New Jersey    07960
1362	   USA

1364	   Phone: +001-781-752-8003
1365	   Email: bhumip.khasnabish@ztetx.com