idnits 2.17.1 

draft-sl-rtgwg-far-dcn-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 53 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet has text resembling
     RFC 2119 boilerplate text.

  -- The document date (June 21, 2014) is 3596 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Missing reference section? 'RFC2119' on line 163 looks like a reference

  -- Missing reference section? 'FAT-TREE' on line 1074 looks like a reference


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Routing Area Working Group                                       Bin Liu
3	Internet-Draft                                                  ZTE Inc.
4	Intended status: Informational                                Yantao Sun
5	Expires: December 23, 2014                                    Jing Cheng
6	                                                            Yichen Zhang
7	                                             Beijing Jiaotong University
8	                                                           June 21, 2014

10	   Generic Fault-avoidance Routing Protocol for Data Center Networks
11	                       draft-sl-rtgwg-far-dcn-01

13	Abstract

15	   This draft proposes a generic routing method and protocol for a
16	   regular data center network, named as the fault-avoidance routing
17	   (FAR) protocol.  FAR protocol provides a generic routing method for
18	   all types of network architectures that are proposed for large-scale
19	   cloud-based data centers over the past few years.  FAR protocol is
20	   well designed to fully leverage the regularity in the topology and
21	   compute its routing table in a simplistic manner.  Fat-tree is taken
22	   as an example architecture to illustrate how FAR protocol can be
23	   applied in real operational scenarios.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on December 23, 2014.

42	Copyright Notice

44	   Copyright (c) 2014 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Acronyms & Definitions  . . . . . . . . . . . . . . . . .   3
61	   2.  Conventions used in this document . . . . . . . . . . . . . .   4
62	   3.   Problem Statement  . . . . . . . . . . . . . . . . . . . . .   4
63	     3.1.  The Impact of Large-scale Networks on Route Calculation .   4
64	     3.2.  Dilemma of conventional routing methods in a large-scale
65	           network with giant number nodes of routers  . . . . . . .   5
66	     3.3.  Network Addressing Issues . . . . . . . . . . . . . . . .   7
67	     3.4.  Big Routing Table Issues  . . . . . . . . . . . . . . . .   7
68	     3.5.  Adaptivity Issues for Routing Algorithms  . . . . . . . .   7
69	     3.6.  Virtual Machine Migration Issues  . . . . . . . . . . . .   8
70	   4.  The FAR Framework . . . . . . . . . . . . . . . . . . . . . .   8
71	   5.  Data Format . . . . . . . . . . . . . . . . . . . . . . . . .   9
72	     5.1.  Data Tables . . . . . . . . . . . . . . . . . . . . . . .   9
73	     5.2.  Messages  . . . . . . . . . . . . . . . . . . . . . . . .  12
74	   6.  FAR Modules . . . . . . . . . . . . . . . . . . . . . . . . .  16
75	     6.1.  Neighbor and Link Detection Module(M1)  . . . . . . . . .  16
76	     6.2.  Device Learning Module(M2)  . . . . . . . . . . . . . . .  16
77	     6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)   17
78	     6.4.  Link Failure Learning Module(M4)  . . . . . . . . . . . .  17
79	     6.5.  BRT Building Module(M5) . . . . . . . . . . . . . . . . .  17
80	     6.6.  NRT Building Module(M6) . . . . . . . . . . . . . . . . .  17
81	     6.7.  Routing Table Lookup(M7)  . . . . . . . . . . . . . . . .  18
82	   7.  Application Example . . . . . . . . . . . . . . . . . . . . .  18
83	     7.1.  BRT Building Procedure  . . . . . . . . . . . . . . . . .  20
84	     7.2.  NRT Building Procedure  . . . . . . . . . . . . . . . . .  21
85	       7.2.1.  Single Link Failure . . . . . . . . . . . . . . . . .  21
86	       7.2.2.  A Group of Link Failures  . . . . . . . . . . . . . .  22
87	       7.2.3.  Node Failures . . . . . . . . . . . . . . . . . . . .  23
88	     7.3.  Routing Procedure . . . . . . . . . . . . . . . . . . . .  23
89	   8.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  25
90	   9.  Reference . . . . . . . . . . . . . . . . . . . . . . . . . .  25
91	   10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  25
92	   11. Security Considerations . . . . . . . . . . . . . . . . . . .  25
93	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  25

95	1.  Introduction

97	   In recent years, with the rapid development of cloud computing
98	   technologies, the widely deployed cloud services, such as Amazon EC2
99	   and Google search, bring about huge challenges to data center
100	   networking (DCN).  Today's cloud-based data centers (DCs) require
101	   large-scale networks with larger internal bandwidth and smaller
102	   transfer delay, However, conventional networks cannot meet such
103	   requirements due to limitations in their network architecture.  In
104	   order satisfy the requirements of cloud computing services, many new
105	   network architectures have been proposed for data centers, such as
106	   Fat-tree, MatrixDCN, and BCube.  These new architectures can support
107	   non-blocking large-scale datacenter networks with more than tens of
108	   thousands of physical servers.

110	   Generic routing protocols such as OSPF and IS-IS cannot be scaled to
111	   support a very large-scale datacenter network because of the problems
112	   of network convergence and PDU overhead.  For each type of
113	   architecture, researchers designed a routing algorithm according to
114	   the features of its topology.  Because these routing algorithms are
115	   different and lack compatibility with each other, it is very
116	   difficult to develop a routing protocol for network routers
117	   supporting multiple routing algorithms.  Furthermore, the fault
118	   tolerances in those routing algorithms are very complicated and have
119	   low efficiency.

121	   This draft proposes a generic routing method and protocol, fault-
122	   avoidance routing (FAR) protocol, for DCNs.  This method leverages
123	   the regularity in the topologies of data center networks to simplify
124	   routing learning and accelerate the query of routing tables.  This
125	   routing method has a better fault tolerance property and can be
126	   applied to any DCN with a regular topology.

128	1.1.  Acronyms & Definitions

130	   DCN - Data Center Network

132	   FAR - Fault-avoidance Routing

134	   BRT - Basic Routing Table

136	   NRT - Negative Routing Table

138	   NDT - Neighbor Devices Table

140	   ADT - All Devices Table

142	   LFT - Link failure Table
143	   DA - Device Announcement

145	   LFA - Link Failure Announcement

147	   DLR&#168;C Device and Link Request

149	   IP - Internet Protocol

151	   UDP - User Datagram Protocol

153	   VM - Virtual Machine

155	2.  Conventions used in this document

157	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
158	   NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
159	   this document are to be interpreted as described in RFC-2119
160	   [RFC2119].  In this document, these words will appear with that
161	   interpretation only when in ALL CAPS.  Lower case uses of these words
162	   are not to be interpreted as carrying RFC-2119 significance.

164	3.  Problem Statement

166	   The problem to be addressed by FAR as proposed in this draft is
167	   described in this section.  The expansion of Cloud data center
168	   networks has brought significant challenges to the existing routing
169	   technologies.  FAR mainly solves a series of routing problems faced
170	   by large-scale data center networks.

172	3.1.  The Impact of Large-scale Networks on Route Calculation

174	   In a large-scale cloud data center network, there may be thousands of
175	   routers.  Running OSPF and other routing protocols in such network
176	   will encounter these two challenges: a) Network convergence time
177	   would be too long, which will cause a longer time to elapse for
178	   creating and updating the routes.  The response to network failures
179	   may be excessively high; b) a large number of routing protocol
180	   packets need to be sent.  The routing information consumes a lot of
181	   network bandwidth and CPU resources, which easily leads to packet
182	   loss and makes the problem (a) more prominent.

184	   In order to solve these problems, a common practice is partitioning
185	   the large network into some small areas, where the route calculation
186	   runs independently within different areas.  However, nowadays the
187	   cloud data centers typically require very large internal bandwidth.
188	   To meet this requirement, a large number of parallel equivalent links
189	   are deployed in the network, such as the Fat-tree network
190	   architecture.  Partitioning the network will affect the utilization
191	   of routing algorithm on equivalent multi-path and reduce internal
192	   network bandwidth requirements.

194	   In the FAR routing calculation process, a Basic Routing Table (BRT)
195	   is built on local network topology leveraging the regularity of the
196	   network topologies.  In addition to BRT, FAR also builds a Negative
197	   Routing Table (NRT).  FAR gradually builds NRT in the process of
198	   learning network link failure information, which does not require
199	   learning the complete network fault information.  FAR does not need
200	   to wait for the completion of the network convergence in the process
201	   of building these two tables.  Therefor, it avoids the problem of
202	   excessive network convergence overheads in the route calculation
203	   process.  In addition, FAR only needs to exchange a small amount of
204	   link change information between routers, and hence consumes less
205	   network bandwidth.

207	3.2.  Dilemma of conventional routing methods in a large-scale network
208	      with giant number nodes of routers

210	   There are many real world scenario where tens of thousands of
211	   nodes(or much more nodes) need to be deployed in a flat area, such as
212	   infiniband routing and switching system, high-performance computer
213	   network, and many IDC networks in China.  The similar problems have
214	   been existed long ago.  People have solved the problems through
215	   similar solutions, such as the traditional regular topology-based
216	   RFC3619 protocol, the routing protocols of infiniband routing and
217	   switching system, and high-performance computer network routing
218	   protocol.

220	   Why OSPF and other conventional routing methods do not work well in a
221	   large-scale network with giant number nodes of routers?

223	   As everyone knows, the OSPF protocol uses multiple databases, more
224	   topological exchange information (as seen in the following example)
225	   and complicated algorithm.  It requires routers to consume more
226	   memory and CPU processing capability.  But the processing rate of CPU
227	   on the protocol message per second is very limited.  When the network
228	   expands, CPU will quickly approach its processing limits, and at this
229	   time OSPF can not continue to expand the scale of the management.
230	   The SPF algorithm itself does not thoroughly solve these problems.

232	   On the contrary, FAR does not have the convergence time delay and the
233	   additional CPU overheads, which SPF requires.  Because in the initial
234	   stage, FAR already knows the regular information of the whole network
235	   topology and does not need to periodically do SPF operation.

237	   One of the examples of "more topological exchange information":

239	   In the OSPF protocol, LSA floods every 1800 seconds.  Especially in
240	   the larger network, the occupation of CPU and band bandwidth will
241	   soon reach the router's performance bottleneck.  In order to reduce
242	   these adverse effects, OSPF introduced the concept of Area, which
243	   still has not solved the problem thoroughly).  By dividing the OSPF
244	   Area into several areas, the routers in the same area do not need to
245	   know the topological details outside their area.  (In comparison with
246	   FAR, after OSPF introducing the concept of Area, the equivalent paths
247	   cannot be selected in the whole network scope)

249	   OSPF can achieve the following results by Area : 1) Routers only need
250	   to maintain the same link state databases as other routers within the
251	   same Area, without the necessity of maintaining the same link state
252	   database as all routers in the whole OSPF domain.  2) The reduction
253	   of the link state databases means dealing with relatively fewer LSA,
254	   which reduces the CPU consumption of routers; 3) The large number of
255	   LSAs flood only within the same Area.  But, its negative effect is
256	   that the smaller number of routers which can be managed in each OSPF
257	   area.  On the contrary, because FAR does not have the above
258	   disadvantages, FAR can also manage large-scale network even without
259	   dividing Areas.

261	   The aging time of OSPF is set in order to adapt to routing
262	   transformation and protocol message exchange happened frequently in
263	   the irregular topology.  Its negative effect is: when the network
264	   does not change, the LSA needs to be refreshed every 1800 seconds to
265	   reset the aging time.  In the regular topology, as the routings are
266	   fixed, it does not need the complex protocol message exchange and
267	   aging rules to reflect the routing changes, as long as LFA mechanism
268	   in the FAR is enough.

270	   Therefore, in FAR, we can omit many unnecessary processing and the
271	   packet exchange.  The benefits are fast convergence speed and much
272	   larger network scale than other dynamic routing protocol.  Now there
273	   are some successful implementations of simplified routings in the
274	   regular topology in the HPC environment.  Conclusion: As FAR needs
275	   few routing entries and the topology is regular, the database does
276	   not need to be updated regularly.  Without the need for aging, there
277	   is no need for CPU and bandwidth overhead brought by LSA flood every
278	   30 minutes, so the expansion of the network has no obvious effect on
279	   the performance of FAR, which is contrary to OSPF.

281	   Comparison of convergence time: The settings of OSPF spf_delay and
282	   spf_hold_time can affect the change of convergence time.  The
283	   convergence time of the network with 2480 nodes is about 15-20
284	   seconds(as seen in the following pages); while the FAR does not need
285	   to calculate the SFP, so there is no such convergence time.

287	   These issues still exist in rapid convergence technology of OSPF and
288	   ISIS (such as I-SPF).  The convergence speed and network scale
289	   constraint each other.  FAR does not have the above problems, and the
290	   convergence time is almost negligible.
291	   Can FRR solve these problems?  IP FRR has some limitations.  The
292	   establishment of IP FRR backup scheme will not affect the original
293	   topology and traffic forwarding which are established by protocol,
294	   however, we can not get the information of whereabouts and status
295	   when the traffic is switched to an alternate next hop.

297	3.3.  Network Addressing Issues

299	   Routers typically configure multiple network interfaces, each
300	   connected to a subnet.  OSPF and other routing algorithms require
301	   that each interface of the router must be configured with an IP
302	   address.  A large-scale data center network may contain thousands of
303	   routers.  Tens of thousands of IP addresses may be needed to
304	   configure for each router with dozens of network interfaces.  It will
305	   be a very complex issue to configure and manage a large number of
306	   network interfaces.  Network maintenance is usually costly and error-
307	   prone.  It will be difficult to troubleshoot the problems.

309	   In FAR, the device position information is encoded in the IP address
310	   of the router.  Each router only needs to be assigned a unique IP
311	   address according its location, which greatly solves complex network
312	   addressing issues in large-scale networks.

314	3.4.  Big Routing Table Issues

316	   There are a large number of subnets in the large-scale data center
317	   network.  Routers may build a routing entry for each subnet, and
318	   therefore the size of the routing tables on each router may be very
319	   large.  It will increase equipment cost and reduce the querying speed
320	   of the routing table.

322	   FAR uses two measures to reduce the size of the routing tables: a)
323	   Builds a BRT on the regularity of the network topologies; b)
324	   introduces a new routing table, i.e., a NRT.  In this way FAR can
325	   reduce the size of routing tables to only a few dozen routing
326	   entries.

328	3.5.  Adaptivity Issues for Routing Algorithms

330	   To implement efficient routing in large-scale datacenters, besides
331	   FAR, some other routing methods are proposed for some specific
332	   network architectures, such as Fat-tree and BCube.  These routing
333	   methods are different(from both design and implementation viewpoints)
334	   and not compatibile with the conventional routing methods, which
335	   brings big troubles to network equipment providers to develop new
336	   routers supporting various new routing methods.

338	   FAR is a generic routing method.  With slight modification, FAR
339	   method can be applied to most of regular datacenter networks.
340	   Furthermore, the structure of routing tables and querying a routing
341	   table in FAR are the same as conventional routing method.  If FAR is
342	   adopted, the workload of developing a new type of router will be
343	   significantly decreased.

345	3.6.  Virtual Machine Migration Issues

347	   Supporting VM migration is very important for cloud-based datacenter
348	   networks.  However, in order to support layer-3 routing, routing
349	   methods including OSPF and FAR require limiting VM migration within a
350	   subnet.  For this paradox, the mainstream methods still utilize
351	   layer-3 routing on routers or switches, transmit packets encapsulated
352	   by IPinIP or MACinIP between hosts by tunnels passing through network
353	   to the destination access switch, and then extract original packet
354	   out and send it to the destination host.

356	   By utilizing the aforementioned methods, FAR can be applied to Fat-
357	   tree, MatrixDCN or BCube networks for supporting VM migration in
358	   entire network.

360	4.  The FAR Framework

362	   FAR requires that a DCN has a regular topology, and network devices,
363	   including routers, switches, and servers, are assigned IP addresses
364	   according to their locations in the network.  In other word, we can
365	   locate a device in the network according to its IP address.

367	   FAR is a distributed routing method.  In order to support FAR, each
368	   router needs to have a routing module that implements the FAR
369	   algorithm.  FAR algorithm is composed of three parts, i.e., link-
370	   state learning, routing table building and routing table querying, as
371	   shown in Fig. 1.

373	               Link-state Learning    |Routing Table  | Routing Table
374	                                      |  Building
375	                                      |               |   Querying
376	    +--------+   /---------------\    | +--------+    |    Packets
377	    |2 Device|<--| 1 Neighbor &  |----->| 5 BRT  \    |
378	    |Learning|   | Link Detection|    | |Building|\   |       |
379	    +--------+   \---------------/    | +--------+ \  |      \|/
380	             |           |            |             \ |+--------------+
381	             |           |            |              /||  7 Querying  |
382	            \|/         \|/           |             / || Routing Table|
383	             +-----------------------+|            /  |+--------------+
384	             |3 Invisible Neighbor & ||           /   |
385	             |Link Failure Inferring || +---------    |
386	             +-----------------------+|/| 6 NRT  |    |
387	                         |            / |Building|    |
388	                        \|/          /| +--------+    |
389	                  +--------------+  / |               |
390	                  |4 Link Failure| /  |               |
391	                  |  Learning    |    |               |
392	                  +--------------+    |               |
393	                                      |               |

395	                        Figure 1: The FAR framework

397	   Link-state learning is responsible for a router to detect the states
398	   of its connected links and learn the states of all the other links in
399	   the entire network.  The second part builds two routing tables, a
400	   basic routing table (BRT) and an negative routing table (NRT),
401	   according to the learned link states in the first part.  The third
402	   part queries the BRT and the NRT to decide a next forwarding hop for
403	   the received (ingress) packets.

405	5.  Data Format

407	5.1.  Data Tables

409	   Some data tables are maintained on each router in FAR.  They are:

411	   Neighbor Device Table (NDT): To store neighbor routers and related
412	   links.

414	   All Devices Table (ADT): To store all routers in the entire network.

416	   Link Failures Table (LFT): To store all link failures in the entire
417	   network.

419	   Basic Routing Table (BRT): To store the candidate routes.

421	   Negative Routing Table(NRT): To store the avoiding routes.

423	   The format of NDT

425	    ----------------------------------------------------------
426	    Device ID | Device IP | Port ID | Link State | Update Time
427	    ----------------------------------------------------------

429	   Device ID: The ID of a neighbor router.

431	   Device IP: The IP address of a neighbor router.

433	   Port ID: The port ID that a neighbor router is attached to.

435	   Link State: The state of the link between a router and its neighbor
436	   router.  There are two states: Up and Down.

438	   Update Time: The time of updating the entry.

440	   The format of ADT

442	    --------------------------------------------------
443	    Device ID | Device IP | Type | State | Update Time
444	    --------------------------------------------------

446	   Device ID: The ID of a neighbor router.

448	   Device IP: The IP address of a neighbor router.

450	   Type: The type of a neighbor router.

452	   State: The state of a neighbor router.  There are two states: Up and
453	   Down.

455	   Update Time: The time of updating the entry.

457	   The format of LFT

459	    --------------------------------------------
460	    No | Router 1 IP | Router 2 IP | Update Time
461	    --------------------------------------------

463	   No: The entry number.

465	   Router 1 IP: The IP address of one router that a failed link connects
466	   to.

468	   Router 2 IP: The IP address of another router that a failed link
469	   connects to.

471	   Update Time: The time of updating the entry.

473	   The format of BRT

475	    -------------------------------------------------------
476	    Destination | Mask | Next Hop | Interface | Update Time
477	    -------------------------------------------------------

479	   Destination: A destination network

481	   Mask: The subnet mask of a destination network.

483	   Next Hop: The IP address of a next hop for a destination.

485	   Interface: The interface related to a next hop.

487	   Update Time: The time of updating the entry.

489	   The format of NRT

491	    -------------------------------------------------------------------
492	    Destination| Mask| Next Hop| Interface| Failed Link No| Timestamp
493	    -------------------------------------------------------------------

495	   Destination: A destination network.

497	   Mask: The subnet mask of a destination network.

499	   Next Hop: The IP address of a next hop that should be avoided for a
500	   destination.

502	   Interface: The interface related to a next hop that should be
503	   avoided.

505	   Failed Link No: A group of failed link numbers divided by !o/!+/-,
506	   for example 1/2/3.

508	   Timestamp: The time of updating the entry.

510	5.2.  Messages

512	   Some protocol messages are exchanged between routers in FAR.

514	   Hello Message: This message is exchanged between neighbor routers to
515	   learn adjacency.

517	   Device Announcement (DA): Synchronize the knowledge of routers
518	   between routers.

520	   Link Failure Announcement (LFA): Synchronize link failures between
521	   routers.

523	   Device and Link Request (DLR): When a router starts, it requests the
524	   knowledge of routers and links from its neighbors by a DLR message.

526	   A FAR Message is directly encapsulated in an IP packet.  The protocol
527	   field of IP header indicates an IP packet is an FAR message.  The
528	   protocol of IP for FAR should be assigned by IANA.

530	   The four types of FAR messages have same format of packet header,
531	   called FAR header (as shown in Figure 2).

533	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
534	            +-----------+------------+------------------------+
535	            |  Version  |Message Type|    Message Length      |
536	            +-----------+------------+------------------------+
537	            |        Checksum        |       AuType           |
538	            +------------------------+------------------------+
539	            |                  Authentication                 |
540	            +-------------------------------------------------+
541	            |                  Authentication                 |
542	            +-------------------------------------------------+
543	            |                   Timestamp                     |
544	            +-------------------------------------------------+

546	                    Figure 2: The format of FAR header

548	   Version: FAR version

550	   Message Type: The type of FAR message.

552	   Packet Length: The packet length of the total FAR message.

554	   Checksum: The checksum of an entire FAR message.

556	   AuType:Authentication type. 0: no authentication, 1: Plaintext
557	   Authentication, 2: MD5 Authentication.

559	   Authentication: Authentication information. 0: undefined, 1: Key, 2:
560	   key ID, MD5 data length and packet number.  MD5 data is appended to
561	   the backend of the packet.

563	   AuType and Authentication can refer to the definition of OSPF packet.

565	            |<--- 1 --->| <--- 1 --->|<--------- 2 ---------->|
566	            +-----------+------------+------------------------+
567	            |  Version  |Message Type|    Message Length      |
568	            +-----------+------------+------------------------+
569	            |        Checksum        |       AuType           |
570	            +------------------------+------------------------+
571	            |                  Authentication                 |
572	            +-------------------------------------------------+
573	            |                  Authentication                 |
574	            +-------------------------------------------------+
575	            |                   Timestamp                     |
576	            +-------------------------------------------------+
577	            |                    Router IP                    |
578	            +------------------------+------------------------+
579	            |     HelloInterval      |     HelloDeadInterval  |
580	            +------------------------+------------------------+
581	            |                Neighbor Router IP               |
582	            +-------------------------------------------------+
583	            |                       ...                       |
584	            +-------------------------------------------------+

586	                  Figure 3: The Format of Hello Messages

588	   For Hello messages, the Message Type in FAR header is set to
589	   1.Besides FAR header, a Hello message(Fig. 3) requires the following
590	   fields:

592	   Router IP: The router IP address.

594	   HelloInterval: The interval of sending Hello messages to neighbor
595	   routers.

597	   RouterDeadInterval: The interval to set a neighbor router dead(out-
598	   of-service).  If in the interval time, a router doesn't receive a
599	   Hello message from its neighbor router, the neighbor router is
600	   treated as dead.

602	   Neighbor Router IP: The IP address of a neighbor router.  All the
603	   neighbor router's addresses should be included in a Hello message.

605	            |<----1---->| <----1---->|<----------2----------->|
606	            +-----------+------------+------------------------+
607	            |  Version  |Message Type|    Message Length      |
608	            +-----------+------------+------------------------+
609	            |        Checksum        |       AuType           |
610	            +------------------------+------------------------+
611	            |                  Authentication                 |
612	            +-------------------------------------------------+
613	            |                  Authentication                 |
614	            +-------------------------------------------------+
615	            |                   Timestamp                     |
616	            +------------------------+------------------------+
617	            |                   Router1 IP                    |
618	            +-------------------------------------------------+
619	            |                       ...                       |
620	            +-------------------------------------------------+

622	                    Figure 4: The Format of DA Messages

624	   For DA messages(Fig. 4), the Message Type in FAR header is set to 2.
625	   Besides FAR header, a DA message includes IP addresses of all the
626	   announced routers.

628	            |<----1---->| <----1---->|<----------2----------->|
629	            +-----------+------------+------------------------+
630	            |  Version  |Message Type|    Message Length      |
631	            +-----------+------------+------------------------+
632	            |        Checksum        |       AuType           |
633	            +------------------------+------------------------+
634	            |                  Authentication                 |
635	            +-------------------------------------------------+
636	            |                  Authentication                 |
637	            +-------------------------------------------------+
638	            |                   Timestamp                     |
639	            +------------------------+------------------------+
640	            |                     Left IP                     |
641	            +-------------------------------------------------+
642	            |                    Right IP                     |
643	            +------------------------+------------------------+
644	            |                      State                      |
645	            +-------------------------------------------------+
646	            |                       ...                       |
647	            +-------------------------------------------------+

649	                   Figure 5: The Format of LFA Messages

651	   For LFA messages(Fig. 5), the Message Type in FAR header is set to 3.
652	   Besides FAR header, a LFA message includes all the announced link
653	   failures.

655	   Left IP: The IP address of the left endpoint router of a link.

657	   Right IP: The IP address of the right endpoint router of a link.

659	   State: Link state. 0: Up, 1: down
660	            |<----1---->| <----1---->|<----------2----------->|
661	            +-----------+------------+------------------------+
662	            |  Version  |Message Type|    Message Length      |
663	            +-----------+------------+------------------------+
664	            |        Checksum        |       AuType           |
665	            +------------------------+------------------------+
666	            |                  Authentication                 |
667	            +-------------------------------------------------+
668	            |                  Authentication                 |
669	            +-------------------------------------------------+
670	            |                   Timestamp                     |
671	            +-------------------------------------------------+

673	                   Figure 6: The Format of DLR Messages

675	   For DLR messages(Fig. 6), the Message Type in FAR header is set to
676	   1.Except for FAR header, DLR has no additional fields.

678	6.  FAR Modules

680	6.1.  Neighbor and Link Detection Module(M1)

682	   M1 is responsible for sending and receiving Hello messages, and
683	   detecting directly-connected links and neighbor routers.  Each Hello
684	   message is encapsulated in a UDP packet.  M1 sends Hello messages
685	   periodically to all the active router ports and receives Hello
686	   messages from its neighbor routers.  M1 detects neighbor routers and
687	   directly-connected links according to received Hello Messages and
688	   stores these neighbors and links into a Neighbor Devices Table (NDT).
689	   Additionally, M1 also stores the neighbor routers into an All Devices
690	   Table (ADT).

692	6.2.  Device Learning Module(M2)

694	   M2 is responsible for sending, receiving, and forwarding device
695	   announcement (DA) messages, learning all the routers in the whole
696	   network, and deducing faulted routers.  When a router starts, it
697	   sends a DA message announcing itself to its neighbors and a DLR
698	   message requesting the knowledge of routers and links from its
699	   neighbors.  If M2 module of a router receives a DA message, it checks
700	   whether the router encapsulated in the message is in an ADT.  If the
701	   router is not in the ADT, M2 puts this router into the ADT and
702	   forwards this DA message to all the active ports except for the
703	   incoming one, otherwise, M2 discards this message directly.  If M2
704	   module of a router receives a DLR message, it replies a DA message
705	   that encapsulates all of the learned routers.

707	6.3.  Invisible Neighbor and Link Failure Inferring Module(M3)

709	   M3 is responsible for inferring invisible neighbors of the current
710	   router by means of the ADT.  If the link between the router A and its
711	   neighbor B breaks, which results in that M1 module of A cannot detect
712	   the existence of B, then B is an invisible neighbor of A.  Since a
713	   device's location is coded into its IP address, it can be judged
714	   whether two routers are adjacent, according to their IP addresses.
715	   Based on this idea, M3 infers all of the invisible neighbors of the
716	   current router and the related link failures.  The results are stored
717	   into a NDT.  Moreover, link failures also are added into a link-
718	   failure table (LFT).  LFT stores all of the failed links in the
719	   entire network.

721	6.4.  Link Failure Learning Module(M4)

723	   M4 is responsible for sending, receiving and forwarding link failure
724	   announcement (LFA) and learning all the link failures in the whole
725	   network.  M4 broadcasts each newly inferred link failure to all the
726	   routers in the network.  Each link failure is encapsulated in a LFA
727	   message and one link failure is broadcasted only once.  If a router
728	   receives a DLR request from its neighbor, it will reply a LFA message
729	   that encapsulates all the learned link failures through M4 module.
730	   If M4 receives a LFA message, it checks whether the link failure
731	   encapsulated in the message is in a LFT.  If the link failure is not
732	   in the LFT, M4 puts this link failure into the LFT and forwards this
733	   LFA message to all the active ports except for the incoming one,
734	   otherwise, M4 discards this message directly.

736	6.5.  BRT Building Module(M5)

738	   M5 is responsible for building a BRT for the current router.  By
739	   leveraging the regularity in topology, M5 can calculate the routing
740	   paths for any destination without the knowledge of the topology of
741	   whole network, and then build the BRT based on a NDT.  Since the IP
742	   addresses of network devices are continuous, M5 only creates one
743	   route entry for a group of destination addresses that have the same
744	   network prefix by means of route aggregation technology.  Usually,
745	   the size of a BRT is very small.  The detail of how to build a BRT is
746	   described in section 5.

748	6.6.  NRT Building Module(M6)

750	   M6 is responsible for building a NRT for the current router.  Because
751	   M5 builds a BRT without considering link failures in network, the
752	   routing paths calculated by the BRT cannot avoid failed links.  To
753	   solve this problem, a NRT is used to exclude the routing paths that
754	   include some failed links from the paths calculated by a BRT.  M6
755	   calculate the routing paths that include failed links and stored them
756	   into the NRT.  The details of how to build a NRT is described in
757	   section 5.

759	6.7.  Routing Table Lookup(M7)

761	   M7 is responsible for querying routing tables and selecting the next
762	   hop for forwarding the packets.  Firstly, M7 takes the destination
763	   address of a forwarding packet as a criterion to look up route
764	   entries in a BRT based on longest prefix match.  All of the matched
765	   entries are composed of a candidate hops list.  Secondly, M7 look up
766	   negative route entries in a NRT taking the destination address of the
767	   forwarding packet as criteria.  This lookup is not limited to the
768	   longest prefix match, any entry that matches the criteria would be
769	   selected and composed of an avoiding hops list.  Thirdly, the
770	   candidate hops minus avoiding hops are composed of an applicable hops
771	   list.  At last, M7 sends the forwarding packet to any one of the
772	   applicable hops.  If the applicable list is empty, the forwarding
773	   packet will be dropped.

775	7.  Application Example

777	   In this section, we take a Fat-tree network(Fig. 7) as an example to
778	   describe how to apply FAR routing.  Since M1 to M4 are very simple,
779	   we only introduce how the modules M5, M6, and M7 work in a Fat-tree
780	   network.

782	   A Fat-tree network is composed of 4 layers.  The top layer is core
783	   layer, and the other layers are aggregation layer, edge layer and
784	   server layer.  There are k pods, each one containing two layers of
785	   k/2 switches.  Each k-port switch in the edge layer is directly
786	   connected to k/2 hosts.  The remaining k/2 ports are connected to k/2
787	   of the k-port switches in the aggregation layer.  There are (k/2)2
788	   k-port core switches.  Each core switch has one port connected to
789	   each of the k pods.

791	       10.0.1.1          10.0.1.2            10.0.2.1         10.0.2.2
792	         +--+              +--+                +--+               +--+
793	         |  |              |  |                |  |               |  |
794	         +--+,_            +--+',              +--+             ,,+--+
795	         |`,`',`-.,       / | \  `.           .'` -        .-'``.` /|
796	         |  .  `', `'.,  /  |  '   '       ,-`,'  |`.         .`  ' |
797	         |   \    `',  `-.,              .`  /    |  `,     .`  ,'  |
798	         |    `,     `'.   `'-,_      .'`  ,'     |    ',      /    |
799	         |      .       `'.     `-.,-`    /       |      \
800	         |       \         `'.,  .` `'., `        |       `.
801	         |        `,          .'`,     ,`'.,      |         ',
802	         |          .      ,-`    '., -     `'-,_ |           `.
803	         |           \   .`         ,'.,         `|.,           .
804	         |            .'`          /    `-,       |  `'.,        `.
805	     10.1.0.1      ,-`  .        .'    10.3.0.1  10.3.0.2`'.,      ',
806	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
807	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
808	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
809	          |  \/ |         |  \/ |           |  \/ |           |  \/ |
810	         +--+/\+--+      +--+/\+--+        +--+/\+--+        +--+/\+--+
811	         |  |  |  |      |  |  |  |        |  |  |  |        |  |  |  |
812	         +--+  +--+      +--+  +--+        +--+  +--+        +--+  +--+
813	          /|   10.1.2.1   /|    |\     10 3.1.3   |\          /|   |  |
814	         / |    | \      / |    | \        / |    | \        / |   |  |
815	        /  |    |  \    /  |    |  \      /  |    |  \      /  |   |  |
816	       /   |    |   \  /   |    |   \    /   |    |   \    /   |   |  |
817	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
818	       ++  ++  ++  ++ ++  ++    ++  ++  ++  ++    ++  ++   ++  ++ ++ ++
819	            10.1.2.2                           10.3.1.3

821	                        Figure 7: Fat-tree Network

823	   Aggregation switches are given addresses of the form 10.pod.0.switch,
824	   where pod denotes the pod number, and switch denotes the position of
825	   that switch in the upper pod (in [1, k/2]).  Edge switches are given
826	   addresses of the form 10.pod.switch.1, where pod denotes the pod
827	   number, and switch denotes the position of that switch in the lower
828	   pod (in [1, k/2]).  The core switches are given addresses of the form
829	   10.0.j.i, where j and i denote that switch's coordinates in the
830	   (k/2)2 core switch grid (each in[1, (k/2)], starting from top-left).
831	   The address of a host follows the pod switch to which it is connected
832	   to; hosts have addresses of the form: 10.pod.switch.ID, where ID is
833	   the host's position in that subnet (in [2, k/2+1], starting from left
834	   to the right).

836	7.1.  BRT Building Procedure

838	   By leveraging the topology's regularity, every switch clearly knows
839	   how it forwards a packet.  When a packet arrives at an edge switch,
840	   if the destination of the packet lies in the same subnet with the
841	   switch, then the switch directly forwards the packet to the
842	   destination server through layer-2 switching.  Otherwise, the switch
843	   forwards the packet to any of aggregation switches in the same pod.
844	   When a packet arrives at an aggregation switch, if the destination of
845	   the packet lies in the same pod, the switch forwards the packet to
846	   the corresponding edge switch.  Otherwise, the switch forwards the
847	   packet to any of core switches that it is connected to.  If a core
848	   switch receives a packet, it forwards the packet to the corresponding
849	   aggregation switch that lies in the destination pod.

851	   The forwarding policy discussed above is easily expressed through a
852	   BRT.  The BRT of an edge switch, such as 10.1.1.1, is composed of the
853	   following entries:

855	   Destination/Mask       Next hop
856	   10.0.0.0/255.0.0.0     10.1.0.1
857	   10.0.0.0/255.0.0.0     10.1.0.2

859	   The BRT of an aggregation switch, such as 10.1.0.1, is composed of
860	   the following entries:

862	   Destination/Mask            Next hop
863	   10.1.1.0/255 255.255.0      10.1.1.1
864	   10.1.2.0/255.255.255.0      10.1.2.1
865	   10.0.0.0/255.0.0.0          10.0.1.1
866	   10.0.0.0/255.0.0.0          10.0.1.2

868	   The BRT of acore switch, such as 10.0.1.1, is composed of the
869	   following entries:

871	   Destination/Mask            Next hop
872	   10.1.0.0/255 255.0.0        10.1.0.1
873	   10.2.0.0/255.255.0.0        10.2.0.1
874	   10.3.0.0/255.255.0.0        10.3.0.1
875	   10.4.0.0/255.255.0.0        10.4.0.1

877	7.2.  NRT Building Procedure

879	   The route entries in an NRT are related with link and node failures.
880	   We summarize all types of cases into three (3) catalogs.

882	7.2.1.  Single Link Failure

884	   In Fat-tree, Links can be classified as 3 types by their locations:
885	   1) servers to edge switches; 2) edge to aggregation switches; 3)
886	   aggregation to core switches.  Link failures between servers to edge
887	   switches only affect the communication of the corresponding servers
888	   and don't affect the routing tables of any switch, so we only discuss
889	   the second and third type of links failures.

891	   Edge to Aggregation Switches

893	   Suppose that the link between an edge switch, such as 10.1.2.1 (A),
894	   and an aggregation switch, such as 10.1.0.1(B),fails.  This link
895	   failure may affect 3 types of communications.

897	   o Sources lie in the same subnet with A, and destinations do not.  In
898	   this case, the link failure will only affect the routing tables of A.
899	   As this link is attached to A directly, A only needs to delete the
900	   route entries whose next hop is B in its BRT and add no entries to
901	   its NRT when A's M6 module detect the link failure.

903	   o Destinations lie in the same subnet with A, and sources lie in
904	   another subnet of the same pod.  In this case, the link failure will
905	   affect the routing tables of all the edge switches in the same pod
906	   except for A.  When an edge switch, such as 10.1.1.1, learns the link
907	   failure, it will add a route entry to its NRT:

909	   Destination/Mask            Next hop
910	   10.1.2.0/255.255.255.0      10.1.0.1

912	   o Destinations lie in the same subnet with A, sources lie in another
913	   pod.  In this case, the link failure will affect the routing tables
914	   of all the edge switches in the other pods.  When an edge switch in
915	   one other pod, such as 10.3.1.1, learns the link failure, because all
916	   the routings that pass through 10.3.0.1 to A will certainly pass
917	   through the link between A and B, 10.3.1.1 need add a route entry to
918	   its NRT:

920	   Destination/Mask            Next hop
921	   10.1.2.0/255.255.255.0      10.3.0.1
922	   Aggregation to Core Switches

924	   Suppose that the link between an aggregation switch, such as 10.1.0.1
925	   (A), and a core switch, such as 10.0.1.2(B), fails.  This link
926	   failure may affect 2 types of communications.

928	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
929	   the other pods.  In this case, the link failure will only affect the
930	   routing tables of A.  As this link is attached to A directly, A only
931	   need to delete the route entries whose next hop is B in its BRT and
932	   add no entries to its NRT when A!_s M6 module detect the link
933	   failure.

935	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
936	   another pod.  In this case, the link failure will affect the routing
937	   tables of all the aggregation switches in other pods except for pod
938	   1.  When an aggregation switch in one other pod, such as 10.3.0.1,
939	   learns the link failure, because all the routings that pass through
940	   10.0.1.2 to the pod 1 where A lies will certainly pass through the
941	   link between A and B, 10.3.0.1 need add a route entry to its NRT:

943	   Destination/Mask            Next hop
944	   10.1.0.0/255.255.0.0        10.0.1.2

946	7.2.2.  A Group of Link Failures

948	   If all the uplinks of an aggregation switch fail, then this switch
949	   cannot forward packets, which will affect the routing of every edge
950	   switches.  Suppose that all the uplinks of the node A (10.1.0.1)
951	   fail, it will affect two types of communications.

953	   o Sources lie in the same pod (pod 1) with A, and destinations lie in
954	   the other pods.  In this case, the link failures will affect the
955	   routing of the edge switches in the Pod of A.  To avoid the node A,
956	   each edge switch should remove the route entry
957	   !o10.0.0.0/255.0.0.0GBP[not]10.1.0.1!+/- in which the next hop is the
958	   node A.

960	   o Destinations lie in the same pod (pod 1) with A, and sources lie in
961	   other pods.  In this case, the link failures will affect the routing
962	   of edge switches in other pods.  For example, if the edge switch
963	   10.3.1.1 communicates with some node in the pod of A, it should avoid
964	   the node 10.3.0.1, because any communication through 10.3.0.1 to the
965	   pod of A will pass through the node A.  So a route entry should be
966	   added to 10.3.1.1:

968	   Destination/Mask            Next hop
969	   10.1.0.0/255.255.0.0        10.3.0.1

971	7.2.3.  Node Failures

973	   At last, we discuss the effect of node failures to a NRT.  There are
974	   3 types of node failures: the failure of edge, aggregation and core
975	   switches.

977	   o An edge switch fails.  The failure doesn't affect the routing table
978	   of any switch.

980	   o A core switch fails.  Only when all the core switches connected to
981	   the same aggregation switch fail, they will affect the routing of
982	   other switches.  This case is equal to the case that all the uplinks
983	   of an aggregation switch fail, so the process of link failures can
984	   cover it.

986	   o An aggregation switch fails.  This case is similar to the case that
987	   all the uplinks of an aggregation switch fail.  It affects the
988	   routing of edge switches in other pods, but doesn't affect the
989	   routing of edge switches in pod of the failed switch.  The process of
990	   this failure is same to the second case in section 6.2.2.

992	7.3.  Routing Procedure

994	   FAR decides a routing by looking up its BRT and NRT.  We illuminate
995	   the routing procedure by an example.  In this example, we suppose
996	   that the link between 10.3.1.1 and 10.3.0.2 and the link between
997	   10.1.2.1 and 10.1.0.2 have failed.  Then we look into the routing
998	   procedure of a communication from 10.3.1.3 (source) to 10.1.2.2
999	   (destination).

1001	   Step 1: The source 10.3.1.3 sends packets to its default router
1002	   10.3.1.1

1004	   Step 2: The routing of 10.3.1.1.

1006	   1) Calculate candidate hops

1008	   10.3.1.1 looks up its BRT and gets the following matched entriesGBPo

1010	   Destination/Mask            Next hop
1011	   10.0.0.0/255.0.0.0          10.3.0.1

1013	   So the candidate hops = {10.3.0.1}
1014	   2) Calculate avoiding hops

1016	   Its NRT is empty, so the set of avoiding hop is empty too.

1018	   3) Calculate applicable hops

1020	   The applicable hops are candidate hops minus avoiding hops, so:

1022	   The applicable hops = {10.3.0.1}

1024	   4) Forward packets to 10.3.0.1

1026	   Step 3: The routing of 10.3.0.1

1028	   1) Calculate candidate hops.

1030	   10.3. 0.1 looks up its BRT and gets the following matched entriesGBPo

1032	   Destination/Mask           Next hop
1033	   10.1.0.0/255.255.0.0       10.0.1.1
1034	   10.1.0.0/255.255.0.0       10.0.1.2

1036	   So the candidate hops = {10.0.1.1, 10.0.1.2}

1038	   2) Calculate avoiding hops

1040	   Destination/Mask           Next hop
1041	   10.1.0.0/255.255.0.0       10.0.1.2

1043	   So the avoiding hops = {10.0.1.2}

1045	   3) Calculate applicable hops

1047	   The applicable hops are candidate hops minus avoiding hops, so:

1049	   The applicable hops = {10.0.1.1}

1051	   4) Forward packets to 10.0.1.1

1053	   Step 4: 10.0.1.1 forwards packets to 10.1.0.1 by looking up its
1054	   routing tables.

1056	   Step 5: 10.1.0.1 forwards packets to 10.1.2.1 by looking up its
1057	   routing tables.

1059	   Step 6:10.1.2.1 forwards packets to the destination 10.1.2.2 by
1060	   layer-2 switching.

1062	8.  Conclusion

1064	   This draft introduces FAR protocol, a generic routing method and
1065	   protocol, for data centers that have a regular topology.  It uses two
1066	   routing tables, a BRT and a NRT, to store the normal routing paths
1067	   and avoiding routing paths, respectively, which makes FAR very simple
1068	   and efficient.  The sizes of two tables are very small.  Usually, a
1069	   BRT has only several tens of entries and a NRT has only several or
1070	   about a dozen entries.

1072	9.  Reference

1074	   [FAT-TREE] M.  Al-Fares, A.  Loukissas, and A.  Vahdat."A
1075	   Scalable,Commodity, Data Center Network Architecture",In ACM SIGCOMM
1076	   2008.

1078	10.  Acknowledgments

1080	   This document is supported by ZTE Enterprise-University-Research
1081	   Joint Project.

1083	11.  Security Considerations

1085	Authors' Addresses

1087	   Bin Liu
1088	   ZTE Inc.
1089	   ZTE Plaza, No.19 East Huayuan Road,Haidian District
1090	   Beijing  100191
1091	   China

1093	   Email: liu.bin21@zte.com.cn

1095	   Yantao Sun
1096	   Beijing Jiaotong University
1097	   No.3 Shang Yuan Cun, Hai Dian District
1098	   Beijing  100044
1099	   China

1101	   Email: ytsun@bjtu.edu.cn
1102	   Jing Cheng
1103	   Beijing Jiaotong University
1104	   No.3 Shang Yuan Cun, Hai Dian District
1105	   Beijing  100044
1106	   China

1108	   Email: yourney.j@gmail.com

1110	   Yichen Zhang
1111	   Beijing Jiaotong University
1112	   No.3 Shang Yuan Cun, Hai Dian District
1113	   Beijing  100044
1114	   China

1116	   Email: snowfall_dan@sina.com