FORCES W. Lu Internet-Draft P. Venugopal Intended status: Informational I. Chen Expires: April 21, 2011 Ericsson October 18, 2010 Control Forwarding Data Reduction draft-lu-control-forwarding-data-reduction-00 Abstract Data exchange between the control plane and the forwarding plane becomes the bottleneck of modern routers in terms of scalability and convergence. This paper describes a method which reduces the volume of information transferred from the control plane to the forwarding plane. The reduction has no impact on routing table precision, but it brings up the convergence speed significantly. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 21, 2011. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Lu, et al. Expires April 21, 2011 [Page 1] Internet-Draft Control Forwarding Data Reduction October 2010 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 1.2. Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Existing Method . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Disadvantages . . . . . . . . . . . . . . . . . . . . . . 5 3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . 5 3.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2. Procedure . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3. Pseudo Code . . . . . . . . . . . . . . . . . . . . . . . 11 4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1. Analogy . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2. Trade-off . . . . . . . . . . . . . . . . . . . . . . . . 12 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 8.1. Normative References . . . . . . . . . . . . . . . . . . . 13 8.2. Informative References . . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 Lu, et al. Expires April 21, 2011 [Page 2] Internet-Draft Control Forwarding Data Reduction October 2010 1. Introduction One link breakage, or merely a metric change, can cause vast amount of routing table updates and RIB, FIB downloading. Nobody likes that. But everybody lives with that. Due to the de facto format of the routing table, defined in TCP/IP Illustrated [Wright94] for example, every route entry has to be resolved to map to a local outgoing interface, commonly called nexthop. The advantage of this format is that when it comes to forwarding, traffic can readily be dispatched through the nexthop interfaces. x A---B---C--- y\ / D Figure 1: Sample Topology The trade off however is that a single route change can affect many destinations that are linked through the same route. For example, as shown in Figure 1, router A can reach thousands of routes (rt1, rt2, ..., rtn) behind router C through its outgoing interface x. The routing table looks like the one shown in Table 1. +-----+---+ | B | x | | D | y | | C | x | | rt1 | x | | rt2 | x | | .. | | | rtn | x | +-----+---+ Table 1: Sample Routing Table When the link off x breaks, router A will have to use interface y to reach B, and subsequently C, rt1, rt2, etc. The new routing table would look like Table 2. Lu, et al. Expires April 21, 2011 [Page 3] Internet-Draft Control Forwarding Data Reduction October 2010 +-----+---+ | B | y | | D | y | | C | y | | rt1 | y | | rt2 | y | | .. | | | rtn | y | +-----+---+ Table 2: Sample Routing Table After Change Compared to Table 1, many entries are changed. They all have to be downloaded to RIB, which in turn downloads them to FIB. This is very costly. This memo proposes a method that keeps the routing table change constrained around the area of failure, hence greatly reduce the impact of the topology change. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 1.2. Acronyms RIB - Routing Information Base FIB - Forwarding Information Base IGP - Interior Gateway Protocol SPF - Shortest Path First RIP - Routing Information Protocol OSPF - Open Shortest Path First IS-IS - Intermediate System to Intermediate System 2. Existing Method Upon topology change, the existing method marks changes on all affected routes that switch to use different outgoing interfaces. The number of these routes can be very high such as hundreds of Lu, et al. Expires April 21, 2011 [Page 4] Internet-Draft Control Forwarding Data Reduction October 2010 thousands. All these routes will have to be downloaded to RIB which in turn downloads them to the forwarding data plane. 2.1. Disadvantages The existing method incurs burden on routing table management. It also demands lots of downloading effort to RIB and FIB. These efforts are proportional to the size of the network. When the network size becomes large, all four processes, i.e. routing table updates, RIB downloading, FIB downloading, and FIB updates, become bottlenecks, resulting in very slow overall system convergence time, if it converges at all. 3. Proposed Method This method solves the following three problems. a. Routing Table localization; b. RIB/FIB fast downloading; c. FIB fast lookup. None of these three problems is trivial to deal with. The solution will not be practical until all these issues are addressed. The routing table localization is not very much pursued because people tend to link this to the incremental SPF (iSPF). There is never ceasing argument about the iSPF on whether it is necessary or worthy of the overhead. Numerous algorithms were proposed to improve or to justify the iSPF, but none of them is ideal. However this is not the focus of the routing table localization. Regardless of the method used, the final routing table has to be the same. The proposed method separates the routing table optimization, or more precisely routing table localization, from the SPF method which can be improved independently and is beyond the scope of this memo. The RIB/FIB downloading is the key bottleneck in modern routers. This is also the triggering point that leads to this proposal. The savings on the application (routing process) and the downloading could take a toll on the forwarding process, especially the FIB lookup. Reducing or removing such impact is also an important consideration of this proposal. Lu, et al. Expires April 21, 2011 [Page 5] Internet-Draft Control Forwarding Data Reduction October 2010 3.1. Advantages The main advantage is scalability. With the traditional method, the impact of a single network change is often domain wide and is variable depending on the location of the failure. The proposed method limits the impact to a local scope which is centered on the point of failure and is usually one or two hops in size. In most cases, the number of affected routes is near or below the average number of routers' interfaces. And this number will not go up proportionally with the network size like in the case of traditional handling. The scalability issue will transfer to the convergence time and traffic outage time. For a large size network, the difference can be very significant. Table 3 and Table 4 lists some comparison data for a set of typical n-cube topology. The n-cube is a half meshed network with most available links which are orthogonal to each other. It is often used to simulate real production networks. The number n in n-cube denotes the dimension. So the total number of nodes (or routers) is 2 to the power of n which is 2**n. And the total number of edges is n*2**(n-1). We create three network scenarios from small to large size. We assume each node has 100 additional routes attached. These routes can be of any kind such as the ones redistributed from BGP. The cost on the edge is the same for all edges and for both directions. The total number of routes is the sum of links and leaf routes. Table 3 and Table 4 show the side by side performance comparison between the convention method and the proposed method. The two tables share the same topology matrix. The three columns to the right show the possible impact of a single link change. These three columns' names indicate the locality of the point of failure. Near means it is near the computing router. Far means far away in terms of hop counts and distance. Mid means in between. +----------+------+------+--------+--------+-------+------+-----+ | topology | node | link | leaf | total | near | mid | far | +----------+------+------+--------+--------+-------+------+-----+ | 3-cube | 8 | 12 | 800 | 812 | 270 | 150 | 33 | | 4-cube | 16 | 32 | 1600 | 1632 | 544 | 300 | 25 | | 10-cube | 1024 | 5120 | 102400 | 107520 | 10752 | 5400 | 10 | +----------+------+------+--------+--------+-------+------+-----+ Lu, et al. Expires April 21, 2011 [Page 6] Internet-Draft Control Forwarding Data Reduction October 2010 Table 3: Change Impacts: Existing Method +----------+------+------+--------+--------+------+-----+-----+ | topology | node | link | leaf | total | near | mid | far | +----------+------+------+--------+--------+------+-----+-----+ | 3-cube | 8 | 12 | 800 | 812 | 2 | 2 | 1 | | 4-cube | 16 | 32 | 1600 | 1632 | 3 | 3 | 1 | | 10-cube | 1024 | 5120 | 102400 | 107520 | 9 | 9 | 1 | +----------+------+------+--------+--------+------+-----+-----+ Table 4: Change Impacts: Proposed Method In these exemplary networks, the proposed method works very well with different network size. If implemented, the router will respond to network changes swiftly. The reduction of these numbers not just alleviates the work on the routing process, but helps more significantly when it comes to RIB/ FIB downloading. It reduces significantly the amount of messaging between the clients and RIB, and between RIB and FIB on the data plane. The latter, which are real inter-machine messages, are particularly significant. The proposed method moves the work around specifically to save that messaging. 3.2. Procedure The proposed method makes a few modifications in routing/forwarding areas which are detailed below. As can be seen, the changes are simple and light in both implementation and execution. 1. In SPF computation, directly use the parent node as the next hop, skipping inheriting the parent node's next hop. This actually simplifies SPF a bit; 2. The RIB needs to add logic to understand and allow the non-direct next-hop, also called recursive next-hop. FIB also needs to understand and allow the recursive next-hop. The next-hop entry structure is generally unchanged. An additional flag will have to be added to indicate whether a next-hop is recursive; 3. The data plane needs some new code: A. Two FIB tables are needed in the forwarding plane, one is called the update-table, the other the packet-table. B. The updating processor (different from the ones for forwarding packets), once receives updates from RIB, will update the entries (which are mostly recursive) onto the Lu, et al. Expires April 21, 2011 [Page 7] Internet-Draft Control Forwarding Data Reduction October 2010 update-table, and resolve the recursive next-hops for all affected routes. C. It will also record the parent-children relationship for efficient updating in the future. D. When all affected routes are updated, the updating processor copies these entries over to the packet-table based on which packets are forwarded. E. The packet-table requires no change per this method. The forwarding throughput is intact; Continue using Figure 1 as an example. Table 5 shows the FIB update- table for the same topology using the proposed method. +-----+--------+----+-------+ | Dst | parent | nh | child | +-----+--------+----+-------+ | B | x | | | | D | y | | | | C | B | | | | rt1 | C | | | | rt2 | C | | | | ... | | | | | rtn | C | | | +-----+--------+----+-------+ Table 5: Sample FIB update-table In Table 5, "parent" is the immediate parent node of the Destination "Dst". The other two columns "nh" and "child" are empty and will be resolved as soon as the FIB downloading completes. The resolution process walks every newly updated "Dst" entry, adds the node to the "child" column of its parent and inherits the next-hop from the parent to its "nh" column. This resolution process literally is the same as the IGP routing table next-hop updating process. The proposed method postpones this to the last stage of the route change handling, so that the volume of updating messages is minimized. Note that this last stage is at the FIB update-table which is one step ahead of the FIB packet-table. This insures that the FIB forwarding is intact as mentioned in Section 3 bullet c. Table 6 shows the update-table after next-hop resolution. Lu, et al. Expires April 21, 2011 [Page 8] Internet-Draft Control Forwarding Data Reduction October 2010 +-----+--------+----+--------+ | Dst | parent | nh | child | +-----+--------+----+--------+ | B | x | x | C | | D | y | y | | | C | B | x | rt1,.. | | rt1 | C | x | | | rt2 | C | x | | | ... | | | | | rtn | C | x | | +-----+--------+----+--------+ Table 6: Sample FIB update-table: nh resolved Table 6 then is used to update the packet-table, which is shown in Table 7. +-----+----+ | Dst | nh | +-----+----+ | B | x | | D | y | | C | x | | rt1 | x | | rt2 | x | | ... | | | rtn | x | +-----+----+ Table 7: Sample FIB packet-table When there is topology change, say link x breaks, only the entry of B is updated and downloaded. This updated message to FIB looks like Table 8. +-----+----+ | Dst | nh | +-----+----+ | B | D | +-----+----+ Table 8: Update to FIB The updating processor will receive this message and update the corresponding rows in Table 6 in following steps: a. B is removed from its parent's child list, if any; Lu, et al. Expires April 21, 2011 [Page 9] Internet-Draft Control Forwarding Data Reduction October 2010 b. B's parent is changed to D; c. B's nh is changed to y which is inherited from D; d. B is added to D's child list; e. B's child C is updated to have y as nh; f. C's children rt1, rt2, ... are updated to also have y as nh; g. All modified (nh field) entries are copied to overwrite their counterparts in packet-table; Table 9 shows the update-table after the change. +-----+--------+----+---------+ | Dst | parent | nh | child | +-----+--------+----+---------+ | B | D | y | C | | D | y | y | B | | C | B | y | rt1,... | | rt1 | C | y | | | rt2 | C | y | | | ... | | | | | rtn | C | y | | +-----+--------+----+---------+ Table 9: Sample FIB update-table: after topology change Table 10 shows the packet-table after the change. +-----+----+ | Dst | nh | +-----+----+ | B | y | | D | y | | C | y | | rt1 | y | | rt2 | y | | ... | | | rtn | y | +-----+----+ Table 10: Sample FIB packet-table: after topology change Note that the total touched entries are B, C, rt1, rt2, ..., rtn. The number of changes is identical to that of the traditional approach. Although there is some extra cost in processing Table 9 Lu, et al. Expires April 21, 2011 [Page 10] Internet-Draft Control Forwarding Data Reduction October 2010 which is relocated from the IGP process space, the tradeoff is the route change aggregation. The scope of the routing table changes is contained until the last mile of the control path. 3.3. Pseudo Code The conventional routing updates are illustrated in Figure 2. 1 while (the candidate node tree (or cdt-tree) is not empty) { 2 pick the node with lowest cost from the computing router; 3 move this node from the cdt-tree to the final path tree; 4 inherit the parent node's next-hop as this node's next-hop; 5 for-each-child-node { 6 set child-node's next-hop to this node; 7 add the child-node to cdt-tree; 8 } 9 } Figure 2: Routing Table Operation The proposed method will omit line 4. The related FIB modification only applies to the update-table and is shown in Figure 3. Lu, et al. Expires April 21, 2011 [Page 11] Internet-Draft Control Forwarding Data Reduction October 2010 1 while (fib_queue_not_empty) { 2 if (entry == (new || changed)) { 3 copy the entry onto update_table; 3 mark the entry CHANGED; 4 } 5 } 1 for each entry CHANGED { 2 if (entry.parent != local_intf) { 3 entry.nh = parent_next_hop(entry.parent); 4 add_to_child_list(entry.parent, entry); 5 } else { 6 entry.nh = entry.parent; 7 } 8 9 for each child node (entry) { 10 update_child_nh(child, entry.nh); 11 } 12 } Figure 3: Update-Table Operation 4. Summary The proposed method purposes to move the burden of updating topology changes from the inter-plane communication to the intra-plane processing. 4.1. Analogy The concept is analogous to the evolution of the IGP routing protocols. Earlier the distance vector protocol such as RIP [RFC2453] prevails. Routers can directly conclude best routes from the protocol messages. Before long, when the topology scales up, the distance vector protocol is replaced by the link-state protocols such as OSPF [RFC2328] and IS-IS [RFC1195][ISO.10589.1992]. The key advantage of the link-state protocols is that the protocol messages only convey the local change information. And it is the responsibility of every computing router to compute its best routes based on the information it receives. The reduced protocol messages greatly improves the overall protocol response time, convergence speed and ability to scale. 4.2. Trade-off If we perceive the control plane and the forwarding plane as two individual routing units, what this memo recommends is to convey the Lu, et al. Expires April 21, 2011 [Page 12] Internet-Draft Control Forwarding Data Reduction October 2010 "link-state" based routing updates from one unit to the other. This greatly reduces the control traffic volume between the two units and all aforementioned downloading bottlenecks become a non-issue. The trade-off is that the data plane will have to perform extra nexthop updates. We believe that this trade-off is worthy and advantageous especially with the nowadays' CPU power. 5. Acknowledgements TBD 6. IANA Considerations This memo includes no request to IANA. 7. Security Considerations There are no specific security considerations within the scope of this document. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 8.2. Informative References [ISO.10589.1992] International Organization for Standardization, "Intermediate system to intermediate system intra-domain- routing routine information exchange protocol for use in conjunction with the protocol for providing the connectionless-mode Network Service (ISO 8473)", ISO Standard 10589, 1992. [RFC1195] Callon, R., "Use of OSI IS-IS for routing in TCP/IP and dual environments", RFC 1195, December 1990. [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. [RFC2453] Malkin, G., "RIP Version 2", STD 56, RFC 2453, Lu, et al. Expires April 21, 2011 [Page 13] Internet-Draft Control Forwarding Data Reduction October 2010 November 1998. [Wright94] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2: The Implementation", 1994. Authors' Addresses Wenhu Lu Ericsson 300 Holger Way San Jose, California 95134 USA Phone: 408 750-5436 Email: wenhu.lu@ericsson.com Prashanth Venugopal Ericsson 300 Holger Way San Jose, California 95134 USA Phone: 408 750-5510 Email: Prashanth.Venugopal@ericsson.com Ing-Wher Chen Ericsson 300 Holger Way San Jose, California 95134 USA Phone: 408 750-5648 Email: ing-wher.chen@ericsson.com Lu, et al. Expires April 21, 2011 [Page 14]