idnits 2.17.1 

draft-ashwood-sdnrg-state-reduction-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The document has examples using IPv4 documentation addresses according
     to RFC6890, but does not use any IPv6 documentation addresses.  Maybe
     there should be IPv6 examples, too?


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 03, 2013) is 3949 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Internet Research Task Force                         P. Ashwood-Smith
2	Internet Draft                                                 Huawei
3	Intended status: Informational                             M. Soliman
4	Expires: February 03, 2014                        Carleton University
5	                                                               T. Wan
6	                                                               Huawei
7	                                                        July 03, 2013

9	                           SDN State Reduction

11	               draft-ashwood-sdnrg-state-reduction-00.txt

13	Abstract

15	   This document makes the argument that to support the centralized
16	   control of a substantial number of forwarding devices (as Software
17	   Defined Networking (SDN) proposes) that the scale, speed, cost and
18	   general quality of such a solution will be improved by reducing
19	   the state needed to be distributed into the network of devices by
20	   the controller(s). To this end we re-visit forms of Source Routing
21	   (SR), in particular Strict Link Source Routing (SLSR) and suggest
22	   that light weight SLSR could allow substantial reduction in
23	   controller burden while potentially reducing the costs/complexity
24	   on forwarding devices. We discuss some simulation results that
25	   demonstrate these advantages and how the advantages grow
26	   substantially as the network diameter grows. We also look at
27	   various implementation possibilities including existing IPV4, V6,
28	   MPLS, new/modified MPLS vs. something brand new that could
29	   possibly be implemented with new SDN technology like Protocol
30	   Oblivious Forwarding-POF.

32	Status of this Memo

34	   This Internet-Draft is submitted in full conformance with the
35	   provisions of BCP 78 and BCP 79.

37	   Internet-Drafts are working documents of the Internet Engineering
38	   Task Force (IETF), its areas, and its working groups.  Note that
39	   other groups may also distribute working documents as Internet-
40	   Drafts.

42	   Internet-Drafts are draft documents valid for a maximum of six
43	   months and may be updated, replaced, or obsoleted by other
44	   documents at any time.  It is inappropriate to use Internet-Drafts
45	   as reference material or to cite them other than as "work in
46	   progress."

48	   The list of current Internet-Drafts can be accessed at
49	   http://www.ietf.org/ietf/1id-abstracts.txt
50	   The list of Internet-Draft Shadow Directories can be accessed at
51	   http://www.ietf.org/shadow.html

53	   This Internet-Draft will expire on December 3, 2012.

55	Copyright Notice

57	   Copyright (c) 2013 IETF Trust and the persons identified as the
58	   document authors. All rights reserved.

60	   This document is subject to BCP 78 and the IETF Trust's Legal
61	   Provisions Relating to IETF Documents
62	   (http://trustee.ietf.org/license-info) in effect on the date of
63	   publication of this document. Please review these documents
64	   carefully, as they describe your rights and restrictions with
65	   respect to this document.

67	Table of Contents

69	   1. Terminology                                                   3
70	   2. Introduction                                                  3
71	   3. Logical Example                                               5
72	   4. Expressing a Path                                             6
73	   5. Computing a Path                                              7
74	   6. Downloading Forwarding State                                  8
75	   7. Logically Forwarding SLSR                                     10
76	      7.1. Ingress Logical Unicast Forwarding                       10
77	      7.2. Tandem Logical Unicast Forwarding                        11
78	      7.3. Egress Logical Unicast Forwarding                        12
79	   8. Logical Multicast Forwarding SLSR Packets                     13
80	   9. Failure Recovery                                              14
81	   10. Comparison of Logical Model to Existing Source Routing       15
82	      10.1. MPLS as a SLSR                                          15
83	      10.2. IPV4/6 Options as SLSR                                  18
84	      10.3. Protocol Oblivious Forwarding as SLSR mechanism         19
85	   11. Security Considerations                                      20
86	   12. Conclusions and Future work                                  21
87	   13. IANA Considerations                                          21
88	   14. References                                                   21
89	      14.1. Informative References                                  21
90	   15. Authors' Addresses                                           23
91	   16. Contributors                                                 23
92	   17. Acknowledgements                                             23

94	1. Terminology

96	   ATM         Asynchronous Transfer Mode (a cell based network)
97	   BGP         Boarder Gateway Protocol
98	   CSPF        Constrained Shortest Path First
99	   DOS         Denial of Service (attack)
100	   ECMP        Equal Cost Multi Path
101	   flow        Logically related packets following the same path
102	   IS-IS       Intermediate System to Intermediate System
103	   LACP        Link Aggregation Control Protocol
104	   LAG         Link Aggregation
105	   Loose       A source route that enumerates only some of all hops
106	   MPLS        Multi Protocol Label Switching
107	   MPLS-TE     MPLS Traffic Engineering.
108	   NPU         Network Processor Unit (programmable forwarding)
109	   OpenFlow    Open data path programming protocol
110	   OSPF        Open Shortest Path First
111	   PCE         Path Computation Element (used with MPLS-TE)
112	   PNNI        Private Network to Network Interface (link state ATM)
113	   POF         Protocol Oblivious Forwarding - more generic OpenFlow)
114	   RSVP-TE     Resource Reservation Protocol - Traffic Engineering
115	   SDN         Software Defined Networking (as per [OPENFLOW])
116	   SDN-domain  Set of forwarding devices controlled as a unit.
117	   SR          Source Routing - enumerating hops to traverse
118	   SLSR        Strict Link Source Routing - enumerating links [SLSR]
119	   SPF         Shortest Path First - (Dijkstra etc.)
120	   SSRR        Strict Source Record Route - IPV4 header option 9
121	   Strict      Source route that enumerates every hop(unlike Loose)
122	   TE          Traffic Engineering
123	   VFI         Virtual Forwarding Instance (layer 2)
124	   VPLS        Virtual Private Lan Service
125	   VRF         Virtual Routing and Forwarding (layer 3)

127	2. Introduction

129	   The centralized control of a network is not a new idea. Indeed
130	   centralized control was widely deployed in voice networks and some
131	   early data networks but of course gave way to distributed control
132	   for IP.

134	   Centralized computation is however still widely used for traffic
135	   engineered networks, like MPLS-TE and GMPLS where a Path
136	   Computation Engine (PCE) makes use of a global view of a sub-
137	   network and its resource usage for the planning of new paths and
138	   their resources. The data path state distribution with these
139	   models is however not initiated centrally and relies on protocols
140	   like RSVP-TE to install the hop by hop state. In fact this form
141	   of distributed control with centralized traffic engineering
142	   computations is the norm today.

144	   Notwithstanding the massive deployment of this kind of hybrid
145	   distributed/central control, we have in the last several years
146	   seen a huge resurgence of interest in fully centralized control of
147	   at least a set of forwarding devices [ONF] [OPENFLOW] with
148	   Software Defined Networking (SDN). This SDN proposes a central
149	   controller (or controllers) using IP protocols such as TCP to talk
150	   to a set of arbitrarily interconnected (and cheap/dumb) forwarding
151	   devices (SDN-domain) and which is responsible for the
152	   configuration of the majority of forwarding state on those
153	   devices. This state may be produced either as a result of pro-
154	   active configuration, or based on re-active responses to packet
155	   flow indications from the forwarding devices themselves.

157	   Since this central controller has knowledge of the entire sub-
158	   network of devices, and potentially of the traffic demands
159	   into/out of the sub-network, it can perform a variety of path
160	   optimization computations similar to CSPF/MPLS-TE/PCE/GMPLS, or
161	   even more elaborate forms of optimization (trading flows against
162	   each other rather than individually optimizing them, exploiting
163	   quiet areas of the network to offload busy areas etc), the output
164	   of which is forwarding state for all meta flows in the entire sub
165	   network of devices and a sub network which more optimally meets
166	   the desired local constraints. One such deployment reports a
167	   substantial increase in network utilization from 30% to 70%-90%
168	   [SDNGOOG].

170	   A central controller can also more effectively solve problems such
171	   as bin-packing and path blocking [SDNGOOG], which occur when flows
172	   are optimized individually with greedy type algorithms rather than
173	   considering other orderings of the flows. The finer grained
174	   ability to place traffic can also permit much more detailed
175	   placement of traffic after a failure, including traffic not
176	   directly affected by the failure but the replacement of which is
177	   critical to achieving fair/efficient use of the remaining
178	   bandwidth subsequent to the failure.

180	   Since the output of the controller is much closer to a TE (Traffic
181	   Engineered) type solution from a PCE (Path Control Element) than
182	   an SPF (Shortest Path First) solution the controller cannot simply
183	   install destination based forwarding entries. A controller either
184	   needs to install tunnels that follow the explicit routes it wishes
185	   and then map traffic to those tunnels at the edges, or it must
186	   install n-tuple < <source IP> <destination IP> <source port>
187	   <destination port> etc.> state and configure these n-tuple matches
188	   on every hop along the desired path. Packets which fail to match
189	   an n-tuple are either discarded or sent to the controller.

191	   In the normal case of SDN (as given in [OPENFLOW]) the controller
192	   is required to send configuration information to all devices along
193	   the path from ingress of the SDN-domain of this controller to the
194	   egress of that SDN-domain, alternatively a tunnel setup protocol
195	   like RSVP-TE is required to be triggered to distribute the per hop
196	   state between the ingress and egress.

198	   This draft proposes that since the controller knows the exact end-
199	   to-end path (down to the level of the links it wishes the packets
200	   to traverse) and that the diameter of an SDN-domain is likely to
201	   be a reasonable number of hops, that the controller should instead
202	   simply insert into a header the exact links it wishes the packet
203	   to traverse and thereby not have to deal either with per hop n-
204	   tuple state installation (very expensive) or with MPLS tunnel
205	   installation via RSVP-TE(complex). Such a mechanism also
206	   eliminates any concerns about Equal Cost Multi Path (ECMP) and/or
207	   Link Aggregation (LAG) as the controller can place traffic on
208	   exact links.

210	   Operations, Administration and Management (OAM) is also greatly
211	   simplified since data packets will flow on invariant paths that
212	   are known by both ends of the flow and can be the same as any OAM
213	   packets that probe the flow. This OAM "fate sharing" property is
214	   widely valued by network operators and considerable effort has
215	   already been expended to permit similar fate sharing between OAM
216	   and data paths with other carrier scale networking protocols such
217	   as 802.1ag and MPLS-TP. Of course if a controller does not wish to
218	   enforce symmetry and congruence it need not.

220	3. Logical Example

222	   The following is an example of an idealized strict link based
223	   source routing (SLSR) forwarding. We talk about possible
224	   implementations including MPLS methods after looking at the
225	   logical ideal.

227	   Consider the simple 7 node network shown in Figure 1 below. Here
228	   the nodes are named {A, B, C, D, E, F, G} and where each node has
229	   locally numbered interfaces named {1, 2, 3, 4, 5, 6}.

231	   For example node A has interfaces named 1, 2, 3, 4 and where
232	   interfaces 4 and 2 both go to node B. Node B has local interfaces
233	   1, 2, 3, 4 and 5 but the two interfaces going back to node A are
234	   locally named 1 and 3. Clearly node interface names are likely
235	   (but not necessarily) different at both ends of a link.

237	                     +----+           +----+
238	                     |    | 4 ----- 3 |    |
239	                     | A  | 2 ----- 1 | B  |
240	                     +----+           +----+
241	                    1      3         4      2 5
242	                   /        \       /        \ \
243	                  3          4     3          2 3
244	                 +----+        +----+          +----+
245	                 | C  | 2----1 | D  | 2------1 | E  |
246	                 +----+        +----+          +----+
247	                  1          6     5          4
248	                   \        /       \        /
249	                    3      2         1      2
250	                     +----+           +----+
251	                     | F  | 1-------3 | G  |
252	                     +----+           +----+

254	    Figure 1 - simple 7 node network with local link identifiers.

256	4. Expressing a Path

258	   A path through a network labeled as per Figure 1 can clearly be
259	   expressed as a sequence of link names (an SLSR).

261	   For example, between nodes C and E the following are all valid
262	   paths.

264	     C.3 -> A.2 -> B.2
265	     C.2 -> D.2
266	     C.1 -> F.2 -> D.2
267	     C.3 -> A.4 -> B.5

269	   Now since the links lead unambiguously to a known node, the paths
270	   can be more compactly expressed without the node names as follows:

272	     {3,2,2}
273	     {2,2}
274	     {1,2,2}
275	     {3,4,5}

277	   As long as we know the origin of the path (in this case node C),
278	   the list of link names unambiguously identifies a path and an
279	   egress point. In addition it identifies unambiguously which link
280	   from among parallel links between neighbors should be traversed.
281	   Of course it is possible to give a name to the set of links that
282	   all attach to the same neighbor and thereby leave the exact link
283	   in that path deliberately ambiguous and thereby subject to a local
284	   forwarding decision as to exactly which link in the set to follow.

286	   Each path of course also has exactly one perfectly symmetric
287	   reverse. Note that the symmetric reverse path is not simply the
288	   same list of link names in reverse order. A reverse path has to be
289	   specified from the opposite end of the path so in this example the
290	   origin has to be E.

292	   The forward and corresponding reverse paths are therefore.

294	          C->E        E->C

296	        {3,2,2}      {2,1,1}
297	        {2,2}        {1,1}
298	        {1,2,2}      {1,6,3}
299	        {3,4,5}      {3,3,1}

301	   Various very efficient encodings of these kinds of paths in source
302	   routed headers are possible. Even a simple encoding using 8 bits
303	   per hop can encode every path in a large 8 hop network with fewer
304	   bits than an IP in IP tunnel.

306	5. Computing a Path

308	   It should be obvious that the output of any graph based
309	   computation which has as its goal various optimization criteria
310	   for flows can express its results as a series of such paths where
311	   each path is expressed as a Strict Link based Source Route (SLSR).
312	   This includes multiple different metric Dijkstra computations
313	   (i.e. shortest path, multi topology shortest path), CSPF type and
314	   of course more elaborate linear-programming or other convex type
315	   optimizations.

317	   The expression of the path as an SLSR imposes no constraints on
318	   the type of computation being performed except possibly in path
319	   length. However in any real network under the control of a single
320	   controller it is not likely that path length would be a real issue
321	   unless perhaps unreasonably large link names are encoded.

323	   Convex and linear-programming type solutions to traffic placement
324	   are of particular interest because to do a good job they must
325	   exploit a considerable number of paths through a network (many
326	   more than shortest). These algorithms take the matrix of
327	   ingress/egress flows in a network together with all the usable
328	   paths between all sources and destinations and will assign
329	   percentages of the ingress/egress flows to the available paths in
330	   ratios that can optimize a number of simultaneous constraints. For
331	   example they can optimize the network's total throughput, the
332	   average link utilization, the fairness of the bandwidth available
333	   to each flow and can even optimize different linear and non linear
334	   combinations of those goals. What is interesting about all of
335	   these kinds of optimizations however is that they need access to
336	   all of the reasonable paths across the network since it is by
337	   making trade-offs between busy and less busy parts of the network
338	   that they achieve their goals. Unfortunately the number of paths
339	   (shortest or otherwise) in a network grows exponentially with
340	   network size and with it the state distribution problem (or
341	   burden) the controller must deal with.

343	   It is important to make the distinction between a flow and a path.
344	   This draft concerns itself not with the immense numbers of micro
345	   flows but with the very large numbers of paths required to be
346	   supported onto which those micro flows are then aggregated. A set
347	   of micro flows can be treated as a single flow, and a single flow
348	   has a unique path through the network.

350	6. Downloading Forwarding State

352	   A controller likely takes as input the fields that identify the
353	   flow and its various statistical attributes. The controller then
354	   likely computes an end to end path for this flow either based on
355	   the single flow's attributes (in a re-active manner), or on more
356	   global knowledge of multiple flow attributes (in a pro-active
357	   manner). Flows may be meta (many micro flows) or individual micro
358	   flows depending on the implementation and its scale. The output of
359	   course is just a list of links that must be traversed for this
360	   flow together with matching rules to identify the flow at the
361	   ingress.

363	   The controller then delivers the flow matching rules and the
364	   Strict Link Based Source Route to the *single* node where the flow
365	   is to be encapsulated (i.e. where the flow first enters the SDN-
366	   domain).

368	   The fact of only having to communicate with the *single* node at
369	   the head end of the path means that the controller experiences a
370	   reduction in its work load directly proportional to the number of
371	   hops in the path (as compared to traditional SDN which must
372	   program every hop along the path).

374	   Intuitively this translates to the following I/O burden reduction
375	   at the controller based on the number of links that must be
376	   traversed per average path.

378	          +----------------+-------------------------+
379	          | #Avg Path Len  |  % I/O Burden Reduction |
380	          +----------------+-------------------------+
381	          | 1              |  0%                     |
382	          | 2              |  50%                    |
383	          | 3              |  66%                    |
384	          | 4              |  75%                    |
385	          | 5              |  80%                    |
386	          | ..             |  ..                     |
387	          | N              |  (100-100/n) %          |
388	          +----------------+-------------------------+

390	   Since forwarding state download is typically a substantial part of
391	   a "normal" routers' re-convergence time, it seems reasonable that
392	   this will become a similar bottleneck for a central controller and
393	   quite possibly be further aggravated by the increased delays and
394	   larger amount of state that the central device must deal with.

396	   As a result this reduction in state and I/O burden should have a
397	   marked impact on convergence times assuming there are appropriate
398	   forwarding mechanisms that can implement the Strict Link Source
399	   Route (SLSR). Note also that the position of the controller
400	   relative to the ingress/egress nodes is now more important than
401	   its position relative to all nodes. Therefore studies as to
402	   controller optimum placement as defined by the Controller
403	   Placement Problem [PLACEMENT] would require different optimization
404	   goals.

406	   An additional 50% reduction can also be obtained should the
407	   implementation of the forwarding be able to reverse the path on
408	   the fly. Such a reversal permits the implicit communication of the
409	   desired reverse path to the receiver thereby eliminating
410	   communication with the controller to obtain a reverse path. Of
411	   course if symmetry is not desired this further optimization is not
412	   possible.

414	   For example, consider a network with 1000 nodes. It therefore has
415	   O(1,000,000) meta flows and assuming 10 possible paths for each
416	   flow has O(10,000,000) ingress forwarding entries that must be
417	   centrally configured (its burden). If each path on average takes 5
418	   hops then the burden on the controller grows 5 fold to
419	   O(50,000,000) entries but with SLSR the burden remains at
420	   O(10,000,000). If path reversal is supported and symmetric routing
421	   is desired then the burden with SLSR drops further to
422	   O(5,000,000).

424	   Simulations done by one of us in [SRSDN] provide additional weight
425	   to the above arguments. In particular we simulated for various
426	   network sizes and diameters the differences between hop by hop SDN
427	   and SLSR and saw up to 3 x performance improvements in convergence
428	   times with SLSR. There were also a number of other benefits such
429	   as a markedly reduced standard deviation in convergence times for
430	   the different nodes (81% decreases) and a significantly reduced
431	   sensitivity to the placement of the controller (80% reduction in
432	   standard deviation). The performance improvements can perhaps
433	   better be understood by an analogy comparing the work required to
434	   fill in the area of an object (traditional SDN) vs. simply drawing
435	   the circumference of that object (SLSR). Since the circumference
436	   varies as a function of the diameter but the area varies as a
437	   function of the diameter squared the relative burden reduction
438	   with just dealing with the circumference (the edge of the network)
439	   becomes apparent. In fact in this simulation study we varied the
440	   radius and then plotted the relative convergence times of SLSR and
441	   traditional hop by hop forwarded SDN and saw a ratio of
442	   convergence times as a function of radius that indicated a trend
443	   towards 1/R as expected. Simply stated, the bigger the centrally
444	   controlled network the better source routing performs compared to
445	   hop by hop.

447	7. Logically Forwarding SLSR

449	   There are three distinct phases to be performed to logically
450	   forward unicast SLSR. These are similar to any tunnel technology
451	   and consist of 1) Ingress Encapsulation, 2) Tandem Forwarding, and
452	   3) Egress Decapsulation and Forwarding. We address the generic
453	   concepts first before looking at possible existing or new
454	   encapsulations and their applicability.

456	   Multicast SLSR is also possible (but with limitations to keep the
457	   header sizes from growing too large) and is briefly discussed
458	   after unicast.

460	7.1. Ingress Logical Unicast Forwarding

462	   Here the flow information, likely IP header(s) + UDP/TCP header(s)
463	   is looked up and a sequence of link identifiers and a current hop
464	   must be placed on the packet, the packet must then be forwarded to
465	   the first of those links. This operation is identical to almost
466	   every tunnel protocol except that IP ECMP and/or LAG hash would
467	   potentially be unnecessary because the first link name would often
468	   resolve to a physical link not a LAG bundle. For example:

470	      +----------+----------+---------+--------+--------+
471	      |SrcIP     |  DstIP   | SrcPrt  | DstPrt |   SLSR |
472	      +----------+----------+---------+--------+--------+
473	      |192.0.2.4 |192.0.2.9 |  1000   |  98    |{3,2,5} |
474	      |192.0.2.4 |192.0.2.9 |  1001   |  99    |{3,4,5} |
475	      +----------+----------+---------+--------+--------+

477	   Of course nothing precludes the use of LAG and the link identifier
478	   therefore identifying an entire LAG bundle rather than an element
479	   of that LAG. In fact it is possible to simultaneously support both
480	   concepts so that some traffic can be forwarded to the entire LAG
481	   while other traffic could be placed on a particular LAG bundle
482	   member at the discretion of the central computation.

484	 7.2. Tandem Logical Unicast Forwarding

486	   At tandem devices the operation would start by incrementing the
487	   current hop in the packet header (shown with a ^ symbol) and then
488	   forward to the link identified in the new current hop. If we
489	   support reversal, we change the previous link name to the local
490	   link name for that link. For example, referring to Figure 1 (and
491	   disregarding non relevant headers/options) after matching the
492	   first flow tuple at the ingress node C the packet is encapsulated
493	   with the SLSR header {3,2,5} and then leaves node C on interface 3
494	   toward A. Then:

496	     Packet arrives at node A on local interface 1 where it looks
497	     like this:

499	        +--------------------------------------~~~------------+
500	        | 3 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
501	        +-^------------------------------------~~~------------+

503	     Current hop is incremented while previous hop is changed to
504	     local interface name (3 changes to a 1).

506	        +--------------------------------------~~~------------+
507	        | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
508	        +-----^--------------------------------~~~------------+

510	     Packet is forwarded to interface for current hop i.e. 2.

512	     Packet arrives at node B on local interface 1.

514	        +--------------------------------------~~~------------+
515	        | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
516	        +-----^--------------------------------~~~------------+

518	     Current hop is incremented while previous hop is changed to
519	     local interface name 1 (2 changes to a 1).

521	        +--------------------------------------~~~------------+
522	        | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
523	        +---------^----------------------------~~~------------+

525	     Packet is forwarded to interface for current hop i.e. 5.

527	     Packet arrives at node E on local interface 3.

529	        +--------------------------------------~~~------------+
530	        | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
531	        +---------^----------------------------~~~------------+

533	     Current hop is incremented while previous hop is changed to
534	     local interface name (5 changes to 3).

536	        +--------------------------------------~~~------------+
537	        | 1 | 1 | 3 | 192.0.2.9 | 192.0.2.4 |..   <payload>   |
538	        +-----------^--------------------------~~~------------+

540	     We are at the end of the path, so egress processing begins.

542	   One additional step not described above is a reverse path check.
543	   Prior to substituting the reverse link identifier into the SLSR
544	   header, the link identifier from the neighbor can be validated and
545	   the packet discarded if the neighbor link identifier in the packet
546	   is incorrect for the port the packet arrived on. This would reduce
547	   the chances of mis-delivery of the packet should a link identifier
548	   change or a link destination change while a packet is in flight.

550	7.3. Egress Logical Unicast Forwarding

552	   Here the operation consists of normal IP/Ethernet etc. forwarding
553	   based on the IP destination / MAC or other ACL rules. Basically
554	   the SLSR header is stripped and the packet is submitted to the
555	   Virtual Forwarding Instance, or Virtual Forwarding Function (VFI
556	   or VRF) for further processing.

558	   Optionally the link identifier from the neighbor can be validated
559	   against what is expected and the packet discarded in the case of a
560	   mismatch. This reduces the chance of mis-delivery as in the tandem
561	   case.

563	   In addition, if path reversal is supported, the reverse path is
564	   compared against the current reverse path for this reverse flow
565	   and if it has changed the local forwarding state for the reverse
566	   flow would be updated. This would allow the head end to always
567	   dictate the forward and reverse path to be used for all packets in
568	   the flow without involving the controller on the egress side (and
569	   of course not needing to communicate with any tandem device).

571	   Processing the reverse flow/path in this manner means that a flow
572	   is already present for the reverse direction without having to re-
573	   actively or pro-actively consult the controller. This results in a
574	   further 50% reduction in controller load. In the case of asymmetry
575	   this optimization is of course not possible.

577	8. Logical Multicast Forwarding SLSR Packets

579	   Multicasting packets usually involve one of two approaches.

581	   The first approach simply re-uses unicast and sends multiple
582	   copies to a pre-determined list of receivers. There is little to
583	   discuss with this approach as we can replicate SLSR based unicast
584	   packets just as easily as any other tunneling mechanism. Clearly
585	   such a serial unicast approach has nearly identical bandwidth
586	   overhead as other protocols like VPLS which also use this serial
587	   unicast mechanism.

589	   It is therefore interesting to look at more efficient methods that
590	   involve the second multicast mechanism, which uses replication
591	   points in the network. These replication points are chosen so that
592	   copies are more efficiently made thereby eliminating multiple
593	   copies of the packet traversing any given link. Various logical
594	   tree structures are usually involved e.g. STP, SPB, TRILL, PIM,
595	   MOSPF etc.

597	   These tree based mechanisms could in theory be implemented without
598	   requiring tandem state as an SLSR by introducing a branch point
599	   concept into the list of indexes. In this manner a complete tree
600	   as a pre-order traversal could be encoded along with the packet
601	   payload. It is not difficult to define a variety of different
602	   encodings that would accomplish this. The obvious objection to
603	   such a scheme is the sheer size of header required especially
604	   where a large network with many multicast receivers is concerned.
605	   It is therefore unlikely to be practical to encode any large tree
606	   of receivers and the SLSRs between them in any single header.

608	   This leads to a hybrid approach which would encode a subset of the
609	   tree, say a single replication point and 5 or so recipients. This
610	   little tree or 'tree-let', would efficiently get a single packet
611	   to 5 (or some suitably small number) of recipients with an SLSR to
612	   the replication point and then SLSRs to each of the receivers.
613	   Such an encoding is much more reasonable than trying to encode all
614	   receivers and all replication points of a single tree in one
615	   packet. However, since this one packet would not reach all
616	   receivers, the head end would have to generate as many copies of
617	   the data packets as necessary to cover all recipients. As a result
618	   this approach would be a compromise between a full tree, and full
619	   head end replication. Variations in the size of the 'tree-let'
620	   header would allow for space v.s. bandwidth efficiency trade-offs
621	   while meeting the goal of remaining stateless in the core.

623	   In the literature there are also non-exact methods to multicast
624	   without state such as with Bloom filters [BLOOM]. In this approach
625	   the links to be traversed are logically mapped into a field which
626	   is carried in the packet (for example if the links are given
627	   unique 128 bit sparse addresses then a 128 bit union of all the
628	   links to be traversed on the tree is encoded in the header). These
629	   mechanisms guarantee that all receivers will get a copy of the
630	   packet (because they check each link for inclusion in the Bloom
631	   Filter at each hop) however they do so at the expense of sending
632	   false positive copies to unintended receivers which must then
633	   filter the unwanted packets egress. Depending on the size of the
634	   Bloom Filter and the link identifiers various statistical trade-
635	   offs in false positives vs. packet header size can be made.

637	   Other exact methods to encode and methods to compute SLSR
638	   multicast etc. are FFS.

640	9. Failure Recovery

642	   A variety of failure recovery techniques can be employed with
643	   SLSR. The most obvious is to just re-compute all affected paths on
644	   indication of a link failure. This won't be discussed further.

646	   More interesting are the so called fast restoration mechanisms.
647	   These can broadly be broken down into head end and tandem
648	   restoration.

650	   Head end mechanisms that provide 1+1 protection have been around
651	   for a long time with MPLS-TP, PBT and SONET/DWDM. Similar
652	   mechanisms can be used with any tunnel type and of course SLSR is
653	   no exception. Probes can be sent down one source route, reflected
654	   back along the reverse source route and in this manner the forward
655	   and reverse paths can be simultaneously probed for failure. In the
656	   event of a failure a diverse alternate source route can rapidly be
657	   added to the packet and the flow restored. The advantage of course
658	   with SLSR is that no state is required for either the primary or
659	   the backup path. As a result there is little added cost to having
660	   even greater redundancy than 1+1 with SLSR. The mechanisms to
661	   accomplish this are fairly obvious. Having the reverse path
662	   available at the egress means that fate sharing the forward and
663	   reverse probes is easy.

665	   In addition to 1+1 protection it is possible to do hop by hop fast
666	   reroute type detour protection. This can be done by substitution
667	   of a failed link identifier with a set of link identifiers that
668	   merge with the path downstream of the failure. An example is given
669	   further below for MPLS label stacks, however many other
670	   possibilities exist when a history of the packet's path is
671	   available to the detour mechanism. The history would permit the
672	   detour mechanism to spread the failing packets over different
673	   detours and thereby reduce the concentration of additional load
674	   imposed by the failure on the same set of links.

676	10. Comparison of Logical Model to Existing Source Routing

678	   There are a number of existing protocols that support forms of
679	   source routing (or can be used to do something close to source
680	   routing). IPV4 and V6 had strict and loose node-by-node source
681	   routing options (now deprecated) and we'll discuss them briefly.
682	   Likewise MPLS behavior can be used to do strict link source
683	   routing where a label stack represents a list of link names, this
684	   has recently been called segment routing in [SEGMENT].

686	 10.1. MPLS as a SLSR

688	   MPLS is of course not a source routed forwarding protocol, at
689	   least not by design. Rather, packets follow an arbitrary path by
690	   substitution of a previous hop label with a next hop label and
691	   each hop must be pre-configured with the <incoming port, label> to
692	   <outgoing port, label> relationship. This is clearly not source
693	   routing because tandem configuration is required per path and per
694	   hop. However MPLS has a stacking mechanism that can be exploited
695	   to create a consumable list of link names to be traversed as they
696	   are popped.

698	   The MPLS label stack can therefore be used to implement a flavor
699	   of SLSR. This is accomplished by pre-assigning a locally unique
700	   MPLS label to each outgoing link of a node. For example in figure
701	   1, node D's link 3 would be assigned MPLS label 3 (but more likely
702	   a label value which is 1:1 related to link 3, however we stick
703	   with label=link for simplicity of explanation).

705	   The tunnel encapsulation operation would therefore be to push a
706	   set of labels onto the frame where each label indicates which link
707	   to follow at that given exact strict hop. For example:

709	   +----------+----------+--------+--------+-----------------------+
710	   |SrcIP     |  DstIP   | SrcPrt | DstPrt | MPLS SLSR             |
711	   +----------+----------+--------+--------+-----------------------+
712	   |192.0.2.4 |192.0.2.9 | 1000   | 98     |push(3,push(2,push(5)))|
713	   |192.0.2.4 |192.0.2.9 | 1001   | 99     |push(3,push(4,push(5)))|
714	   +----------+----------+--------+--------+-----------------------+

716	   The tunnel tandem operation would then be to pop the label on the
717	   incoming frame (after optionally validating its reverse link
718	   identifier) and forward to the interface specified by the just
719	   popped label value. Every tandem node would be pre-configured
720	   approximately as per below. Note that as with any source routing
721	   mechanism, this tandem pre-configuration is independent of the
722	   actual paths that traverse the node. A table like the one below,
723	   with a few hundred interfaces and hence a few hundred labels,
724	   could support the transit of an infinite number of TE (or SPF)
725	   paths. For clarity we use label N = interface/N but in reality it
726	   would be label N = F(interface/N) since a 1:1 mapping 'F' is
727	   almost certainly required.

729	   +-------------------+-------+--------------------------------+
730	   |Incoming Interface | Label |    Actions                     |
731	   +-------------------+-------+--------------------------------+
732	   | any               |   1   |  Pop, forward to interface/1   |
733	   | any               |   2   |  Pop, forward to interface/2   |
734	   |  :                |   :   |  :                             |
735	   | any               |   N   |  Pop, forward to interface/N   |
736	   +-------------------+-------+--------------------------------+

738	   If reverse validation is required the tables would be a bit
739	   different because they must match the label to the incoming
740	   interface and then pop it and then forward based on the next
741	   label. Reverse validation therefore requires two label lookups per
742	   forwarding operation.

744	   Finally the tunnel egress operation would be normal forwarding to
745	   a VFI or VRF.

747	   MPLS in this manner could be made to do SLSR of unicast frames but
748	   cannot be made to reverse the route because the route is consumed
749	   in transit. This method also uses many more bits than are really
750	   necessary. Each label consumes 32 bits which is rather more than
751	   required to express the number of links/adjacencies on a typical
752	   switch or router. For example, if the average packet size if 512
753	   bytes, a 5 hop MPLS source route imposes a 4% overhead (20/512) on
754	   some links with the largest overhead on the first few links. For
755	   larger packets this is likely not an issue, for smaller packets it
756	   is possibly a concern.

758	   A more realistic number actually required per hop is probably 8 or
759	   12 bits (256-4K links) and if more bits are required two hops can
760	   be consumed by any node with such a large nodal degree. The MPLS
761	   label also has an 8 bit TTL which is of course redundant in any
762	   source routing mechanism. This begs the question of if a smaller
763	   MPLS label would not be more suitable?

765	   There are other issues with the use of MPLS, in particular current
766	   hardware can usually not stack very many labels at a time (3 on
767	   some popular ASICs). This would limit the network diameter to 4
768	   hops. Of course NPUs or new ASICs could be extended to allow
769	   further ingress stacking.

771	   It does not seem possible to do SLSR multicast with MPLS except of
772	   course via head end replication.

774	   The hop(ingress stack size) limit, lack of reverse, consumable
775	   route and lack of efficient multicast still do not invalidate use
776	   of MPLS source routing for many networks and its use would have a
777	   noticeable positive impact on the scale/speed of a central
778	   controller in such environments.

780	   MPLS fast reroute mechanisms can also be implemented locally in a
781	   similar fashion thus further improving controller scale by
782	   alleviating the need for 50ms responses network wide from the
783	   controller and giving the controller more lee-way to recover after
784	   the fast reroutes have detoured traffic around the failed nodes
785	   and/or links.

787	   Consider possible local actions when the link A.2 between nodes A
788	   and B in Figure 1 fails. Since there is still a link A.4
789	   available, the node A can locally change the action associated
790	   with label 2 to instead send to interface 4 when interface 2
791	   fails. If an entire adjacency fails, such as would happen when
792	   both A.2 and A.4 fail, then a link detour can be locally performed
793	   by reprogramming the actions for labels 2 and 4 to now push labels
794	   3,3 and send to interface 3. This will cause a detour via D back
795	   to B. More elaborate kinds of detour are possible by processing
796	   two link names ahead instead of one, including nodal detours.
797	   These can be done locally without end to end path knowledge and
798	   hence scale independently to the number of paths. Eventually the
799	   controller will detect the failure and reconstruct the SLSRs at
800	   the head end and the use of the detour will stop without having to
801	   withdraw any state in the core.

803	   If MPLS is of use in the context of SLSR then it would be worth
804	   considering a number of future extensions to MPLS. Some things to
805	   consider could be a smaller MPLS label option, say 16 bits with no
806	   TTL and the possibility of not popping but rotating the label to
807	   the bottom of the stack to preserve the path history for OAM and
808	   reversibility reasons. While these sorts of things are of course
809	   not possible with existing ASICs they are easy to do on existing
810	   NPU's and new work on Protocol Oblivious Forwarding [POF] allows
811	   near arbitrary bit pattern/action matches to be programmed by an
812	   SDN controller permitting a more optimal data path encoding of
813	   SLSR than can be obtained by simply reusing MPLS.

815	10.2. IPV4/6 Options as SLSR

817	   IP header option 9 [RFC791] defined (but now deprecated) the
818	   Strict Source and Record Route (SSRR) option for IPV4 packets.
819	   This option has(had) a 'length' field, a 'pointer' field and an
820	   array of 'route data' fields. The element in the array of 'route
821	   data' indexed by the 'pointer' field contains the IP address of
822	   the immediate next hop towards which the packet must be forwarded,
823	   the 'pointer' field is incremented, and the previous hop is filled
824	   in with the IP address of the current device prior to actually
825	   forwarding the packet. Up to 9 hops could be specified in this
826	   manner. IPV6 also had a similar option "RH0" which is also now
827	   deprecated [SRBAD].

829	   IPV4 and V6 Strict Source and Record Route methods could be used
830	   to implement Strict Link Source Routing. This would be
831	   accomplished by assigning a 32 bit number to the link and then
832	   using the 32 bit number in place of the IPV4 or V6 address in the
833	   route list.

835	   In both IPV4 and IPV6 the source routing options were found to be
836	   harmful to the Internet at large for a number of reasons. These
837	   reasons are described in [SRBAD] but briefly there were two broad
838	   classes of problem encountered. 1) Harm to intermediate links and
839	   2) harm to end hosts. For example:

841	   - Since it was possible to list a waypoint more than once in the
842	   route data, it was possible to loop traffic around multiple times
843	   (9 times in the case of IPV4 and 90 times in the case of IPV6).
844	   This looping allowed saturation of high speed links by hosts that
845	   had an order (or two) smaller bandwidth access to the Internet. A
846	   congestion style DOS was therefore possible from low speed access
847	   links against higher speed core links.

849	   - Various schemes such as bypassing of firewalls etc. are of
850	   course easy to do when a host can specify waypoints that detour
851	   around a firewall.

853	   - Spoofing using the reverse route. Since the reverse source route
854	   is installed against the IP SA by a host that receives it, it is
855	   possible to use a bogus IP SA in combination with a reverse source
856	   route that detours the packets to the imposter host.

858	10.3. Protocol Oblivious Forwarding as SLSR mechanism

860	   The OpenFlow [OPENFLOW] protocol defines methods for an external
861	   controller to cause the manipulation of known packet headers and
862	   fields within those headers by a forwarding element. As such it is
863	   currently limited to matching on known fields like MPLS, IP,
864	   Ethernet etc. and taking actions on those fields. While flexible
865	   there are still many things at the data path level that OpenFlow
866	   cannot do including generic source routing such as SLSR.

868	   The Protocol Oblivious Forwarding [POF] protocol is a proposed
869	   extension to OpenFlow which permits arbitrary bit pattern
870	   matching/actions and is therefore much more flexible. The goal of
871	   POF is to allow a controller to define a new data path in addition
872	   to a new control plane and to then program the data path on the
873	   forwarding elements to its specifications. POF is therefore not
874	   limited to existing IP, MPLS, Ethernet fields.

876	   It would therefore be possible with POF to implement a highly
877	   flexible SDN tunnel data plane that closely resembles the
878	   idealized SLSR data path. Strictly by way of example POF could
879	   implement a flexible SLSR header along the following lines:

881	      +----------+---------+---------+------+------+-~~------+
882	      | NextHop  |   Hop   |   Hop   |  Hop |  Hop |    Hop  |
883	      | Index:4  | Count:4 |  Size:4 |   0  |   1  |     N   |
884	      +----------+---------+---------+------+------+-~~------+

886	   With only five bytes, this header could represent 3 hops with 256
887	   links per hop, 4 hops with 64 links per hop, or 6 hops with 16
888	   links per hop, etc. With additional bytes of course more/longer
889	   combinations are possible with very reasonable overhead. This is
890	   considerably more compact than the other described options and
891	   without sacrificing reversibility or giving up the OAM benefits of
892	   knowing the exact path the packet has taken.

894	   POF however could also implement other variations of SLSR based on
895	   MPLS. For example POF could implement a smaller MPLS label, say a
896	   16 bit label without a TTL. POF could theoretically also implement
897	   a rotating label list instead of a popping label stack.

899	   POF appears to be ideally suited for SLSR developments beyond what
900	   can currently be done with MPLS.

902	11. Security Considerations

904	   Source Routing security concerns are also discussed in the
905	   previous section related to IPV4 and IPV6 now deprecated nodal
906	   source routing.

908	   This draft is proposing link based source routing and that it be
909	   used as a tunneling mechanism only. This means that only devices
910	   that are at the edge of an SDN sub-network would be allowed to
911	   insert strict link source routes. Note that an MPLS label can only
912	   be inserted by a Label Edge Router (LER) and processed by Label
913	   Switch Routers (LSR) and not by end hosts. Therefore SLSR should
914	   be no more or less secure than MPLS. In fact the absence of
915	   signaling protocols like RSVP-TE removes a point of attack. The
916	   fact that this mechanism is intended for use by a central
917	   controller further mitigates the possible attacks as encrypted
918	   communications are used to the edge devices which are the only
919	   device able to insert the strict link source routes.

921	   There is however the possibility that an attacker could attach to
922	   a core device and inject strict link source routed packets.
923	   Methods to prevent this however are not hard, in particular the
924	   adjacency would have to be reported to the controller and the
925	   controller would have to enable packet forwarding. Unless the
926	   controller recognized both ends of the link as being part of its
927	   controlled domain it should not enable the strict link source
928	   routing capability on that interface thus preventing the threat.

930	   Other interfaces, such as those facing a network of hosts or
931	   devices not in the domain of the controller would, as with current
932	   BCP's, drop any source routed frame in any format (new or old).

934	   As previously mentioned there are ways to spread the link names
935	   into a 32 bit space such that the exact mappings are only known by
936	   the controller and the tandem node in question. This would prevent
937	   any easy form of guessing being used to construct an SLSR. One
938	   such example of this kind of secure source routing is given in
939	   [SANE].

941	   Source Routing also is unique in that the packets themselves give
942	   details about slices/cuts through the topology, therefore with
943	   sufficient interception of packets from diverse sources and
944	   destinations in the network, an attacker could build up a detailed
945	   view of the network topology, this would be a concern for a
946	   carrier SDN network in particular where details of topology are
947	   considered a valuable asset, although exploiting knowledge of the
948	   topology would be more challenging given the secure protocols that
949	   exist between a controller and the forwarding entities.

951	   In the SDN context there appears to be little need for a loose
952	   source route. Loose source routing adds additional security
953	   concerns because it does not require knowledge of the entire path
954	   to construct an attack. If loose source routing is included the
955	   security concerns should be addressed.

957	12. Conclusions and Future work

959	   SDN where a central controller creates either pro-actively or re-
960	   actively the state for a sub-network of forwarding devices will
961	   have performance limitations that are related to network
962	   diameter/size, network recovery requirements and the amount of
963	   state they need to distribute. Strict Link Source Routing
964	   mechanisms can alleviate these problems allowing greater scale and
965	   faster recovery. MPLS can be used to implement this on a small
966	   scale with some of the benefits. IPV4 and IPV6 source routing
967	   options can be used to implement this on a larger scale with more
968	   of the benefits but at much larger packet overhead but are however
969	   perceived as risky and have been deprecated from IP. These risks
970	   however can be mitigated in this specific use. No existing
971	   mechanism however is optimum, and therefore there is room for a
972	   new mechanism that addresses these requirements and includes
973	   multicast methods and more efficient encoding of link names than
974	   is currently possible. One possible solution is to look at a
975	   smaller MPLS label for this purpose and to look at ways to retain
976	   the popped labels for the purposes of end to end path reversal and
977	   OAM. New work in SDN, in particular Protocol Oblivious Forwarding
978	   may make these kinds of things possible in a generic manner.

980	13. IANA Considerations

982	   This memo includes no request to IANA.

984	14. References

986	14.1. Informative References

988	  [BLOOM]           Active Bloom Filters for Multicast Addressing,
989	                    Z. Heszberger et. al. Budapest
990	                    University of Technology and Economics.

992	  [OPENFLOW]        www.openflow.org

994	  [ONF]             www.opennetworking.org

996	  [POF]             Protocol Oblivious Forwarding:
997	                    http://www.poforwarding.org/

999	  [PLACEMENT]       The Controller Placement Problem, Nick McKeon,
1000	                    Brandon Heller, Rob Sherwood. HotSDN'12, August
1001	                    13, 2012, Helsinki, Finland. 2012 ACM 978-1-
1002	                    4503-1477-0/12/08,
1003	                    http://conferences.sigcomm.org/sigcomm/2012/paper
1004	                    /hotsdn/p7.pdf

1006	  [RFC791]          Internet Protocol, Information Sciences
1007	                    Institute, RFC 791, September 1981.

1009	  [SANE]            SANE: A Protection Architecture for Enterprise
1010	                    Networks, Martin Casado, Nick McKeown,
1011	                    http://yuba.stanford.edu/~casado/sane.pdf,
1012	                    Stanford and ICSI 2005.

1014	  [SDNGOOG]         SDN at Google - Opportunities for WAN
1015	                    optimization, E. Crabbe, V. Valancius, 8/1/2012.
1016	                    Presentation at IETF84 SDN BOF.

1018	  [SEGMENT]         Segment Routing with IS-IS, S.Previdi et. Al.
1019	                    http://tools.ietf.org/html/draft-previdi-
1020	                    filsfils-isis-segment-routing-00

1022	  [SLSR]            Software Defined Networking and Centralized
1023	                    Controller State Distribution Reduction,
1024	                    www.ieee802.org/1/files/public/
1025	                    docs2012/new-
1026	                    ashwood-sdn-optimizations-0712-v01.pdf

1028	  [SRBAD]           Deprecation of Source Routing Options in IPV4
1029	                    http://tools.ietf.org/html/draft-reitzel-ipv4-
1030	                    source-routing-is-evil-00

1032	  [SRSDN]           Source Routed Forwarding with SDN, M. Soliman
1033	                    http://conferences.sigcomm.org/co-
1034	                    next/2012/eproceedings/student/p43.pdf

1036	15. Authors' Addresses

1038	   Peter Ashwood-Smith
1039	   Huawei Canada Inc.
1040	   303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1
1041	   Email: Peter.AshwoodSmith@huawei.com

1043	   Mourad Soliman
1044	   Carleton University,
1045	   1125 Colonel By Drive Ottawa, Ontario K1S 5B6 Canada
1046	   Email: MouradSoliman@cmail.carleton.ca

1048	   Tao Wan
1049	   Huawei Canada Inc.
1050	   303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1
1051	   Email: Tao.Wan@huawei.com

1053	16. Contributors

1055	   We invite more contributors.

1057	17. Acknowledgements

1059	   We gratefully appreciate the feedback of Nigel Bragg, Sue Hares,
1060	   Peter Willis, Biswajit Nandy and Linda Dunbar.