idnits 2.17.1 draft-ashwood-sdnrg-state-reduction-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 03, 2013) is 3949 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Research Task Force P. Ashwood-Smith 2 Internet Draft Huawei 3 Intended status: Informational M. Soliman 4 Expires: February 03, 2014 Carleton University 5 T. Wan 6 Huawei 7 July 03, 2013 9 SDN State Reduction 11 draft-ashwood-sdnrg-state-reduction-00.txt 13 Abstract 15 This document makes the argument that to support the centralized 16 control of a substantial number of forwarding devices (as Software 17 Defined Networking (SDN) proposes) that the scale, speed, cost and 18 general quality of such a solution will be improved by reducing 19 the state needed to be distributed into the network of devices by 20 the controller(s). To this end we re-visit forms of Source Routing 21 (SR), in particular Strict Link Source Routing (SLSR) and suggest 22 that light weight SLSR could allow substantial reduction in 23 controller burden while potentially reducing the costs/complexity 24 on forwarding devices. We discuss some simulation results that 25 demonstrate these advantages and how the advantages grow 26 substantially as the network diameter grows. We also look at 27 various implementation possibilities including existing IPV4, V6, 28 MPLS, new/modified MPLS vs. something brand new that could 29 possibly be implemented with new SDN technology like Protocol 30 Oblivious Forwarding-POF. 32 Status of this Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF), its areas, and its working groups. Note that 39 other groups may also distribute working documents as Internet- 40 Drafts. 42 Internet-Drafts are draft documents valid for a maximum of six 43 months and may be updated, replaced, or obsoleted by other 44 documents at any time. It is inappropriate to use Internet-Drafts 45 as reference material or to cite them other than as "work in 46 progress." 48 The list of current Internet-Drafts can be accessed at 49 http://www.ietf.org/ietf/1id-abstracts.txt 50 The list of Internet-Draft Shadow Directories can be accessed at 51 http://www.ietf.org/shadow.html 53 This Internet-Draft will expire on December 3, 2012. 55 Copyright Notice 57 Copyright (c) 2013 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (http://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with 65 respect to this document. 67 Table of Contents 69 1. Terminology 3 70 2. Introduction 3 71 3. Logical Example 5 72 4. Expressing a Path 6 73 5. Computing a Path 7 74 6. Downloading Forwarding State 8 75 7. Logically Forwarding SLSR 10 76 7.1. Ingress Logical Unicast Forwarding 10 77 7.2. Tandem Logical Unicast Forwarding 11 78 7.3. Egress Logical Unicast Forwarding 12 79 8. Logical Multicast Forwarding SLSR Packets 13 80 9. Failure Recovery 14 81 10. Comparison of Logical Model to Existing Source Routing 15 82 10.1. MPLS as a SLSR 15 83 10.2. IPV4/6 Options as SLSR 18 84 10.3. Protocol Oblivious Forwarding as SLSR mechanism 19 85 11. Security Considerations 20 86 12. Conclusions and Future work 21 87 13. IANA Considerations 21 88 14. References 21 89 14.1. Informative References 21 90 15. Authors' Addresses 23 91 16. Contributors 23 92 17. Acknowledgements 23 94 1. Terminology 96 ATM Asynchronous Transfer Mode (a cell based network) 97 BGP Boarder Gateway Protocol 98 CSPF Constrained Shortest Path First 99 DOS Denial of Service (attack) 100 ECMP Equal Cost Multi Path 101 flow Logically related packets following the same path 102 IS-IS Intermediate System to Intermediate System 103 LACP Link Aggregation Control Protocol 104 LAG Link Aggregation 105 Loose A source route that enumerates only some of all hops 106 MPLS Multi Protocol Label Switching 107 MPLS-TE MPLS Traffic Engineering. 108 NPU Network Processor Unit (programmable forwarding) 109 OpenFlow Open data path programming protocol 110 OSPF Open Shortest Path First 111 PCE Path Computation Element (used with MPLS-TE) 112 PNNI Private Network to Network Interface (link state ATM) 113 POF Protocol Oblivious Forwarding - more generic OpenFlow) 114 RSVP-TE Resource Reservation Protocol - Traffic Engineering 115 SDN Software Defined Networking (as per [OPENFLOW]) 116 SDN-domain Set of forwarding devices controlled as a unit. 117 SR Source Routing - enumerating hops to traverse 118 SLSR Strict Link Source Routing - enumerating links [SLSR] 119 SPF Shortest Path First - (Dijkstra etc.) 120 SSRR Strict Source Record Route - IPV4 header option 9 121 Strict Source route that enumerates every hop(unlike Loose) 122 TE Traffic Engineering 123 VFI Virtual Forwarding Instance (layer 2) 124 VPLS Virtual Private Lan Service 125 VRF Virtual Routing and Forwarding (layer 3) 127 2. Introduction 129 The centralized control of a network is not a new idea. Indeed 130 centralized control was widely deployed in voice networks and some 131 early data networks but of course gave way to distributed control 132 for IP. 134 Centralized computation is however still widely used for traffic 135 engineered networks, like MPLS-TE and GMPLS where a Path 136 Computation Engine (PCE) makes use of a global view of a sub- 137 network and its resource usage for the planning of new paths and 138 their resources. The data path state distribution with these 139 models is however not initiated centrally and relies on protocols 140 like RSVP-TE to install the hop by hop state. In fact this form 141 of distributed control with centralized traffic engineering 142 computations is the norm today. 144 Notwithstanding the massive deployment of this kind of hybrid 145 distributed/central control, we have in the last several years 146 seen a huge resurgence of interest in fully centralized control of 147 at least a set of forwarding devices [ONF] [OPENFLOW] with 148 Software Defined Networking (SDN). This SDN proposes a central 149 controller (or controllers) using IP protocols such as TCP to talk 150 to a set of arbitrarily interconnected (and cheap/dumb) forwarding 151 devices (SDN-domain) and which is responsible for the 152 configuration of the majority of forwarding state on those 153 devices. This state may be produced either as a result of pro- 154 active configuration, or based on re-active responses to packet 155 flow indications from the forwarding devices themselves. 157 Since this central controller has knowledge of the entire sub- 158 network of devices, and potentially of the traffic demands 159 into/out of the sub-network, it can perform a variety of path 160 optimization computations similar to CSPF/MPLS-TE/PCE/GMPLS, or 161 even more elaborate forms of optimization (trading flows against 162 each other rather than individually optimizing them, exploiting 163 quiet areas of the network to offload busy areas etc), the output 164 of which is forwarding state for all meta flows in the entire sub 165 network of devices and a sub network which more optimally meets 166 the desired local constraints. One such deployment reports a 167 substantial increase in network utilization from 30% to 70%-90% 168 [SDNGOOG]. 170 A central controller can also more effectively solve problems such 171 as bin-packing and path blocking [SDNGOOG], which occur when flows 172 are optimized individually with greedy type algorithms rather than 173 considering other orderings of the flows. The finer grained 174 ability to place traffic can also permit much more detailed 175 placement of traffic after a failure, including traffic not 176 directly affected by the failure but the replacement of which is 177 critical to achieving fair/efficient use of the remaining 178 bandwidth subsequent to the failure. 180 Since the output of the controller is much closer to a TE (Traffic 181 Engineered) type solution from a PCE (Path Control Element) than 182 an SPF (Shortest Path First) solution the controller cannot simply 183 install destination based forwarding entries. A controller either 184 needs to install tunnels that follow the explicit routes it wishes 185 and then map traffic to those tunnels at the edges, or it must 186 install n-tuple < 187 etc.> state and configure these n-tuple matches 188 on every hop along the desired path. Packets which fail to match 189 an n-tuple are either discarded or sent to the controller. 191 In the normal case of SDN (as given in [OPENFLOW]) the controller 192 is required to send configuration information to all devices along 193 the path from ingress of the SDN-domain of this controller to the 194 egress of that SDN-domain, alternatively a tunnel setup protocol 195 like RSVP-TE is required to be triggered to distribute the per hop 196 state between the ingress and egress. 198 This draft proposes that since the controller knows the exact end- 199 to-end path (down to the level of the links it wishes the packets 200 to traverse) and that the diameter of an SDN-domain is likely to 201 be a reasonable number of hops, that the controller should instead 202 simply insert into a header the exact links it wishes the packet 203 to traverse and thereby not have to deal either with per hop n- 204 tuple state installation (very expensive) or with MPLS tunnel 205 installation via RSVP-TE(complex). Such a mechanism also 206 eliminates any concerns about Equal Cost Multi Path (ECMP) and/or 207 Link Aggregation (LAG) as the controller can place traffic on 208 exact links. 210 Operations, Administration and Management (OAM) is also greatly 211 simplified since data packets will flow on invariant paths that 212 are known by both ends of the flow and can be the same as any OAM 213 packets that probe the flow. This OAM "fate sharing" property is 214 widely valued by network operators and considerable effort has 215 already been expended to permit similar fate sharing between OAM 216 and data paths with other carrier scale networking protocols such 217 as 802.1ag and MPLS-TP. Of course if a controller does not wish to 218 enforce symmetry and congruence it need not. 220 3. Logical Example 222 The following is an example of an idealized strict link based 223 source routing (SLSR) forwarding. We talk about possible 224 implementations including MPLS methods after looking at the 225 logical ideal. 227 Consider the simple 7 node network shown in Figure 1 below. Here 228 the nodes are named {A, B, C, D, E, F, G} and where each node has 229 locally numbered interfaces named {1, 2, 3, 4, 5, 6}. 231 For example node A has interfaces named 1, 2, 3, 4 and where 232 interfaces 4 and 2 both go to node B. Node B has local interfaces 233 1, 2, 3, 4 and 5 but the two interfaces going back to node A are 234 locally named 1 and 3. Clearly node interface names are likely 235 (but not necessarily) different at both ends of a link. 237 +----+ +----+ 238 | | 4 ----- 3 | | 239 | A | 2 ----- 1 | B | 240 +----+ +----+ 241 1 3 4 2 5 242 / \ / \ \ 243 3 4 3 2 3 244 +----+ +----+ +----+ 245 | C | 2----1 | D | 2------1 | E | 246 +----+ +----+ +----+ 247 1 6 5 4 248 \ / \ / 249 3 2 1 2 250 +----+ +----+ 251 | F | 1-------3 | G | 252 +----+ +----+ 254 Figure 1 - simple 7 node network with local link identifiers. 256 4. Expressing a Path 258 A path through a network labeled as per Figure 1 can clearly be 259 expressed as a sequence of link names (an SLSR). 261 For example, between nodes C and E the following are all valid 262 paths. 264 C.3 -> A.2 -> B.2 265 C.2 -> D.2 266 C.1 -> F.2 -> D.2 267 C.3 -> A.4 -> B.5 269 Now since the links lead unambiguously to a known node, the paths 270 can be more compactly expressed without the node names as follows: 272 {3,2,2} 273 {2,2} 274 {1,2,2} 275 {3,4,5} 277 As long as we know the origin of the path (in this case node C), 278 the list of link names unambiguously identifies a path and an 279 egress point. In addition it identifies unambiguously which link 280 from among parallel links between neighbors should be traversed. 281 Of course it is possible to give a name to the set of links that 282 all attach to the same neighbor and thereby leave the exact link 283 in that path deliberately ambiguous and thereby subject to a local 284 forwarding decision as to exactly which link in the set to follow. 286 Each path of course also has exactly one perfectly symmetric 287 reverse. Note that the symmetric reverse path is not simply the 288 same list of link names in reverse order. A reverse path has to be 289 specified from the opposite end of the path so in this example the 290 origin has to be E. 292 The forward and corresponding reverse paths are therefore. 294 C->E E->C 296 {3,2,2} {2,1,1} 297 {2,2} {1,1} 298 {1,2,2} {1,6,3} 299 {3,4,5} {3,3,1} 301 Various very efficient encodings of these kinds of paths in source 302 routed headers are possible. Even a simple encoding using 8 bits 303 per hop can encode every path in a large 8 hop network with fewer 304 bits than an IP in IP tunnel. 306 5. Computing a Path 308 It should be obvious that the output of any graph based 309 computation which has as its goal various optimization criteria 310 for flows can express its results as a series of such paths where 311 each path is expressed as a Strict Link based Source Route (SLSR). 312 This includes multiple different metric Dijkstra computations 313 (i.e. shortest path, multi topology shortest path), CSPF type and 314 of course more elaborate linear-programming or other convex type 315 optimizations. 317 The expression of the path as an SLSR imposes no constraints on 318 the type of computation being performed except possibly in path 319 length. However in any real network under the control of a single 320 controller it is not likely that path length would be a real issue 321 unless perhaps unreasonably large link names are encoded. 323 Convex and linear-programming type solutions to traffic placement 324 are of particular interest because to do a good job they must 325 exploit a considerable number of paths through a network (many 326 more than shortest). These algorithms take the matrix of 327 ingress/egress flows in a network together with all the usable 328 paths between all sources and destinations and will assign 329 percentages of the ingress/egress flows to the available paths in 330 ratios that can optimize a number of simultaneous constraints. For 331 example they can optimize the network's total throughput, the 332 average link utilization, the fairness of the bandwidth available 333 to each flow and can even optimize different linear and non linear 334 combinations of those goals. What is interesting about all of 335 these kinds of optimizations however is that they need access to 336 all of the reasonable paths across the network since it is by 337 making trade-offs between busy and less busy parts of the network 338 that they achieve their goals. Unfortunately the number of paths 339 (shortest or otherwise) in a network grows exponentially with 340 network size and with it the state distribution problem (or 341 burden) the controller must deal with. 343 It is important to make the distinction between a flow and a path. 344 This draft concerns itself not with the immense numbers of micro 345 flows but with the very large numbers of paths required to be 346 supported onto which those micro flows are then aggregated. A set 347 of micro flows can be treated as a single flow, and a single flow 348 has a unique path through the network. 350 6. Downloading Forwarding State 352 A controller likely takes as input the fields that identify the 353 flow and its various statistical attributes. The controller then 354 likely computes an end to end path for this flow either based on 355 the single flow's attributes (in a re-active manner), or on more 356 global knowledge of multiple flow attributes (in a pro-active 357 manner). Flows may be meta (many micro flows) or individual micro 358 flows depending on the implementation and its scale. The output of 359 course is just a list of links that must be traversed for this 360 flow together with matching rules to identify the flow at the 361 ingress. 363 The controller then delivers the flow matching rules and the 364 Strict Link Based Source Route to the *single* node where the flow 365 is to be encapsulated (i.e. where the flow first enters the SDN- 366 domain). 368 The fact of only having to communicate with the *single* node at 369 the head end of the path means that the controller experiences a 370 reduction in its work load directly proportional to the number of 371 hops in the path (as compared to traditional SDN which must 372 program every hop along the path). 374 Intuitively this translates to the following I/O burden reduction 375 at the controller based on the number of links that must be 376 traversed per average path. 378 +----------------+-------------------------+ 379 | #Avg Path Len | % I/O Burden Reduction | 380 +----------------+-------------------------+ 381 | 1 | 0% | 382 | 2 | 50% | 383 | 3 | 66% | 384 | 4 | 75% | 385 | 5 | 80% | 386 | .. | .. | 387 | N | (100-100/n) % | 388 +----------------+-------------------------+ 390 Since forwarding state download is typically a substantial part of 391 a "normal" routers' re-convergence time, it seems reasonable that 392 this will become a similar bottleneck for a central controller and 393 quite possibly be further aggravated by the increased delays and 394 larger amount of state that the central device must deal with. 396 As a result this reduction in state and I/O burden should have a 397 marked impact on convergence times assuming there are appropriate 398 forwarding mechanisms that can implement the Strict Link Source 399 Route (SLSR). Note also that the position of the controller 400 relative to the ingress/egress nodes is now more important than 401 its position relative to all nodes. Therefore studies as to 402 controller optimum placement as defined by the Controller 403 Placement Problem [PLACEMENT] would require different optimization 404 goals. 406 An additional 50% reduction can also be obtained should the 407 implementation of the forwarding be able to reverse the path on 408 the fly. Such a reversal permits the implicit communication of the 409 desired reverse path to the receiver thereby eliminating 410 communication with the controller to obtain a reverse path. Of 411 course if symmetry is not desired this further optimization is not 412 possible. 414 For example, consider a network with 1000 nodes. It therefore has 415 O(1,000,000) meta flows and assuming 10 possible paths for each 416 flow has O(10,000,000) ingress forwarding entries that must be 417 centrally configured (its burden). If each path on average takes 5 418 hops then the burden on the controller grows 5 fold to 419 O(50,000,000) entries but with SLSR the burden remains at 420 O(10,000,000). If path reversal is supported and symmetric routing 421 is desired then the burden with SLSR drops further to 422 O(5,000,000). 424 Simulations done by one of us in [SRSDN] provide additional weight 425 to the above arguments. In particular we simulated for various 426 network sizes and diameters the differences between hop by hop SDN 427 and SLSR and saw up to 3 x performance improvements in convergence 428 times with SLSR. There were also a number of other benefits such 429 as a markedly reduced standard deviation in convergence times for 430 the different nodes (81% decreases) and a significantly reduced 431 sensitivity to the placement of the controller (80% reduction in 432 standard deviation). The performance improvements can perhaps 433 better be understood by an analogy comparing the work required to 434 fill in the area of an object (traditional SDN) vs. simply drawing 435 the circumference of that object (SLSR). Since the circumference 436 varies as a function of the diameter but the area varies as a 437 function of the diameter squared the relative burden reduction 438 with just dealing with the circumference (the edge of the network) 439 becomes apparent. In fact in this simulation study we varied the 440 radius and then plotted the relative convergence times of SLSR and 441 traditional hop by hop forwarded SDN and saw a ratio of 442 convergence times as a function of radius that indicated a trend 443 towards 1/R as expected. Simply stated, the bigger the centrally 444 controlled network the better source routing performs compared to 445 hop by hop. 447 7. Logically Forwarding SLSR 449 There are three distinct phases to be performed to logically 450 forward unicast SLSR. These are similar to any tunnel technology 451 and consist of 1) Ingress Encapsulation, 2) Tandem Forwarding, and 452 3) Egress Decapsulation and Forwarding. We address the generic 453 concepts first before looking at possible existing or new 454 encapsulations and their applicability. 456 Multicast SLSR is also possible (but with limitations to keep the 457 header sizes from growing too large) and is briefly discussed 458 after unicast. 460 7.1. Ingress Logical Unicast Forwarding 462 Here the flow information, likely IP header(s) + UDP/TCP header(s) 463 is looked up and a sequence of link identifiers and a current hop 464 must be placed on the packet, the packet must then be forwarded to 465 the first of those links. This operation is identical to almost 466 every tunnel protocol except that IP ECMP and/or LAG hash would 467 potentially be unnecessary because the first link name would often 468 resolve to a physical link not a LAG bundle. For example: 470 +----------+----------+---------+--------+--------+ 471 |SrcIP | DstIP | SrcPrt | DstPrt | SLSR | 472 +----------+----------+---------+--------+--------+ 473 |192.0.2.4 |192.0.2.9 | 1000 | 98 |{3,2,5} | 474 |192.0.2.4 |192.0.2.9 | 1001 | 99 |{3,4,5} | 475 +----------+----------+---------+--------+--------+ 477 Of course nothing precludes the use of LAG and the link identifier 478 therefore identifying an entire LAG bundle rather than an element 479 of that LAG. In fact it is possible to simultaneously support both 480 concepts so that some traffic can be forwarded to the entire LAG 481 while other traffic could be placed on a particular LAG bundle 482 member at the discretion of the central computation. 484 7.2. Tandem Logical Unicast Forwarding 486 At tandem devices the operation would start by incrementing the 487 current hop in the packet header (shown with a ^ symbol) and then 488 forward to the link identified in the new current hop. If we 489 support reversal, we change the previous link name to the local 490 link name for that link. For example, referring to Figure 1 (and 491 disregarding non relevant headers/options) after matching the 492 first flow tuple at the ingress node C the packet is encapsulated 493 with the SLSR header {3,2,5} and then leaves node C on interface 3 494 toward A. Then: 496 Packet arrives at node A on local interface 1 where it looks 497 like this: 499 +--------------------------------------~~~------------+ 500 | 3 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | 501 +-^------------------------------------~~~------------+ 503 Current hop is incremented while previous hop is changed to 504 local interface name (3 changes to a 1). 506 +--------------------------------------~~~------------+ 507 | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | 508 +-----^--------------------------------~~~------------+ 510 Packet is forwarded to interface for current hop i.e. 2. 512 Packet arrives at node B on local interface 1. 514 +--------------------------------------~~~------------+ 515 | 1 | 2 | 5 | 192.0.2.9 | 192.0.2.4 |.. | 516 +-----^--------------------------------~~~------------+ 518 Current hop is incremented while previous hop is changed to 519 local interface name 1 (2 changes to a 1). 521 +--------------------------------------~~~------------+ 522 | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |.. | 523 +---------^----------------------------~~~------------+ 525 Packet is forwarded to interface for current hop i.e. 5. 527 Packet arrives at node E on local interface 3. 529 +--------------------------------------~~~------------+ 530 | 1 | 1 | 5 | 192.0.2.9 | 192.0.2.4 |.. | 531 +---------^----------------------------~~~------------+ 533 Current hop is incremented while previous hop is changed to 534 local interface name (5 changes to 3). 536 +--------------------------------------~~~------------+ 537 | 1 | 1 | 3 | 192.0.2.9 | 192.0.2.4 |.. | 538 +-----------^--------------------------~~~------------+ 540 We are at the end of the path, so egress processing begins. 542 One additional step not described above is a reverse path check. 543 Prior to substituting the reverse link identifier into the SLSR 544 header, the link identifier from the neighbor can be validated and 545 the packet discarded if the neighbor link identifier in the packet 546 is incorrect for the port the packet arrived on. This would reduce 547 the chances of mis-delivery of the packet should a link identifier 548 change or a link destination change while a packet is in flight. 550 7.3. Egress Logical Unicast Forwarding 552 Here the operation consists of normal IP/Ethernet etc. forwarding 553 based on the IP destination / MAC or other ACL rules. Basically 554 the SLSR header is stripped and the packet is submitted to the 555 Virtual Forwarding Instance, or Virtual Forwarding Function (VFI 556 or VRF) for further processing. 558 Optionally the link identifier from the neighbor can be validated 559 against what is expected and the packet discarded in the case of a 560 mismatch. This reduces the chance of mis-delivery as in the tandem 561 case. 563 In addition, if path reversal is supported, the reverse path is 564 compared against the current reverse path for this reverse flow 565 and if it has changed the local forwarding state for the reverse 566 flow would be updated. This would allow the head end to always 567 dictate the forward and reverse path to be used for all packets in 568 the flow without involving the controller on the egress side (and 569 of course not needing to communicate with any tandem device). 571 Processing the reverse flow/path in this manner means that a flow 572 is already present for the reverse direction without having to re- 573 actively or pro-actively consult the controller. This results in a 574 further 50% reduction in controller load. In the case of asymmetry 575 this optimization is of course not possible. 577 8. Logical Multicast Forwarding SLSR Packets 579 Multicasting packets usually involve one of two approaches. 581 The first approach simply re-uses unicast and sends multiple 582 copies to a pre-determined list of receivers. There is little to 583 discuss with this approach as we can replicate SLSR based unicast 584 packets just as easily as any other tunneling mechanism. Clearly 585 such a serial unicast approach has nearly identical bandwidth 586 overhead as other protocols like VPLS which also use this serial 587 unicast mechanism. 589 It is therefore interesting to look at more efficient methods that 590 involve the second multicast mechanism, which uses replication 591 points in the network. These replication points are chosen so that 592 copies are more efficiently made thereby eliminating multiple 593 copies of the packet traversing any given link. Various logical 594 tree structures are usually involved e.g. STP, SPB, TRILL, PIM, 595 MOSPF etc. 597 These tree based mechanisms could in theory be implemented without 598 requiring tandem state as an SLSR by introducing a branch point 599 concept into the list of indexes. In this manner a complete tree 600 as a pre-order traversal could be encoded along with the packet 601 payload. It is not difficult to define a variety of different 602 encodings that would accomplish this. The obvious objection to 603 such a scheme is the sheer size of header required especially 604 where a large network with many multicast receivers is concerned. 605 It is therefore unlikely to be practical to encode any large tree 606 of receivers and the SLSRs between them in any single header. 608 This leads to a hybrid approach which would encode a subset of the 609 tree, say a single replication point and 5 or so recipients. This 610 little tree or 'tree-let', would efficiently get a single packet 611 to 5 (or some suitably small number) of recipients with an SLSR to 612 the replication point and then SLSRs to each of the receivers. 613 Such an encoding is much more reasonable than trying to encode all 614 receivers and all replication points of a single tree in one 615 packet. However, since this one packet would not reach all 616 receivers, the head end would have to generate as many copies of 617 the data packets as necessary to cover all recipients. As a result 618 this approach would be a compromise between a full tree, and full 619 head end replication. Variations in the size of the 'tree-let' 620 header would allow for space v.s. bandwidth efficiency trade-offs 621 while meeting the goal of remaining stateless in the core. 623 In the literature there are also non-exact methods to multicast 624 without state such as with Bloom filters [BLOOM]. In this approach 625 the links to be traversed are logically mapped into a field which 626 is carried in the packet (for example if the links are given 627 unique 128 bit sparse addresses then a 128 bit union of all the 628 links to be traversed on the tree is encoded in the header). These 629 mechanisms guarantee that all receivers will get a copy of the 630 packet (because they check each link for inclusion in the Bloom 631 Filter at each hop) however they do so at the expense of sending 632 false positive copies to unintended receivers which must then 633 filter the unwanted packets egress. Depending on the size of the 634 Bloom Filter and the link identifiers various statistical trade- 635 offs in false positives vs. packet header size can be made. 637 Other exact methods to encode and methods to compute SLSR 638 multicast etc. are FFS. 640 9. Failure Recovery 642 A variety of failure recovery techniques can be employed with 643 SLSR. The most obvious is to just re-compute all affected paths on 644 indication of a link failure. This won't be discussed further. 646 More interesting are the so called fast restoration mechanisms. 647 These can broadly be broken down into head end and tandem 648 restoration. 650 Head end mechanisms that provide 1+1 protection have been around 651 for a long time with MPLS-TP, PBT and SONET/DWDM. Similar 652 mechanisms can be used with any tunnel type and of course SLSR is 653 no exception. Probes can be sent down one source route, reflected 654 back along the reverse source route and in this manner the forward 655 and reverse paths can be simultaneously probed for failure. In the 656 event of a failure a diverse alternate source route can rapidly be 657 added to the packet and the flow restored. The advantage of course 658 with SLSR is that no state is required for either the primary or 659 the backup path. As a result there is little added cost to having 660 even greater redundancy than 1+1 with SLSR. The mechanisms to 661 accomplish this are fairly obvious. Having the reverse path 662 available at the egress means that fate sharing the forward and 663 reverse probes is easy. 665 In addition to 1+1 protection it is possible to do hop by hop fast 666 reroute type detour protection. This can be done by substitution 667 of a failed link identifier with a set of link identifiers that 668 merge with the path downstream of the failure. An example is given 669 further below for MPLS label stacks, however many other 670 possibilities exist when a history of the packet's path is 671 available to the detour mechanism. The history would permit the 672 detour mechanism to spread the failing packets over different 673 detours and thereby reduce the concentration of additional load 674 imposed by the failure on the same set of links. 676 10. Comparison of Logical Model to Existing Source Routing 678 There are a number of existing protocols that support forms of 679 source routing (or can be used to do something close to source 680 routing). IPV4 and V6 had strict and loose node-by-node source 681 routing options (now deprecated) and we'll discuss them briefly. 682 Likewise MPLS behavior can be used to do strict link source 683 routing where a label stack represents a list of link names, this 684 has recently been called segment routing in [SEGMENT]. 686 10.1. MPLS as a SLSR 688 MPLS is of course not a source routed forwarding protocol, at 689 least not by design. Rather, packets follow an arbitrary path by 690 substitution of a previous hop label with a next hop label and 691 each hop must be pre-configured with the to 692 relationship. This is clearly not source 693 routing because tandem configuration is required per path and per 694 hop. However MPLS has a stacking mechanism that can be exploited 695 to create a consumable list of link names to be traversed as they 696 are popped. 698 The MPLS label stack can therefore be used to implement a flavor 699 of SLSR. This is accomplished by pre-assigning a locally unique 700 MPLS label to each outgoing link of a node. For example in figure 701 1, node D's link 3 would be assigned MPLS label 3 (but more likely 702 a label value which is 1:1 related to link 3, however we stick 703 with label=link for simplicity of explanation). 705 The tunnel encapsulation operation would therefore be to push a 706 set of labels onto the frame where each label indicates which link 707 to follow at that given exact strict hop. For example: 709 +----------+----------+--------+--------+-----------------------+ 710 |SrcIP | DstIP | SrcPrt | DstPrt | MPLS SLSR | 711 +----------+----------+--------+--------+-----------------------+ 712 |192.0.2.4 |192.0.2.9 | 1000 | 98 |push(3,push(2,push(5)))| 713 |192.0.2.4 |192.0.2.9 | 1001 | 99 |push(3,push(4,push(5)))| 714 +----------+----------+--------+--------+-----------------------+ 716 The tunnel tandem operation would then be to pop the label on the 717 incoming frame (after optionally validating its reverse link 718 identifier) and forward to the interface specified by the just 719 popped label value. Every tandem node would be pre-configured 720 approximately as per below. Note that as with any source routing 721 mechanism, this tandem pre-configuration is independent of the 722 actual paths that traverse the node. A table like the one below, 723 with a few hundred interfaces and hence a few hundred labels, 724 could support the transit of an infinite number of TE (or SPF) 725 paths. For clarity we use label N = interface/N but in reality it 726 would be label N = F(interface/N) since a 1:1 mapping 'F' is 727 almost certainly required. 729 +-------------------+-------+--------------------------------+ 730 |Incoming Interface | Label | Actions | 731 +-------------------+-------+--------------------------------+ 732 | any | 1 | Pop, forward to interface/1 | 733 | any | 2 | Pop, forward to interface/2 | 734 | : | : | : | 735 | any | N | Pop, forward to interface/N | 736 +-------------------+-------+--------------------------------+ 738 If reverse validation is required the tables would be a bit 739 different because they must match the label to the incoming 740 interface and then pop it and then forward based on the next 741 label. Reverse validation therefore requires two label lookups per 742 forwarding operation. 744 Finally the tunnel egress operation would be normal forwarding to 745 a VFI or VRF. 747 MPLS in this manner could be made to do SLSR of unicast frames but 748 cannot be made to reverse the route because the route is consumed 749 in transit. This method also uses many more bits than are really 750 necessary. Each label consumes 32 bits which is rather more than 751 required to express the number of links/adjacencies on a typical 752 switch or router. For example, if the average packet size if 512 753 bytes, a 5 hop MPLS source route imposes a 4% overhead (20/512) on 754 some links with the largest overhead on the first few links. For 755 larger packets this is likely not an issue, for smaller packets it 756 is possibly a concern. 758 A more realistic number actually required per hop is probably 8 or 759 12 bits (256-4K links) and if more bits are required two hops can 760 be consumed by any node with such a large nodal degree. The MPLS 761 label also has an 8 bit TTL which is of course redundant in any 762 source routing mechanism. This begs the question of if a smaller 763 MPLS label would not be more suitable? 765 There are other issues with the use of MPLS, in particular current 766 hardware can usually not stack very many labels at a time (3 on 767 some popular ASICs). This would limit the network diameter to 4 768 hops. Of course NPUs or new ASICs could be extended to allow 769 further ingress stacking. 771 It does not seem possible to do SLSR multicast with MPLS except of 772 course via head end replication. 774 The hop(ingress stack size) limit, lack of reverse, consumable 775 route and lack of efficient multicast still do not invalidate use 776 of MPLS source routing for many networks and its use would have a 777 noticeable positive impact on the scale/speed of a central 778 controller in such environments. 780 MPLS fast reroute mechanisms can also be implemented locally in a 781 similar fashion thus further improving controller scale by 782 alleviating the need for 50ms responses network wide from the 783 controller and giving the controller more lee-way to recover after 784 the fast reroutes have detoured traffic around the failed nodes 785 and/or links. 787 Consider possible local actions when the link A.2 between nodes A 788 and B in Figure 1 fails. Since there is still a link A.4 789 available, the node A can locally change the action associated 790 with label 2 to instead send to interface 4 when interface 2 791 fails. If an entire adjacency fails, such as would happen when 792 both A.2 and A.4 fail, then a link detour can be locally performed 793 by reprogramming the actions for labels 2 and 4 to now push labels 794 3,3 and send to interface 3. This will cause a detour via D back 795 to B. More elaborate kinds of detour are possible by processing 796 two link names ahead instead of one, including nodal detours. 797 These can be done locally without end to end path knowledge and 798 hence scale independently to the number of paths. Eventually the 799 controller will detect the failure and reconstruct the SLSRs at 800 the head end and the use of the detour will stop without having to 801 withdraw any state in the core. 803 If MPLS is of use in the context of SLSR then it would be worth 804 considering a number of future extensions to MPLS. Some things to 805 consider could be a smaller MPLS label option, say 16 bits with no 806 TTL and the possibility of not popping but rotating the label to 807 the bottom of the stack to preserve the path history for OAM and 808 reversibility reasons. While these sorts of things are of course 809 not possible with existing ASICs they are easy to do on existing 810 NPU's and new work on Protocol Oblivious Forwarding [POF] allows 811 near arbitrary bit pattern/action matches to be programmed by an 812 SDN controller permitting a more optimal data path encoding of 813 SLSR than can be obtained by simply reusing MPLS. 815 10.2. IPV4/6 Options as SLSR 817 IP header option 9 [RFC791] defined (but now deprecated) the 818 Strict Source and Record Route (SSRR) option for IPV4 packets. 819 This option has(had) a 'length' field, a 'pointer' field and an 820 array of 'route data' fields. The element in the array of 'route 821 data' indexed by the 'pointer' field contains the IP address of 822 the immediate next hop towards which the packet must be forwarded, 823 the 'pointer' field is incremented, and the previous hop is filled 824 in with the IP address of the current device prior to actually 825 forwarding the packet. Up to 9 hops could be specified in this 826 manner. IPV6 also had a similar option "RH0" which is also now 827 deprecated [SRBAD]. 829 IPV4 and V6 Strict Source and Record Route methods could be used 830 to implement Strict Link Source Routing. This would be 831 accomplished by assigning a 32 bit number to the link and then 832 using the 32 bit number in place of the IPV4 or V6 address in the 833 route list. 835 In both IPV4 and IPV6 the source routing options were found to be 836 harmful to the Internet at large for a number of reasons. These 837 reasons are described in [SRBAD] but briefly there were two broad 838 classes of problem encountered. 1) Harm to intermediate links and 839 2) harm to end hosts. For example: 841 - Since it was possible to list a waypoint more than once in the 842 route data, it was possible to loop traffic around multiple times 843 (9 times in the case of IPV4 and 90 times in the case of IPV6). 844 This looping allowed saturation of high speed links by hosts that 845 had an order (or two) smaller bandwidth access to the Internet. A 846 congestion style DOS was therefore possible from low speed access 847 links against higher speed core links. 849 - Various schemes such as bypassing of firewalls etc. are of 850 course easy to do when a host can specify waypoints that detour 851 around a firewall. 853 - Spoofing using the reverse route. Since the reverse source route 854 is installed against the IP SA by a host that receives it, it is 855 possible to use a bogus IP SA in combination with a reverse source 856 route that detours the packets to the imposter host. 858 10.3. Protocol Oblivious Forwarding as SLSR mechanism 860 The OpenFlow [OPENFLOW] protocol defines methods for an external 861 controller to cause the manipulation of known packet headers and 862 fields within those headers by a forwarding element. As such it is 863 currently limited to matching on known fields like MPLS, IP, 864 Ethernet etc. and taking actions on those fields. While flexible 865 there are still many things at the data path level that OpenFlow 866 cannot do including generic source routing such as SLSR. 868 The Protocol Oblivious Forwarding [POF] protocol is a proposed 869 extension to OpenFlow which permits arbitrary bit pattern 870 matching/actions and is therefore much more flexible. The goal of 871 POF is to allow a controller to define a new data path in addition 872 to a new control plane and to then program the data path on the 873 forwarding elements to its specifications. POF is therefore not 874 limited to existing IP, MPLS, Ethernet fields. 876 It would therefore be possible with POF to implement a highly 877 flexible SDN tunnel data plane that closely resembles the 878 idealized SLSR data path. Strictly by way of example POF could 879 implement a flexible SLSR header along the following lines: 881 +----------+---------+---------+------+------+-~~------+ 882 | NextHop | Hop | Hop | Hop | Hop | Hop | 883 | Index:4 | Count:4 | Size:4 | 0 | 1 | N | 884 +----------+---------+---------+------+------+-~~------+ 886 With only five bytes, this header could represent 3 hops with 256 887 links per hop, 4 hops with 64 links per hop, or 6 hops with 16 888 links per hop, etc. With additional bytes of course more/longer 889 combinations are possible with very reasonable overhead. This is 890 considerably more compact than the other described options and 891 without sacrificing reversibility or giving up the OAM benefits of 892 knowing the exact path the packet has taken. 894 POF however could also implement other variations of SLSR based on 895 MPLS. For example POF could implement a smaller MPLS label, say a 896 16 bit label without a TTL. POF could theoretically also implement 897 a rotating label list instead of a popping label stack. 899 POF appears to be ideally suited for SLSR developments beyond what 900 can currently be done with MPLS. 902 11. Security Considerations 904 Source Routing security concerns are also discussed in the 905 previous section related to IPV4 and IPV6 now deprecated nodal 906 source routing. 908 This draft is proposing link based source routing and that it be 909 used as a tunneling mechanism only. This means that only devices 910 that are at the edge of an SDN sub-network would be allowed to 911 insert strict link source routes. Note that an MPLS label can only 912 be inserted by a Label Edge Router (LER) and processed by Label 913 Switch Routers (LSR) and not by end hosts. Therefore SLSR should 914 be no more or less secure than MPLS. In fact the absence of 915 signaling protocols like RSVP-TE removes a point of attack. The 916 fact that this mechanism is intended for use by a central 917 controller further mitigates the possible attacks as encrypted 918 communications are used to the edge devices which are the only 919 device able to insert the strict link source routes. 921 There is however the possibility that an attacker could attach to 922 a core device and inject strict link source routed packets. 923 Methods to prevent this however are not hard, in particular the 924 adjacency would have to be reported to the controller and the 925 controller would have to enable packet forwarding. Unless the 926 controller recognized both ends of the link as being part of its 927 controlled domain it should not enable the strict link source 928 routing capability on that interface thus preventing the threat. 930 Other interfaces, such as those facing a network of hosts or 931 devices not in the domain of the controller would, as with current 932 BCP's, drop any source routed frame in any format (new or old). 934 As previously mentioned there are ways to spread the link names 935 into a 32 bit space such that the exact mappings are only known by 936 the controller and the tandem node in question. This would prevent 937 any easy form of guessing being used to construct an SLSR. One 938 such example of this kind of secure source routing is given in 939 [SANE]. 941 Source Routing also is unique in that the packets themselves give 942 details about slices/cuts through the topology, therefore with 943 sufficient interception of packets from diverse sources and 944 destinations in the network, an attacker could build up a detailed 945 view of the network topology, this would be a concern for a 946 carrier SDN network in particular where details of topology are 947 considered a valuable asset, although exploiting knowledge of the 948 topology would be more challenging given the secure protocols that 949 exist between a controller and the forwarding entities. 951 In the SDN context there appears to be little need for a loose 952 source route. Loose source routing adds additional security 953 concerns because it does not require knowledge of the entire path 954 to construct an attack. If loose source routing is included the 955 security concerns should be addressed. 957 12. Conclusions and Future work 959 SDN where a central controller creates either pro-actively or re- 960 actively the state for a sub-network of forwarding devices will 961 have performance limitations that are related to network 962 diameter/size, network recovery requirements and the amount of 963 state they need to distribute. Strict Link Source Routing 964 mechanisms can alleviate these problems allowing greater scale and 965 faster recovery. MPLS can be used to implement this on a small 966 scale with some of the benefits. IPV4 and IPV6 source routing 967 options can be used to implement this on a larger scale with more 968 of the benefits but at much larger packet overhead but are however 969 perceived as risky and have been deprecated from IP. These risks 970 however can be mitigated in this specific use. No existing 971 mechanism however is optimum, and therefore there is room for a 972 new mechanism that addresses these requirements and includes 973 multicast methods and more efficient encoding of link names than 974 is currently possible. One possible solution is to look at a 975 smaller MPLS label for this purpose and to look at ways to retain 976 the popped labels for the purposes of end to end path reversal and 977 OAM. New work in SDN, in particular Protocol Oblivious Forwarding 978 may make these kinds of things possible in a generic manner. 980 13. IANA Considerations 982 This memo includes no request to IANA. 984 14. References 986 14.1. Informative References 988 [BLOOM] Active Bloom Filters for Multicast Addressing, 989 Z. Heszberger et. al. Budapest 990 University of Technology and Economics. 992 [OPENFLOW] www.openflow.org 994 [ONF] www.opennetworking.org 996 [POF] Protocol Oblivious Forwarding: 997 http://www.poforwarding.org/ 999 [PLACEMENT] The Controller Placement Problem, Nick McKeon, 1000 Brandon Heller, Rob Sherwood. HotSDN'12, August 1001 13, 2012, Helsinki, Finland. 2012 ACM 978-1- 1002 4503-1477-0/12/08, 1003 http://conferences.sigcomm.org/sigcomm/2012/paper 1004 /hotsdn/p7.pdf 1006 [RFC791] Internet Protocol, Information Sciences 1007 Institute, RFC 791, September 1981. 1009 [SANE] SANE: A Protection Architecture for Enterprise 1010 Networks, Martin Casado, Nick McKeown, 1011 http://yuba.stanford.edu/~casado/sane.pdf, 1012 Stanford and ICSI 2005. 1014 [SDNGOOG] SDN at Google - Opportunities for WAN 1015 optimization, E. Crabbe, V. Valancius, 8/1/2012. 1016 Presentation at IETF84 SDN BOF. 1018 [SEGMENT] Segment Routing with IS-IS, S.Previdi et. Al. 1019 http://tools.ietf.org/html/draft-previdi- 1020 filsfils-isis-segment-routing-00 1022 [SLSR] Software Defined Networking and Centralized 1023 Controller State Distribution Reduction, 1024 www.ieee802.org/1/files/public/ 1025 docs2012/new- 1026 ashwood-sdn-optimizations-0712-v01.pdf 1028 [SRBAD] Deprecation of Source Routing Options in IPV4 1029 http://tools.ietf.org/html/draft-reitzel-ipv4- 1030 source-routing-is-evil-00 1032 [SRSDN] Source Routed Forwarding with SDN, M. Soliman 1033 http://conferences.sigcomm.org/co- 1034 next/2012/eproceedings/student/p43.pdf 1036 15. Authors' Addresses 1038 Peter Ashwood-Smith 1039 Huawei Canada Inc. 1040 303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1 1041 Email: Peter.AshwoodSmith@huawei.com 1043 Mourad Soliman 1044 Carleton University, 1045 1125 Colonel By Drive Ottawa, Ontario K1S 5B6 Canada 1046 Email: MouradSoliman@cmail.carleton.ca 1048 Tao Wan 1049 Huawei Canada Inc. 1050 303 Terry Fox Drive, Suite 400, Kanata, Ontario K2K 3J1 1051 Email: Tao.Wan@huawei.com 1053 16. Contributors 1055 We invite more contributors. 1057 17. Acknowledgements 1059 We gratefully appreciate the feedback of Nigel Bragg, Sue Hares, 1060 Peter Willis, Biswajit Nandy and Linda Dunbar.