idnits 2.17.1 draft-behringer-ncrg-complexity-framework-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 25, 2016) is 2923 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'R1' is mentioned on line 396, but not defined == Missing Reference: 'R2' is mentioned on line 396, but not defined == Missing Reference: 'R4' is mentioned on line 396, but not defined == Missing Reference: 'R6' is mentioned on line 396, but not defined Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. Behringer 3 Internet-Draft A. Retana 4 Intended status: Informational Cisco Systems 5 Expires: October 27, 2016 R. White 6 Ericsson 7 G. Huston 8 APNIC 9 April 25, 2016 11 A Framework for Defining Network Complexity 12 draft-behringer-ncrg-complexity-framework-02 14 Abstract 16 Complexity is a widely used parameter in network design, yet there is 17 no generally accepted definition of the term. Complexity metrics 18 exist in a wide range of research papers, but most of these address 19 only a particular aspect of a network, for example the complexity of 20 a graph or software. While it may be impossible to define a metric 21 for overall network complexity, there is a desire to better 22 understand the complexity of a network as a whole, as deployed today 23 to provide Internet services. This document provides a framework to 24 guide research on the topic of network complexity, as well as some 25 practical examples for trade-offs in networking. 27 This document summarizes the work of the IRTF's Network Complexity 28 Research Group (NCRG) at the time of its closure. It does not 29 present final results, but a snapshot of an ongoing activity, as a 30 basis for future work. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on October 27, 2016. 49 Copyright Notice 51 Copyright (c) 2016 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 2. General Considerations . . . . . . . . . . . . . . . . . . . 4 65 2.1. The Behavior of a Complex Network . . . . . . . . . . . . 4 66 2.2. Complex versus Complicated . . . . . . . . . . . . . . . 4 67 2.3. Robust Yet Fragile . . . . . . . . . . . . . . . . . . . 5 68 2.4. The Complexity Cube . . . . . . . . . . . . . . . . . . . 5 69 2.5. Related Concepts . . . . . . . . . . . . . . . . . . . . 5 70 2.6. Technical Debt . . . . . . . . . . . . . . . . . . . . . 6 71 2.7. Layering considerations . . . . . . . . . . . . . . . . . 7 72 3. Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . 7 73 3.1. Control Plane State versus Optimal Forwarding Paths 74 (Stretch) . . . . . . . . . . . . . . . . . . . . . . . . 8 75 3.2. Configuration State versus Failure Domain Separation . . 9 76 3.3. Policy Centralization versus Optimal Policy Application . 11 77 3.4. Configuration State versus Per Hop Forwarding 78 Optimization . . . . . . . . . . . . . . . . . . . . . . 12 79 3.5. Reactivity versus Stability . . . . . . . . . . . . . . . 12 80 4. Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 14 81 5. Elements of Complexity . . . . . . . . . . . . . . . . . . . 15 82 5.1. The Physical Network (Hardware) . . . . . . . . . . . . . 15 83 5.2. Algorithms . . . . . . . . . . . . . . . . . . . . . . . 15 84 5.3. State in the Network . . . . . . . . . . . . . . . . . . 16 85 5.4. Churn . . . . . . . . . . . . . . . . . . . . . . . . . . 16 86 5.5. Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 16 87 6. Location of Complexity . . . . . . . . . . . . . . . . . . . 16 88 6.1. Topological Location . . . . . . . . . . . . . . . . . . 16 89 6.2. Logical Location . . . . . . . . . . . . . . . . . . . . 17 90 6.3. Layering Considerations . . . . . . . . . . . . . . . . . 17 91 7. Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 17 92 7.1. Local Dependencies . . . . . . . . . . . . . . . . . . . 17 93 7.2. Network Wide Dependencies . . . . . . . . . . . . . . . . 18 94 7.3. Network External Dependencies . . . . . . . . . . . . . . 18 95 8. Management Interactions . . . . . . . . . . . . . . . . . . . 18 96 8.1. Configuration Complexity . . . . . . . . . . . . . . . . 19 97 8.2. Troubleshooting Complexity . . . . . . . . . . . . . . . 19 98 8.3. Monitoring Complexity . . . . . . . . . . . . . . . . . . 19 99 8.4. Complexity of System Integration . . . . . . . . . . . . 20 100 9. External Interactions . . . . . . . . . . . . . . . . . . . . 20 101 10. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 21 102 11. Security Considerations . . . . . . . . . . . . . . . . . . . 21 103 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 104 13. Informative References . . . . . . . . . . . . . . . . . . . 21 105 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 107 1. Introduction 109 Network design can be described as the art of finding the simplest 110 solution to solve a given problem. Complexity is thus assumed in the 111 design process; engineers do not ask, "should there be complexity 112 here," but rather, "how much complexity is required to solve this 113 problem." This question, "how much complexity," assumes there is 114 some way to characterize the amount of complexity present in a 115 system. The reality is, however, this is an area of research and 116 experience, rather than a solved problem within the network 117 engineering space. Today's design decisions are made based on a 118 rough estimation of the network's complexity, rather than a solid 119 understanding. 121 The document begins with general considerations, including some 122 foundational definitions and concepts. It then provides some 123 examples for trade-offs that network engineers regularly make when 124 designing a network. This section serves to demonstrate that there 125 is no single answer to complexity; rather it is a managed trade-off 126 between many parameters. After this, this document provides a set of 127 parameters engineers should consider when attempting to either 128 measure complexity or build a framework around it. This list makes 129 no claim to be complete, but it serves as a guide of known existing 130 areas of investigation, as well as a pointer to areas that still need 131 to be investigated. 133 Two purposes are served here. The first is to guide researchers 134 working in the area of complexity in their work. The more 135 researchers are able to connect their work to the concerns of network 136 designers, the more useful their research will become. This document 137 may also guide research into areas not considered before. The second 138 is to help network engineers to build a better understanding of where 139 complexity might be "hiding" in their networks, and to be more fully 140 aware of how complexity interacts with design and deployment. 142 The goal of the IRTF Network Complexity Research Group (NCRG) [ncrg] 143 was to define a framework for network complexity research, while 144 recognising that it may be impossible to define metrics for overall 145 network complexity. This document summarizes the work of this group 146 at the time of its closure in 2014. It does not present final 147 results, but a snapshot of an ongoing activity, as a basis for future 148 work. 150 Many references to existing research in the area of network 151 complexity are listed on the Network Complexity Wiki [wiki]. This 152 wiki also contains background information on previous meetings on the 153 subject, previous research, etc. 155 2. General Considerations 157 2.1. The Behavior of a Complex Network 159 While there is no generally accepted definition of network 160 complexity, there is some understanding of the behavior of a complex 161 network. It has some or all of the following properties: 163 o Self-Organization: A network runs some protocols and processes 164 without external control; for example a routing process, failover 165 mechanisms, etc. The interaction of those mechanisms can lead to 166 a complex behaviour. 168 o Un-predictability: In a complex network, the effect of a local 169 change on the behaviour of the global network may be 170 unpredictable. 172 o Emergence: The behaviour of the system as a whole is not reflected 173 in the behaviour of any individual component of the system. 175 o Non-linearity: An input into the network produces a non-linear 176 result. 178 o Fragility: A small local input can break the entire system. 180 2.2. Complex versus Complicated 182 The two terms "complex" and "complicted" are often used 183 interchangably, yet they describe different but overlapping 184 properties. The RG made the following statements about the two 185 terms, but they would need further refinement to be considered formal 186 definitions: 188 o A "complicated" system is a deterministic system that can be 189 understood by an appropriate level of analysis. It is often an 190 externally applied attribute rather than an intrinsic property of 191 a system, and is typically associated with systems that require 192 deep or significant levels of analysis. 194 o A "complex" system, by comparison, is an intrinsic property of a 195 system, and is typically associated with emergent behaviours, such 196 that the behaviour of the system is not fully described by the sum 197 of the behaviour of each of the components of the system. Complex 198 systems are often associated with systems whose components exhibit 199 high levels of interaction and feedback. 201 2.3. Robust Yet Fragile 203 Networks typically follow the "robust yet fragile" paradigm: They are 204 designed to be robust against a set of failures, yet they are very 205 vulnerable to other failures. Doyle [Doyle] explains the concept 206 with an example: The Internet is robust against single component 207 failure, but fragile to targeted attacks. The "robust yet fragile" 208 property also touches on the fact that all network designs are 209 necessarily making trade-offs between different design goals. The 210 simplest one is articulated in "The Twelve Networking Truths" RFC1925 211 [RFC1925]: "Good, Fast, Cheap: Pick any two (you can't have all 212 three)." In real network design, trade-offs between many aspects 213 have to be made, including, for example, issues of scope, time and 214 cost in the network cycle of planning, design, implementation and 215 management of a network platform. Parameters are discussed in 216 Section 4, and Section 3 gives some examples of tradeoffs. 218 2.4. The Complexity Cube 220 Complex tasks on a network can be done in different components of the 221 network. For example, routing can be controlled by central 222 algorithms, and the result distributed (e.g., OpenFlow model); the 223 routing algorithm can also run completely distributed (e.g., routing 224 protocols such as OSPF or ISIS), or a human operator could calculate 225 routing tables and statically configure routing. Behringer 226 [Behringer] defines these three axes of complexity as a "complexity 227 cube" with three axes: Network elements, central systems, and human 228 operators. Any function can be implemented in any of these three 229 axes, and this choice likely has an impact on the overall complexity 230 of the system. 232 2.5. Related Concepts 234 When discussing network complexity, a large number of influencing 235 factors have to be taken into account to arrive at a full picture, 236 for example: 238 o State in the network: Contains the network elements, such as 239 routers, switches (with their OS, including protocols), lines, 240 central systems, etc. The number and algorithmic complexity of 241 the protocols on network devices for example. 243 o Human operators: Complexity manifests itself often by a network 244 that is not completely understood by human operators. Human error 245 is a primary source for catastrophic failures, and therefore must 246 be taken into account. 248 o Classes / templates: Rather than counting the number of lines in a 249 configuration, or the number of hardware elements, more important 250 is the number of classes from which those can be derived. In 251 other words, it is probably less complex to have 1000 interfaces 252 which are identically configured than 5 that are completely 253 different configured. 255 o Dependencies and interactions: The number of dependencies between 256 elements, as well as the interactions between them has influence 257 on the complexity of the network. 259 o TCO (Total cost of ownership): TCO could be a good metric for 260 network complexity, if the TCO calculation takes into account all 261 influencing factors, for example training time for staff to be 262 able to maintain a network. 264 o Benchmark Unit Cost is a related metric that indicates the cost of 265 operating a certain component. If calculated well, it reflects at 266 least parts of the complexity of this component. Therefore, the 267 way TCO or BUC are calculated can help to derive a complexity 268 metric. 270 o Churn / rate of change: The change rate in a network itself can 271 contribute to complexity, especially if a number of components of 272 the overall network interact. 274 Networks differ in terms of their intended purpose (such as is found 275 in differences between enterprise and public carriage network 276 platforms, and in their intended role (such as is found in the 277 differences between so-called "access" networks and "core" transit 278 networks). The differences in terms of role and purpose can often 279 lead to differences in the tolerance for, and even the metrics of, 280 complexity within such different network scenarios. This is not 281 necessarily a space where a single methodology for measuring 282 complexity, and defining a single threshold value of acceptability of 283 complexity, is appropriate. 285 2.6. Technical Debt 287 Many changes in a network are made with a dependency on the existing 288 network. Often, a suboptimal decision is made because the optimal 289 decision is hard or impossible to realise at the time. Over time, 290 the number of suboptimal changes in themselves cause significant 291 complexity, which would not have been there had the optimal solution 292 been implemented. 294 The term "technical debt" refers to the accumulated complexity of 295 sub-optimal changes over time. As with financial debt, the idea is 296 that also technical debt must be repaid one day by cleaning up the 297 network or software. 299 2.7. Layering considerations 301 In considering the larger space of applications, transport services, 302 network services and media services, it is feasible to engineer 303 responses for certain types of desired applications responses in many 304 different ways, and involving different layers of the so-called 305 network protocol stack. For example, Quality of Service could be 306 engineered at any of these layers, or even in a number of 307 combinations of different layers. 309 Considerations of complexity arise when mutually incompatible 310 measures are used in combination (such as error detection and 311 retransmission at the media layer in conjunction with the use TCP 312 transport protocol), or when assumptions used in one layer are 313 violated by another layer. This results in surprising outcomes that 314 may result in complex interactions, for example oscillation because 315 different layers use different timers for retransmission. These 316 issues have led to the perspective that increased layering frequently 317 increases complexity [RFC3439]. 319 While this research work is focussed network complexity, the 320 interactions of the network with the end-to-end transport protocols, 321 application layer protocols and media properties are relevant 322 considerations here. 324 3. Tradeoffs 326 Network complexity is a system level, rather than component level, 327 problem; overall system complexity may be more than the sum of the 328 complexity of the individual pieces. 330 There are two basic ways in which system level problems might be 331 addressed: interfaces and continuums. In addressing a system level 332 problem through interfaces, we seek to treat each piece of the system 333 as a "black box," and develop a complete understanding of the 334 interfaces between these black boxes. In addressing a system level 335 problem as a continuum, we seek to understand the impact of a single 336 change or element to the entire system as a set of tradeoffs. 338 While network complexity can profitably be approached from either of 339 these perspectives, in this document we have chosen to approach the 340 system level impact of network complexity from the perspective of 341 continuums of tradeoffs. In theory, modifying the network to resolve 342 one particular problem (or class of problems) will add complexity 343 which results in the increased likelihood (or appearance) of another 344 class of problems. Discovering these continuums of tradeoffs, and 345 then determining how to measure each one, become the key steps in 346 understanding and measuring system level complexity in this view. 348 The following sections describe five such continuums; more may be 349 possible. 351 o Control Plane State versus Optimal Forwarding Paths (or its 352 opposite measure, stretch) 354 o Configuration State versus Failure Domain Separation 356 o Policy Centralization versus Optimal Policy Application 358 o Configuration State versus Per Hop Forwarding Optimization 360 o Reactivity versus Stability 362 3.1. Control Plane State versus Optimal Forwarding Paths (Stretch) 364 Control plane state is the aggregate amount of information carried by 365 the control plane through the network in order to produce the 366 forwarding table at each device. Each additional piece of 367 information added to the control plane --such as more specific 368 reachability information, policy information, additional control 369 planes for virtualization and tunneling, or more precise topology 370 information-- adds to the complexity of the control plane. This 371 added complexity, in turn, adds to the burden of monitoring, 372 understanding, troubleshooting, and managing the network. 374 Removing control plane state, however, is not always a net positive 375 gain for the network as a system; removing control plane state almost 376 always results in decreased optimality in the forwarding and handing 377 of packets travelling through the network. This decreased optimality 378 can be termed stretch, which is defined as the difference between the 379 absolute shortest (or best) path traffic could take through the 380 network and the path the traffic actually takes. Stretch is 381 expressed as the difference between the optimal and actual path. The 382 figure below provides and example of this tradeoff. 384 +---R1---+ 385 | | 386 (aggregate: 192.0.2/24) R2 R3 (aggregate: 192.0.2/24) 387 | | 388 R4-------R5 389 | 390 (announce: 192.0.2.1/32) R6 392 Assume each link is of equal cost in this figure, and R6 is 393 advertising 192.0.2.1/32. 395 For R1, the shortest path to 192.0.2.1/32, advertised by R6, is along 396 the path [R1,R2,R4,R6]. 398 Assume, however, the network administrator decides to aggregate 399 reachability information at R2 and R3, advertising 192.0.2.0/24 400 towards R1 from both of these points. This reduces the overall 401 complexity of the control plane by reducing the amount of information 402 carried past these two routers (at R1 only in this case). 404 Aggregating reachability information at R2 and R3, however, may have 405 the impact of making both routes towards 192.0.2.1/32 appear as equal 406 cost paths to R1; there is no particular reason R1 should choose the 407 shortest path through R2 over the longer path through R3. This, in 408 effect, increases the stretch of the network. The shortest path from 409 R1 to R6 is 3 hops, a path that will always be chosen before 410 aggregation is configured. Assuming half of the traffic will be 411 forwarded along the path through R2 (3 hops), and half through R3 (4 412 hops), the network is stretched by ((3+4)/2) - 3), or .5, a "half a 413 hop." 415 Traffic engineering through various tunneling mechanisms is, at a 416 broad level, adding control plane state to provide more optimal 417 forwarding (or network utlization). Optimizing network utilization 418 may require detuning stretch (intentionally increasing stretch) to 419 increase overall network utilization and efficiency; this is simply 420 an alternate instance of control plane state (and hence complexity) 421 weighed against optimal forwarding through the network. 423 3.2. Configuration State versus Failure Domain Separation 425 A failure domain, within the context of a network control plane, can 426 be defined as the set of devices impacted by a change in the network 427 topology or configuration. A network with larger failure domains is 428 more prone to cascading failures, so smaller failure domains are 429 normally preferred over larger ones. 431 The primary means used to limit the size of a failure domain within a 432 network's control plane is information hiding; the two primary types 433 of information hidden in a network control plane are reachability 434 information and topology information. An example of aggregating 435 reachability information is summarizing the routes 192.0.2.1/32, 436 192.0.2.2/32, and 192.0.2.3/32 into the single route 192.0.2.0/24, 437 along with the aggregation of the metric information associated with 438 each of the component routes. Note that aggregation is a "natural" 439 part of IP networks, starting with the aggregation of individual 440 hosts into a subnet at the network edge. An example of topology 441 aggregation is the summarization of routes at a link state flooding 442 domain boundary, or the lack of topology information in a distance- 443 vector protocol. 445 While limiting the size of failure domains appears to be an absolute 446 good in terms of network complexity, there is a definite tradeoff in 447 configuration complexity. The more failure domain edges created in a 448 network, the more complex configuration will become. This is 449 particularly true if redistribution of routing information between 450 multiple control plane processes is used to create failure domain 451 boundaries; moving between different types of control planes causes a 452 loss of the consistent metrics most control planes rely on to build 453 loop free paths. Redistribution, in particular, opens the door to 454 very destructive positive feedback loops within the control plane. 455 Examples of control plane complexity caused by the creation of 456 failure domain boundaries include route filters, routing aggregation 457 configuration, and metric modifications to engineer traffic across 458 failure domain boundaries. 460 Returning to the network described in the previous section, 461 aggregating routing information at R2 and R3 will divide the network 462 into two failure domains: (R1,R2,R3), and (R2,R3,R4,R5). A failure 463 at R5 should have no impact on the forwarding information at R1. 465 A false failure domain separation occurs, however, when the metric of 466 the aggregate route advertised by R2 and R3 is dependent on one of 467 the routes within the aggregate. For instance, if the metric of the 468 192.0.2.0/24 aggregate is derived from the metric of the component 469 192.0.2.1/32, then a failure of this one component will cause changes 470 in the forwarding table at R1 --in this case, the control plane has 471 not truly been separated into two distinct failure domains. The 472 added complexity in the illustration network would be the management 473 of the configuration required to aggregate the contorl plane 474 information, and the management of the metrics to ensure the control 475 plane is truly separated into two distinct failure domains. 477 Replacing aggregation with redistribution adds the complexity of 478 managing the feedback of routing information redistributed between 479 the failure domains. For instance, if R1, R2, and R3 were configured 480 to run one routing protocol, while R2, R3, R4, R5, and R6 were 481 configured to run another protocol, R2 and R3 could be configured to 482 redistribute reachability information between these two control 483 planes. This can split the control plane into multiple failure 484 domains (depending on how, specifically, redistribution is 485 configured), but at the cost of creating and managing the 486 redistribution configuration. Futher, R3 must be configured to block 487 routing information redistributed at R2 towards R1 from being 488 redistributined (again) towards R4 and R5. 490 3.3. Policy Centralization versus Optimal Policy Application 492 Another broad area where control plane complexity interacts with 493 optimal network utilization is Quality of Service (QoS). Two 494 specific actions are required to optimize the flow of traffic through 495 a network: marking and Per Hop Behaviors (PHBs). Rather than 496 examining each packet at each forwarding device in a network, packets 497 are often marked, or classified, in some way (typically through Type 498 of Service bits) so they can be handled consistently at all 499 forwarding devices. 501 Packet marking policies must be configured on specific forwarding 502 devices throughout the network. Distributing marking closer to the 503 edge of the network necessarily means configuring and managing more 504 devices, but produces optimal forwarding at a larger number of 505 network devices. Moving marking towards the network core means 506 packets are marked for proper handling across a smaller number of 507 devices. In the same way, each device through which a packet passes 508 with the correct PHBs configured represents an increase in the 509 consistency in packet handling through the network as well as an 510 increase in the number of devices which must be configured and 511 managed for the correct PHBs. The network below is used for an 512 illustration of this concept. 514 +----R1----+ 515 | | 516 +--R2--+ +--R3--+ 517 | | | | 518 R4 R5 R6 R7 520 In this network, marking and PHB configuration may be configured on 521 any device, R1 through R7. 523 Assume marking is configured at the network edge; in this case, four 524 devices, (R4,R5,R6,R7), must be configured, including ongoing 525 configuration management, to mark packets. Moving packet marking to 526 R2 and R3 will halve the number of devices on which packet marking 527 configuration must be managed, but at the cost of inconsistent packet 528 handling at the inbound interfaces of R2 and R3 themselves. 530 Thus reducing the number of devices which must have managed 531 configurations for packet marking will reduce optimal packet flow 532 through the network. Assuming packet marking is actually configured 533 along the edge of this network, configuring PHBs on different devices 534 has this same tradeoff of managed configuration versus optimal 535 traffic flow. If the correct PHBs are configured on R1, R2, and R3, 536 then packets passing through the network will be handled correctly at 537 each hop. The cost involved will be the management of PHB 538 configuration on three devices. Configuring a single device for the 539 correct PHBs (R1, for instance), will decrease the amount of 540 configuration management required, at the cost of less than optimal 541 packet handling along the entire path. 543 3.4. Configuration State versus Per Hop Forwarding Optimization 545 The number of PHBs configured along a forwarding path exhibits the 546 same complexity versus optimality tradeoff described in the section 547 above. The more classes (or queues) traffic is divided into, the 548 more fine-grained traffic will be managed as it passes through the 549 network. At the same time, each class of service must be managed, 550 both in terms of configuration and in its interaction with other 551 classes of service configured in the network. 553 3.5. Reactivity versus Stability 555 The speed at which the network's control plane can react to a change 556 in configuration or topology is an area of widespread study. Control 557 plane convergence can be broken down into four essential parts: 559 o Detecting the change 561 o Propagating information about the change 563 o Determining the best path(s) through the network after the change 565 o Changing the forwarding path at each network element along the 566 modified paths 568 Each of these areas can be addressed in an effort to improve network 569 convergence speeds; some of these improvements come at the cost of 570 increased complexity. 572 Changes in network topology can be detected much more quickly through 573 faster echo (or hello) mechanisms, lower layer physical detection, 574 and other methods. Each of these mechanisms, however, can only be 575 used at the cost of evaluating and managing false positives and high 576 rates of topology change. 578 If the state of a link change can be detected in 10ms, for instance, 579 the link could theoretically change state 50 times in a second --it 580 would be impossible to tune a network control plane to react to 581 topology changes at this rate. Injecting topology change information 582 into the control plane at this rate can destabalize the control 583 plane, and hence the network itself. To counter this, most fast down 584 detection techniques include some form of dampening mechanism; 585 configuring and managing these dampening mechanisms increases 586 complexity that must be configured and managed. 588 Changes in network topology must also be propagated throughout the 589 network, so each device along the path can compute new forwarding 590 tables. In high speed network environments, propagation of routing 591 information changes can take place in tens of milliseconds, opening 592 the possibility of multiple changes being propagated per second. 593 Injecting information at this rate into the contral plane creates the 594 risk of overloading the processes and devices participating in the 595 control plane, as well as creating destructive positive feedback 596 loops in the network. To avoid these consequences, most control 597 plane protocols regulate the speed at which information about network 598 changes can be transmitted by any individual device. A recent 599 innovation in this area is using exponential backoff techniques to 600 manage the rate at which information is advertised into the control 601 plane; the first change is transmitted quickly, while subsequent 602 changes are transmitted more slowly. These techniques all control 603 the destabalilzing effects of rapid information flows through the 604 control plane through the added complexity of configuring and 605 managing the rate at which the control plane can propagate 606 information about network changes. 608 All control planes require some form of algorithmic calculation to 609 find the best path through the network to any given destination. 610 These algorithms are often lightweight, but they still require some 611 amount of memory and computational power to execute. Rapid changes 612 in the network can overwhelm the devices on which these algorithms 613 run, particularly if changes are presented more quickly than the 614 algorithm can run. Once the devices running these algorithms become 615 processor or memory bound, it could experience a computational 616 failure altogether, causing a more general network outage. To 617 prevent computational overloading, control plane protocols are 618 designed with timers limiting how often they can compute the best 619 path through a network; often these timers are exponential in nature, 620 allowing the first computation to run quickly, while delaying 621 subsequent computations. Configuring and managing these timers is 622 another source of complexity within the network. 624 Another option to improve the speed at which the control plane reacts 625 to changes in the network is to precompute alternate paths at each 626 device, and possibly preinstall forwarding information into local 627 forwarding tables. Additional state is often needed to precompute 628 alternate paths, and additional algorithms and techniques are often 629 configured and deployed. This additional state, and these additional 630 algorithms, add some amount of complexity to the configuration and 631 management of the network. 633 In some situations (for some topologies), a tunnel is required to 634 pass traffic around a network failure or topology change. These 635 tunnels, while not manually configured, represent additional 636 complexity at the forwarding and control planes. 638 4. Parameters 640 In Section 3 we describe a set of trade-offs in network design to 641 illustrate the practical choices network operators have to make. The 642 amount of parameters to consider in such tradeoff scenarios is very 643 large, thus that a complete listing may not be possible. Also the 644 dependencies between the various metrics itself is very complex and 645 requires further study. This document attempts to define a 646 methodology and an overall high level structure. 648 To analyse tradeoffs it is necessary to formalise them. The list of 649 parameters for such tradeoffs is long, and the parameters can be 650 complex in themselves. For example, "cost" can be a simple 651 unidimensional metric, but "extensibility" or "optimal forwarding 652 state" are harder to define in detail. 654 A list of parameters to trade off contains metrics such as: 656 o State: How much state needs to be held in control plane, 657 forwarding plane, configuration, etc. 659 o Cost: How much does the network cost to build (capex) and run 660 (opex) 662 o Bandwidth / delay / jitter: Traffic characteristics between two 663 points (average, max, ...) 665 o Configuration complexity: How hard to configure and maintain the 666 configuration 668 o Susceptibility to Denial-of-Service: How easy is it to attack the 669 service 671 o Security (confidentiality / integrity): How easy is it to sniff / 672 modify / insert the data flow 674 o Scalability: To what size can I grow the network / service 676 o Stability: How stable is the network under the influence of local 677 change? 679 o Reactivity: How fast does the network converge, or adapt to new 680 situations? 682 o Extensibility: Can I use the network for other services in the 683 future? 685 o Ease of troubleshooting: Are failure domains separated? How hard 686 is it to find and correct problems? 688 o Optimal per-hop forwarding behavior 690 o Predictability: If I change a parameter, what will happen? 692 o Clean failure: When a problem arises, does the root cause lead to 693 deterministic failure 695 5. Elements of Complexity 697 Complexity can be found in various elements in a networked system. 698 For example, the configuration of a network element reflects some of 699 the complexity contained in this system. Or an algorithm used by a 700 protocol may be more or less complex. When classifying complexity 701 the first question to ask is "WHAT is complex?". This section offers 702 a method to answer this question. 704 5.1. The Physical Network (Hardware) 706 The set of network devices and wiring contains a certain complexity. 707 For example, adding a redundant link between two locations increases 708 the complexity of the network, but provides more redundancy. Also 709 network devices can be more or less modular, which has impact on 710 complexity trading off against ease of maintenance, availability and 711 upgradability. 713 5.2. Algorithms 715 The behavior of the physical network is not only defined by the 716 hardware, but also by algorithms that run on network elements and in 717 central locations. Every algorithm has a certain intrinsic 718 complexity, which is the subject of research on software complexity. 720 5.3. State in the Network 722 The way a network element treats traffic is defined largely by the 723 state in the network, in form of configuration, routing state, 724 security measures, etc. Section 3.1 shows an example where more 725 control plane state allows a more precise forwarding. 727 5.4. Churn 729 The rate of change itself is a parameter in complexity, which needs 730 to be weighed against other parameters. Section 3.5 explains a 731 trade-off between the speed of communicating changes through the 732 network and the stability of the network. 734 5.5. Knowledge 736 Certain complexity parameters have a strong link to the human aspect 737 of networking. For example, the more option and parameters a network 738 protocol has, the harder it is to configure and trouble-shoot. 739 Therefore, there is a trade-off between the knowledge to be 740 maintained by operational staff and desired functionality. The 741 required knowledge of network operators is therefore an important 742 part in complexity considerations. 744 6. Location of Complexity 746 The previous section discussed in which form complexity may be 747 perceived. This section focuses on where this complexity is located 748 in a network. For example, an algorithm can run centrally, 749 distributed, or even in the head of a network administrator. In 750 classifying the complexity of a network, the location of a component 751 may have an impact on overall complexity. This section offers a 752 methodology to the question "WHERE is the complex component?" 754 6.1. Topological Location 756 An algorithm can run distributed, for example a routing protocol like 757 OSPF runs on all routers in a network. But it can also be in a 758 central location such as the Network Operations Center (NOC). The 759 physical location has impact on several other parameters, such as 760 availability (local changes might be faster than going through a 761 remote NOC) and ease of operation, because it might be easier to 762 understand and troubleshoot one central entity rather than many 763 remote ones. 765 The example in Section 3.3 shows how the location of state (in this 766 case configuration) impacts the precision of the policy enforcement 767 and the corresponding state required. Enforcement closer to the edge 768 requires more network wide state, but is more precise. 770 6.2. Logical Location 772 Independent of its physical location, the logical location also may 773 make a difference to complexity. A controller function for example 774 can reside in a NOC, but also on a network element. Generally, 775 organising a network in separate logical entities is considered 776 positive, because it eases the understanding of the network, thereby 777 making trouble-shooting and configuration easier. For example a BGP 778 route reflector is a separate logical entity from a BGP speaker, but 779 it may reside on the same physical node. 781 6.3. Layering Considerations 783 Also the layer of the TCP/IP stack in which a function is implemented 784 can have an impact on the complexity of the overall network. Some 785 functions are implemented in several layers in slightly different 786 ways, which may lead to unexpected results. 788 As an example, a link failure is detected on various layers: L1, L2, 789 the IGP, BGP, and potentially more. Since those have dependencies on 790 each other, different link failure detection times can cause 791 undesired effects. Dependencies are discussed in more detail in the 792 next section. 794 7. Dependencies 796 Dependencies are generally regarded as related to overall complexity. 797 A system with less dependencies is generally considered less complex. 798 This section proposes a way to analyse dependencies in a network. 800 For example, [Chun] states: "We conjecture that the complexity 801 particular to networked systems arises from the need to ensure state 802 is kept in sync with its distributed dependencies." 804 In this document we distinguish three types of dependencies: Local 805 dependencies, network wide dependencies, and network external 806 dependencies. 808 7.1. Local Dependencies 810 Local dependencies are relative to a single node in the network. For 811 example, an interface on a node may have an IP address; this address 812 may be used in other parts of the configuration. If the interface 813 address changes, the dependent configuration parts have to change as 814 well. 816 Similar dependencies exist for QoS policies, access-control-lists, 817 names and numbers of configuration parts, etc. 819 7.2. Network Wide Dependencies 821 Routing protocols, failover protocols, and many other have 822 dependencies across the network. If one node is affected by a 823 problem, this may have a ripple effect through the network. These 824 protocols are typically designed to deal with unexpected 825 consequences, thus unlikely to cause an issue on their own. But 826 occasionally a number of complexity issues come together, for 827 example, different timers on different layers, then unexpected 828 behaviour can occur. 830 7.3. Network External Dependencies 832 Some dependencies are on elements outside the actual network, for 833 example on an external NTP clock source, or a AAA server. Again, a 834 tradeoff is made: In the example of AAA used for login 835 authentication, we reduce the configuration (state) on each node, 836 specifically user specific configuration. But we add an external 837 dependency on a AAA server. In networks with many administrators, a 838 AAA server is clearly the only manageable way to track all 839 administrators. But it comes at the cost of this external 840 dependency, with the consequence that admin access may be lost for 841 all devices at the same time, when the AAA server is unavailable. 843 Even with the external dependency on a AAA server, the advantage of 844 centralizing the user information (and logging) still has significant 845 value over distributing user information across all devices. To 846 solve the problem of the central dependency not being available, 847 other solutions have been developed, for example a secondary 848 authentication mode with a single root level password in case the AAA 849 server is not available. 851 8. Management Interactions 853 A static network generally is relatively stable; conversely, changes 854 introduce a degree of uncertainty and therefore need to be examined 855 in detail. Also, the trouble shooting of a network exposes 856 intuitively the complexity of the network. This section proposes a 857 methodology to classify management interactions with regard to their 858 relationship to network complexity. 860 8.1. Configuration Complexity 862 Configuration can be seen as distributed state across network 863 devices, where the administrator has direct influence on the 864 operation of the network. Modifying the configuration can improve 865 the network behaviour over all, or negatively affect it. In the 866 worst case, a single misconfiguration could potentially bring down 867 the entire network. Therefore it is important that a human 868 administrator can manage the complexity of the configuration well. 870 The configuration reflects most of the local and global dependencies 871 in the network, as explained in Section 7. Tracking those 872 dependencies in the configuration helps in understanding the overall 873 network complexity. 875 8.2. Troubleshooting Complexity 877 Unexpected behaviour can have a number of sources: The configuration 878 may contain errors, the operating system (algorithms) may have bugs, 879 and the hardware may be faulty, which includes anything from broken 880 fibres to faulty line cards. In serious problems, a combination of 881 causes could result in a single visible condition. Tracking the root 882 causes of a error condition may be extremely difficult, pointing to 883 the complex nature of a network. 885 Being able to find the source of a problem requires therefore a solid 886 understanding of the complexity of a network. The configuration 887 complexity discussed in the previous section represents only a part 888 of the overall problem space. 890 8.3. Monitoring Complexity 892 Even in the absence of error conditions, the state of the network 893 should be monitored to detect error conditions ideally before network 894 services are affected. For example, a single "link-down" event may 895 not cause a service disruption in a well designed network, but the 896 problem needs to be resolved quickly to restore redundancy. 898 Monitoring a network has itself a certain complexity. Issues are in 899 scale, variations of devices to be monitored, variations of methods 900 used to collect information, the inevitable loss of information as 901 reporting is aggregated centrally, and the knowledge required to 902 understand the network, the dependencies and interactions with users 903 and other external inputs. 905 8.4. Complexity of System Integration 907 A network doesn't just consist of network devices, but includes a 908 vast array of backend and support systems. It also interfaces a 909 large variety of user devices, and a number of human interfaces, both 910 to the user / customer, as well as to administrators of the network. 911 To make sure the overall network provides the overall service 912 expected requires a system integration job. 914 All those interactions and systems have to be modelled to understand 915 the inter-dependencies and complexities in the network. This is a 916 large area of future research. 918 9. External Interactions 920 A network is not a self-contained entity, but exists to provide 921 connectivity and services to users and other networks, both of which 922 are outside the direct control of a network administrator. The user 923 experience of a network also illustrates a form of interaction with 924 its own complexity. 926 External interactions fall into the following categories: 928 User Interactions: Users need a way to request a service, to have 929 their problems resolved, and potentially to get billed for their 930 usage. There are a number of human interfaces that need to be 931 considered, which depend to some extend on the network, for 932 example for troubleshooting, or monitoring usage. 934 Interactions with End Systems: The network also interacts with the 935 devices that connect to it. Typically a device receives an IP 936 address from the network, and information on how to resolve domain 937 names, plus potentially other services. While those interactions 938 are relatively simple, the vast amount of end device types makes 939 this a complicated space to track. 941 Inter-Network Interactions: Most networks connect to other 942 networks. Also in this case there are many interactions between 943 networks, both technically (for example, running a routing 944 protocol), as well as non-technical (for example, tracing problems 945 across network boundaries). 947 For a fully operational network providing services to users, also the 948 external interactions and dependencies form an integral part of the 949 overall complexity of the network service. A specific example are 950 the root DNS servers, which are critical to the function of the 951 Internet. Practically all Internet users have an implicit dependency 952 on the root DNS servers, which explains why those are frequent 953 targets for attacks. Understanding the overall complexity of a 954 network includes understanding all those external dependencies. Of 955 course, in the case of the root DNS servers, there is little a 956 network operator can influence. 958 10. Examples 960 In the foreseeable future it is unlikely to define a single, 961 objective metric that includes all the relevant aspects of 962 complexity. In the absence of such a global metric, a comparative 963 approach could be easier. 965 For example, it is possible to compare the complexity of a 966 centralised systems where algorithms run centrally, and the results 967 are distributed to the network nodes with a distributed algorithm. 968 The type of algorithm may be similar, but the location is different, 969 and a different dependency graph would result. The supporting 970 hardware may be the same, thus could be ignored for this exercise. 971 Also layering is likely to be the same. The management interactions 972 though would significantly differ in both cases. 974 The classification in this document also makes it easier to survey 975 existing research with regards to which area of complexity is 976 covered. This could help in identifying open areas for research. 978 11. Security Considerations 980 This document does not discuss any specific security considerations. 982 12. Acknowledgements 984 The motivations and framework of this overview of studies into 985 network complexity is the result of many meetings and discussions, 986 with too many people to provide a full list here. However, key 987 contributions have been made by: John Doyle, Dave Meyer, Jon 988 Crowcroft, Mark Handley, Fred Baker, Paul Vixie, Lars Eggert, Bob 989 Briscoe, Keith Jones, Bruno Klauser, Steve Youell, Joel Obstfeld, 990 Philip Eardley. 992 The authors would like to acknowledge the contributions of Rana 993 Sircar, Ken Carlberg and Luca Caviglione in the preparation of this 994 document. 996 13. Informative References 998 [Behringer] 999 Behringer, M., "Classifying Network Complexity", 1000 Proceedings of the ACM Re-Arch'09, December 2009. 1002 [Chun] Chun, B-G., Ratnasamy, S., and E. Eddie, "NetComplex: A 1003 Complexity Metric for Networked System Design", 5th Usenix 1004 Symposium on Networked Systems Design and 1005 Implementation NSDI 2008, April 2008, 1006 . 1008 [Doyle] Doyle, J., "The 'robust yet fragile' nature of the 1009 Internet", PNAS vol. 102 no. 41 14497-14502, October 2005. 1011 [ncrg] "Network Complexity Research Group", 1012 . 1014 [RFC1925] Callon, R., "The Twelve Networking Truths", RFC 1925, 1015 DOI 10.17487/RFC1925, April 1996, 1016 . 1018 [RFC3439] Bush, R. and D. Meyer, "Some Internet Architectural 1019 Guidelines and Philosophy", RFC 3439, 1020 DOI 10.17487/RFC3439, December 2002, 1021 . 1023 [wiki] "Network Complexity Wiki", 1024 . 1026 Authors' Addresses 1028 Michael H. Behringer 1029 Cisco Systems 1030 Building D, 45 Allee des Ormes 1031 Mougins 06250 1032 France 1034 Email: mbehring@cisco.com 1036 Alvaro Retana 1037 Cisco Systems 1038 7025 Kit Creek Rd. 1039 Research Triangle Park, NC 27709 1040 USA 1042 Email: aretana@cisco.com 1043 Russ White 1044 Ericsson 1045 144 Warm Wood Lane 1046 Apex, NC 27539 1047 United States 1049 Email: russw@riw.us 1050 URI: http://www.ericsson.com 1052 Geoff Huston 1053 Asia Pacific Network Information Centre 1054 6 Cordelia St 1055 South Brisbane, QLD 4101 1056 Australia 1058 Email: gih@apnic.net 1059 URI: http://www.apnic.net