idnits 2.17.1 draft-ietf-tewg-restore-hierarchy-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 1442 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 4 characters in excess of 72. ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 1155 has weird spacing: '... define the c...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 2001) is 8229 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? '1' on line 55 looks like a reference -- Missing reference section? '2' on line 115 looks like a reference -- Missing reference section? '3' on line 884 looks like a reference -- Missing reference section? '4' on line 810 looks like a reference -- Missing reference section? '5' on line 966 looks like a reference -- Missing reference section? '6' on line 352 looks like a reference -- Missing reference section? '7' on line 566 looks like a reference -- Missing reference section? '8' on line 566 looks like a reference -- Missing reference section? '9' on line 641 looks like a reference -- Missing reference section? '10' on line 641 looks like a reference -- Missing reference section? '11' on line 618 looks like a reference -- Missing reference section? '12' on line 675 looks like a reference -- Missing reference section? '13' on line 836 looks like a reference -- Missing reference section? '14' on line 894 looks like a reference -- Missing reference section? '15' on line 894 looks like a reference -- Missing reference section? '16' on line 904 looks like a reference -- Missing reference section? '17' on line 960 looks like a reference -- Missing reference section? '18' on line 1012 looks like a reference -- Missing reference section? '19' on line 1067 looks like a reference -- Missing reference section? '20' on line 1122 looks like a reference -- Missing reference section? '21' on line 1176 looks like a reference -- Missing reference section? '22' on line 1234 looks like a reference -- Missing reference section? '23' on line 1292 looks like a reference -- Missing reference section? '24' on line 1323 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 26 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Traffic Engineering Working Group Wai Sum Lai, AT&T 3 Internet Draft Dave McDysan, WorldCom 4 (Co-Editors) 5 Category: Informational 6 Expiration Date: April 2002 Jim Boyle, PDNets 7 Malin Carlzon 8 Rob Coltun, Redback 9 Tim Griffin, AT&T 10 Ed Kern 11 Tom Reddington, Lucent 13 October 2001 15 Network Hierarchy and Multilayer Survivability 17 Status of this Memo 19 This document is an Internet-Draft and is in full conformance with 20 all provisions of Section 10 of RFC2026 [1]. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF), its areas, and its working groups. Note that 24 other groups may also distribute working documents as Internet- 25 Drafts. Internet-Drafts are draft documents valid for a maximum of 26 six months and may be updated, replaced, or obsoleted by other 27 documents at any time. It is inappropriate to use Internet- Drafts 28 as reference material or to cite them other than as "work in 29 progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 1. Abstract 39 This document is the deliverable out of the Network Hierarchy and 40 Survivability Techniques Design Team established within the Traffic 41 Engineering Working Group. This team collected and documented 42 current and near term requirements for survivability and hierarchy 43 in service provider environments. For clarity, an expanded set of 44 definitions is included. The team determined that there appears to 45 be a need to define a small set of interoperable survivability 46 approaches in packet and non-packet networks. Suggested approaches 47 include path-based as well as one that repairs connections in 48 proximity to the network fault. They operate primarily at a single 49 network layer. For hierarchy, there did not appear to be a driving 50 near-term need for work on "vertical hierarchy," defined as 51 communication between network layers such as TDM/optical and MPLS. 52 In particular, instead of direct exchange of signaling and routing 53 between vertical layers, some looser form of coordination and 55 Lai, et al Category - Expiration [1] 57 Network Hierarchy and Multilayer Survivability Oct 2001 59 communication, such as the specification of hold-off timers, is a 60 nearer term need. For "horizontal hierarchy" in data networks, 61 there are several pressing needs. The requirement is to be able to 62 set up many LSPs in a service provider network with hierarchical 63 IGP. This is necessary to support layer 2 and layer 3 VPN services 64 that require edge-to-edge signaling across a core network. 66 Please send comments to te-wg@ops.ietf.org 68 Table of Contents 70 1. Abstract 1 71 2. Conventions used in this document 2 72 3. Introduction 3 73 4. Terminology and Concepts 4 74 4.1 Hierarchy 4 75 4.1.1 Vertical Hierarchy 5 76 4.1.2 Horizontal Hierarchy 5 77 4.2 Survivability Terminology 5 78 4.2.1 Survivability 6 79 4.2.2 Generic Operations 6 80 4.2.3 Survivability Techniques 7 81 4.2.4 Survivability Performance 8 82 4.3 Survivability Mechanisms: Comparison 9 83 5. Survivability 10 84 5.1 Scope 10 85 5.2 Required initial set of survivability mechanisms 11 86 5.2.1 1:1 Path Protection with Pre-Established Capacity 11 87 5.2.2 1:1 Path Protection with Pre-Planned Capacity 12 88 5.2.3 Local Restoration 12 89 5.2.4 Path Restoration 13 90 5.3 Applications Supported 13 91 5.4 Timing Bounds for Survivability Mechanisms 13 92 5.5 Coordination Among Layers 14 93 5.6 Evolution Toward IP Over Optical 15 94 6. Hierarchy Requirements 15 95 6.1 Historical Context 16 96 6.2 Applications for Horizontal Hierarchy 16 97 6.3 Horizontal Hierarchy Requirements 17 98 7. Survivability and Hierarchy 18 99 8. Security Considerations 18 100 9. References 18 101 10. Acknowledgments 20 102 11. Author's Addresses 20 103 Appendix A: Questions used to help develop requirements 21 104 Full Copyright Statement 24 106 2. Conventions used in this document 108 Lai, et al Category - Expiration [2] 110 Network Hierarchy and Multilayer Survivability Oct 2001 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 114 this document are to be interpreted as described in RFC-2119 [2]. 116 3. Introduction 118 This document presents a proposal of the tangible requirements for 119 network survivability and hierarchy in current service provider 120 environments. With feedback from the working group solicited, the 121 objective is to help focus the work that is being addressed in the 122 TEWG (Traffic Engineering Working Group), CCAMP (Common Control and 123 Measurement Plane Working Group), and other working groups. A main 124 goal of this work is to provide some expedience for required 125 functionality in multi-vendor service provider networks. The 126 initial focus is primarily on intra-domain operations. However, to 127 maintain consistency in the provision of end-to-end service in a 128 multi-provider environment, rules governing the operations of 129 survivability mechanisms at domain boundaries must also be 130 specified. While such issues are raised and discussed, where 131 appropriate, they will not be treated in depth in the initial 132 release of this document. 134 The document first develops a set of definitions to be used later in 135 this document and potentially in other documents as well. It then 136 addresses the requirements and issues associated with service 137 restoration, hierarchy, and finally a short discussion of 138 survivability in hierarchical context. 140 Here is a summary of the findings: 142 A. Survivability Requirements 144 @ need to define a small set of interoperable survivability 145 approaches in packet and non-packet networks 146 @ suggested survivability mechanisms include 147 - 1:1 path protection with pre-established backup capacity (non- 148 shared) 149 - 1:1 path protection with pre-planned backup capacity (shared) 150 - local restoration with repairs in proximity to the network 151 fault 152 - path restoration through source-based rerouting 153 @ timing bounds for service restoration to support voice call cutoff 154 (140 msec to 2 sec), protocol timer requirements in premium data 155 services, and mission critical applications 156 @ use of restoration priority for service differentiation 158 B. Hierarchy Requirements 160 B.1. Horizontally Oriented Hierarchy (Intra-Domain) 162 @ ability to set up many LSPs in a service provider network with 163 hierarchical IGP, for the support layer 2 and layer 3 VPN services 165 Lai, et al Category - Expiration [3] 167 Network Hierarchy and Multilayer Survivability Oct 2001 169 @ requirements for multi-area traffic engineering need to be 170 developed to provide guidance for any necessary protocol 171 extensions 173 B.2. Vertically Oriented Hierarchy 175 The following functionality for survivability is common on most 176 routing equipment today. 177 @ near-term need is some loose form of coordination and 178 communication based on the use of nested hold-off timers, instead 179 of direct exchange of signaling and routing between vertical 180 layers 181 @ means for an upper layer to immediately begin recovery actions in 182 the event that a lower layer is not configured to perform recovery 184 C. Survivability Requirements in Horizontal Hierarchy 186 @ protection of end-to-end connection is based on a concatenated set 187 of connections, each protected within their area 188 @ mechanisms for connection routing may include (1) a network 189 element that participates on both sides of a boundary (e.g., OSPF 190 ABR) - note that this is a common point of failure; (2) route 191 server 192 @ need for inter-area signaling of survivability information (1) to 193 enable a "least common denominator" survivability mechanism at the 194 boundary; (2) to convey the success or failure of the service 195 restoration action; e.g., if a part of a "connection" is down on 196 one side of a boundary, there is no need for the other side to 197 recover from failures 199 4. Terminology and Concepts 201 4.1 Hierarchy 203 Hierarchy is a technique to build scalable complex systems. It is 204 based on an abstraction, at each level, of what is most significant 205 from the details and internal structures of the levels further away. 206 This approach makes use of a general property of all hierarchical 207 systems composed of related subsystems that interactions between 208 subsystems decrease as the level of communication between subsystems 209 decreases. 211 Network hierarchy is an abstraction of part of a network's topology, 212 routing and signaling mechanisms. Abstraction may be used as a 213 mechanism to build large networks or as a technique for enforcing 214 administrative, topological, or geographic boundaries. For example, 215 network hierarchy might be used to separate the metropolitan and 216 long-haul regions of a network, or to separate the regional and 217 backbone sections of a network, or to interconnect service provider 218 networks (with BGP which reduces a network to an Autonomous System). 220 Lai, et al Category - Expiration [4] 222 Network Hierarchy and Multilayer Survivability Oct 2001 224 In this document, network hierarchy is considered from two 225 perspectives: 227 (1) Vertically oriented: between two network technology layers 228 (2) Horizontally oriented: between two areas or administrative 229 subdivisions within the same network technology layer 231 4.1.1 Vertical Hierarchy 233 Vertical hierarchy is the abstraction, or reduction in information, 234 which would be of benefit when communicating information across 235 network technology layers, as in propagating information between 236 optical and router networks. 238 In the vertical hierarchy, the total network functions are 239 partitioned into a series of functional or technological layers with 240 clear logical, and may be even physical, separation between adjacent 241 layers. Survivability mechanisms either currently exist or are being 242 developed at multiple layers in networks [3]. The optical layer is 243 now becoming capable of providing dynamic ring and mesh restoration 244 functionality, in addition to traditional 1+1 or 1:1 protection. 245 The SDH/SONET layer provides survivability capability with automatic 246 protection switching (APS), as well as self-healing ring and mesh 247 restoration architectures. Similar functionality has been defined 248 in the ATM Layer, with work ongoing to also provide such 249 functionality using MPLS [4]. At the IP layer, rerouting is used to 250 restore service continuity following link and node outages. 251 Rerouting at the IP layer, however, occurs after a period of routing 252 convergence, which may require from a few seconds to several minutes 253 to complete. 255 4.1.2 Horizontal Hierarchy 257 Horizontal hierarchy is the abstraction that allows a network at one 258 technology layer, for instance a packet network, to scale. Examples 259 of horizontal hierarchy include BGP confederations, separate 260 Autonomous Systems, and multi-area OSPF. 262 In the horizontal hierarchy, a large network is partitioned into 263 multiple smaller, non-overlapping sub-networks. The partitioning 264 criteria can be based on topology, network function, administrative 265 policy, or service domain demarcation. Two networks at the *same* 266 hierarchical level, e.g., two Autonomous Systems in BGP, may share a 267 peer relation with each other through some loose form of coupling. 268 On the other hand, for routing in large networks using multi-area 269 OSPF, abstraction through the aggregation of routing information is 270 achieved through a hierarchical partitioning of the network. 272 4.2 Survivability Terminology 274 In alphabetical order, the following terms are defined in this 275 section: 277 Lai, et al Category - Expiration [5] 279 Network Hierarchy and Multilayer Survivability Oct 2001 281 backup entity, same as protection entity (section 4.2.2) 282 extra traffic (section 4.2.2) 283 non-revertive mode (section 4.2.2) 284 normalization (section 4.2.2) 285 preemptable traffic, same as extra traffic (section 4.2.2) 286 preemption priority (section 4.2.4) 287 protection (section 4.2.3) 288 protection entity (section 4.2.2) 289 protection switching (section 4.2.3) 290 protection switch time (section 4.2.4) 291 recovery (section 4.2.2) 292 recovery by rerouting, same as restoration (section 4.2.3) 293 recovery entity, same as protection entity (section 4.2.2) 294 restoration (section 4.2.3) 295 restoration priority (section 4.2.4) 296 restoration time (section 4.2.4) 297 revertive mode (section 4.2.2) 298 shared risk group (SRG) (section 4.2.2) 299 survivability (section 4.2.1) 300 working entity (section 4.2.2) 302 4.2.1 Survivability 304 Survivability is the capability of a network to maintain service 305 continuity in the presence of faults within the network [5]. 306 Survivability mechanisms such as protection and restoration are 307 implemented either on a per-link basis, on a per-path basis, or 308 throughout an entire network to alleviate service disruption at 309 affordable costs. The degree of survivability is determined by the 310 network's capability to survive single failures, multiple failures, 311 and equipment failures. 313 4.2.2 Generic Operations 315 This document does not discuss the sequence of events of how network 316 failures are monitored, detected, and mitigated. For more detail of 317 this aspect, see [4]. Also, the repair process following a failure 318 is out of the scope here. 320 A working entity is the entity that is used to carry traffic in 321 normal operation mode. Depending on the context, an entity can be a 322 channel or a transmission link in the physical layer, an LSP in 323 MPLS, or a logical bundle of one or more LSPs. 325 A protection entity, also called backup entity or recovery entity, 326 is the entity that is used to carry protected traffic in recovery 327 operation mode, i.e., when the working entity is in error or has 328 failed. 330 Extra traffic, also referred to as preemptable traffic, is the 331 traffic carried over the protection entity while the working entity 332 is active. Extra traffic is not protected, i.e., when the 334 Lai, et al Category - Expiration [6] 336 Network Hierarchy and Multilayer Survivability Oct 2001 338 protection entity is required to protect the traffic that is being 339 carried over the working entity, the extra traffic is preempted. 341 A shared risk group (SRG) is a set of network elements that are 342 collectively impacted by a specific fault or fault type. For 343 example, a shared risk link group (SRLG) is the union of all the 344 links on those fibers that are routed in the same physical conduit 345 in a fiber-span network. This concept includes, besides shared 346 conduit, other types of compromise such as shared fiber cable, 347 shared right of way, shared optical ring, shared office without 348 power sharing, etc. The span of an SRG, such as the length of the 349 sharing for compromised outside plant, needs to be considered on a 350 per fault basis. The concept of SRG can be extended to represent a 351 "risk domain" and its associated capabilities and summarization for 352 traffic engineering purposes. See [6] for further discussion. 354 Normalization is the sequence of events and actions taken by a 355 network that returns the network to the preferred state upon 356 completing repair of a failure. This could include the switching or 357 rerouting of affected traffic to the original repaired working 358 entities or new routes. Revertive mode refers to the case where 359 traffic is automatically returned to a repaired working entity (also 360 called switch back). 362 Recovery is the sequence of events and actions taken by a network 363 after the detection of a failure to maintain the required 364 performance level for existing services (e.g., according to service 365 level agreements) and to allow normalization of the network. The 366 actions include notification of the failure followed by two parallel 367 processes: (1) a repair process with fault isolation and repair of 368 the failed components, and (2) a reconfiguration process using 369 survivability mechanisms to maintain service continuity. In 370 protection, reconfiguration involves switching the affected traffic 371 from a working entity to a protection entity. In restoration, 372 reconfiguration involves path selection and rerouting for the 373 affected traffic. 375 Revertive mode is a procedure in which revertive action, i.e., 376 switch back from the protection entity to the working entity, is 377 taken once the failed working entity has been repaired. In non- 378 revertive mode, such action is not taken. To minimize service 379 interruption, switch-back in revertive mode should be performed at a 380 time when there is the least impact on the traffic concerned, or by 381 using the make-before-break concept. 383 Non-revertive mode is the case where there is no preferred path or 384 it may be desirable to minimize further disruption of the service 385 brought on by a revertive switching operation. A switch-back to the 386 original working path is not desired or not possible since the 387 original path may no longer exist after the occurrence of a fault on 388 that path. 390 4.2.3 Survivability Techniques 392 Lai, et al Category - Expiration [7] 394 Network Hierarchy and Multilayer Survivability Oct 2001 396 Protection, also called protection switching, is a survivability 397 technique based on predetermined failure recovery: as the working 398 entity is established, a protection entity is also established. 399 Protection techniques can be implemented by several architectures: 400 1+1, 1:1, 1:n, and m:n. In the context of SDH/SONET, they are 401 referred to as Automatic Protection Switching (APS). 403 In the 1+1 protection architecture, a protection entity is dedicated 404 to each working entity. The dual-feed mechanism is used whereby the 405 working entity is permanently bridged onto the protection entity at 406 the source of the protected domain. In normal operation mode, 407 identical traffic is transmitted simultaneously on both the working 408 and protection entities. At the other end (sink) of the protected 409 domain, both feeds are monitored for alarms and maintenance signals. 410 A selection between the working and protection entity is made based 411 on some predetermined criteria, such as the transmission performance 412 requirements or defect indication. 414 In the 1:1 protection architecture, a protection entity is also 415 dedicated to each working entity. The protected traffic is normally 416 transmitted by the working entity. When the working entity fails, 417 the protected traffic is switched to the protection entity. The two 418 ends of the protected domain must signal detection of the fault and 419 initiate the switchover. 421 In the 1:n protection architecture, a dedicated protection entity is 422 shared by n working entities. In this case, not all of the affected 423 traffic may be protected. 425 The m:n architecture is a generalization of the 1:n architecture. 426 Typically m <= n, m dedicated protection entities are shared by n 427 working entities. 429 Restoration, also referred to as recovery by rerouting [4], is a 430 survivability technique that establishes new paths or path segments 431 on demand, for restoring affected traffic after the occurrence of a 432 fault. The resources in these alternate paths are the currently 433 unassigned (unreserved) resources in the same layer. Preemption of 434 extra traffic may also be used if spare resources are not available 435 to carry the higher-priority protected traffic. As initiated by 436 detection of a fault on the working path, the selection of a 437 recovery path may be based on preplanned configurations, network 438 routing policies, or current network status such as network topology 439 and fault information. Signaling is used for establishing the new 440 paths to bypass the fault. Thus, restoration involves a path 441 selection process followed by rerouting of the affected traffic from 442 the working entity to the recovery entity. 444 4.2.4 Survivability Performance 446 Protection switch time is the time interval from the occurrence of a 447 network fault until the completion of the protection-switching 449 Lai, et al Category - Expiration [8] 451 Network Hierarchy and Multilayer Survivability Oct 2001 453 operations. It includes the detection time necessary to initiate 454 the protection switch, any hold-off time to allow for interworking 455 of protection schemes, and the switch completion time. 457 Restoration time is the time interval from the occurrence of a 458 network fault to the instant when the affected traffic is either 459 completely restored, or until spare resources are exhausted, and/or 460 no more extra traffic exists that can be preempted to make room. 462 Restoration priority is a method of giving preference to protect 463 higher-priority traffic ahead of lower-priority traffic. Its use is 464 to help determine the order of restoring traffic after a failure has 465 occurred. The purpose is to differentiate service restoration time 466 as well as to control access to available spare capacity for 467 different classes of traffic. 469 Preemption priority is a method of determining which traffic can be 470 disconnected in the event that not all traffic with a higher 471 restoration priority is restored after the occurrence of a failure. 473 4.3 Survivability Mechanisms: Comparison 475 In a survivable network design, spare capacity and diversity must be 476 built into the network from the beginning to support some degree of 477 self-healing whenever failures occur. A common strategy is to 478 associate each working entity with a protection entity having either 479 dedicated resources or shared resources that are pre-reserved or 480 reserved-on-demand. According to the methods of setting up a 481 protection entity, different approaches to providing survivability 482 can be classified. Generally, protection techniques are based on 483 having a dedicated protection entity set up prior to failure. Such 484 is not the case in restoration techniques, which mainly rely on the 485 use of spare capacity in the network. Hence, in terms of trade- 486 offs, protection techniques usually offer fast recovery from failure 487 with enhanced availability, while restoration techniques usually 488 achieve better resource utilization. 490 A 1+1 protection architecture is rather expensive since resource 491 duplication is required for the working and protection entities. It 492 is generally used for specific services that need a very high 493 availability. 495 A 1:1 architecture is inherently slower in recovering from failure 496 than a 1+1 architecture since communication between both ends of the 497 protection domain is required to perform the switch-over operation. 498 An advantage is that the protection entity can optionally be used to 499 carry low-priority extra traffic in normal operation, if traffic 500 preemption is allowed. Packet networks can pre-establish a 501 protection path for later use with pre-planned but not pre-reserved 502 capacity. That is, if no packets are sent onto a protection path, 503 then no bandwidth is consumed. This is not the case in transmission 504 networks like optical or TDM where path establishment and resource 505 reservation cannot be decoupled. 507 Lai, et al Category - Expiration [9] 509 Network Hierarchy and Multilayer Survivability Oct 2001 511 In the 1:n protection architecture, traffic is normally sent on the 512 working entities. When multiple working entities have failed 513 simultaneously, only one of them can be restored by the common 514 protection entity. This contention could be resolved by assigning a 515 different preemptive priority to each working entity. As in the 1:1 516 case, the protection entity can optionally be used to carry 517 preemptable traffic in normal operation. 519 While the m:n architecture can improve system availability with 520 small cost increases, it has rarely been implemented or 521 standardized. 523 When compared with protection mechanisms, restoration mechanisms are 524 generally more frugal as no resources are committed until after the 525 fault occurs and the location of the fault is known. However, 526 restoration mechanisms are inherently slower, since more must be 527 done following the detection of a fault. Also, the time it takes 528 for the dynamic selection and establishment of alternate paths may 529 vary, depending on the amount of traffic and connections to be 530 restored, and is influenced by the network topology, technology 531 employed, and the type and severity of the fault. As a result, 532 restoration time tends to be more variable than the protection 533 switch time needed with pre-selected protection entities. Hence, in 534 using restoration mechanisms, it is essential to use restoration 535 priority to ensure that service objectives are met cost-effectively. 537 Once the network routing algorithms have converged after a fault, it 538 may be preferable, in some cases, to reoptimize the network by 539 performing a reroute based on the current state of the network and 540 network policies. 542 5. Survivability 544 5.1 Scope 546 Interoperable approaches to network survivability were determined to 547 be an immediate requirement in packet networks as well as in 548 SDH/SONET framed TDM networks. Not as pressing at this time were 549 techniques which would cover all-optical networks (e.g., where 550 framing is unknown), as the control of these networks in a multi- 551 vendor environment appeared to have some other hurdles to first deal 552 with. Also, not of immediate interest were approaches to coordinate 553 or explicitly communicate survivability mechanisms across network 554 layers (such as from a TDM or optical network to/from an IP 555 network). However, a capability should be provided for a network 556 operator to perform fault notification and to control the operation 557 of survivability mechanisms among different layers. This may 558 require the development of corresponding OAM functionality. 559 However, such issues and those related to OAM are currently outside 561 Lai, et al Category - Expiration [10] 563 Network Hierarchy and Multilayer Survivability Oct 2001 565 the scope of this document. (For proposed MPLS OAM requirements, 566 see [7, 8]). 568 The initial scope is to address only "backhoe failures" in the 569 inter-office connections of a service provider network. A link 570 connection in the router layer typically comprises of multiple spans 571 in the lower layers. Therefore, the types of network failures that 572 cause a recovery to be performed include link/span failures. 573 However, linecard and node failures may not need to be treated any 574 differently than their respective link/span failures, as a router 575 failure may be represented as a set of simultaneous link failures. 577 Depending on the actual network configuration, drop-side interface 578 (e.g., between a customer and an access router, or between a router 579 and an optical cross-connect) may be considered either inter-domain 580 or inter-layer. Another inter-domain scenario is the use of intra- 581 office links for interconnecting a metro network and a core network, 582 with both networks being administered by the same service provider. 583 Failures at such interfaces may be similarly protected by the 584 mechanisms of this section. 586 Other more complex failure mechanisms such as systematic control- 587 plane failure, configuration error, or breach of security are not 588 within the scope of the survivability mechanisms discussed in this 589 document. Network impairment such as congestion which results in 590 lower throughput are also not covered. 592 5.2 Required initial set of survivability mechanisms 594 5.2.1 1:1 Path Protection with Pre-Established Capacity 596 In this protection mode, the head end of a working connection 597 establishes a protection connection to the destination. There 598 should be the ability to maintain relative restoration priorities 599 between working and protection connections, as well as between 600 different classes of protection connections. 602 In normal operation, traffic is only sent on the working connection, 603 though the ability to signal that traffic will be sent on both 604 connections (1+1 Path for signaling purposes) would be valuable in 605 non-packet networks. Some distinction between working and 606 protection connections is likely, either through explicit objects, 607 or preferably through implicit methods such as general classes or 608 priorities. Head ends need the ability to create connections that 609 are as failure disjoint as possible from each other. This requires 610 SRG information that can be generally assigned to either nodes or 611 links and propagated through the control or management plane. In 612 this mechanism, capacity in the protection connection is pre- 613 established, however it should be capable of carrying preemptable 614 extra traffic in non-packet networks. When protection capacity is 615 called into service during recovery, there should be the ability to 616 promote the protection connection to working status (for non- 618 Lai, et al Category - Expiration [11] 620 Network Hierarchy and Multilayer Survivability Oct 2001 622 revertive mode operation) with some form of make-before-break 623 capability. 625 5.2.2 1:1 Path Protection with Pre-Planned Capacity 627 Similar to the above 1:1 protection with pre-established capacity, 628 the protection connection in this case is also pre-signaled. The 629 difference is in the way protection capacity is assigned. With pre- 630 planned capacity, the mechanism supports the ability for the 631 protection capacity to be shared, or "double-booked". Operators 632 need the ability to provision different amounts of protection 633 capacity according to expected failure modes and service level 634 agreements. Thus, an operator may wish to provision sufficient 635 restoration capacity to handle a single failure affecting all 636 connections in an SRG, or may wish to provision less or more 637 restoration capacity. Mechanisms should be provided to allow 638 restoration capacity on each link to be shared by SRG-disjoint 639 failures.In a sense, this is 1:1 from a path perspective, however 640 the protection capacity in the network (on a link by link basis) is 641 shared in a 1:n fashion, e.g., see the proposals in [9, 10]. If 642 capacity is planned but not allocated, some form of signaling could 643 be required before traffic may be sent on protection connections, 644 especially in TDM networks. 646 The use of this approach improves network resource utilization, but 647 may require more careful planning. So, initial deployment might be 648 based on 1:1 path protection with pre-established capacity and the 649 local restoration mechanism to be described next. 651 5.2.3 Local Restoration 653 Due to the time impact of signal propagation, dynamic recovery of an 654 entire path may not meet the service requirements of some networks. 655 The solution to this is to restore connectivity of the link or span 656 in immediate proximity to the fault, e.g., see the proposals in [11, 657 12]. At a minimum, this approach should be able to protect against 658 connectivity-type SRGs, though protecting against node-based SRGs 659 might be worthwhile. Also, this approach is applicable to support 660 restoration on the inter-domain and inter-layer interconnection 661 scenarios using intra-office links as described in the Scope 662 Section. 664 Head end systems must have some control as to whether their 665 connections are candidates for or excluded from local restoration. 666 For example, best-effort and preemptable traffic may be excluded 667 from local restoration; they only get restored if there is bandwidth 668 available. This type of control may require the definition of an 669 object in signaling. 671 Since local restoration may be suboptimal, a means for head end 672 systems to later perform path-level re-grooming must be supported 673 for this approach. 675 Lai, et al Category - Expiration [12] 677 Network Hierarchy and Multilayer Survivability Oct 2001 679 5.2.4 Path Restoration 681 In this approach, connections that are impacted by a fault are 682 rerouted by the originating network element upon notification of 683 connection failure. Such a source-based approach is efficient for 684 network resources, but typically takes longer to accomplish 685 restoration. It does not involve any new mechanisms. It merely is 686 a mention of another common approach to protecting against faults in 687 a network. 689 5.3 Applications Supported 691 With service continuity under failure as a goal, a network is 692 "survivable" if, in the face of a network failure, connectivity is 693 interrupted for a "brief" period and then recovered before the 694 network failure ends. The length of this interrupted period is 695 dependent on the application supported. Here are some typical 696 applications and considerations that drive the requirements for an 697 acceptable protection switch time or restoration time: 699 - Best-effort data: recovery of network connectivity by rerouting at 700 the IP layer would be sufficient 701 - Premium data service: need to meet TCP timeout or application 702 protocol timer requirements 703 - Voice: call cutoff is in the range of 140 msec to 2 sec (the time 704 that a person waits after interruption of the speech path before 705 hanging up or the time that a telephone switch will disconnect a 706 call) 707 - Other real-time service (e.g., streaming, fax) where an 708 interruption would cause the session to terminate 709 - Mission-critical applications that cannot tolerate even brief 710 interruptions, for example, real-time financial transactions 712 5.4 Timing Bounds for Survivability Mechanisms 714 The approach to picking the types of survivability mechanisms 715 recommended was to consider a spectrum of mechanisms that can be 716 used to protect traffic with varying characteristics of 717 survivability and speed of protection/restoration, and then attempt 718 to select a few general points which provide some coverage across 719 that spectrum. The focus of this work is to provide requirements to 720 which a small set of detailed proposals may be developed, allowing 721 the operator some (limited) flexibility in approaches to meeting 722 their design goals in engineering multi-vendor networks. 723 Requirements of different applications as listed in the previous 724 sub-section were discussed generally, however none on the team would 725 likely attest to the scientific merit of the ability of the timing 726 bounds below to meet any specific application's needs. A few 727 assumptions include: 729 1. Approaches that protection switch without propagation of 730 information are likely to be faster than those that do require 732 Lai, et al Category - Expiration [13] 734 Network Hierarchy and Multilayer Survivability Oct 2001 736 some form of fault notification to some or all elements in a 737 network. 738 2. Approaches that require some form of signaling after a fault will 739 also likely suffer some timing impact. 741 Proposed timing bounds for different survivability mechanisms are as 742 follows (all bounds are exclusive of signal propagation): 744 1:1 path protection with pre-established capacity: 100-500 ms 745 1:1 path protection with pre-planned capacity: 100-750 ms 746 Local restoration: 50 ms 747 Path restoration: 1-5 seconds 749 To ensure that the service requirements for different applications 750 can be met within the above timing bounds, restoration priority must 751 be implemented to determine the order in which connections are 752 restored (to minimize service restoration time as well as to gain 753 access to available spare capacity on the best paths). For example, 754 mission critical applications may require high restoration priority. 755 At the fiber layer, instead of specific applications, it may be 756 possible that priority be given to certain classifications of 757 customers with their traffic types enclosed within the customer 758 aggregate. Preemption priority should only be used in the event 759 that not all connections can be restored, in which case connections 760 with lower preemption priority should be released. Depending on a 761 service provider's strategy in provisioning network resources for 762 backup, preemption may or not be needed in the network. 764 5.5 Coordination Among Layers 766 A common design goal for networks with multiple technological layers 767 is to provide the desired level of service in the most cost- 768 effective manner. Multilayer survivability may allow the 769 optimization of spare resources through the improvement of resource 770 utilization by sharing spare capacity across different layers, 771 though further investigations are needed. Coordination during 772 recovery among different network layers (e.g. IP, SDH/SONET, optical 773 layer) might necessitate development of vertical hierarchy. The 774 benefits of providing survivability mechanisms at multiple layers, 775 and the optimization of the overall approach, must be weighed with 776 the associated cost and service impacts. 778 A default coordination mechanism for inter-layer interaction could 779 be the use of nested timers and current SDH/SONET fault monitoring, 780 as has been done traditionally for backward compatibility. Thus, 781 when lower-layer recovery happens in a longer time period than 782 higher-layer recovery, a hold-off timer is utilized to avoid 783 contention between the different single-layer survivability schemes. 784 In other words, multilayer interaction is addressed by having 785 successively higher multiplexing levels operate at a 786 protection/restoration time scale greater than the next lowest 787 layer. This can impact the overall time to recover service. For 788 example, if SDH/SONET protection switching is used, MPLS recovery 790 Lai, et al Category - Expiration [14] 792 Network Hierarchy and Multilayer Survivability Oct 2001 794 timers must wait until SDH/SONET has had time to switch. Setting 795 such timers involves a tradeoff between rapid recovery and creation 796 of a race condition where multiple layers are responding to the same 797 fault, potentially allocating resources in an inefficient manner. 799 In other configurations where the lower layer does not have a 800 restoration capability or is not expected to protect, say an 801 unprotected SDH/SONET linear circuit, then there must be a mechanism 802 for the lower layer to trigger the higher layer to take recovery 803 actions immediately. This difference in network configuration means 804 that implementations must allow for adjustment of hold-off timer 805 values and/or a means for a lower layer to immediately indicate to a 806 higher layer that a fault has occurred so that the higher layer can 807 take restoration or protection actions. 809 Furthermore, faults at higher layers should not trigger restoration 810 or protection actions at lower layers [3, 4]. 812 It was felt that the current approach to coordination of 813 survivability approaches currently did not have significant 814 operational shortfalls. These approaches include protecting traffic 815 solely at one layer (e.g. at the IP layer over linear WDM, or at the 816 SDH/SONET layer). Where survivability mechanisms might be deployed 817 at several layers, such as when a routed network rides a SDH/SONET 818 protected network, it was felt that current coordination approaches 819 were sufficient in many cases. One exception is the hold-off of 820 MPLS recovery until the completion of SDH/SONET protection switching 821 as described above. This limits the recovery time of fast MPLS 822 restoration. Also, by design, the operations and mechanisms within 823 a given layer tend to be invisible to other layers. 825 5.6 Evolution Toward IP Over Optical 827 As more pressing requirements for survivability and horizontal 828 hierarchy for edge-to-edge signaling are met with technical 829 proposals, it is believed that the benefits of merging (in some 830 manner) the control planes of multiple layers will be outlined. 831 When these benefits are self-evident, it would then seem to be the 832 right time to review if vertical hierarchy mechanisms are needed, 833 and what the requirements might be. For example, a future 834 requirement might be to provide a better match between the recovery 835 requirements of IP networks with the recovery capability of optical 836 transport. One such proposal is described in [13]. 838 6. Hierarchy Requirements 840 Efforts in the area of network hierarchy should focus on mechanisms 841 that would allow more scalable edge-to-edge signaling, or signaling 842 across networks with existing network hierarchy (such as multi-area 843 OSPF). This appears to be a more urgent need than mechanisms that 844 might be needed to interconnect networks at different layers. 846 Lai, et al Category - Expiration [15] 848 Network Hierarchy and Multilayer Survivability Oct 2001 850 6.1 Historical Context 852 One reason for horizontal hierarchy is functionality (e.g., metro 853 versus backbone). Geographic "islands" or partititons reduce the 854 need for interoperability and make administration and operations 855 less complex. Using a simpler, more interoperable, survivability 856 scheme at metro/backbone boundaries is natural for many provider 857 network architectures. In transmission networks, creating 858 geographic islands of different vendor equipment has been done for a 859 long time because multi-vendor interoperability has been difficult 860 to achieve. Traditionally, providers have to coordinate the 861 equipment on either end of a "connection," and making this 862 interoperable reduces complexity. A provider should be able to 863 concatenate survivability mechanisms in order to provide a 864 "protected link" to the next higher level. Think of SDH/SONET rings 865 connecting to TDM DXCs with 1+1 line-layer protection between the 866 ADM and the DXC port. The TDM connection, e.g., a DS3 is protected, 867 but usually all equipment on each SDH/SONET ring is from a single 868 vendor. The DXC cross connections are controlled by the provider 869 and the ports are physically protected resulting in a highly 870 available design. Thus, concatenation of survivability approaches 871 can be used to cascade across horizontal hierarchy. While not 872 perfect, it is workable in the near- to mid-term until multi-vendor 873 interoperability is achieved. 875 While the problems associated with multi-vendor interoperability may 876 necessitate horizontal hierarchy as a practical matter in the near 877 to mid-term (at least this has been the case in TDM networks), there 878 should not be a technical reason for it in the standards developed 879 by the IETF for core networks, or even most access networks. 880 Establishing interoperability of survivability mechanisms between 881 multi-vendor equipment in core IP networks is urgently required to 882 enable adoption of IP as a viable core transport technology and to 883 facilitate the traffic engineering of future multi-service IP 884 networks [3]. 886 Some of the largest service provider networks currently run a single 887 area/level IGP. Some service providers, as well as many large 888 enterprise networks, run multi-area OSPF to gain increases in 889 scalability. Often, this was from an original design, so it is 890 difficult to say if the network truly required the hierarchy to 891 reach its current size. 893 Some proposals on improved mechanisms to address network hierarchy 894 have been suggested [14, 15, 16, 17, 18]. This document aims to 895 provide the concrete requirements so that these and other proposals 896 can first aim to meet some limited objectives. 898 6.2 Applications for Horizontal Hierarchy 900 A primary driver for intra-domain horizontal hierarchy is signaling 901 capabilities in the context of edge-to-edge VPNs, potentially across 902 traffic-engineered data networks. There are a number of different 904 Lai, et al Category - Expiration [16] 906 Network Hierarchy and Multilayer Survivability Oct 2001 908 approaches to layer 2 and layer 3 VPNs and they are currently being 909 addressed by different emerging protocols in the provider- 910 provisioned VPNs (e.g., virtual routers) and Pseudo Wire Edge-to- 911 Edge Emulation (PWE3) efforts based on either MPLS and/or IP 912 tunnels. These may or not need explicit signaling from edge to edge, 913 but it is a common perception that in order to meet SLAs, some form 914 of edge-to-edge signaling may be required. 916 With a large number of edges (N), scalability is concerned with 917 avoiding the O(N^2) properties of edge-to-edge signaling. However, 918 the main issue here is not with the scalability of large amounts of 919 signaling, such as in O(N^2) meshes with a "connection" between 920 every edge-pair. This is because, even if establishing and 921 maintaining connections is feasible in a large network, there might 922 be an impact on core survivability mechanisms which would cause 923 protection/restoration times to grow with N^2, which would be 924 undesirable. While some value of N may be inevitable, approaches to 925 reduce N (e.g. to pull in from the edge to aggregation points) might 926 be of value. 928 Thus, most service providers feel that O(N^2) meshes are not 929 necessary for VPNs, and that the number of tunnels to support VPNs 930 would be within the scalability bounds of current protocols and 931 implementations. That may be the case, there is currently a lack of 932 ability to signal MPLS tunnels from edge to edge across IGP 933 hierarchy, such as OSPF areas. This may require the development 934 of signaling standards that support dynamic establishment and 935 potentially restoration of LSPs across a 2-level IGP hierarchy. 937 For routing scalability, especially in data applications, a major 938 concern is the amount of processing/state that is required in the 939 variety of network elements. If some nodes might not be able to 940 communicate and process the state of every other node, it might be 941 preferable to limit the information. There is one school of thought 942 that says that the amount of information contained by a horizontal 943 barrier should be significant, and that impacts this might have on 944 optimality in route selection and ability to provide global 945 survivability are accepted tradeoffs. 947 6.3 Horizontal Hierarchy Requirements 949 Mechanisms are required to allow for edge-to-edge signaling of 950 connections through a network. One network scenario includes medium 951 to large networks that currently have hierarchical interior routing 952 such as multi-area OSPF or multi-level IS-IS. The primary context 953 of this is edge-to-edge signaling which is thought to be required to 954 assure the SLAs for the layer 2 and layer 3 VPNs that are being 955 carried across the network. Another possible context would be edge- 956 to-edge signaling in TDM SDH/SONET networks with IP control, where 957 metro and core networks again might be in a hierarchical interior 958 routing domain. 960 Lai, et al Category - Expiration [17] 962 Network Hierarchy and Multilayer Survivability Oct 2001 964 To support edge-to-edge signaling in the above network scenarios 965 within the framework of existing horizontal hierarchies, current 966 traffic engineering (TE) methods [19, 5] may need to be extended. 967 Requirements for multi-area TE need to be developed to provide 968 guidance for any necessary protocol extensions. 970 7. Survivability and Hierarchy 972 When horizontal hierarchy exists in a network technology layer, a 973 question arises as to how survivability can be provided along a 974 connection which crosses hierarchical boundaries. 976 In designing protocols to meet the requirements of hierarchy, an 977 approach to consider is that boundaries are either clean, or are of 978 minimal value. However, the concept of network elements that 979 participate on both sides of a boundary might be a consideration 980 (e.g. OSPF ABRs). That would allow for devices on either side to 981 take an intra-area approach within their region of knowledge, and 982 for the ABR to do this in both areas, and splice the two protected 983 connections together at a common point (granted it is a common point 984 of failure now). If the limitations of this approach start to 985 appear in operational settings, then perhaps it would be time to 986 start thinking about route-servers and signaling propagated 987 directives. However, one initial approach might be to signal 988 through a common border router, and to consider the service as 989 protected as it consist of a concatenated set of connections which 990 are each protected within their area. Another approach might be to 991 have a least common denominator mechanism at the boundary, e.g., 1+1 992 port protection. There should also be some standardized means for a 993 survivability scheme on one side of such a boundary to communicate 994 with the scheme on the other side regarding the success or failure 995 of the recovery action. For example, if a part of a "connection" is 996 down on one side of such a boundary, there is no need for the other 997 side to recover from failures. 999 In summary, at this time, approaches as described above that allow 1000 concatenation of survivability schemes across hierarchical 1001 boundaries seem sufficient. 1003 8. Security Considerations 1005 No security issues have been raised in these requirements. 1007 9. References 1009 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 1010 9, RFC 2026, October 1996. 1012 Lai, et al Category - Expiration [18] 1014 Network Hierarchy and Multilayer Survivability Oct 2001 1016 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement 1017 Levels", BCP 14, RFC 2119, March 1997. 1019 3 K. Owens, V. Sharma, and M.Oommen, "Network Survivability 1020 Considerations for Traffic Engineered IP Networks," Internet- 1021 Draft, Work in Progress, July 2001. 1023 4 V. Sharma, B. Crane, S. Makam, K. Owens, C. Huang, F. Hellstrand, 1024 J. Weil, L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. 1025 Chiu, "Framework for MPLS-based Recovery," Internet-Draft, Work 1026 in Progress, July 2001. 1028 5 D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, 1029 "Overview and Principles of Internet Traffic Engineering," 1030 Internet-Draft, Work in Progress, August 2001. 1032 6 S. Dharanikota, R. Jain, D. Papadimitriou, R. Hartani, G. 1033 Bernstein, V. Sharma, C. Brownmiller, Y. Xue, and J. Strand, 1034 "Inter-domain routing with Shared Risk Groups," Internet-Draft, 1035 Work in Progress, July 2001. 1037 7 N. Harrison, P. Willis, S. Davari, E. Cuevas, B. Mack-Crane, E. 1038 Franze, H. Ohta, T. So, S. Goldfless, and F. Chen, "Requirements 1039 for OAM in MPLS Networks," Internet-Draft, Work in Progress, May 1040 2001. 1042 8 D. Allan and M. Azad, "A Framework for MPLS User Plane OAM," 1043 Internet-Draft, Work in Progress, July 2001. 1045 9 S. Kini, M. Kodialam, T.V. Lakshman, S. Sengupta, and C. 1046 Villamizar, "Shared Backup Label Switched Path Restoration," 1047 Internet-Draft, Work in Progress, May 2001. 1049 10 G. Li, C. Kalmanek, J. Yates, G. Bernstein, F. Liaw, and V. 1050 Sharma, "RSVP-TE Extensions For Shared-Mesh Restoration in 1051 Transport Networks," Internet-Draft, Work in Progress, July 2001. 1053 11 D.H. Gan, P. Pan, A. Ayyangar, and K. Kompella, "A Method for 1054 MPLS LSP Fast-Reroute Using RSVP Detours," Internet-Draft, Work 1055 in Progress, April 2001. 1057 12 A. Atlas, C. Villamizar, and C. Litvanyi, "MPLS RSVP-TE 1058 Interoperability for Local Protection/Fast Reroute," Internet- 1059 Draft, Work in Progress, July 2001. 1061 13 A. Chiu and J. Strand, "Joint IP/Optical Layer Restoration after 1062 a Router Failure," Proc. OFC'2001, Anaheim, CA, March 2001. 1064 14 K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic 1065 Engineering," Internet-Draft, Work in Progress, March 2001. 1067 Lai, et al Category - Expiration [19] 1069 Network Hierarchy and Multilayer Survivability Oct 2001 1071 15 G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft, 1072 Work in Progress, September 2001. 1074 16 A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing 1075 Extensions for MPLS Signaling," Internet-Draft, Work in Progress, 1076 July 2001. 1078 17 C-Y Lee, A Celer, N Gammage, S Ghanti, G. Ash, "Distributed Route 1079 Exchangers," Internet-Draft, Work in Progress, March 2001. 1081 18 C-Y Lee and S Ghanti, "Path Request and Path Reply Message," 1082 Internet-Draft, Work in Progress, July 2001. 1084 19 D. Awduche, J. Malcolm, J. Agogbua, M. O'Dell, J. McManus, 1085 "Requirements for Traffic Engineering Over MPLS," RFC 2702, 1086 September 1999. 1088 10. Acknowledgments 1090 A lot of the direction taken in this document, and by the team in 1091 its initial effort, was steered by the insightful questions provided 1092 by Bala Rajagoplan, Greg Bernstein, Yangguang Xu, and Avri Doria. 1093 The set of questions is attached as Appendix A in this document. 1095 After the release of the first draft, a number of comments were 1096 received. Thanks to the inputs from Jerry Ash, Sudheer Dharanikota, 1097 Chuck Kalmanek, Dan Koller, Lyndon Ong, Steve Plote, and Yong Xue. 1099 11. Author's Addresses 1101 Wai Sum Lai 1102 AT&T 1103 200 Laurel Avenue 1104 Middletown, NJ 07748, USA 1105 Tel: +1 732-420-3712 1106 wlai@att.com 1108 Dave McDysan 1109 WorldCom 1110 22001 Loudoun County Pkwy 1111 Ashburn, VA 20147, USA 1112 dave.mcdysan@wcom.com 1114 Jim Boyle 1115 Protocol Driven Networks 1116 Tel: +1 919-852-5160 1117 jboyle@pdnets.com 1119 Malin Carlzon 1120 malin@sunet.se 1122 Lai, et al Category - Expiration [20] 1124 Network Hierarchy and Multilayer Survivability Oct 2001 1126 Rob Coltun 1127 Redback Networks 1128 300 Ferguson Drive 1129 Mountain View, CA 94043, USA 1130 Tel: +1 650-390-9030 1131 rcoltun@redback.com 1133 Tim Griffin 1134 AT&T 1135 180 Park Avenue 1136 Florham Park, NJ 07932, USA 1137 Tel: +1 973-360-7238 1138 griffin@research.att.com 1140 Ed Kern 1141 ejk@tech.org 1143 Tom Reddington 1144 Lucent Technologies 1145 67 Whippany Rd 1146 Whippany, NJ 07981, USA 1147 Tel: +1 973-386-7291 1148 treddington@bell-labs.com 1150 Appendix A: Questions used to help develop requirements 1152 A. Definitions 1154 1. In determining the specific requirements, the design team should 1155 precisely define the concepts "survivability", "restoration", 1156 "protection", "protection switching", "recovery", "re-routing" etc. 1157 and their relations. This would enable the requirements doc to 1158 describe precisely which of these will be addressed. 1159 In the following, the term "restoration" is used to indicate the 1160 broad set of policies and mechanisms used to ensure survivability. 1162 B. Network types and protection modes 1164 1. What is the scope of the requirements with regard to the types of 1165 networks covered? Specifically, are the following in scope: 1167 Restoration of connections in mesh optical networks (opaque or 1168 transparent) 1169 Restoration of connections in hybrid mesh-ring networks 1170 Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a 1171 transport network, e.g., optical) 1172 Any other types of networks? 1173 Is commonality of approach, or optimization of approach more 1174 important? 1176 Lai, et al Category - Expiration [21] 1178 Network Hierarchy and Multilayer Survivability Oct 2001 1180 2. What are the requirements with regard to the protection modes to 1181 be supported in each network type covered? (Examples of protection 1182 modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes 1183 such as P-cycles, etc.) 1185 3. What are the requirements on local span (i.e., link by link) 1186 protection and end-to-end protection, and the interaction between 1187 them? E.g.: what should be the granularity of connections for each 1188 type (single connection, bundle of connections, etc). 1190 C. Hierarchy 1192 1. Vertical (between two network layers): 1193 What are the requirements for the interaction between 1194 restoration procedures across two network layers, when these 1195 features are offered in both layers? (Example, MPLS network 1196 realized over pt-to-pt optical connections.) Under such a case, 1198 (a) Are there any criteria to choose which layer should provide 1199 protection? 1201 (b) If both layers provide survivability features, what are the 1202 requirements to coordinate these mechanisms? 1204 (c) How is lack of current functionality of cross-layer 1205 cooridnation currently hampering operations? 1207 (d) Would the benefits be worth additional complexity associated 1208 with routing isolation (e.g. VPN, areas), security, address 1209 isolation and policy / authentication processes? 1211 2. Horizontal (between two areas or administrative subdivisions 1212 within the same network layer): 1214 (a) What are the criteria that trigger the creation of protocol 1215 or administrative boundaries pertaining to restoration? (e.g., 1216 scalability? multi-vendor interoperability? what are the practical 1217 issues?) multi-provider? Should multi-vendor necessitate 1218 hierarchical seperation? 1220 When such boundaries are defined: 1222 (b) What are the requirements on how protection/restoration is 1223 performed end-to-end across such boundaries? 1225 (c) If different restoration mechanisms are implemented on two 1226 sides of a boundary, what are the requirements on their interaction? 1228 What is the primary driver of horizontal hierarchy? (select one) 1229 - functionality (e.g. metro -v- backbone) 1230 - routing scalability 1231 - signaling scalability 1232 - current network architecture, trying to layer on TE ontop of 1234 Lai, et al Category - Expiration [22] 1236 Network Hierarchy and Multilayer Survivability Oct 2001 1238 already hiearchical network architecture 1239 - routing and signalling 1241 For signalling scalability, is it 1242 - managability 1243 - processing/state of network 1244 - edge-to-edge N^2 type issue 1246 For routing scalability, is it 1247 - processing/state of network 1248 - are you flat and want to go hierarchical 1249 - or already hierarchical? 1250 - data or TDM application? 1252 D. Policy 1254 1. What are the requirements for policy support during 1255 protection/restoration, 1256 e.g., restoration priority, preemption, etc. 1258 E. Signaling Mechanisms 1260 1. What are the requirements on the signaling transport mechanism 1261 (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP 1262 network, etc.) used to communicate restoration protocol 1263 messages between network elements. What are the bandwidth and 1264 other requirements on the signaling channels? 1266 2. What are the requirements on fault detection/localization 1267 mechanisms (which is the prelude to performing restoration 1268 procedures) in the case of opaque and transparent optical networks? 1269 What are the requirements in the case of MPLS restoration? 1271 3. What are the requirements on signaling protocols to be used in 1272 restoration procedures (e.g., high priority processing, security, 1273 etc). 1275 4. Are there any requirements on the operation of restoration 1276 protocols? 1278 F. Quantitative 1280 1. What are the quantitative requirements (e.g., latency) for 1281 completing restoration under different protection modes (for both 1282 local and end-to-end protection)? 1284 G. Management 1286 1. What information should be measured/maintained by the control 1287 plane at each network element pertaining to restoration events? 1289 2. What are the requirements for the correlation between control 1290 plane and data plane failures from the restoration point of view? 1292 Lai, et al Category - Expiration [23] 1294 Network Hierarchy and Multilayer Survivability Oct 2001 1296 Full Copyright Statement 1298 "Copyright (C) The Internet Society (date). All Rights Reserved. 1299 This document and translations of it may be copied and furnished to 1300 others, and derivative works that comment on or otherwise explain it 1301 or assist in its implmentation may be prepared, copied, published 1302 and distributed, in whole or in part, without restriction of any 1303 kind, provided that the above copyright notice and this paragraph 1304 are included on all such copies and derivative works. However, this 1305 document itself may not be modified in any way, such as by removing 1306 the copyright notice or references to the Internet Society or other 1307 Internet organizations, except as needed for the purpose of 1308 developing Internet standards in which case the procedures for 1309 copyrights defined in the Internet Standards process must be 1310 followed, or as required to translate it into languages other than 1311 English. 1313 The limited permissions granted above are perpetual and will not be 1314 revoked by the Internet Society or its successors or assigns. 1316 This document and the information contained herein is provided on an 1317 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1318 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1319 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1320 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1321 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1323 Lai, et al Category - Expiration [24]