idnits 2.17.1 draft-team-tewg-restore-hierarchy-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == There are 2 instances of lines with non-ascii characters in the document. == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 1050 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 776 has weird spacing: '... define the c...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 2001) is 8320 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? '1' on line 54 looks like a reference -- Missing reference section? '2' on line 72 looks like a reference -- Missing reference section? '3' on line 183 looks like a reference -- Missing reference section? '4' on line 223 looks like a reference -- Missing reference section? '5' on line 320 looks like a reference -- Missing reference section? '6' on line 576 looks like a reference -- Missing reference section? '7' on line 576 looks like a reference -- Missing reference section? '8' on line 576 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Traffic Engineering Working Group Wai Sum Lai, AT&T 2 Internet Draft Dave McDysan, WorldCom 3 (Co-Editors) 4 Category: Informational 5 Expiration Date: January 2002 Jim Boyle 6 Malin Carlzon 7 Rob Coltun, Redback 8 Tim Griffin, AT&T 9 Ed Kern, Cogent 10 Tom Reddington, Lucent 12 July 2001 14 Network Hierarchy and Multilayer Survivability 16 Status of this Memo 18 This document is an Internet-Draft and is in full conformance with 19 all provisions of Section 10 of RFC2026 [1]. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. Internet-Drafts are draft documents valid for a maximum of 25 six months and may be updated, replaced, or obsoleted by other 26 documents at any time. It is inappropriate to use Internet- Drafts 27 as reference material or to cite them other than as "work in 28 progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 1. Abstract 38 This document is the deliverable out of the Network Hierarchy and 39 Survivability Techniques Design Team established within the Traffic 40 Engineering Working Group. This team was requested to try to 41 determine what the current and near term requirements are for 42 survivability and hierarchy in MPLS networks. The team determined 43 that there appears to be a need for common, interoperable 44 survivability approaches in packet and non-packet networks. 45 Suggested approaches include path-based as well as one that repairs 46 connections in proximity to the network fault. For clarity, an 47 expanded set of definitions is included. As for hierarchy, there 48 did not appear to be as much need for work on "vertical hierarchy," 49 defined as communication between network layers such as TDM/optical 50 and MPLS. In particular, instead of direct exchange of signaling 51 and routing between vertical layers, some looser form of 52 coordination and communication is a nearer term need. For 54 Lai, et al Category - Expiration [1] 55 Network Hierarchy and Multilayer Survivability July 2001 57 "horizontal hierarchy" in data networks, there does appear to be a 58 pressing need. This requirement is often presented in the context 59 of layer 2 and layer 3 VPN services where SLAs would appear to 60 necessitate signaling from the edges into the core of a network. 61 Issues include potential current protocols limitations in networks 62 which are hierarchical (e.g. multi-area OSPF) and scalability 63 concerns of potentially O(N^2) connection growth in larger networks. 65 Please send comments to te-wg@ops.ietf.org 67 2. Conventions used in this document 69 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 70 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 71 this document are to be interpreted as described in RFC-2119 [2]. 73 3. Introduction 75 This document presents a proposal of the tangible requirements for 76 network survivability and hierarchy in current service provider 77 environments. With feedback from the working group solicited, the 78 objective is to help focus the work that is being addressed in the 79 traffic engineering, ccamp and other working groups. A main goal of 80 this work is to provide some expedience for required functionality 81 in multi-vendor service provider networks. The initial focus is 82 primarily on intra-domain operations. However, to maintain 83 consistency in the provision of end-to-end service in a multi- 84 provider environment, rules governing the operations of 85 survivability mechanisms at domain boundaries must also be 86 specified. While such issues are raised and discussed, where 87 appropriate, they will not be treated in depth in the initial 88 release of this document. 90 The document first develops a set of definitions to be used later in 91 this document and potentially in other documents as well. It then 92 addresses the requirements and issues associated with service 93 restoration, hierarchy, and finally a short discussion of 94 survivability in hierarchical context. 96 4. Definitions 98 4.1 Hierarchy Terminology 100 Network hierarchy is an abstraction of part of a network's topology 101 and the routing and signaling mechanism needed to support the 102 topological abstraction. Abstraction may be used as a mechanism to 103 build large networks or as a technique for enforcing administrative, 104 topological or geographic boundaries. For example, network 105 hierarchy might be used to separate the metropolitan and long-haul 107 Lai, et al Category - Expiration 2 108 Network Hierarchy and Multilayer Survivability July 2001 110 regions of a network or to separate the regional and backbone 111 sections of a network [Bert Wijnen], or to interconnect service 112 provider networks (with BGP which reduces a network to an Autonomous 113 System). In this document, network hierarchy is considered from two 114 perspectives: 115 (1) Horizontally oriented: between two areas or administrative 116 subdivisions within the same network layer 117 (2) Vertically oriented: between two network layers 119 Horizontal hierarchy is the abstraction necessary to allow a network 120 at one network layer, for instance a packet network, to grow. 121 Examples of horizontal hierarchy include BGP and multi-area OSPF. 123 Vertical hierarchy is the abstraction, or reduction in information, 124 which would be of benefit when communicating information across 125 network layers, as in propagating information between optical and 126 router networks. 128 4.2 Survivability Terminology 130 Extra traffic is the traffic carried over the protection entity 131 while the working entity is active. Extra traffic is not protected, 132 i.e., when the protection entity is required to protect the traffic 133 that is being carried over the working entity (e.g., due to a 134 failure affecting the working entity), the extra traffic is 135 preempted. 137 Normalization is the return to the normal state of a network upon 138 completing the repair of the network failure. This could include 139 the rerouting of affected traffic to the original working entities 140 or new routes. The term revertive mode is used when traffic is 141 returned to the working entity (switch back). 143 Protection, also called protection switching, is a survivability 144 technique based on predetermined failure recovery: as the working 145 entity is established, resources are reserved for the protection 146 entity. These resources may be used by low-priority traffic 147 (referred to as extra traffic) if traffic preemption is allowed. 148 Depending on the amount of reserved resources, not all of the 149 affected traffic may be protected. (For further discussion of 150 concepts related to protection, see the Sub-section below on 151 Survivability Concepts.) 153 Protection entity (also called back-up entity or recovery entity) is 154 the entity that is used to carry protected traffic in protection 155 operation mode, i.e., when the working entity is in error or has 156 failed. 158 Recovery is the sequence of actions taken by a network after the 159 detection of a failure to maintain the required performance level 160 for existing services (e.g., according to service level agreements) 161 and to allow normalization of the network. The actions include 163 Lai, et al Category - Expiration 3 164 Network Hierarchy and Multilayer Survivability July 2001 166 notification of the failure followed by two parallel processes: (1) 167 a repair process with fault isolation and repair of the failed 168 components, and (2) a reconfiguration process with path selection 169 and rerouting for the affected traffic. 171 Rerouting is placement of affected traffic from the working entity 172 to the protection entity, when the path for the protection entity 173 has been selected after the detection of a fault on the working 174 entity. This is synonymous with switch-over in protection 175 techniques. (In [3], rerouting is synonymous with restoration.) 177 Restoration is a survivability technique that dynamically discovers 178 the alternate path from spare resources in network, or establishes 179 new paths on demand, for affected traffic once the failure is 180 detected and the affected traffic is identified for rerouting. The 181 new path may be based on preplanned configurations or current 182 network status. Thus, restoration involves a path selection process 183 followed by traffic rerouting. (In [3], restoration is referred to 184 as recovery by rerouting.) 186 Restoration, or more specifically, service restoration, refers to 187 the actions taken by a network to maintain service continuity after 188 the detection of a failure. In this second usage, restoration has a 189 meaning very similar to recovery, except that restoration covers 190 only the reconfiguration process and not the repair process. Also, 191 in this usage, it should be clear from the context that it is 192 irrelevant whether the survivability technique used to achieve 193 service continuity is based on protection or restoration techniques. 195 Restoration time is the time interval from the occurrence of a 196 network impairment to the instant when the affected traffic is 197 either completely rerouted or until spare resources are exhausted 198 and/or no more preemptable traffic to make room. 200 Revertive mode is a procedure in which revertive action, i.e., 201 switch back from the protection entity to the working entity, is 202 taken once the failed working entity has been repaired. In non- 203 revertive mode, such action is not taken. To minimize service 204 interruption, switch-back in revertive mode should be performed at a 205 time when there is the least impact on the traffic concerned, or by 206 using the make-before-break concept. 208 Shared risk group (SRG) is a set of network elements that are 209 collectively impacted by a specific fault or fault type. For 210 example, a shared risk link group (SRLG) is the union of all the 211 links on those fibers that are routed in the same physical conduit 212 in a fiber-span network. This concept includes, besides shared 213 conduit, other types of compromise such as shared fiber cable, 214 shared right of way, shared optical ring, shared office without 215 power sharing, etc. The span of an SRG, such as the length of the 216 sharing for compromised outside plant, needs to be considered on a 217 per fault basis. 219 Lai, et al Category - Expiration 4 220 Network Hierarchy and Multilayer Survivability July 2001 222 Survivability is the capability of a network to maintain service 223 continuity in the presence of faults within the network [4]. 224 Survivability techniques such as protection and restoration are 225 implemented either on a per-link basis, on a per-path basis, or 226 throughout an entire network to alleviate service disruption at 227 affordable costs. The degree of survivability is determined by the 228 network's capability to survive single failures, multiple failures, 229 and equipment failures. 231 Working entity is the entity that is used to carry traffic in normal 232 operation mode. Depending on the context, an entity can be, e.g., a 233 channel or a transmission link in the physical layer, an LSP in 234 MPLS, or a logical bundle of one or more LSPs. 236 4.3 Survivability Concepts 238 In a survivable network design, spare capacity and diversity must be 239 built into the network from the beginning to support some degree of 240 self-healing whenever failures occur. A common strategy is to 241 associate each working entity with a protection entity having either 242 dedicated resources or shared resources that are pre-reserved or 243 reserved-on-demand. According to the methods of setting up a 244 protection entity, different approaches to providing survivability 245 can be classified. Generally, protection techniques are based on 246 having a dedicated protection entity set up prior to failure. Such 247 is not the case in restoration techniques, which mainly rely on the 248 use of spare capacity in the network. Hence, in terms of trade- 249 offs, protection techniques usually offer fast recovery from failure 250 with enhanced availability, while restoration techniques usually 251 achieve better resource utilization. 253 Protection techniques can be implemented by several architectures: 254 1+1, 1:1, 1:n, and m:n. In the context of SDH/SONET, they are 255 referred to as Automatic Protection Switching (APS). 257 In the 1+1 protection architecture, a protection entity is dedicated 258 to each working entity. The dual-feed mechanism is used whereby the 259 working entity is permanently bridged onto the protection entity at 260 the source of the protected domain. In normal operation mode, 261 identical traffic is transmitted simultaneously on both the working 262 and protection entities. At the sink of the protected domain, both 263 feeds are monitored for alarms and maintenance signals. A selection 264 between the working and protection entity is made based on some 265 predetermined criteria, such as the transmission performance 266 requirements or defect indication. This architecture is rather 267 expensive since resource duplication is required. It is generally 268 used for specific services that need a very high availability. 270 In the 1:1 protection architecture, a protection entity is also 271 dedicated to each working entity. The protected traffic is normally 272 transmitted by the working entity. If the working entity has 273 failed, the protected traffic is rerouted to the protection entity. 275 Lai, et al Category - Expiration 5 276 Network Hierarchy and Multilayer Survivability July 2001 278 This architecture is inherently slower in recovering from failure 279 than a 1+1 architecture since communication between both ends of the 280 protection domain is required to perform the switch-over operation. 281 An advantage is that the protection entity can optionally be used to 282 carry preemptable "extra traffic" in normal operation. Also, in 283 packet networks, a protection path can be pre-established for later 284 use with pre-planned but not pre-reserved capacity. (If no packets 285 are sent into a link, no bandwidth is consumed.) This is not the 286 case in channelized transport networks. 288 In the 1:n protection architecture, a dedicated protection entity is 289 shared by n working entities. Traffic is normally sent on the 290 working entities. When multiple working entities have failed 291 simultaneously, only one of them can be restored by the common 292 protection entity. This contention is resolved by assigning a 293 different preemptive priority to each working entity. As in the 1:1 294 case, the protection entity can optionally be used to carry 295 preemptable "extra traffic" in normal operation 297 The m:n architecture is a generalization of the 1:n architecture. 298 Typically m <= n, m dedicated protection entities are shared by n 299 working entities. While this architecture can improve system 300 availability with small cost increases, it has rarely been 301 implemented or standardized. 303 5. Survivability 305 5.1 Scope 307 Interoperable approaches to network survivability were determined to 308 be an immediate requirement in packet networks as well as in 309 SDH/SONET framed TDM networks. Not as pressing at this time were 310 techniques which would cover all-optical networks (e.g., where 311 framing is unknown), as the control of these networks in a multi- 312 vendor environment appeared to have some other hurdles to first deal 313 with. Also, not of immediate interest were approaches to coordinate 314 or explicitly communicate survivability mechanisms across network 315 layers (such as from a TDM or optical network to/from an IP 316 network). However, a capability should be provided for a network 317 operator to control the operation of survivability mechanisms among 318 different layers. Such issues and those related to OAM are 319 currently outside the scope of this document. (For proposed MPLS 320 OAM requirements, see [5]). 322 The types of network failures that cause a restoration to be 323 performed include link/span and node failures (which might include 324 span failures at lower layers). Other more complex failure 325 mechanisms such as systematic control-plane failure or breach of 326 security are not within the scope of the survivability mechanisms 327 discussed in this document. 329 Lai, et al Category - Expiration 6 330 Network Hierarchy and Multilayer Survivability July 2001 332 5.2 Required initial set of survivability mechanisms 334 5.2.1 1:1 Path Protection with Pre-Established Capacity 336 In this protection mode, the head end of a working connection 337 establishes a protection connection to the destination. In normal 338 operation, traffic is only sent on the working connection, though 339 the ability to signal that traffic will be sent on both connections 340 (1+1 Path for signaling purposes) would be valuable in non-packet 341 networks. Some distinction between working and protection 342 connections is likely, either through explicit objects, or 343 preferably through implicit methods such as general classes or 344 priorities. Head ends need the ability to create connections that 345 are as failure disjoint as possible from each other. This would 346 require SRG information that can be generally assigned to either 347 nodes or links and propagated through the control or management 348 plane. In this mechanism, capacity in the protection connection is 349 pre-established, however it can be used to carry preemptable extra 350 traffic. Protect capacity is first come first served. When protect 351 capacity is called into service during restoration, there should be 352 the ability to promote the protection connection to working status 353 (for non-revertive mode operation) with some form of make-before- 354 break capability. 356 5.2.2 1:1 Path Protection with Pre-Planned Capacity 358 Similar to the above 1:1 protection with pre-established capacity, 359 the protection connection in this case is also pre-signaled. The 360 difference is in the way protect capacity is assigned. With pre- 361 planned capacity, the mechanism supports the ability for the protect 362 capacity to be shared, or "double-booked." It would be expected 363 that should operator predicted failures occur, which potentially 364 could rely on enumeration in SRGs, that only a limited set of 365 protect connections would be put into service, and that the protect 366 capacity available in the network would be able to fulfill this 367 traffic (given proper sizing and planning of the network). In a 368 sense, this is 1:1 from a path perspective, however the protect 369 capacity in the network (on a link by link basis) is shared in a 1:n 370 fashion. Some form of information propagation could be required 371 before traffic may be sent on protection connections, especially in 372 TDM networks. In data networks, a desirable operating approach for 373 this mechanism might be where the protect capacity is not accurately 374 booked against SRGs (e.g. non-predictive). 376 The use of this approach improves network resource utilization, but 377 may require more careful planning. So, initial deployment might be 378 based on 1:1 path protection with pre-established capacity and the 379 local restoration mechanism to be described next. 381 5.2.3 Local Restoration 383 Lai, et al Category - Expiration 7 384 Network Hierarchy and Multilayer Survivability July 2001 386 Due to the time impact of signal propagation, path-based approaches 387 may not be able to meet the service requirements desired in some 388 networks. The solution to this is to restore connectivity in 389 immediate proximity to the fault. At a minimum, this approach 390 should be able to protect against connectivity-type SRGs, though 391 protecting against node-based SRGs might be worthwhile. After local 392 restoration is in place, it is likely that head end systems would 393 later perform some path-level re-grooming. Head end systems must 394 have some control as to whether their connections are candidates for 395 or excluded from local restoration. 397 5.2.4 Path Restoration 399 In this approach, connections that are impacted by a fault are 400 rerouted by the originating network element upon notification of 401 connection failure. This approach does not involve any new 402 mechanisms. It merely is a mention of another common approach to 403 protecting against faults in a network. 405 5.3 Applications Supported 407 With service continuity under failure as a goal, a network is 408 "survivable" if, in the face of a network failure, connectivity is 409 interrupted for a brief period and then restored before the network 410 failure ends. The length of this interrupted period is dependent on 411 the application supported. Here are some typical applications that 412 need to be considered: 414 - Best-effort data: restoration of network connectivity by rerouting 415 at the IP layer would be sufficient 416 - Premium data service: need to meet TCP or application protocol 417 timer requirements 418 - Voice: call cutoff is in the range of 140 msec to 2 sec 419 - Other real-time service (e.g., streaming, fax) 420 - Mission-critical applications 422 5.4 Timing Bounds for Service Restoration 424 The approach to picking the types of survivability mechanisms 425 recommended was to consider a spectrum of mechanisms that can be 426 used to protect traffic with varying characteristics of 427 survivability and speed of restoration, and then attempt to select a 428 few general points which provide some coverage across that spectrum. 429 The focus of this work is to provide requirements to which a small 430 set of detailed proposals may be developed, allowing the operator 431 some (limited) flexibility in approaches to meeting their design 432 goals in engineering multi-vendor networks. Requirements of 433 different applications as listed in the previous sub-section were 434 discussed generally, however none on the team would likely attest to 435 the scientific merit of the ability of the timing bounds below to 436 meet any specific application�s needs. A few assumptions include: 438 Lai, et al Category - Expiration 8 439 Network Hierarchy and Multilayer Survivability July 2001 441 Approaches that protection switch without propagation of information 442 are likely to be faster than those that do require some form of 443 fault notification to some or all elements in a network. 444 Approaches that require some form of signaling after a fault will 445 also likely suffer some timing impact. 447 Proposed timing bounds for service restoration for different 448 mechanisms are as follows (all bounds are exclusive of signal 449 propagation): 451 1:1 path protection with pre-established capacity: 100-500 ms 452 1:1 path protection with pre-planned capacity: 100-750 ms 453 Local restoration: 50 ms 454 Path restoration: 1-5 seconds 456 To ensure that the service requirements for different applications 457 can be met within the above timing bounds, restoration priority is 458 used to determine the order in which connections are restored (to 459 minimize service restoration time as well as to gain access to 460 available spare capacity). For example, mission critical 461 applications may require high restoration priority. Preemption 462 priority should only be used in the event that all connections 463 cannot be restored, in which case connections with lower preemption 464 priority should be released. Depending on a service provider's 465 strategy in provisioning network resources for backup, preemption 466 may or not be needed in the network. 468 5.5 Coordination Among Layers 470 A common design goal for multi-layered networks is to provide the 471 desired level of service in the most cost-effective manner. The use 472 of multilayer survivability might allow the optimization of spare 473 resources through the improvement of resource utilization by sharing 474 spare capacity across different layers, though further 475 investigations are needed. Coordination during service restoration 476 among different network layers (e.g. IP, SDH/SONET, optical layer) 477 might necessitate development of vertical hierarchy. The benefits 478 of providing survivability mechanisms at multiple layers, and the 479 optimization of the overall approach, must be weighed with the 480 associated cost and service impacts. 482 A default coordination mechanism for inter-layer interaction could 483 be the use of nested timers and current SDH/SONET fault monitoring, 484 as has been done traditionally for backward compatibility. Thus, 485 when lower-layer restoration happens in a longer time period than 486 higher-layer restoration, a hold-off timer is utilized to avoid 487 contention between the different single-layer recovery schemes. In 488 other words, multilayer interaction is addressed by having 489 successively higher multiplexing levels operate at restoration time 490 scale greater than the next lowest layer. Currently, if SDH/SONET 491 protection switching is used, MPLS recovery timers must wait until 492 SDH/SONET has had time to switch. 494 Lai, et al Category - Expiration 9 495 Network Hierarchy and Multilayer Survivability July 2001 497 It was felt that the current approach to coordination of 498 survivability approaches currently did not have significant 499 operational shortfalls. These approaches include protecting traffic 500 solely at one layer (e.g. at the IP layer over linear WDM, or at the 501 SDH/SONET layer). Where survivability mechanisms might be deployed 502 at several layers, such as when a routed network rides a SDH/SONET 503 protected network, it was felt that current coordination approaches 504 were sufficient in many cases. One exception is the hold-off of 505 MPLS recovery until the completion of SDH/SONET protection switching 506 as described above. This limits the recovery time of fast MPLS 507 restoration. Also, note that failures within a layer can be guarded 508 against by techniques either in that layer or at a higher layer, but 509 not in reverse. Thus, the optical layer cannot guard against 510 failures in the IP layer such as router system failures, line card 511 failures. 513 5.6 Evolution Toward IP Over Optical 515 As more pressing requirements for survivability and horizontal 516 hierarchy for edge-to-edge signaling are met with technical 517 proposals, it is believed that the benefits of merging (in some 518 manner) the control planes of multiple layers will be outlined. 519 When these benefits are self-evident, it would then seem to be the 520 right time to review if vertical hierarchy mechanisms are needed, 521 and what the requirements might be. 523 6. Hierarchy Requirements 525 Efforts in the area of network hierarchy should focus on mechanisms 526 that would allow more scalable edge-to-edge signaling, or signaling 527 across networks with existing network hierarchy (such as multi-area 528 OSPF). This would appear to be a more immediate need than 529 mechanisms that might be needed to interconnect networks at 530 different layers. 532 6.1 Historical Context 534 One reason for horizontal hierarchy is functionality (e.g., metro 535 versus backbone). Geographic �islands� reduce the need for 536 interoperability and make administration and operations less 537 complex. Using a simpler, more interoperable, survivability scheme 538 at metro/backbone boundaries is natural for many provider network 539 architectures. In transmission networks, creating geographic 540 islands of different vendor equipment has been done for a long time 541 because multi-vendor interoperability has been difficult to achieve. 542 Traditionally, providers have to coordinate the equipment on either 543 end of a "connection," and making this interoperable reduces 544 complexity. A provider should be able to concatenate survivability 545 mechanisms in order to provide a "protected link" to the next higher 546 level. Think of SDH/SONET rings connecting to TDM DXCs with 1+1 547 line-layer protection between the ADM and the DXC port. The TDM 548 connection, e.g., a DS3 is protected, but usually all equipment on 549 each SDH/SONET ring is from a single vendor. The DXC cross 551 Lai, et al Category - Expiration 10 552 Network Hierarchy and Multilayer Survivability July 2001 554 connections are controlled by the provider and the ports are 555 physically protected resulting in a highly available design. Thus, 556 concatenation of survivability approaches can be used to cascade 557 across horizontal hierarchy. While not perfect, it is workable in 558 the near- to mid-term until multi-vendor interoperability is 559 achieved. 561 While the problems associated with multi-vendor interoperability may 562 necessitate horizontal hierarchy as a practical matter (at least 563 this has been the case in TDM networks), there may be no technical 564 reason for it. Members of the team with more experience on IP 565 networks felt there should be no need for this in core networks, or 566 even most access networks. 568 Some of the largest service provider networks currently run a single 569 area/level IGP. Some service providers, as well as many large 570 enterprise networks, run multi-area OSPF to gain increases in 571 scalability. Often, this was from an original design, so it is 572 difficult to say if the network truly required the hierarchy to 573 reach its current size. 575 Some proposals on improved mechanisms to address network hierarchy 576 have been suggested [6, 7, 8]. This document aims to provide the 577 concrete requirements so that these and other proposals can first 578 aim to meet some limited objectives. 580 6.2 Applications for Horizontal Hierarchy 582 A primary driver for intra-domain horizontal hierarchy is signaling 583 scalability in the context of edge-to-edge VPNs, potentially across 584 traffic-engineered data networks. There are a number of different 585 approaches to VPNs and they are currently being addressed by 586 different emerging protocols: RFC 2547bis BGP/MPLS VPNs, provider- 587 provisioned VPNs based upon MPLS tunnels (e.g., virtual routers), 588 Pseudo Wire Edge-to-edge Emulation (PWE3), etc. These may or not 589 need explicit signaling from edge to edge, but it is a common 590 perception that in order to meet SLAs, some form of edge-to-edge 591 signaling is required. 593 For signaling scalability, there are probably two types of network 594 scenarios to consider: 596 - Large SP networks with flat routing domains where edge-to-edge 597 (MPLS) signaling as implemented today would probably not scale. 598 - Networks which would like to signal edge-to-edge, and might even 599 scale in a limited application. However, they are hierarchically 600 routed (e.g. OSPF areas) and current implementations, and 601 potentially standards prevent signaling across areas. This 602 requires the development of signaling standards that support 603 dynamic establishment and potentially restoration of LSPs across a 604 2-level IGP hierarchy. 606 Lai, et al Category - Expiration 11 607 Network Hierarchy and Multilayer Survivability July 2001 609 Scalability is concerned with the O(N^2) properties of edge-to-edge 610 signaling. For a large network, maintaining a "connection" between 611 every edge is simply not scalable. Even if establishing and 612 maintaining connections is feasible, there might be an impact on 613 core survivability mechanisms which would cause restoration times to 614 grow with N^2, which would be undesirable. While some value of N 615 may be inevitable, approaches to reduce N (e.g. to pull in from the 616 edge to aggregation points) might be of value. 618 For routing scalability, especially in data applications, a major 619 concern is the amount of processing/state that is required in the 620 variety of network elements. If some nodes might not be able to 621 communicate and process the state of every other node, it might be 622 preferable to limit the information. There is one way of thought 623 that says that the amount of information contained by a horizontal 624 barrier should be significant, and that impacts this might have on 625 optimality in route selection and ability to provide global 626 survivability are accepted tradeoffs. 628 6.3 Horizontal Hierarchy Requirements 630 Mechanisms are required to allow for edge-to-edge signaling of 631 connections through a network. The types of network scenarios 632 include large networks with a large number of edge devices and flat 633 interior routing, as well as medium to large networks which 634 currently have hierarchical interior routing such as multi-area OSPF 635 or multi-level IS-IS. The primary context of this is edge-to-edge 636 signaling which is thought to be required to assure the SLAs for the 637 layer 2 and layer 3 VPNs that are being carried across the network. 638 Another possible context would be edge-to-edge signaling in TDM 639 SDH/SONET networks, where metro and core networks again might either 640 be in a flat or hierarchical interior routing domain. 642 7. Survivability and Hierarchy 644 When horizontal hierarchy exist in a network layer, a question 645 arises as to how survivability can be provided along a connection 646 which crosses hierarchical boundaries. 648 In designing protocols to meet the requirements of hierarchy, an 649 approach to consider is that boundaries are either clean, or are of 650 minimal value. However, the concept of network elements that 651 participate on both sides of a boundary might be a consideration 652 (e.g. OSPF ABRs). That would allow for devices on either side to 653 take an intra-area approach within their region of knowledge, and 654 for the ABR to do this in both areas, and splice the two protected 655 connections together at a common point (granted it is a common point 656 of failure now). If the limitations of this approach start to 657 appear in operational settings, then perhaps it would be time to 658 start thinking about route-servers and signaling propagated 659 directives. However, one initial approach might be to signal 660 through a common border router, and to consider the service as 661 protected as it consist of a concatenated set of connections which 663 Lai, et al Category - Expiration 12 664 Network Hierarchy and Multilayer Survivability July 2001 666 are each protected within their area. Another approach might be to 667 have a least common denominator mechanism at the boundary, e.g., 1+1 668 port protection. There should also be some standardized means for a 669 survivability scheme on one side of such a boundary to communicate 670 with the scheme on the other side regarding the success or failure 671 of the service restoration action. For example, if a part of a 672 "connection" is down on one side of such a boundary, there is no 673 need for the other side to recover from failures. 675 In summary, at this time, approaches that allow concatenation of 676 survivability schemes across hierarchical boundaries should provide 677 sufficient. 679 8. Security Considerations 681 Security is not considered in this initial version. 683 9. References 685 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 686 9, RFC 2026, October 1996. 688 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement 689 Levels", BCP 14, RFC 2119, March 1997 691 3 V. Sharma, B. Crane, K. Owens, C. Huang, F. Hellstrand, J. Weil, 692 L. Andersson, B. Jamoussi, B. Cain, S. Civanlar, and A. Chiu, 693 "Framework for MPLS-based Recovery," Internet-Draft, Work in 694 Progress, March 2001. 696 4 D.O. Awduche, A. Chiu, A. Elwalid, I. Widjaja, and X. Xiao, "A 697 Framework for Internet Traffic Engineering," Internet-Draft, Work 698 in Progress, May 2001. 700 5 N. Harrison, et al, "Requirements for OAM in MPLS Networks," 701 Internet-Draft, Work in Progress, May 2001. 703 6 K. Kompella and Y. Rekhter, "Multi-area MPLS Traffic 704 Engineering," Internet-Draft, Work in Progress, March 2001. 706 7 G. Ash, et al, "Requirements for Multi-Area TE," Internet-Draft, 707 Work in Progress, March 2001. 709 8 A. Iwata, N. Fujita, G.R. Ash, and A. Farrel, "Crankback Routing 710 Extensions for MPLS Signaling," Internet-Draft, Work in Progress, 711 July 2001. 713 10. Acknowledgments 715 Lai, et al Category - Expiration 13 716 Network Hierarchy and Multilayer Survivability July 2001 718 A lot of the direction taken in this document, and by the team, was 719 steered by the insightful questions provided by Bala Rajagoplan, 720 Greg Bernstein, Yangguang Xu, and Avri Doria. The set of questions 721 is attached as Appendix A in this document. 723 11. Author's Addresses 725 Wai Sum Lai 726 AT&T 727 200 Laurel Avenue 728 Middletown, NJ 07748, USA 729 Tel: +1 732-420-3712 730 wlai@att.com 732 Dave McDysan 733 WorldCom 734 22001 Loudoun County Pkwy 735 Ashburn, VA 20147, USA 736 dave.mcdysan@wcom.com 738 Jim Boyle 739 jimpb@nc.rr.com 741 Malin Carlzon 742 malin@sunet.se 744 Rob Coltun 745 rcoltun@redback.com 747 Tim Griffin 748 AT&T 749 180 Park Avenue 750 Florham Park, NJ 07932, USA 751 Tel: +1 973-360-7238 752 griffin@research.att.com 754 Ed Kern 755 Cogent Communications 756 3413 Metzerott Rd 757 College Park, MD 20740, USA 758 Tel: +1 703-852-0522 759 ejk@tech.org 761 Tom Reddington 762 Lucent Technologies 763 67 Whippany Rd 764 Whippany, NJ 07981, USA 765 Tel: +1 973-386-7291 766 treddington@bell-labs.com 768 Appendix A: Questions used to help develop requirements 770 Lai, et al Category - Expiration 14 771 Network Hierarchy and Multilayer Survivability July 2001 773 A. Definitions 775 1. In determining the specific requirements, the design team should 776 precisely define the concepts "survivability", "restoration", 777 "protection", "protection switching", "recovery", "re-routing" etc. 778 and their relations. This would enable the requirements doc to 779 describe precisely which of these will be addressed. 780 In the following, the term "restoration" is used to indicate the 781 broad set of policies and mechanisms used to ensure survivability. 783 B. Network types and protection modes 785 1. What is the scope of the requirements with regard to the types of 786 networks covered? Specifically, are the following in scope: 788 Restoration of connections in mesh optical networks (opaque or 789 transparent) 790 Restoration of connections in hybrid mesh-ring networks 791 Restoration of LSPs in MPLS networks (composed of LSRs overlaid on a 792 transport network, e.g., optical) 793 Any other types of networks? 794 Is commonality of approach, or optimization of approach more 795 important? 797 2. What are the requirements with regard to the protection modes to 798 be supported in each network type covered? (Examples of protection 799 modes include 1+1, M:N, shared mesh, UPSR, BLSR, newly defined modes 800 such as P-cycles, etc.) 802 3. What are the requirements on local span (i.e., link by link) 803 protection and end-to-end protection, and the interaction between 804 them? E.g.: what should be the granularity of connections for each 805 type (single connection, bundle of connections, etc). 807 C. Hierarchy 809 1. Vertical (between two network layers): 810 What are the requirements for the interaction between 811 restoration procedures across two network layers, when these 812 features are offered in both layers? (Example, MPLS network 813 realized over pt-to-pt optical connections.) Under such a case, 815 (a) Are there any criteria to choose which layer should provide 816 protection? 818 (b) If both layers provide survivability features, what are the 819 requirements to coordinate these mechanisms? 821 (c) How is lack of current functionality of cross-layer 822 cooridnation currently hampering operations? 824 Lai, et al Category - Expiration 15 825 Network Hierarchy and Multilayer Survivability July 2001 827 (d) Would the benefits be worth additional complexity associated 828 with routing isolation (e.g. VPN, areas), security, address 829 isolation and policy / authentication processes? 831 2. Horizontal (between two areas or administrative subdivisions 832 within the same network layer): 834 (a) What are the criteria that trigger the creation of protocol 835 or administrative boundaries pertaining to restoration? (e.g., 836 scalability? multi-vendor interoperability? what are the practical 837 issues?) multi-provider? Should multi-vendor necessitate 838 hierarchical seperation? 840 When such boundaries are defined: 842 (b) What are the requirements on how protection/restoration is 843 performed end-to-end across such boundaries? 845 (c) If different restoration mechanisms are implemented on two 846 sides of a boundary, what are the requirements on their interaction? 848 What is the primary driver of horizontal hierarchy? (select one) 849 - functionality (e.g. metro -v- backbone) 850 - routing scalability 851 - signaling scalability 852 - current network architecture, trying to layer on TE ontop of 853 already hiearchical network architecture 854 - routing and signalling 856 For signalling scalability, is it 857 - managability 858 - processing/state of network 859 - edge-to-edge N^2 type issue 861 For routing scalability, is it 862 - processing/state of network 863 - are you flat and want to go hierarchical 864 - or already hierarchical? 865 - data or TDM application? 867 D. Policy 869 1. What are the requirements for policy support during 870 protection/restoration, 871 e.g., restoration priority, preemption, etc. 873 E. Signaling Mechanisms 875 1. What are the requirements on the signaling transport mechanism 876 (e.g., in-band over sonet/sdh overhead bytes, out-of-band over an IP 877 network, etc.) used to communicate restoration protocol 878 messages between network elements. What are the bandwidth and 879 other requirements on the signaling channels? 881 Lai, et al Category - Expiration 16 882 Network Hierarchy and Multilayer Survivability July 2001 884 2. What are the requirements on fault detection/localization 885 mechanisms (which is the prelude to performing restoration 886 procedures) in the case of opaque and transparent optical networks? 887 What are the requirements in the case of MPLS restoration? 889 3. What are the requirements on signaling protocols to be used in 890 restoration procedures (e.g., high priority processing, security, 891 etc). 893 4. Are there any requirements on the operation of restoration 894 protocols? 896 F. Quantitative 898 1. What are the quantitative requirements (e.g., latency) for 899 completing restoration under different protection modes (for both 900 local and end-to-end protection)? 902 G. Management 904 1. What information should be measured/maintained by the control 905 plane at each network element pertaining to restoration events? 907 2. What are the requirements for the correlation between control 908 plane and data plane failures from the restoration point of view? 910 Full Copyright Statement 912 "Copyright (C) The Internet Society (date). All Rights Reserved. 913 This document and translations of it may be copied and furnished to 914 others, and derivative works that comment on or otherwise explain it 915 or assist in its implmentation may be prepared, copied, published 916 and distributed, in whole or in part, without restriction of any 917 kind, provided that the above copyright notice and this paragraph 918 are included on all such copies and derivative works. However, this 919 document itself may not be modified in any way, such as by removing 920 the copyright notice or references to the Internet Society or other 921 Internet organizations, except as needed for the purpose of 922 developing Internet standards in which case the procedures for 923 copyrights defined in the Internet Standards process must be 924 followed, or as required to translate it into languages other than 925 English. 927 The limited permissions granted above are perpetual and will not be 928 revoked by the Internet Society or its successors or assigns. 930 This document and the information contained herein is provided on an 931 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 932 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 933 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 934 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 936 Lai, et al Category - Expiration 17 937 Network Hierarchy and Multilayer Survivability July 2001 939 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 941 Lai, et al Category - Expiration 18