idnits 2.17.1 draft-ietf-ccamp-gmpls-recovery-analysis-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1.a on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2234. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2211. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2218. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2224. ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line 2240), which is fine, but *also* found old RFC 2026, Section 10.4C, paragraph 1 text on line 41. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: This document is an Internet-Draft and is subject to all provisions of Section 3 of RFC 3667. By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 41 longer pages, the longest (page 2) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 2005) is 6949 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'G.798' is mentioned on line 301, but not defined == Missing Reference: 'G.874' is mentioned on line 671, but not defined == Unused Reference: 'RFC2026' is defined on line 2077, but no explicit reference was found in the text == Unused Reference: 'RFC3667' is defined on line 2092, but no explicit reference was found in the text == Unused Reference: 'RFC3668' is defined on line 2095, but no explicit reference was found in the text == Unused Reference: 'RFC3469' is defined on line 2143, but no explicit reference was found in the text == Unused Reference: 'G.808.1' is defined on line 2178, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3667 (Obsoleted by RFC 3978) ** Obsolete normative reference: RFC 3668 (Obsoleted by RFC 3979) Summary: 8 errors (**), 0 flaws (~~), 10 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group CCAMP GMPLS P&R Design Team 3 Internet Draft 4 Category: Informational Dimitri Papadimitriou (Editor) 5 Expiration Date: October 2005 Eric Mannie (Editor) 7 April 2005 9 Analysis of Generalized Multi-Protocol Label Switching (GMPLS)-based 10 Recovery Mechanisms (including Protection and Restoration) 12 draft-ietf-ccamp-gmpls-recovery-analysis-05.txt 14 Status of this Memo 16 This document is an Internet-Draft and is subject to all provisions 17 of section 3 of RFC 3667. By submitting this Internet-Draft, each 18 author represents that any applicable patent or other IPR claims of 19 which he or she is aware have been or will be disclosed, and any of 20 which he or she become aware will be disclosed, in accordance with 21 RFC 3668. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six 29 months and may be updated, replaced, or obsoleted by other documents 30 at any time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt. 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html. 39 Copyright Notice 41 Copyright (C) The Internet Society (2005). All Rights Reserved. 43 Abstract 45 This document provides an analysis grid to evaluate, compare and 46 contrast the Generalized Multi-Protocol Label Switching (GMPLS) 47 protocol suite capabilities with respect to the recovery mechanisms 48 currently proposed at the IETF CCAMP Working Group. A detailed 49 analysis of each of the recovery phases is provided using the 50 terminology defined in a companion document. This document focuses 51 on transport plane survivability and recovery issues and not on 52 control plane resilience and related aspects. 54 D.Papadimitriou et al. - Expires October 2005 1 55 Table of Content 57 Status of this Memo .............................................. 1 58 Abstract ......................................................... 1 59 Table of Content ................................................. 2 60 1. Contributors .................................................. 3 61 2. Conventions used in this Document ............................. 4 62 3. Introduction .................................................. 4 63 4. Fault Management .............................................. 5 64 4.1 Failure Detection ............................................ 5 65 4.2 Failure Localization and Isolation ........................... 7 66 4.3 Failure Notification ......................................... 8 67 4.4 Failure Correlation ......................................... 10 68 5. Recovery Mechanisms .......................................... 10 69 5.1 Transport vs. Control Plane Responsibilities ................ 10 70 5.2 Technology In/dependent Mechanisms .......................... 11 71 5.2.1 OTN Recovery .............................................. 11 72 5.2.2 Pre-OTN Recovery .......................................... 11 73 5.2.3 SONET/SDH Recovery ........................................ 12 74 5.3 Specific Aspects of Control Plane-based Recovery Mechanisms . 12 75 5.3.1 In-band vs. Out-of-band Signaling ......................... 12 76 5.3.2 Uni- vs. Bi-directional Failures .......................... 13 77 5.3.3 Partial vs. Full Span Recovery ............................ 15 78 5.3.4 Difference between LSP, LSP Segment and Span Recovery ..... 16 79 5.4 Difference between Recovery Type and Scheme ................. 16 80 5.5 LSP Recovery Mechanisms ..................................... 18 81 5.5.1 Classification ............................................ 18 82 5.5.2 LSP Restoration ........................................... 20 83 5.5.3 Pre-planned LSP Restoration ............................... 21 84 5.5.4 LSP Segment Restoration ................................... 22 85 6. Reversion .................................................... 23 86 6.1 Wait-To-Restore (WTR) ....................................... 23 87 6.2 Revertive Mode Operation .................................... 23 88 6.3 Orphans ..................................................... 24 89 7. Hierarchies .................................................. 24 90 7.1 Horizontal Hierarchy (Partitions) ........................... 25 91 7.2 Vertical Hierarchy (Layers) ................................. 25 92 7.2.1 Recovery Granularity ...................................... 27 93 7.3 Escalation Strategies ....................................... 27 94 7.4 Disjointness ................................................ 28 95 7.4.1 SRLG Disjointness ......................................... 28 96 8. Recovery Mechanisms Analysis ................................. 29 97 8.1 Fast Convergence (Detection/Correlation and Hold-off Time) .. 30 98 8.2 Efficiency (Recovery Switching Time) ........................ 30 99 8.3 Robustness .................................................. 31 100 8.4 Resource Optimization ....................................... 31 101 8.4.1 Recovery Resource Sharing ................................. 32 102 8.4.2 Recovery Resource Sharing and SRLG Recovery ............... 34 103 8.4.3 Recovery Resource Sharing, SRLG Disjointness and Admission. 35 104 9. Summary and Conclusions ...................................... 36 105 10. Security Considerations ..................................... 38 106 11. IANA Considerations ......................................... 38 107 12. Acknowledgments ............................................. 38 109 D.Papadimitriou et al. - Expires October 2005 2 110 13. References .................................................. 39 111 13.1 Normative References ....................................... 39 112 13.2 Informative References ..................................... 40 113 14. Editor's Address ............................................ 41 114 Intellectual Property Statement ................................. 42 115 Disclaimer of Validity .......................................... 42 116 Copyright Statement ............................................. 42 118 1. Contributors 120 This document is the result of the CCAMP Working Group Protection 121 and Restoration design team joint effort. Besides the editors, the 122 following are the authors that contributed to the present memo: 124 Deborah Brungard (AT&T) 125 200 S. Laurel Ave. 126 Middletown, NJ 07748, USA 127 EMail: dbrungard@att.com 129 Sudheer Dharanikota 130 EMail: sudheer@ieee.org 132 Jonathan P. Lang (Sonos) 133 506 Chapala Street 134 Santa Barbara, CA 93101, USA 135 EMail: jplang@ieee.org 137 Guangzhi Li (AT&T) 138 180 Park Avenue, 139 Florham Park, NJ 07932, USA 140 EMail: gli@research.att.com 142 Eric Mannie 143 EMail: eric_mannie@hotmail.com 145 Dimitri Papadimitriou (Alcatel) 146 Francis Wellesplein, 1 147 B-2018 Antwerpen, Belgium 148 EMail: dimitri.papadimitriou@alcatel.be 150 Bala Rajagopalan (Intel Broadband Wireless Division) 151 2111 NE 25th Ave. 152 Hillsboro, OR 97124, USA 153 EMail: bala.rajagopalan@intel.com 155 Yakov Rekhter (Juniper) 156 1194 N. Mathilda Avenue 157 Sunnyvale, CA 94089, USA 158 EMail: yakov@juniper.net 160 D.Papadimitriou et al. - Expires October 2005 3 161 2. Conventions used in this document 163 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 164 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 165 this document are to be interpreted as described in [RFC2119]. 167 Any other recovery-related terminology used in this document 168 conforms to the one defined in [TERM]. The reader is also assumed to 169 be familiar with the terminology developed in [RFC3945], [RFC3471], 170 [RFC3473], [GMPLS-RTG] and [LMP]. 172 3. Introduction 174 This document provides an analysis grid to evaluate, compare and 175 contrast the Generalized MPLS (GMPLS) protocol suite capabilities 176 with respect to the recovery mechanisms proposed at the IETF CCAMP 177 Working Group. The focus is on transport plane survivability and 178 recovery issues and not on control plane resilience related aspects. 179 Although the recovery mechanisms described in this document impose 180 different requirements on GMPLS-based recovery protocols, the 181 protocol(s) specifications will not be covered in this document. 182 Though the concepts discussed are technology independent, this 183 document implicitly focuses on SONET [T1.105]/SDH [G.707], Optical 184 Transport Networks (OTN) [G.709] and pre-OTN technologies except 185 when specific details need to be considered (for instance, in the 186 case of failure detection). 188 A detailed analysis is provided for each of the recovery phases as 189 identified in [TERM]. These phases define the sequence of generic 190 operations that need to be performed when a LSP/Span failure (or any 191 other event generating such failures) occurs: 193 - Phase 1: Failure detection 194 - Phase 2: Failure localization and isolation 195 - Phase 3: Failure notification 196 - Phase 4: Recovery (Protection/Restoration) 197 - Phase 5: Reversion (normalization) 199 Failure detection, localization and notification phases together are 200 referred to as fault management. Within a recovery domain, the 201 entities involved during the recovery operations are defined in 202 [TERM]; these entities include ingress, egress and intermediate 203 nodes. The term "recovery mechanism" is used to cover both 204 protection and restoration mechanisms. Specific terms such as 205 protection and restoration are only used when differentiation is 206 required. Likewise the term "failure" is used to represent both 207 signal failure and signal degradation. 209 In addition, a clear distinction is made between partitioning 210 (horizontal hierarchy) and layering (vertical hierarchy) when 211 analyzing the different hierarchical recovery mechanisms including 212 disjointness related issues. The dimensions from which each of the 213 recovery mechanisms detailed in this document can be analyzed are 215 D.Papadimitriou et al. - Expires October 2005 4 216 introduced to assess the current GMPLS protocol capabilities and the 217 potential need for further extensions. This document concludes by 218 detailing the applicability of the current GMPLS protocol building 219 blocks for recovery purposes. 221 4. Fault Management 223 4.1 Failure Detection 225 Transport failure detection is the only phase that can not be 226 achieved by the control plane alone since the latter needs a hook to 227 the transport plane to collect the related information. It has to be 228 emphasized that even if failure events themselves are detected by 229 the transport plane, the latter, upon a failure condition, must 230 trigger the control plane for subsequent actions through the use of 231 GMPLS signalling capabilities (see [RFC3471] and [RFC3473]) or Link 232 Management Protocol capabilities (see [LMP], Section 6). 234 Therefore, by definition, transport failure detection is transport 235 technology dependent (and so exceptionally, we keep here the 236 "transport plane" terminology). In transport fault management, 237 distinction is made between a defect and a failure. Here, the 238 discussion addresses failure detection (persistent fault cause). In 239 the technology-dependent descriptions, a more precise specification 240 will be provided. 242 As an example, SONET/SDH (see [G.707], [G.783] and [G.806]) provides 243 supervision capabilities covering: 245 - Continuity: monitors the integrity of the continuity of a trail 246 (i.e. section or path). This operation is performed by monitoring 247 the presence/absence of the signal. Examples are Loss of Signal 248 (LOS) detection for the physical layer, Unequipped (UNEQ) Signal 249 detection for the path layer, Server Signal Fail Detection (e.g. 250 AIS) at the client layer. 252 - Connectivity: monitors the integrity of the routing of the signal 253 between end-points. Connectivity monitoring is needed if 254 the layer provides flexible connectivity, either automatically 255 (e.g. cross-connects) or manually (e.g. fiber distribution frame). 256 An example is the Trail (i.e. section or path) Trace Identifier 257 used at the different layers and the corresponding Trail Trace 258 Identifier Mismatch detection. 260 - Alignment: checks that the client and server layer frame start can 261 be correctly recovered from the detection of loss of alignment. 262 The specific processes depend on the signal/frame structure and 263 may include: (multi-)frame alignment, pointer processing and 264 alignment of several independent frames to a common frame start in 265 case of inverse multiplexing. Loss of alignment is a generic term. 266 Examples are loss of frame, loss of multi-frame, or loss of 267 pointer. 269 D.Papadimitriou et al. - Expires October 2005 5 270 - Payload type: checks that compatible adaptation functions are used 271 at the source and the destination. This is normally done by adding 272 a payload type identifier (referred to as the "signal label") at 273 the source adaptation function and comparing it with the expected 274 identifier at the destination. For instance, the payload type 275 identifier and the corresponding mismatch detection. 277 - Signal Quality: monitors the performance of a signal. For 278 instance, if the performance falls below a certain threshold a 279 defect - excessive errors (EXC) or degraded signal (DEG) - is 280 detected. 282 The most important point is that the supervision processes and the 283 corresponding failure detection (used to initiate the recovery 284 phase(s)) result in either: 286 - Signal Degrade (SD): A signal indicating that the associated data 287 has degraded in the sense that a degraded defect condition is 288 active (for instance, a dDEG declared when the Bit Error Rate 289 exceeds a preset threshold). 291 - Signal Fail (SF): A signal indicating that the associated data has 292 failed in the sense that a signal interrupting near-end defect 293 condition is active (as opposed to the degraded defect). 295 In Optical Transport Networks (OTN) equivalent supervision 296 capabilities are provided at the optical/digital section layers 297 (i.e. Optical Transmission Section (OTS), Optical Multiplex Section 298 (OMS) and Optical channel Transport Unit (OTU)) and at the optical/ 299 digital path layers (i.e. Optical Channel (OCh) and Optical channel 300 Data Unit (ODU)). Interested readers are referred to the ITU-T 301 Recommendations [G.798] and [G.709] for more details. 303 The above are examples that illustrate cases where the failure 304 detection, and reporting entities (see [TERM]) are co-located. The 305 following example illustrates the scenario where the failure 306 detecting and reporting entities (see [TERM]) are not co-located. 308 In pre-OTN networks, a failure may be masked by intermediate O-E-O 309 based Optical Line System (OLS), preventing a Photonic Cross-Connect 310 (PXC) from detecting upstream failures. In such cases, failure 311 detection may be assisted by an out-of-band communication channel 312 and failure condition reported to the PXC control plane. This can be 313 provided by using [LMP-WDM] extensions that delivers IP message- 314 based communication between the PXC and the OLS control plane. Also, 315 since PXCs are independent of the framing format, failure conditions 316 can only be triggered either by detecting the absence of the optical 317 signal or by measuring its quality. These mechanisms are generally 318 less reliable than electrical (digital) ones. Both types of 319 detection mechanisms are outside the scope of this document. If the 320 intermediate OLS supports electrical (digital) mechanisms, using the 321 LMP communication channel, these failure conditions are reported to 323 D.Papadimitriou et al. - Expires October 2005 6 324 the PXC and subsequent recovery actions performed as described in 325 Section 5. As such from the control plane viewpoint, this mechanism 326 turn the OLS-PXC composed system into a single logical entity 327 allowing the consideration of the same failure management mechanisms 328 for such entity as for any other O-E-O capable device. 330 More generally, the following are typical failure conditions in 331 SONET/SDH and pre-OTN networks: 332 - Loss of Light (LOL)/Loss of Signal (LOS): Signal Failure (SF) 333 condition where the optical signal is not detected any longer on 334 the receiver of a given interface. 335 - Signal Degrade (SD): detection of the signal degradation over 336 a specific period of time. 337 - For SONET/SDH payloads, all of the above-mentioned supervision 338 capabilities can be used, resulting in SD or SF condition. 340 In summary, the following cases apply when considering the 341 communication between the detecting and reporting entities: 343 - Co-located detecting and reporting entities: both the detecting 344 and reporting entities are on the same node (e.g., SONET/SDH 345 equipment, Opaque cross-connects, and, with some limitations, 346 Transparent cross-connects, etc.) 348 - Non co-located detecting and reporting entities: 349 o with in-band communication between entities: entities are 350 physically separated but the transport plane provides in-band 351 communication between them (e.g., Server Signal Failures (Alarm 352 Indication Signal (AIS)), etc.) 353 o with out-of-band communication between entities: entities are 354 physically separated but an out-of-band communication channel is 355 provided between them (e.g., using [LMP]). 357 4.2 Failure Localization and Isolation 359 Failure localization provides to the deciding entity information 360 about the location (and so the identity) of the transport plane 361 entity that detects the LSP(s)/span(s) failure. The deciding entity 362 can then make an accurate decision to achieve finer grained recovery 363 switching action(s). Note that this information can also be included 364 as part of the failure notification (see Section 4.3). 366 In some cases, this accurate failure localization information may be 367 less urgent to determine if it requires performing more time 368 consuming failure isolation (see also Section 4.4). This is 369 particularly the case when edge-to-edge LSP recovery (edge referring 370 to a sub-network end-node for instance) is performed based on a 371 simple failure notification (including the identification of the 372 working LSPs under failure condition). In this case, a more accurate 373 localization and isolation can be performed after recovery of these 374 LSPs. 376 D.Papadimitriou et al. - Expires October 2005 7 377 Failure localization should be triggered immediately after the fault 378 detection phase. This operation can be performed at the transport 379 plane and/or, if unavailable (via the transport plane), the control 380 plane level where dedicated signaling messages can be used. When 381 performed at the control plane level, a protocol such as LMP (see 382 [LMP], Section 6) can be used for failure localization purposes. 384 4.3 Failure Notification 386 Failure notification is used 1) to inform intermediate nodes that an 387 LSP/span failure has occurred and has been detected 2) to inform the 388 deciding entities (which can correspond to any intermediate or end- 389 point of the failed LSP/span) that the corresponding service is not 390 available. In general, these deciding entities will be the ones 391 taking the appropriate recovery decision. When co-located with the 392 recovering entity, these entities will also perform the 393 corresponding recovery action(s). 395 Failure notification can be either provided by the transport or by 396 the control plane. As an example, let us first briefly describe the 397 failure notification mechanism defined at the SONET/SDH transport 398 plane level (also referred to as maintenance signal supervision): 400 - AIS (Alarm Indication Signal) occurs as a result of a failure 401 condition such as Loss of Signal and is used to notify downstream 402 nodes (of the appropriate layer processing) that a failure has 403 occurred. AIS performs two functions 1) inform the intermediate 404 nodes (with the appropriate layer monitoring capability) that a 405 failure has been detected 2) notify the connection end-point that 406 the service is no longer available. 408 For a distributed control plane supporting one (or more) failure 409 notification mechanism(s), regardless of the mechanism's actual 410 implementation, the same capabilities are needed with more (or less) 411 information provided about the LSPs/spans under failure condition, 412 their detailed status, etc. 414 The most important difference between these mechanisms is related to 415 the fact that transport plane notifications (as defined today) would 416 directly initiate either a certain type of protection switching 417 (such as those described in [TERM]) via the transport plane or 418 restoration actions via the management plane. 420 On the other hand, using a failure notification mechanism through 421 the control plane would provide the possibility to trigger either a 422 protection or a restoration action via the control plane. This has 423 the advantage that a control plane recovery responsible entity does 424 not necessarily have to be co-located with a transport 425 maintenance/recovery domain. A control plane recovery domain can be 426 defined at entities not supporting a transport plane recovery. 428 Moreover, as specified in [RFC3473], notification message exchanges 429 through a GMPLS control plane may not follow the same path as the 431 D.Papadimitriou et al. - Expires October 2005 8 432 LSP/spans for which these messages carry the status. In turn, this 433 ensures a fast, reliable (through acknowledgement and the use of 434 either a dedicated control plane network or disjoint control 435 channels) and efficient (through the aggregation of several LSP/span 436 status within the same message) failure notification mechanism. 438 The other important properties to be met by the failure notification 439 mechanism are mainly the following: 441 - Notification messages must provide enough information such that 442 the most efficient subsequent recovery action will be taken (in 443 most of the recovery types and schemes this action is even 444 deterministic) at the recovering entities. Remember here that 445 these entities can be either intermediate or end-points through 446 which normal traffic flows. Based on local policy, intermediate 447 nodes may not use this information for subsequent recovery actions 448 (see for instance the APS protocol phases as described in [TERM]). 449 In addition, since fast notification is a mechanism running in 450 collaboration with the existing GMPLS signalling (see [RFC3473]) 451 that also allows intermediate nodes to stay informed about the 452 status of the working LSP/spans under failure condition. 454 The trade-off here is to define what information the LSP/span end- 455 points (more precisely, the deciding entity) needs in order for 456 the recovering entity to take the best recovery action: if not 457 enough information is provided, the decision can not be optimal 458 (note that in this eventuality, the important issue is to quantify 459 the level of sub-optimality), if too much information is provided 460 the control plane may be overloaded with unnecessary information 461 and the aggregation/correlation of this notification information 462 will be more complex and time consuming to achieve. Note that a 463 more detailed quantification of the amount of information to be 464 exchanged and processed is strongly dependent on the failure 465 notification protocol. 467 - If the failure localization and isolation is not performed by one 468 of the LSP/span end-points or some intermediate points, they 469 should receive enough information from the notification message in 470 order to locate the failure otherwise they would need to (re-) 471 initiate a failure localization and isolation action. 473 - Avoiding so-called notification storms implies that 1) the failure 474 detection output is correlated (i.e. alarm correlation) and 475 aggregated at the node detecting the failure(s) 2) the failure 476 notifications are directed to a restricted set of destinations (in 477 general the end-points) and that 3) failure notification 478 suppression (i.e. alarm suppression) is provided in order to limit 479 flooding in case of multiple and/or correlated failures appearing 480 at several locations in the network. 482 - Alarm correlation and aggregation (at the failure detecting 483 node) implies a consistent decision based on the conditions for 484 which a trade-off between fast convergence (at detecting node) and 486 D.Papadimitriou et al. - Expires October 2005 9 487 fast notification (implying that correlation and aggregation 488 occurs at receiving end-points) can be found. 490 4.4 Failure Correlation 492 A single failure event (such as a span failure) can result into 493 reporting multiple failures (such as individual LSP failures) 494 conditions. These can be grouped (i.e. correlated) to reduce the 495 number of failure conditions communicated on the reporting channel, 496 for both in-band and out-of-band failure reporting. 498 In such a scenario, it can be important to wait for a certain period 499 of time, typically called failure correlation time, and gather all 500 the failures to report them as a group of failures (or simply group 501 failure). For instance, this approach can be provided using LMP-WDM 502 for pre-OTN networks (see [LMP-WDM]) or when using Signal Failure/ 503 Degrade Group in the SONET/SDH context. 505 Note that a default average time interval during which failure 506 correlation operation can be performed is difficult to provide since 507 it is strongly dependent on the underlying network topology. 508 Therefore, it can be advisable to provide a per-node configurable 509 failure correlation time. The detailed selection criteria for this 510 time interval are outside of the scope of this document. 512 When failure correlation is not provided, multiple failure 513 notification messages may be sent out in response to a single 514 failure (for instance, a fiber cut), each one containing a set of 515 information on the failed working resources (for instance, the 516 individual lambda LSP flowing through this fiber). This allows for a 517 more prompt response but can potentially overload the control plane 518 due to a large amount of failure notifications. 520 5. Recovery Mechanisms 522 5.1 Transport vs. Control Plane Responsibilities 524 For both protection and restoration, and when applicable, recovery 525 resources are provisioned using GMPLS signalling capabilities. Thus, 526 these are control plane-driven actions (topological and resource- 527 constrained) that are always performed in this context. 529 The following tables give an overview of the responsibilities taken 530 by the control plane in case of LSP/span recovery: 532 1. LSP/span Protection 534 - Phase 1: Failure detection Transport plane 535 - Phase 2: Failure localization/isolation Transport/Control plane 536 - Phase 3: Failure notification Transport/Control plane 537 - Phase 4: Protection switching Transport/Control plane 538 - Phase 5: Reversion (normalization) Transport/Control plane 540 D.Papadimitriou et al. - Expires October 2005 10 541 Note: in the context of LSP/span protection, control plane actions 542 can be performed either for operational purposes and/or 543 synchronization purposes (vertical synchronization between transport 544 and control plane) and/or notification purposes (horizontal 545 synchronization between end-nodes at control plane level). This 546 suggests the selection of the responsible plane (in particular for 547 protection switching) during the provisioning phase of the 548 protected/protection LSP. 550 2. LSP/span Restoration 552 - Phase 1: Failure detection Transport plane 553 - Phase 2: Failure localization/isolation Transport/Control plane 554 - Phase 3: Failure notification Control plane 555 - Phase 4: Recovery switching Control plane 556 - Phase 5: Reversion (normalization) Control plane 558 Therefore, this document primarily focuses on provisioning of LSP 559 recovery resources, failure notification mechanisms, recovery 560 switching, and reversion operations. Moreover some additional 561 considerations can be dedicated to the mechanisms associated to the 562 failure localization/isolation phase. 564 5.2 Technology in/dependent mechanisms 566 The present recovery mechanisms analysis applies in fact to any 567 circuit oriented data plane technology with discrete bandwidth 568 increments (like SONET/SDH, G.709 OTN, etc.) being controlled by a 569 GMPLS-based distributed control plane. 571 The following sub-sections are not intended to favor one technology 572 versus another. They just list pro and cons for each of them in 573 order to determine the mechanisms that GMPLS-based recovery must 574 deliver to overcome their cons and take benefits of their pros in 575 their respective applicability context. 577 5.2.1 OTN Recovery 579 OTN recovery specifics are left for further considerations. 581 5.2.2 Pre-OTN Recovery 583 Pre-OTN recovery specifics (also referred to as "lambda switching") 584 present mainly the following advantages: 586 - benefits from a simpler architecture making it more suitable for 587 mesh-based recovery types and schemes (on a per channel basis). 589 - when providing suppression of intermediate node transponders (vs. 590 use of non-standard masking of upstream failures) e.g. use of 591 squelching, implies that failures (such as LoL) will propagate to 592 edge nodes giving the possibility to initiate recovery actions 593 driven by upper layers. 595 D.Papadimitriou et al. - Expires October 2005 11 596 The main disadvantage comes from the lack of interworking due to the 597 large amount of failure management (in particular failure 598 notification protocols) and recovery mechanisms currently available. 600 Note also, that for all-optical networks, combination of recovery 601 with optical physical impairments is left for a future release of 602 this document since corresponding detection technologies are under 603 specification. 605 5.2.3 SONET/SDH Recovery 607 Some of the advantages of SONET [T1.105]/SDH [G.707] and more 608 generically any TDM transport plane recovery are that they provide: 610 - Protection types operating at the data plane level are 611 standardized (see [G.841]) and can operate across protected 612 domains and interwork (see [G.842]). 614 - Failure detection, notification and path/section Automatic 615 Protection Switching (APS) mechanisms. 617 - Greater control over the granularity of the TDM LSPs/links that 618 can be recovered with respect to coarser optical channel (or whole 619 fiber content) recovery switching 621 Some of the limitations of the SONET/SDH recovery are: 623 - Limited topological scope: Inherently the use of ring topologies, 624 typically, dedicated Sub-Network Connection Protection (SNCP) or 625 shared protection rings, has reduced flexibility and resource 626 efficiency with respect to the (somewhat more complex) meshed 627 recovery. 629 - Inefficient use of spare capacity: SONET/SDH protection is largely 630 applied to ring topologies, where spare capacity often remains 631 idle, making the efficiency of bandwidth usage a real issue. 633 - Support of meshed recovery requires intensive network management 634 development and the functionality is limited by both the network 635 elements and the capabilities of the element management systems 636 (justifying thus the development of GMPLS-based distributed 637 recovery mechanisms.) 639 5.3 Specific Aspects of Control Plane-based Recovery Mechanisms 641 5.3.1 In-band vs Out-of-band Signalling 643 The nodes communicate through the use of IP terminating control 644 channels defining the control plane (transport) topology. In this 645 context, two classes of transport mechanisms can be considered here 646 i.e. in-fiber or out-of-fiber (through a dedicated physically 647 diverse control network referred to as the Data Communication 649 D.Papadimitriou et al. - Expires October 2005 12 650 Network or DCN). The potential impact of the usage of an in-fiber 651 (signalling) transport mechanism is briefly considered here. 653 In-fiber transport mechanism can be further subdivided into in-band 654 and out-of-band. As such, the distinction between in-fiber in-band 655 and in-fiber out-of-band signalling reduces to the consideration of 656 a logically versus physically embedded control plane topology with 657 respect to the transport plane topology. In the scope of this 658 document, it is assumed that at least one IP control channel between 659 each pair of adjacent nodes is continuously available to enable the 660 exchange of recovery-related information and messages. Thus, in 661 either case (i.e. in-band or out-of-band) at least one logical or 662 physical control channel between each pair of nodes is always 663 expected to be available. 665 Therefore, the key issue when using in-fiber signalling is whether 666 one can assume independence between the fault-tolerance capabilities 667 of control plane and the failures affecting the transport plane 668 (including the nodes). Note also that existing specifications like 669 the OTN provide a limited form of independence for in-fiber 670 signaling by dedicating a separate optical supervisory channel (OSC, 671 see [G.709] and [G.874]) to transport the overhead and other control 672 traffic. For OTNs, failure of the OSC does not result in failing the 673 optical channels. Similarly, loss of the control channel must not 674 result in failing the data channels (transport plane). 676 5.3.2 Uni- versus Bi-directional Failures 678 The failure detection, correlation and notification mechanisms 679 (described in Section 4) can be triggered when either a 680 unidirectional or a bi-directional LSP/Span failure occurs (or a 681 combination of both). As illustrated in Figure 1 and 2, two 682 alternatives can be considered here: 684 1. Uni-directional failure detection: the failure is detected on the 685 receiver side i.e. it is only is detected by the downstream node 686 to the failure (or by the upstream node depending on the failure 687 propagation direction, respectively). 689 2. Bi-directional failure detection: the failure is detected on the 690 receiver side of both downstream node AND upstream node to the 691 failure. 693 Notice that after the failure detection time, if only control plane 694 based failure management is provided, the peering node is unaware of 695 the failure detection status of its neighbor. 697 ------- ------- ------- ------- 698 | | | |Tx Rx| | | | 699 | NodeA |----...----| NodeB |xxxxxxxxx| NodeC |----...----| NodeD | 700 | |----...----| |---------| |----...----| | 701 ------- ------- ------- ------- 703 D.Papadimitriou et al. - Expires October 2005 13 704 t0 >>>>>>> F 706 t1 x <---------------x 707 Notification 708 t2 <--------...--------x x--------...--------> 709 Up Notification Down Notification 711 ------- ------- ------- ------- 712 | | | |Tx Rx| | | | 713 | NodeA |----...----| NodeB |xxxxxxxxx| NodeC |----...----| NodeD | 714 | |----...----| |xxxxxxxxx| |----...----| | 715 ------- ------- ------- ------- 717 t0 F <<<<<<< >>>>>>> F 719 t1 x <-------------> x 720 Notification 721 t2 <--------...--------x x--------...--------> 722 Up Notification Down Notification 724 After failure detection, the following failure management operations 725 can be subsequently considered: 727 - Each detecting entity sends a notification message to the 728 corresponding transmitting entity. For instance, in Fig. 1 (Fig. 729 2), node C sends a notification message to node B (while node B 730 sends a notification message to node C). To ensure reliable 731 failure notification, a dedicated acknowledgment message can be 732 returned back to the sender node. 734 - Next, within a certain (and pre-determined) time window, nodes 735 impacted by the failure occurrences may perform their correlation. 736 In case of unidirectional failure, node B only receives the 737 notification message from node C and thus the time for this 738 operation is negligible. In case of bi-directional failure, node B 739 (and node C) has to correlate the received notification message 740 from node C (node B, respectively) with the corresponding locally 741 detected information. 743 - After some (pre-determined) period of time, referred to as the 744 hold-off time, after which the local recovery actions (see Section 745 5.3.4) were not successful, the following occurs. In case of 746 unidirectional failure and depending on the directionality of the 747 LSP, node B should send an upstream notification message (see 748 [RFC3473]) to the ingress node A and node C may send a downstream 749 notification message (see [RFC3473]) to the egress node D. 750 However, in such a case only node A referred to as the "master" 751 (node D being then referred to as the "slave" per [TERM]), would 752 initiate an edge to edge recovery action. Note that the other LSP 753 end-node (i.e. node D in this case) may be optionally notified 754 using a downstream notification message (see [RFC3473]). 756 In case of bi-directional failure, node B should send an upstream 758 D.Papadimitriou et al. - Expires October 2005 14 759 notification message (see [RFC3473]) to the ingress node A and 760 node C may send a downstream notification message (see [RFC3473]) 761 to the egress node D. However, due to the dependence on the LSP 762 directionality, only ingress node A would initiate an edge to edge 763 recovery action. Note that the other LSP end-node (i.e. node D in 764 this case) should also be notified of this event using a 765 downstream notification message (see [RFC3473]). For instance, if 766 an LSP directed from D to A is under failure condition, only the 767 notification message sent from node C to D would initiate a 768 recovery action and, in this case, per [TERM], the deciding and 769 recovering node D is referred to as the "master" while node A is 770 referred to as the "slave" (i.e. recovering only entity). 772 Note: The determination of the master and the slave may be based 773 either on configured information or dedicated protocol capability. 775 In the above scenarios, the path followed by the upstream and 776 downstream notification messages does not have to be the same as the 777 one followed by the failed LSP (see [RFC3473] for more details on 778 the notification message exchange). The important point, concerning 779 this mechanism, is that either the detecting/reporting entity (i.e. 780 the nodes B and C) is also the deciding/recovery entity or the 781 detecting/reporting entity is simply an intermediate node in the 782 subsequent recovery process. One refers to local recovery in the 783 former case and to edge-to-edge recovery in the latter one (see also 784 Section 5.3.4). 786 5.3.3 Partial versus Full Span Recovery 788 When a given span carries more than one LSPs or LSP segments, an 789 additional aspect must be considered. In case of span failure, the 790 LSPs it carries can be either individually recovered or recovered as 791 a group (aka bulk LSP recovery) or independent sub-groups. The 792 selection of this mechanism would be triggered independently of the 793 failure notification granularity when correlation time windows are 794 used and simultaneous recovery of several LSPs can be performed 795 using a single request. Moreover, criteria by which such sub-groups 796 can be formed are outside of the scope of this document. 798 Additional complexity arises in the case of (sub-)group LSP 799 recovery. Between a given pair of nodes, the LSPs that a given (sub- 800 )group contains may have been created from different source nodes 801 (i.e. initiator) and directed toward different destinations nodes. 802 Consequently the failure notification messages sub-sequent to a bi- 803 directional span failure affecting several LSPs (or the whole group 804 of LSPs it carries) are not necessarily directed toward the same 805 initiator nodes. In particular these messages may be directed to 806 both the upstream and downstream nodes to the failure. Therefore, 807 such span failure may trigger recovery actions to be performed from 808 both sides (i.e. both from the upstream and the downstream node to 809 the failure). In order to facilitate the definition of the 810 corresponding recovery mechanisms (and their sequence), one assumes 811 here as well, that per [TERM] the deciding (and recovering) entity, 813 D.Papadimitriou et al. - Expires October 2005 15 814 referred to as the "master" is the only initiator of the recovery of 815 the whole LSP (sub-)group. 817 5.3.4 Difference between LSP, LSP Segment and Span Recovery 819 The recovery definitions given in [TERM] are quite generic and apply 820 for link (or local span) and LSP recovery. The major difference 821 between LSP, LSP Segment and span recovery is related to the number 822 of intermediate nodes that the signalling messages have to travel. 823 Since nodes are not necessarily adjacent in case of LSP (or LSP 824 Segment) recovery, signalling message exchanges from the reporting 825 to the deciding/recovery entity may have to cross several 826 intermediate nodes. In particular, this applies for the notification 827 messages due to the number of hops separating the location of a 828 failure occurrence from its destination. This results in an 829 additional propagation and forwarding delay. Note that the former 830 delay may in certain circumstances be non-negligible; e.g. in case 831 of copper out-of-band network, approximately 1 ms per 200km. 833 Moreover, the recovery mechanisms applicable to end-to-end LSPs and 834 to the segments that may compose an end-to-end LSP (i.e. edge-to- 835 edge recovery) can be exactly the same. However, one expects in the 836 latter case, that the destination of the failure notification 837 message will be the ingress/egress of each of these segments. 838 Therefore, using the mechanisms described in Section 5.3.2, failure 839 notification messages can be first exchanged between terminating 840 points of the LSP segment and after expiration of the hold-off time 841 be directed toward terminating points of the end-to-end LSP. 843 Note: Several studies provide quantitative analysis of the relative 844 performance of LSP/span recovery techniques. [WANG] for instance, 845 provides an analysis grid for these techniques showing that dynamic 846 LSP restoration (see Section 5.5.2) performs well under medium 847 network loads but suffers performance degradations at higher loads 848 due to greater contention for recovery resources. LSP restoration 849 upon span failure, as defined in [WANG], degrades at higher loads 850 because paths around failed links tend to increase the hop count of 851 the affected LSPs and thus consume additional network resources. 852 Also, performance of LSP restoration can be enhanced by a failed 853 working LSP's source node initiating a new recovery attempt if an 854 initial attempt fails. A single retry attempt is sufficient to 855 produce large increases in the restoration success rate and ability 856 to initiate successful LSP restoration attempts, especially at high 857 loads, while not adding significantly to the long-term average 858 recovery time. Allowing additional attempts produces only small 859 additional gains in performance. This suggests using additional 860 (intermediate) crankback signalling when using dynamic LSP 861 restoration (described in Section 5.5.2 - case 2). Details on 862 crankback signalling are outside the scope of the present document. 864 5.4 Difference between Recovery Type and Scheme 866 D.Papadimitriou et al. - Expires October 2005 16 868 [TERM] defines the basic LSP/span recovery types. This section 869 describes the recovery schemes that can be built using these 870 recovery types. In brief, a recovery scheme is defined as the 871 combination of several ingress-egress node pairs supporting a given 872 recovery type (from the set of the recovery types they allow). 873 Several examples are provided here to illustrate the difference 874 between recovery types such as 1:1 or M:N and recovery schemes such 875 as (1:1)^n or (M:N)^n referred to as shared-mesh recovery. 877 1. (1:1)^n with recovery resource sharing 879 The exponent, n, indicates the number of times a 1:1 recovery type 880 is applied between at most n different ingress-egress node pairs. 881 Here, at most n pairs of disjoint working and recovery LSPs/spans 882 share at most n times a common resource. Since the working LSPs/ 883 spans are mutually disjoint, simultaneous requests for use of the 884 shared (common) resource will only occur in case of simultaneous 885 failures, which is less likely to happen. 887 For instance, in the common (1:1)^2 case, if the 2 recovery LSPs in 888 the group overlap the same common resource, then it can handle only 889 single failures; any multiple working LSP failures will cause at 890 least one working LSP to be denied automatic recovery. Consider for 891 instance the following topology with the working LSPs A-B-C and F-G- 892 H and their respective recovery LSPs A-D-E-C and F-D-E-H that share 893 a common D-E link resource. 895 A---------B---------C 896 \ / 897 \ / 898 D-------------E 899 / \ 900 / \ 901 F---------G---------H 903 2. (M:N)^n with recovery resource sharing 905 The (M:N)^n scheme is documented here for the sake of completeness 906 only (i.e. it is not mandated that GMPLS capabilities would support 907 this scheme). The exponent, n, indicates the number of times an M:N 908 recovery type is applied between at most n different ingress-egress 909 node pairs. So the interpretation follows from the previous case, 910 except that here disjointness applies to the N working LSPs/spans 911 and to the M recovery LSPs/spans while sharing at most n times M 912 common resources. 914 In both schemes, it results in a "group" of sum{n=1}^N N{n} working 915 LSPs and a pool of shared recovery resources, not all of which are 916 available to any given working LSP. In such conditions, defining a 917 metric that describes the amount of overlap among the recovery LSPs 918 would give some indication of the group's ability to handle 919 simultaneous failures of multiple LSPs. 921 D.Papadimitriou et al. - Expires October 2005 17 922 For instance, in the simple (1:1)^n case, if n recovery LSPs in a 923 (1:1)^n group overlap, then it can handle only single failures; any 924 simultaneous failure of multiple working LSPs will cause at least 925 one working LSP to be denied automatic recovery. But if one 926 considers for instance, a (2:2)^2 group in which there are two pairs 927 of overlapping recovery LSPs, then two LSPs (belonging to the same 928 pair) can be simultaneously recovered. The latter case can be 929 illustrated by the following topology with 2 pairs of working LSPs 930 A-B-C and F-G-H and their respective recovery LSPs A-D-E-C and F-D- 931 E-H that share two common D-E link resources. 933 A========B========C 934 \\ // 935 \\ // 936 D =========== E 937 // \\ 938 // \\ 939 F========G========H 941 Moreover, in all these schemes, (working) path disjointness can be 942 enforced by exchanging information related to working LSPs during 943 the recovery LSP signaling. Specific issues related to the 944 combination of shared (discrete) bandwidth and disjointness for 945 recovery schemes are described in Section 8.4.2. 947 5.5 LSP Recovery Mechanisms 949 5.5.1 Classification 951 The recovery time and ratio of LSPs/spans depend on proper recovery 952 LSP provisioning (meaning pre-provisioning when performed before 953 failure occurrence) and the level of overbooking of recovery 954 resources (i.e. over-provisioning). A proper balance of these two 955 operations will result in the desired LSP/span recovery time and 956 ratio when single or multiple failure(s) occur(s). Note also that 957 these operations are mostly performed during the network planning 958 phases. 960 The different options for LSP (pre-)provisioning and overbooking are 961 classified below to structure the analysis of the different recovery 962 mechanisms. 964 1. Pre-Provisioning 966 Proper recovery LSP pre-provisioning will help to alleviate the 967 failure of the working LSPs (due to the failure of the resources 968 that carry these LSPs). As an example, one may compute and establish 969 the recovery LSP either end-to-end or segment-per-segment, to 970 protect a working LSP from multiple failure events affecting 972 D.Papadimitriou et al. - Expires October 2005 18 973 link(s), node(s) and/or SRLG(s). The recovery LSP pre-provisioning 974 options can be classified (in the below figure) as follows: 976 (1) the recovery path can be either pre-computed or computed 977 on-demand. 979 (2) when the recovery path is pre-computed, it can be either pre- 980 signaled (implying recovery resource reservation) or signaled 981 on-demand. 983 (3) when the recovery resources are pre-signaled, they can be either 984 pre-selected or selected on-demand. 986 Recovery LSP provisioning phases: 988 (1) Path Computation --> On-demand 989 | 990 | 991 --> Pre-Computed 992 | 993 | 994 (2) Signalling --> On-demand 995 | 996 | 997 --> Pre-Signaled 998 | 999 | 1000 (3) Resource Selection --> On-demand 1001 | 1002 | 1003 --> Pre-Selected 1005 Note that these different options lead to different LSP/span 1006 recovery times. The following sections will consider the above- 1007 mentioned pre-provisioning options when analyzing the different 1008 recovery mechanisms. 1010 2. Overbooking 1012 There are many mechanisms available that allow the overbooking of 1013 the recovery resources. This overbooking can be done per LSP (such 1014 as the example mentioned above), per link (such as span protection) 1015 or even per domain. In all these cases, the level of overbooking, as 1016 shown in the below figure, can be classified as dedicated (such as 1017 1+1 and 1:1), shared (such as 1:N and M:N) or unprotected (and thus 1018 restorable if enough recovery resources are available). 1020 Overbooking levels: 1022 +----- Dedicated (for instance: 1+1, 1:1, etc.) 1023 | 1024 | 1026 D.Papadimitriou et al. - Expires October 2005 19 1027 +----- Shared (for instance: 1:N, M:N, etc.) 1028 | 1029 Level of | 1030 Overbooking -----+----- Unprotected (for instance: 0:1, 0:N) 1032 Also, when using shared recovery, one may support preemptible extra- 1033 traffic; the recovery mechanism is then expected to allow preemption 1034 of this low priority traffic in case of recovery resource contention 1035 during recovery operations. The following sections will consider the 1036 above-mentioned overbooking options when analyzing the different 1037 recovery mechanisms. 1039 5.5.2 LSP Restoration 1041 The following times are defined to provide a quantitative estimation 1042 about the time performance of the different LSP restoration 1043 mechanisms (also referred to as LSP re-routing): 1045 - Path Computation Time: Tc 1046 - Path Selection Time: Ts 1047 - End-to-end LSP Resource Reservation Time: Tr (a delta for resource 1048 selection is also considered, the corresponding total time is then 1049 referred to as Trs) 1050 - End-to-end LSP Resource Activation Time: Ta (a delta for 1051 resource selection is also considered, the corresponding total 1052 time is then referred to as Tas) 1054 The Path Selection Time (Ts) is considered when a pool of recovery 1055 LSP paths between a given pair of source/destination end-points is 1056 pre-computed and after a failure occurrence one of these paths is 1057 selected for the recovery of the LSP under failure condition. 1059 Note: failure management operations such as failure detection, 1060 correlation and notification are considered (for a given failure 1061 event) as equally time consuming for all the mechanisms described 1062 here below: 1064 1. With Route Pre-computation (or LSP re-provisioning) 1066 An end-to-end restoration LSP is established after the failure(s) 1067 occur(s) based on a pre-computed path. As such, one can define this 1068 as an "LSP re-provisioning" mechanism. Here, one or more (disjoint) 1069 paths for the restoration LSP are computed (and optionally pre- 1070 selected) before a failure occurs. 1072 No reservation or selection of resources is performed along the 1073 restoration path before failure occurrence. As a result, there is no 1074 guarantee that a restoration LSP is available when a failure occurs. 1076 The expected total restoration time T is thus equal to Ts + Trs or 1077 to Trs when a dedicated computation is performed for each working 1078 LSP. 1080 D.Papadimitriou et al. - Expires October 2005 20 1081 2. Without Route Pre-computation (or Full LSP re-routing) 1083 An end-to-end restoration LSP is dynamically established after the 1084 failure(s) occur(s). Here, after failure occurrence, one or more 1085 (disjoint) paths for the restoration LSP are dynamically computed 1086 and one is selected. As such, one can define this as a complete "LSP 1087 re-routing" mechanism. 1089 No reservation or selection of resources is performed along the 1090 restoration path before failure occurrence. As a result, there is no 1091 guarantee that a restoration LSP is available when a failure occurs. 1093 The expected total restoration time T is thus equal to Tc (+ Ts) + 1094 Trs. Therefore, time performance between these two approaches 1095 differs by the time required for route computation Tc (and its 1096 potential selection time, Ts). 1098 5.5.3 Pre-planned LSP Restoration 1100 Pre-planned LSP restoration (also referred to as pre-planned LSP re- 1101 routing) implies that the restoration LSP is pre-signaled. This in 1102 turn implies the reservation of recovery resources along the 1103 restoration path. Two cases can be defined based on whether the 1104 recovery resources are pre-selected or not. 1106 1. With resource reservation and without resource pre-selection 1108 Before failure occurrence, an end-to-end restoration path is pre- 1109 selected from a set of pre-computed (disjoint) paths. The 1110 restoration LSP is signaled along this pre-selected path to reserve 1111 resources at each node but these resources are not selected. 1113 In this case, the resources reserved for each restoration LSP may be 1114 dedicated or shared between multiple restoration LSPs whose working 1115 LSPs are not expected to fail simultaneously. Local node policies 1116 can be applied to define the degree to which these resources can be 1117 shared across independent failures. Also, since a restoration scheme 1118 is considered, resource sharing should not be limited to restoration 1119 LSPs starting and ending at the same ingress and egress nodes. 1120 Therefore, each node participating to this scheme is expected to 1121 receive some feedback information on the sharing degree of the 1122 recovery resource(s) that this scheme involves. 1124 Upon failure detection/notification message reception, signaling is 1125 initiated along the restoration path to select the resources, and to 1126 perform the appropriate operation at each node crossed by the 1127 restoration LSP (e.g. cross-connections). If lower priority LSPs 1128 were established using the restoration resources, they must be 1129 preempted when the restoration LSP is activated. 1131 D.Papadimitriou et al. - Expires October 2005 21 1132 The expected total restoration time T is thus equal to Tas (post- 1133 failure activation) while operations performed before failure 1134 occurrence takes Tc + Ts + Tr. 1136 2. With both resource reservation and resource pre-selection 1138 Before failure occurrence, an end-to-end restoration path is pre- 1139 selected from a set of pre-computed (disjoint) paths. The 1140 restoration LSP is signaled along this pre-selected path to reserve 1141 AND select resources at each node but these resources are not 1142 committed at the data plane level. Such that the selection of the 1143 recovery resources is committed at the control plane level only, no 1144 cross-connections are performed along the restoration path. 1146 In this case, the resources reserved and selected for each 1147 restoration LSP may be dedicated or even shared between multiple 1148 restoration LSPs whose associated working LSPs are not expected to 1149 fail simultaneously. Local node policies can be applied to define 1150 the degree to which these resources can be shared across independent 1151 failures. Also, since a restoration scheme is considered, resource 1152 sharing should not be limited to restoration LSPs starting and 1153 ending at the same ingress and egress nodes. Therefore, each node 1154 participating to this scheme is expected to receive some feedback 1155 information on the sharing degree of the recovery resource(s) that 1156 this scheme involves. 1158 Upon failure detection/notification message reception, signaling is 1159 initiated along the restoration path to activate the reserved and 1160 selected resources, and to perform the appropriate operation at each 1161 node crossed by the restoration LSP (e.g. cross-connections). If 1162 lower priority LSPs were established using the restoration 1163 resources, they must be preempted when the restoration LSP is 1164 activated. 1166 The expected total restoration time T is thus equal to Ta (post- 1167 failure activation) while operations performed before failure 1168 occurrence takes Tc + Ts + Trs. Therefore, time performance between 1169 these two approaches differs only by the time required for resource 1170 selection during the activation of the recovery LSP (i.e. Tas - Ta). 1172 5.5.4 LSP Segment Restoration 1174 The above approaches can be applied on an edge-to-edge LSP basis 1175 rather than end-to-end LSP basis (i.e. to reduce the global recovery 1176 time) by allowing the recovery of the individual LSP segments 1177 constituting the end-to-end LSP. 1179 Also, by using the horizontal hierarchy approach described in 1180 Section 7.1, an end-to-end LSP can be recovered by multiple recovery 1181 mechanisms applied on an LSP segment basis (e.g. 1:1 edge-to-edge 1182 LSP protection in a metro network and M:N edge-to-edge protection in 1183 the core). These mechanisms are ideally independent and may even use 1184 different failure localization and notification mechanisms. 1186 D.Papadimitriou et al. - Expires October 2005 22 1187 6. Reversion 1189 Reversion (a.k.a. normalization) is defined as the mechanism 1190 allowing switching of normal traffic from the recovery LSP/span to 1191 the working LSP/span previously under failure condition. Use of 1192 normalization is at the discretion of the recovery domain policy. 1193 Normalization (also referred to as reversion) may impact the normal 1194 traffic (a second hit) depending on the normalization mechanism 1195 used. 1197 If normalization is supported 1) the LSP/span must be returned to 1198 the working LSP/span when the failure condition clears 2) the 1199 capability to de-activate (turn-off) the use of reversion should be 1200 provided. De-activation of reversion should not impact the normal 1201 traffic regardless of whether currently using the working or 1202 recovery LSP/span. 1204 Note: during the failure, the reuse of any non-failed resources 1205 (e.g. LSP and/or spans) belonging to the working LSP/span is under 1206 the discretion of recovery domain policy. 1208 6.1 Wait-To-Restore (WTR) 1210 A specific mechanism (Wait-To-Restore) is used to prevent frequent 1211 recovery switching operations due to an intermittent defect (e.g. 1212 BER fluctuating around the SD threshold). 1214 First, an LSP/span under failure condition must become fault-free, 1215 e.g. a BER less than a certain recovery threshold. After the 1216 recovered LSP/span (i.e. the previously working LSP/span) meets this 1217 criterion, a fixed period of time shall elapse before normal traffic 1218 uses the corresponding resources again. This duration called Wait- 1219 To-Restore (WTR) period or timer is generally of the order of a few 1220 minutes (for instance, 5 minutes) and should be capable of being 1221 set. The WTR timer may be either a fixed period, or provide for 1222 incrementally longer periods before retrying. An SF or SD condition 1223 on the previously working LSP/span will override the WTR timer value 1224 (i.e. the WTR cancels and the WTR timer will restart). 1226 6.2 Revertive Mode Operation 1228 In revertive mode of operation, when the recovery LSP/span is no 1229 longer required, i.e. the failed working LSP/span is no longer in SD 1230 or SF condition, a local Wait-to-Restore (WTR) state will be 1231 activated before switching the normal traffic back to the recovered 1232 working LSP/span. 1234 During the reversion operation, since this state becomes the highest 1235 in priority, signalling must maintain the normal traffic on the 1236 recovery LSP/span from the previously failed working LSP/span. 1237 Moreover, during this WTR state, any null traffic or extra traffic 1238 (if applicable) request is rejected. 1240 D.Papadimitriou et al. - Expires October 2005 23 1241 However, deactivation (cancellation) of the wait-to-restore timer 1242 may occur in case of higher priority request attempts. That is the 1243 recovery LSP/span usage by the normal traffic may be preempted if a 1244 higher priority request for this recovery LSP/span is attempted. 1246 6.3 Orphans 1248 When a reversion operation is requested normal traffic must be 1249 switched from the recovery to the recovered working LSP/span. A 1250 particular situation occurs when the previously working LSP/span 1251 cannot be recovered such that normal traffic can not be switched 1252 back. In such a case, the LSP/span under failure condition (also 1253 referred to as "orphan") must be cleared i.e. removed from the pool 1254 of resources allocated for normal traffic. Otherwise, potential de- 1255 synchronization between the control and transport plane resource 1256 usage can appear. Depending on the signalling protocol capabilities 1257 and behavior different mechanisms are expected here. 1259 Therefore any reserved or allocated resources for the LSP/span under 1260 failure condition must be unreserved/de-allocated. Several ways can 1261 be used for that purpose: either wait for the elapsing of the clear- 1262 out time interval, or initiate a deletion from the ingress or the 1263 egress node, or trigger the initiation of deletion from an entity 1264 (such as an EMS or NMS) capable to react on the reception of an 1265 appropriate notification message. 1267 7. Hierarchies 1269 Recovery mechanisms are being made available at multiple (if not 1270 each) transport layers within so-called "IP/MPLS-over-optical" 1271 networks. However, each layer has certain recovery features and one 1272 needs to determine the exact impact of the interaction between the 1273 recovery mechanisms provided by these layers. 1275 Hierarchies are used to build scalable complex systems. Abstraction 1276 is used as a mechanism to build large networks or as a technique for 1277 enforcing technology, topological or administrative boundaries by 1278 hiding the internal details. The same hierarchical concept can be 1279 applied to control the network survivability. Network survivability 1280 is the set of capabilities that allow a network to restore affected 1281 traffic in the event of a failure. Network survivability is defined 1282 further in [TERM]. In general, it is expected that the recovery 1283 action is taken by the recoverable LSP/span closest to the failure 1284 in order to avoid the multiplication of recovery actions. Moreover, 1285 recovery hierarchies can be also bound to control plane logical 1286 partitions (e.g. administrative or topological boundaries). Each of 1287 them may apply different recovery mechanisms. 1289 In brief, the commonly accepted ideas are generally that the lower 1290 layers can provide coarse but faster recovery while the higher 1291 layers can provide finer but slower recovery. Moreover, it is also 1292 desirable to avoid similar layers with functional overlaps to 1294 D.Papadimitriou et al. - Expires October 2005 24 1295 optimize network resource utilization and processing overhead, since 1296 repeating the same capabilities at each layer does not create any 1297 added value for the network as a whole. In addition, even if a lower 1298 layer recovery mechanism is enabled, doing so does not prevent the 1299 additional provision of a recovery mechanism at the upper layer. The 1300 inverse statement does not necessarily hold; that is, enabling an 1301 upper layer recovery mechanism may prevent the use of a lower layer 1302 recovery mechanism. In this context, this section intends to analyze 1303 these hierarchical aspects including the physical (passive) 1304 layer(s). 1306 7.1 Horizontal Hierarchy (Partitioning) 1308 A horizontal hierarchy is defined when partitioning a single-layer 1309 network (and its control plane) into several recovery domains. 1310 Within a domain, the recovery scope may extend over a link (or 1311 span), LSP segment or even an end-to-end LSP. Moreover, an 1312 administrative domain may consist of a single recovery domain or can 1313 be partitioned into several smaller recovery domains. The operator 1314 can partition the network into recovery domains based on physical 1315 network topology, control plane capabilities or various traffic 1316 engineering constraints. 1318 An example often addressed in the literature is the metro-core-metro 1319 application (sometimes extended to a metro-metro/core-core) within a 1320 single transport layer (see Section 7.2). For such a case, an end- 1321 to-end LSP is defined between the ingress and egress metro nodes, 1322 while LSP segments may be defined within the metro or core sub- 1323 networks. Each of these topological structures determines a so- 1324 called "recovery domain" since each of the LSPs they carry can have 1325 its own recovery type (or even scheme). The support of multiple 1326 recovery types and schemes within a sub-network is referred to as a 1327 multi-recovery capable domain or simply multi-recovery domain. 1329 7.2 Vertical Hierarchy (Layers) 1331 It is a very challenging task to combine in a coordinated manner the 1332 different recovery capabilities available across the path (i.e. 1333 switching capable) and section layers to ensure that certain network 1334 survivability objectives are met for the different services 1335 supported by the network. 1337 As a first analysis step, one can draw the following guidelines for 1338 a vertical coordination of the recovery mechanisms: 1339 - The lower the layer the faster the notification and switching 1340 - The higher the layer the finer the granularity of the recoverable 1341 entity and therefore the granularity of the recovery resource 1343 Moreover, in the context of this analysis, a vertical hierarchy 1344 consists of multiple layered transport planes providing different: 1345 - Discrete bandwidth granularities for non-packet LSPs such as OCh, 1346 ODUk, STS_SPE/HOVC and VT_SPE/LOVC LSPs and continuous bandwidth 1347 granularities for packet LSPs 1349 D.Papadimitriou et al. - Expires October 2005 25 1350 - Potential recovery capabilities with different temporal 1351 granularities: ranging from milliseconds to tens of seconds 1353 Note: based on the bandwidth granularity we can determine four 1354 classes of vertical hierarchies (1) packet over packet (2) packet 1355 over circuit (3) circuit over packet and (4) circuit over circuit. 1356 Below we briefly expand on (4) only. (2) is covered in [RFC3386]. 1357 (1) is extensively covered by the MPLS Working Group, and (3) by the 1358 PWE3 Working Group. 1360 In SONET/SDH environments, one typically considers the VT_SPE/LOVC 1361 and STS SPE/HOVC as independent layers, VT_SPE/LOVC LSP using the 1362 underlying STS_SPE/HOVC LSPs as links, for instance. In OTN, the 1363 ODUk path layers will lie on the OCh path layer i.e. the ODUk LSPs 1364 using the underlying OCh LSPs as OTUk links. Note here that lower 1365 layer LSPs may simply be provisioned and not necessarily dynamically 1366 triggered or established (control driven approach). In this context, 1367 an LSP at the path layer (i.e. established using GMPLS signalling), 1368 for instance an optical channel LSP, appears at the OTUk layer as a 1369 link, controlled by a link management protocol such as LMP. 1371 The first key issue with multi-layer recovery is that achieving 1372 individual or bulk LSP recovery will be as efficient as the 1373 underlying link (local span) recovery. In such a case, the span can 1374 be either protected or unprotected, but the LSP it carries must be 1375 (at least locally) recoverable. Therefore, the span recovery process 1376 can be either independent when protected (or restorable), or 1377 triggered by the upper LSP recovery process. The former case 1378 requires coordination to achieve subsequent LSP recovery. Therefore, 1379 in order to achieve robustness and fast convergence, multi-layer 1380 recovery requires a fine-tuned coordination mechanism. 1382 Moreover, in the absence of adequate recovery mechanism coordination 1383 (for instance, a pre-determined coordination when using a hold-off 1384 timer), a failure notification may propagate from one layer to the 1385 next one within a recovery hierarchy. This can cause "collisions" 1386 and trigger simultaneous recovery actions that may lead to race 1387 conditions and in turn, reduce the optimization of the resource 1388 utilization and/or generate global instabilities in the network (see 1389 [MANCHESTER]). Therefore, a consistent and efficient escalation 1390 strategy is needed to coordinate recovery across several layers. 1392 Therefore, one can expect that the definition of the recovery 1393 mechanisms and protocol(s) is technology-independent such that they 1394 can be consistently implemented at different layers; this would in 1395 turn simplify their global coordination. Moreover, as mentioned in 1396 [RFC3386], some looser form of coordination and communication 1397 between (vertical) layers such a consistent hold-off timer 1398 configuration (and setup through signalling during the working LSP 1399 establishment) can be considered, allowing the synchronization 1400 between recovery actions performed across these layers. 1402 D.Papadimitriou et al. - Expires October 2005 26 1403 7.2.1 Recovery Granularity 1405 In most environments, the design of the network and the vertical 1406 distribution of the LSP bandwidth are such that the recovery 1407 granularity is finer at higher layers. The OTN and SONET/SDH layers 1408 can only recover the whole section or the individual connections it 1409 transports whereas the IP/MPLS control plane can recover individual 1410 packet LSPs or groups of packet LSPs and this independently of their 1411 granularity. On the other side, the recovery granularity at the sub- 1412 wavelength level (i.e. SONET/SDH) can be provided only when the 1413 network includes devices switching at the same granularity (and thus 1414 not with optical channel level). Therefore, the network layer can 1415 deliver control-plane driven recovery mechanisms on a per-LSP basis 1416 if and only if these LSPs have their corresponding switching 1417 granularity supported at the transport plane level. 1419 7.3 Escalation Strategies 1421 There are two types of escalation strategies (see [DEMEESTER]): 1422 bottom-up and top-down. 1424 The bottom-up approach assumes that lower layer recovery types and 1425 schemes are more expedient and faster than the upper layer one. 1426 Therefore we can inhibit or hold-off higher layer recovery. However 1427 this assumption is not entirely true. Consider for instance a 1428 Sonet/SDH based protection mechanism (with a less than 50 ms 1429 protection switching time) lying on top of an OTN restoration 1430 mechanism (with a less than 200 ms restoration time). Therefore, 1431 this assumption should be (at least) clarified as: lower layer 1432 recovery mechanism is expected to be faster than upper level one if 1433 the same type of recovery mechanism is used at each layer. 1435 Consequently, taking into account the recovery actions at the 1436 different layers in a bottom-up approach, if lower layer recovery 1437 mechanisms are provided and sequentially activated in conjunction 1438 with higher layer ones, the lower layers must have an opportunity to 1439 recover normal traffic before the higher layers do. However, if 1440 lower layer recovery is slower than higher layer recovery, the lower 1441 layer must either communicate the failure related information to the 1442 higher layer(s) (and allow it to perform recovery), or use a hold- 1443 off timer in order to temporarily set the higher layer recovery 1444 action in a "standby mode". Note that the a priori information 1445 exchange between layers concerning their efficiency is not within 1446 the current scope of this document. Nevertheless, the coordination 1447 functionality between layers must be configurable and tunable. 1449 An example of coordination between the optical and packet layer 1450 control plane enables for instance the optical layer performing the 1451 failure management operations (in particular, failure detection and 1452 notification) while giving to the packet layer control plane the 1453 authority to decide and perform the recovery actions. In case the 1454 packet layer recovery action is unsuccessful, fallback at the 1455 optical layer can be subsequently performed. 1457 D.Papadimitriou et al. - Expires October 2005 27 1458 The top-down approach attempts service recovery at the higher layers 1459 before invoking lower layer recovery. Higher layer recovery is 1460 service selective, and permits "per-CoS" or "per-connection" re- 1461 routing. With this approach, the most important aspect is that the 1462 upper layer should provide its own reliable and independent failure 1463 detection mechanism from the lower layer. 1465 The same reference also suggests recovery mechanisms incorporating a 1466 coordinated effort shared by two adjacent layers with periodic 1467 status updates. Moreover, some of these recovery operations can be 1468 pre-assigned (on a per-link basis) to a certain layer, e.g. a given 1469 link will be recovered at the packet layer while another will be 1470 recovered at the optical layer. 1472 7.4 Disjointness 1474 Having link and node diverse working and recovery LSPs/spans does 1475 not guarantee their complete disjointness. Due to the common 1476 physical layer topology (passive), additional hierarchical concepts 1477 such as the Shared Risk Link Group (SRLG) and mechanisms such as 1478 SRLG diverse path computation must be developed to provide complete 1479 working and recovery LSP/span disjointness (see [IPO-IMP] and 1480 [GMPLS-RTG]). Otherwise, a failure affecting the working LSP/span 1481 would also potentially affect the recovery LSP/span; one refers to 1482 such an event as "common failure". 1484 7.4.1 SRLG Disjointness 1486 A Shared Risk Link Group (SRLG) is defined as the set of links 1487 sharing a common risk (for instance, a common physical resource such 1488 as a fiber link or a fiber cable). For instance, a set of links L 1489 belongs to the same SRLG s, if they are provisioned over the same 1490 fiber link f. 1492 The SRLG properties can be summarized as follows: 1494 1) A link belongs to more than one SRLG if and only if it crosses 1495 one of the resources covered by each of them. 1497 2) Two links belonging to the same SRLG can belong individually to 1498 (one or more) other SRLGs. 1500 3) The SRLG set S of an LSP is defined as the union of the 1501 individual SRLG s of the individual links composing this LSP. 1503 SRLG disjointness is also applicable to LSPs: 1505 The LSP SRLG disjointness concept is based on the following 1506 postulate: an LSP (i.e. sequence of links and nodes) covers an 1507 SRLG if and only if it crosses one of the links or nodes 1508 belonging to that SRLG. 1510 D.Papadimitriou et al. - Expires October 2005 28 1511 Therefore, the SRLG disjointness for LSPs can be defined as 1512 follows: two LSPs are disjoint with respect to an SRLG s if and 1513 only if they do not cover simultaneously this SRLG s. 1515 Whilst the SRLG disjointness for LSPs with respect to a set S of 1516 SRLGs is defined as follows: two LSPs are disjoint with respect 1517 to a set of SRLGs S if and only if the common SRLGs between the 1518 sets of SRLGs they individually cover is disjoint from set S. 1520 The impact on recovery is noticeable: SRLG disjointness is a 1521 necessary (but not a sufficient) condition to ensure network 1522 survivability. With respect to the physical network resources, a 1523 working-recovery LSP/span pair must be SRLG disjoint in case of 1524 dedicated recovery type. On the other hand, in case of shared 1525 recovery, a group of working LSP/span must be mutually SRLG-disjoint 1526 in order to allow for a (single and common) shared recovery LSP 1527 itself SRLG-disjoint from each of the working LSPs/spans. 1529 8. Recovery Mechanisms Analysis 1531 In order to provide a structured analysis of the recovery mechanisms 1532 detailed in the previous sections, the following dimensions can be 1533 considered: 1535 1. Fast convergence (performance): provide a mechanism that 1536 aggregates multiple failures (this implies fast failure 1537 detection and correlation mechanisms) and fast recovery decision 1538 independently of the number of failures occurring in the optical 1539 network (implying also a fast failure notification). 1541 2. Efficiency (scalability): minimize the switching time required 1542 for LSP/span recovery independently of the number of LSPs/spans 1543 being recovered (this implies an efficient failure correlation, a 1544 fast failure notification and time-efficient recovery 1545 mechanism(s)). 1547 3. Robustness (availability): minimize the LSP/span downtime 1548 independently of the underlying topology of the transport plane 1549 (this implies a highly responsive recovery mechanism). 1551 4. Resource optimization (optimality): minimize the resource 1552 capacity, including LSPs/spans and nodes (switching capacity), 1553 required for recovery purposes; this dimension can also be 1554 referred to as optimizing the sharing degree of the recovery 1555 resources. 1557 5. Cost optimization: provide a cost-effective recovery type/scheme. 1559 However, these dimensions are either outside the scope of this 1560 document such as cost optimization and recovery path computational 1561 aspects or mutually conflicting. For instance, it is obvious that 1562 providing a 1+1 LSP protection minimizes the LSP downtime (in case 1564 D.Papadimitriou et al. - Expires October 2005 29 1565 of failure) while being non-scalable and consuming recovery resource 1566 without enabling any extra-traffic. 1568 The following sections provide an analysis of the recovery phases 1569 and mechanisms detailed in the previous sections with respect to the 1570 dimensions described here above to assess the GMPLS protocol suite 1571 capabilities and applicability. In turn, this allows the evaluation 1572 of the potential need for further GMPLS signaling and routing 1573 extensions. 1575 8.1 Fast Convergence (Detection/Correlation and Hold-off Time) 1577 Fast convergence is related to the failure management operations. It 1578 refers to the elapsing time between the failure detection/ 1579 correlation and hold-off time, point at which the recovery switching 1580 actions are initiated. This point has been detailed in Section 4. 1582 8.2 Efficiency (Recovery Switching Time) 1584 In general, the more pre-assignment/pre-planning of the recovery 1585 LSP/span, the more rapid the recovery is. Since protection implies 1586 pre-assignment (and cross-connection) of the protection resources, 1587 in general, protection recover faster than restoration. 1589 Span restoration is likely to be slower than most span protection 1590 types; however this greatly depends on the efficiency of the span 1591 restoration signalling. LSP restoration with pre-signaled and pre- 1592 selected recovery resources is likely to be faster than fully 1593 dynamic LSP restoration, especially because of the elimination of 1594 any potential crankback during the recovery LSP establishment. 1596 If one excludes the crankback issue, the difference between dynamic 1597 and pre-planned restoration depends on the restoration path 1598 computation and selection time. Since computational considerations 1599 are outside the scope of this document, it is up to the vendor to 1600 determine the average and maximum path computation time in different 1601 scenarios and to the operator to decide whether or not dynamic 1602 restoration is advantageous over pre-planned schemes depending on 1603 the network environment. This difference depends also on the 1604 flexibility provided by pre-planned restoration versus dynamic 1605 restoration: the former implies a somewhat limited number of failure 1606 scenarios (that can be due, for instance, to local storage capacity 1607 limitation). The latter enables on-demand path computation based on 1608 the information received through failure notification message and as 1609 such is more robust with respect to the failure scenario scope. 1611 Moreover, LSP segment restoration, in particular, dynamic 1612 restoration (i.e. no path pre-computation so none of the recovery 1613 resource is pre-reserved) will generally be faster than end-to-end 1614 LSP restoration. However, local LSP restoration assumes that each 1615 LSP segment end-point has enough computational capacity to perform 1616 this operation while end-to-end LSP restoration requires only that 1617 LSP end-points provides this path computation capability. 1619 D.Papadimitriou et al. - Expires October 2005 30 1620 Recovery time objectives for SONET/SDH protection switching (not 1621 including time to detect failure) are specified in [G.841] at 50 ms, 1622 taking into account constraints on distance, number of connections 1623 involved, and in the case of ring enhanced protection, number of 1624 nodes in the ring. Recovery time objectives for restoration 1625 mechanisms have been proposed through a separate effort [RFC3386]. 1627 8.3 Robustness 1629 In general, the less pre-assignment (protection)/pre-planning 1630 (restoration) of the recovery LSP/span, the more robust the recovery 1631 type or scheme is to a variety of single failures, provided that 1632 adequate resources are available. Moreover, the pre-selection of the 1633 recovery resources gives in the case of multiple failure scenarios 1634 less flexibility than no recovery resource pre-selection. For 1635 instance, if failures occur that affect two LSPs sharing a common 1636 link along their restoration paths, then only one of these LSPs can 1637 be recovered. This occurs unless the restoration path of at least 1638 one of these LSPs is re-computed or the local resource assignment is 1639 modified on the fly. 1641 In addition, recovery types and schemes with pre-planned recovery 1642 resources, in particular LSP/spans for protection and LSPs for 1643 restoration purposes, will not be able to recover from failures that 1644 simultaneously affect both the working and recovery LSP/span. Thus, 1645 the recovery resources should ideally be as disjoint as possible 1646 (with respect to link, node and SRLG) from the working ones, so that 1647 any single failure event will not affect both working and recovery 1648 LSP/span. In brief, working and recovery resource must be fully 1649 diverse in order to guarantee that a given failure will not affect 1650 simultaneously the working and the recovery LSP/span. Also, the risk 1651 of simultaneous failure of the working and the recovery LSP can be 1652 reduced. This, by computing a new recovery path whenever a failure 1653 occurs along one of the recovery LSPs or by computing a new recovery 1654 path and provision the corresponding LSP whenever a failure occurs 1655 along a working LSP/span. Both methods enable the network to 1656 maintain the number of available recovery path constant. 1658 The robustness of a recovery scheme is also determined by the amount 1659 of pre-reserved (i.e. signaled) recovery resources within a given 1660 shared resource pool: as the sharing degree of recovery resources 1661 increases, the recovery scheme becomes less robust to multiple 1662 LSP/span failure occurrences. Recovery schemes, in particular 1663 restoration, with pre-signaled resource reservation (with or without 1664 pre-selection) should be capable to reserve the adequate amount of 1665 resource to ensure recovery from any specific set of failure events, 1666 such as any single SRLG failure, any two SRLG failures etc. 1668 8.4 Resource Optimization 1670 It is commonly admitted that sharing recovery resources provides 1671 network resource optimization. Therefore, from a resource 1673 D.Papadimitriou et al. - Expires October 2005 31 1674 utilization perspective, protection schemes are often classified 1675 with respect to their degree of sharing recovery resources with 1676 respect to the working entities. Moreover, non-permanent bridging 1677 protection types allow (under normal conditions) for extra-traffic 1678 over the recovery resources. 1680 From this perspective 1) 1+1 LSP/Span protection is the most 1681 resource consuming protection type since not allowing for any extra- 1682 traffic 2) 1:1 LSP/span recovery requires dedicated recovery 1683 LSP/span allowing for extra-traffic 3) 1:N and M:N LSP/span recovery 1684 require 1 (M, respectively) recovery LSP/span (shared between the N 1685 working LSP/span) allowing for extra-traffic. Obviously, 1+1 1686 protection precludes and 1:1 recovery does not allow for any 1687 recovery LSP/span sharing whereas 1:N and M:N recovery do allow 1688 sharing of 1 (M, respectively) recovery LSP/spans between N working 1689 LSP/spans. However, despite the fact that 1:1 LSP recovery precludes 1690 the sharing of the recovery LSP, the recovery schemes (see Section 1691 5.4) that can be built from it (e.g. (1:1)^n) do allow sharing of 1692 its recovery resources. In addition, the flexibility in the usage of 1693 shared recovery resources (in particular, shared links) may be 1694 limited because of network topology restrictions, e.g. fixed ring 1695 topology for traditional enhanced protection schemes. 1697 On the other hand, when using LSP restoration with pre-signaled 1698 resource reservation, the amount of reserved restoration capacity is 1699 determined by the local bandwidth reservation policies. In LSP 1700 restoration schemes with re-provisioning, a pool of spare resources 1701 can be defined from which all resources are selected after failure 1702 occurrence for the purpose of restoration path computation. The 1703 degree to which restoration schemes allow sharing amongst multiple 1704 independent failures is then directly inferred from the size of the 1705 resource pool. Moreover, in all restoration schemes, spare resources 1706 can be used to carry preemptible traffic (thus over preemptible 1707 LSP/span) when the corresponding resources have not been committed 1708 for LSP/span recovery purposes. 1710 From this, it clearly follows that less recovery resources (i.e. 1711 LSP/spans and switching capacity) have to be allocated to a shared 1712 recovery resource pool if a greater sharing degree is allowed. Thus, 1713 the network survivability level is determined by the policy that 1714 defines the amount of shared recovery resources and by the maximum 1715 sharing degree allowed for these recovery resources. 1717 8.4.1. Recovery Resource Sharing 1719 When recovery resources are shared over several LSP/Spans, the use 1720 of the Maximum Reservable Bandwidth, the Unreserved Bandwidth and 1721 the Maximum LSP Bandwidth (see [GMPLS-RTG]) provides the information 1722 needed to obtain the optimization of the network resources allocated 1723 for shared recovery purposes. 1725 The Maximum Reservable Bandwidth is defined as the Maximum Link 1726 Bandwidth but it may be greater in case of link over-subscription. 1728 D.Papadimitriou et al. - Expires October 2005 32 1729 The Unreserved Bandwidth (at priority p) is defined as the bandwidth 1730 not yet reserved on a given TE link (its initial value for each 1731 priority p corresponds to the Maximum Reservable Bandwidth). Last, 1732 the Maximum LSP Bandwidth (at priority p) is defined as the smaller 1733 of Unreserved Bandwidth (at priority p) and Maximum Link Bandwidth. 1735 Here, one generally considers a recovery resource sharing degree (or 1736 ratio) to globally optimize the shared recovery resource usage. The 1737 distribution of the bandwidth utilization per TE link can be 1738 inferred from the per-priority bandwidth pre-allocation. By using 1739 the Maximum LSP Bandwidth and the Maximum Reservable Bandwidth, the 1740 amount of (over-provisioned) resources that can be used for shared 1741 recovery purposes is known from the IGP. 1743 In order to analyze this behavior, we define the difference between 1744 the Maximum Reservable Bandwidth (in the present case, this value is 1745 greater than the Maximum Link Bandwidth) and the Maximum LSP 1746 Bandwidth per TE link i as the Maximum Shareable Bandwidth or 1747 max_R[i]. Within this quantity, the amount of bandwidth currently 1748 allocated for shared recovery per TE link i is defined as R[i]. Both 1749 quantities are expressed in terms of discrete bandwidth units (and 1750 thus, the Minimum LSP Bandwidth is of one bandwidth unit). 1752 The knowledge of this information available per TE link can be 1753 exploited in order to optimize the usage of the resources allocated 1754 per TE link for shared recovery. If one refers to r[i] as the actual 1755 bandwidth per TE link i (in terms of discrete bandwidth units) 1756 committed for shared recovery, then the following quantity must be 1757 maximized over the potential TE link candidates: 1759 sum {i=1}^N [(R{i} - r{i})/(t{i} - b{i})] 1761 or equivalently: sum {i=1}^N [(R{i} - r{i})/r{i}] 1763 with R{i} >= 1 and r{i} >= 1 (in terms of per component 1764 bandwidth unit) 1766 In this formula, N is the total number of links traversed by a given 1767 LSP, t[i] the Maximum Link Bandwidth per TE link i and b[i] the sum 1768 per TE link i of the bandwidth committed for working LSPs and other 1769 recovery LSPs (thus except "shared bandwidth" LSPs). The quantity 1770 [(R{i} - r{i})/r{i}] is defined as the Shared (Recovery) Bandwidth 1771 Ratio per TE link i. In addition, TE links for which R[i] reaches 1772 max_R[i] or for which r[i] = 0 are pruned during shared recovery 1773 path computation as well as TE links for which max_R[i] = r[i] which 1774 can simply not be shared. 1776 More generally, one can draw the following mapping between the 1777 available bandwidth at the transport and control plane level: 1779 - ---------- Max Reservable Bandwidth 1780 | ----- ^ 1781 |R ----- | 1783 D.Papadimitriou et al. - Expires October 2005 33 1784 | ----- | 1785 - ----- |max_R 1786 ----- | 1787 -------- TE link Capacity - ------ | - Maximum TE Link Bandwidth 1788 ----- |r ----- v 1789 ----- <------ b ------> - ---------- Maximum LSP Bandwidth 1790 ----- ----- 1791 ----- ----- 1792 ----- ----- 1793 ----- ----- 1794 ----- ----- <--- Minimum LSP Bandwidth 1795 -------- 0 ---------- 0 1797 Note that the above approach does not require the flooding of any 1798 per LSP information or any detailed distribution of the bandwidth 1799 allocation per component link or individual ports or even any per- 1800 priority shareable recovery bandwidth information (using a dedicated 1801 sub-TLV). The latter would provide the same capability than the 1802 already defined Maximum LSP bandwidth per-priority information. Such 1803 approach is referred to as a Partial (or Aggregated) Information 1804 Routing as described for instance in [KODIALAM1] and [KODIALAM2]. 1805 They show that the difference obtained with a Full (or Complete) 1806 Information Routing approach (where for the whole set of working and 1807 recovery LSPs, the amount of bandwidth units they use per-link is 1808 known at each node and for each link) is clearly negligible. The 1809 latter approach is detailed in [GLI], for instance. Note also that 1810 both approaches rely on the deterministic knowledge (at different 1811 degrees) of the network topology and resource usage status. 1813 Moreover, extending the GMPLS signalling capabilities can enhance 1814 the Partial Information Routing approach. This, by allowing working 1815 LSP related information and in particular, its path (including link 1816 and node identifiers) to be exchanged with the recovery LSP request 1817 to enable more efficient admission control at upstream nodes of 1818 shared recovery resources, in particular links (see Section 8.4.3). 1820 8.4.2 Recovery Resource Sharing and SRLG Recovery 1822 Resource shareability can also be maximized with respect to the 1823 number of times each SRLG is protected by a recovery resource (in 1824 particular, a shared TE link) and methods can be considered for 1825 avoiding contention of the shared recovery resources in case of 1826 single SRLG failure. These methods enable for the sharing of 1827 recovery resources between two (or more) recovery LSPs if their 1828 respective working LSPs are mutually disjoint with respect to link, 1829 node and SRLGs. A single failure then does not simultaneously 1830 disrupt several (or at least two) working LSPs. 1832 For instance, [BOUILLET] shows that the Partial Information Routing 1833 approach can be extended to cover recovery resource shareability 1834 with respect to SRLG recoverability (i.e. the number of times each 1835 SRLG is recoverable). By flooding this aggregated information per TE 1837 D.Papadimitriou et al. - Expires October 2005 34 1838 link, path computation and selection of SRLG-diverse recovery LSPs 1839 can be optimized with respect to the sharing of recovery resource 1840 reserved on each TE link giving a performance difference of less 1841 than 5% (and so negligible) compared to the corresponding Full 1842 Information Flooding approach (see [GLI]). 1844 For this purpose, additional extensions to [GMPLS-RTG] in support of 1845 path computation for shared mesh recovery have been often considered 1846 in the literature. TE link attributes would include, among other, 1847 the current number of recovery LSPs sharing the recovery resources 1848 reserved on the TE link and the current number of SRLGs recoverable 1849 by this amount of (shared) recovery resources reserved on the TE 1850 link. The latter is equivalent to the current number of SRLGs that 1851 the recovery LSPs sharing the recovery resource reserved on the TE 1852 link shall recover. Then, if explicit SRLG recoverability is 1853 considered an additional TE link attribute including the explicit 1854 list of SRLGs recoverable by the shared recovery resource reserved 1855 on the TE link and their respective shareable recovery bandwidth. 1856 The latter information is equivalent to the shareable recovery 1857 bandwidth per SRLG (or per group of SRLGs) which implies to consider 1858 a decreasing amount of shareable bandwidth and SRLG list over time. 1860 Compared to the case of recovery resource sharing only (regardless 1861 of SRLG recoverability, as described in Section 8.4.1), this 1862 additional TE link attributes would potentially deliver better path 1863 computation and selection (at distinct ingress node) for shared mesh 1864 recovery purposes. However, due to the lack of results of evidence 1865 for better efficiency and due to the complexity that such extensions 1866 would generate, they are not further considered in the scope of the 1867 present analysis. For instance, a per-SRLG group minimum/maximum 1868 shareable recovery bandwidth is restricted by the length that the 1869 corresponding (sub-)TLV may take and thus the number of SRLGs that 1870 it can include. Therefore, the corresponding parameter should not be 1871 translated into GMPLS routing (or even signalling) protocol 1872 extensions in the form of TE link sub-TLV. 1874 8.4.3 Recovery Resource Sharing, SRLG Disjointness and Admission 1875 Control 1877 Admission control is a strict requirement to be fulfilled by nodes 1878 giving access to shared links. This can be illustrated using the 1879 following network topology: 1881 A ------ C ====== D 1882 | | | 1883 | | | 1884 | B | 1885 | | | 1886 | | | 1887 ------- E ------ F 1889 Node A creates a working LSP to D (A-C-D), B creates simultaneously 1890 a working LSP to D (B-C-D) and a recovery LSP (B-E-F-D) to the same 1892 D.Papadimitriou et al. - Expires October 2005 35 1893 destination. Then, A decides to create a recovery LSP to D (A-E-F- 1894 D), but since the C-D span carries both working LSPs, node E should 1895 either assign a dedicated resource for this recovery LSP or reject 1896 this request if the C-D span has already reached its maximum 1897 recovery bandwidth sharing ratio. Otherwise, in the latter case, C-D 1898 span failure would imply that one of the working LSP would not be 1899 recoverable. 1901 Consequently, node E must have the required information to perform 1902 admission control for the recovery LSP requests it processes 1903 (implying for instance, that the path followed by the working LSP is 1904 carried with the corresponding recovery LSP request). If node E can 1905 guarantee that the working LSPs (A-C-D and B-C-D) are SRLG disjoint 1906 over the C-D span, it may securely accept the incoming recovery LSP 1907 request and assign to the recovery LSPs (A-E-F-D and B-E-F-D) the 1908 same resources on the link E-F. This, if the link E-F has not yet 1909 reached its maximum recovery bandwidth sharing ratio. In this 1910 example, one assumes that the node failure probability is negligible 1911 compared to the link failure probability. 1913 To achieve this, the path followed by the working LSP is transported 1914 with the recovery LSP request and examined at each upstream node of 1915 potentially shareable links. Admission control is performed using 1916 the interface identifiers (included in the path) to retrieve in the 1917 TE DataBase the list of SRLG Ids associated to each of the working 1918 LSP links. If the working LSPs (A-C-D and B-C-D) have one or more 1919 link or SRLG id in common (in this example, one or more SRLG id in 1920 common over the span C-D) node E should not assign the same resource 1921 over link E-F to the recovery LSPs (A-E-F-D and B-E-F-D). Otherwise, 1922 one of these working LSPs would not be recoverable in case of C-D 1923 span failure. 1925 There are some issues related to this method, the major one being 1926 the number of SRLG Ids that a single link can cover (more than 100, 1927 in complex environments). Moreover, when using link bundles, this 1928 approach may generate the rejection of some recovery LSP requests. 1929 This occurs when the SRLG sub-TLV corresponding to a link bundle 1930 includes the union of the SRLG id list of all the component links 1931 belonging to this bundle (see [GMPLS-RTG] and [BUNDLE]). 1933 In order to overcome this specific issue, an additional mechanism 1934 may consist of querying the nodes where such an information would be 1935 available (in this case, node E would query C). The main drawback of 1936 this method is that, in addition to the dedicated mechanism(s) it 1937 requires, it may become complex when several common nodes are 1938 traversed by the working LSPs. Therefore, when using link bundles, 1939 solving this issue is tightly related to the sequence of the 1940 recovery operations. Per component flooding of SRLG identifiers 1941 would deeply impact the scalability of the link state routing 1942 protocol. Therefore, one may rely on the usage of an on-line 1943 accessible network management system. 1945 D.Papadimitriou et al. - Expires October 2005 36 1946 9. Summary and Conclusions 1948 The following table summarizes the different recovery types and 1949 schemes analyzed throughout this document. 1951 -------------------------------------------------------------------- 1952 | Path Search (computation and selection) 1953 -------------------------------------------------------------------- 1954 | Pre-planned (a) | Dynamic (b) 1955 -------------------------------------------------------------------- 1956 | | faster recovery | Does not apply 1957 | | less flexible | 1958 | 1 | less robust | 1959 | | most resource consuming | 1960 Path | | | 1961 Setup ------------------------------------------------------------ 1962 | | relatively fast recovery | Does not apply 1963 | | relatively flexible | 1964 | 2 | relatively robust | 1965 | | resource consumption | 1966 | | depends on sharing degree | 1967 ------------------------------------------------------------ 1968 | | relatively fast recovery | less faster (computation) 1969 | | more flexible | most flexible 1970 | 3 | relatively robust | most robust 1971 | | less resource consuming | least resource consuming 1972 | | depends on sharing degree | 1973 -------------------------------------------------------------------- 1975 1a. Recovery LSP setup (before failure occurrence) with resource 1976 reservation (i.e. signalling) and selection is referred to as 1977 LSP protection. 1979 2a. Recovery LSP setup (before failure occurrence) with resource 1980 reservation (i.e. signalling) and with resource pre-selection is 1981 referred to as pre-planned LSP re-routing with resource pre- 1982 selection. This implies only recovery LSP activation after 1983 failure occurrence. 1985 3a. Recovery LSP setup (before failure occurrence) with resource 1986 reservation (i.e. signalling) and without resource selection is 1987 referred to as pre-planned LSP re-routing without resource pre- 1988 selection. This implies recovery LSP activation and resource 1989 (i.e. label) selection after failure occurrence. 1991 3b. Recovery LSP setup after failure occurrence is referred to as 1992 to as LSP re-routing, which is full when recovery LSP path 1993 computation occurs after failure occurrence. 1995 The term pre-planned refers thus to recovery LSP path pre- 1996 computation, signaling (reservation), and a priori resource 1997 selection (optional), but not cross-connection. Also, the shared- 1999 D.Papadimitriou et al. - Expires October 2005 37 2000 mesh recovery scheme can be viewed as a particular case of 2a) and 2001 3a) using the additional constraint described in Section 8.4.3. 2003 The implementation of these recovery mechanisms requires only 2004 considering extensions to GMPLS signalling protocols (i.e. [RFC3471] 2005 and [RFC3473]). These GMPLS signalling extensions should mainly 2006 focus in delivering (1) recovery LSP pre-provisioning for the cases 2007 1a, 2a and 3a, (2) LSP failure notification, (3) recovery LSP 2008 switching action(s), and (4) reversion mechanisms. 2010 Moreover, the present analysis (see Section 8) shows that no GMPLS 2011 routing extensions are expected to efficiently implement any of 2012 these recovery types and schemes. 2014 10. Security Considerations 2016 This document does not introduce any additional security issue or 2017 imply any specific security consideration from [RFC3945] to the 2018 current RSVP-TE GMPLS signaling, routing protocols (OSPF-TE, IS-IS- 2019 TE) or network management protocols. 2021 However, the authorization of requests for resources by GMPLS- 2022 capable nodes should determining whether a given party, presumable 2023 already authenticated, has a right to access the requested 2024 resources. This determination is typically a matter of local policy 2025 control, for example by setting limits on the total bandwidth made 2026 available to some party in the presence of resource contention. Such 2027 policies may become quite complex as the number of users, types of 2028 resources and sophistication of authorization rules increases. This 2029 is particularly the case for recovery schemes that assume pre- 2030 planned sharing of recovery resources, or contention for resources 2031 in case of dynamic re-routing. 2033 Therefore, control elements should match them against the local 2034 authorization policy. These control elements must be capable of 2035 making decisions based on the identity of the requester, as verified 2036 cryptographically and/or topologically. 2038 11. IANA Considerations 2040 This document defines no new code points and requires no action by 2041 IANA. 2043 12. Acknowledgments 2045 The authors would like to thank Fabrice Poppe (Alcatel) and Bart 2046 Rousseau (Alcatel) for their revision effort, Richard Rabbat 2047 (Fujitsu Labs), David Griffith (NIST) and Lyndon Ong (Ciena) for 2048 their useful comments. 2050 Thanks also to Adrian Farrel for the thorough review of the 2051 document. 2053 D.Papadimitriou et al. - Expires October 2005 38 2054 13. References 2056 13.1 Normative References 2058 [BUNDLE] K.Kompella et al., "Link Bundling in MPLS Traffic 2059 Engineering," Work in progress, draft-ietf-mpls-bundle- 2060 06.txt, December 2004. 2062 [GMPLS-RTG] K.Kompella (Editor) et al., "Routing Extensions in 2063 Support of Generalized Multi-Protocol Label Switching," 2064 Work in Progress, draft-ietf-ccamp-gmpls-routing- 2065 09.txt, October 2003. 2067 [LMP] J.P.Lang (Editor) et al., "Link Management Protocol 2068 (LMP)," Work in progress, draft-ietf-ccamp-lmp-10.txt, 2069 October 2003. 2071 [LMP-WDM] A.Fredette and J.P.Lang (Editors), "Link Management 2072 Protocol (LMP) for Dense Wavelength Division 2073 Multiplexing (DWDM) Optical Line Systems," Work in 2074 progress, draft-ietf-ccamp-lmp-wdm-03.txt, October 2075 2003. 2077 [RFC2026] S.Bradner, "The Internet Standards Process -- Revision 2078 3," BCP 9, RFC 2026, October 1996. 2080 [RFC2119] S.Bradner, "Key words for use in RFCs to Indicate 2081 Requirement Levels," BCP 14, RFC 2119, March 1997. 2083 [RFC3471] L.Berger (Editor) et al., "Generalized Multi-Protocol 2084 Label Switching (GMPLS) Signaling Functional 2085 Description," RFC 3471, January 2003. 2087 [RFC3473] L.Berger (Editor) et al., "Generalized Multi-Protocol 2088 Label Switching (GMPLS) Signaling Resource ReserVation 2089 Protocol-Traffic Engineering (RSVP-TE) Extensions," RFC 2090 3473, January 2003. 2092 [RFC3667] S.Bradner, "IETF Rights in Contributions", BCP 78, 2093 RFC 3667, February 2004. 2095 [RFC3668] S.Bradner, Ed., "Intellectual Property Rights in IETF 2096 Technology", BCP 79, RFC 3668, February 2004. 2098 [RFC3945] E.Mannie (Editor) et al., "Generalized Multi-Protocol 2099 Label Switching Architecture," RFC 3945, October 2004. 2101 [TERM] E.Mannie and D.Papadimitriou (Editors), "Recovery 2102 (Protection and Restoration) Terminology for 2103 Generalized Multi-Protocol Label Switching (GMPLS)," 2104 Work in progress, draft-ietf-ccamp-gmpls-recovery- 2105 terminology-06.txt, April 2005. 2107 D.Papadimitriou et al. - Expires October 2005 39 2108 13.2 Informative References 2110 [BOUILLET] E.Bouillet et al., "Stochastic Approaches to Compute 2111 Shared Meshed Restored Lightpaths in Optical Network 2112 Architectures," IEEE Infocom 2002, New York City, June 2113 2002. 2115 [DEMEESTER] P.Demeester et al., "Resilience in Multilayer 2116 Networks," IEEE Communications Magazine, Vol. 37, No. 2117 8, pp. 70-76, August 1998. 2119 [GLI] G.Li et al., "Efficient Distributed Path Selection for 2120 Shared Restoration Connections," IEEE Infocom 2002, New 2121 York City, June 2002. 2123 [IPO-IMP] J.Strand and A.Chiu, "Impairments and Other Constraints 2124 On Optical Layer Routing," Work in Progress, draft- 2125 ietf-ipo-impairments-05.txt, May 2003. 2127 [KODIALAM1] M.Kodialam and T.V.Lakshman, "Restorable Dynamic 2128 Quality of Service Routing," IEEE Communications 2129 Magazine, pp. 72-81, June 2002. 2131 [KODIALAM2] M.Kodialam and T.V.Lakshman, "Dynamic Routing of 2132 Restorable Bandwidth-Guaranteed Tunnels using 2133 Aggregated Network Resource Usage Information," IEEE/ 2134 ACM Transactions on Networking, pp. 399-410, June 2003. 2136 [MANCHESTER] J.Manchester, P.Bonenfant and C.Newton, "The Evolution 2137 of Transport Network Survivability," IEEE 2138 Communications Magazine, August 1999. 2140 [RFC3386] W.Lai, D.McDysan, J.Boyle, et al., "Network Hierarchy 2141 and Multi-layer Survivability," RFC 3386, November 2002 2143 [RFC3469] V.Sharma and F.Hellstrand (Editors), "Framework for 2144 Multi-Protocol Label Switching (MPLS)- based Recovery," 2145 RFC 3469, February 2003. 2147 [T1.105] ANSI, "Synchronous Optical Network (SONET): Basic 2148 Description Including Multiplex Structure, Rates, and 2149 Formats," ANSI T1.105, January 2001. 2151 [WANG] J.Wang, L.Sahasrabuddhe, and B.Mukherjee, "Path vs. 2152 Subpath vs. Link Restoration for Fault Management in 2153 IP-over-WDM Networks: Performance Comparisons Using 2154 GMPLS Control Signaling," IEEE Communications Magazine, 2155 pp. 80-87, November 2002. 2157 For information on the availability of the following documents, 2158 please see http://www.itu.int 2160 D.Papadimitriou et al. - Expires October 2005 40 2162 [G.707] ITU-T, "Network Node Interface for the Synchronous 2163 Digital Hierarchy (SDH)," Recommendation G.707, October 2164 2000. 2166 [G.709] ITU-T, "Network Node Interface for the Optical 2167 Transport Network (OTN)," Recommendation G.709, 2168 February 2001 (and Amendment no.1, October 2001). 2170 [G.783] ITU-T, "Characteristics of Synchronous Digital 2171 Hierarchy (SDH) Equipment Functional Blocks," 2172 Recommendation G.783, October 2000. 2174 [G.806] ITU-T, "Characteristics of Transport Equipment - 2175 Description Methodology and Generic Functionality", 2176 Recommendation G.806, October 2000. 2178 [G.808.1] ITU-T, "Generic Protection Switching - Linear trail and 2179 Subnetwork Protection," Recommendation G.808.1, 2180 December 2003. 2182 [G.841] ITU-T, "Types and Characteristics of SDH Network 2183 Protection Architectures," Recommendation G.841, 2184 October 1998. 2186 [G.842] ITU-T, "Interworking of SDH network protection 2187 architectures," Recommendation G.842, October 1998. 2189 14. Editor's Addresses 2191 Eric Mannie 2192 EMail: eric_mannie@hotmail.com 2194 Dimitri Papadimitriou 2195 Alcatel 2196 Francis Wellesplein, 1 2197 B-2018 Antwerpen, Belgium 2198 Phone: +32 3 240-8491 2199 EMail: dimitri.papadimitriou@alcatel.be 2201 D.Papadimitriou et al. - Expires October 2005 41 2202 Intellectual Property Statement 2204 The IETF takes no position regarding the validity or scope of any 2205 Intellectual Property Rights or other rights that might be claimed 2206 to pertain to the implementation or use of the technology described 2207 in this document or the extent to which any license under such 2208 rights might or might not be available; nor does it represent that 2209 it has made any independent effort to identify any such rights. 2210 Information on the procedures with respect to rights in RFC 2211 documents can be found in BCP 78 and BCP 79. 2213 Copies of IPR disclosures made to the IETF Secretariat and any 2214 assurances of licenses to be made available, or the result of an 2215 attempt made to obtain a general license or permission for the use 2216 of such proprietary rights by implementers or users of this 2217 specification can be obtained from the IETF on-line IPR repository 2218 at http://www.ietf.org/ipr. 2220 The IETF invites any interested party to bring to its attention any 2221 copyrights, patents or patent applications, or other proprietary 2222 rights that may cover technology that may be required to implement 2223 this standard. Please address the information to the IETF at 2224 ietf-ipr@ietf.org. 2226 Disclaimer of Validity 2228 This document and the information contained herein are provided on 2229 an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 2230 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE 2231 INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR 2232 IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2233 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2234 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2236 Copyright Statement 2238 Copyright (C) The Internet Society (2005). This document is subject 2239 to the rights, licenses and restrictions contained in BCP 78, and 2240 except as set forth therein, the authors retain all their rights. 2242 Acknowledgment 2244 Funding for the RFC Editor function is currently provided by the 2245 Internet Society. 2247 D.Papadimitriou et al. - Expires October 2005 42