idnits 2.17.1 draft-ietf-ccamp-alarm-module-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 399 has weird spacing: '...perator str...' == Line 404 has weird spacing: '...w state ope...' == Line 540 has weird spacing: '...alifier ala...' == Line 589 has weird spacing: '...alifier lea...' == Line 598 has weird spacing: '...everity sev...' == (2 more instances...) -- The document date (December 14, 2017) is 2324 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-06) exists of draft-ietf-netmod-yang-tree-diagrams-02 Summary: 1 error (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Vallin 3 Internet-Draft Stefan Vallin AB 4 Intended status: Standards Track M. Bjorklund 5 Expires: June 17, 2018 Cisco 6 December 14, 2017 8 YANG Alarm Module 9 draft-ietf-ccamp-alarm-module-00 11 Abstract 13 This document defines a YANG module for alarm management. It 14 includes functions for alarm list management, alarm shelving and 15 notifications to inform management systems. There are also RPCs to 16 manage the operator state of an alarm and administrative alarm 17 procedures. The module carefully maps to relevant alarm standards. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on June 17, 2018. 36 Copyright Notice 38 Copyright (c) 2017 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Requirements notation . . . . . . . . . . . . . . . . . . . . 3 54 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 56 3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 4. Alarm Module Concepts . . . . . . . . . . . . . . . . . . . . 5 58 4.1. Alarm Definition . . . . . . . . . . . . . . . . . . . . 5 59 4.2. Alarm Type . . . . . . . . . . . . . . . . . . . . . . . 5 60 4.3. Identifying Resource . . . . . . . . . . . . . . . . . . 7 61 4.4. Identifying Alarm Instances . . . . . . . . . . . . . . . 7 62 4.5. Alarm Life-Cycle . . . . . . . . . . . . . . . . . . . . 8 63 4.5.1. Resource Alarm Life-Cycle . . . . . . . . . . . . . . 8 64 4.5.2. Operator Alarm Life-cycle . . . . . . . . . . . . . . 9 65 4.5.3. Administrative Alarm Life-Cycle . . . . . . . . . . . 9 66 4.6. Root Cause and Impacted Resources . . . . . . . . . . . . 10 67 4.7. Alarm Shelving . . . . . . . . . . . . . . . . . . . . . 10 68 5. Alarm Data Model . . . . . . . . . . . . . . . . . . . . . . 10 69 5.1. Alarm Control . . . . . . . . . . . . . . . . . . . . . . 11 70 5.1.1. Alarm Shelving . . . . . . . . . . . . . . . . . . . 11 71 5.2. Alarm Inventory . . . . . . . . . . . . . . . . . . . . . 12 72 5.3. Alarm Summary . . . . . . . . . . . . . . . . . . . . . . 13 73 5.4. The Alarm List . . . . . . . . . . . . . . . . . . . . . 13 74 5.5. The Shelved Alarms List . . . . . . . . . . . . . . . . . 15 75 5.6. RPCs and Actions . . . . . . . . . . . . . . . . . . . . 15 76 5.7. Notifications . . . . . . . . . . . . . . . . . . . . . . 15 77 6. Alarm YANG Module . . . . . . . . . . . . . . . . . . . . . . 15 78 7. X.733 Alarm Mapping Data Model . . . . . . . . . . . . . . . 40 79 8. X.733 Alarm Mapping YANG Module . . . . . . . . . . . . . . . 41 80 9. Security Considerations . . . . . . . . . . . . . . . . . . . 47 81 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 47 82 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 47 83 11.1. Normative References . . . . . . . . . . . . . . . . . . 47 84 11.2. Informative References . . . . . . . . . . . . . . . . . 47 85 Appendix A. Vendor-specific Alarm-Types Example . . . . . . . . 48 86 Appendix B. Alarm Inventory Example . . . . . . . . . . . . . . 49 87 Appendix C. Alarm List Example . . . . . . . . . . . . . . . . . 50 88 Appendix D. Alarm Shelving Example . . . . . . . . . . . . . . . 51 89 Appendix E. X.733 Mapping Example . . . . . . . . . . . . . . . 52 90 Appendix F. Background and Usability Requirements . . . . . . . 52 91 F.1. Alarm Concepts . . . . . . . . . . . . . . . . . . . . . 53 92 F.1.1. Alarm type . . . . . . . . . . . . . . . . . . . . . 53 93 F.2. Usability Requirements . . . . . . . . . . . . . . . . . 54 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 57 96 1. Requirements notation 98 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 100 "OPTIONAL" in this document are to be interpreted as described in BCP 101 14 [RFC2119] [RFC8174] when, and only when, they appear in all 102 capitals, as shown here. 104 2. Introduction 106 This document defines a YANG [RFC7950] module for alarm management. 107 The purpose is to define a standardised alarm interface for network 108 devices that can be easily integrated into management applications. 109 The model is also applicable as a northbound alarm interface in the 110 management applications. 112 Alarm monitoring is a fundamental part of monitoring the network. 113 Raw alarms from devices do not always tell the status of the network 114 services or necessarily point to the root cause. However, being able 115 to feed alarms to the network management system in a standardised 116 format is a starting point for performing higher level network 117 assurance tasks. 119 This document defines a standardised YANG module for alarm 120 management. The design of the module is based on experience from 121 using and implementing available alarm standards. 123 2.1. Terminology 125 The following terms are defined in [RFC7950]: 127 o action 129 o client 131 o data tree 133 o RPC 135 o server 137 The following terms are used within this document: 139 o Alarm (the general concept): An alarm signifies an undesirable 140 state in a resource that requires corrective action. 142 o Alarm Instance: The alarm state for a specific resource and alarm 143 type. For example (GigabitEthernet0/15, link-alarm). An entry in 144 the alarm list. 146 o Alarm Inventory: A list of all possible alarm types on a system. 148 o Alarm Shelving: Blocking alarms according to specific criteria. 150 o Alarm Type: An alarm type identifies a possible unique alarm state 151 for a resource. Alarm types are names to identify the state like 152 "link-alarm", "jitter-violation", "high-disk-utilization". 154 o Management System: The alarm management application that consumes 155 the alarms, i.e., acts as a client. 157 o Resource: A fine-grained identification of the alarming resource, 158 for example: an interface, a process. 160 o System: The system that implements this YANG alarm module, i.e., 161 acts as a server. This corresponds to a network device or a 162 management application that provides a north-bound alarm 163 interface. 165 Tree diagrams used in this document follow the notation defined in 166 [I-D.ietf-netmod-yang-tree-diagrams]. 168 3. Objectives 170 The objectives for the design of the Alarm Module are: 172 o Simple to use. If a system supports this module, it shall be 173 straight-forward to integrate this into a YANG based alarm 174 manager. 176 o View alarms as states on resources and not as discrete 177 notifications. 179 o Clear definition of "alarm" in order to exclude general events 180 that should not be forwarded as alarm notifications. 182 o Clear and precise identification of alarm types and alarm 183 instances. 185 o A management system should be able to pull all available alarm 186 types from a system, i.e., read the alarm inventory from a system. 187 This makes it possible to prepare alarm operators with 188 corresponding alarm instructions. 190 o Address alarm usability requirements. While IETF has not really 191 addressed alarm management, telecom standards has addressed it 192 purely from a protocol perspective. The process industry has 193 published several relevant standards addressing requirements for a 194 useful alarm interface; [EEMUA], [ISA182]. This alarm module 195 defines usability requirements as well as a YANG data model. 197 o Mapping to X.733, which is a requirement for many alarm systems. 198 Still, keep some of the X.733 concepts out of the core model in 199 order to make the model small and easy to understand. 201 4. Alarm Module Concepts 203 This section defines the fundamental concepts behind the data model. 204 This section is rooted in the works of Vallin et. al [ALARMSEM]. 206 4.1. Alarm Definition 208 An alarm signifies an undesirable state in a resource that requires 209 corrective action. 211 See Appendix F for more motivation and consequences around this 212 definition. 214 4.2. Alarm Type 216 This document defines an alarm type with an alarm type id and an 217 alarm type qualifier. 219 The alarm type id is modeled as a YANG identity. With YANG 220 identities, new alarm types can be defined in a distributed fashion. 221 YANG identities are hierarchical, which means that an hierarchy of 222 alarm types can be defined. 224 Standards and vendors should define their own alarm type identities 225 based on this definition. 227 The use of YANG identities means that all possible alarms are 228 identified at design time. This explicit declaration of alarm types 229 makes it easier to allow for alarm qualification reviews and 230 preparation of alarm actions and documentation. 232 There are occasions where the alarm types are not known at design 233 time. For example, a system with digital inputs that allows users to 234 connects detectors (e.g., smoke detector) to the inputs. In this 235 case it is a configuration action that says that certain connectors 236 are fire alarms for example. The drawback of this is that there is a 237 big risk that alarm operators will receive alarm types as a surprise, 238 they do not know how to resolve the problem since a defined alarm 239 procedure does not necessarily exist. 241 In order to allow for dynamic addition of alarm types the alarm 242 module also allows for further qualification of the identity based 243 alarm type using a string. 245 A vendor or standard can then define their own alarm-type hierarchy. 246 The example below shows a hierarchy based on X.733 event types: 248 import ietf-alarms { 249 prefix al; 250 } 251 identity vendor-alarms { 252 base al:alarm-type; 253 } 254 identity communications-alarm { 255 base vendor-alarms; 256 } 257 identity link-alarm { 258 base communications-alarm; 259 } 261 Alarm types can be abstract. An abstract alarm type is used as a 262 base for defining hierarchical alarm types. Concrete alarm types are 263 used for alarm states and appear in the alarm inventory. There are 264 two kinds of concrete alarm types: 266 1. The last subordinate identity in the "alarm-type-id" hierarchy is 267 concrete, for example: "alarm-identity.environmental- 268 alarm.smoke". In this example "alarm-identity" and 269 "environmental-alarm" are abstract YANG identities, whereas 270 "smoke" is a concrete YANG identity. 272 2. The YANG identity hierarchy is abstract and the concrete alarm 273 type is defined by the dynamic alarm qualifier string, for 274 example: "alarm-identity.environmental-alarm.external-detector" 275 with alarm-type-qualifier "smoke". 277 For example: 279 // Alternative 1: concrete alarm type identity 280 import ietf-alarms { 281 prefix al; 282 } 283 identity environmental-alarm { 284 base al:alarm-type; 285 description "Abstract alarm type"; 286 } 287 identity smoke { 288 base environmental-alarm; 289 description "Concrete alarm type"; 290 } 292 // Alternative 2: concrete alarm type qualifier 293 import ietf-alarms { 294 prefix al; 295 } 296 identity environmental-alarm { 297 base al:alarm-type; 298 description "Abstract alarm type"; 299 } 300 identity external-detector { 301 base environmental-alarm; 302 description 303 "Abstract alarm type, a run-time configuration 304 procedure sets the type of alarm detected. This will 305 be reported in the alarm-type-qualifier."; 306 } 308 4.3. Identifying Resource 310 It is of vital importance to be able to refer to the alarming 311 resource. This reference must be as fine-grained as possible. If 312 the alarming resource exists in the data tree then an instance- 313 identifier MUST be used with the full path to the object. 315 This module also allows for alternate naming of the alarming resource 316 if it is not available in the data tree. 318 4.4. Identifying Alarm Instances 320 A primary goal of this alarm module is to remove any ambiguity in how 321 alarm notifications are mapped to an update of an alarm instance. 322 X.733 and especially 3GPP were not really clear on this point. This 323 YANG alarm module states that the tuple (resource, alarm type 324 identifier, alarm type qualifier) corresponds to a single alarm 325 instance. This means that alarm notifications for the same resource 326 and same alarm type are matched to update the same alarm instance. 327 These three leafs are therefore used as the key in the alarm list: 329 list alarm { 330 key "resource alarm-type-id alarm-type-qualifier"; 331 ... 332 } 334 4.5. Alarm Life-Cycle 336 The alarm model clearly separates the resource alarm life-cycle from 337 the operator and administrative life-cycles of an alarm. 339 o resource alarm life-cycle: the alarm instrumentation that controls 340 alarm raise, clearance, and severity changes. 342 o operator alarm life-cycle: operators acting upon alarms with 343 actions like acknowledgment and closing. Closing an alarm implies 344 that the operator considers the corrective action performed. 345 Operators can also shelf alarms in order to avoid nuisance alarms. 347 o administrative alarm life-cycle: deleting (purging) alarms and 348 compressing the alarm status change list. This module exposes 349 operations to manage the administrative life-cycle. The server 350 may also perform these operations based on other policies, but how 351 that is done is out of scope for this document. 353 4.5.1. Resource Alarm Life-Cycle 355 From a resource perspective, an alarm can have the following life- 356 cycle: raise, change severity, change severity, clear, being raised 357 again etc. All of these status changes can have different alarm 358 texts generated by the instrumentation. Two important things to 359 note: 361 1. Alarms are not deleted when they are cleared. Deleting alarms is 362 an administrative process. The alarm module defines an rpc 363 "purge" that deletes alarms. 365 2. Alarms are not cleared by operators, only the underlying 366 instrumentation can clear an alarm. Operators can close alarms. 368 The YANG tree representation below illustrates the resource oriented 369 life-cycle: 371 +--ro alarm* [resource alarm-type-id alarm-type-qualifier] 372 ... 373 +--ro is-cleared boolean 374 +--ro last-changed yang:date-and-time 375 +--ro perceived-severity severity 376 +--ro alarm-text alarm-text 377 +--ro status-change* [time] 378 +--ro time yang:date-and-time 379 +--ro perceived-severity severity 380 +--ro alarm-text alarm-text 382 For every status change from the resource perspective a row is added 383 to the "status-change" list. The last status values are also 384 represented at leafs for the alarm. Note well that the alarm 385 severity does not include "cleared", alarm clearance is a flag. 387 An alarm can therefore look like this: ((GigabitEthernet0/25, link- 388 alarm,""), false, T, major, "Interface GigabitEthernet0/25 down") 390 4.5.2. Operator Alarm Life-cycle 392 Operators can also act upon alarms using the set-operator-state 393 action: 395 +--ro alarm* [resource alarm-type-id alarm-type-qualifier] 396 ... 397 +--ro operator-state-change* [time] {operator-actions}? 398 | +--ro time yang:date-and-time 399 | +--ro operator string 400 | +--ro state operator-state 401 | +--ro text? string 402 +---x set-operator-state {operator-actions}? 403 +---w input 404 +---w state operator-state 405 +---w text? string 407 The operator state for an alarm can be: "none", "ack", "shelved", and 408 "closed". Alarm deletion (using the rpc "purge-alarms"), can use 409 this state as a criteria. A closed alarm is an alarm where the 410 operator has performed any required corrective actions. Closed 411 alarms are good candidates for being deleted. 413 4.5.3. Administrative Alarm Life-Cycle 415 Deleting alarms from the alarm list is considered an administrative 416 action. This is supported by the "purge-alarms" rpc. The "purge- 417 alarms" rpc takes a filter as input. The filter selects alarms based 418 on the operator and resource life-cycle such as "all closed cleared 419 alarms older than a time specification". The server may also perform 420 these operations based on other policies, but how that is done is out 421 of scope for this document. 423 Alarms can be compressed. Compressing an alarm deletes all entries 424 in the alarm's "status-change" list except for the last status 425 change. A client can perform this using the "compress-alarms" rpc. 426 The server may also perform these operations based on other policies, 427 but how that is done is out of scope for this document. 429 4.6. Root Cause and Impacted Resources 431 The general principle of this alarm module is to limit the amount of 432 alarms. The alarm has two leaf-lists to identify possible impacted 433 resources and possible root-cause resources. The system should not 434 send individual alarms for the possible root-cause resources and 435 impacted resources. These serves as hints only. It is up to the 436 client application to use this information to present the overall 437 status. 439 4.7. Alarm Shelving 441 Alarm shelving is an important function in order for alarm management 442 applications and operators to stop superfluous alarms. A shelved 443 alarm implies that any alarms fulfilling this criteria are ignored. 444 Shelved alarms appear in a dedicated shelved alarm list in order not 445 to disturb the relevant alarms. Shelved alarms do not generate 446 notifications. 448 5. Alarm Data Model 450 Alarm shelving and operator actions are YANG features so that a 451 server can select not to support these. 453 The data model has the following overall structure: 455 +--rw alarms 456 +--rw control 457 | +--rw max-alarm-status-changes? union 458 | +--rw notify-status-changes? boolean 459 | +--rw alarm-shelving {alarm-shelving}? 460 | ... 461 +--ro alarm-inventory 462 | +--ro alarm-type* [alarm-type-id alarm-type-qualifier] 463 | ... 464 +--ro summary 465 | +--ro alarm-summary* [severity] 466 | | ... 467 | +--ro shelves-active? empty {alarm-shelving}? 468 +--ro alarm-list 469 | +--ro number-of-alarms? yang:gauge32 470 | +--ro last-changed? yang:date-and-time 471 | +--ro alarm* [resource alarm-type-id alarm-type-qualifier] 472 | ... 473 +--ro shelved-alarms {alarm-shelving}? 474 +--ro number-of-shelved-alarms? yang:gauge32 475 +--ro alarm-shelf-last-changed? yang:date-and-time 476 +--ro shelved-alarm* 477 [resource alarm-type-id alarm-type-qualifier] 478 ... 480 5.1. Alarm Control 482 The "/alarms/control/notify-status-changes" leaf controls if 483 notifications are sent for all state changes, severity change and 484 alarm text change, or just for new and cleared alarms. 486 Every alarm has a list of status changes, this is a circular list. 487 The length of this list is controlled by "/alarms/control/max-alarm- 488 status-changes". 490 5.1.1. Alarm Shelving 492 The shelving control tree is shown below: 494 +--rw alarms 495 +--rw control 496 +--rw alarm-shelving {alarm-shelving}? 497 +--rw shelf* [shelf-name] 498 +--rw shelf-name string 499 +--rw resource? resource 500 +--rw alarm-type-id? alarm-type-id 501 +--rw alarm-type-qualifier? alarm-type-qualifier 502 +--rw description? string 504 Shelved alarms are shown in a dedicated shelved alarm list. The 505 instrumentation MUST move shelved alarms from the alarm list 506 (/alarms/alarm-list) to the shelved alarm list (/alarms/shelved- 507 alarms/). Shelved alarms do not generate any notifications. When 508 the shelving criteria is removed or changed the alarm list MUST be 509 updated to the correct actual state of the alarms. 511 A leaf (/alarms/summary/shelfs-active) in the alarm summary indicates 512 if there are shelved alarms. 514 A system can select to not support the shelving feature. 516 5.2. Alarm Inventory 518 The alarm inventory represents all possible alarm types that may 519 occur in the system. A management system may use this to build alarm 520 procedures. The alarm inventory is relevant for several reasons: 522 The system might not instrument all alarm type identities. 524 The system has configured dynamic alarm types using the alarm 525 qualifier. The inventory makes it possible for the management 526 system to discover these. 528 Note that the mechanism whereby dynamic alarm types are added using 529 the alarm type qualifier MUST populate this list. 531 The optional leaf-list "resource" in the alarm inventory enables the 532 system to publish for which resources a given alarm type may appear. 534 The alarm inventory tree is shown below: 536 +--rw alarms 537 +--ro alarm-inventory 538 +--ro alarm-type* [alarm-type-id alarm-type-qualifier] 539 +--ro alarm-type-id alarm-type-id 540 +--ro alarm-type-qualifier alarm-type-qualifier 541 +--ro resource* string 542 +--ro has-clear boolean 543 +--ro severity-levels* severity 544 +--ro description string 546 5.3. Alarm Summary 548 The alarm summary list summarises alarms per severity; how many 549 cleared, cleared and closed, and closed. It also gives an indication 550 if there are shelved alarms. 552 The alarm summary tree is shown below: 554 +--rw alarms 555 +--ro summary 556 +--ro alarm-summary* [severity] 557 | +--ro severity severity 558 | +--ro total? yang:gauge32 559 | +--ro cleared? yang:gauge32 560 | +--ro cleared-not-closed? yang:gauge32 561 | | {operator-actions}? 562 | +--ro cleared-closed? yang:gauge32 563 | | {operator-actions}? 564 | +--ro not-cleared-closed? yang:gauge32 565 | | {operator-actions}? 566 | +--ro not-cleared-not-closed? yang:gauge32 567 | {operator-actions}? 568 +--ro shelves-active? empty {alarm-shelving}? 570 5.4. The Alarm List 572 The alarm list (/alarms/alarm-list) is a function from (resource, 573 alarm type, alarm type qualifier) to the current alarm state. 575 +--ro alarm-list 576 +--ro number-of-alarms? yang:gauge32 577 +--ro last-changed? yang:date-and-time 578 +--ro alarm* [resource alarm-type-id alarm-type-qualifier] 579 +--ro time-created yang:date-and-time 580 +--ro resource resource 581 +--ro alarm-type-id alarm-type-id 582 +--ro alarm-type-qualifier alarm-type-qualifier 583 +--ro alt-resource* resource 584 +--ro related-alarm* 585 | [resource alarm-type-id alarm-type-qualifier] 586 | +--ro resource 587 | | -> /alarms/alarm-list/alarm/resource 588 | +--ro alarm-type-id leafref 589 | +--ro alarm-type-qualifier leafref 590 +--ro impacted-resource* resource 591 +--ro root-cause-resource* resource 592 +--ro is-cleared boolean 593 +--ro last-changed yang:date-and-time 594 +--ro perceived-severity severity 595 +--ro alarm-text alarm-text 596 +--ro status-change* [time] {alarm-history}? 597 | +--ro time yang:date-and-time 598 | +--ro perceived-severity severity-with-clear 599 | +--ro alarm-text alarm-text 600 +--ro operator-state-change* [time] {operator-actions}? 601 | +--ro time yang:date-and-time 602 | +--ro operator string 603 | +--ro state operator-state 604 | +--ro text? string 605 +---x set-operator-state {operator-actions}? 606 +---w input 607 +---w state operator-state 608 +---w text? string 610 Every alarm has three important states, the resource clearance state 611 "is-cleared", the severity "perceived-severity" and the operator 612 state available in the operator state change list. 614 In order to see the alarm history the resource state changes are 615 available in the "status-change" list and the operator history is 616 available in the "operator-state-change" list. 618 5.5. The Shelved Alarms List 620 The shelved alarm list has the same structure as the alarm list 621 above. It shows all the alarms that matches the shelving criteria 622 (/alarms/control/alarm-shelving). 624 5.6. RPCs and Actions 626 The alarm module supports rpcs and actions to manage the alarms: 628 "purge-alarms" (rpc): delete alarms according to specific 629 criteria, for example all cleared alarms older then a specific 630 date. 632 "compress-alarms" (rpc): compress the status-change list for the 633 alarms. 635 "set-operator-state" (action): change the operator state for an 636 alarm: for example acknowledge. 638 5.7. Notifications 640 The alarm module supports a general notification to report alarm 641 state changes. It carries all relevant parameters for the alarm 642 management application. 644 There is also a notification to report that an operator changed the 645 operator state on an alarm, like acknowledge. 647 If the alarm inventory is changed, for example a new card type is 648 inserted, a notification will tell the management application that 649 new alarm types are available. 651 6. Alarm YANG Module 653 file "ietf-alarms@2017-10-30.yang" 654 module ietf-alarms { 655 yang-version 1.1; 656 namespace "urn:ietf:params:xml:ns:yang:ietf-alarms"; 657 prefix al; 659 import ietf-yang-types { 660 prefix yang; 661 } 663 organization 664 "IETF CCAMP Working Group"; 666 contact 667 "WG Web: 668 WG List: 670 Editor: Stefan Vallin 671 673 Editor: Martin Bjorklund 674 "; 676 description 677 "This module defines an interface for managing alarms. Main 678 inputs to the module design are the 3GPP Alarm IRP, ITU-T X.733 679 and ANSI/ISA-18.2 alarm standards. 681 Main features of this module include: 683 * Alarm list: 684 A list of all alarms. Cleared alarms stay in 685 the list until explicitly removed. 687 * Operator actions on alarms: 688 Acknowledging and closing alarms. 690 * Administrative actions on alarms: 691 Purging alarms from the list according to specific 692 criteria. 694 * Alarm inventory: 695 A management application can read all 696 alarm types implemented by the system. 698 * Alarm shelving: 699 Shelving (blocking) alarms according 700 to specific criteria. 702 This module uses a stateful view on alarms. An alarm is a state 703 for a specific resource (note that an alarm is not a 704 notification). An alarm type is a possible alarm state for a 705 resource. For example, the tuple: 707 ('link-alarm', 'GigabitEthernet0/25') 709 is an alarm of type 'link-alarm' on the resource 710 'GigabitEthernet0/25'. 712 Alarm types are identified using YANG identities and an optional 713 string-based qualifier. The string-based qualifier allows for 714 dynamic extension of the statically defined alarm types. Alarm 715 types identify a possible alarm state and not the individual 716 notifications. For example, the traditional 'link-down' and 717 'link-up' notifications are two notifications referring to the 718 same alarm type 'link-alarm'. 720 With this design there is no ambiguity about how alarm and alarm 721 clear correlation should be performed: notifications that report 722 the same resource and alarm type are considered updates of the 723 same alarm, such as clearing an active alarm or changing the 724 severity of an alarm. 726 The instrumentation can update 'severity' and 'alarm-text' on an 727 existing alarm. The above alarm example can therefore look 728 like: 730 (('link-alarm', 'GigabitEthernet0/25'), 731 warning, 732 'interface down while interface admin state is up') 734 There is a clear separation between updates on the alarm from 735 the underlying resource, like clear, and updates from an 736 operator like acknowledge or closing an alarm: 738 (('link-alarm', 'GigabitEthernet0/25'), 739 warning, 740 'interface down while interface admin state is up', 741 cleared, 742 closed) 744 Administrative actions like removing closed alarms older than a 745 given time is supported."; 747 revision 2017-10-30 { 748 description 749 "Initial revision."; 750 reference 751 "RFC XXXX: YANG Alarm Module"; 752 } 754 /* 755 * Features 756 */ 758 feature operator-actions { 759 description 760 "This feature means that the systems supports operator states 761 on alarms."; 762 } 764 feature alarm-shelving { 765 description 766 "This feature means that the system supports shelving 767 (blocking) alarms."; 768 } 770 feature alarm-history { 771 description 772 "This feature means that the alarm list also maintains a 773 history of state changes for each alarm. For example, if an 774 alarm toggles between cleared and active 10 times, a list for 775 that alarm will show those state changes with time-stamps."; 776 } 777 /* 778 * Identities 779 */ 781 identity alarm-identity { 782 description 783 "Base identity for alarm types. A unique identification of the 784 alarm, not including the resource. Different resources can 785 share alarm types. If the resource reports the same alarm 786 type, it is to be considered to be the same alarm. The alarm 787 type is a simplification of the different X.733 and 3GPP alarm 788 IRP alarm correlation mechanisms and it allows for 789 hierarchical extensions. 791 A string-based qualifier can be used in addition to the 792 identity in order to have different alarm types based on 793 information not known at design-time, such as values in 794 textual SNMP Notification var-binds. 796 Standards and vendors can define sub-identities to clearly 797 identify specific alarm types. 799 This identity is abstract and shall not be used for alarms."; 800 } 802 /* 803 * Common types 804 */ 806 typedef resource { 807 type union { 808 type instance-identifier { 809 require-instance false; 810 } 811 type yang:object-identifier; 812 type string; 813 } 814 description 815 "This is an identification of the alarming resource, such as an 816 interface. It should be as fine-grained as possible both to 817 guide the operator and to guarantee uniqueness of the 818 alarms. If a resource has both a config and a state tree 819 normally this should identify the state tree, 820 (e.g., /interfaces-state/interface/name). 821 But if the instrumentation can detect a broken config, this 822 should be identified as the resource. 823 If the alarming resource is modelled in YANG, this 824 type will be an instance-identifier. If the resource is an 825 SNMP object, the type will be an object-identifier. If the 826 resource is anything else, for example a distinguished name or 827 a CIM path, this type will be a string."; 828 } 830 typedef alarm-text { 831 type string; 832 description 833 "The string used to inform operators about the alarm. This 834 MUST contain enough information for an operator to be able 835 to understand the problem and how to resolve it. If this 836 string contains structure, this format should be clearly 837 documented for programs to be able to parse that 838 information."; 839 } 841 typedef severity { 842 type enumeration { 843 enum indeterminate { 844 value 2; 845 description 846 "Indicates that the severity level could not be 847 determined. This level SHOULD be avoided."; 848 } 849 enum minor { 850 value 3; 851 description 852 "The 'minor' severity level indicates the existence of a 853 non-service affecting fault condition and that corrective 854 action should be taken in order to prevent a more serious 855 (for example, service affecting) fault. Such a severity 856 can be reported, for example, when the detected alarm 857 condition is not currently degrading the capacity of the 858 resource."; 859 } 860 enum warning { 861 value 4; 862 description 863 "The 'warning' severity level indicates the detection of 864 a potential or impending service affecting fault, before 865 any significant effects have been felt. Action should be 866 taken to further diagnose (if necessary) and correct the 867 problem in order to prevent it from becoming a more 868 serious service affecting fault."; 869 } 870 enum major { 871 value 5; 872 description 873 "The 'major' severity level indicates that a service 874 affecting condition has developed and an urgent 875 corrective action is required. Such a severity can be 876 reported, for example, when there is a severe 877 degradation in the capability of the resource 878 and its full capability must be restored."; 879 } 880 enum critical { 881 value 6; 882 description 883 "The 'critical' severity level indicates that a service 884 affecting condition has occurred and an immediate 885 corrective action is required. Such a severity can be 886 reported, for example, when a resource becomes totally 887 out of service and its capability must be restored."; 888 } 889 } 890 description 891 "The severity level of the alarm. Note well that value 'clear' 892 is not included. If an alarm is cleared or not is a separate 893 boolean flag."; 894 reference 895 "ITU Recommendation X.733: Information Technology 896 - Open Systems Interconnection 897 - System Management: Alarm Reporting Function"; 898 } 900 typedef severity-with-clear { 901 type union { 902 type enumeration { 903 enum cleared { 904 value 1; 905 description 906 "The alarm is cleared by the instrumentation."; 907 } 908 } 909 type severity; 910 } 911 description 912 "The severity level of the alarm including clear. 913 This is used *only* in notifications reporting state changes 914 for an alarm."; 915 } 917 typedef operator-state { 918 type enumeration { 919 enum none { 920 value 1; 921 description 922 "The alarm is not being taken care of."; 923 } 924 enum ack { 925 value 2; 926 description 927 "The alarm is being taken care of. Corrective action not 928 taken yet, or failed"; 929 } 930 enum closed { 931 value 3; 932 description 933 "Corrective action taken successfully."; 934 } 935 enum shelved { 936 value 4; 937 description 938 "Alarm shelved. Alarms in alarms/shelved-alarms/ 939 MUST be assigned this operator state by the server as 940 the last entry in the operator-state-change list."; 941 } 942 enum un-shelved { 943 value 5; 944 description 945 "Alarm moved back to alarm-list from shelf. 946 Alarms 'moved' from /alarms/shelved-alarms/ 947 to /alarms/alarm-list MUST be assigned this 948 state by the server as the last entry in the 949 operator-state-change list."; 950 } 952 } 953 description 954 "Operator states on an alarm. The 'closed' state indicates 955 that an operator considers the alarm being resolved. This 956 is separate from the resource alarm clear flag."; 957 } 959 /* Alarm type */ 961 typedef alarm-type-id { 962 type identityref { 963 base alarm-identity; 964 } 965 description 966 "Identifies an alarm type. The description of the alarm type 967 id MUST indicate if the alarm type is abstract or not. An 968 abstract alarm type is used as a base for other alarm type ids 969 and will not be used as a value for an alarm or be present in 970 the alarm inventory."; 971 } 973 typedef alarm-type-qualifier { 974 type string; 975 description 976 "If an alarm type can not be fully specified at design time by 977 alarm-type-id, this string qualifier is used in addition to 978 fully define a unique alarm type. 980 The definition of alarm qualifiers is considered being part 981 of the instrumentation and out of scope for this module. 982 An empty string is used when this is part of a key."; 983 } 985 /* 986 * Groupings 987 */ 989 grouping common-alarm-parameters { 990 description 991 "Common parameters for an alarm. 993 This grouping is used both in the alarm list and in the 994 notification representing an alarm state change."; 996 leaf resource { 997 type resource; 998 mandatory true; 999 description 1000 "The alarming resource. See also 'alt-resource'. 1001 This could for example be a reference to the alarming 1002 interface"; 1003 } 1005 leaf alarm-type-id { 1006 type alarm-type-id; 1007 mandatory true; 1008 description 1009 "This leaf and the leaf 'alarm-type-qualifier' together 1010 provides a unique identification of the alarm type."; 1011 } 1013 leaf alarm-type-qualifier { 1014 type alarm-type-qualifier; 1015 description 1016 "This leaf is used when the 'alarm-type-id' leaf cannot 1017 uniquely identify the alarm type. Normally, this is not 1018 the case, and this leaf is the empty string."; 1019 } 1021 leaf-list alt-resource { 1022 type resource; 1023 description 1024 "Used if the alarming resource is available over other 1025 interfaces. This field can contain SNMP OID's, CIM paths or 1026 3GPP Distinguished names for example."; 1027 } 1029 list related-alarm { 1030 key "resource alarm-type-id alarm-type-qualifier"; 1032 description 1033 "References to related alarms. Note that the related alarm 1034 might have been removed from the alarm list."; 1036 leaf resource { 1037 type leafref { 1038 path "/alarms/alarm-list/alarm/resource"; 1039 require-instance false; 1040 } 1041 description 1042 "The alarming resource for the related alarm."; 1043 } 1044 leaf alarm-type-id { 1045 type leafref { 1046 path "/alarms/alarm-list/alarm" 1047 + "[resource=current()/../resource]" 1048 + "/alarm-type-id"; 1049 require-instance false; 1050 } 1051 description 1052 "The alarm type identifier for the related alarm."; 1053 } 1054 leaf alarm-type-qualifier { 1055 type leafref { 1056 path "/alarms/alarm-list/alarm" 1057 + "[resource=current()/../resource]" 1058 + "[alarm-type-id=current()/../alarm-type-id]" 1059 + "/alarm-type-qualifier"; 1060 require-instance false; 1061 } 1062 description 1063 "The alarm qualifier for the related alarm."; 1064 } 1065 } 1066 leaf-list impacted-resource { 1067 type resource; 1068 description 1069 "Resources that might be affected by this alarm. If the 1070 system creates an alarm on a resource and also has a mapping 1071 to other resources that might be impacted, these resources 1072 can be listed in this leaf-list. In this way the system can 1073 create one alarm instead of several. For example, if an 1074 interface has an alarm, the 'impacted-resource' can 1075 reference the aggregated port channels."; 1076 } 1077 leaf-list root-cause-resource { 1078 type resource; 1079 description 1080 "Resources that are candidates for causing the alarm. If the 1081 system has a mechanism to understand the candidate root 1082 causes of an alarm, this leaf-list can be used to list the 1083 root cause candidate resources. In this way the system can 1084 create one alarm instead of several. An example might be a 1085 logging system (alarm resource) that fails, the alarm can 1086 reference the file-system in the 'root-cause-resource' 1087 leaf-list. Note that the intended use is not to also send an 1088 an alarm with the root-cause-resource as alarming resource. 1089 The root-cause-resource leaf list is a hint and should not 1090 also generate an alarm for the same problem."; 1091 } 1092 } 1094 grouping alarm-state-change-parameters { 1095 description 1096 "Parameters for an alarm state change. 1098 This grouping is used both in the alarm list's 1099 status-change list and in the notification representing an 1100 alarm state change."; 1102 leaf time { 1103 type yang:date-and-time; 1104 mandatory true; 1105 description 1106 "The time the status of the alarm changed. The value 1107 represents the time the real alarm state change appeared 1108 in the resource and not when it was added to the 1109 alarm list. The /alarm-list/alarm/last-changed MUST be 1110 set to the same value."; 1111 } 1112 leaf perceived-severity { 1113 type severity-with-clear; 1114 mandatory true; 1115 description 1116 "The severity of the alarm as defined by X.733. Note 1117 that this may not be the original severity since the alarm 1118 may have changed severity."; 1119 reference 1120 "ITU Recommendation X.733: Information Technology 1121 - Open Systems Interconnection 1122 - System Management: Alarm Reporting Function"; 1123 } 1124 leaf alarm-text { 1125 type alarm-text; 1126 mandatory true; 1127 description 1128 "A user friendly text describing the alarm state change."; 1129 reference 1130 "ITU Recommendation X.733: Information Technology 1131 - Open Systems Interconnection 1132 - System Management: Alarm Reporting Function"; 1133 } 1134 } 1136 grouping operator-parameters { 1137 description 1138 "This grouping defines parameters that can 1139 be changed by an operator"; 1140 leaf time { 1141 type yang:date-and-time; 1142 mandatory true; 1143 description 1144 "Timestamp for operator action on alarm."; 1145 } 1146 leaf operator { 1147 type string; 1148 mandatory true; 1149 description 1150 "The name of the operator that has acted on this 1151 alarm."; 1152 } 1153 leaf state { 1154 type operator-state; 1155 mandatory true; 1156 description 1157 "The operator's view of the alarm state."; 1158 } 1159 leaf text { 1160 type string; 1161 description 1162 "Additional optional textual information provided by 1163 the operator."; 1164 } 1165 } 1167 grouping resource-alarm-parameters { 1168 description 1169 "Alarm parameters that originates from the resource view."; 1170 leaf is-cleared { 1171 type boolean; 1172 mandatory true; 1173 description 1174 "Indicates the current clearance state of the alarm. An 1175 alarm might toggle from active alarm to cleared alarm and 1176 back to active again."; 1177 } 1179 leaf last-changed { 1180 type yang:date-and-time; 1181 mandatory true; 1182 description 1183 "A timestamp when the alarm status was last changed. Status 1184 changes are changes to 'is-cleared', 'perceived-severity', 1185 and 'alarm-text'."; 1186 } 1188 leaf perceived-severity { 1189 type severity; 1190 mandatory true; 1191 description 1192 "The last severity of the alarm. 1194 If an alarm was raised with severity 'warning', but later 1195 changed to 'major', this leaf will show 'major'."; 1196 } 1198 leaf alarm-text { 1199 type alarm-text; 1200 mandatory true; 1201 description 1202 "The last reported alarm text. This text should contain 1203 information for an operator to be able to understand 1204 the problem and how to resolve it."; 1205 } 1207 list status-change { 1208 if-feature alarm-history; 1209 key time; 1210 min-elements 1; 1211 description 1212 "A list of status change events for this alarm. 1214 The entry with latest time-stamp in this list MUST 1215 correspond to the leafs 'is-cleared', 'perceived-severity' 1216 and 'alarm-text' for the alarm. The time-stamp for that 1217 entry MUST be equal to the 'last-changed' leaf. 1219 This list is ordered according to the timestamps of 1220 alarm state changes. The last item corresponds to the 1221 latest state change. 1223 The following state changes creates an entry in this 1224 list: 1225 - changed severity (warning, minor, major, critical) 1226 - clearance status, this also updates the 'is-cleared' 1227 leaf 1228 - alarm text update"; 1230 uses alarm-state-change-parameters; 1231 } 1232 } 1234 /* 1235 * The /alarms data tree 1236 */ 1238 container alarms { 1239 description 1240 "The top container for this module"; 1241 container control { 1242 description 1243 "Configuration to control the alarm behaviour."; 1244 leaf max-alarm-status-changes { 1245 type union { 1246 type uint16; 1247 type enumeration { 1248 enum infinite { 1249 description 1250 "The status change entries are accumulated 1251 infinitely."; 1252 } 1253 } 1254 } 1255 default 32; 1256 description 1257 "The status-change entries are kept in a circular list 1258 per alarm. When this number is exceeded, the oldest 1259 status change entry is automatically removed. If the 1260 value is 'infinite', the status change entries are 1261 accumulated infinitely."; 1262 } 1264 leaf notify-status-changes { 1265 type boolean; 1266 default false; 1267 description 1268 "This leaf controls whether notifications are sent on all 1269 alarm status updates, e.g., updated perceived-severity or 1270 alarm-text. By default the notifications are only sent 1271 when a new alarm is raised, re-raised after being cleared 1272 and when an alarm is cleared."; 1273 } 1274 container alarm-shelving { 1275 if-feature alarm-shelving; 1276 description 1277 "This list is used to shelve alarms. The server will move 1278 any alarms corresponding to the shelving criteria from the 1279 alarms/alarm-list/alarm list to the 1280 alarms/shelved-alarms/shelved-alarm list. It will also 1281 stop sending notifications for the shelved alarms. The 1282 conditions in the shelf criteria are logically ANDed. 1283 When the shelving criteria is deleted or changed, the 1284 non-matching alarms MUST appear in the 1285 alarms/alarm-list/alarm list according to the real state. 1286 This means that the instrumentation MUST maintain states 1287 for the shelved alarms. Alarms that match the criteria 1288 shall have an operator-state 'shelved'."; 1289 list shelf { 1290 key shelf-name; 1291 leaf shelf-name { 1292 type string; 1293 description 1294 "An arbitrary name for the alarm shelf."; 1295 } 1296 description 1297 "Each entry defines the criteria for shelving alarms. 1298 Criterias are ANDed."; 1300 leaf resource { 1301 type resource; 1302 description 1303 "Shelve alarms for this resource."; 1304 } 1305 leaf alarm-type-id { 1306 type alarm-type-id; 1307 description 1308 "Shelve alarms for this alarm type identifier."; 1309 } 1310 leaf alarm-type-qualifier { 1311 type alarm-type-qualifier; 1312 description 1313 "Shelve alarms for this alarm type qualifier."; 1314 } 1315 leaf description { 1316 type string; 1317 description 1318 "An optional textual description of the shelf. This 1319 description should include the reason for shelving 1320 these alarms."; 1321 } 1322 } 1323 } 1324 } 1326 container alarm-inventory { 1327 config false; 1328 description 1329 "This list contains all possible alarm types for the system. 1330 If the system knows for which resources a a specific alarm 1331 type can appear, this is also identified in the inventory. 1332 The list also tells if each alarm type has a corresponding 1333 clear state. The inventory shall only contain concrete 1334 alarm types. 1336 The alarm inventory MUST be updated by the system when new 1337 alarms can appear. This can be the case when installing new 1338 software modules or inserting new card types. A 1339 notification 'alarm-inventory-changed' is sent when the 1340 inventory is changed."; 1342 list alarm-type { 1343 key "alarm-type-id alarm-type-qualifier"; 1344 description 1345 "An entry in this list defines a possible alarm."; 1346 leaf alarm-type-id { 1347 type alarm-type-id; 1348 mandatory true; 1349 description 1350 "The statically defined alarm type identifier for this 1351 possible alarm."; 1352 } 1353 leaf alarm-type-qualifier { 1354 type alarm-type-qualifier; 1355 description 1356 "The optionally dynamically defined alarm type identifier 1357 for this possible alarm."; 1358 } 1359 leaf-list resource { 1360 type string; 1361 description 1362 "Optionally, specifies for which resources the alarm type 1363 is valid. This string is for human consumption but 1364 SHOULD refer to paths in the model."; 1365 } 1366 leaf has-clear { 1367 type boolean; 1368 mandatory true; 1369 description 1370 "This leaf tells the operator if the alarm will be 1371 cleared when the correct corrective action has been 1372 taken. Implementations SHOULD strive for detecting the 1373 cleared state for all alarm types. If this leaf is 1374 true, the operator can monitor the alarm until it 1375 becomes cleared after the corrective action has been 1376 taken. If this leaf is false the operator needs to 1377 validate that the alarm is not longer active using other 1378 mechanisms. Alarms can lack a corresponding clear due 1379 to missing instrumentation or that there is no logical 1380 corresponding clear state."; 1381 } 1382 leaf-list severity-levels { 1383 type severity; 1384 description 1385 "This leaf-list indicates the possible severity levels of 1386 this alarm type. Note well that 'clear' is not part of 1387 the severity type. In general, the severity level should 1388 be defined by the instrumentation based on dynamic state 1389 and not defined statically by the alarm type in order to 1390 provide relevant severity level based on dynamic state 1391 and context. However most alarm types have a defined set 1392 of possible severity levels and this should be provided 1393 here."; 1394 } 1395 leaf description { 1396 type string; 1397 mandatory true; 1398 description 1399 "A description of the possible alarm. It SHOULD include 1400 information on possible underlying root causes and 1401 corrective actions."; 1402 } 1403 } 1404 } 1406 container summary { 1407 config false; 1408 description 1409 "This container gives a summary of number of alarms 1410 and shelved alarms"; 1411 list alarm-summary { 1412 key severity; 1413 description 1414 "A global summary of all alarms in the system."; 1415 leaf severity { 1416 type severity; 1417 description 1418 "Alarm summary for this severity level."; 1419 } 1420 leaf total { 1421 type yang:gauge32; 1422 description 1423 "Total number of alarms of this severity level."; 1424 } 1425 leaf cleared { 1426 type yang:gauge32; 1427 description 1428 "For this severity level, the number of alarms that are 1429 cleared."; 1430 } 1431 leaf cleared-not-closed { 1432 if-feature operator-actions; 1433 type yang:gauge32; 1434 description 1435 "For this severity level, the number of alarms that are 1436 cleared but not closed."; 1437 } 1438 leaf cleared-closed { 1439 if-feature operator-actions; 1440 type yang:gauge32; 1441 description 1442 "For this severity level, the number of alarms that are 1443 cleared and closed."; 1444 } 1445 leaf not-cleared-closed { 1446 if-feature operator-actions; 1447 type yang:gauge32; 1448 description 1449 "For this severity level, the number of alarms that are 1450 not cleared but closed."; 1451 } 1452 leaf not-cleared-not-closed { 1453 if-feature operator-actions; 1454 type yang:gauge32; 1455 description 1456 "For this severity level, the number of alarms that are 1457 not cleared and not closed."; 1458 } 1459 } 1460 leaf shelves-active { 1461 if-feature alarm-shelving; 1462 type empty; 1463 description 1464 "This is a hint to the operator that there are active 1465 alarm shelves. This leaf MUST exist if the 1466 alarms/shelved-alarms/number-of-shelved-alarms is > 0."; 1467 } 1468 } 1470 container alarm-list { 1471 config false; 1472 description 1473 "The alarms in the system."; 1474 leaf number-of-alarms { 1475 type yang:gauge32; 1476 description 1477 "This object shows the total number of 1478 alarms in the system, i.e., the total number 1479 of entries in the alarm list."; 1481 } 1483 leaf last-changed { 1484 type yang:date-and-time; 1485 description 1486 "A timestamp when the alarm list was last 1487 changed. The value can be used by a manager to 1488 initiate an alarm resynchronization procedure."; 1489 } 1491 list alarm { 1492 key "resource alarm-type-id alarm-type-qualifier"; 1494 description 1495 "The list of alarms. Each entry in the list holds one 1496 alarm for a given alarm type and resource. 1497 An alarm can be updated from the underlying resource or 1498 by the user. The following leafs are maintained by the 1499 resource: is-cleared, last-change, perceived-severity, 1500 and alarm-text. An operator can change: operator-state 1501 and operator-text. 1503 Entries appear in the alarm list the first time an 1504 alarm becomes active for a given alarm-type and resource. 1505 Entries do not get deleted when the alarm is cleared, this 1506 is a boolean state in the alarm. 1508 Alarm entries are removed, purged, from the list by an 1509 explicit purge action. For example, delete all alarms 1510 that are cleared and in closed operator-state that are 1511 older than 24 hours. Systems may also remove alarms based 1512 on locally configured policies which is out of scope for 1513 this module."; 1514 leaf time-created { 1515 type yang:date-and-time; 1516 mandatory true; 1517 description 1518 "The time-stamp when this alarm entry was created. This 1519 represents the first time the alarm appeared, it can 1520 also represent that the alarm re-appeared after a purge. 1521 Further state-changes of the same alarm does not change 1522 this leaf, these changes will update the 'last-changed' 1523 leaf."; 1524 } 1526 uses common-alarm-parameters; 1527 uses resource-alarm-parameters; 1528 list operator-state-change { 1529 if-feature operator-actions; 1530 key time; 1531 description 1532 "This list is used by operators to indicate 1533 the state of human intervention on an alarm. 1534 For example, if an operator has seen an alarm, 1535 the operator can add a new item to this list indicating 1536 that the alarm is acknowledged."; 1537 uses operator-parameters; 1538 } 1540 action set-operator-state { 1541 if-feature operator-actions; 1542 description 1543 "This is a means for the operator to indicate 1544 the level of human intervention on an alarm."; 1545 input { 1546 leaf state { 1547 type operator-state; 1548 mandatory true; 1549 description 1550 "Set this operator state."; 1551 } 1552 leaf text { 1553 type string; 1554 description 1555 "Additional optional textual information."; 1556 } 1557 } 1558 } 1559 } 1560 } 1562 container shelved-alarms { 1563 if-feature alarm-shelving; 1564 config false; 1565 description 1566 "The shelved alarms. Alarms appear here if they match the 1567 criterias in /alarms/control/alarm-shelving. This list does 1568 not generate any notifications. The list represents alarms 1569 that are considered not relevant by the operator. Alarms in 1570 this list have an operator-state of 'shelved'. This can not 1571 be changed."; 1572 leaf number-of-shelved-alarms { 1573 type yang:gauge32; 1574 description 1575 "This object shows the total number of currently 1576 alarms, i.e., the total number of entries 1577 in the alarm list."; 1578 } 1580 leaf alarm-shelf-last-changed { 1581 type yang:date-and-time; 1582 description 1583 "A timestamp when the shelved alarm list was last 1584 changed. The value can be used by a manager to 1585 initiate an alarm resynchronization procedure."; 1586 } 1588 list shelved-alarm { 1589 key "resource alarm-type-id alarm-type-qualifier"; 1591 description 1592 "The list of shelved alarms. Each entry in the list holds 1593 one alarm for a given alarm type and resource. An alarm 1594 can be updated from the underlying resource or by the 1595 user. These changes are reflected in different lists 1596 below the corresponding alarm."; 1598 uses common-alarm-parameters; 1599 uses resource-alarm-parameters; 1601 list operator-state-change { 1602 if-feature operator-actions; 1603 key time; 1604 description 1605 "This list is used by operators to indicate 1606 the state of human intervention on an alarm. 1607 For example, if an operator has seen an alarm, 1608 the operator can add a new item to this list indicating 1609 that the alarm is acknowledged."; 1610 uses operator-parameters; 1611 } 1612 } 1613 } 1614 } 1616 /* 1617 * Operations 1618 */ 1620 rpc compress-alarms { 1621 if-feature alarm-history; 1622 description 1623 "This operation requests the server to compress entries in the 1624 alarm list by removing all but the latest state change for all 1625 alarms. Conditions in the input are logically ANDed. If no 1626 input condition is given, all alarms are compressed."; 1627 input { 1628 leaf resource { 1629 type leafref { 1630 path "/alarms/alarm-list/alarm/resource"; 1631 require-instance false; 1632 } 1633 description 1634 "Compress the alarms with this resource."; 1635 } 1636 leaf alarm-type-id { 1637 type leafref { 1638 path "/alarms/alarm-list/alarm/alarm-type-id"; 1639 } 1640 description 1641 "Compress alarms with this alarm-type-id."; 1642 } 1643 leaf alarm-type-qualifier { 1644 type leafref { 1645 path "/alarms/alarm-list/alarm/alarm-type-qualifier"; 1646 } 1647 description 1648 "Compress the alarms with this alarm-type-qualifier."; 1649 } 1650 } 1651 output { 1652 leaf compressed-alarms { 1653 type uint32; 1654 description 1655 "Number of compressed alarm entries."; 1656 } 1657 } 1658 } 1660 grouping filter-input { 1661 description 1662 "Grouping to specify a filter construct on alarm information."; 1663 leaf alarm-status { 1664 type enumeration { 1665 enum any { 1666 description 1667 "Ignore alarm clearance status."; 1668 } 1669 enum cleared { 1670 description 1671 "Filter cleared alarms."; 1672 } 1673 enum not-cleared { 1674 description 1675 "Filter not cleared alarms."; 1676 } 1677 } 1678 mandatory true; 1679 description 1680 "The clearance status of the alarm."; 1681 } 1683 container older-than { 1684 presence "Age specification"; 1685 description 1686 "Matches the 'last-status-change' leaf in the alarm."; 1687 choice age-spec { 1688 description 1689 "Filter using date and time age."; 1690 case seconds { 1691 leaf seconds { 1692 type uint16; 1693 description 1694 "Seconds part"; 1695 } 1696 } 1697 case minutes { 1698 leaf minutes { 1699 type uint16; 1700 description 1701 "Minute part"; 1702 } 1703 } 1704 case hours { 1705 leaf hours { 1706 type uint16; 1707 description 1708 "Hours part."; 1709 } 1710 } 1711 case days { 1712 leaf days { 1713 type uint16; 1714 description 1715 "Day part"; 1716 } 1717 } 1718 case weeks { 1719 leaf weeks { 1720 type uint16; 1721 description 1722 "Week part"; 1723 } 1724 } 1725 } 1726 } 1727 container severity { 1728 presence "Severity filter"; 1729 choice sev-spec { 1730 description 1731 "Filter based on severity level."; 1732 leaf below { 1733 type severity; 1734 description 1735 "Severity less than this leaf."; 1736 } 1737 leaf is { 1738 type severity; 1739 description 1740 "Severity level equal this leaf."; 1741 } 1742 leaf above { 1743 type severity; 1744 description 1745 "Severity level higher than this leaf."; 1746 } 1747 } 1748 description 1749 "Filter based on severity."; 1750 } 1751 container operator-state-filter { 1752 if-feature operator-actions; 1753 presence "Operator state filter"; 1754 leaf state { 1755 type operator-state; 1756 description 1757 "Filter on operator state."; 1758 } 1759 leaf user { 1760 type string; 1761 description 1762 "Filter based on which operator."; 1763 } 1764 description 1765 "Filter based on operator state."; 1766 } 1767 } 1768 rpc purge-alarms { 1769 description 1770 "This operation requests the server to delete entries from the 1771 alarm list according to the supplied criteria. Typically it 1772 can be used to delete alarms that are in closed operator state 1773 and older than a specified time. The number of purged alarms 1774 is returned as an output parameter"; 1775 input { 1776 uses filter-input; 1777 } 1778 output { 1779 leaf purged-alarms { 1780 type uint32; 1781 description 1782 "Number of purged alarms."; 1783 } 1784 } 1785 } 1787 /* 1788 * Notifications 1789 */ 1791 notification alarm-notification { 1792 description 1793 "This notification is used to report a state change for an 1794 alarm. The same notification is used for reporting a newly 1795 raised alarm, a cleared alarm or changing the text and/or 1796 severity of an existing alarm."; 1798 uses common-alarm-parameters; 1799 uses alarm-state-change-parameters; 1800 } 1802 notification alarm-inventory-changed { 1803 description 1804 "This notification is used to report that the list of possible 1805 alarms has changed. This can happen when for example if a new 1806 software module is installed, or a new physical card is 1807 inserted"; 1808 } 1810 notification operator-action { 1811 if-feature operator-actions; 1812 description 1813 "This notification is used to report that an operator 1814 acted upon an alarm."; 1816 leaf resource { 1817 type leafref { 1818 path "/alarms/alarm-list/alarm/resource"; 1819 require-instance false; 1820 } 1821 description 1822 "The alarming resource."; 1823 } 1824 leaf alarm-type-id { 1825 type leafref { 1826 path "/alarms/alarm-list/alarm" 1827 + "[resource=current()/../resource]" 1828 + "/alarm-type-id"; 1829 require-instance false; 1830 } 1831 description 1832 "The alarm type identifier for the alarm."; 1833 } 1834 leaf alarm-type-qualifier { 1835 type leafref { 1836 path "/alarms/alarm-list/alarm" 1837 + "[resource=current()/../resource]" 1838 + "[alarm-type-id=current()/../alarm-type-id]" 1839 + "/alarm-type-qualifier"; 1840 require-instance false; 1841 } 1842 description 1843 "The alarm qualifier for the alarm."; 1844 } 1845 uses operator-parameters; 1846 } 1847 } 1849 1851 7. X.733 Alarm Mapping Data Model 1853 Many alarm management systems are based on the X.733 alarm standard. 1854 This YANG module allows a mapping from alarm types to X.733 event- 1855 type and probable-cause. 1857 The module augments the alarm inventory, the alarm list and the alarm 1858 notification with X.733 parameters. 1860 The module also supports a feature whereby the alarm manager can 1861 configure the mapping. This might be needed when the default mapping 1862 provided by the system is in conflict with other systems or not 1863 considered good. 1865 8. X.733 Alarm Mapping YANG Module 1867 This YANG module references [X.733]. 1869 file "ietf-alarms-x733@2017-10-30.yang" 1870 module ietf-alarms-x733 { 1871 yang-version 1.1; 1872 namespace "urn:ietf:params:xml:ns:yang:ietf-alarms-x733"; 1873 prefix x733; 1875 import ietf-alarms { 1876 prefix al; 1877 } 1879 organization 1880 "IETF CCAMP Working Group"; 1882 contact 1883 "WG Web: 1884 WG List: 1886 Editor: Stefan Vallin 1887 1889 Editor: Martin Bjorklund 1890 "; 1892 description 1893 "This module augments the ietf-alarms module with X.733 mapping 1894 information. The following structures are augmented with 1895 event type and probable cause: 1897 1) alarm inventory: all possible alarms. 1898 2) alarm: every alarm in the system. 1899 3) alarm notification: notifications indicating alarm state 1900 changes. 1902 The module also optionally allows the alarm management system 1903 to configure the mapping. The mapping does not include a 1904 a corresponding specific problem value. The recommendation is 1905 to use alarm-type-qualifier which serves the same purpose."; 1906 reference 1907 "ITU Recommendation X.733: Information Technology 1908 - Open Systems Interconnection 1909 - System Management: Alarm Reporting Function"; 1911 revision 2017-10-30 { 1912 description 1913 "Initial revision."; 1914 reference 1915 "RFC XXXX: YANG Alarm Module"; 1916 } 1918 /* 1919 * Features 1920 */ 1922 feature configure-x733-mapping { 1923 description 1924 "The system supports configurable X733 mapping from 1925 alarm type to event type and probable cause."; 1926 } 1928 /* 1929 * Typedefs 1930 */ 1932 typedef event-type { 1933 type enumeration { 1934 enum other { 1935 value 1; 1936 description 1937 "None of the below."; 1938 } 1939 enum communications-alarm { 1940 value 2; 1941 description 1942 "An alarm of this type is principally associated with the 1943 procedures and/or processes required to convey 1944 information from one point to another."; 1945 reference 1946 "ITU Recommendation X.733: Information Technology 1947 - Open Systems Interconnection 1948 - System Management: Alarm Reporting Function"; 1949 } 1950 enum quality-of-service-alarm { 1951 value 3; 1952 description 1953 "An alarm of this type is principally associated with a 1954 degradation in the quality of a service."; 1955 reference 1956 "ITU Recommendation X.733: Information Technology 1957 - Open Systems Interconnection 1958 - System Management: Alarm Reporting Function"; 1959 } 1960 enum processing-error-alarm { 1961 value 4; 1962 description 1963 "An alarm of this type is principally associated with a 1964 software or processing fault."; 1965 reference 1966 "ITU Recommendation X.733: Information Technology 1967 - Open Systems Interconnection 1968 - System Management: Alarm Reporting Function"; 1969 } 1970 enum equipment-alarm { 1971 value 5; 1972 description 1973 "An alarm of this type is principally associated with an 1974 equipment fault."; 1975 reference 1976 "ITU Recommendation X.733: Information Technology 1977 - Open Systems Interconnection 1978 - System Management: Alarm Reporting Function"; 1979 } 1980 enum environmental-alarm { 1981 value 6; 1982 description 1983 "An alarm of this type is principally associated with a 1984 condition relating to an enclosure in which the equipment 1985 resides."; 1986 reference 1987 "ITU Recommendation X.733: Information Technology 1988 - Open Systems Interconnection 1989 - System Management: Alarm Reporting Function"; 1990 } 1991 enum integrity-violation { 1992 value 7; 1993 description 1994 "An indication that information may have been illegally 1995 modified, inserted or deleted."; 1996 reference 1997 "ITU Recommendation X.736: Information Technology 1998 - Open Systems Interconnection 1999 - System Management: Security Alarm Reporting Function"; 2000 } 2001 enum operational-violation { 2002 value 8; 2003 description 2004 "An indication that the provision of the requested service 2005 was not possible due to the unavailability, malfunction or 2006 incorrect invocation of the service."; 2007 reference 2008 "ITU Recommendation X.736: Information Technology 2009 - Open Systems Interconnection 2010 - System Management: Security Alarm Reporting Function"; 2011 } 2012 enum physical-violation { 2013 value 9; 2014 description 2015 "An indication that a physical resource has been violated 2016 in a way that suggests a security attack."; 2017 reference 2018 "ITU Recommendation X.736: Information Technology 2019 - Open Systems Interconnection 2020 - System Management: Security Alarm Reporting Function"; 2021 } 2022 enum security-service-or-mechanism-violation { 2023 value 10; 2024 description 2025 "An indication that a security attack has been detected by 2026 a security service or mechanism."; 2027 reference 2028 "ITU Recommendation X.736: Information Technology 2029 - Open Systems Interconnection 2030 - System Management: Security Alarm Reporting Function"; 2031 } 2032 enum time-domain-violation { 2033 value 11; 2034 description 2035 "An indication that an event has occurred at an unexpected 2036 or prohibited time."; 2037 reference 2038 "ITU Recommendation X.736: Information Technology 2039 - Open Systems Interconnection 2040 - System Management: Security Alarm Reporting Function"; 2041 } 2042 } 2043 description 2044 "The event types as defined by X.733 and X.736. The use of the 2045 term 'event' is a bit confusing. In an alarm context these 2046 are top level alarm types."; 2047 } 2049 /* 2050 * Groupings 2051 */ 2053 grouping x733-alarm-parameters { 2054 description 2055 "Common X.733 parameters for alarms."; 2057 leaf event-type { 2058 type event-type; 2059 description 2060 "The X.733/X.736 event type for this alarm."; 2061 } 2062 leaf probable-cause { 2063 type uint32; 2064 description 2065 "The X.733 probable cause for this alarm."; 2066 } 2067 } 2069 grouping x733-alarm-definition-parameters { 2070 description 2071 "Common X.733 parameters for alarm definitions."; 2073 leaf event-type { 2074 type event-type; 2075 description 2076 "The alarm type has this X.733/X.736 event type."; 2077 } 2078 leaf probable-cause { 2079 type uint32; 2080 description 2081 "The alarm type has this X.733 probable cause value. 2082 This module defines probable cause as an integer 2083 and not as an enumeration. The reason being that the 2084 primary use of probable cause is in the management 2085 application if it is based on the X.733 standard. 2086 However, most management applications have their own 2087 defined enum definitions and merging enums from 2088 different systems might create conflicts. By using 2089 a configurable uint32 the system can be configured 2090 to match the enum values in the manager."; 2091 } 2092 } 2094 /* 2095 * Add X.733 parameters to the alarm definitions, alarms, 2096 * and notification. 2097 */ 2099 augment "/al:alarms/al:alarm-inventory/al:alarm-type" { 2100 description 2101 "Augment X.733 mapping information to the alarm inventory."; 2103 uses x733-alarm-definition-parameters; 2104 } 2105 augment "/al:alarms/al:control" { 2106 description 2107 "Add X.733 mapping capabilities. "; 2108 list x733-mapping { 2109 if-feature configure-x733-mapping; 2110 key "alarm-type-id alarm-type-qualifier-match"; 2111 description 2112 "This list allows a management application to control the 2113 X.733 mapping for all alarm types in the system. Any entry 2114 in this list will allow the alarm manager to over-ride the 2115 default X.733 mapping in the system and the final mapping 2116 will be shown in the alarm-inventory"; 2118 leaf alarm-type-id { 2119 type al:alarm-type-id; 2120 description 2121 "Map the alarm type with this alarm type identifier."; 2122 } 2123 leaf alarm-type-qualifier-match { 2124 type string; 2125 description 2126 "A W3C regular expression that is used when mapping an 2127 alarm type and alarm-type-qualifier to X.733 parameters."; 2128 } 2130 uses x733-alarm-definition-parameters; 2131 } 2132 } 2134 augment "/al:alarms/al:alarm-list/al:alarm" { 2135 description 2136 "Augment X.733 information to the alarm."; 2138 uses x733-alarm-parameters; 2139 } 2141 augment "/al:alarms/al:shelved-alarms/al:shelved-alarm" { 2142 description 2143 "Augment X.733 information to the alarm."; 2145 uses x733-alarm-parameters; 2146 } 2148 augment "/al:alarm-notification" { 2149 description 2150 "Augment X.733 information to the alarm notification."; 2152 uses x733-alarm-parameters; 2154 } 2155 } 2157 2159 9. Security Considerations 2161 None. 2163 10. Acknowledgements 2165 The author wishes to thank Viktor Leijon and Johan Nordlander for 2166 their valuable input on forming the alarm model. 2168 11. References 2170 11.1. Normative References 2172 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2173 Requirement Levels", BCP 14, RFC 2119, 2174 DOI 10.17487/RFC2119, March 1997, . 2177 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 2178 RFC 7950, DOI 10.17487/RFC7950, August 2016, 2179 . 2181 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2182 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 2183 May 2017, . 2185 [X.733] International Telecommunications Union, "Information 2186 Technology - Open Systems Interconnection - Systems 2187 Management: Alarm Reporting Function", 2188 ITU-T Recommendation X.733, 1992. 2190 11.2. Informative References 2192 [ALARMIRP] 2193 3GPP, "Telecommunication management; Fault Management; 2194 Part 2: Alarm Integration Reference Point (IRP): 2195 Information Service (IS)", 3GPP TS 32.111-2 3.4.0, March 2196 2005. 2198 [ALARMSEM] 2199 Wallin, S., Leijon, V., Nordlander, J., and N. Bystedt, 2200 "The semantics of alarm definitions: enabling systematic 2201 reasoning about alarms. International Journal of Network 2202 Management, Volume 22, Issue 3, John Wiley and Sons, Ltd, 2203 http://dx.doi.org/10.1002/nem.800", March 2012. 2205 [EEMUA] EEMUA Publication No. 191 Engineering Equipment and 2206 Materials Users Association, London, 2 edition., "Alarm 2207 Systems: A Guide to Design, Management and Procurement.", 2208 2007. 2210 [I-D.ietf-netmod-yang-tree-diagrams] 2211 Bjorklund, M. and L. Berger, "YANG Tree Diagrams", draft- 2212 ietf-netmod-yang-tree-diagrams-02 (work in progress), 2213 October 2017. 2215 [ISA182] International Society of Automation,ISA, "ANSI/ISA- 2216 18.2-2009 Management of Alarm Systems for the Process 2217 Industries", 2009. 2219 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 2220 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 2221 September 2004, . 2223 [X.736] International Telecommunications Union, "Information 2224 Technology - Open Systems Interconnection - Systems 2225 Management: Security alarm reporting function", 2226 ITU-T Recommendation X.736, 1992. 2228 Appendix A. Vendor-specific Alarm-Types Example 2230 This example shows how to define alarm-types in a vendor-specific 2231 module. In this case the vendor "xyz" has chosen to define top level 2232 identities according to X.733 event types. 2234 module example-xyz-alarms { 2235 namespace "urn:example:xyz-alarms"; 2236 prefix xyz-al; 2238 import ietf-alarms { 2239 prefix al; 2240 } 2242 identity xyz-alarms { 2243 base al:alarm-identity; 2244 } 2246 identity communications-alarm { 2247 base xyz-alarms; 2248 } 2249 identity quality-of-service-alarm { 2250 base xyz-alarms; 2251 } 2252 identity processing-error-alarm { 2253 base xyz-alarms; 2254 } 2255 identity equipment-alarm { 2256 base xyz-alarms; 2257 } 2258 identity environmental-alarm { 2259 base xyz-alarms; 2260 } 2262 // communications alarms 2263 identity link-alarm { 2264 base communications-alarm; 2265 } 2267 // QoS alarms 2268 identity high-jitter-alarm { 2269 base quality-of-service-alarm; 2270 } 2271 } 2273 Appendix B. Alarm Inventory Example 2275 This shows an alarm inventory, it shows one alarm type defined only 2276 with the identifier, and another dynamically configured. In the 2277 latter case a digital input has been connected to a smoke-detector, 2278 therefore the 'alarm-type-qualifier' is set to "smoke-detector" and 2279 the 'alarm-type-identity' to "environmental-alarm". 2281 2283 2284 2285 xyz-al:link-alarm 2286 2287 true 2288 2289 Link failure, operational state down but admin state up 2290 2291 2292 2293 xyz-al:environmental-alarm 2294 smoke-alarm 2295 true 2296 2297 Connected smoke detector to digital input 2298 2299 2300 2301 2303 Appendix C. Alarm List Example 2305 In this example we show an alarm that has toggled [major, clear, 2306 major]. An operator has acknowledged the alarm. 2308 2311 2312 1 2313 2015-04-08T08:39:50.00Z 2315 2316 2317 /dev:interfaces/dev:interface[name='FastEthernet1/0'] 2318 2319 xyz-al:link-alarm 2320 2322 2015-04-08T08:39:50.00Z 2323 false 2324 1.3.6.1.2.1.2.2.1.1.17 2325 2015-04-08T08:39:40.00Z 2326 major 2327 2328 Link operationally down but administratively up 2329 2330 2331 2332 major 2333 2334 Link operationally down but administratively up 2335 2336 2337 2338 2339 cleared 2340 2341 Link operationally up and administratively up 2342 2343 2344 2345 2346 major 2347 2348 Link operationally down but administratively up 2349 2350 2351 2352 2353 ack 2354 joe 2355 Will investigate, ticket TR764999 2356 2357 2358 2359 2361 Appendix D. Alarm Shelving Example 2363 This example shows how to shelf alarms. We shelf alarms related to 2364 the smoke-detectors since they are being installed and tested. We 2365 also shelf all alarms from FastEthernet1/0. 2367 2370 2371 2372 2373 FE10 2374 2375 /dev:interfaces/dev:interface[name='FastEthernet1/0'] 2376 2377 2378 2379 detectortest 2380 xyz-al:environmental-alarm 2381 smoke-alarm 2382 2383 2384 2385 2387 Appendix E. X.733 Mapping Example 2389 This example shows how to map a dynamic alarm type (alarm-type- 2390 identity=environmental-alarm, alarm-type-qualifier=smoke-alarm) to 2391 the corresponding X.733 event-type and probable cause parameters. 2393 2395 2396 2398 xyz-al:environmental-alarm 2399 2400 smoke-alarm 2401 2402 quality-of-service-alarm 2403 777 2404 2405 2406 2408 Appendix F. Background and Usability Requirements 2410 This section gives background information regarding design choices in 2411 the alarm module. It also defines usability requirements for alarms. 2412 Alarm usability is important for an alarm interface. A data-model 2413 will help in defining the format but if the actual alarms is of low 2414 value we have not gained the goal of alarm management. 2416 The telecommunication domain has standardised an alarm interface in 2417 ITU-T X.733 [X.733]. This continued in mobile networks within the 2418 3GPP organisation [ALARMIRP]. Although SNMP is the dominant 2419 mechanism for monitoring devices, IETF did not early on standardise 2420 an alarm MIB. Instead, management systems interpreted the enterprise 2421 specific traps per MIB and device to build an alarm list. When 2422 finally The Alarm MIB [RFC3877] was published, it had to address the 2423 existence of enterprise traps and map these into alarms. This 2424 requirement led to a MIB that is not always easy to use. 2426 F.1. Alarm Concepts 2428 There are two misconceptions regarding alarms and alarm interfaces 2429 that are important to sort out. The first problem is that alarms are 2430 mixed with events in general. Alarms MUST correspond to an 2431 undesirable state that needs corrective action. Many implementations 2432 of alarm interfaces do not adhere to this principle and just send 2433 events in general. In order to qualify as an alarm, there must exist 2434 a corrective action. If that is not true, it is an event that can go 2435 into logs. 2437 The other misconception is that the term "alarm" refers to the 2438 notification itself. Rather, an alarm is a state of a resource in 2439 the system. The alarm notifications report state changes of the 2440 alarm, such as alarm raise and alarm clear. 2442 "One of the most important principles of alarm management is that an 2443 alarm requires an action. This means that if the operator does not 2444 need to respond to an alarm (because unacceptable consequences do not 2445 occur), then it is not an alarm. Following this cardinal rule will 2446 help eliminate many potential alarm management issues." [ISA182] 2448 F.1.1. Alarm type 2450 Since every alarm has a corresponding corrective action, a vendor can 2451 to prepare a list of available alarms and their corrective actions. 2452 We use the term "alarm type" to refer to every possible alarm that 2453 could be active in the system. 2455 Alarm types are also fundamental in order to provide a state-based 2456 alarm list. The alarm list correlates alarm state changes for the 2457 same alarm type and the same resource into one alarm. 2459 Different alarm interfaces use different mechanisms to define alarm 2460 types, ranging from simple error numbers to more advanced mechanisms 2461 like the X.733 triplet of event type, probable cause and specific 2462 problem. 2464 A common misunderstanding is that individual alarm notifications are 2465 alarm types. This is not correct; e.g., "link-up" and "link-down" 2466 are two notifications reporting different states for the same alarm 2467 type, "link-alarm". 2469 F.2. Usability Requirements 2471 Common alarm problems and the cause of the problems are summarised in 2472 Table 1. This summary is adopted to networking based on the ISA 2473 [ISA182] and EEMUA [EEMUA] standards. 2475 +------------------+--------------------------------+---------------+ 2476 | Problem | Cause | How this | 2477 | | | module | 2478 | | | address the | 2479 | | | cause | 2480 +------------------+--------------------------------+---------------+ 2481 | Alarms are | "Nuisance" alarms (chattering | Strict | 2482 | generated but | alarms and fleeting alarms), | definition of | 2483 | they are ignored | faulty hardware, redundant | alarms | 2484 | by the operator. | alarms, cascading alarms, | requiring | 2485 | | incorrect alarm settings, | corrective | 2486 | | alarms have not been | response. | 2487 | | rationalised, the alarms | Alarm | 2488 | | represent log information | requirements | 2489 | | rather than true alarms. | in Table 2. | 2490 | | | | 2491 | When alarms | Insufficient alarm response | The alarm | 2492 | occur, operators | procedures and not well | inventory | 2493 | do not know how | defined alarm types. | lists all | 2494 | to respond. | | alarm types | 2495 | | | and | 2496 | | | corrective | 2497 | | | actions. | 2498 | | | Alarm | 2499 | | | requirements | 2500 | | | in Table 2. | 2501 | | | | 2502 | The alarm | Nuisance alarms, stale alarms, | The alarm | 2503 | display is full | alarms from equipment not in | definition | 2504 | of alarms, even | service. | and alarm | 2505 | when there is | | shelving. | 2506 | nothing wrong. | | | 2507 | | | | 2508 | During a | Incorrect prioritization of | State-based | 2509 | failure, | alarms. Not using advanced | alarm model, | 2510 | operators are | alarm techniques (e.g. state- | alarm rate | 2511 | flooded with so | based alarming). | requirements | 2512 | many alarms that | | in Table 3 | 2513 | they do not know | | and Table 4 | 2514 | which ones are | | | 2515 | the most | | | 2516 | important. | | | 2517 +------------------+--------------------------------+---------------+ 2519 Table 1: Alarm Problems and Causes 2521 Based upon the above problems EEMUA gives the following definition of 2522 a good alarm: 2524 +----------------+--------------------------------------------------+ 2525 | Characteristic | Explanation | 2526 +----------------+--------------------------------------------------+ 2527 | Relevant | Not spurious or of low operational value. | 2528 | | | 2529 | Unique | Not duplicating another alarm. | 2530 | | | 2531 | Timely | Not long before any response is needed or too | 2532 | | late to do anything. | 2533 | | | 2534 | Prioritised | Indicating the importance that the operator | 2535 | | deals with the problem. | 2536 | | | 2537 | Understandable | Having a message which is clear and easy to | 2538 | | understand. | 2539 | | | 2540 | Diagnostic | Identifying the problem that has occurred. | 2541 | | | 2542 | Advisory | Indicative of the action to be taken. | 2543 | | | 2544 | Focusing | Drawing attention to the most important issues. | 2545 +----------------+--------------------------------------------------+ 2547 Table 2: Definition of a Good Alarm 2549 Vendors SHOULD rationalise all alarms according to above. Another 2550 crucial requirement is acceptable alarm rates. Vendors SHOULD make 2551 sure that they do not exceed the recommendations from EEMUA below: 2553 +-----------------------------------+-------------------------------+ 2554 | Long Term Alarm Rate in Steady | Acceptability | 2555 | Operation | | 2556 +-----------------------------------+-------------------------------+ 2557 | More than one per minute | Very likely to be | 2558 | | unacceptable. | 2559 | | | 2560 | One per 2 minutes | Likely to be over-demanding. | 2561 | | | 2562 | One per 5 minutes | Manageable. | 2563 | | | 2564 | Less than one per 10 minutes | Very likely to be acceptable. | 2565 +-----------------------------------+-------------------------------+ 2567 Table 3: Acceptable Alarm Rates, Steady State 2569 +----------------------------+--------------------------------------+ 2570 | Number of alarms displayed | Acceptability | 2571 | in 10 minutes following a | | 2572 | major network problem | | 2573 +----------------------------+--------------------------------------+ 2574 | More than 100 | Definitely excessive and very likely | 2575 | | to lead to the operator to abandon | 2576 | | the use of the alarm system. | 2577 | | | 2578 | 20-100 | Hard to cope with. | 2579 | | | 2580 | Under 10 | Should be manageable - but may be | 2581 | | difficult if several of the alarms | 2582 | | require a complex operator response. | 2583 +----------------------------+--------------------------------------+ 2585 Table 4: Acceptable Alarm Rates, Burst 2587 The numbers in Table 3 and Table 4 are the sum of all alarms for a 2588 network being managed from one alarm console. So every individual 2589 system or NMS contributes to these numbers. 2591 Vendors SHOULD make sure that the following rules are used in 2592 designing the alarm interface: 2594 1. Rationalize the alarms in the system to ensure that every alarm 2595 is necessary, has a purpose, and follows the cardinal rule - that 2596 it requires an operator response. Adheres to the rules of 2597 Table 2 2599 2. Audit the quality of the alarms. Talk with the operators about 2600 how well the alarm information support them. Do they know what 2601 to do in the event of an alarm? Are they able to quickly 2602 diagnose the problem and determine the corrective action? Does 2603 the alarm text adhere to the requirements in Table 2? 2605 3. Analyze and benchmark the performance of the system and compare 2606 it to the recommended metrics in Table 3 and Table 4. Start by 2607 identifying nuisance alarms, standing alarms at normal state and 2608 startup. 2610 Authors' Addresses 2612 Stefan Vallin 2613 Stefan Vallin AB 2615 Email: stefan@wallan.se 2616 Martin Bjorklund 2617 Cisco 2619 Email: mbj@tail-f.com