idnits 2.17.1 draft-kiesel-alto-availability-metrics-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC2119], [RFC6708], [RFC5693]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (February 13, 2014) is 3726 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-27) exists of draft-ietf-alto-protocol-25 == Outdated reference: A later version (-03) exists of draft-roome-alto-pid-properties-00 == Outdated reference: A later version (-02) exists of draft-scharf-alto-vpn-service-01 Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO S. Kiesel 3 Internet-Draft University of Stuttgart 4 Intended status: Standards Track M. Scharf 5 Expires: August 17, 2014 Alcatel-Lucent Bell Labs 6 February 13, 2014 8 ALTO metrices for expressing availability information 9 draft-kiesel-alto-availability-metrics-00 11 Abstract 13 This document specifies new metrices to be used with the ALTO 14 protocol. The goal is to provide information about the availability 15 of physical network, host, and storage infrastructures to management 16 systems that orchestrate virtual infrastructures on top of them. 18 Terminology and Requirements Language 20 This document makes use of the ALTO terminology defined in [RFC5693] 21 and [RFC6708]. 23 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 24 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 25 document are to be interpreted as described in RFC 2119 [RFC2119]. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on August 17, 2014. 44 Copyright Notice 46 Copyright (c) 2014 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 2. Classification of availability-related parameters . . . . . . 5 63 2.1. Identification of physical resources . . . . . . . . . . . 5 64 2.2. Classification of cost types and properties . . . . . . . 5 65 2.2.1. Static vs. dynamic facts vs. probabilites . . . . . . 5 66 2.2.2. Causality and Correlation . . . . . . . . . . . . . . 6 67 3. Specification of new Endpoint Address types . . . . . . . . . 8 68 4. Specification of new Cost and Property types . . . . . . . . . 9 69 5. Obtaining Availability Information . . . . . . . . . . . . . . 10 70 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 71 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 72 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 73 8.1. Normative References . . . . . . . . . . . . . . . . . . . 13 74 8.2. Informative References . . . . . . . . . . . . . . . . . . 13 75 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14 77 1. Introduction 79 Various virtualization technologies allow to instantiate virtual 80 hosts, virtual storage, and virtual networks on top of their physical 81 counterparts. They can be combined to build complex virtual 82 infrastructures. Management systems automate the task of mapping 83 virtual to physical resources, considering various optimization 84 goals. Mechanisms like live migration of virtual machines or re- 85 shaping the topology of overlay networks allow to dynamically react 86 on changing conditions, both in the virtual infrastructure (e.g., 87 change in demand) and in the underlying physical infrastructures 88 (e.g., change in available resources). 90 A typical example is, that in a cluster of several physical servers, 91 in times of low demand, all virtual machines could be migrated away 92 from one node. This node would be powered down, in order to save 93 precious energy. If resource utilization is the only optimization 94 goal, the input for the placement/scheduling manager can be gathered 95 by measurements. 97 If, however, other optimization goals have to be considered, the 98 management system needs external information. For example, if all 99 but two nodes of the cluster are to be shut down, the remaining two 100 nodes should be selected in a way that minimizes the risk of both 101 nodes failing at the same time due to a single root cause. This 102 optimization problem becomes more difficult if not only hosts and 103 storage but also network resources are considered. 105 This document shows that the ALTO protocol [I-D.ietf-alto-protocol] 106 offers the required base mechanisms for providing a standardized 107 interface to virtual infrastructure orchestration managers, for 108 conveying information about the availability / reliability of the 109 underlying physical infrastructure, This document further defines 110 appropriate metrices for this use case. 112 2. Classification of availability-related parameters 114 Important concepts of the ALTO protocol are Endpoints, which are 115 identified by Endpoint Addresses. Endpoints can be grouped in PIDs. 116 Endpoints (and by means of protocol extensions 117 [I-D.roome-alto-pid-properties] also PIDs) may have properties that 118 can be queried using the ALTO protocol. Paths between PIDs may have 119 one or more path costs according to some cost metric. These path 120 costs can be queried for individual pairs of PIDs, or a whole cost 121 map (i.e., a "PID x PID -> cost" matrix) can be downloaded. The path 122 cost concept can easily be generalized to a path property concept. 124 This section discusses how these base mechanisms can be used to 125 convey information related to availability of physical 126 infrastructures to systems that manage virtual infrastructures on top 127 of them. 129 2.1. Identification of physical resources 131 In order to identify physical resources within the ALTO protocol, an 132 appropriate endpoint address type has to be used. The ALTO base 133 protocol specification [I-D.ietf-alto-protocol] only defines IPv4 and 134 IPv6 addresses, and establishes a process to register further types. 135 In fact, IP addresses may be used to identify physical resources in 136 many cases, e.g., the loopback address of a router, or the management 137 address of a physical server, etc. For a discussion of VPNs and ALTO 138 see [I-D.scharf-alto-vpn-service]. 140 TBD: discussion of further options for endpoint addesses. 142 2.2. Classification of cost types and properties 144 Information related to availability of physical resources may be of 145 different fundamental natures, requiring different encodings and 146 different update intervals. This section itemizes several criteria. 148 2.2.1. Static vs. dynamic facts vs. probabilites 150 Information may be static facts that change never or very 151 infequently. For example: "Electrical power outlets A and B are both 152 connected to circuit breaker F1". 154 Information may also be more frequently changing, e.g., 155 "Uninterruptible power supply UPS1 is now running on battery power, 156 82% capacity left". TBD: further investigation and guidance is 157 needed on the maximum update frequency that can reasonably be done 158 using ALTO. 160 Another type of information are statistical measures such as the 161 average relative availability of a subsystem, e.g., the famous "five 162 nines". 164 2.2.2. Causality and Correlation 166 Many initial incidents can cause a series of events, according to 167 some kind of "failure propagation topology", which is independent of 168 the IP network topology. There may be even hierarchies. 170 For example, "Servers S1 and S2 are connected via circuit breaker F1 171 to uninterruptible power supply UPS1 while S3 and S4 are connected 172 via F2 to UPS1" implies that a failure in S1 triggering F1 will also 173 interrupt operation of S2. Furhermore, shutting down S2, S3, and S4 174 in case of a power grid failure could strech UPS1's battery lifetime 175 and thereby prolong S1's survivability time. Similar considerations 176 can be made for different kinds of problems, e.g., the impact of a 177 fire. 179 Modeling diffent risk types (e.g., power outage, fire, flooding, 180 physical intruders, etc.) in their respective terminology would 181 require the definition of many new data types. 183 A more generic approach is to use an ALTO cost map as a matrix, which 184 indicates the level of isolation against "fate sharing" of any two 185 PIDs with respect to a given (physical) risk. In other words, for 186 every specific risk R the coefficients of that matrix could be 187 calculated as 189 C_R(x,y) = 1 - P( y fails due to R | x fails due to R ) 191 For example, if the risk type is "fire", then a coefficient of 0 192 could mean "these two physical resources are in the same rack. If 193 one is on fire for any reason, the other one will almost inevitably 194 fail within seconds, too.", a value of 0.3 could mean "the resources 195 are in adjacent buildings" and 0.99999 could mean "these two 196 resources are on different continents and only a natural disaster 197 causing global destruction could disable both of them in one single 198 event". 200 Note that these conditional properties only indicate how likely it is 201 that the second resource will become unavailable due to the same 202 event that disabled the first resource. They do not indicate how 203 likely it is that the event will actually occur. 205 TBD: discuss to which extent a single "endpoint address to PID" 206 network map is useful when considering different risk types. The 207 idea behind PIDs is to reduce map size by grouping topologically 208 close endpoints, but the "failure propagation topologies" may be very 209 unalingned for different risk types. We will probably end up with 210 many very small PIDs. 212 3. Specification of new Endpoint Address types 214 TBD. 216 4. Specification of new Cost and Property types 218 TBD. 220 We need: the "isolation level agains fate sharing" matrix, and a list 221 of risk types, in order to give the absolute probability of that risk 222 for a given resource. 224 5. Obtaining Availability Information 226 For any ALTO information, it is important to consider whether the 227 ALTO service realistically can discover that information, if the 228 distribution of that information is allowed, if the data is useful, 229 if a client can get that information without excessive privacy 230 concerns, and if the information cannot be gathered easily be found 231 in some other way. 233 Availability-related parameters can both refer to properties of the 234 network infrastructure (e.g., network resiliency mechanisms) as well 235 as non-networking effects (e.g., redundancy of power supply). In 236 both cases, an application typically cannot measure that information, 237 neither by passive monitoring nor by active probing. Yet, 238 availability information and insight into impact of incidents matters 239 to many applications and can be an important criteria for resource 240 selection decisions. Since typical use cases would be limited to one 241 administrative domain, privacy is not a major concern; in addition, 242 the suggested correlation metrics provide an abstraction over the 243 actual physical infrastructure. 245 Gathering availability information may be more challenging than, for 246 instance, IP routing topologies. For instance, it may require access 247 to inventory databases. Yet, within one domain, the organization 248 that is responsible for the physical network topology may also take 249 care of other parts of the physical infrastructure, such as the power 250 supply or hardware installation. An organization that operates an 251 ALTO server for exposing network topology information could therefore 252 also have access to other inventory data. Therefore, providing 253 availability information to an ALTO server as described in this 254 document is realistic. 256 6. IANA Considerations 258 TBD. 260 7. Security Considerations 262 TBD. 264 8. References 266 8.1. Normative References 268 [I-D.ietf-alto-protocol] 269 Alimi, R., Penno, R., and Y. Yang, "ALTO Protocol", 270 draft-ietf-alto-protocol-25 (work in progress), 271 January 2014. 273 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 274 Requirement Levels", BCP 14, RFC 2119, March 1997. 276 8.2. Informative References 278 [I-D.roome-alto-pid-properties] 279 Roome, B. and Y. Yang, "PID Property Extension for ALTO 280 Protocol", draft-roome-alto-pid-properties-00 (work in 281 progress), October 2013. 283 [I-D.scharf-alto-vpn-service] 284 Scharf, M., Gurbani, V., Soprovich, G., and V. Hilt, "The 285 Virtual Private Network (VPN) Service in ALTO: Use Cases, 286 Requirements and Extensions", 287 draft-scharf-alto-vpn-service-01 (work in progress), 288 July 2013. 290 [RFC5693] Seedorf, J. and E. Burger, "Application-Layer Traffic 291 Optimization (ALTO) Problem Statement", RFC 5693, 292 October 2009. 294 [RFC6708] Kiesel, S., Previdi, S., Stiemerling, M., Woundy, R., and 295 Y. Yang, "Application-Layer Traffic Optimization (ALTO) 296 Requirements", RFC 6708, September 2012. 298 Authors' Addresses 300 Sebastian Kiesel 301 University of Stuttgart Information Center 302 Networks and Communication Systems Department 303 Allmandring 30 304 Stuttgart 70550 305 Germany 307 Email: ietf-alto@skiesel.de 308 URI: http://www.rus.uni-stuttgart.de/nks/ 310 Michael Scharf 311 Alcatel-Lucent Bell Labs 312 Lorenzstrasse 10 313 Stuttgart 70435 314 Germany 316 Email: michael.scharf@alcatel-lucent.com 317 URI: www.alcatel-lucent.com/bell-labs