idnits 2.17.1 

draft-kiesel-alto-availability-metrics-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC2119], [RFC6708],
     [RFC5693]), which it shouldn't.  Please replace those with straight
     textual mentions of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (February 13, 2014) is 3726 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-27) exists of
     draft-ietf-alto-protocol-25

  == Outdated reference: A later version (-03) exists of
     draft-roome-alto-pid-properties-00

  == Outdated reference: A later version (-02) exists of
     draft-scharf-alto-vpn-service-01


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	ALTO                                                           S. Kiesel
3	Internet-Draft                                   University of Stuttgart
4	Intended status: Standards Track                               M. Scharf
5	Expires: August 17, 2014                        Alcatel-Lucent Bell Labs
6	                                                       February 13, 2014

8	         ALTO metrices for expressing availability information
9	               draft-kiesel-alto-availability-metrics-00

11	Abstract

13	   This document specifies new metrices to be used with the ALTO
14	   protocol.  The goal is to provide information about the availability
15	   of physical network, host, and storage infrastructures to management
16	   systems that orchestrate virtual infrastructures on top of them.

18	Terminology and Requirements Language

20	   This document makes use of the ALTO terminology defined in [RFC5693]
21	   and [RFC6708].

23	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
24	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
25	   document are to be interpreted as described in RFC 2119 [RFC2119].

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on August 17, 2014.

44	Copyright Notice

46	   Copyright (c) 2014 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
62	   2.  Classification of availability-related parameters  . . . . . .  5
63	     2.1.  Identification of physical resources . . . . . . . . . . .  5
64	     2.2.  Classification of cost types and properties  . . . . . . .  5
65	       2.2.1.  Static vs. dynamic facts vs. probabilites  . . . . . .  5
66	       2.2.2.  Causality and Correlation  . . . . . . . . . . . . . .  6
67	   3.  Specification of new Endpoint Address types  . . . . . . . . .  8
68	   4.  Specification of new Cost and Property types . . . . . . . . .  9
69	   5.  Obtaining Availability Information . . . . . . . . . . . . . . 10
70	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
71	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
72	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 13
73	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 13
74	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 13
75	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14

77	1.  Introduction

79	   Various virtualization technologies allow to instantiate virtual
80	   hosts, virtual storage, and virtual networks on top of their physical
81	   counterparts.  They can be combined to build complex virtual
82	   infrastructures.  Management systems automate the task of mapping
83	   virtual to physical resources, considering various optimization
84	   goals.  Mechanisms like live migration of virtual machines or re-
85	   shaping the topology of overlay networks allow to dynamically react
86	   on changing conditions, both in the virtual infrastructure (e.g.,
87	   change in demand) and in the underlying physical infrastructures
88	   (e.g., change in available resources).

90	   A typical example is, that in a cluster of several physical servers,
91	   in times of low demand, all virtual machines could be migrated away
92	   from one node.  This node would be powered down, in order to save
93	   precious energy.  If resource utilization is the only optimization
94	   goal, the input for the placement/scheduling manager can be gathered
95	   by measurements.

97	   If, however, other optimization goals have to be considered, the
98	   management system needs external information.  For example, if all
99	   but two nodes of the cluster are to be shut down, the remaining two
100	   nodes should be selected in a way that minimizes the risk of both
101	   nodes failing at the same time due to a single root cause.  This
102	   optimization problem becomes more difficult if not only hosts and
103	   storage but also network resources are considered.

105	   This document shows that the ALTO protocol [I-D.ietf-alto-protocol]
106	   offers the required base mechanisms for providing a standardized
107	   interface to virtual infrastructure orchestration managers, for
108	   conveying information about the availability / reliability of the
109	   underlying physical infrastructure, This document further defines
110	   appropriate metrices for this use case.

112	2.  Classification of availability-related parameters

114	   Important concepts of the ALTO protocol are Endpoints, which are
115	   identified by Endpoint Addresses.  Endpoints can be grouped in PIDs.
116	   Endpoints (and by means of protocol extensions
117	   [I-D.roome-alto-pid-properties] also PIDs) may have properties that
118	   can be queried using the ALTO protocol.  Paths between PIDs may have
119	   one or more path costs according to some cost metric.  These path
120	   costs can be queried for individual pairs of PIDs, or a whole cost
121	   map (i.e., a "PID x PID -> cost" matrix) can be downloaded.  The path
122	   cost concept can easily be generalized to a path property concept.

124	   This section discusses how these base mechanisms can be used to
125	   convey information related to availability of physical
126	   infrastructures to systems that manage virtual infrastructures on top
127	   of them.

129	2.1.  Identification of physical resources

131	   In order to identify physical resources within the ALTO protocol, an
132	   appropriate endpoint address type has to be used.  The ALTO base
133	   protocol specification [I-D.ietf-alto-protocol] only defines IPv4 and
134	   IPv6 addresses, and establishes a process to register further types.
135	   In fact, IP addresses may be used to identify physical resources in
136	   many cases, e.g., the loopback address of a router, or the management
137	   address of a physical server, etc.  For a discussion of VPNs and ALTO
138	   see [I-D.scharf-alto-vpn-service].

140	   TBD: discussion of further options for endpoint addesses.

142	2.2.  Classification of cost types and properties

144	   Information related to availability of physical resources may be of
145	   different fundamental natures, requiring different encodings and
146	   different update intervals.  This section itemizes several criteria.

148	2.2.1.  Static vs. dynamic facts vs. probabilites

150	   Information may be static facts that change never or very
151	   infequently.  For example: "Electrical power outlets A and B are both
152	   connected to circuit breaker F1".

154	   Information may also be more frequently changing, e.g.,
155	   "Uninterruptible power supply UPS1 is now running on battery power,
156	   82% capacity left".  TBD: further investigation and guidance is
157	   needed on the maximum update frequency that can reasonably be done
158	   using ALTO.

160	   Another type of information are statistical measures such as the
161	   average relative availability of a subsystem, e.g., the famous "five
162	   nines".

164	2.2.2.  Causality and Correlation

166	   Many initial incidents can cause a series of events, according to
167	   some kind of "failure propagation topology", which is independent of
168	   the IP network topology.  There may be even hierarchies.

170	   For example, "Servers S1 and S2 are connected via circuit breaker F1
171	   to uninterruptible power supply UPS1 while S3 and S4 are connected
172	   via F2 to UPS1" implies that a failure in S1 triggering F1 will also
173	   interrupt operation of S2.  Furhermore, shutting down S2, S3, and S4
174	   in case of a power grid failure could strech UPS1's battery lifetime
175	   and thereby prolong S1's survivability time.  Similar considerations
176	   can be made for different kinds of problems, e.g., the impact of a
177	   fire.

179	   Modeling diffent risk types (e.g., power outage, fire, flooding,
180	   physical intruders, etc.) in their respective terminology would
181	   require the definition of many new data types.

183	   A more generic approach is to use an ALTO cost map as a matrix, which
184	   indicates the level of isolation against "fate sharing" of any two
185	   PIDs with respect to a given (physical) risk.  In other words, for
186	   every specific risk R the coefficients of that matrix could be
187	   calculated as

189	   C_R(x,y) = 1 - P( y fails due to R | x fails due to R )

191	   For example, if the risk type is "fire", then a coefficient of 0
192	   could mean "these two physical resources are in the same rack.  If
193	   one is on fire for any reason, the other one will almost inevitably
194	   fail within seconds, too.", a value of 0.3 could mean "the resources
195	   are in adjacent buildings" and 0.99999 could mean "these two
196	   resources are on different continents and only a natural disaster
197	   causing global destruction could disable both of them in one single
198	   event".

200	   Note that these conditional properties only indicate how likely it is
201	   that the second resource will become unavailable due to the same
202	   event that disabled the first resource.  They do not indicate how
203	   likely it is that the event will actually occur.

205	   TBD: discuss to which extent a single "endpoint address to PID"
206	   network map is useful when considering different risk types.  The
207	   idea behind PIDs is to reduce map size by grouping topologically
208	   close endpoints, but the "failure propagation topologies" may be very
209	   unalingned for different risk types.  We will probably end up with
210	   many very small PIDs.

212	3.  Specification of new Endpoint Address types

214	   TBD.

216	4.  Specification of new Cost and Property types

218	   TBD.

220	   We need: the "isolation level agains fate sharing" matrix, and a list
221	   of risk types, in order to give the absolute probability of that risk
222	   for a given resource.

224	5.  Obtaining Availability Information

226	   For any ALTO information, it is important to consider whether the
227	   ALTO service realistically can discover that information, if the
228	   distribution of that information is allowed, if the data is useful,
229	   if a client can get that information without excessive privacy
230	   concerns, and if the information cannot be gathered easily be found
231	   in some other way.

233	   Availability-related parameters can both refer to properties of the
234	   network infrastructure (e.g., network resiliency mechanisms) as well
235	   as non-networking effects (e.g., redundancy of power supply).  In
236	   both cases, an application typically cannot measure that information,
237	   neither by passive monitoring nor by active probing.  Yet,
238	   availability information and insight into impact of incidents matters
239	   to many applications and can be an important criteria for resource
240	   selection decisions.  Since typical use cases would be limited to one
241	   administrative domain, privacy is not a major concern; in addition,
242	   the suggested correlation metrics provide an abstraction over the
243	   actual physical infrastructure.

245	   Gathering availability information may be more challenging than, for
246	   instance, IP routing topologies.  For instance, it may require access
247	   to inventory databases.  Yet, within one domain, the organization
248	   that is responsible for the physical network topology may also take
249	   care of other parts of the physical infrastructure, such as the power
250	   supply or hardware installation.  An organization that operates an
251	   ALTO server for exposing network topology information could therefore
252	   also have access to other inventory data.  Therefore, providing
253	   availability information to an ALTO server as described in this
254	   document is realistic.

256	6.  IANA Considerations

258	   TBD.

260	7.  Security Considerations

262	   TBD.

264	8.  References

266	8.1.  Normative References

268	   [I-D.ietf-alto-protocol]
269	              Alimi, R., Penno, R., and Y. Yang, "ALTO Protocol",
270	              draft-ietf-alto-protocol-25 (work in progress),
271	              January 2014.

273	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
274	              Requirement Levels", BCP 14, RFC 2119, March 1997.

276	8.2.  Informative References

278	   [I-D.roome-alto-pid-properties]
279	              Roome, B. and Y. Yang, "PID Property Extension for ALTO
280	              Protocol", draft-roome-alto-pid-properties-00 (work in
281	              progress), October 2013.

283	   [I-D.scharf-alto-vpn-service]
284	              Scharf, M., Gurbani, V., Soprovich, G., and V. Hilt, "The
285	              Virtual Private Network (VPN) Service in ALTO: Use Cases,
286	              Requirements and Extensions",
287	              draft-scharf-alto-vpn-service-01 (work in progress),
288	              July 2013.

290	   [RFC5693]  Seedorf, J. and E. Burger, "Application-Layer Traffic
291	              Optimization (ALTO) Problem Statement", RFC 5693,
292	              October 2009.

294	   [RFC6708]  Kiesel, S., Previdi, S., Stiemerling, M., Woundy, R., and
295	              Y. Yang, "Application-Layer Traffic Optimization (ALTO)
296	              Requirements", RFC 6708, September 2012.

298	Authors' Addresses

300	   Sebastian Kiesel
301	   University of Stuttgart Information Center
302	   Networks and Communication Systems Department
303	   Allmandring 30
304	   Stuttgart  70550
305	   Germany

307	   Email: ietf-alto@skiesel.de
308	   URI:   http://www.rus.uni-stuttgart.de/nks/

310	   Michael Scharf
311	   Alcatel-Lucent Bell Labs
312	   Lorenzstrasse 10
313	   Stuttgart  70435
314	   Germany

316	   Email: michael.scharf@alcatel-lucent.com
317	   URI:   www.alcatel-lucent.com/bell-labs