idnits 2.17.1 

draft-xiang-alto-exascale-network-optimization-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 3, 2017) is 2489 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	ALTO WG                                                         Q. Xiang
3	Internet-Draft                                    Tongji/Yale University
4	Intended status: Informational                                 H. Newman
5	Expires: January 4, 2018              California Institute of Technology
6	                                                            G. Bernstein
7	                                                       Grotto Networking
8	                                                                   H. Du
9	                                                  Tongji/Yale University
10	                                                                  K. Gao
11	                                                     Tsinghua University
12	                                                               A. Mughal
13	                                                               J. Balcas
14	                                      California Institute of Technology
15	                                                                J. Zhang
16	                                                       Tongji University
17	                                                                 Y. Yang
18	                                                  Tongji/Yale University
19	                                                            July 3, 2017

21	         Resource Orchestration for Multi-Domain Data Analytics
22	         draft-xiang-alto-exascale-network-optimization-03.txt

24	Abstract

26	   Data-intensive analytics is entering the era of multi-domain,
27	   geographically-distributed, collaborative computing, where different
28	   organizations contribute various resources to collaboratively
29	   collect, share and analyze extremely large amounts of data.  Examples
30	   of this paradigm include the Compact Muon Solenoid (CMS) and A
31	   Toroidal LHC ApparatuS (ATLAS) experiments of the Large Hadron
32	   Collider (LHC) program.  Massive datasets continue to be acquired,
33	   simulated, processed and analyzed by globally distributed science
34	   networks in these collaborations.  Applications that manage and
35	   analyze such massive data volumes can benefit substantially from the
36	   information about networking, computing and storage resources from
37	   each member's site, and more directly from network-resident services
38	   that optimize and load balance resource usage among multiple data
39	   transfers and analytics requests, and achieve a better utilization of
40	   multiple resources in clusters.

42	   The Application-Layer Traffic Optimization (ALTO) protocol can
43	   provide via extensions the network information about different
44	   clusters/sites, to both users and proactive network management
45	   services where applicable, with the goal of improving both
46	   application performance and network resource utilization.  In this
47	   document, we propose that it is feasible to use existing ALTO
48	   services to provides not only network information, but also
49	   information about computation and storage resources in data analytics
50	   networks.  We introduce a uniform resource orchestration framework
51	   (Unicorn), which achieves an efficient multi-resource allocation to
52	   support low-latency dataset transfer and data intensive analytics for
53	   collaborative computing.  It collects cluster information from
54	   multiple ALTO services utilizing topology extensions and leverages
55	   emerging SDN control capabilities to orchestrate the resource
56	   allocation for dataset transfers and analytics tasks, leading to
57	   improved transfer and analytics latency as well as more efficient
58	   utilization of multi-resources in sites.

60	Status of This Memo

62	   This Internet-Draft is submitted in full conformance with the
63	   provisions of BCP 78 and BCP 79.

65	   Internet-Drafts are working documents of the Internet Engineering
66	   Task Force (IETF).  Note that other groups may also distribute
67	   working documents as Internet-Drafts.  The list of current Internet-
68	   Drafts is at http://datatracker.ietf.org/drafts/current/.

70	   Internet-Drafts are draft documents valid for a maximum of six months
71	   and may be updated, replaced, or obsoleted by other documents at any
72	   time.  It is inappropriate to use Internet-Drafts as reference
73	   material or to cite them other than as "work in progress."

75	   This Internet-Draft will expire on January 4, 2018.

77	Copyright Notice

79	   Copyright (c) 2017 IETF Trust and the persons identified as the
80	   document authors.  All rights reserved.

82	   This document is subject to BCP 78 and the IETF Trust's Legal
83	   Provisions Relating to IETF Documents
84	   (http://trustee.ietf.org/license-info) in effect on the date of
85	   publication of this document.  Please review these documents
86	   carefully, as they describe your rights and restrictions with respect
87	   to this document.  Code Components extracted from this document must
88	   include Simplified BSD License text as described in Section 4.e of
89	   the Trust Legal Provisions and are provided without warranty as
90	   described in the Simplified BSD License.

92	Table of Contents

94	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
95	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   5
96	   3.  Changes Since Version -02 . . . . . . . . . . . . . . . . . .   5
97	   4.  Problem Settings  . . . . . . . . . . . . . . . . . . . . . .   6
98	     4.1.  Motivation  . . . . . . . . . . . . . . . . . . . . . . .   6
99	     4.2.  Challenges  . . . . . . . . . . . . . . . . . . . . . . .   6
100	   5.  Basic Idea  . . . . . . . . . . . . . . . . . . . . . . . . .   7
101	     5.1.  Use ALTO services to provide multi-resource information .   7
102	     5.2.  Example 1: network information impacts data analytics
103	           scheduling  . . . . . . . . . . . . . . . . . . . . . . .   8
104	     5.3.  Example 2: encode multi-resources in abstract network
105	           elements  . . . . . . . . . . . . . . . . . . . . . . . .   9
106	   6.  Key Issues  . . . . . . . . . . . . . . . . . . . . . . . . .  10
107	   7.  Unified Resource Orchestration Framework  . . . . . . . . . .  11
108	     7.1.  Architecture  . . . . . . . . . . . . . . . . . . . . . .  11
109	     7.2.  Workflow converter  . . . . . . . . . . . . . . . . . . .  14
110	       7.2.1.  User API  . . . . . . . . . . . . . . . . . . . . . .  14
111	     7.3.  Resource Demand Estimator . . . . . . . . . . . . . . . .  15
112	     7.4.  Entity Locator  . . . . . . . . . . . . . . . . . . . . .  15
113	     7.5.  ALTO Client . . . . . . . . . . . . . . . . . . . . . . .  15
114	       7.5.1.  Query Mode  . . . . . . . . . . . . . . . . . . . . .  16
115	     7.6.  ALTO Server . . . . . . . . . . . . . . . . . . . . . . .  16
116	     7.7.  Resource View Extractor . . . . . . . . . . . . . . . . .  16
117	     7.8.  Execution Agents  . . . . . . . . . . . . . . . . . . . .  18
118	     7.9.  Multi-Resource Orchestrator . . . . . . . . . . . . . . .  18
119	       7.9.1.  Orchestration Algorithms  . . . . . . . . . . . . . .  18
120	       7.9.2.  Online, Dynamic Orchestration . . . . . . . . . . . .  18
121	       7.9.3.  Example: A Max-Min Fairness Resource Allocation
122	               Algorithm . . . . . . . . . . . . . . . . . . . . . .  19
123	   8.  Discussion  . . . . . . . . . . . . . . . . . . . . . . . . .  20
124	     8.1.  Deployment  . . . . . . . . . . . . . . . . . . . . . . .  20
125	     8.2.  Benefiting From ALTO Extension Services . . . . . . . . .  21
126	     8.3.  Limitations of the MFRA Algorithm . . . . . . . . . . . .  21
127	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  22
128	   10. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  22
129	   11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  22
130	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  22
131	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  22
132	     12.2.  Informative References . . . . . . . . . . . . . . . . .  22
133	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  24

135	1.  Introduction

137	   As the data volume increases exponentially over time, data intensive
138	   analytics is transiting from single-domain computing to multi-
139	   organizational, geographically-distributed, collaborative computing,
140	   where different organizations contribute various resources, e.g.,
141	   computation, storage and networking resources, to collaboratively
142	   collect, share and analyze extremely large amounts of data.  One
143	   leading example is the Large Hadron Collider (LHC) high energy
144	   physics (HEP) program, which aims to find new particles and
145	   interactions in a previously inaccessible range of energies.  The
146	   scientific collaborations that have built and operate large HEP
147	   experimental facilities at the LHC, such as the Compact Muon Solenoid
148	   (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than
149	   300 petabytes of data under management at hundreds of sites around
150	   the world, and this volume is expected to grow to one exabyte by
151	   approximately 2018.

153	   With such an increasing data volume, how to manage the storage and
154	   analytics of these data in a globally distributed infrastructure has
155	   become an increasingly challenging issue.  Applications such as the
156	   Production ANd Distributed Analysis system (PanDA) in ATLAS and the
157	   Physics Experiment Data Export system (PhEDEX) in CMS have been
158	   developed to manage the data transfers among different cluster sites.
159	   Given a data transfer request, these applications make data transfer
160	   decisions based on the availability of dataset replicas at different
161	   sites and initiate retransmission from a different replica if the
162	   original transmission fails or is excessively delayed.  And HTCondor
163	   [HTCondor] is deployed to achieve coarse-grained data analytics
164	   parallelization across these sites.  When a data analytics task is
165	   submitted, HTCondor adopts a match-making process to assign the task
166	   to a certain set of servers in one site, based on the coarse-grained
167	   description of resource availability, such as the number of cores,
168	   the size of memory, the size of hard disk, etc.  However, neither
169	   dataset transfers nor data analytics task parallelization takes fine-
170	   grained information of cluster resources, such as data locality,
171	   memory speed, network delay, network bandwidth, etc., into account,
172	   leading to high data transfer and analytics latency and
173	   underutilization of cluster resources.

175	   The Application-Layer Traffic Optimization (ALTO) services defined in
176	   [RFC7285] provide network information with the goal of improving the
177	   network resource utilization while maintaining or improving
178	   application performance.  Though ALTO is not designed to provide
179	   information about other resources, such as computing and storage
180	   resources in cluster networks, in this document we propose that
181	   exascale science networks can leverage existing ALTO services defined
182	   in [RFC7285] and ALTO topology extension services defined in network
183	   graph [DRAFT-NETGRAPH], path vector [DRAFT-PV], routing state
184	   abstraction[DRAFT-RSA], multi-cost [DRAFT-MC] and cost-calendar
185	   [DRAFT-CC] and etc. to encode information about multiple types of
186	   resources in science networks, such as memory I/O speed, CPU
187	   utilization, network bandwidth, and provide such information to
188	   orchestration applications to improve the performance of dataset
189	   transfer and data analytics tasks, including throughput, latency,
190	   etc.

192	   This document introduces a unified resource orchestration framework
193	   (Unicorn), which provides efficient multi-resource allocation to
194	   support low-latency, multi-domain, geo-distributed data analytics.
195	   Unicorn provides a set of simple APIs for authorized users to submit,
196	   update and delete dataset transfer requests and data intensive
197	   analytics requests.  One important proposal we make in this document
198	   is that it is feasible to use ALTO services to provide not only
199	   network information, but also information on other resources in
200	   multi-domain, geo-distributed analytics networks including computing
201	   and storage.

203	   A prototype of Unicorn with the dataset transfer scheduling component
204	   has been implemented on a single-domain Caltech SDN development
205	   testbed, where the ALTO OpenDaylight controller is used to collect
206	   topology information.  We are currently designing the resource
207	   orchestration components to achieve low-latency data-intensive
208	   analytics.

210	   This document is organized as follows: Section 3 summarizes the
211	   change of this document since version -01.  Section 4 elaborates on
212	   the motivation and challenges for coordinating storage, computing and
213	   network resources in a globally distributed science network
214	   infrastructure.  Section 5 discusses the basic idea of encoding
215	   multi-resource information into ALTO path vector and abstraction
216	   services and gives an example.  Section 6 lists several key issues to
217	   address in order to realize the proposal of providing multi-resource
218	   information by ALTO topology services.  Section 7 gives the details
219	   of Unicorn architecture for multi-domain, geo-distributed data
220	   analytics.  Section 8 discusses current development progress of
221	   Unicorn and next steps.

223	2.  Requirements Language

225	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
226	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
227	   document are to be interpreted as described in [RFC2119].

229	3.  Changes Since Version -02

231	   o  Add an example in Section 7 to show the importance of network
232	      information in resource allocation for data analytics.

234	   o  Update the architecture of Unicorn in Section 7, i.e., add the
235	      entity locator and rename ANE aggregator to resource view
236	      extractor.

238	   o  Add detailed description of how the entity locator and the
239	      resource view extractor work.

241	   o  Minor changes in abstract and discussion sections.

243	4.  Problem Settings

245	4.1.  Motivation

247	   Multi-domain, geo-distributed data analytics usually involves the
248	   participation of countries and sites all over the world.  Science
249	   programs such as the CMS experiment and the ATLAS experiment at CERN
250	   are typical examples.  The site located at the LHC laboratory is
251	   called a Tier-0 site, which processes the data selected and stored
252	   locally by the online systems that select and record the data in
253	   real-time as it comes off the particle detector, archives it and
254	   transfers it to over 10 Tier-1 sites around the globe.  Raw datasets
255	   and processed datasets from Tier-1 sites are then transferred to over
256	   160 Tier-2 sites around the world based on users' requests.
257	   Different sites have different resources and belong to different
258	   administration domains.  With the exponentially increasing data
259	   volume in the CMS experiment, the management of large data transfers
260	   and data intensive analytics in such a global multi-domain science
261	   network has become an increasingly challenging issue.  Allocating
262	   resources in different clusters to fulfill different users' dataset
263	   transfer requests and data analytics requests require careful
264	   orchestrating as different requests are competing for limited
265	   storage, computation and network resources.

267	4.2.  Challenges

269	   Orchestrating exascale dataset transfers and analytics in a globally
270	   distributed science network is non-trivial as it needs to cope with
271	   two challenges.

273	   o  Different sites in this network belong to different administration
274	      domains.  Sharing raw site/cluster information would violate
275	      sites' privacy constraints.  Orchestrating data transfers and
276	      analytics requests based on highly abstracted, non-real-time
277	      network information may lead to suboptimal scheduling decisions.
278	      Hence the orchestrating framework must be able to collect
279	      sufficient resource information about different clusters/sites in
280	      real-time as well as over the longer term, to allow reasonably
281	      optimized network resource utilization without violating sites'
282	      privacy requirements.

284	   o  Different science programs tend to adopt different software
285	      infrastructures for managing dataset transfers and analytics, and
286	      may place different requirements.  Hence the orchestrating
287	      framework must be modular so that it can support different dataset
288	      management systems and different orchestrating algorithms.

290	   The orchestrating framework must support the interaction between the
291	   multi-resource orchestration module, the dataset transfer module, and
292	   the data analytics execution module.  The key information to be
293	   exchanged between modules includes dataset information, the resource
294	   state of different clusters and sites, the transfer and analytic
295	   requests in progress, as well as trends and network-segment and site
296	   performance from the network point of view.  Such interaction ensures
297	   that (1) the various programs can adopt their own data transfer and
298	   analytics systems to be multi-resource-aware, and more efficient in
299	   achieving their goals; and (2) the various orchestrating algorithms
300	   can achieve a reasonably optimized utilization on not only the
301	   network resource but also the computing and storage resources.

303	5.  Basic Idea

305	5.1.  Use ALTO services to provide multi-resource information

307	   The ALTO protocol is designed to provide network information to
308	   applications so that applications can achieve better performance and
309	   the network more efficient use of resources.  Different ALTO topology
310	   services including path vector, routing state abstraction, multi-
311	   cost, cost calendar, etc., have been proposed to provide fine-grained
312	   network information to applications.  In this document, we propose
313	   that not only can ALTO provide network information of different
314	   cluster sites, it can also provide information of multiple resources,
315	   including computing and storage resources.  To this end, the basic
316	   "one-big-switch" abstraction provided by the base ALTO protocol is
317	   not sufficient.  Several examples have already been given in
318	   [DRAFT-PV] and [DRAFT-RSA] to demonstrate that.  There has been a
319	   similar proposal before about using ALTO to provide resource
320	   information for data centers [DRAFT-DC].  However, that proposal
321	   requires a new information model for clusters or data centers, which
322	   may affect the compatibility of ALTO.  The solution of this proposal
323	   is simpler.  Its basic idea is that each computer node and storage
324	   node can be seen as an "abstract network element" defined in ALTO-
325	   path-vector [DRAFT-PV].  In this way, Unicorn can fully reuse all
326	   existing ALTO services by introducing only one cost-mode (pv) and two
327	   cost-metrics (ne and ane), instead of introducing a new information
328	   model.

330	5.2.  Example 1: network information impacts data analytics scheduling

332	            .-----------.             .-----------.
333	            |  .-----.  |             |  .-----.  |
334	            |  | eh1 |  |             |  | eh3 |  |
335	            |  '-----'  |             |  '-----'  |
336	            |           |             |           |
337	            |           |             |           |
338	            |           |             |           |
339	            |           |             |           |
340	            |  .-----.  |             |  .-----.  |
341	            |  | eh2 |  |             |  | eh4 |  |
342	            |  '-----'  |             |  '-----'  |
343	            '-----------'             '-----------'
344	               Site 1                    Site 2
345	        distance(eh1, eh2) = 2    distance(eh1, eh2) = 2

347	            Figure 1: An Example for Data Locality Information.

349	   We first use the example in Figure 1 to show that only network
350	   information itself has a huge impact on resource allocation for data
351	   analytics.  In this scenario, a MapReduce task needs to be executed.
352	   The input data has two copies at end hosts eh1 and eh3, respectively.
353	   And the end hosts eh2 and eh4 will be the computation nodes with the
354	   same computation power, correspondingly.  Using the common data
355	   analytics resource management framework such as Hadoop it can be
356	   revealed that both allocation options, i.e., eh1->eh2 and eh3->eh4,
357	   have a storage-computation distance of 2, i.e., they have the same
358	   data locality from Hadoop's view.  As a result, it appears that both
359	   options would provide the same performance for this task.

361	   However, assume that the bandwidth between eh1 and eh2 is 100Mb/s
362	   while that between eh3 and eh4 is 1Gb/s.  These significant different
363	   data accessing speeds decide that these options have very different
364	   performances for the same task and the only optimal allocation option
365	   is to allocate this task to eh3->eh4.  This example demonstrates the
366	   imperativeness of network information in making efficient resource
367	   allocation decisions.  Such information is not provided in Hadoop or
368	   other similar or related projects such as Mesos.  On the contrary, if
369	   ALTO servers are deployed at these sites, applications can retrieve
370	   such information via ALTO queries.

372	5.3.  Example 2: encode multi-resources in abstract network elements

374	   We use the same dumbbell topology in [DRAFT-RSA] as an example to
375	   show the feasibility of using ALTO topology service to provide multi-
376	   resource information.  In this topology, we assume the bandwidth of
377	   each network cable is 1Gbps, including the cables connecting end
378	   hosts to switches.  Consider a dataset transfer request which needs
379	   to schedule the traffic among a set of end host source-destination
380	   pairs, say eh1 -> eh2, and eh3 -> eh4.  Assume that the transfer
381	   application receives information from the ALTO Cost Map service that
382	   both eh1 -> eh2 and eh3 -> eh4 have bandwidth 1Gbps.  In [DRAFT-RSA],
383	   it is shown that whether each of the two traffic flows can receive
384	   1Gbps bandwidth depends on whether the routes of two flows share a
385	   bottleneck link.  Path vector and routing state abstraction services
386	   provide additional information about network state encoded in
387	   abstract network elements.  If the returned state is ane1 + ane2 <=
388	   1Gbps, it means two flows cannot each get 1Gbps bandwidth at the same
389	   time.  If the returned state is ane1 <= 1Gbps and ane2 <= 1Gbps, it
390	   means two flows each can get 1Gbps bandwidth.

392	                             +------+
393	                             |      |
394	                           --+ sw6  +--
395	                         /   |      |  \
396	   PID1 +-----+         /    +------+   \          +-----+  PID2
397	   eh1__|     |_       /                 \     ____|     |__eh2
398	        | sw1 | \   +--+---+         +---+--+ /    | sw2 |
399	        +-----+  \  |      |         |      |/     +-----+
400	                  \_| sw5  +---------+ sw7  |
401	   PID3 +-----+   / |      |         |      |\     +-----+  PID4
402	   eh3__|     |__/  +------+         +------+ \____|     |__eh4
403	        | sw3 |                                    | sw4 |
404	        +-----+                                    +-----+

406	                   Figure 2: A Dumbbell Network Topology

408	   Other than network resource, assume in this topology eh1 and eh3 are
409	   equipped with commodity hard disk drives (HDD) while eh2 and eh4 are
410	   equipped with SSDs.  Because the bandwidth of an HDD is typically
411	   0.8Gbps and that of SSD is typically 3Gbps.  Even if the returned
412	   routing state is ane1 <= 1Gbps and ane2 <=1Gbps, the actual
413	   bottleneck of each traffic flow is the storage I/O bandwidth at
414	   source host.  As a result, the total bandwidth of both traffic flows
415	   can only reach 1.6Gbps.

417	   It has been verified in the CMS experiment, and also several studies
418	   on commercial data centers that network resource are not always the
419	   bottleneck of large dataset transfers and data analytics.  Many have
420	   reported that storage resources and computing resources become the
421	   bottleneck in a fairly large percentage of dataset transfers and data
422	   analytics tasks in science networks and commercial data centers.

424	   In this example, if we look at the end hosts as abstract network
425	   elements, the storage I/O bandwidth of each host can also be encoded
426	   as an abstract element into the path-vector.  And under the storage
427	   and route settings above, the returned cluster state would be ane1
428	   <=0.8Gbps and ane2 <=0.8Gbps, which provides a more accurate capacity
429	   region for the requested traffic flows.

431	6.  Key Issues

433	   The last section described the basic idea of using ALTO topology
434	   services to provide multi-resource information and gives an example
435	   to demonstrate its feasibility.  Next we list and discuss several key
436	   issues to address in this proposal.

438	   o  Can ALTO topology services provide data locality information?
439	      Existing ALTO topology services do not provide such information.
440	      Many studies have pointed out that such information plays a vital
441	      role in reducing the latency of data-intensive analytics.  If ALTO
442	      topology services can encode such information together with
443	      information of other resources together, data-intensive
444	      applications can benefit a great deal in terms of information
445	      aggregation and communication overhead.

447	   o  How to quickly map applications' resource allocation decision on
448	      abstract multi-resource view back to the physical multi-resource
449	      view of clusters/sites?  Fine-grained resource information can be
450	      encoded into abstract network elements to reduce overhead and
451	      provide certain privacy protection of clusters.  Such information
452	      can be highly compressed (see the dumbbell example used in this
453	      document as well as in [DRAFT-PV] and [DRAFT-RSA]).  In
454	      preliminary evaluations on RSA, the network element compression
455	      ratio can be as high as 80 percent.  This ratio is expected to be
456	      even higher in large-scale data center or cluster setting, e.g. a
457	      fat-tree topology with k=48.  Therefore a fast mapping from the
458	      resource orchestration decisions based on the abstract view back
459	      to the physical view is needed to satisfy the stringent latency
460	      requirement of large dataset transfers and data-intensive
461	      analytics.

463	   o  How much privacy, including key resource configurations, raw
464	      topology, intra-cluster scheduling policy, etc., will be exposed?
465	      Compared with the "one-big-switch" abstraction, other ALTO
466	      topology services such as path vector [DRAFT-PV] and routing state
467	      abstraction [DRAFT-RSA] provide fine-grained resource information
468	      to applications.  Even if such information can be encoded into
469	      abstract network elements, it still risks exposing private
470	      information of different clusters/sites.  Current internet drafts
471	      of these services did not provide any formal privacy analysis or
472	      performance measurement.  This would be one of the key issues this
473	      document plans to investigate in the future.

475	   o  How does current ALTO services such as path-vector and RSA scale
476	      when they are used to provide abstract information concerning
477	      multiple resources in clusters?  Another issue along this line is
478	      how to balance the liveness of fine-grained resource information
479	      and the corresponding information delivery overhead?  Although
480	      encoding information of network elements into abstract network
481	      elements can achieve a very competitive information compression
482	      ratio, large dataset transfers and analytics applications always
483	      involve many network elements in multiple clusters/sites and the
484	      absolute number of involved network elements keep increasing as
485	      the scale of clusters increase.  In addition, when resource
486	      information in a cluster changes, the ALTO services need to inform
487	      all related applications.  In either cases, delivering fine-
488	      grained resource information would cause high communication
489	      overhead.  There still lacks of an analytics or experimental
490	      understanding on the scalability of path-vector and RSA services.

492	7.  Unified Resource Orchestration Framework

494	7.1.  Architecture

496	   This section describes the design details of key components of the
497	   Unicorn framework: the workflow converter, the resource demand
498	   estimator, the ALTO clients, the ALTO servers, the resource view
499	   extractor, the multi-resource orchestrator and the execution agents.
500	   Figure 3 shows the architecture of Unicorn.  The overall process is
501	   as follows.

503	                              .---------.
504	                              |  Users  |
505	                              '---------'
506	                                   | 1
507	     .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - -.
508	     |                   .--------------------.                    |
509	     |   Unicorn         | Workflow Converter |                    |
510	     |                   '--------------------'                    |
511	     |                             | 2                             |
512	     |              .-----------------------------.                |
513	     |              |  Resource Demand Estimator  |                |
514	     |              '-----------------------------'                |
515	     |                             | 3                             |
516	     |              .-----------------------------.                |
517	     |              | Multi-Resource Orchestrator |                |
518	     |              '-----------------------------'                |
519	     |             /     |         | 10        |   \               |
520	     |          9 /      | 4       |         4 |    \ 9            |
521	     | .-------------. .-------.   |     .-------. .-------------. |
522	     | |Resource View| |Entity |   |     |Entity | |Resource View| |
523	     | |  Extractor  | |Locator|   |     |Locator| |  Extractor  | |
524	     | '-------------' '-------'   |     '-------' '-------------' |
525	     |        | 8        | 5       |         5 |          | 8      |
526	     |    .----------------.   .-----------. .----------------.    |
527	     |    | ALTO Client(s) | .-| Execution | | ALTO Client(s) |    |
528	     |    '----------------' | |  Agents   | '----------------'    |
529	     |          | 6          | '-----------'         | 6           |
530	     |  .----------------.   '-----------'    .----------------.   |
531	     |  | ALTO Server(s) |    /         \    | ALTO Server(s) |    |
532	     |  '----------------'   /           \   '----------------'    |
533	     |          | 7         / 11       11 \          | 7           |
534	     |  .----------------. /               \ .----------------.    |
535	     |  |     Site 1     |       . . .       |     Site N     |    |
536	     |  '----------------'                   '----------------'    |
537	     '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -'

539	                    Figure 3: Architecture of Unicorn.

541	   o  STEP 1 Authorized users submit high-level data analytics workflows
542	      to Unicorn through a set of simple APIs.

544	   o  STEP 2 The workflow converter transforms the high-level data
545	      analytics workflows into low-level task workflows, i.e., a set of
546	      analytics tasks with precedence encoded in a directed acyclic
547	      graph (DAG).

549	   o  STEP 3 The resource demand estimator automatically finds the
550	      optimal configuration (resource demand) of each task, i.e., the
551	      number of CPUs, the size of memory and disk, I/O bandwidth, etc.

553	   o  STEP 4 The multi-resource orchestrator receives the resource
554	      demand of a set of tasks and sends them to the entity locator at
555	      each site.

557	   o  STEP 5 The entity locator at each site retrieves the entity
558	      addresses of the end hosts that would be allocated for the tasks
559	      to be scheduled, and passes these addresses to the ALTO clients,
560	      and asks the ALTO clients to collect information about these end
561	      hosts, i.e., properties of corresponding computing, storage and
562	      networking resources.

564	   o  STEP 6 The ALTO clients issue ALTO queries defined in the base
565	      ALTO protocol [RFC7285], e.g., EPS, ECS, Network Map, etc. and
566	      ALTO extension services, e.g., routing state abstraction (RSA)
567	      [DRAFT-RSA], path vector [DRAFT-PV], network graph
568	      [DRAFT-NETGRAPH], multi-cost [DRAFT-MC], cost-calendar [DRAFT-CC]
569	      and property map [DRAFT-PM], to collect resource information.

571	   o  STEP 7 The ALTO servers at each site accept the queries from the
572	      ALTO client, collect resource information from the residing site
573	      and send back to the ALTO clients.

575	   o  STEP 8 The ALTO clients send the response from ALTO servers to the
576	      resource view extractor.

578	   o  STEP 9 The extractor uses a lightweight, optimal algorithm to
579	      compress the raw resource information provided by ALTO servers
580	      into a minimal, equivalent view of resources and sends back to the
581	      multi-resource orchestrator.

583	   o  STEP 10 The orchestrator makes resource allocation decisions,
584	      e.g., dataset transfer scheduling and analytics task placement,
585	      based on the resource demand of analytics tasks and the resource
586	      supply sent back from the resource view extractor.  Decisions are
587	      then sent to the execution agents deployed in corresponding sites.

589	   o  STEP 11 The execution agents receive and execute instructions from
590	      the multi-resource orchestrator.  They also monitor the status of
591	      different tasks and send the updated status to the multi-resource
592	      orchestrator.

594	   Unicorn provides a unified, automatic solution for multi-domain, geo-
595	   distributed data analytics.  In particular, its benefits include:

597	   o  On the resource demand side, it provides a set of simple APIs for
598	      authorized users to submit and manage data analytics requests and
599	      enables real-time requests' status monitoring.  And it
600	      automatically converts high-level analytics workflow into low-
601	      level task workflows and finds the optimal configuration for each
602	      task.

604	   o  On the resource supply side, it collects the resource information
605	      of different sites through a common, REST based interface
606	      specified by the ALTO protocol, encodes such information into
607	      different entity abstractions and computes a minimal, yet accurate
608	      view on resource supply dynamic.

610	   o  It provides a scalable multi-resource orchestrator that makes
611	      efficient resource allocation decisions to achieve high resource
612	      utilization and low-latency data analytics.

614	   o  The architecture of Unicorn is modular to support different
615	      resource orchestration algorithms and the deployment of different
616	      ALTO servers.

618	7.2.  Workflow converter

620	   The converter is the front end of Unicorn.  It is responsible for
621	   collecting high-level data analytics workflows from users and
622	   transforming them into low-level task workflows, e.g., HTCondor
623	   ClassAds.  It provides a set of simple APIs for users to submit and
624	   manage requests, and to track the status of requests in real-time.

626	7.2.1.  User API

628	   o  submitReq(request, [options])

630	      This API allows users to submit a request and specify
631	      corresponding options.  The request can be a data transfer request
632	      or a data analytics request.  Request options include priority,
633	      delay, etc.  It returns a request identifier reqID that allows
634	      users to update, delete this request or track its status.  The
635	      additional options may or may not be approved, and the relative
636	      priorities may be modified by the resource orchestrator depending
637	      on the role of users (regular users or administrators at different
638	      levels), the resource availability and the status of other ongoing
639	      requests.

641	   o  updateReq(requestID, [options])

643	      This API allows users to update the options of requests.  It will
644	      return a SUCCESS if the new options are received by the request
645	      parser.  But these new options may or may not be approved, and may
646	      be modified by the resource orchestrator depending on the role of
647	      users (regular users or administrators), the resource availability
648	      and the status of other ongoing requests.

650	   o  deleteReq(requestID)

652	      This API allows users to delete a request by passing the
653	      corresponding requestID.  A completed request cannot be deleted.
654	      An ongoing request will be stopped and the output data will be
655	      deleted.

657	   o  getReqStatus(requestID)

659	      This API allows users to query the status of a request by
660	      specifying the corresponding requestID.  The returned status
661	      information includes whether the request has started, the assigned
662	      priority, the percentage of finished sub-requests, transmission
663	      statistics, the expected remaining time to finish, etc.

665	7.3.  Resource Demand Estimator

667	   The estimator leverages the fact that low-level tasks are typically
668	   repetitive or have very high similarities.  It uses reinforcement
669	   learning to predict the optimal configuration for each task and
670	   passes the resource demand to the multi-resource orchestrator for
671	   further processing.

673	7.4.  Entity Locator

675	   The task configurations computed by the demand estimator has no
676	   knowledge on the underlying structure of topology of each site, i.e.,
677	   the addresses of end hosts and network devices.  Hence they cannot be
678	   directly used by the ALTO clients for querying resource information.
679	   The entity locator retrieves the entity addresses of the end hosts
680	   that would be allocated for the tasks to be scheduled, and passes
681	   these addresses to the ALTO clients, and asks the ALTO clients to
682	   collect information about these end hosts, i.e., properties of
683	   corresponding computing, storage and networking resources.

685	7.5.  ALTO Client

687	   The ALTO client is in the back end of Unicorn and is responsible for
688	   retrieving resource information through querying ALTO servers
689	   deployed at different sites.  The resource information needed in
690	   Unicorn includes the topology, link bandwidth, computing node memory
691	   I/O speed, computing node CPU utilization, etc.  The base ALTO
692	   protocol [RFC7285] provides an extreme single-node abstraction for
693	   this information, which only allows the multi-resource orchestrator
694	   to make coarse-grained resource allocation decisions.  To enable
695	   fine-grained multi-resource orchestration for dataset transfer and
696	   data analytics in cluster networks, ALTO topology extension services
697	   such as routing state abstraction (RSA) [DRAFT-RSA], path vector
698	   [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost [DRAFT-MC] and
699	   cost-calendar [DRAFT-CC] are needed to provide fine-grained
700	   information about different types of resources in clusters.

702	7.5.1.  Query Mode

704	   The ALTO client should operate in different query modes depending on
705	   the implementation of ALTO servers.  If an ALTO server does not
706	   support incremental updates using server-sent events (SSE)
707	   [DRAFT-SSE], the ALTO client sends queries to this server
708	   periodically to get the latest resource information.  If the cluster
709	   state changes after one query, the ALTO client will not be aware of
710	   the change until next query.  If an ALTO server supports SSE, the
711	   ALTO client only sends one query to the ALTO server to get the
712	   initial cluster information.  When the resource state changes, the
713	   ALTO client will be notified by the ALTO server through SSE.

715	7.6.  ALTO Server

717	   ALTO servers are deployed at different sites around the world, and at
718	   strategic locations in the network itself, to provide information
719	   about different types of resources in the cluster networks in
720	   response to queries from the ALTO client.  Such information include
721	   topology, link bandwidth, memory I/O speed and CPU utilization at
722	   computing nodes, storage constraints in storage nodes and etc.  Each
723	   ALTO server must provide basic information services as specified in
724	   [RFC7285] such as network map, cost map, endpoint cost service (ECS),
725	   etc.  To support the fine-grained multi-resource allocation in
726	   Unicorn, each ALTO server should also provide more fine-grained
727	   information about different resources in clusters through ALTO
728	   extension services such as the routing state abstraction [DRAFT-RSA],
729	   path vector [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost
730	   [DRAFT-MC] and cost-calendar [DRAFT-CC] services.

732	7.7.  Resource View Extractor

734	   In each site, the resource information collected by the ALTO clients
735	   is not directly sent back to the orchestrator.  Instead, we design a
736	   resource view extractor to compress the resource information provided
737	   by the ALTO servers into a minimal, equivalent view of all the
738	   resources, i.e., computing, storage and networking resources, that
739	   would be allocated to a set of tasks.  The extractor works in the
740	   following steps.

742	   o  Resource-Task Matrix

744	      Depending on specific services provided by the ALTO servers, the
745	      responses collected by ALTO clients may include information of
746	      different entities, e.g., endpoints, PIDs, ane, etc.  Each entity
747	      possesses a set of resources available for tasks to be scheduled.
748	      For each entity property p in the responses, such as bandwidth,
749	      delay, etc., the extractor composes an entity-task matrix M(p).
750	      Each row of this M(p) represents an entity in the responses
751	      provides information about property p and each column represents a
752	      task to be scheduled.  The element m(i, j) of M(p) is a variable
753	      representing the amount of property p of entity i that task j can
754	      get.

756	   o  Resource-Task Constraints

758	      For each property p, the extractor composes a series of resource-
759	      task constraints.  The first set of constraints is sum m(i, j) =
760	      r(p, j) for each task j.  These constraints calculate r(p, j), the
761	      total amount of property p provided to j by all the entities.  The
762	      exact rule of summation depends on the property p.  For instance,
763	      if p is latency, the summation is through common addition
764	      operations, but if p is bandwidth, the summation is a minimum
765	      function.

767	      The second set of constraints is sum m(i, j) <= r(p, i) for each
768	      entity i.  These constraints represent the usage of resource i on
769	      property p for all the tasks cannot exceed the capacity of
770	      resource i on this property.  The exact rule of summation also
771	      depends on the property p.  For instance, if p is bandwidth, the
772	      summation is through common addition operations, but if p is
773	      latency, the constraints become m(i,j) = r(p, i) for each each
774	      entity i and each task j.  This means that entity i provides the
775	      same delay property for each task.

777	   o  Resource View Compression

779	      The whole set of resource-task constraints are linear, and express
780	      the view of resources that are available for the tasks to be
781	      scheduled.  The extractor uses a lightweight, optimal algorithm to
782	      compress them into a minimal, equivalent view of resources, i.e.,
783	      a minimal set of linear constraints that represent the same
784	      feasible region as the original constraints.  The basic idea of
785	      this algorithm is described in [DRAFT-RSA].

787	7.8.  Execution Agents

789	   Execution agents are deployed at each site and are responsible for
790	   the following functions:

792	   o  Receive and process instructions from the multi-resource
793	      orchestrator, e.g. dataset transfer scheduling, data analytics
794	      task placement and execution, task update and abortion, etc.

796	   o  Monitor the status of data analytics tasks and send the updated
797	      status to the multi-resource orchestrator.

799	   Depending on the supporting data analytics frameworks, different
800	   request execution agents may be deployed in each site.  For instance,
801	   in the CMS experiment at CERN, both MPI and Hadoop execution agents
802	   are deployed.

804	7.9.  Multi-Resource Orchestrator

806	   The multi-resource orchestrator receives the resource demand
807	   information, i.e., a set of low-level task workflows and their
808	   configurations, from the resource demand estimator.  It then asks the
809	   entity locator in each site to get the addresses of end hosts that
810	   would be allocated for the tasks to be scheduled and ALTO clients to
811	   query properties related to these end hosts.  When the resource view
812	   extractor sends the response back, the orchestrator makes resource
813	   allocation decisions, e.g., dataset transfer scheduling and analytics
814	   task execution, based on both resource demand dynamic and resource
815	   supply dynamic.  The dataset transfer scheduling decisions include
816	   dataset replica selection, path selection, and bandwidth allocation,
817	   etc.  The analytics task execution decisions include which cluster
818	   should allocate how much resources to execute which tasks.  These
819	   decisions are sent to the execution agents at different sites for
820	   execution.

822	7.9.1.  Orchestration Algorithms

824	   The modular design of Unicorn allows the adoption of different
825	   orchestration algorithms and methodologies, depending on the specific
826	   performance requirements.  In Section 7.8.3, a max-min fairness
827	   resource allocation algorithm for dataset transfer is described as an
828	   example.

830	7.9.2.  Online, Dynamic Orchestration

832	   The multi-resource orchestrator should adjust the resource allocation
833	   decisions based on the progress of ongoing requests, the utilization
834	   and dynamics of cluster resources.  In normal cases, the multi-
835	   resource orchestrator periodically collects such information and
836	   executes the orchestration algorithm.  When it is notified of events
837	   such as request status update, cluster state update and etc., the
838	   orchestrator will also execute the orchestration algorithm to adjust
839	   resource allocations.

841	7.9.3.  Example: A Max-Min Fairness Resource Allocation Algorithm

843	   In this section, we describe a max-min fair resource allocation
844	   (MFRA) scheduling algorithm which aims to minimize the maximal time
845	   to complete a dataset transfer subject to a set of constraints.  To
846	   make resource allocation decisions, MFRA requires sufficient network
847	   information including topology, link bandwidth and recent historical
848	   information in some cases.  In a small-scale single-domain network,
849	   an SDN controller can provide the raw complete topology information
850	   for the MFRA algorithm.  However, in a large-scale multi-domain
851	   science network such as CMS, providing the raw network topology is
852	   infeasible because (1) it would incur significant communication
853	   overhead; and (2) it would violate the privacy constraints of some
854	   sites.  Several ALTO extension topology services including Abstract
855	   Path Vector [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA
856	   [DRAFT-RSA] can provide the fine-grained yet aggregated/abstract
857	   topology information for MFRA to efficiently utilize bandwidth
858	   resources in the network.

860	   Ongoing pre-production deployment efforts of Unicorn in the CMS
861	   network involve the implementation of the RSA service.  Other than
862	   topology information, the additional input of the MFRA algorithm is
863	   the priority of each class of flows, expressed in terms of upper and
864	   lower limits on the allocated bandwidth between the source and the
865	   destination for each data transfer requests.

867	   The basic idea of the MFRA algorithm is to iteratively maximize the
868	   volume of data that can be transferred subject to the constraints.
869	   It works in quantized time intervals such that it schedules network
870	   paths and data volumes to be transferred in each time slot.  When the
871	   DTR scheduler is notified of events such as the cancellation of a
872	   DTR, the completion of a DTR or network state changes, the MFRA
873	   algorithm will also be invoked to make updated network path and
874	   bandwidth allocation decisions.

876	   In each execution cycle, MFRA first marks all transfers as
877	   unsaturated.  Then it solves a linear programming model to find the
878	   common minimum transfer satisfaction rate (i.e., the ratio of
879	   transferred data volume in a time interval over the whole data volume
880	   of this request) that is satisfied by all transfer requests.  With
881	   this common rate found, MFRA then randomly selects an unsaturated
882	   request in each iteration, increases its transfer rate as much as
883	   possible by finding residual paths available in the network, or by
884	   increasing the allocated bandwidth along an existing path, until it
885	   reaches its upper limit or can otherwise not be increased further, so
886	   it is saturated.  At each iteration, newly saturated requests are
887	   removed from the subsequent process by fixing their corresponding
888	   rate value, and completed transfers are removed from further
889	   consideration.  After all the data transfer rates are saturated in
890	   the given time slot, then a feasible set of data transfer volumes
891	   scheduled to be transferred in the slot across each link in the
892	   network can be derived.

894	   The MFRA algorithm yields a full utilization of limited network
895	   resources such as bandwidth so that all DTR can be completed in a
896	   timely manner.  It allocates network resources fairly so that no DTR
897	   suffers starvation.  It also achieves load balance among the sites
898	   and the network paths crossing a complex network topology so that no
899	   site and no network link is oversubscribed.  Moreover, MFRA can
900	   handle the case where particular routing constraints are specified,
901	   e.g., where all routes are fixed ahead of time, or where each
902	   transfer request only uses one single path in each time slot, by
903	   introducing an additional set of linear constraints.

905	8.  Discussion

907	8.1.  Deployment

909	   The Unicorn framework is the first step towards a new class of
910	   intelligent, SDN-driven global systems for multi-domain, geo-
911	   distributed data analytics involving a worldwide ensemble of sites
912	   and networks, such as CMS and ATLAS.  Unicorn relies heavily on the
913	   ALTO services for collecting and expressing abstract, real-time
914	   resource information from different sites, and the SDN centralized
915	   control capability to orchestrate data analytics workflows.  It aims
916	   to provide a new operational paradigm in which science programs can
917	   use complex network and computing infrastructures with high
918	   throughput, while allowing for coexistence with other network
919	   traffic.

921	   A prototype case study implementation of Unicorn has been
922	   demonstrated on the Caltech/StarLight/Michigan/Fermilab SDN
923	   development testbed.  Because this testbed is a single-domain
924	   network, the current Unicorn prototype leverages the ALTO
925	   OpenDaylight controller, to collect topology information.  The CMS
926	   experiment is currently exploring pre-production deployments of
927	   Unicorn, looking towards future widespread production use.  To
928	   achieve this goal, it is imperative to collect sufficient resource
929	   information from the various sites in the multi-domain CMS network,
930	   without causing any privacy leak.  To this end, the ALTO RSA service

932	   [DRAFT-RSA] is under development.  Furthermore, as will be discussed
933	   next, other ALTO topology extension services can also substantially
934	   improve the performance of Unicorn.

936	8.2.  Benefiting From ALTO Extension Services

938	   The current ALTO base protocol [RFC7285] exposes network topology and
939	   endpoint properties using the extreme "my-Internet-view"
940	   representation, which abstracts a whole network as a single node that
941	   has a set of access ports, with each port connects to a set of end
942	   hosts called endpoints.  Such an extreme abstraction leads to
943	   significant information loss on network topology [DRAFT-PV], which is
944	   the key information for Unicorn to make dynamic scheduling and
945	   resource allocation decisions.  Though Unicorn can still allocate
946	   resource for data transfer and analytics requests on this abstract
947	   view, the resource allocation decisions are suboptimal.
948	   Alternatively, feeding the raw, complete network topology of each
949	   site to Unicorn is not desirable, either.  First, this would violate
950	   privacy constraints of different sites.  Secondly, a raw network
951	   topology would significantly increase the problem space and the
952	   solution space of the orchestrating algorithm, leading to a long
953	   computation time.  Hence, Unicorn desires an ALTO topology service
954	   that is able to provide only enough fine-grained topology
955	   information.

957	   Several ALTO topology extension services including Path Vector
958	   [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA [DRAFT-RSA],
959	   [DRAFT-PM] are potential candidates for providing fine-grained
960	   abstract network formation to Unicorn.  In addition, we propose that
961	   these services can also be used to provide information about
962	   computing and storage resources of different cluster/sites by viewing
963	   each computing node and storage node as a network element or abstract
964	   network element.  For instance, the path vector service supports the
965	   capacity region query, which accepts multiple concurrent data flows
966	   as the input and returns the information of bottleneck resources,
967	   which could be a set of links, computing devices or storage devices,
968	   for the given set of concurrent flows.  This information can be
969	   interpreted as a set of linear constraints for the multi-resource
970	   orchestrator, which can help data transfer and analytics requests
971	   better utilize multiple types of resources in different clusters.

973	8.3.  Limitations of the MFRA Algorithm

975	   The first limitation of the MFRA algorithm is computation overhead.
976	   The execution of MFRA involves solving linear programming problems
977	   repeatedly at every time slot.  The overhead of computation time is
978	   acceptable for small sets of dataset transfer requests, but may
979	   increase significantly when handling large sets of requests, e.g.,
980	   hundreds of transfer requests.  Current efforts towards addressing
981	   this issue include exploring the feasibility of incremental
982	   computation of scheduling policies, and reducing the problem scale by
983	   finding the minimal equivalent set of constraints of the linear
984	   programming model.  The latter approach can benefit substantially
985	   from the ALTO RSA service [DRAFT-RSA].

987	   The second limitation is that the current version of MFRA does not
988	   involve dataset replica selection.  Simply denoting the replica
989	   selection as a set of binary constraint will significantly increases
990	   the computation complexity of the scheduling process.  Current
991	   efforts focus on finding efficient algorithms to make dataset replica
992	   selection.

994	9.  Security Considerations

996	   This document does not introduce any privacy or security issue not
997	   already present in the ALTO protocol.

999	10.  IANA Considerations

1001	   This document does not define any new media type or introduce any new
1002	   IANA consideration.

1004	11.  Acknowledgments

1006	   The authors thank discussions with Shenshen Chen, Linghe Kong, Xiao
1007	   Lin and Xin Wang.

1009	12.  References

1011	12.1.  Normative References

1013	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1014	              Requirement Levels", BCP 14, RFC 2119,
1015	              DOI 10.17487/RFC2119, March 1997,
1016	              <http://www.rfc-editor.org/info/rfc2119>.

1018	12.2.  Informative References

1020	   [DRAFT-CC]
1021	              Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N.
1022	              Schwan, "ALTO Cost Calendar", 2017,
1023	              <https://datatracker.ietf.org/doc/draft-ietf-alto-cost-
1024	              calendar>.

1026	   [DRAFT-DC]
1027	              Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO
1028	              Extensions for Collecting Data Center Resource
1029	              Information", 2014, <https://datatracker.ietf.org/doc/
1030	              draft-lee-alto-ext-dc-resource/>.

1032	   [DRAFT-MC]
1033	              Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost
1034	              ALTO", 2017, <https://datatracker.ietf.org/doc/draft-ietf-
1035	              alto-multi-cost/>.

1037	   [DRAFT-NETGRAPH]
1038	              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
1039	              Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015,
1040	              <https://tools.ietf.org/html/draft-yang-alto-topology-06>.

1042	   [DRAFT-PM]
1043	              Roome, W. and Y. Yang, "Extensible Property Maps for the
1044	              ALTO Protocol", 2015, <https://datatracker.ietf.org/doc/
1045	              draft-roome-alto-unified-props-new/>.

1047	   [DRAFT-PV]
1048	              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
1049	              Yang, "ALTO Extension: Abstract Path Vector as a Cost
1050	              Mode", 2015, <https://tools.ietf.org/html/draft-yang-alto-
1051	              path-vector-01>.

1053	   [DRAFT-RSA]
1054	              Gao, K., Wang, X., Yang, Y., and G. Chen, "ALTO Extension:
1055	              A Routing State Abstraction Service Using Declarative
1056	              Equivalence", 2015, <https://datatracker.ietf.org/doc/
1057	              draft-gao-alto-routing-state-abstraction/>.

1059	   [DRAFT-SSE]
1060	              Roome, W. and Y. Yang, "ALTO Incremental Updates Using
1061	              Server-Sent Events (SSE)", 2015,
1062	              <https://datatracker.ietf.org/doc/draft-ietf-alto-incr-
1063	              update-sse/>.

1065	   [HTCondor]
1066	              Thain, D., Tannenbaum, T., and M. Livny, "Distributed
1067	              computing in practice: the Condor experience", 2005,
1068	              <http://dl.acm.org/citation.cfm?id=1064336>.

1070	   [RFC7285]  Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
1071	              Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
1072	              "Application-Layer Traffic Optimization (ALTO) Protocol",
1073	              RFC 7285, DOI 10.17487/RFC7285, September 2014,
1074	              <http://www.rfc-editor.org/info/rfc7285>.

1076	Authors' Addresses

1078	   Qiao Xiang
1079	   Tongji/Yale University
1080	   51 Prospect Street
1081	   New Haven, CT
1082	   USA

1084	   Email: qiao.xiang@cs.yale.edu

1086	   Harvey Newman
1087	   California Institute of Technology
1088	   1200 California Blvd.
1089	   Pasadena, CA
1090	   USA

1092	   Email: newman@hep.caltech.edu

1094	   Greg Bernstein
1095	   Grotto Networking
1096	   Fremont, CA
1097	   USA

1099	   Email: gregb@grotto-networking.com

1101	   Haizhou Du
1102	   Tongji/Yale University
1103	   51 Prospect Street
1104	   New Haven, CT
1105	   USA

1107	   Email: duhaizhou@gmail.com
1108	   Kai Gao
1109	   Tsinghua University
1110	   30 Shuangqinglu Street
1111	   Beijing
1112	   China

1114	   Email: gaok12@mails.tsinghua.edu.cn

1116	   Azher Mughal
1117	   California Institute of Technology
1118	   1200 California Blvd.
1119	   Pasadena, CA
1120	   USA

1122	   Email: azher@hep.caltech.edu

1124	   Justas Balcas
1125	   California Institute of Technology
1126	   1200 California Blvd.
1127	   Pasadena, CA
1128	   USA

1130	   Email: justas.balcas@cern.ch

1132	   Jingxuan Jensen Zhang
1133	   Tongji University
1134	   4800 Cao'an Hwy
1135	   Shanghai  201804
1136	   China

1138	   Email: jingxuan.n.zhang@gmail.com

1140	   Y. Richard Yang
1141	   Tongji/Yale University
1142	   51 Prospect Street
1143	   New Haven, CT
1144	   USA

1146	   Email: yry@cs.yale.edu