idnits 2.17.1 

draft-xiang-alto-multidomain-analytics-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 604 has weird spacing: '...Queries  v |...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (March 5, 2018) is 2243 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '11' on line 780

  -- Looks like a reference, but probably isn't: '49' on line 780

  -- Looks like a reference, but probably isn't: '95' on line 782

  -- Looks like a reference, but probably isn't: '34' on line 780

  -- Looks like a reference, but probably isn't: '58' on line 781

  -- Looks like a reference, but probably isn't: '22' on line 781

  -- Looks like a reference, but probably isn't: '75' on line 781

  -- Looks like a reference, but probably isn't: '25' on line 781

  -- Looks like a reference, but probably isn't: '50' on line 782

  -- Looks like a reference, but probably isn't: '69' on line 782

  -- Looks like a reference, but probably isn't: '89' on line 782


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	ALTO WG                                                         Q. Xiang
3	Internet-Draft                                    Tongji/Yale University
4	Intended status: Informational                                     F. Le
5	Expires: September 6, 2018                                           IBM
6	                                                                 Y. Yang
7	                                                  Tongji/Yale University
8	                                                               H. Newman
9	                                      California Institute of Technology
10	                                                                   H. Du
11	                                                       Tongji University
12	                                                           March 5, 2018

14	 Unicorn: Resource Orchestration for Multi-Domain, Geo-Distributed Data
15	                               Analytics
16	             draft-xiang-alto-multidomain-analytics-01.txt

18	Abstract

20	   As the data volume increases exponentially over time, data analytics
21	   is transiting from a single-domain network to a multi-domain, geo-
22	   distributed network, where different member networks contribute
23	   various resources, e.g., computation, storage and networking
24	   resources, to collaboratively collect, share and analyze extremely
25	   large amounts of data.  Such a network calls for a resource
26	   orchestration framework that emphasizes the performance
27	   predictability of data analytics jobs, the high utilization of
28	   resources, and the autonomy and privacy of member networks.

30	   This document presents the design of Unicorn, a unified resource
31	   orchestration framework for multi-domain, geo-distributed data
32	   analytics, which uses the Application-Layer Traffic Optimization
33	   (ALTO) protocol as the key component for (1) allows member networks
34	   to provide accurate information on different types of resources; (2)
35	   keeps the private information of member networks; and (3) allows data
36	   analytics jobs to accurately describe their requirements of different
37	   types of resources.  As a part of Unicorn, an ALTO extension for
38	   privacy-preserving interdomain information aggregation is also
39	   presented.

41	Status of This Memo

43	   This Internet-Draft is submitted in full conformance with the
44	   provisions of BCP 78 and BCP 79.

46	   Internet-Drafts are working documents of the Internet Engineering
47	   Task Force (IETF).  Note that other groups may also distribute
48	   working documents as Internet-Drafts.  The list of current Internet-
49	   Drafts is at http://datatracker.ietf.org/drafts/current/.

51	   Internet-Drafts are draft documents valid for a maximum of six months
52	   and may be updated, replaced, or obsoleted by other documents at any
53	   time.  It is inappropriate to use Internet-Drafts as reference
54	   material or to cite them other than as "work in progress."

56	   This Internet-Draft will expire on September 6, 2018.

58	Copyright Notice

60	   Copyright (c) 2018 IETF Trust and the persons identified as the
61	   document authors.  All rights reserved.

63	   This document is subject to BCP 78 and the IETF Trust's Legal
64	   Provisions Relating to IETF Documents
65	   (http://trustee.ietf.org/license-info) in effect on the date of
66	   publication of this document.  Please review these documents
67	   carefully, as they describe your rights and restrictions with respect
68	   to this document.  Code Components extracted from this document must
69	   include Simplified BSD License text as described in Section 4.e of
70	   the Trust Legal Provisions and are provided without warranty as
71	   described in the Simplified BSD License.

73	Table of Contents

75	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
76	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   4
77	   3.  Changes Since Version -00 . . . . . . . . . . . . . . . . . .   4
78	   4.  Characteristics of Multi-Domain, Geo-Distributed Data
79	       Analytics . . . . . . . . . . . . . . . . . . . . . . . . . .   4
80	     4.1.  Dynamic Data Analytics Workload . . . . . . . . . . . . .   5
81	     4.2.  Dynamic Resource Availability . . . . . . . . . . . . . .   5
82	   5.  Design Requirements . . . . . . . . . . . . . . . . . . . . .   6
83	   6.  Review of Resource Orchestration Designs for Data Analytics .   7
84	     6.1.  Centralized resource-graph-based orchestration  . . . . .   7
85	     6.2.  Centralized ClassAds-based orchestration  . . . . . . . .   7
86	     6.3.  Distributed opportunistic orchestration . . . . . . . . .   7
87	     6.4.  Inadequacy of Existing Designs for Multi-Domain, Geo-
88	           Distributed Data Analytics  . . . . . . . . . . . . . . .   8
89	   7.  Unicorn Design  . . . . . . . . . . . . . . . . . . . . . . .   8
90	     7.1.  Choosing ALTO as the Resource Information Model . . . . .   8
91	     7.2.  Architecture of Unicorn . . . . . . . . . . . . . . . . .   9
92	       7.2.1.  Three-Phase Resource Discovery  . . . . . . . . . . .  11
93	     7.3.  Example . . . . . . . . . . . . . . . . . . . . . . . . .  15
94	   8.  ALTO Extension: Privacy-Preserving Interdomain Information
95	       Aggregation for Resource Discovery  . . . . . . . . . . . . .  16

97	     8.1.  Extension Specification . . . . . . . . . . . . . . . . .  16
98	     8.2.  Example . . . . . . . . . . . . . . . . . . . . . . . . .  17
99	   9.  Discussion  . . . . . . . . . . . . . . . . . . . . . . . . .  18
100	     9.1.  Discovering the Domain-Paths Using a New Interdomain
101	           Routing Protocol  . . . . . . . . . . . . . . . . . . . .  18
102	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  18
103	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  19
104	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
105	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  19
106	     12.2.  Informative References . . . . . . . . . . . . . . . . .  19
107	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

109	1.  Introduction

111	   This document describes the design of Unicorn, a unified resource
112	   orchestration framework for large-scale data analytics in multi-
113	   domain, geo-distributed networks.  An important use case for such
114	   settings is the Large Hadron Collider (LHC) network, which consists
115	   of over 180 member networks all over the world, to support scientists
116	   to access multiple resources, e.g., computing, storage and networking
117	   resources, distributed in the member networks to conduct large-scale
118	   data analytics.  With more and more data being generated and stored
119	   in different geo-distributed member networks, network architects and
120	   administrators are exploring different designs for efficient resource
121	   orchestration in multi-domain, geo-distributed networks.

123	   The design presented in this document is based on the development and
124	   deployment experience of Unicorn in the CMS network, one of the
125	   largest scientific experiments in the LHC network.  The primary
126	   requirements of resource orchestration in such a multi-domain, geo-
127	   distributed environment are the performance predictability of various
128	   data analytics jobs, the high utilization of different types of
129	   resources, and the autonomy and privacy of resource owners, i.e.,
130	   member networks.

132	   Pre-production development and extensive testing have shown that the
133	   Application-Layer Traffic Optimization Protocol [RFC7285] is well
134	   suited as a fundamental component in Unicorn for providing a generic
135	   representation that (1) allows different types of data analytics jobs
136	   to accurately describe their resource requirements and (2) allows
137	   member networks to provide accurate information on different types of
138	   resources they own and at the same time maintain their privacies.
139	   This is in contrast with the state-of-the-art resource orchestration
140	   frameworks, such as HTCondor and Mesos, which either do not provide
141	   accurate networking information or expose all the private details of
142	   member networks.  This document elaborates on the design requirements
143	   of resource orchestration in multi-domain, geo-distributed networks
144	   that lead to this design choice and presents the details of Unicorn,
145	   including an ALTO extension for privacy-preserving, interdomain
146	   information aggregation.

148	   This document first gives an overview of the characteristics of
149	   multi-domain, geo-distributed data analytics.  Then, the design
150	   requirements for resource orchestration under such settings are
151	   summarized.  After reviewing existing designs and their limitations,
152	   this document gives the arguments for using ALTO as the generic
153	   representation for describing both resource requirements and the
154	   resource information and describes the design details of Unicorn.
155	   Finally, a privacy-preserving, interdomain extension of ALTO is
156	   presented.

158	2.  Requirements Language

160	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
161	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
162	   document are to be interpreted as described in [RFC2119].

164	3.  Changes Since Version -00

166	   o  Add an overview of the characteristics of multi-domain, geo-
167	      distributed data analytics.

169	   o  Based on the above characteristics, add the design requirements of
170	      the resource orchestration for multi-domain, geo-distributed data
171	      analytics.

173	   o  Add a review of existing resource orchestration system designs for
174	      data analytics systems.

176	   o  Update the architecture of the Unicorn system.

178	   o  Update the motivation of using ALTO as the key information model
179	      in Unicorn.

181	   o  Give the design details of the three-phase resource discovery in
182	      Unicorn.

184	   o  Design an ALTO extension for privacy-preserving interdomain
185	      resource information aggregation.

187	4.  Characteristics of Multi-Domain, Geo-Distributed Data Analytics

189	   This section describes the characteristics of multi-domain, geo-
190	   distributed data analytics.

192	4.1.  Dynamic Data Analytics Workload

194	   In multi-domain, geo-distributed data analytics, extremely large
195	   amounts of data are generated and stored across different member
196	   networks.  Authorized users from different organizations can access
197	   data and resources in member networks to conduct various data
198	   analytics jobs using various data analytics applications.

200	   An data analytics application usually provides an automated process
201	   that decomposes a large data analytics job into a set of smaller
202	   tasks, whose dependencies are expressed as a directed acyclic graph
203	   (DAG).  Tasks without any dependency can be executed in parallel to
204	   improve the efficiency of the data analytics job they belong to.
205	   This decomposition is highly user- and application-dependent.

207	   Each task may have different requirements on different resources.
208	   For instance, task T1 may require dataset A in storage node X as
209	   input and 1 CPU as the computing resource, while task T2 may require
210	   dataset B in storage node Y as input and 2 CPUs as the computing
211	   resource.  Furthermore, each task may require resources from
212	   different member networks.  In the previous example, T1 may require
213	   its output to be stored in a storage node in another member network
214	   for the purpose of secure storage.  The resource requirements of
215	   tasks are highly user- and application-dependent.

217	   From the above description, it is observed that the workload of
218	   multi-domain, geo-distributed data analytics is highly dynamic, in
219	   terms of the number of users, the types of applications, the number
220	   of jobs, the decomposition of jobs and the resource requirements of
221	   tasks.

223	   Though with such dynamism, it is the general consensus of users to
224	   expect performance predictability of their analytics jobs (TODO: add
225	   Mogul citation).  Hence the resource orchestration for multi-domain,
226	   geo-distributed data analytics must be able to achieve efficient
227	   resource sharing among different data analytics jobs of different
228	   applications from different users.  To this end, a generic
229	   representation of resource requirements for different tasks from
230	   different analytics applications must be chosen.  Furthermore, to
231	   ensure maximal deployment, the resource orchestration framework must
232	   be independent of and compatible with data analytics applications.

234	4.2.  Dynamic Resource Availability

236	   In the multi-domain, geo-distributed data analytics network,
237	   different member networks belong to different administrative domains.
238	   Each member network has its own resource management policies and can
239	   choose to use different management software, such as HTCondor and
240	   Mesos.

242	   Each member network provides different types of resources with
243	   different amounts.  For example, transit networks such as ESNet and
244	   Internet2 provide high-bandwidth networking resources.  In contrast,
245	   campus science networks provide abundant computation and storage
246	   resources, but may provide limited networking bandwidths.  And some
247	   smaller science networks only provide limited computation and storage
248	   resources.  The availability of the resources in each member network
249	   is subject to the autonomous control of the member network.

251	   Furthermore, member networks are interconnected with high bandwidth-
252	   delay-product links, where state-of-the-art networking resource
253	   allocation mechanisms, such as TCP, become inefficient [XCP].

255	   From the above description, it is observed that the resource
256	   availability of the multi-domain, geo-distributed data analytics
257	   network is also highly dynamic, subject to the types of member
258	   networks, the resources provided by member networks and the resource
259	   management policies and management software used by member networks.

261	   Though with such dynamism, it is the general consensus of member
262	   networks that the resource orchestration for multi-domain, geo-
263	   distributed data analytics must achieve high utilization of different
264	   types of resources, following the autonomy and privacy of each member
265	   network.  To this end, a generic representation of resource
266	   availabilities for different types of resources must be chosen.  Such
267	   a representation must be accurate and at the same time maintain the
268	   privacy of member networks.  Furthermore, to ensure maximal
269	   deployment, the resource orchestration framework must be independent
270	   of and compatible with the resource management systems used by member
271	   networks.

273	5.  Design Requirements

275	   This section summarizes the design requirements for resource
276	   orchestration for multi-domain, geo-distributed data analytics from
277	   the previous section.

279	   o  REQ1: Provide performance predictability for data analytics jobs.

281	   o  REQ2: Achieve the efficient resource sharing among data analytics
282	      jobs.

284	   o  REQ3: Achieve the high utilization of different types of resources
285	      in member networks.

287	   o  REQ4: Maintain the autonomy and privacy of member networks.

289	   o  REQ5: Provide compatibility with different data analytics
290	      applications and resource management systems to maximize the
291	      deployment.

293	6.  Review of Resource Orchestration Designs for Data Analytics

295	   This section provides an overview of three general types of resource
296	   orchestration designs for data analytics -- the centralized resource-
297	   graph-based orchestration, the centralized ClassAds-based
298	   orchestration and the distributed opportunistic orchestration.  Then,
299	   the key reason why these designs are inadequate for multi-domain,
300	   geo-distributed data analytics is provided.

302	6.1.  Centralized resource-graph-based orchestration

304	   Systems such as Mesos [Mesos] and Borg [Borg] adopt a graph-based
305	   abstraction to represent the resource availability of computing
306	   clusters.  Each node in the graph is a physical node representing
307	   computation or storage resources and each edge between a pair of
308	   nodes denotes the networking resource connecting two physical nodes.
309	   This design is inadequate for multi-domain, geo-distributed data
310	   analytics system because (1) it compromises the privacy of different
311	   member networks by revealing all the details of resources; and (2)
312	   the overhead to keep the resource availability graph up to date is
313	   too expensive due to the heterogeneity and dynamicity of resources
314	   from different member networks.

316	6.2.  Centralized ClassAds-based orchestration

318	   HTCondor [HTCondor] proposes a ClassAds programming model, which
319	   allows different resource owners to advertise their resource supply
320	   and the job owners to advertise the resource demand.  However, this
321	   programming model does not support the accurate discovery of
322	   networking resources, but leave the orchestration of networking
323	   resources completely to TCP, which has been known to behave poorly in
324	   networks with high bandwidth-delay products [XCP].

326	6.3.  Distributed opportunistic orchestration

328	   Some systems, such as Apollo [Apollo] and Sparrow [Sparrow], use a
329	   distributed design.  In this design, given a data analytics job, a
330	   small number of computing and storage nodes are randomly selected as
331	   candidates.  Then a scheduling algorithm makes the decision to select
332	   the best pair of computing and storage nodes within this small set of
333	   candidates.  Though it is shown in production that this design
334	   achieves a performance very close to the theoretical optimal resource
335	   allocation scheme, this design cannot be applied to multi-domain,
336	   geo-distributed data analytics because (1) the pool of computing and
337	   storage resources is much larger, and is distributed across the
338	   world, and (2) it is hard to distributively orchestrate networking
339	   resources in such a high bandwidth-delay product scenario.

341	6.4.  Inadequacy of Existing Designs for Multi-Domain, Geo-Distributed
342	      Data Analytics

344	   Applying the designs reviewed in the preceding subsections for multi-
345	   domain, geo-distributed data analytics only satisfies the design
346	   requirement of compatibility (REQ5), but leaves all the other
347	   requirements unfulfilled.  The key reason is that they do not have an
348	   information model that simultaneously

350	   o  allows member networks to provide accurate information on
351	      different types of resources, e.g., the computing, storage and
352	      networking resources, they own;

354	   o  keeps the private information of member networks, such as physical
355	      topologies and policies, from the data analytics applications; and

357	   o  allows data analytics jobs to accurately describe their
358	      requirements of different types of resources.

360	7.  Unicorn Design

362	   This section presents the design of the Unicorn framework.  First,
363	   the motivations of using ALTO as the information model of resource
364	   orchestration for multi-domain, geo-distributed data analytics are
365	   reviewed.  Then the architecture of Unicorn is provided.

367	7.1.  Choosing ALTO as the Resource Information Model

369	   As reviewed in the preceding section, the commonly used resource-
370	   graph-based information model and the ClassAds information model do
371	   not support the accurate, yet privacy-preserving resource discovery
372	   across different member networks.  In contrast, the ALTO protocol
373	   uses abstract maps of networks to provide network information with
374	   the goal of modifying network resource consumption patterns while
375	   maintaining or improving application performance [RFC7285].  This
376	   document proposes the use of ALTO for providing information of
377	   different types of resources, e.g., computing, storage and networking
378	   resources.  This design has the following advantages:

380	   o  ALTO provides the network information based on abstract maps of a
381	      network.  Additional services are built on top of the ALTO
382	      abstract maps to provide information of other types of resources,
383	      e.g., the computing and storage resources.  These maps provide
384	      accurate information of different types of resources for the
385	      resource orchestration system to effectively utilize them for data
386	      analytics applications.  For example, the ALTO Endpoint Property
387	      Service can provide information of computing nodes and storage
388	      nodes.

390	   o  The ALTO abstract maps provide a simplified view of resources of
391	      member networks, instead of the full details of their resource
392	      availability.  Thus ALTO allows member networks to keep their
393	      private information, such as physical topologies and policies,
394	      from the applications.  For example, the ALTO Network Map service
395	      provides a "one-big-switch" view that defines a grouping of
396	      network endpoints.  This view hides the details of the underlying
397	      physical topology of the network and a network deploying the ALTO
398	      server has the autonomy to adopt any endpoint grouping algorithm.

400	   o  ALTO uses a client-server model, in which applications can use
401	      ALTO clients to accurately describe their requirements of
402	      different types of resources and send these requirements to the
403	      ALTO servers to retrieve the accurate information of resources
404	      that suit their requirements.  For example, the ALTO Multi-Cost
405	      service [RFC8189] allows an ALTO client to specify a logic set of
406	      tests in a query.  Such tests are used by ALTO servers to filter
407	      out the information of unqualified resources from the response
408	      sent back to the ALTO client.

410	7.2.  Architecture of Unicorn

412	   This section describes the design details of Unicorn.  Figure 1
413	   presents the architecture of Unicorn for a multi-domain, geo-
414	   distributed data analytics system with N member networks.  In
415	   particular, Unicorn consists of the following key components:

417	          .-------------.           .-------------.
418	          |Application 1|    ...    |Application N|
419	          '-------------'           '-------------'
420	                \                         /
421	 .- - - - - - - -\- - - - - - - - - - - -/- - - - - - - - - - - - - - -.
422	 |  Unicorn       \                     /                              |
423	 |               .-----------------------.                             |
424	 |               | Resource Orchestrator |     .----------------------.|
425	 |               |                       |     |Distributed Hash Table||
426	 |               |     .-----------.     |---- |   of Computing and   ||
427	 |               |     |ALTO Client|     |     |   Storage Resources  ||
428	 |               |     '-----------'     |     '----------------------'|
429	 |               '-----------------------'                             |
430	 |             /            |            \                             |
431	 |            /             |             \                            |
432	 |   .-------------.  .-----------.      .-------------.               |
433	 |   |ALTO Server 1|  | Execution |      |ALTO Server M|               |
434	 |   '-------------'  |  Agents   |      '-------------'               |
435	 |          |         '-----------'            |                       |
436	 |          |          /           \           |                       |
437	 |  .----------------./             \ .----------------.               |
438	 |  |     Site 1     |       . .      |     Site N     |               |
439	 |  '----------------'                '----------------'               |
440	 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -'

442	                    Figure 1: Architecture of Unicorn.

444	   o  ALTO Server: for each member network, one or more ALTO servers are
445	      deployed to provide accurate, yet privacy-preserving information
446	      of different types of resources owned by the corresponding
447	      network.  Examples of such information include the link bandwidth
448	      between endpoints, the memory I/O bandwidth and the CPU
449	      utilization at computing endpoints and the storage space at
450	      storage endpoints.  In addition to the basic ALTO services defined
451	      in [RFC7285], The ALTO servers in Unicorn also provide ALTO
452	      extension services such as the ALTO Multi-Cost Service [RFC8189],
453	      the ALTO Server-Sent Event Service [DRAFT-SSE] and the ALTO
454	      Multipart Cost Property Service [DRAFT-PV] to provide fine-grained
455	      resource information.

457	   o  Distributed Hash Table (DHT) of Computing and Storage Resources: A
458	      DHT system is deployed across member networks to lookup the
459	      location of computing and storage resources.  Compared with the
460	      current centralized lookup services in the CMS network, i.e.,
461	      PhEDEx and HTCondor, a DHT system provides a significant
462	      performance improvement for discovering the locations of computing
463	      and storage resources in multi-domain, geo-distributed data
464	      analytics systems.

466	   o  Resource Orchestrator: The orchestrator is a shim layer between
467	      the data analytics jobs from different applications and the member
468	      networks.  It contains an ALTO client that communicates with the
469	      ALTO servers at member networks to retrieve resource information.
470	      Given a set of data analytics jobs, the orchestrator adopts a
471	      three-phase discovery process, which will be elaborated in the
472	      next section, to find the accurate information of all the
473	      resources that can be used to execute these jobs.  Then the
474	      orchestrator runs a customized resource allocation algorithm to
475	      compute the resource allocation decisions for these jobs, and send
476	      the decisions to the execution agents at corresponding member
477	      networks.

479	   o  Execution Agent: One or more execution agents are deployed at each
480	      member network.  They take the resource allocation decisions from
481	      the resource orchestrator, and communicate with the underlying
482	      resource management system deployed at the corresponding member
483	      network to reserve the resources for the data analytics jobs and
484	      execute them.

486	7.2.1.  Three-Phase Resource Discovery

488	   The preceding subsection describes the architecture and the key
489	   components of Unicorn.  One missing component is how to accurately
490	   discover the information of different types of resources for a set of
491	   data analytics jobs with the assistance of ALTO.  This section
492	   presents the three-phase resource discovery design in Unicorn.

494	7.2.1.1.  Phase 1: Endpoint Property Discovery

496	   Figure 2 shows the procedure of the endpoint property discovery
497	   phase.  Given a set of data analytics jobs, the resource orchestrator
498	   communicates with the DHT lookup system to find the locations, i.e.,
499	   the endpoint addresses, of all candidate computing and storage
500	   resources.  With such information, the ALTO client then issues
501	   Endpoint Property Service (EPS) queries to the ALTO servers deployed
502	   at member networks to discover the information of all candidate
503	   endpoints.

505	      .-----------------------.
506	      | Resource Orchestrator |  Jobs' Resource
507	      |                       |   Requirements   .------------.
508	      |     .-----------.     | ---------------> | DHT Lookup |
509	      |     |ALTO Client|     | <--------------- |   System   |
510	      |     '-----------'     |     Endpoint     '------------'
511	      '---------| ^-----------'     Locations
512	         EPS    | | Endpoint
513	       Queries  | | Properties
514	                v |
515	          .-------------.
516	          |ALTO Servers |
517	          '-------------'

519	             Figure 2: The Endpoint Property Discovery Phase.

521	7.2.1.2.  Phase 2: Endpoint Path Discovery

523	   Candidate computing and storage endpoints need to move data between
524	   them before, during and after the execution of a data analytics job.
525	   In multi-domain, geo-distributed data analytics, a pair of candidate
526	   endpoints may not be in the same member network.  In this case, the
527	   orchestrator needs to find out the connectivity information between
528	   such a pair of candidate endpoints.

530	   Figure 3 shows the procedure of the endpoint path discovery phase.
531	   Given a pair of candidate endpoints that are not in the same member
532	   network, the ALTO client in the orchestrator adopts an iterative
533	   process to find the interdomain connectivity information for this
534	   pair.  It starts by issuing an ALTO Endpoint Cost Service query or an
535	   ALTO Flow-based Endpoint Cost Service [DRAFT-FCS] to the ALTO server
536	   of the member network where the source endpoint locates.  The cost
537	   type of this query is a customized type called next-hop, with a
538	   customized cost mode tuple and a customized cost metric next-network.

540	                           .-----------------------.
541	                           | Resource Orchestrator |
542	                           |                       |
543	                           |     .-----------.     |
544	                           |     |ALTO Client|     |
545	                           |     '-----------'     |
546	                           '---------| ^-----------'
547	                           Customized| | Endpoint
548	                              ECS    | | Path
549	                            Queries  | | Segments
550	                                     v |
551	                               .-------------.
552	                               |ALTO Servers |
553	                               '-------------'

555	               Figure 3: The Endpoint Path Discovery Phase.

557	   The ALTO server returns a 2-tuple, where the first element is the
558	   autonomous number (AS) of the next member network along the AS-path
559	   from the source endpoint to the destination endpoint, and the second
560	   element is the ingress of this next member network.  In a member
561	   network, the ALTO server can get such information from the underlying
562	   interdomain routing protocol, e.g., BGP.  Based on the received
563	   response, the ALTO client then issues a similar query to the ALTO
564	   server of the next member network.  The process stops when the ALTO
565	   server of the member network where the destination endpoint locates
566	   receives such a query, who will return a null 2-tuple in response to
567	   notify the ALTO client.  By the end of this process, the ALTO client
568	   can assemble a domain-path, in the form of a path vector of (ingress,
569	   AS), of this pair of candidate endpoints.

571	7.2.1.3.  Phase 3: Resource State Abstraction Discovery

573	   After the second phase, the resource orchestrator has the
574	   connectivity information of each candidate endpoint pair, i.e., the
575	   domain-path.  Equivalently, for each member network, it knows the set
576	   of all candidate endpoint pairs that will enter this network.  With
577	   this information, the resource orchestrator can communicate with the
578	   ALTO servers at member networks to discover the resource sharing
579	   between all the candidate endpoint pairs.  In particular, Unicorn
580	   extends the routing state abstraction [DRAFT-RSA] to the more generic
581	   resource state abstraction to represent such resource sharing.

583	   Figure 4 shows the procedure of the resource state abstraction
584	   discovery phase.  For each member network, the ALTO client in the
585	   orchestrator sends an ALTO Multipart Cost Property Service query
586	   defined in [DRAFT-PV] by providing the set of candidate endpoint
587	   pairs as input.  The cost type of this query is path vector.  Upon
588	   receiving the query, the ALTO server in each member network computes
589	   an ALTO cost map and an ATLO property map to the ALTO client.  These
590	   two maps represent a set of linear inequalities revealing the
591	   resource sharing among the set of candidate endpoint pairs in the
592	   member network.

594	                           .-----------------------.
595	                           | Resource Orchestrator |
596	                           |                       |
597	                           |     .-----------.     |
598	                           |     |ALTO Client|     |
599	                           |     '-----------'     |
600	                           '---------| ^-----------'
601	                            Multipart| | Resource
602	                            Cost     | | State
603	                            Property | | Abstraction
604	                            Queries  v |
605	                               .-------------.
606	                               |ALTO Servers |
607	                               '-------------'

609	         Figure 4: The Resource State Abstraction Discovery Phase.

611	   Unicorn provides two mechanisms for the ALTO servers to return the
612	   computed cost maps and property maps to the ALTO client.  The first
613	   mechanism is to let each ALTO server independently sends its response
614	   to the ALTO client.  The second mechanism is a privacy-preserving
615	   interdomain information aggregation process, in which the ALTO
616	   servers in all member networks use a secure multi-party computation
617	   (SMPC) protocol to collectively send the responses to the ALTO client
618	   without revealing the source of any entry, i.e., the linear in
619	   equality, in the cost maps and property maps.

621	   The first mechanism has a higher security risk in that it exposes the
622	   bottleneck resource information of each member network.  In contrast,
623	   the second mechanism provides a better protection of the private
624	   information of each member network.  The details of the privacy-
625	   preserving interdomain information aggregation process will be
626	   presented in the next section.

628	   After receiving the responses sent back from the ALTO servers from
629	   all the member networks, the orchestrator finishes the whole resource
630	   discovery process and collects the accurate information of different
631	   types of resources for data analytics jobs.

633	7.3.  Example

635	   This subsection gives an example to illustrate the workflow of
636	   Unicorn.  Figure 5 gives a topology of three member networks, where
637	   s1 and s2 are storage endpoints and d1 and d2 are computation
638	   endpoints.  Assume a data analytics job is composed of two parallel
639	   tasks T1 and T2.  T1 needs dataset X as input and T2 needs dataset Y
640	   as input.

642	                                     .------------.
643	                                     | Network B  |
644	               .------------.    ingB|            |
645	               | Network A  |--------|     d1     |
646	               |            |        '------------'
647	               |    s1      |
648	               |            |        .------------.
649	               |    s2      |--------| Network C  |
650	               '------------'    ingC|            |
651	                                     |     d2     |
652	                                     '------------'

654	              Figure 5: An Illustrating Example for Unicorn.

656	   In the endpoint property discovery phase, the Unicorn resource
657	   orchestrator finds that s1 stores X and s2 stores Y, and that the
658	   locations of s1, s2, d1 and d2, from the DHT lookup system.  It then
659	   issues EPS queries to network A, B and C, respectively, to discover
660	   that d1 satisfies the computing requirements of T1 and d2 satisfies
661	   the computing requirements of T2.  Hence there are only two candidate
662	   endpoint pairs: (s1, d1) and (s2, d2).

664	   In the endpoint path discovery phase, the ALTO client in the
665	   orchestrator iteratively issues Endpoint Cost Service (ECS) query to
666	   the ALTO servers in member networks, and finds that the domain-path
667	   for pair (s1, d2) is [(null, A), (ingB, B)] and the domain-path for
668	   pair (s2, d2) is [(null, A), (ingB, B)].  Hence both pairs will use
669	   the networking resources of network A, while only (s1, d1) will use
670	   network B and only (s2, d2) will use network C.

672	   In the resource state abstraction discovery phase, the ALTO client in
673	   the orchestrator issues Multipart Cost Property Service queries to
674	   network A, B and C, respectively.  Denote the available bandwidth
675	   that can be assigned to T1 as x1 and that to T2 as x2.  Assume the
676	   linear inequalities computed by the three networks are:

678	                         A: x1 + x2 <= 10Mbps
679	                         B: x1 <= 3Mbps
680	                         C: x2 <= 3Mbps

682	   If the ALTO servers use the first mechanism to directly return their
683	   resource information to ALTO client, respectively, each of them will
684	   send a cost map and a property map response encoding its own linear
685	   inequality to the ALTO client.  In this way, the orchestrator gets
686	   the accurate information about networking resource sharing between
687	   (s1, d1) and (s2, d2).  It then can invoke a resource allocation
688	   algorithm to allocate the resources to tasks T1 and T2.  For example,
689	   if the goal is to maximize the minimal bandwidth of two tasks, the
690	   allocation decision will be to assign endpoints s1 and d1 to T1, with
691	   a bandwidth of 3Mbps, and assign endpoints s2 and d2 to T2, with a
692	   bandwidth of 3Mbps as well.

694	8.  ALTO Extension: Privacy-Preserving Interdomain Information
695	    Aggregation for Resource Discovery

697	   This section describes a customized ALTO extension in Unicorn that
698	   supports the privacy-preserving discovery of networking resource
699	   sharing among a set of candidate endpoint pairs.

701	8.1.  Extension Specification

703	   Figure 6 presents the workflow of the proposed ALTO extension.
704	   Assume a set of N member networks denoted as AS_1, AS_2, ...  AS_N
705	   and the number of all candidate endpoint pairs is F.  The interdomain
706	   information aggregation process works as follows:

708	                             .-----------------------.
709	                             | Resource Orchestrator |
710	                             |                       |
711	                             |     .-----------.     |
712	                             |     |ALTO Client|     |
713	                             |     '-----------'     |
714	                             '-----------------------'<
715	          Encryption Key and /                         \ Ciphered MCP
716	          Multipart Cost    /                           \ Response
717	          Property Queries v  Key, Ciphertext            \
718	                  .--------.  and MCP .--------.          .--------.
719	                  |ALTO    |  Queries |ALTO    |          |ALTO    |
720	                  |Server 1| -------> |Server 2|-- ... -->|Server N|
721	                  '--------'          '--------'          '--------'

723	     Figure 6: The Privacy-Preserving Interdomain Resource Information
724	                               Aggregation.

726	   o  Step 1: The ALTO client sends the Multipart Cost Property Service
727	      request to and a homomorphic public key k_p to each member
728	      network.

730	   o  Step 2: The ALTO server of each network AS_i computes its own set
731	      of linear inequalities A_i x <= b_i.  Denote the size of this set
732	      as m_i.

734	   o  Step 3: The ALTO server of each network AS_i introduces m_i non-
735	      negative slack variables to transform its set of linear
736	      inequalities into a set of linear equations.

738	   o  Step 4: The ALTO servers of all member networks use an SMPC
739	      summation protocol to collectively compute k=m_1 + m_2 + ... + m_N
740	      + 1.  The value k is known to all the member networks.

742	   o  Step 5: The ALTO servers of each network AS_i selects a random k-
743	      by-m_i matrix P_i, and compute the matrix P_iA_i and P_ib_i.

745	   o  Step 6: The ALTO servers of all member networks use an SMPC
746	      summation protocol to collectively compute the summation of P_iA_i
747	      and the summation of P_ib_i using the public key k_p.  Then the
748	      final result is returned to the ALTO client.

750	   o  Step 7: the ALTO client uses its own private key to decrypt the
751	      result and get a set of linear equations sum P_iA_i x = sum
752	      P_ib_i.

754	   This process ensures that the networking resource capacity region
755	   derived from sum P_iA_i x = sum P_ib_i is the same as that derived
756	   from A_1 x <= b_1, A_2 x <= b_2, ... A_N x <= b_N.  More importantly,
757	   the ALTO client has no knowledge on the information of network
758	   resource sharing of a single member network.

760	8.2.  Example

762	   This subsection uses the same example in Figure 5 to illustrate the
763	   privacy-preserving information aggregation process.  The set of
764	   linear inequalities computed by each network is as follows:

766	                         A: x1 + x2 <= 10
767	                         B: x1 <= 3
768	                         C: x2 <= 3

770	   Then the networks collectively compute k=1+1+1+1=4.  And then
771	   introduces slack variables to transform the linear inequalities into
772	   linear equations:

774	                         A: x1 + x2 + x3           <= 10
775	                         B: x1           + x4      <= 3
776	                         C:      x2           + x5 <= 3

778	   For each network, the random matrix it chooses as follows:

780	                         P_A: [11, 49, 95, 34]
781	                         P_B: [58, 22, 75, 25]
782	                         P_C: [50, 69, 89, 95]

784	   After the SMPC summation process, the decrypted set of linear
785	   equations the ALTO client gets is

787	                   69  x1 + 61  x2 + 11 x3 + 58 x4_ + 50 x5 = 434
788	                   71  x1 + 118 x2 + 49 x3 + 22 x4_ + 69 x5 = 763
789	                   170 x1 + 184 x2 + 95 x3 + 75 x4_ + 89 x5 = 1442
790	                   59  x1 + 129 x2 + 34 x3 + 25 x4_ + 95 x5 = 700

792	   Assume the goal is still to maximize the minimal bandwidth of two
793	   tasks, the allocation decision made using this set of linear
794	   equations will still be x1=3 and x2=3, i.e., assigning endpoints s1
795	   and d1 to T1, with a bandwidth of 3 and assigning endpoints s2 and d2
796	   to T2, with a bandwidth of 3 as well.

798	9.  Discussion

800	9.1.  Discovering the Domain-Paths Using a New Interdomain Routing
801	      Protocol

803	   The current design of the endpoint path discovery process in Unicorn
804	   assumes that the underlying interdomain routing protocol is the
805	   standard BGP, which only provides the path vector of ASes instead of
806	   the path vector of (ingress, AS) tuples needed by Unicorn.  If a
807	   multi-domain, geo-distributed data analytics system uses an
808	   interdomain routing protocol that provides the path vector of
809	   (ingress, AS) pairs, the endpoint path discovery process in Unicorn
810	   can be simplified to only send queries to the ALTO server of the
811	   network where the source candidate endpoint locates.

813	10.  Security Considerations

815	   This document does not introduce any privacy or security issue not
816	   already present in the ALTO protocol.

818	11.  IANA Considerations

820	   This document does not define any new media type or introduce any new
821	   IANA consideration.

823	12.  References

825	12.1.  Normative References

827	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
828	              Requirement Levels", BCP 14, RFC 2119,
829	              DOI 10.17487/RFC2119, March 1997, <https://www.rfc-
830	              editor.org/info/rfc2119>.

832	12.2.  Informative References

834	   [Apollo]   Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J.,
835	              Qian, Z., Wu, M., and L. Zhou, "Apollo: Scalable and
836	              Coordinated Scheduling for Cloud-Scale Computing", 2014,
837	              <https://www.usenix.org/system/files/conference/osdi14/
838	              osdi14-paper-boutin_0.pdf>.

840	   [Borg]     Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D.,
841	              Tune, E., and J. Wilkes, "Large-scale cluster management
842	              at Google with Borg", 2015, <https://dl.acm.org/
843	              citation.cfm?id=2741964>.

845	   [DRAFT-FCS]
846	              Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang,
847	              "ALTO Extension: Flow-based Cost Query", 2017,
848	              <https://datatracker.ietf.org/doc/draft-gao-alto-fcs/>.

850	   [DRAFT-PV]
851	              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
852	              Yang, "ALTO Extension: Abstract Path Vector as a Cost
853	              Mode", 2015, <https://tools.ietf.org/html/draft-yang-alto-
854	              path-vector-01>.

856	   [DRAFT-RSA]
857	              Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G.
858	              Chen, "A Recommendation for Compressing ALTO Path
859	              Vectors", 2017, <https://datatracker.ietf.org/doc/draft-
860	              gao-alto-routing-state-abstraction/>.

862	   [DRAFT-SSE]
863	              Roome, W. and Y. Yang, "ALTO Incremental Updates Using
864	              Server-Sent Events (SSE)", 2015,
865	              <https://datatracker.ietf.org/doc/draft-ietf-alto-incr-
866	              update-sse/>.

868	   [HTCondor]
869	              Thain, D., Tannenbaum, T., and M. Livny, "Distributed
870	              computing in practice: the Condor experience", 2005,
871	              <http://dl.acm.org/citation.cfm?id=1064336>.

873	   [Mesos]    Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,
874	              Joseph, A., Katz, R., Shenker, S., and I. Stoica, "Mesos:
875	              A Platform for Fine-Grained Resource Sharing in the Data
876	              Center", 2011,
877	              <http://static.usenix.org/events/nsdi11/tech/full_papers/
878	              Hindman_new.pdf>.

880	   [RFC7285]  Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
881	              Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
882	              "Application-Layer Traffic Optimization (ALTO) Protocol",
883	              RFC 7285, DOI 10.17487/RFC7285, September 2014,
884	              <https://www.rfc-editor.org/info/rfc7285>.

886	   [RFC8189]  Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost
887	              Application-Layer Traffic Optimization (ALTO)", RFC 8189,
888	              DOI 10.17487/RFC8189, October 2017, <https://www.rfc-
889	              editor.org/info/rfc8189>.

891	   [Sparrow]  Ousterhout, K., Wendell, P., Zaharia, M., and I. Stoica,
892	              "Sparrow: Distributed, Low Latency Scheduling", 2013,
893	              <https://dl.acm.org/citation.cfm?id=2522716>.

895	   [XCP]      Katabi, D., Handley, M., and C. Rohrs, "Internet
896	              Congestion Control for Future High Bandwidth-Delay Product
897	              Environments", 2002,
898	              <http://citeseerx.ist.psu.edu/viewdoc/
899	              summary?doi=10.1.1.58.4783>.

901	Authors' Addresses

903	   Qiao Xiang
904	   Tongji/Yale University
905	   51 Prospect Street
906	   New Haven, CT
907	   USA

909	   Email: qiao.xiang@cs.yale.edu
910	   Franck Le
911	   IBM
912	   Thomas J. Watson Research Center
913	   Yorktown Heights, NY
914	   USA

916	   Email: fle@us.ibm.com

918	   Y. Richard Yang
919	   Tongji/Yale University
920	   51 Prospect Street
921	   New Haven, CT
922	   USA

924	   Email: yry@cs.yale.edu

926	   Harvey Newman
927	   California Institute of Technology
928	   1200 California Blvd.
929	   Pasadena, CA
930	   USA

932	   Email: newman@hep.caltech.edu

934	   Haizhou Du
935	   Tongji University
936	   4800 Cao'an Hwy
937	   Shanghai  201804
938	   China

940	   Email: duhaizhou@gmail.com