idnits 2.17.1 

draft-xiang-alto-multidomain-analytics-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 6 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (December 17, 2017) is 2321 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'DRAFT-CC' is defined on line 394, but no explicit
     reference was found in the text

  == Unused Reference: 'DRAFT-DC' is defined on line 400, but no explicit
     reference was found in the text

  == Unused Reference: 'DRAFT-MC' is defined on line 411, but no explicit
     reference was found in the text

  == Unused Reference: 'DRAFT-NETGRAPH' is defined on line 416, but no
     explicit reference was found in the text

  == Unused Reference: 'DRAFT-PM' is defined on line 421, but no explicit
     reference was found in the text

  == Unused Reference: 'DRAFT-SSE' is defined on line 438, but no explicit
     reference was found in the text

  == Unused Reference: 'HTCondor' is defined on line 452, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-04) exists of
     draft-yang-alto-path-vector-01


     Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	ALTO WG                                                         Q. Xiang
3	Internet-Draft                                                   Y. Yang
4	Intended status: Standards Track                  Tongji/Yale University
5	Expires: June 20, 2018                                 December 17, 2017

7	   Unicorn: Resource Orchestration for Large-Scale, Multi-Domain Data
8	                               Analytics
9	             draft-xiang-alto-multidomain-analytics-00.txt

11	Abstract

13	   This document presents the design of Unicorn, a multi-domain,
14	   geographically-distributed, data-intensive analytics system.  The
15	   setting of such a system includes edge science networks, which
16	   provide storage and computation resources for collecting, sharing and
17	   analyzing extremely large amounts of data, and transit networks,
18	   which provide networking resources to connects edge science networks
19	   for transmitting large science datasets.

21	   The key design challenge is to accurately discover and represent
22	   resource information from different domains.  Unicorn leverages
23	   multiple ALTO services, including ALTO-Path Vector, ALTO-Routing
24	   State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service
25	   to address this challenge.  In particular, Unicorn decomposes the
26	   resource discovery into three phases.  The first phase is to identify
27	   endpoint resource, e.g., dataset storage location, computation
28	   resource location and output storage resource location.  The second
29	   phase is to identify the reachability information between the
30	   locations of storage and computation resources.  The third phase is
31	   to identify the available networking resource connecting different
32	   storage and computation resources.  All information collected through
33	   these three phases can be used by a logically centralized scheduling
34	   system to orchestrate the resources usage.

36	Status of This Memo

38	   This Internet-Draft is submitted in full conformance with the
39	   provisions of BCP 78 and BCP 79.

41	   Internet-Drafts are working documents of the Internet Engineering
42	   Task Force (IETF).  Note that other groups may also distribute
43	   working documents as Internet-Drafts.  The list of current Internet-
44	   Drafts is at https://datatracker.ietf.org/drafts/current/.

46	   Internet-Drafts are draft documents valid for a maximum of six months
47	   and may be updated, replaced, or obsoleted by other documents at any
48	   time.  It is inappropriate to use Internet-Drafts as reference
49	   material or to cite them other than as "work in progress."

51	   This Internet-Draft will expire on June 20, 2018.

53	Copyright Notice

55	   Copyright (c) 2017 IETF Trust and the persons identified as the
56	   document authors.  All rights reserved.

58	   This document is subject to BCP 78 and the IETF Trust's Legal
59	   Provisions Relating to IETF Documents
60	   (https://trustee.ietf.org/license-info) in effect on the date of
61	   publication of this document.  Please review these documents
62	   carefully, as they describe your rights and restrictions with respect
63	   to this document.  Code Components extracted from this document must
64	   include Simplified BSD License text as described in Section 4.e of
65	   the Trust Legal Provisions and are provided without warranty as
66	   described in the Simplified BSD License.

68	Table of Contents

70	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
71	     1.1.  Settings  . . . . . . . . . . . . . . . . . . . . . . . .   3
72	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   4
73	   3.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
74	   4.  Storage and Computation Resource Discovery  . . . . . . . . .   6
75	   5.  Path Discovery  . . . . . . . . . . . . . . . . . . . . . . .   6
76	     5.1.  Using SDN to get flow-based site-path . . . . . . . . . .   7
77	     5.2.  Path Discovery Example  . . . . . . . . . . . . . . . . .   7
78	   6.  Networking Resource Discovery . . . . . . . . . . . . . . . .   8
79	     6.1.  Networking Resource Discovery Example . . . . . . . . . .   8
80	     6.2.  A Secure Multiparty Computation Protocol to Compute
81	           Minimal, Cross-Domain RSA . . . . . . . . . . . . . . . .   8
82	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
83	     7.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
84	     7.2.  Informative References  . . . . . . . . . . . . . . . . .   9
85	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

87	1.  Introduction

89	   As the data volume increases exponentially over time, data intensive
90	   analytics is transiting from single-domain computing to multi-
91	   organizational, geographically-distributed, collaborative computing,
92	   where different organizations contribute various resources, e.g.,
93	   computation, storage and networking resources, to collaboratively
94	   collect, share and analyze extremely large amounts of data.  One
95	   leading example is the Large Hadron Collider (LHC) high energy
96	   physics (HEP) program, which aims to find new particles and
97	   interactions in a previously inaccessible range of energies.  The
98	   scientific collaborations that have built and operate large HEP
99	   experimental facilities at the LHC, such as the Compact Muon Solenoid
100	   (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than
101	   300 petabytes of data under management at hundreds of sites around
102	   the world, and this volume is expected to grow to one exabyte by
103	   approximately 2018.

105	   This document presents Unicorn, a generic design for resource
106	   orchestration for large-scale, multi-domain data analytics.  The key
107	   design challenge for such a resource orchestration system is to
108	   accurately discover and represent the resource information from
109	   different domains.  Our design resorts to the Application-Layer
110	   Traffic Optimization Protocol (ALTO) [RFC7285] to address this
111	   challenge.  In particular, several ALTO extension services, including
112	   ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side
113	   Event and ALTO-Flow Cost Service, are integrated in the proposed
114	   design.

116	   This document focuses on the design details of Unicorn.  We present
117	   the implementation and deployment experience of Unicorn in another
118	   document [DRAFT-UNICORN-INFO].

120	1.1.  Settings

122	   The targeting scenario is as follows.  There are two types of
123	   networks in the whole system.  The first type is the edge science
124	   network.  An edge science networks is usually a cluster residing in a
125	   campus network.  It provides storage resources to store large
126	   scientific datasets and computation resources to analyze these
127	   datasets.  The second type is the transit network.  A transit network
128	   does not provide any storage or computation resources.  It only
129	   provides networking resources to inter-connect different edge science
130	   networks so that datasets can be moved and shared between different
131	   edge science networks.  Edge science networks do not directly connect
132	   to each other, but are connected through transit networks.

134	   Without loss of generality, a data analytics task is defined as a
135	   3-tuple: (input dataset, program, output site).  A task can be
136	   further decomposed into a set of jobs, who have a precedence relation
137	   defined by a directed acyclic graph (DAG).  And each job can also be
138	   defined as a 3-tuple: (input dataset, program, output site).

140	2.  Requirements Language

142	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
143	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
144	   document are to be interpreted as described in [RFC2119].

146	3.  Overview

148	   The key design challenge for multi-domain data analytics system is to
149	   accurately discover the resource information from different sites
150	   while preserving the autonomy and privacy of each site.  In order to
151	   address this challenge, the design needs to strike a balance between
152	   the information accuracy, the efficiency of resource discovery and
153	   the privacy of each site.  In particular, we propose the following
154	   architecture in the Figure 1.

156	                             .---------.
157	                             |  Users  |
158	                             '---------'
159	                                  | Tasks
160	    .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - - - - .
161	    |                             |                                    |
162	    |         .-----------------------.   1  .------------------------.|
163	    |         | Resource Orchestrator | -----|Storage/Computation Pool||
164	    |         '-----------------------' \    '------------------------'|
165	    |             /   |        | 4     \ \                             |
166	    |          2 /  3 |        |       3\ \ 2                          |
167	    |    .-------------.  .-----------. .-------------.                |
168	    |    | ALTO Server |  | Execution | | ALTO Server |                |
169	    |    '-------------'  |  Agents   | '-------------'                |
170	    |          |          '-----------'        |                       |
171	    |          |          /             \      |                       |
172	    |  .----------------./               \ .----------------.          |
173	    |  |     Site 1     |       . . .      |     Site N     |          |
174	    |  '----------------'                  '----------------'          |
175	    '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '

177	    Figure 1: Resource Orchestration for Large-Scale, Multi-Domain Data
178	                         Analytics: Architecture.

180	   In the proposed design, each site deploys an ALTO server.  Within the
181	   site, the ALTO server collects resource information, such as dataset
182	   locations, storage resources, computation resources, networking
183	   resources and so on, and announces its capability, i.e., the type of
184	   information it is willing to share with other ALTO servers or the
185	   data analytics resource orchestrator.

187	   In this system, users submit analytics tasks to the resource
188	   orchestrator.  For a data analytics task, a user needs to at least
189	   provide the analytics program.  If the input dataset is not specified
190	   by the user, it means this task does not require a dataset as input.
191	   Similarly, if the user does not specify the computation resource or
192	   the output dataset storage site, the system will try to allocate
193	   default computation resources to this task, and will return the
194	   output dataset directly to the user.

196	   After getting a (set of) task(s) from the user(s), the orchestrator
197	   discover the available resources for executing the submitted task(s)
198	   in three steps, labeled in the figure.  The first step is called
199	   storage/computation resource discovery.  In this step, the
200	   orchestrator sends requests to a centralized storage and computation
201	   resource pool to find the location of candidate input storage
202	   resources, the computation resources and the output storage
203	   resources.

205	   The second step is called path discovery, in which the resource
206	   orchestrator sends endpoint or flow cost service queries to the ALTO
207	   servers at the site holding the candidate input storage resources and
208	   the site holding the candidate computation resources to ask about the
209	   connectivity from input dataset site to the computation site and that
210	   from the computation site to the output site.  The cost type of such
211	   queries is path vector defined in [DRAFT-PV].  The response sent back
212	   from the ALTO server to the orchestrator is a vector.  Each element
213	   in this vector is the IP address of the ingress gateway switch/router
214	   that the candidate flow will pass along the AS-path.  This vector is
215	   called the site-path in this document.

217	   After collecting the site-path of all the candidate (storage,
218	   computation) flows, for each site X, the orchestrator derives F_X,
219	   the set of candidate flows that will consume networking resources in
220	   site X.  Then the orchestrator will send endpoint/flow cost service
221	   queries to the ALTO server at each site X to ask about the networking
222	   resource sharing of the flow set F_X in site X.  The returned
223	   response is a set of linear inequalities called resource state
224	   abstraction.

226	   Using the resource information collected from the three-phase
227	   resource discovery process, the resource orchestrator can run an
228	   scheduling algorithm to make the resource allocation decisions to
229	   execute the submitted tasks.  The decisions include job decomposition
230	   (DAG construction), task concatenation, job placement, network
231	   resource allocation for input dataset movement and output movement.
232	   These decisions will be sent to the corresponding execution agents at
233	   different sites, which will practice these decisions and send
234	   feedback to the orchestrator.

236	   When resource state changes, e.g., a network link is broken, the ALTO
237	   server at the scene will check whether the results of existing path
238	   discovery and networking resource discovery are affected by this
239	   event, and sends updated resource information using the ALTO-SSE
240	   service.

242	   In the next few sections, we present the detailed design of the
243	   three-phase resource discovery.

245	4.  Storage and Computation Resource Discovery

247	   In order to allocation resources for a (set of) data analytic tasks,
248	   the scheduling system must first know the availability of the
249	   resources explicitly specified in the task, i.e., the storage
250	   resource storing the input dataset, the computation resources to run
251	   the analytics program and the storage resource that will be used to
252	   store the output dataset.  Such resources are only provided by the
253	   edge science networks.  Therefore, a strawman design is for the
254	   scheduling system to send requests to the resource information
255	   servers of all the edge science networks and to get such information.
256	   However, this solution is inefficient in that the scheduling system
257	   needs to query all the edge science networks to get the complete
258	   information.

260	   This document adopts an alternative design, in which all the resource
261	   information servers proactively send all their information about the
262	   storage and computation resources to a centralized resource pool.
263	   This resource pool can be a DNS server or a traditional database.
264	   Different techniques are under investigation to improve the
265	   scalability of this design, including sharding and distributed
266	   hashing table (DHT).

268	5.  Path Discovery

270	   Having identified the locations of input dataset storage nodes, the
271	   locations of candidate computation nodes and the locations of
272	   candidate output dataset storage nodes, the scheduling system next
273	   needs to find out the connectivity information between storage nodes
274	   and computation nodes.  The first connectivity information is the
275	   reachability between storage nodes and the computation nodes.  A
276	   input storage node, a computation node and a output storage node can
277	   be allocated to execute a job only if data movement is allowed
278	   between the input storage node and the computation node, and between
279	   the computation node and the output storage node.

281	   Because edge science networks are connected through transit networks,
282	   the data movement between candidate storage nodes and computation
283	   nodes need to consume networking resources of multiple networks if
284	   these nodes are located at different edge science networks.  In order
285	   to find the networking resource sharing between different (storage,
286	   computation) pair, the scheduling system also needs to know which
287	   networks are involved in the data movement of each (storage,
288	   computation) node pair.

290	   To retrieve both the types of information, the scheduling system
291	   issues endpoint cost service queries to the ALTO servers at edge
292	   science networks.  For the ALTO server at an edge science network X,
293	   the scheduling system issues endpoint cost service defined in
294	   [RFC7285] or the extension flow cost service defined in [DRAFT-FCS]
295	   queries for all the (input storage node, computation node) pairs
296	   where the input storage node is located in X, and all the
297	   (computation node, output storage node) pairs where the computation
298	   node is located in X.  The cost type of such queries is the new path
299	   vector cost type introduced in [DRAFT-PV].

301	   For each (storage, computation) pair, the response sent by the ALTO
302	   servers at edge science networks is a path vector providing the
303	   information about the AS-level path for the data movement of this
304	   pair.  Different from the traditional path vector where each element
305	   is an AS name/number, each element in the path vector sent by the
306	   ALTO servers also includes the ingress IP address of the gateway
307	   switch/router of the corresponding network.  We call this path vector
308	   the "site-path", to differentiate it from the traditional AS-path.

310	5.1.  Using SDN to get flow-based site-path

312	   ALTO servers can compute the site-path for a given (storage,
313	   computation) pair using the information provided by BGP and
314	   traceroute.  However, BGP only supports destination-IP based routing
315	   and limits each network's ability to make fine-grained flow-based
316	   routing decisions.  We are investigating the usage of SDN technique
317	   to allow different networks in the multi-domain data analytics system
318	   to exchange and make fine-grained flow-based inter-domain routing
319	   decisions.  To avoid the route advertisement explosion brought by
320	   flow-based routing, we design use a sub/pub system that allows an
321	   ALTO server to send routing information queries of a set of flows,
322	   instead of the whole flow space, to other ALTO servers at other
323	   domains.

325	5.2.  Path Discovery Example

327	   The following is an example of path discovery query made by the
328	   orchestrator.

330	   { "cost-type":
331	       { "cost-mode": "array",
332	         "cost-metric": "ane-path" },
333	     "endpoint-flows":
334	        { "srcs": [ "ipv4:172.0.0.1", "ipv4:172.0.1.1"],
335	          "dsts": [ "ipv4:172.0.2.1", "ipv4:172.0.3.1"]}
336	   }

338	   And the following is the response sent from the ALTO server.

340	   {"endpoint-cost-map":
341	        "ipv4: 172.0.0.1 ": {
342	           "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
343	           "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]},
344	        "ipv4: 172.0.1.1 ": {
345	           "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"],
346	           "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]}
347	   }

349	6.  Networking Resource Discovery

351	   The responses from ALTO servers during the path discovery provides
352	   the connectivity information for every pair of candidate input
353	   dataset storage node and computation node, and that of every pair of
354	   candidate computation node to output storage node in the form of
355	   site-path.  With such information, the scheduling system can further
356	   discover the networking resource sharing between candidate (storage,
357	   computation) data movement flows.  In particular, for each network X,
358	   both edge science networks and transit network, we can easily derive
359	   the whole set of candidate data movement flows F_X that will enter
360	   network X from the site-path information of all candidate (storage,
361	   computation) data movement flows.  After deriving F_X for each
362	   network X, the scheduling system will send endpoint cost services or
363	   flow cost services to retrieve the resource state abstraction
364	   [DRAFT-RSA] for the flow set F_X.

366	6.1.  Networking Resource Discovery Example

368	   TBA.

370	6.2.  A Secure Multiparty Computation Protocol to Compute Minimal,
371	      Cross-Domain RSA

373	   The current design of ALTO-RSA can only compute the minimal resource
374	   state abstraction for a single network.  In Unicorn, we design a
375	   secure multiparty computation protocol to support the computation of
376	   minimal, cross-domain routing state abstraction.  This protocol
377	   contains each network's exposure of its redundant linear inequalities
378	   to a small number of other networks, and ensures that the
379	   orchestrator only gets the minimal, cross-domain resource state
380	   abstraction.  The overhead of this SMPC process is reasonable due to
381	   the adoption of state-of-the-art secure scalar product protocol.

383	7.  References

385	7.1.  Normative References

387	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
388	              Requirement Levels", BCP 14, RFC 2119,
389	              DOI 10.17487/RFC2119, March 1997,
390	              <https://www.rfc-editor.org/info/rfc2119>.

392	7.2.  Informative References

394	   [DRAFT-CC]
395	              Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N.
396	              Schwan, "ALTO Cost Calendar", 2017,
397	              <https://datatracker.ietf.org/doc/
398	              draft-ietf-alto-cost-calendar>.

400	   [DRAFT-DC]
401	              Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO
402	              Extensions for Collecting Data Center Resource
403	              Information", 2014, <https://datatracker.ietf.org/doc/
404	              draft-lee-alto-ext-dc-resource/>.

406	   [DRAFT-FCS]
407	              Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang,
408	              "ALTO Extension: Flow-based Cost Query", 2017,
409	              <https://datatracker.ietf.org/doc/draft-gao-alto-fcs/>.

411	   [DRAFT-MC]
412	              Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost
413	              ALTO", 2017, <https://datatracker.ietf.org/doc/
414	              draft-ietf-alto-multi-cost/>.

416	   [DRAFT-NETGRAPH]
417	              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
418	              Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015,
419	              <https://tools.ietf.org/html/draft-yang-alto-topology-06>.

421	   [DRAFT-PM]
422	              Roome, W. and Y. Yang, "Extensible Property Maps for the
423	              ALTO Protocol", 2015, <https://datatracker.ietf.org/doc/
424	              draft-roome-alto-unified-props-new/>.

426	   [DRAFT-PV]
427	              Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y.
428	              Yang, "ALTO Extension: Abstract Path Vector as a Cost
429	              Mode", 2015, <https://tools.ietf.org/html/
430	              draft-yang-alto-path-vector-01>.

432	   [DRAFT-RSA]
433	              Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G.
434	              Chen, "A Recommendation for Compressing ALTO Path
435	              Vectors", 2017, <https://datatracker.ietf.org/doc/
436	              draft-gao-alto-routing-state-abstraction/>.

438	   [DRAFT-SSE]
439	              Roome, W. and Y. Yang, "ALTO Incremental Updates Using
440	              Server-Sent Events (SSE)", 2015,
441	              <https://datatracker.ietf.org/doc/
442	              draft-ietf-alto-incr-update-sse/>.

444	   [DRAFT-UNICORN-INFO]
445	              Xiang, Q., Newman, H., Bernstein, G., Du, H., Gao, K.,
446	              Mughal, A., Balcas, J., Zhang, J., and Y. Yang,
447	              "Implementation and Deployment of A Resource Orchestration
448	              System for Multi-Domain Data Analytics", 2017,
449	              <https://datatracker.ietf.org/doc/
450	              draft-xiang-alto-exascale-network-optimization/>.

452	   [HTCondor]
453	              Thain, D., Tannenbaum, T., and M. Livny, "Distributed
454	              computing in practice: the Condor experience", 2005,
455	              <http://dl.acm.org/citation.cfm?id=1064336>.

457	   [RFC7285]  Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
458	              Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
459	              "Application-Layer Traffic Optimization (ALTO) Protocol",
460	              RFC 7285, DOI 10.17487/RFC7285, September 2014,
461	              <https://www.rfc-editor.org/info/rfc7285>.

463	Authors' Addresses

465	   Qiao Xiang
466	   Tongji/Yale University
467	   51 Prospect Street
468	   New Haven, CT
469	   USA

471	   Email: qiao.xiang@cs.yale.edu
472	   Y. Richard Yang
473	   Tongji/Yale University
474	   51 Prospect Street
475	   New Haven, CT
476	   USA

478	   Email: yry@cs.yale.edu