idnits 2.17.1 draft-xiang-alto-multidomain-analytics-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 6 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 17, 2017) is 2321 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'DRAFT-CC' is defined on line 394, but no explicit reference was found in the text == Unused Reference: 'DRAFT-DC' is defined on line 400, but no explicit reference was found in the text == Unused Reference: 'DRAFT-MC' is defined on line 411, but no explicit reference was found in the text == Unused Reference: 'DRAFT-NETGRAPH' is defined on line 416, but no explicit reference was found in the text == Unused Reference: 'DRAFT-PM' is defined on line 421, but no explicit reference was found in the text == Unused Reference: 'DRAFT-SSE' is defined on line 438, but no explicit reference was found in the text == Unused Reference: 'HTCondor' is defined on line 452, but no explicit reference was found in the text == Outdated reference: A later version (-04) exists of draft-yang-alto-path-vector-01 Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Y. Yang 4 Intended status: Standards Track Tongji/Yale University 5 Expires: June 20, 2018 December 17, 2017 7 Unicorn: Resource Orchestration for Large-Scale, Multi-Domain Data 8 Analytics 9 draft-xiang-alto-multidomain-analytics-00.txt 11 Abstract 13 This document presents the design of Unicorn, a multi-domain, 14 geographically-distributed, data-intensive analytics system. The 15 setting of such a system includes edge science networks, which 16 provide storage and computation resources for collecting, sharing and 17 analyzing extremely large amounts of data, and transit networks, 18 which provide networking resources to connects edge science networks 19 for transmitting large science datasets. 21 The key design challenge is to accurately discover and represent 22 resource information from different domains. Unicorn leverages 23 multiple ALTO services, including ALTO-Path Vector, ALTO-Routing 24 State Abstraction, ALTO-Server-Side Event and ALTO-Flow Cost Service 25 to address this challenge. In particular, Unicorn decomposes the 26 resource discovery into three phases. The first phase is to identify 27 endpoint resource, e.g., dataset storage location, computation 28 resource location and output storage resource location. The second 29 phase is to identify the reachability information between the 30 locations of storage and computation resources. The third phase is 31 to identify the available networking resource connecting different 32 storage and computation resources. All information collected through 33 these three phases can be used by a logically centralized scheduling 34 system to orchestrate the resources usage. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on June 20, 2018. 53 Copyright Notice 55 Copyright (c) 2017 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 71 1.1. Settings . . . . . . . . . . . . . . . . . . . . . . . . 3 72 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 73 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 4. Storage and Computation Resource Discovery . . . . . . . . . 6 75 5. Path Discovery . . . . . . . . . . . . . . . . . . . . . . . 6 76 5.1. Using SDN to get flow-based site-path . . . . . . . . . . 7 77 5.2. Path Discovery Example . . . . . . . . . . . . . . . . . 7 78 6. Networking Resource Discovery . . . . . . . . . . . . . . . . 8 79 6.1. Networking Resource Discovery Example . . . . . . . . . . 8 80 6.2. A Secure Multiparty Computation Protocol to Compute 81 Minimal, Cross-Domain RSA . . . . . . . . . . . . . . . . 8 82 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 83 7.1. Normative References . . . . . . . . . . . . . . . . . . 9 84 7.2. Informative References . . . . . . . . . . . . . . . . . 9 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 87 1. Introduction 89 As the data volume increases exponentially over time, data intensive 90 analytics is transiting from single-domain computing to multi- 91 organizational, geographically-distributed, collaborative computing, 92 where different organizations contribute various resources, e.g., 93 computation, storage and networking resources, to collaboratively 94 collect, share and analyze extremely large amounts of data. One 95 leading example is the Large Hadron Collider (LHC) high energy 96 physics (HEP) program, which aims to find new particles and 97 interactions in a previously inaccessible range of energies. The 98 scientific collaborations that have built and operate large HEP 99 experimental facilities at the LHC, such as the Compact Muon Solenoid 100 (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than 101 300 petabytes of data under management at hundreds of sites around 102 the world, and this volume is expected to grow to one exabyte by 103 approximately 2018. 105 This document presents Unicorn, a generic design for resource 106 orchestration for large-scale, multi-domain data analytics. The key 107 design challenge for such a resource orchestration system is to 108 accurately discover and represent the resource information from 109 different domains. Our design resorts to the Application-Layer 110 Traffic Optimization Protocol (ALTO) [RFC7285] to address this 111 challenge. In particular, several ALTO extension services, including 112 ALTO-Path Vector, ALTO-Routing State Abstraction, ALTO-Server-Side 113 Event and ALTO-Flow Cost Service, are integrated in the proposed 114 design. 116 This document focuses on the design details of Unicorn. We present 117 the implementation and deployment experience of Unicorn in another 118 document [DRAFT-UNICORN-INFO]. 120 1.1. Settings 122 The targeting scenario is as follows. There are two types of 123 networks in the whole system. The first type is the edge science 124 network. An edge science networks is usually a cluster residing in a 125 campus network. It provides storage resources to store large 126 scientific datasets and computation resources to analyze these 127 datasets. The second type is the transit network. A transit network 128 does not provide any storage or computation resources. It only 129 provides networking resources to inter-connect different edge science 130 networks so that datasets can be moved and shared between different 131 edge science networks. Edge science networks do not directly connect 132 to each other, but are connected through transit networks. 134 Without loss of generality, a data analytics task is defined as a 135 3-tuple: (input dataset, program, output site). A task can be 136 further decomposed into a set of jobs, who have a precedence relation 137 defined by a directed acyclic graph (DAG). And each job can also be 138 defined as a 3-tuple: (input dataset, program, output site). 140 2. Requirements Language 142 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 143 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 144 document are to be interpreted as described in [RFC2119]. 146 3. Overview 148 The key design challenge for multi-domain data analytics system is to 149 accurately discover the resource information from different sites 150 while preserving the autonomy and privacy of each site. In order to 151 address this challenge, the design needs to strike a balance between 152 the information accuracy, the efficiency of resource discovery and 153 the privacy of each site. In particular, we propose the following 154 architecture in the Figure 1. 156 .---------. 157 | Users | 158 '---------' 159 | Tasks 160 .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - - - - . 161 | | | 162 | .-----------------------. 1 .------------------------.| 163 | | Resource Orchestrator | -----|Storage/Computation Pool|| 164 | '-----------------------' \ '------------------------'| 165 | / | | 4 \ \ | 166 | 2 / 3 | | 3\ \ 2 | 167 | .-------------. .-----------. .-------------. | 168 | | ALTO Server | | Execution | | ALTO Server | | 169 | '-------------' | Agents | '-------------' | 170 | | '-----------' | | 171 | | / \ | | 172 | .----------------./ \ .----------------. | 173 | | Site 1 | . . . | Site N | | 174 | '----------------' '----------------' | 175 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ' 177 Figure 1: Resource Orchestration for Large-Scale, Multi-Domain Data 178 Analytics: Architecture. 180 In the proposed design, each site deploys an ALTO server. Within the 181 site, the ALTO server collects resource information, such as dataset 182 locations, storage resources, computation resources, networking 183 resources and so on, and announces its capability, i.e., the type of 184 information it is willing to share with other ALTO servers or the 185 data analytics resource orchestrator. 187 In this system, users submit analytics tasks to the resource 188 orchestrator. For a data analytics task, a user needs to at least 189 provide the analytics program. If the input dataset is not specified 190 by the user, it means this task does not require a dataset as input. 191 Similarly, if the user does not specify the computation resource or 192 the output dataset storage site, the system will try to allocate 193 default computation resources to this task, and will return the 194 output dataset directly to the user. 196 After getting a (set of) task(s) from the user(s), the orchestrator 197 discover the available resources for executing the submitted task(s) 198 in three steps, labeled in the figure. The first step is called 199 storage/computation resource discovery. In this step, the 200 orchestrator sends requests to a centralized storage and computation 201 resource pool to find the location of candidate input storage 202 resources, the computation resources and the output storage 203 resources. 205 The second step is called path discovery, in which the resource 206 orchestrator sends endpoint or flow cost service queries to the ALTO 207 servers at the site holding the candidate input storage resources and 208 the site holding the candidate computation resources to ask about the 209 connectivity from input dataset site to the computation site and that 210 from the computation site to the output site. The cost type of such 211 queries is path vector defined in [DRAFT-PV]. The response sent back 212 from the ALTO server to the orchestrator is a vector. Each element 213 in this vector is the IP address of the ingress gateway switch/router 214 that the candidate flow will pass along the AS-path. This vector is 215 called the site-path in this document. 217 After collecting the site-path of all the candidate (storage, 218 computation) flows, for each site X, the orchestrator derives F_X, 219 the set of candidate flows that will consume networking resources in 220 site X. Then the orchestrator will send endpoint/flow cost service 221 queries to the ALTO server at each site X to ask about the networking 222 resource sharing of the flow set F_X in site X. The returned 223 response is a set of linear inequalities called resource state 224 abstraction. 226 Using the resource information collected from the three-phase 227 resource discovery process, the resource orchestrator can run an 228 scheduling algorithm to make the resource allocation decisions to 229 execute the submitted tasks. The decisions include job decomposition 230 (DAG construction), task concatenation, job placement, network 231 resource allocation for input dataset movement and output movement. 232 These decisions will be sent to the corresponding execution agents at 233 different sites, which will practice these decisions and send 234 feedback to the orchestrator. 236 When resource state changes, e.g., a network link is broken, the ALTO 237 server at the scene will check whether the results of existing path 238 discovery and networking resource discovery are affected by this 239 event, and sends updated resource information using the ALTO-SSE 240 service. 242 In the next few sections, we present the detailed design of the 243 three-phase resource discovery. 245 4. Storage and Computation Resource Discovery 247 In order to allocation resources for a (set of) data analytic tasks, 248 the scheduling system must first know the availability of the 249 resources explicitly specified in the task, i.e., the storage 250 resource storing the input dataset, the computation resources to run 251 the analytics program and the storage resource that will be used to 252 store the output dataset. Such resources are only provided by the 253 edge science networks. Therefore, a strawman design is for the 254 scheduling system to send requests to the resource information 255 servers of all the edge science networks and to get such information. 256 However, this solution is inefficient in that the scheduling system 257 needs to query all the edge science networks to get the complete 258 information. 260 This document adopts an alternative design, in which all the resource 261 information servers proactively send all their information about the 262 storage and computation resources to a centralized resource pool. 263 This resource pool can be a DNS server or a traditional database. 264 Different techniques are under investigation to improve the 265 scalability of this design, including sharding and distributed 266 hashing table (DHT). 268 5. Path Discovery 270 Having identified the locations of input dataset storage nodes, the 271 locations of candidate computation nodes and the locations of 272 candidate output dataset storage nodes, the scheduling system next 273 needs to find out the connectivity information between storage nodes 274 and computation nodes. The first connectivity information is the 275 reachability between storage nodes and the computation nodes. A 276 input storage node, a computation node and a output storage node can 277 be allocated to execute a job only if data movement is allowed 278 between the input storage node and the computation node, and between 279 the computation node and the output storage node. 281 Because edge science networks are connected through transit networks, 282 the data movement between candidate storage nodes and computation 283 nodes need to consume networking resources of multiple networks if 284 these nodes are located at different edge science networks. In order 285 to find the networking resource sharing between different (storage, 286 computation) pair, the scheduling system also needs to know which 287 networks are involved in the data movement of each (storage, 288 computation) node pair. 290 To retrieve both the types of information, the scheduling system 291 issues endpoint cost service queries to the ALTO servers at edge 292 science networks. For the ALTO server at an edge science network X, 293 the scheduling system issues endpoint cost service defined in 294 [RFC7285] or the extension flow cost service defined in [DRAFT-FCS] 295 queries for all the (input storage node, computation node) pairs 296 where the input storage node is located in X, and all the 297 (computation node, output storage node) pairs where the computation 298 node is located in X. The cost type of such queries is the new path 299 vector cost type introduced in [DRAFT-PV]. 301 For each (storage, computation) pair, the response sent by the ALTO 302 servers at edge science networks is a path vector providing the 303 information about the AS-level path for the data movement of this 304 pair. Different from the traditional path vector where each element 305 is an AS name/number, each element in the path vector sent by the 306 ALTO servers also includes the ingress IP address of the gateway 307 switch/router of the corresponding network. We call this path vector 308 the "site-path", to differentiate it from the traditional AS-path. 310 5.1. Using SDN to get flow-based site-path 312 ALTO servers can compute the site-path for a given (storage, 313 computation) pair using the information provided by BGP and 314 traceroute. However, BGP only supports destination-IP based routing 315 and limits each network's ability to make fine-grained flow-based 316 routing decisions. We are investigating the usage of SDN technique 317 to allow different networks in the multi-domain data analytics system 318 to exchange and make fine-grained flow-based inter-domain routing 319 decisions. To avoid the route advertisement explosion brought by 320 flow-based routing, we design use a sub/pub system that allows an 321 ALTO server to send routing information queries of a set of flows, 322 instead of the whole flow space, to other ALTO servers at other 323 domains. 325 5.2. Path Discovery Example 327 The following is an example of path discovery query made by the 328 orchestrator. 330 { "cost-type": 331 { "cost-mode": "array", 332 "cost-metric": "ane-path" }, 333 "endpoint-flows": 334 { "srcs": [ "ipv4:172.0.0.1", "ipv4:172.0.1.1"], 335 "dsts": [ "ipv4:172.0.2.1", "ipv4:172.0.3.1"]} 336 } 338 And the following is the response sent from the ALTO server. 340 {"endpoint-cost-map": 341 "ipv4: 172.0.0.1 ": { 342 "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"], 343 "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]}, 344 "ipv4: 172.0.1.1 ": { 345 "ipv4: 172.0.2.1 ": ["ane:172.1.0.1", "ane:172.0.2.0"], 346 "ipv4: 172.0.3.1 ": ["ane:172.1.0.1", "ane:172.0.3.0"]} 347 } 349 6. Networking Resource Discovery 351 The responses from ALTO servers during the path discovery provides 352 the connectivity information for every pair of candidate input 353 dataset storage node and computation node, and that of every pair of 354 candidate computation node to output storage node in the form of 355 site-path. With such information, the scheduling system can further 356 discover the networking resource sharing between candidate (storage, 357 computation) data movement flows. In particular, for each network X, 358 both edge science networks and transit network, we can easily derive 359 the whole set of candidate data movement flows F_X that will enter 360 network X from the site-path information of all candidate (storage, 361 computation) data movement flows. After deriving F_X for each 362 network X, the scheduling system will send endpoint cost services or 363 flow cost services to retrieve the resource state abstraction 364 [DRAFT-RSA] for the flow set F_X. 366 6.1. Networking Resource Discovery Example 368 TBA. 370 6.2. A Secure Multiparty Computation Protocol to Compute Minimal, 371 Cross-Domain RSA 373 The current design of ALTO-RSA can only compute the minimal resource 374 state abstraction for a single network. In Unicorn, we design a 375 secure multiparty computation protocol to support the computation of 376 minimal, cross-domain routing state abstraction. This protocol 377 contains each network's exposure of its redundant linear inequalities 378 to a small number of other networks, and ensures that the 379 orchestrator only gets the minimal, cross-domain resource state 380 abstraction. The overhead of this SMPC process is reasonable due to 381 the adoption of state-of-the-art secure scalar product protocol. 383 7. References 385 7.1. Normative References 387 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 388 Requirement Levels", BCP 14, RFC 2119, 389 DOI 10.17487/RFC2119, March 1997, 390 . 392 7.2. Informative References 394 [DRAFT-CC] 395 Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N. 396 Schwan, "ALTO Cost Calendar", 2017, 397 . 400 [DRAFT-DC] 401 Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO 402 Extensions for Collecting Data Center Resource 403 Information", 2014, . 406 [DRAFT-FCS] 407 Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang, 408 "ALTO Extension: Flow-based Cost Query", 2017, 409 . 411 [DRAFT-MC] 412 Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 413 ALTO", 2017, . 416 [DRAFT-NETGRAPH] 417 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 418 Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015, 419 . 421 [DRAFT-PM] 422 Roome, W. and Y. Yang, "Extensible Property Maps for the 423 ALTO Protocol", 2015, . 426 [DRAFT-PV] 427 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 428 Yang, "ALTO Extension: Abstract Path Vector as a Cost 429 Mode", 2015, . 432 [DRAFT-RSA] 433 Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G. 434 Chen, "A Recommendation for Compressing ALTO Path 435 Vectors", 2017, . 438 [DRAFT-SSE] 439 Roome, W. and Y. Yang, "ALTO Incremental Updates Using 440 Server-Sent Events (SSE)", 2015, 441 . 444 [DRAFT-UNICORN-INFO] 445 Xiang, Q., Newman, H., Bernstein, G., Du, H., Gao, K., 446 Mughal, A., Balcas, J., Zhang, J., and Y. Yang, 447 "Implementation and Deployment of A Resource Orchestration 448 System for Multi-Domain Data Analytics", 2017, 449 . 452 [HTCondor] 453 Thain, D., Tannenbaum, T., and M. Livny, "Distributed 454 computing in practice: the Condor experience", 2005, 455 . 457 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 458 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 459 "Application-Layer Traffic Optimization (ALTO) Protocol", 460 RFC 7285, DOI 10.17487/RFC7285, September 2014, 461 . 463 Authors' Addresses 465 Qiao Xiang 466 Tongji/Yale University 467 51 Prospect Street 468 New Haven, CT 469 USA 471 Email: qiao.xiang@cs.yale.edu 472 Y. Richard Yang 473 Tongji/Yale University 474 51 Prospect Street 475 New Haven, CT 476 USA 478 Email: yry@cs.yale.edu