idnits 2.17.1 draft-xiang-alto-multidomain-analytics-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 588 has weird spacing: '...Queries v |...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 2, 2018) is 2122 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '11' on line 779 -- Looks like a reference, but probably isn't: '49' on line 779 -- Looks like a reference, but probably isn't: '95' on line 781 -- Looks like a reference, but probably isn't: '34' on line 779 -- Looks like a reference, but probably isn't: '58' on line 780 -- Looks like a reference, but probably isn't: '22' on line 780 -- Looks like a reference, but probably isn't: '75' on line 780 -- Looks like a reference, but probably isn't: '25' on line 780 -- Looks like a reference, but probably isn't: '50' on line 781 -- Looks like a reference, but probably isn't: '69' on line 781 -- Looks like a reference, but probably isn't: '89' on line 781 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Tongji/Yale University 4 Intended status: Informational F. Le 5 Expires: January 3, 2019 IBM 6 Y. Yang 7 Tongji/Yale University 8 H. Newman 9 California Institute of Technology 10 H. Du 11 Tongji University 12 July 2, 2018 14 Unicorn: Resource Orchestration for Multi-Domain, Geo-Distributed Data 15 Analytics 16 draft-xiang-alto-multidomain-analytics-02.txt 18 Abstract 20 As the data volume increases exponentially over time, data analytics 21 is transiting from a single-domain network to a multi-domain, geo- 22 distributed network, where different member networks contribute 23 various resources, e.g., computation, storage and networking 24 resources, to collaboratively collect, share and analyze extremely 25 large amounts of data. Such a network calls for a resource 26 orchestration framework that emphasizes the performance 27 predictability of data analytics jobs, the high utilization of 28 resources, and the autonomy and privacy of member networks. 30 This document presents the design of Unicorn, a unified resource 31 orchestration framework for multi-domain, geo-distributed data 32 analytics, which uses the Application-Layer Traffic Optimization 33 (ALTO) protocol as the key component for (1) allows member networks 34 to provide accurate information on different types of resources; (2) 35 keeps the private information of member networks; and (3) allows data 36 analytics jobs to accurately describe their requirements of different 37 types of resources. As a part of Unicorn, an ALTO extension for 38 privacy-preserving interdomain information aggregation is also 39 presented. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at http://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on January 3, 2019. 58 Copyright Notice 60 Copyright (c) 2018 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (http://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 76 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 77 3. Changes Since Version -01 . . . . . . . . . . . . . . . . . . 4 78 4. Characteristics of Multi-Domain, Geo-Distributed Data 79 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 4 80 4.1. Dynamic Data Analytics Workload . . . . . . . . . . . . . 4 81 4.2. Dynamic Resource Availability . . . . . . . . . . . . . . 5 82 5. Design Requirements . . . . . . . . . . . . . . . . . . . . . 6 83 6. Review of Resource Orchestration Designs for Data Analytics . 6 84 6.1. Centralized resource-graph-based orchestration . . . . . 7 85 6.2. Centralized ClassAds-based orchestration . . . . . . . . 7 86 6.3. Distributed opportunistic orchestration . . . . . . . . . 7 87 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo- 88 Distributed Data Analytics . . . . . . . . . . . . . . . 7 89 7. Unicorn Design . . . . . . . . . . . . . . . . . . . . . . . 8 90 7.1. Choosing ALTO as the Resource Information Model . . . . . 8 91 7.2. Architecture of Unicorn . . . . . . . . . . . . . . . . . 9 92 7.2.1. Three-Phase Resource Discovery . . . . . . . . . . . 10 93 7.2.2. Proactive Full-Mesh Resource Discovery . . . . . . . 14 94 7.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . 14 95 8. ALTO Extension: Privacy-Preserving Interdomain Information 96 Aggregation for Resource Discovery . . . . . . . . . . . . . 15 97 8.1. Extension Specification . . . . . . . . . . . . . . . . . 15 98 8.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 17 99 9. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 18 100 9.1. Discovering the Domain-Paths Using a New Interdomain 101 Routing Protocol . . . . . . . . . . . . . . . . . . . . 18 102 10. Security Considerations . . . . . . . . . . . . . . . . . . . 18 103 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 104 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 105 12.1. Normative References . . . . . . . . . . . . . . . . . . 18 106 12.2. Informative References . . . . . . . . . . . . . . . . . 18 107 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 109 1. Introduction 111 This document describes the design of Unicorn, a unified resource 112 orchestration framework for large-scale data analytics in multi- 113 domain, geo-distributed networks. An important use case for such 114 settings is the Large Hadron Collider (LHC) network, which consists 115 of over 180 member networks all over the world, to support scientists 116 to access multiple resources, e.g., computing, storage and networking 117 resources, distributed in the member networks to conduct large-scale 118 data analytics. With more and more data being generated and stored 119 in different geo-distributed member networks, network architects and 120 administrators are exploring different designs for efficient resource 121 orchestration in multi-domain, geo-distributed networks. 123 The design presented in this document is based on the development and 124 deployment experience of Unicorn in the CMS network, one of the 125 largest scientific experiments in the LHC network. The primary 126 requirements of resource orchestration in such a multi-domain, geo- 127 distributed environment are the performance predictability of various 128 data analytics jobs, the high utilization of different types of 129 resources, and the autonomy and privacy of resource owners, i.e., 130 member networks. 132 Pre-production development and extensive testing have shown that the 133 Application-Layer Traffic Optimization Protocol [RFC7285] is well 134 suited as a fundamental component in Unicorn for providing a generic 135 representation that (1) allows different types of data analytics jobs 136 to accurately describe their resource requirements and (2) allows 137 member networks to provide accurate information on different types of 138 resources they own and at the same time maintain their privacies. 139 This is in contrast with the state-of-the-art resource orchestration 140 frameworks, such as HTCondor and Mesos, which either do not provide 141 accurate networking information or expose all the private details of 142 member networks. This document elaborates on the design requirements 143 of resource orchestration in multi-domain, geo-distributed networks 144 that lead to this design choice and presents the details of Unicorn, 145 including an ALTO extension for privacy-preserving, interdomain 146 information aggregation. 148 This document first gives an overview of the characteristics of 149 multi-domain, geo-distributed data analytics. Then, the design 150 requirements for resource orchestration under such settings are 151 summarized. After reviewing existing designs and their limitations, 152 this document gives the arguments for using ALTO as the generic 153 representation for describing both resource requirements and the 154 resource information and describes the design details of Unicorn. 155 Finally, a privacy-preserving, interdomain extension of ALTO is 156 presented. 158 2. Requirements Language 160 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 161 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 162 document are to be interpreted as described in [RFC2119]. 164 3. Changes Since Version -01 166 o Update the design of Unicorn system by introducing a proactive 167 full-mesh resource discovery mechanism. 169 o Update the design of the privacy-preserving interdomain resource 170 information aggregation protocol. 172 4. Characteristics of Multi-Domain, Geo-Distributed Data Analytics 174 This section describes the characteristics of multi-domain, geo- 175 distributed data analytics. 177 4.1. Dynamic Data Analytics Workload 179 In multi-domain, geo-distributed data analytics, extremely large 180 amounts of data are generated and stored across different member 181 networks. Authorized users from different organizations can access 182 data and resources in member networks to conduct various data 183 analytics jobs using various data analytics applications. 185 An data analytics application usually provides an automated process 186 that decomposes a large data analytics job into a set of smaller 187 tasks, whose dependencies are expressed as a directed acyclic graph 188 (DAG). Tasks without any dependency can be executed in parallel to 189 improve the efficiency of the data analytics job they belong to. 190 This decomposition is highly user- and application-dependent. 192 Each task may have different requirements on different resources. 193 For instance, task T1 may require dataset A in storage node X as 194 input and 1 CPU as the computing resource, while task T2 may require 195 dataset B in storage node Y as input and 2 CPUs as the computing 196 resource. Furthermore, each task may require resources from 197 different member networks. In the previous example, T1 may require 198 its output to be stored in a storage node in another member network 199 for the purpose of secure storage. The resource requirements of 200 tasks are highly user- and application-dependent. 202 From the above description, it is observed that the workload of 203 multi-domain, geo-distributed data analytics is highly dynamic, in 204 terms of the number of users, the types of applications, the number 205 of jobs, the decomposition of jobs and the resource requirements of 206 tasks. 208 Though with such dynamism, it is the general consensus of users to 209 expect performance predictability of their analytics jobs (TODO: add 210 Mogul citation). Hence the resource orchestration for multi-domain, 211 geo-distributed data analytics must be able to achieve efficient 212 resource sharing among different data analytics jobs of different 213 applications from different users. To this end, a generic 214 representation of resource requirements for different tasks from 215 different analytics applications must be chosen. Furthermore, to 216 ensure maximal deployment, the resource orchestration framework must 217 be independent of and compatible with data analytics applications. 219 4.2. Dynamic Resource Availability 221 In the multi-domain, geo-distributed data analytics network, 222 different member networks belong to different administrative domains. 223 Each member network has its own resource management policies and can 224 choose to use different management software, such as HTCondor and 225 Mesos. 227 Each member network provides different types of resources with 228 different amounts. For example, transit networks such as ESNet and 229 Internet2 provide high-bandwidth networking resources. In contrast, 230 campus science networks provide abundant computation and storage 231 resources, but may provide limited networking bandwidths. And some 232 smaller science networks only provide limited computation and storage 233 resources. The availability of the resources in each member network 234 is subject to the autonomous control of the member network. 236 Furthermore, member networks are interconnected with high bandwidth- 237 delay-product links, where state-of-the-art networking resource 238 allocation mechanisms, such as TCP, become inefficient [XCP]. 240 From the above description, it is observed that the resource 241 availability of the multi-domain, geo-distributed data analytics 242 network is also highly dynamic, subject to the types of member 243 networks, the resources provided by member networks and the resource 244 management policies and management software used by member networks. 246 Though with such dynamism, it is the general consensus of member 247 networks that the resource orchestration for multi-domain, geo- 248 distributed data analytics must achieve high utilization of different 249 types of resources, following the autonomy and privacy of each member 250 network. To this end, a generic representation of resource 251 availabilities for different types of resources must be chosen. Such 252 a representation must be accurate and at the same time maintain the 253 privacy of member networks. Furthermore, to ensure maximal 254 deployment, the resource orchestration framework must be independent 255 of and compatible with the resource management systems used by member 256 networks. 258 5. Design Requirements 260 This section summarizes the design requirements for resource 261 orchestration for multi-domain, geo-distributed data analytics from 262 the previous section. 264 o REQ1: Provide performance predictability for data analytics jobs. 266 o REQ2: Achieve the efficient resource sharing among data analytics 267 jobs. 269 o REQ3: Achieve the high utilization of different types of resources 270 in member networks. 272 o REQ4: Maintain the autonomy and privacy of member networks. 274 o REQ5: Provide compatibility with different data analytics 275 applications and resource management systems to maximize the 276 deployment. 278 6. Review of Resource Orchestration Designs for Data Analytics 280 This section provides an overview of three general types of resource 281 orchestration designs for data analytics -- the centralized resource- 282 graph-based orchestration, the centralized ClassAds-based 283 orchestration and the distributed opportunistic orchestration. Then, 284 the key reason why these designs are inadequate for multi-domain, 285 geo-distributed data analytics is provided. 287 6.1. Centralized resource-graph-based orchestration 289 Systems such as Mesos [Mesos] and Borg [Borg] adopt a graph-based 290 abstraction to represent the resource availability of computing 291 clusters. Each node in the graph is a physical node representing 292 computation or storage resources and each edge between a pair of 293 nodes denotes the networking resource connecting two physical nodes. 294 This design is inadequate for multi-domain, geo-distributed data 295 analytics system because (1) it compromises the privacy of different 296 member networks by revealing all the details of resources; and (2) 297 the overhead to keep the resource availability graph up to date is 298 too expensive due to the heterogeneity and dynamicity of resources 299 from different member networks. 301 6.2. Centralized ClassAds-based orchestration 303 HTCondor [HTCondor] proposes a ClassAds programming model, which 304 allows different resource owners to advertise their resource supply 305 and the job owners to advertise the resource demand. However, this 306 programming model does not support the accurate discovery of 307 networking resources, but leave the orchestration of networking 308 resources completely to TCP, which has been known to behave poorly in 309 networks with high bandwidth-delay products [XCP]. 311 6.3. Distributed opportunistic orchestration 313 Some systems, such as Apollo [Apollo] and Sparrow [Sparrow], use a 314 distributed design. In this design, given a data analytics job, a 315 small number of computing and storage nodes are randomly selected as 316 candidates. Then a scheduling algorithm makes the decision to select 317 the best pair of computing and storage nodes within this small set of 318 candidates. Though it is shown in production that this design 319 achieves a performance very close to the theoretical optimal resource 320 allocation scheme, this design cannot be applied to multi-domain, 321 geo-distributed data analytics because (1) the pool of computing and 322 storage resources is much larger, and is distributed across the 323 world, and (2) it is hard to distributively orchestrate networking 324 resources in such a high bandwidth-delay product scenario. 326 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo-Distributed 327 Data Analytics 329 Applying the designs reviewed in the preceding subsections for multi- 330 domain, geo-distributed data analytics only satisfies the design 331 requirement of compatibility (REQ5), but leaves all the other 332 requirements unfulfilled. The key reason is that they do not have an 333 information model that simultaneously 334 o allows member networks to provide accurate information on 335 different types of resources, e.g., the computing, storage and 336 networking resources, they own; 338 o keeps the private information of member networks, such as physical 339 topologies and policies, from the data analytics applications; and 341 o allows data analytics jobs to accurately describe their 342 requirements of different types of resources. 344 7. Unicorn Design 346 This section presents the design of the Unicorn framework. First, 347 the motivations of using ALTO as the information model of resource 348 orchestration for multi-domain, geo-distributed data analytics are 349 reviewed. Then the architecture of Unicorn is provided. 351 7.1. Choosing ALTO as the Resource Information Model 353 As reviewed in the preceding section, the commonly used resource- 354 graph-based information model and the ClassAds information model do 355 not support the accurate, yet privacy-preserving resource discovery 356 across different member networks. In contrast, the ALTO protocol 357 uses abstract maps of networks to provide network information with 358 the goal of modifying network resource consumption patterns while 359 maintaining or improving application performance [RFC7285]. This 360 document proposes the use of ALTO for providing information of 361 different types of resources, e.g., computing, storage and networking 362 resources. This design has the following advantages: 364 o ALTO provides the network information based on abstract maps of a 365 network. Additional services are built on top of the ALTO 366 abstract maps to provide information of other types of resources, 367 e.g., the computing and storage resources. These maps provide 368 accurate information of different types of resources for the 369 resource orchestration system to effectively utilize them for data 370 analytics applications. For example, the ALTO Endpoint Property 371 Service can provide information of computing nodes and storage 372 nodes. 374 o The ALTO abstract maps provide a simplified view of resources of 375 member networks, instead of the full details of their resource 376 availability. Thus ALTO allows member networks to keep their 377 private information, such as physical topologies and policies, 378 from the applications. For example, the ALTO Network Map service 379 provides a "one-big-switch" view that defines a grouping of 380 network endpoints. This view hides the details of the underlying 381 physical topology of the network and a network deploying the ALTO 382 server has the autonomy to adopt any endpoint grouping algorithm. 384 o ALTO uses a client-server model, in which applications can use 385 ALTO clients to accurately describe their requirements of 386 different types of resources and send these requirements to the 387 ALTO servers to retrieve the accurate information of resources 388 that suit their requirements. For example, the ALTO Multi-Cost 389 service [RFC8189] allows an ALTO client to specify a logic set of 390 tests in a query. Such tests are used by ALTO servers to filter 391 out the information of unqualified resources from the response 392 sent back to the ALTO client. 394 7.2. Architecture of Unicorn 396 This section describes the design details of Unicorn. Figure 1 397 presents the architecture of Unicorn for a multi-domain, geo- 398 distributed data analytics system with N member networks. In 399 particular, Unicorn consists of the following key components: 401 .-------------. .-------------. 402 |Application 1| ... |Application N| 403 '-------------' '-------------' 404 \ / 405 .- - - - - - - -\- - - - - - - - - - - -/- - - - - - - - - - - - - - -. 406 | Unicorn \ / | 407 | .-----------------------. | 408 | | Resource Orchestrator | .----------------------.| 409 | | | |Distributed Hash Table|| 410 | | .-----------. |---- | of Computing and || 411 | | |ALTO Client| | | Storage Resources || 412 | | '-----------' | '----------------------'| 413 | '-----------------------' | 414 | / | \ | 415 | / | \ | 416 | .-------------. .-----------. .-------------. | 417 | |ALTO Server 1| | Execution | |ALTO Server M| | 418 | '-------------' | Agents | '-------------' | 419 | | '-----------' | | 420 | | / \ | | 421 | .----------------./ \ .----------------. | 422 | | Site 1 | . . | Site N | | 423 | '----------------' '----------------' | 424 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' 426 Figure 1: Architecture of Unicorn. 428 o ALTO Server: for each member network, one or more ALTO servers are 429 deployed to provide accurate, yet privacy-preserving information 430 of different types of resources owned by the corresponding 431 network. Examples of such information include the link bandwidth 432 between endpoints, the memory I/O bandwidth and the CPU 433 utilization at computing endpoints and the storage space at 434 storage endpoints. In addition to the basic ALTO services defined 435 in [RFC7285], The ALTO servers in Unicorn also provide ALTO 436 extension services such as the ALTO Multi-Cost Service [RFC8189], 437 the ALTO Server-Sent Event Service [DRAFT-SSE] and the ALTO 438 Multipart Cost Property Service [DRAFT-PV] to provide fine-grained 439 resource information. 441 o Distributed Hash Table (DHT) of Computing and Storage Resources: A 442 DHT system is deployed across member networks to lookup the 443 location of computing and storage resources. Compared with the 444 current centralized lookup services in the CMS network, i.e., 445 PhEDEx and HTCondor, a DHT system provides a significant 446 performance improvement for discovering the locations of computing 447 and storage resources in multi-domain, geo-distributed data 448 analytics systems. 450 o Resource Orchestrator: The orchestrator is a shim layer between 451 the data analytics jobs from different applications and the member 452 networks. It contains an ALTO client that communicates with the 453 ALTO servers at member networks to retrieve resource information. 454 Given a set of data analytics jobs, the orchestrator adopts a 455 three-phase discovery process, which will be elaborated in the 456 next section, to find the accurate information of all the 457 resources that can be used to execute these jobs. Then the 458 orchestrator runs a customized resource allocation algorithm to 459 compute the resource allocation decisions for these jobs, and send 460 the decisions to the execution agents at corresponding member 461 networks. 463 o Execution Agent: One or more execution agents are deployed at each 464 member network. They take the resource allocation decisions from 465 the resource orchestrator, and communicate with the underlying 466 resource management system deployed at the corresponding member 467 network to reserve the resources for the data analytics jobs and 468 execute them. 470 7.2.1. Three-Phase Resource Discovery 472 The preceding subsection describes the architecture and the key 473 components of Unicorn. One missing component is how to accurately 474 discover the information of different types of resources for a set of 475 data analytics jobs with the assistance of ALTO. This section 476 presents the three-phase resource discovery design in Unicorn. 478 7.2.1.1. Phase 1: Endpoint Property Discovery 480 Figure 2 shows the procedure of the endpoint property discovery 481 phase. Given a set of data analytics jobs, the resource orchestrator 482 communicates with the DHT lookup system to find the locations, i.e., 483 the endpoint addresses, of all candidate computing and storage 484 resources. With such information, the ALTO client then issues 485 Endpoint Property Service (EPS) queries to the ALTO servers deployed 486 at member networks to discover the information of all candidate 487 endpoints. 489 .-----------------------. 490 | Resource Orchestrator | Jobs' Resource 491 | | Requirements .------------. 492 | .-----------. | ---------------> | DHT Lookup | 493 | |ALTO Client| | <--------------- | System | 494 | '-----------' | Endpoint '------------' 495 '---------| ^-----------' Locations 496 EPS | | Endpoint 497 Queries | | Properties 498 v | 499 .-------------. 500 |ALTO Servers | 501 '-------------' 503 Figure 2: The Endpoint Property Discovery Phase. 505 7.2.1.2. Phase 2: Endpoint Path Discovery 507 Candidate computing and storage endpoints need to move data between 508 them before, during and after the execution of a data analytics job. 509 In multi-domain, geo-distributed data analytics, a pair of candidate 510 endpoints may not be in the same member network. In this case, the 511 orchestrator needs to find out the connectivity information between 512 such a pair of candidate endpoints. 514 Figure 3 shows the procedure of the endpoint path discovery phase. 515 Given a pair of candidate endpoints that are not in the same member 516 network, the ALTO client in the orchestrator adopts an iterative 517 process to find the interdomain connectivity information for this 518 pair. It starts by issuing an ALTO Endpoint Cost Service query or an 519 ALTO Flow-based Endpoint Cost Service [DRAFT-FCS] to the ALTO server 520 of the member network where the source endpoint locates. The cost 521 type of this query is a customized type called next-hop, with a 522 customized cost mode tuple and a customized cost metric next-network. 524 .-----------------------. 525 | Resource Orchestrator | 526 | | 527 | .-----------. | 528 | |ALTO Client| | 529 | '-----------' | 530 '---------| ^-----------' 531 Customized| | Endpoint 532 ECS | | Path 533 Queries | | Segments 534 v | 535 .-------------. 536 |ALTO Servers | 537 '-------------' 539 Figure 3: The Endpoint Path Discovery Phase. 541 The ALTO server returns a 2-tuple, where the first element is the 542 autonomous number (AS) of the next member network along the AS-path 543 from the source endpoint to the destination endpoint, and the second 544 element is the ingress of this next member network. In a member 545 network, the ALTO server can get such information from the underlying 546 interdomain routing protocol, e.g., BGP. Based on the received 547 response, the ALTO client then issues a similar query to the ALTO 548 server of the next member network. The process stops when the ALTO 549 server of the member network where the destination endpoint locates 550 receives such a query, who will return a null 2-tuple in response to 551 notify the ALTO client. By the end of this process, the ALTO client 552 can assemble a domain-path, in the form of a path vector of (ingress, 553 AS), of this pair of candidate endpoints. 555 7.2.1.3. Phase 3: Resource State Abstraction Discovery 557 After the second phase, the resource orchestrator has the 558 connectivity information of each candidate endpoint pair, i.e., the 559 domain-path. Equivalently, for each member network, it knows the set 560 of all candidate endpoint pairs that will enter this network. With 561 this information, the resource orchestrator can communicate with the 562 ALTO servers at member networks to discover the resource sharing 563 between all the candidate endpoint pairs. In particular, Unicorn 564 extends the routing state abstraction [DRAFT-RSA] to the more generic 565 resource state abstraction to represent such resource sharing. 567 Figure 4 shows the procedure of the resource state abstraction 568 discovery phase. For each member network, the ALTO client in the 569 orchestrator sends an ALTO Multipart Cost Property Service query 570 defined in [DRAFT-PV] by providing the set of candidate endpoint 571 pairs as input. The cost type of this query is path vector. Upon 572 receiving the query, the ALTO server in each member network computes 573 an ALTO cost map and an ATLO property map to the ALTO client. These 574 two maps represent a set of linear inequalities revealing the 575 resource sharing among the set of candidate endpoint pairs in the 576 member network. 578 .-----------------------. 579 | Resource Orchestrator | 580 | | 581 | .-----------. | 582 | |ALTO Client| | 583 | '-----------' | 584 '---------| ^-----------' 585 Multipart| | Resource 586 Cost | | State 587 Property | | Abstraction 588 Queries v | 589 .-------------. 590 |ALTO Servers | 591 '-------------' 593 Figure 4: The Resource State Abstraction Discovery Phase. 595 Unicorn provides two mechanisms for the ALTO servers to return the 596 computed cost maps and property maps to the ALTO client. The first 597 mechanism is to let each ALTO server independently sends its response 598 to the ALTO client. The second mechanism is a privacy-preserving 599 interdomain information aggregation process, in which the ALTO 600 servers in all member networks use a secure multi-party computation 601 (SMPC) protocol to collectively send the responses to the ALTO client 602 without revealing the source of any entry, i.e., the linear in 603 equality, in the cost maps and property maps. 605 The first mechanism has a higher security risk in that it exposes the 606 bottleneck resource information of each member network. In contrast, 607 the second mechanism provides a better protection of the private 608 information of each member network. The details of the privacy- 609 preserving interdomain information aggregation process will be 610 presented in the next section. 612 After receiving the responses sent back from the ALTO servers from 613 all the member networks, the orchestrator finishes the whole resource 614 discovery process and collects the accurate information of different 615 types of resources for data analytics jobs. 617 7.2.2. Proactive Full-Mesh Resource Discovery 619 To ensure the resource discovery process scales, a proactive full- 620 mesh resource discovery component is developed. The main idea of 621 this component consists in having the ALTO client periodically query 622 ALTO servers at all sites to discover the resource state abstraction 623 between every pair of source and destination sites. As such, when an 624 application submits a resource discovery request, the ALTO client 625 does not need to send any query to the ALTO servers. Instead, using 626 the site-level bandwidth sharing information, the ALTO client can 627 immediately perform projection operations to get the resource 628 information for the request. This mechanism substantially improves 629 the scalability of Unicorn. 631 7.3. Example 633 This subsection gives an example to illustrate the workflow of 634 Unicorn. Figure 5 gives a topology of three member networks, where 635 s1 and s2 are storage endpoints and d1 and d2 are computation 636 endpoints. Assume a data analytics job is composed of two parallel 637 tasks T1 and T2. T1 needs dataset X as input and T2 needs dataset Y 638 as input. 640 .------------. 641 | Network B | 642 .------------. ingB| | 643 | Network A |--------| d1 | 644 | | '------------' 645 | s1 | 646 | | .------------. 647 | s2 |--------| Network C | 648 '------------' ingC| | 649 | d2 | 650 '------------' 652 Figure 5: An Illustrating Example for Unicorn. 654 In the endpoint property discovery phase, the Unicorn resource 655 orchestrator finds that s1 stores X and s2 stores Y, and that the 656 locations of s1, s2, d1 and d2, from the DHT lookup system. It then 657 issues EPS queries to network A, B and C, respectively, to discover 658 that d1 satisfies the computing requirements of T1 and d2 satisfies 659 the computing requirements of T2. Hence there are only two candidate 660 endpoint pairs: (s1, d1) and (s2, d2). 662 In the endpoint path discovery phase, the ALTO client in the 663 orchestrator iteratively issues Endpoint Cost Service (ECS) query to 664 the ALTO servers in member networks, and finds that the domain-path 665 for pair (s1, d2) is [(null, A), (ingB, B)] and the domain-path for 666 pair (s2, d2) is [(null, A), (ingB, B)]. Hence both pairs will use 667 the networking resources of network A, while only (s1, d1) will use 668 network B and only (s2, d2) will use network C. 670 In the resource state abstraction discovery phase, the ALTO client in 671 the orchestrator issues Multipart Cost Property Service queries to 672 network A, B and C, respectively. Denote the available bandwidth 673 that can be assigned to T1 as x1 and that to T2 as x2. Assume the 674 linear inequalities computed by the three networks are: 676 A: x1 + x2 <= 10Mbps 677 B: x1 <= 3Mbps 678 C: x2 <= 3Mbps 680 If the ALTO servers use the first mechanism to directly return their 681 resource information to ALTO client, respectively, each of them will 682 send a cost map and a property map response encoding its own linear 683 inequality to the ALTO client. In this way, the orchestrator gets 684 the accurate information about networking resource sharing between 685 (s1, d1) and (s2, d2). It then can invoke a resource allocation 686 algorithm to allocate the resources to tasks T1 and T2. For example, 687 if the goal is to maximize the minimal bandwidth of two tasks, the 688 allocation decision will be to assign endpoints s1 and d1 to T1, with 689 a bandwidth of 3Mbps, and assign endpoints s2 and d2 to T2, with a 690 bandwidth of 3Mbps as well. 692 8. ALTO Extension: Privacy-Preserving Interdomain Information 693 Aggregation for Resource Discovery 695 This section describes a customized ALTO extension in Unicorn that 696 supports the privacy-preserving discovery of networking resource 697 sharing among a set of candidate endpoint pairs. 699 8.1. Extension Specification 701 Figure 6 presents the workflow of the proposed ALTO extension. 702 Assume a set of N member networks denoted as AS_1, AS_2, ... AS_N 703 and the number of all candidate endpoint pairs is F. The interdomain 704 information aggregation process works as follows: 706 .-----------------------. 707 | Resource Orchestrator | 708 | | 709 | .-----------. | 710 | |ALTO Client| | 711 | '-----------' | 712 >'-----------------------'< 713 /Disguised /|\ \ 714 Multipart Cost / Response | \ 715 Property Queries v | \ 716 .--------. .--------. .--------. 717 |ALTO | |ALTO | |ALTO | 718 |Server 1| <------->|Server 2|<--... -->|Server N| 719 '--------' Limitedly'--------' '--------' 720 Shared Secret 722 Figure 6: The Privacy-Preserving Interdomain Resource Information 723 Aggregation. 725 o Step 1: The ALTO client sends the Multipart Cost Property Service 726 request to and a homomorphic public key k_p to each member 727 network. 729 o Step 2: The ALTO server of each network AS_i computes its own set 730 of linear inequalities A_i x <= b_i. Denote the size of this set 731 as m_i. 733 o Step 3: The ALTO server of each network AS_i introduces m_i non- 734 negative slack variables to transform its set of linear 735 inequalities into a set of linear equations. 737 o Step 4: The ALTO servers of all member networks use a private 738 matrix SMPC summation protocol to collectively compute k=m_1 + m_2 739 + ... + m_N + 1. The value k is known to all the member networks. 741 o Step 5: The ALTO servers of each network AS_i selects a random k- 742 by-m_i matrix P_i, and computes the matrix P_iA_i and P_ib_i. 744 o Step 6: The ALTO server of each network then uses a few matrices, 745 which are only shared with a couple of other networks, to further 746 obfuscate P_iA_i and P_ib_i, and sends the obfuscated matrices to 747 the ALTO client via symmetric encryption. 749 o Step 7: the ALTO client decrypts the received responses from all 750 ALTO servers, and sums up the decrypted response to get a set of 751 linear equations sum P_iA_i x = sum P_ib_i. 753 This process ensures that the networking resource capacity region 754 derived from sum P_iA_i x = sum P_ib_i is the same as that derived 755 from A_1 x <= b_1, A_2 x <= b_2, ... A_N x <= b_N. More importantly, 756 the ALTO client has no knowledge on the information of network 757 resource sharing of a single member network. 759 8.2. Example 761 This subsection uses the same example in Figure 5 to illustrate the 762 privacy-preserving information aggregation process. The set of 763 linear inequalities computed by each network is as follows: 765 A: x1 + x2 <= 10 766 B: x1 <= 3 767 C: x2 <= 3 769 Then the networks collectively compute k=1+1+1+1=4. And then 770 introduces slack variables to transform the linear inequalities into 771 linear equations: 773 A: x1 + x2 + x3 <= 10 774 B: x1 + x4 <= 3 775 C: x2 + x5 <= 3 777 For each network, the random matrix it chooses as follows: 779 P_A: [11, 49, 95, 34] 780 P_B: [58, 22, 75, 25] 781 P_C: [50, 69, 89, 95] 783 After the obfuscating process in Step 5 and Step 6 in the previous 784 subsection, the decrypted set of linear equations the ALTO client 785 gets is 787 69 x1 + 61 x2 + 11 x3 + 58 x4_ + 50 x5 = 434 788 71 x1 + 118 x2 + 49 x3 + 22 x4_ + 69 x5 = 763 789 170 x1 + 184 x2 + 95 x3 + 75 x4_ + 89 x5 = 1442 790 59 x1 + 129 x2 + 34 x3 + 25 x4_ + 95 x5 = 700 792 Assume the goal is still to maximize the minimal bandwidth of two 793 tasks, the allocation decision made using this set of linear 794 equations will still be x1=3 and x2=3, i.e., assigning endpoints s1 795 and d1 to T1, with a bandwidth of 3 and assigning endpoints s2 and d2 796 to T2, with a bandwidth of 3 as well. 798 9. Discussion 800 9.1. Discovering the Domain-Paths Using a New Interdomain Routing 801 Protocol 803 The current design of the endpoint path discovery process in Unicorn 804 assumes that the underlying interdomain routing protocol is the 805 standard BGP, which only provides the path vector of ASes instead of 806 the path vector of (ingress, AS) tuples needed by Unicorn. If a 807 multi-domain, geo-distributed data analytics system uses an 808 interdomain routing protocol that provides the path vector of 809 (ingress, AS) pairs, the endpoint path discovery process in Unicorn 810 can be simplified to only send queries to the ALTO server of the 811 network where the source candidate endpoint locates. 813 10. Security Considerations 815 This document does not introduce any privacy or security issue not 816 already present in the ALTO protocol. 818 11. IANA Considerations 820 This document does not define any new media type or introduce any new 821 IANA consideration. 823 12. References 825 12.1. Normative References 827 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 828 Requirement Levels", BCP 14, RFC 2119, 829 DOI 10.17487/RFC2119, March 1997, . 832 12.2. Informative References 834 [Apollo] Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., 835 Qian, Z., Wu, M., and L. Zhou, "Apollo: Scalable and 836 Coordinated Scheduling for Cloud-Scale Computing", 2014, 837 . 840 [Borg] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., 841 Tune, E., and J. Wilkes, "Large-scale cluster management 842 at Google with Borg", 2015, . 845 [DRAFT-FCS] 846 Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang, 847 "ALTO Extension: Flow-based Cost Query", 2017, 848 . 850 [DRAFT-PV] 851 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 852 Yang, "ALTO Extension: Abstract Path Vector as a Cost 853 Mode", 2015, . 856 [DRAFT-RSA] 857 Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G. 858 Chen, "A Recommendation for Compressing ALTO Path 859 Vectors", 2017, . 862 [DRAFT-SSE] 863 Roome, W. and Y. Yang, "ALTO Incremental Updates Using 864 Server-Sent Events (SSE)", 2015, 865 . 868 [HTCondor] 869 Thain, D., Tannenbaum, T., and M. Livny, "Distributed 870 computing in practice: the Condor experience", 2005, 871 . 873 [Mesos] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., 874 Joseph, A., Katz, R., Shenker, S., and I. Stoica, "Mesos: 875 A Platform for Fine-Grained Resource Sharing in the Data 876 Center", 2011, 877 . 880 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 881 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 882 "Application-Layer Traffic Optimization (ALTO) Protocol", 883 RFC 7285, DOI 10.17487/RFC7285, September 2014, 884 . 886 [RFC8189] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 887 Application-Layer Traffic Optimization (ALTO)", RFC 8189, 888 DOI 10.17487/RFC8189, October 2017, . 891 [Sparrow] Ousterhout, K., Wendell, P., Zaharia, M., and I. Stoica, 892 "Sparrow: Distributed, Low Latency Scheduling", 2013, 893 . 895 [XCP] Katabi, D., Handley, M., and C. Rohrs, "Internet 896 Congestion Control for Future High Bandwidth-Delay Product 897 Environments", 2002, 898 . 901 Authors' Addresses 903 Qiao Xiang 904 Tongji/Yale University 905 51 Prospect Street 906 New Haven, CT 907 USA 909 Email: qiao.xiang@cs.yale.edu 911 Franck Le 912 IBM 913 Thomas J. Watson Research Center 914 Yorktown Heights, NY 915 USA 917 Email: fle@us.ibm.com 919 Y. Richard Yang 920 Tongji/Yale University 921 51 Prospect Street 922 New Haven, CT 923 USA 925 Email: yry@cs.yale.edu 927 Harvey Newman 928 California Institute of Technology 929 1200 California Blvd. 930 Pasadena, CA 931 USA 933 Email: newman@hep.caltech.edu 934 Haizhou Du 935 Tongji University 936 4800 Cao'an Hwy 937 Shanghai 201804 938 China 940 Email: duhaizhou@gmail.com