idnits 2.17.1 draft-xiang-alto-multidomain-analytics-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 604 has weird spacing: '...Queries v |...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 5, 2018) is 2243 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '11' on line 780 -- Looks like a reference, but probably isn't: '49' on line 780 -- Looks like a reference, but probably isn't: '95' on line 782 -- Looks like a reference, but probably isn't: '34' on line 780 -- Looks like a reference, but probably isn't: '58' on line 781 -- Looks like a reference, but probably isn't: '22' on line 781 -- Looks like a reference, but probably isn't: '75' on line 781 -- Looks like a reference, but probably isn't: '25' on line 781 -- Looks like a reference, but probably isn't: '50' on line 782 -- Looks like a reference, but probably isn't: '69' on line 782 -- Looks like a reference, but probably isn't: '89' on line 782 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Tongji/Yale University 4 Intended status: Informational F. Le 5 Expires: September 6, 2018 IBM 6 Y. Yang 7 Tongji/Yale University 8 H. Newman 9 California Institute of Technology 10 H. Du 11 Tongji University 12 March 5, 2018 14 Unicorn: Resource Orchestration for Multi-Domain, Geo-Distributed Data 15 Analytics 16 draft-xiang-alto-multidomain-analytics-01.txt 18 Abstract 20 As the data volume increases exponentially over time, data analytics 21 is transiting from a single-domain network to a multi-domain, geo- 22 distributed network, where different member networks contribute 23 various resources, e.g., computation, storage and networking 24 resources, to collaboratively collect, share and analyze extremely 25 large amounts of data. Such a network calls for a resource 26 orchestration framework that emphasizes the performance 27 predictability of data analytics jobs, the high utilization of 28 resources, and the autonomy and privacy of member networks. 30 This document presents the design of Unicorn, a unified resource 31 orchestration framework for multi-domain, geo-distributed data 32 analytics, which uses the Application-Layer Traffic Optimization 33 (ALTO) protocol as the key component for (1) allows member networks 34 to provide accurate information on different types of resources; (2) 35 keeps the private information of member networks; and (3) allows data 36 analytics jobs to accurately describe their requirements of different 37 types of resources. As a part of Unicorn, an ALTO extension for 38 privacy-preserving interdomain information aggregation is also 39 presented. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at http://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on September 6, 2018. 58 Copyright Notice 60 Copyright (c) 2018 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (http://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 76 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 77 3. Changes Since Version -00 . . . . . . . . . . . . . . . . . . 4 78 4. Characteristics of Multi-Domain, Geo-Distributed Data 79 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 4 80 4.1. Dynamic Data Analytics Workload . . . . . . . . . . . . . 5 81 4.2. Dynamic Resource Availability . . . . . . . . . . . . . . 5 82 5. Design Requirements . . . . . . . . . . . . . . . . . . . . . 6 83 6. Review of Resource Orchestration Designs for Data Analytics . 7 84 6.1. Centralized resource-graph-based orchestration . . . . . 7 85 6.2. Centralized ClassAds-based orchestration . . . . . . . . 7 86 6.3. Distributed opportunistic orchestration . . . . . . . . . 7 87 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo- 88 Distributed Data Analytics . . . . . . . . . . . . . . . 8 89 7. Unicorn Design . . . . . . . . . . . . . . . . . . . . . . . 8 90 7.1. Choosing ALTO as the Resource Information Model . . . . . 8 91 7.2. Architecture of Unicorn . . . . . . . . . . . . . . . . . 9 92 7.2.1. Three-Phase Resource Discovery . . . . . . . . . . . 11 93 7.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . 15 94 8. ALTO Extension: Privacy-Preserving Interdomain Information 95 Aggregation for Resource Discovery . . . . . . . . . . . . . 16 97 8.1. Extension Specification . . . . . . . . . . . . . . . . . 16 98 8.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 17 99 9. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 18 100 9.1. Discovering the Domain-Paths Using a New Interdomain 101 Routing Protocol . . . . . . . . . . . . . . . . . . . . 18 102 10. Security Considerations . . . . . . . . . . . . . . . . . . . 18 103 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 104 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 105 12.1. Normative References . . . . . . . . . . . . . . . . . . 19 106 12.2. Informative References . . . . . . . . . . . . . . . . . 19 107 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 109 1. Introduction 111 This document describes the design of Unicorn, a unified resource 112 orchestration framework for large-scale data analytics in multi- 113 domain, geo-distributed networks. An important use case for such 114 settings is the Large Hadron Collider (LHC) network, which consists 115 of over 180 member networks all over the world, to support scientists 116 to access multiple resources, e.g., computing, storage and networking 117 resources, distributed in the member networks to conduct large-scale 118 data analytics. With more and more data being generated and stored 119 in different geo-distributed member networks, network architects and 120 administrators are exploring different designs for efficient resource 121 orchestration in multi-domain, geo-distributed networks. 123 The design presented in this document is based on the development and 124 deployment experience of Unicorn in the CMS network, one of the 125 largest scientific experiments in the LHC network. The primary 126 requirements of resource orchestration in such a multi-domain, geo- 127 distributed environment are the performance predictability of various 128 data analytics jobs, the high utilization of different types of 129 resources, and the autonomy and privacy of resource owners, i.e., 130 member networks. 132 Pre-production development and extensive testing have shown that the 133 Application-Layer Traffic Optimization Protocol [RFC7285] is well 134 suited as a fundamental component in Unicorn for providing a generic 135 representation that (1) allows different types of data analytics jobs 136 to accurately describe their resource requirements and (2) allows 137 member networks to provide accurate information on different types of 138 resources they own and at the same time maintain their privacies. 139 This is in contrast with the state-of-the-art resource orchestration 140 frameworks, such as HTCondor and Mesos, which either do not provide 141 accurate networking information or expose all the private details of 142 member networks. This document elaborates on the design requirements 143 of resource orchestration in multi-domain, geo-distributed networks 144 that lead to this design choice and presents the details of Unicorn, 145 including an ALTO extension for privacy-preserving, interdomain 146 information aggregation. 148 This document first gives an overview of the characteristics of 149 multi-domain, geo-distributed data analytics. Then, the design 150 requirements for resource orchestration under such settings are 151 summarized. After reviewing existing designs and their limitations, 152 this document gives the arguments for using ALTO as the generic 153 representation for describing both resource requirements and the 154 resource information and describes the design details of Unicorn. 155 Finally, a privacy-preserving, interdomain extension of ALTO is 156 presented. 158 2. Requirements Language 160 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 161 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 162 document are to be interpreted as described in [RFC2119]. 164 3. Changes Since Version -00 166 o Add an overview of the characteristics of multi-domain, geo- 167 distributed data analytics. 169 o Based on the above characteristics, add the design requirements of 170 the resource orchestration for multi-domain, geo-distributed data 171 analytics. 173 o Add a review of existing resource orchestration system designs for 174 data analytics systems. 176 o Update the architecture of the Unicorn system. 178 o Update the motivation of using ALTO as the key information model 179 in Unicorn. 181 o Give the design details of the three-phase resource discovery in 182 Unicorn. 184 o Design an ALTO extension for privacy-preserving interdomain 185 resource information aggregation. 187 4. Characteristics of Multi-Domain, Geo-Distributed Data Analytics 189 This section describes the characteristics of multi-domain, geo- 190 distributed data analytics. 192 4.1. Dynamic Data Analytics Workload 194 In multi-domain, geo-distributed data analytics, extremely large 195 amounts of data are generated and stored across different member 196 networks. Authorized users from different organizations can access 197 data and resources in member networks to conduct various data 198 analytics jobs using various data analytics applications. 200 An data analytics application usually provides an automated process 201 that decomposes a large data analytics job into a set of smaller 202 tasks, whose dependencies are expressed as a directed acyclic graph 203 (DAG). Tasks without any dependency can be executed in parallel to 204 improve the efficiency of the data analytics job they belong to. 205 This decomposition is highly user- and application-dependent. 207 Each task may have different requirements on different resources. 208 For instance, task T1 may require dataset A in storage node X as 209 input and 1 CPU as the computing resource, while task T2 may require 210 dataset B in storage node Y as input and 2 CPUs as the computing 211 resource. Furthermore, each task may require resources from 212 different member networks. In the previous example, T1 may require 213 its output to be stored in a storage node in another member network 214 for the purpose of secure storage. The resource requirements of 215 tasks are highly user- and application-dependent. 217 From the above description, it is observed that the workload of 218 multi-domain, geo-distributed data analytics is highly dynamic, in 219 terms of the number of users, the types of applications, the number 220 of jobs, the decomposition of jobs and the resource requirements of 221 tasks. 223 Though with such dynamism, it is the general consensus of users to 224 expect performance predictability of their analytics jobs (TODO: add 225 Mogul citation). Hence the resource orchestration for multi-domain, 226 geo-distributed data analytics must be able to achieve efficient 227 resource sharing among different data analytics jobs of different 228 applications from different users. To this end, a generic 229 representation of resource requirements for different tasks from 230 different analytics applications must be chosen. Furthermore, to 231 ensure maximal deployment, the resource orchestration framework must 232 be independent of and compatible with data analytics applications. 234 4.2. Dynamic Resource Availability 236 In the multi-domain, geo-distributed data analytics network, 237 different member networks belong to different administrative domains. 238 Each member network has its own resource management policies and can 239 choose to use different management software, such as HTCondor and 240 Mesos. 242 Each member network provides different types of resources with 243 different amounts. For example, transit networks such as ESNet and 244 Internet2 provide high-bandwidth networking resources. In contrast, 245 campus science networks provide abundant computation and storage 246 resources, but may provide limited networking bandwidths. And some 247 smaller science networks only provide limited computation and storage 248 resources. The availability of the resources in each member network 249 is subject to the autonomous control of the member network. 251 Furthermore, member networks are interconnected with high bandwidth- 252 delay-product links, where state-of-the-art networking resource 253 allocation mechanisms, such as TCP, become inefficient [XCP]. 255 From the above description, it is observed that the resource 256 availability of the multi-domain, geo-distributed data analytics 257 network is also highly dynamic, subject to the types of member 258 networks, the resources provided by member networks and the resource 259 management policies and management software used by member networks. 261 Though with such dynamism, it is the general consensus of member 262 networks that the resource orchestration for multi-domain, geo- 263 distributed data analytics must achieve high utilization of different 264 types of resources, following the autonomy and privacy of each member 265 network. To this end, a generic representation of resource 266 availabilities for different types of resources must be chosen. Such 267 a representation must be accurate and at the same time maintain the 268 privacy of member networks. Furthermore, to ensure maximal 269 deployment, the resource orchestration framework must be independent 270 of and compatible with the resource management systems used by member 271 networks. 273 5. Design Requirements 275 This section summarizes the design requirements for resource 276 orchestration for multi-domain, geo-distributed data analytics from 277 the previous section. 279 o REQ1: Provide performance predictability for data analytics jobs. 281 o REQ2: Achieve the efficient resource sharing among data analytics 282 jobs. 284 o REQ3: Achieve the high utilization of different types of resources 285 in member networks. 287 o REQ4: Maintain the autonomy and privacy of member networks. 289 o REQ5: Provide compatibility with different data analytics 290 applications and resource management systems to maximize the 291 deployment. 293 6. Review of Resource Orchestration Designs for Data Analytics 295 This section provides an overview of three general types of resource 296 orchestration designs for data analytics -- the centralized resource- 297 graph-based orchestration, the centralized ClassAds-based 298 orchestration and the distributed opportunistic orchestration. Then, 299 the key reason why these designs are inadequate for multi-domain, 300 geo-distributed data analytics is provided. 302 6.1. Centralized resource-graph-based orchestration 304 Systems such as Mesos [Mesos] and Borg [Borg] adopt a graph-based 305 abstraction to represent the resource availability of computing 306 clusters. Each node in the graph is a physical node representing 307 computation or storage resources and each edge between a pair of 308 nodes denotes the networking resource connecting two physical nodes. 309 This design is inadequate for multi-domain, geo-distributed data 310 analytics system because (1) it compromises the privacy of different 311 member networks by revealing all the details of resources; and (2) 312 the overhead to keep the resource availability graph up to date is 313 too expensive due to the heterogeneity and dynamicity of resources 314 from different member networks. 316 6.2. Centralized ClassAds-based orchestration 318 HTCondor [HTCondor] proposes a ClassAds programming model, which 319 allows different resource owners to advertise their resource supply 320 and the job owners to advertise the resource demand. However, this 321 programming model does not support the accurate discovery of 322 networking resources, but leave the orchestration of networking 323 resources completely to TCP, which has been known to behave poorly in 324 networks with high bandwidth-delay products [XCP]. 326 6.3. Distributed opportunistic orchestration 328 Some systems, such as Apollo [Apollo] and Sparrow [Sparrow], use a 329 distributed design. In this design, given a data analytics job, a 330 small number of computing and storage nodes are randomly selected as 331 candidates. Then a scheduling algorithm makes the decision to select 332 the best pair of computing and storage nodes within this small set of 333 candidates. Though it is shown in production that this design 334 achieves a performance very close to the theoretical optimal resource 335 allocation scheme, this design cannot be applied to multi-domain, 336 geo-distributed data analytics because (1) the pool of computing and 337 storage resources is much larger, and is distributed across the 338 world, and (2) it is hard to distributively orchestrate networking 339 resources in such a high bandwidth-delay product scenario. 341 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo-Distributed 342 Data Analytics 344 Applying the designs reviewed in the preceding subsections for multi- 345 domain, geo-distributed data analytics only satisfies the design 346 requirement of compatibility (REQ5), but leaves all the other 347 requirements unfulfilled. The key reason is that they do not have an 348 information model that simultaneously 350 o allows member networks to provide accurate information on 351 different types of resources, e.g., the computing, storage and 352 networking resources, they own; 354 o keeps the private information of member networks, such as physical 355 topologies and policies, from the data analytics applications; and 357 o allows data analytics jobs to accurately describe their 358 requirements of different types of resources. 360 7. Unicorn Design 362 This section presents the design of the Unicorn framework. First, 363 the motivations of using ALTO as the information model of resource 364 orchestration for multi-domain, geo-distributed data analytics are 365 reviewed. Then the architecture of Unicorn is provided. 367 7.1. Choosing ALTO as the Resource Information Model 369 As reviewed in the preceding section, the commonly used resource- 370 graph-based information model and the ClassAds information model do 371 not support the accurate, yet privacy-preserving resource discovery 372 across different member networks. In contrast, the ALTO protocol 373 uses abstract maps of networks to provide network information with 374 the goal of modifying network resource consumption patterns while 375 maintaining or improving application performance [RFC7285]. This 376 document proposes the use of ALTO for providing information of 377 different types of resources, e.g., computing, storage and networking 378 resources. This design has the following advantages: 380 o ALTO provides the network information based on abstract maps of a 381 network. Additional services are built on top of the ALTO 382 abstract maps to provide information of other types of resources, 383 e.g., the computing and storage resources. These maps provide 384 accurate information of different types of resources for the 385 resource orchestration system to effectively utilize them for data 386 analytics applications. For example, the ALTO Endpoint Property 387 Service can provide information of computing nodes and storage 388 nodes. 390 o The ALTO abstract maps provide a simplified view of resources of 391 member networks, instead of the full details of their resource 392 availability. Thus ALTO allows member networks to keep their 393 private information, such as physical topologies and policies, 394 from the applications. For example, the ALTO Network Map service 395 provides a "one-big-switch" view that defines a grouping of 396 network endpoints. This view hides the details of the underlying 397 physical topology of the network and a network deploying the ALTO 398 server has the autonomy to adopt any endpoint grouping algorithm. 400 o ALTO uses a client-server model, in which applications can use 401 ALTO clients to accurately describe their requirements of 402 different types of resources and send these requirements to the 403 ALTO servers to retrieve the accurate information of resources 404 that suit their requirements. For example, the ALTO Multi-Cost 405 service [RFC8189] allows an ALTO client to specify a logic set of 406 tests in a query. Such tests are used by ALTO servers to filter 407 out the information of unqualified resources from the response 408 sent back to the ALTO client. 410 7.2. Architecture of Unicorn 412 This section describes the design details of Unicorn. Figure 1 413 presents the architecture of Unicorn for a multi-domain, geo- 414 distributed data analytics system with N member networks. In 415 particular, Unicorn consists of the following key components: 417 .-------------. .-------------. 418 |Application 1| ... |Application N| 419 '-------------' '-------------' 420 \ / 421 .- - - - - - - -\- - - - - - - - - - - -/- - - - - - - - - - - - - - -. 422 | Unicorn \ / | 423 | .-----------------------. | 424 | | Resource Orchestrator | .----------------------.| 425 | | | |Distributed Hash Table|| 426 | | .-----------. |---- | of Computing and || 427 | | |ALTO Client| | | Storage Resources || 428 | | '-----------' | '----------------------'| 429 | '-----------------------' | 430 | / | \ | 431 | / | \ | 432 | .-------------. .-----------. .-------------. | 433 | |ALTO Server 1| | Execution | |ALTO Server M| | 434 | '-------------' | Agents | '-------------' | 435 | | '-----------' | | 436 | | / \ | | 437 | .----------------./ \ .----------------. | 438 | | Site 1 | . . | Site N | | 439 | '----------------' '----------------' | 440 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' 442 Figure 1: Architecture of Unicorn. 444 o ALTO Server: for each member network, one or more ALTO servers are 445 deployed to provide accurate, yet privacy-preserving information 446 of different types of resources owned by the corresponding 447 network. Examples of such information include the link bandwidth 448 between endpoints, the memory I/O bandwidth and the CPU 449 utilization at computing endpoints and the storage space at 450 storage endpoints. In addition to the basic ALTO services defined 451 in [RFC7285], The ALTO servers in Unicorn also provide ALTO 452 extension services such as the ALTO Multi-Cost Service [RFC8189], 453 the ALTO Server-Sent Event Service [DRAFT-SSE] and the ALTO 454 Multipart Cost Property Service [DRAFT-PV] to provide fine-grained 455 resource information. 457 o Distributed Hash Table (DHT) of Computing and Storage Resources: A 458 DHT system is deployed across member networks to lookup the 459 location of computing and storage resources. Compared with the 460 current centralized lookup services in the CMS network, i.e., 461 PhEDEx and HTCondor, a DHT system provides a significant 462 performance improvement for discovering the locations of computing 463 and storage resources in multi-domain, geo-distributed data 464 analytics systems. 466 o Resource Orchestrator: The orchestrator is a shim layer between 467 the data analytics jobs from different applications and the member 468 networks. It contains an ALTO client that communicates with the 469 ALTO servers at member networks to retrieve resource information. 470 Given a set of data analytics jobs, the orchestrator adopts a 471 three-phase discovery process, which will be elaborated in the 472 next section, to find the accurate information of all the 473 resources that can be used to execute these jobs. Then the 474 orchestrator runs a customized resource allocation algorithm to 475 compute the resource allocation decisions for these jobs, and send 476 the decisions to the execution agents at corresponding member 477 networks. 479 o Execution Agent: One or more execution agents are deployed at each 480 member network. They take the resource allocation decisions from 481 the resource orchestrator, and communicate with the underlying 482 resource management system deployed at the corresponding member 483 network to reserve the resources for the data analytics jobs and 484 execute them. 486 7.2.1. Three-Phase Resource Discovery 488 The preceding subsection describes the architecture and the key 489 components of Unicorn. One missing component is how to accurately 490 discover the information of different types of resources for a set of 491 data analytics jobs with the assistance of ALTO. This section 492 presents the three-phase resource discovery design in Unicorn. 494 7.2.1.1. Phase 1: Endpoint Property Discovery 496 Figure 2 shows the procedure of the endpoint property discovery 497 phase. Given a set of data analytics jobs, the resource orchestrator 498 communicates with the DHT lookup system to find the locations, i.e., 499 the endpoint addresses, of all candidate computing and storage 500 resources. With such information, the ALTO client then issues 501 Endpoint Property Service (EPS) queries to the ALTO servers deployed 502 at member networks to discover the information of all candidate 503 endpoints. 505 .-----------------------. 506 | Resource Orchestrator | Jobs' Resource 507 | | Requirements .------------. 508 | .-----------. | ---------------> | DHT Lookup | 509 | |ALTO Client| | <--------------- | System | 510 | '-----------' | Endpoint '------------' 511 '---------| ^-----------' Locations 512 EPS | | Endpoint 513 Queries | | Properties 514 v | 515 .-------------. 516 |ALTO Servers | 517 '-------------' 519 Figure 2: The Endpoint Property Discovery Phase. 521 7.2.1.2. Phase 2: Endpoint Path Discovery 523 Candidate computing and storage endpoints need to move data between 524 them before, during and after the execution of a data analytics job. 525 In multi-domain, geo-distributed data analytics, a pair of candidate 526 endpoints may not be in the same member network. In this case, the 527 orchestrator needs to find out the connectivity information between 528 such a pair of candidate endpoints. 530 Figure 3 shows the procedure of the endpoint path discovery phase. 531 Given a pair of candidate endpoints that are not in the same member 532 network, the ALTO client in the orchestrator adopts an iterative 533 process to find the interdomain connectivity information for this 534 pair. It starts by issuing an ALTO Endpoint Cost Service query or an 535 ALTO Flow-based Endpoint Cost Service [DRAFT-FCS] to the ALTO server 536 of the member network where the source endpoint locates. The cost 537 type of this query is a customized type called next-hop, with a 538 customized cost mode tuple and a customized cost metric next-network. 540 .-----------------------. 541 | Resource Orchestrator | 542 | | 543 | .-----------. | 544 | |ALTO Client| | 545 | '-----------' | 546 '---------| ^-----------' 547 Customized| | Endpoint 548 ECS | | Path 549 Queries | | Segments 550 v | 551 .-------------. 552 |ALTO Servers | 553 '-------------' 555 Figure 3: The Endpoint Path Discovery Phase. 557 The ALTO server returns a 2-tuple, where the first element is the 558 autonomous number (AS) of the next member network along the AS-path 559 from the source endpoint to the destination endpoint, and the second 560 element is the ingress of this next member network. In a member 561 network, the ALTO server can get such information from the underlying 562 interdomain routing protocol, e.g., BGP. Based on the received 563 response, the ALTO client then issues a similar query to the ALTO 564 server of the next member network. The process stops when the ALTO 565 server of the member network where the destination endpoint locates 566 receives such a query, who will return a null 2-tuple in response to 567 notify the ALTO client. By the end of this process, the ALTO client 568 can assemble a domain-path, in the form of a path vector of (ingress, 569 AS), of this pair of candidate endpoints. 571 7.2.1.3. Phase 3: Resource State Abstraction Discovery 573 After the second phase, the resource orchestrator has the 574 connectivity information of each candidate endpoint pair, i.e., the 575 domain-path. Equivalently, for each member network, it knows the set 576 of all candidate endpoint pairs that will enter this network. With 577 this information, the resource orchestrator can communicate with the 578 ALTO servers at member networks to discover the resource sharing 579 between all the candidate endpoint pairs. In particular, Unicorn 580 extends the routing state abstraction [DRAFT-RSA] to the more generic 581 resource state abstraction to represent such resource sharing. 583 Figure 4 shows the procedure of the resource state abstraction 584 discovery phase. For each member network, the ALTO client in the 585 orchestrator sends an ALTO Multipart Cost Property Service query 586 defined in [DRAFT-PV] by providing the set of candidate endpoint 587 pairs as input. The cost type of this query is path vector. Upon 588 receiving the query, the ALTO server in each member network computes 589 an ALTO cost map and an ATLO property map to the ALTO client. These 590 two maps represent a set of linear inequalities revealing the 591 resource sharing among the set of candidate endpoint pairs in the 592 member network. 594 .-----------------------. 595 | Resource Orchestrator | 596 | | 597 | .-----------. | 598 | |ALTO Client| | 599 | '-----------' | 600 '---------| ^-----------' 601 Multipart| | Resource 602 Cost | | State 603 Property | | Abstraction 604 Queries v | 605 .-------------. 606 |ALTO Servers | 607 '-------------' 609 Figure 4: The Resource State Abstraction Discovery Phase. 611 Unicorn provides two mechanisms for the ALTO servers to return the 612 computed cost maps and property maps to the ALTO client. The first 613 mechanism is to let each ALTO server independently sends its response 614 to the ALTO client. The second mechanism is a privacy-preserving 615 interdomain information aggregation process, in which the ALTO 616 servers in all member networks use a secure multi-party computation 617 (SMPC) protocol to collectively send the responses to the ALTO client 618 without revealing the source of any entry, i.e., the linear in 619 equality, in the cost maps and property maps. 621 The first mechanism has a higher security risk in that it exposes the 622 bottleneck resource information of each member network. In contrast, 623 the second mechanism provides a better protection of the private 624 information of each member network. The details of the privacy- 625 preserving interdomain information aggregation process will be 626 presented in the next section. 628 After receiving the responses sent back from the ALTO servers from 629 all the member networks, the orchestrator finishes the whole resource 630 discovery process and collects the accurate information of different 631 types of resources for data analytics jobs. 633 7.3. Example 635 This subsection gives an example to illustrate the workflow of 636 Unicorn. Figure 5 gives a topology of three member networks, where 637 s1 and s2 are storage endpoints and d1 and d2 are computation 638 endpoints. Assume a data analytics job is composed of two parallel 639 tasks T1 and T2. T1 needs dataset X as input and T2 needs dataset Y 640 as input. 642 .------------. 643 | Network B | 644 .------------. ingB| | 645 | Network A |--------| d1 | 646 | | '------------' 647 | s1 | 648 | | .------------. 649 | s2 |--------| Network C | 650 '------------' ingC| | 651 | d2 | 652 '------------' 654 Figure 5: An Illustrating Example for Unicorn. 656 In the endpoint property discovery phase, the Unicorn resource 657 orchestrator finds that s1 stores X and s2 stores Y, and that the 658 locations of s1, s2, d1 and d2, from the DHT lookup system. It then 659 issues EPS queries to network A, B and C, respectively, to discover 660 that d1 satisfies the computing requirements of T1 and d2 satisfies 661 the computing requirements of T2. Hence there are only two candidate 662 endpoint pairs: (s1, d1) and (s2, d2). 664 In the endpoint path discovery phase, the ALTO client in the 665 orchestrator iteratively issues Endpoint Cost Service (ECS) query to 666 the ALTO servers in member networks, and finds that the domain-path 667 for pair (s1, d2) is [(null, A), (ingB, B)] and the domain-path for 668 pair (s2, d2) is [(null, A), (ingB, B)]. Hence both pairs will use 669 the networking resources of network A, while only (s1, d1) will use 670 network B and only (s2, d2) will use network C. 672 In the resource state abstraction discovery phase, the ALTO client in 673 the orchestrator issues Multipart Cost Property Service queries to 674 network A, B and C, respectively. Denote the available bandwidth 675 that can be assigned to T1 as x1 and that to T2 as x2. Assume the 676 linear inequalities computed by the three networks are: 678 A: x1 + x2 <= 10Mbps 679 B: x1 <= 3Mbps 680 C: x2 <= 3Mbps 682 If the ALTO servers use the first mechanism to directly return their 683 resource information to ALTO client, respectively, each of them will 684 send a cost map and a property map response encoding its own linear 685 inequality to the ALTO client. In this way, the orchestrator gets 686 the accurate information about networking resource sharing between 687 (s1, d1) and (s2, d2). It then can invoke a resource allocation 688 algorithm to allocate the resources to tasks T1 and T2. For example, 689 if the goal is to maximize the minimal bandwidth of two tasks, the 690 allocation decision will be to assign endpoints s1 and d1 to T1, with 691 a bandwidth of 3Mbps, and assign endpoints s2 and d2 to T2, with a 692 bandwidth of 3Mbps as well. 694 8. ALTO Extension: Privacy-Preserving Interdomain Information 695 Aggregation for Resource Discovery 697 This section describes a customized ALTO extension in Unicorn that 698 supports the privacy-preserving discovery of networking resource 699 sharing among a set of candidate endpoint pairs. 701 8.1. Extension Specification 703 Figure 6 presents the workflow of the proposed ALTO extension. 704 Assume a set of N member networks denoted as AS_1, AS_2, ... AS_N 705 and the number of all candidate endpoint pairs is F. The interdomain 706 information aggregation process works as follows: 708 .-----------------------. 709 | Resource Orchestrator | 710 | | 711 | .-----------. | 712 | |ALTO Client| | 713 | '-----------' | 714 '-----------------------'< 715 Encryption Key and / \ Ciphered MCP 716 Multipart Cost / \ Response 717 Property Queries v Key, Ciphertext \ 718 .--------. and MCP .--------. .--------. 719 |ALTO | Queries |ALTO | |ALTO | 720 |Server 1| -------> |Server 2|-- ... -->|Server N| 721 '--------' '--------' '--------' 723 Figure 6: The Privacy-Preserving Interdomain Resource Information 724 Aggregation. 726 o Step 1: The ALTO client sends the Multipart Cost Property Service 727 request to and a homomorphic public key k_p to each member 728 network. 730 o Step 2: The ALTO server of each network AS_i computes its own set 731 of linear inequalities A_i x <= b_i. Denote the size of this set 732 as m_i. 734 o Step 3: The ALTO server of each network AS_i introduces m_i non- 735 negative slack variables to transform its set of linear 736 inequalities into a set of linear equations. 738 o Step 4: The ALTO servers of all member networks use an SMPC 739 summation protocol to collectively compute k=m_1 + m_2 + ... + m_N 740 + 1. The value k is known to all the member networks. 742 o Step 5: The ALTO servers of each network AS_i selects a random k- 743 by-m_i matrix P_i, and compute the matrix P_iA_i and P_ib_i. 745 o Step 6: The ALTO servers of all member networks use an SMPC 746 summation protocol to collectively compute the summation of P_iA_i 747 and the summation of P_ib_i using the public key k_p. Then the 748 final result is returned to the ALTO client. 750 o Step 7: the ALTO client uses its own private key to decrypt the 751 result and get a set of linear equations sum P_iA_i x = sum 752 P_ib_i. 754 This process ensures that the networking resource capacity region 755 derived from sum P_iA_i x = sum P_ib_i is the same as that derived 756 from A_1 x <= b_1, A_2 x <= b_2, ... A_N x <= b_N. More importantly, 757 the ALTO client has no knowledge on the information of network 758 resource sharing of a single member network. 760 8.2. Example 762 This subsection uses the same example in Figure 5 to illustrate the 763 privacy-preserving information aggregation process. The set of 764 linear inequalities computed by each network is as follows: 766 A: x1 + x2 <= 10 767 B: x1 <= 3 768 C: x2 <= 3 770 Then the networks collectively compute k=1+1+1+1=4. And then 771 introduces slack variables to transform the linear inequalities into 772 linear equations: 774 A: x1 + x2 + x3 <= 10 775 B: x1 + x4 <= 3 776 C: x2 + x5 <= 3 778 For each network, the random matrix it chooses as follows: 780 P_A: [11, 49, 95, 34] 781 P_B: [58, 22, 75, 25] 782 P_C: [50, 69, 89, 95] 784 After the SMPC summation process, the decrypted set of linear 785 equations the ALTO client gets is 787 69 x1 + 61 x2 + 11 x3 + 58 x4_ + 50 x5 = 434 788 71 x1 + 118 x2 + 49 x3 + 22 x4_ + 69 x5 = 763 789 170 x1 + 184 x2 + 95 x3 + 75 x4_ + 89 x5 = 1442 790 59 x1 + 129 x2 + 34 x3 + 25 x4_ + 95 x5 = 700 792 Assume the goal is still to maximize the minimal bandwidth of two 793 tasks, the allocation decision made using this set of linear 794 equations will still be x1=3 and x2=3, i.e., assigning endpoints s1 795 and d1 to T1, with a bandwidth of 3 and assigning endpoints s2 and d2 796 to T2, with a bandwidth of 3 as well. 798 9. Discussion 800 9.1. Discovering the Domain-Paths Using a New Interdomain Routing 801 Protocol 803 The current design of the endpoint path discovery process in Unicorn 804 assumes that the underlying interdomain routing protocol is the 805 standard BGP, which only provides the path vector of ASes instead of 806 the path vector of (ingress, AS) tuples needed by Unicorn. If a 807 multi-domain, geo-distributed data analytics system uses an 808 interdomain routing protocol that provides the path vector of 809 (ingress, AS) pairs, the endpoint path discovery process in Unicorn 810 can be simplified to only send queries to the ALTO server of the 811 network where the source candidate endpoint locates. 813 10. Security Considerations 815 This document does not introduce any privacy or security issue not 816 already present in the ALTO protocol. 818 11. IANA Considerations 820 This document does not define any new media type or introduce any new 821 IANA consideration. 823 12. References 825 12.1. Normative References 827 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 828 Requirement Levels", BCP 14, RFC 2119, 829 DOI 10.17487/RFC2119, March 1997, . 832 12.2. Informative References 834 [Apollo] Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., 835 Qian, Z., Wu, M., and L. Zhou, "Apollo: Scalable and 836 Coordinated Scheduling for Cloud-Scale Computing", 2014, 837 . 840 [Borg] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., 841 Tune, E., and J. Wilkes, "Large-scale cluster management 842 at Google with Borg", 2015, . 845 [DRAFT-FCS] 846 Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang, 847 "ALTO Extension: Flow-based Cost Query", 2017, 848 . 850 [DRAFT-PV] 851 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 852 Yang, "ALTO Extension: Abstract Path Vector as a Cost 853 Mode", 2015, . 856 [DRAFT-RSA] 857 Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G. 858 Chen, "A Recommendation for Compressing ALTO Path 859 Vectors", 2017, . 862 [DRAFT-SSE] 863 Roome, W. and Y. Yang, "ALTO Incremental Updates Using 864 Server-Sent Events (SSE)", 2015, 865 . 868 [HTCondor] 869 Thain, D., Tannenbaum, T., and M. Livny, "Distributed 870 computing in practice: the Condor experience", 2005, 871 . 873 [Mesos] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., 874 Joseph, A., Katz, R., Shenker, S., and I. Stoica, "Mesos: 875 A Platform for Fine-Grained Resource Sharing in the Data 876 Center", 2011, 877 . 880 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 881 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 882 "Application-Layer Traffic Optimization (ALTO) Protocol", 883 RFC 7285, DOI 10.17487/RFC7285, September 2014, 884 . 886 [RFC8189] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 887 Application-Layer Traffic Optimization (ALTO)", RFC 8189, 888 DOI 10.17487/RFC8189, October 2017, . 891 [Sparrow] Ousterhout, K., Wendell, P., Zaharia, M., and I. Stoica, 892 "Sparrow: Distributed, Low Latency Scheduling", 2013, 893 . 895 [XCP] Katabi, D., Handley, M., and C. Rohrs, "Internet 896 Congestion Control for Future High Bandwidth-Delay Product 897 Environments", 2002, 898 . 901 Authors' Addresses 903 Qiao Xiang 904 Tongji/Yale University 905 51 Prospect Street 906 New Haven, CT 907 USA 909 Email: qiao.xiang@cs.yale.edu 910 Franck Le 911 IBM 912 Thomas J. Watson Research Center 913 Yorktown Heights, NY 914 USA 916 Email: fle@us.ibm.com 918 Y. Richard Yang 919 Tongji/Yale University 920 51 Prospect Street 921 New Haven, CT 922 USA 924 Email: yry@cs.yale.edu 926 Harvey Newman 927 California Institute of Technology 928 1200 California Blvd. 929 Pasadena, CA 930 USA 932 Email: newman@hep.caltech.edu 934 Haizhou Du 935 Tongji University 936 4800 Cao'an Hwy 937 Shanghai 201804 938 China 940 Email: duhaizhou@gmail.com