idnits 2.17.1 draft-xiang-alto-multidomain-analytics-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 595 has weird spacing: '...Queries v |...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (12 July 2021) is 1019 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '11' on line 786 -- Looks like a reference, but probably isn't: '49' on line 786 -- Looks like a reference, but probably isn't: '95' on line 788 -- Looks like a reference, but probably isn't: '34' on line 786 -- Looks like a reference, but probably isn't: '58' on line 787 -- Looks like a reference, but probably isn't: '22' on line 787 -- Looks like a reference, but probably isn't: '75' on line 787 -- Looks like a reference, but probably isn't: '25' on line 787 -- Looks like a reference, but probably isn't: '50' on line 788 -- Looks like a reference, but probably isn't: '69' on line 788 -- Looks like a reference, but probably isn't: '89' on line 788 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Xiamen University 4 Intended status: Informational J. Zhang 5 Expires: 13 January 2022 Tongji/Yale University 6 F. Le 7 IBM 8 Y. Yang 9 Yale University 10 H. Newman 11 California Institute of Technology 12 12 July 2021 14 Resource Orchestration for Multi-Domain, Exascale, Geo-Distributed Data 15 Analytics 16 draft-xiang-alto-multidomain-analytics-06.txt 18 Abstract 20 As the data volume increases exponentially over time, data analytics 21 is transiting from a single-domain network to a multi-domain, geo- 22 distributed network, where different member networks contribute 23 various resources, e.g., computation, storage and networking 24 resources, to collaboratively collect, share and analyze extremely 25 large amounts of data. Such a network calls for a resource 26 orchestration framework that emphasizes the performance 27 predictability of data analytics jobs, the high utilization of 28 resources, and the autonomy and privacy of member networks. 30 This document presents the design of Unicorn, a unified resource 31 orchestration framework for multi-domain, geo-distributed data 32 analytics, which uses the Application-Layer Traffic Optimization 33 (ALTO) protocol as the key component for (1) allows member networks 34 to provide accurate information on different types of resources; (2) 35 keeps the private information of member networks; and (3) allows data 36 analytics jobs to accurately describe their requirements of different 37 types of resources. As a part of Unicorn, an ALTO extension for 38 privacy-preserving interdomain information aggregation is also 39 presented. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at https://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on 13 January 2022. 58 Copyright Notice 60 Copyright (c) 2021 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 65 license-info) in effect on the date of publication of this document. 66 Please review these documents carefully, as they describe your rights 67 and restrictions with respect to this document. Code Components 68 extracted from this document must include Simplified BSD License text 69 as described in Section 4.e of the Trust Legal Provisions and are 70 provided without warranty as described in the Simplified BSD License. 72 Table of Contents 74 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 75 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 76 3. Changes Since Version -05 . . . . . . . . . . . . . . . . . . 4 77 4. Characteristics of Multi-Domain, Geo-Distributed Data 78 Analytics . . . . . . . . . . . . . . . . . . . . . . . . 4 79 4.1. Dynamic Data Analytics Workload . . . . . . . . . . . . . 4 80 4.2. Dynamic Resource Availability . . . . . . . . . . . . . . 5 81 5. Design Requirements . . . . . . . . . . . . . . . . . . . . . 6 82 6. Review of Resource Orchestration Designs for Data 83 Analytics . . . . . . . . . . . . . . . . . . . . . . . . 7 84 6.1. Centralized resource-graph-based orchestration . . . . . 7 85 6.2. Centralized ClassAds-based orchestration . . . . . . . . 7 86 6.3. Distributed opportunistic orchestration . . . . . . . . . 7 87 6.4. Inadequacy of Existing Designs for Multi-Domain, 88 Geo-Distributed Data Analytics . . . . . . . . . . . . . 8 89 7. Unicorn Design . . . . . . . . . . . . . . . . . . . . . . . 8 90 7.1. Choosing ALTO as the Resource Information Model . . . . . 8 91 7.2. Architecture of Unicorn . . . . . . . . . . . . . . . . . 9 92 7.2.1. Three-Phase Resource Discovery . . . . . . . . . . . 11 93 7.2.2. Proactive Full-Mesh Resource Discovery . . . . . . . 15 95 7.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . 15 96 8. ALTO Extension: Privacy-Preserving Interdomain Information 97 Aggregation for Resource Discovery . . . . . . . . . . . 16 98 8.1. Extension Specification . . . . . . . . . . . . . . . . . 16 99 8.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 18 100 9. Implementation and Demonstration Experience . . . . . . . . . 19 101 10. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 19 102 10.1. Discovering the Domain-Paths Using a New Interdomain 103 Routing Protocol . . . . . . . . . . . . . . . . . . . . 20 104 10.2. Comparison of the Efficiency to Achieve Optimal Resource 105 Orchestration using ALTO and without using ALTO . . . . 20 106 10.3. Future Work 1: Unified Resource Representation as an ALTO 107 extension . . . . . . . . . . . . . . . . . . . . . . . 20 108 10.4. Future Work 2: Integrating Unicorn and ALTO into 109 Rucio . . . . . . . . . . . . . . . . . . . . . . . . . 21 110 11. Security Considerations . . . . . . . . . . . . . . . . . . . 21 111 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 112 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 113 13.1. Normative References . . . . . . . . . . . . . . . . . . 22 114 13.2. Informative References . . . . . . . . . . . . . . . . . 22 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 117 1. Introduction 119 This document describes the design of Unicorn, a unified resource 120 orchestration framework for large-scale data analytics in multi- 121 domain, geo-distributed networks. An important use case for such 122 settings is the Large Hadron Collider (LHC) network, which consists 123 of over 180 member networks all over the world, to support scientists 124 to access multiple resources, e.g., computing, storage and networking 125 resources, distributed in the member networks to conduct large-scale 126 data analytics. With more and more data being generated and stored 127 in different geo-distributed member networks, network architects and 128 administrators are exploring different designs for efficient resource 129 orchestration in multi-domain, geo-distributed networks. 131 The design presented in this document is based on the development and 132 deployment experience of Unicorn in the CMS network, one of the 133 largest scientific experiments in the LHC network. The primary 134 requirements of resource orchestration in such a multi-domain, geo- 135 distributed environment are the performance predictability of various 136 data analytics jobs, the high utilization of different types of 137 resources, and the autonomy and privacy of resource owners, i.e., 138 member networks. 140 Pre-production development and extensive testing have shown that the 141 Application-Layer Traffic Optimization Protocol [RFC7285] is well 142 suited as a fundamental component in Unicorn for providing a generic 143 representation that (1) allows different types of data analytics jobs 144 to accurately describe their resource requirements and (2) allows 145 member networks to provide accurate information on different types of 146 resources they own and at the same time maintain their privacies. 147 This is in contrast with the state-of-the-art resource orchestration 148 frameworks, such as HTCondor and Mesos, which either do not provide 149 accurate networking information or expose all the private details of 150 member networks. This document elaborates on the design requirements 151 of resource orchestration in multi-domain, geo-distributed networks 152 that lead to this design choice and presents the details of Unicorn, 153 including an ALTO extension for privacy-preserving, interdomain 154 information aggregation. 156 This document first gives an overview of the characteristics of 157 multi-domain, geo-distributed data analytics. Then, the design 158 requirements for resource orchestration under such settings are 159 summarized. After reviewing existing designs and their limitations, 160 this document gives the arguments for using ALTO as the generic 161 representation for describing both resource requirements and the 162 resource information and describes the design details of Unicorn. 163 Finally, a privacy-preserving, interdomain extension of ALTO is 164 presented. 166 2. Requirements Language 168 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 169 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 170 document are to be interpreted as described in [RFC2119]. 172 3. Changes Since Version -05 174 * Add Section 10.4 to discuss how Ruzio, the latest scientific data 175 management framework for the ATLAS experiment at LHC, can benefit 176 from integrating with ALTO and Unicorn. 178 4. Characteristics of Multi-Domain, Geo-Distributed Data Analytics 180 This section describes the characteristics of multi-domain, geo- 181 distributed data analytics. 183 4.1. Dynamic Data Analytics Workload 185 In multi-domain, geo-distributed data analytics, extremely large 186 amounts of data are generated and stored across different member 187 networks. Authorized users from different organizations can access 188 data and resources in member networks to conduct various data 189 analytics jobs using various data analytics applications. 191 An data analytics application usually provides an automated process 192 that decomposes a large data analytics job into a set of smaller 193 tasks, whose dependencies are expressed as a directed acyclic graph 194 (DAG). Tasks without any dependency can be executed in parallel to 195 improve the efficiency of the data analytics job they belong to. 196 This decomposition is highly user- and application-dependent. 198 Each task may have different requirements on different resources. 199 For instance, task T1 may require dataset A in storage node X as 200 input and 1 CPU as the computing resource, while task T2 may require 201 dataset B in storage node Y as input and 2 CPUs as the computing 202 resource. Furthermore, each task may require resources from 203 different member networks. In the previous example, T1 may require 204 its output to be stored in a storage node in another member network 205 for the purpose of secure storage. The resource requirements of 206 tasks are highly user- and application-dependent. 208 From the above description, it is observed that the workload of 209 multi-domain, geo-distributed data analytics is highly dynamic, in 210 terms of the number of users, the types of applications, the number 211 of jobs, the decomposition of jobs and the resource requirements of 212 tasks. 214 Though with such dynamism, it is the general consensus of users to 215 expect performance predictability of their analytics jobs (TODO: add 216 Mogul citation). Hence the resource orchestration for multi-domain, 217 geo-distributed data analytics must be able to achieve efficient 218 resource sharing among different data analytics jobs of different 219 applications from different users. To this end, a generic 220 representation of resource requirements for different tasks from 221 different analytics applications must be chosen. Furthermore, to 222 ensure maximal deployment, the resource orchestration framework must 223 be independent of and compatible with data analytics applications. 225 4.2. Dynamic Resource Availability 227 In the multi-domain, geo-distributed data analytics network, 228 different member networks belong to different administrative domains. 229 Each member network has its own resource management policies and can 230 choose to use different management software, such as HTCondor and 231 Mesos. 233 Each member network provides different types of resources with 234 different amounts. For example, transit networks such as ESNet and 235 Internet2 provide high-bandwidth networking resources. In contrast, 236 campus science networks provide abundant computation and storage 237 resources, but may provide limited networking bandwidths. And some 238 smaller science networks only provide limited computation and storage 239 resources. The availability of the resources in each member network 240 is subject to the autonomous control of the member network. 242 Furthermore, member networks are interconnected with high bandwidth- 243 delay-product links, where state-of-the-art networking resource 244 allocation mechanisms, such as TCP, become inefficient [XCP]. 246 From the above description, it is observed that the resource 247 availability of the multi-domain, geo-distributed data analytics 248 network is also highly dynamic, subject to the types of member 249 networks, the resources provided by member networks and the resource 250 management policies and management software used by member networks. 252 Though with such dynamism, it is the general consensus of member 253 networks that the resource orchestration for multi-domain, geo- 254 distributed data analytics must achieve high utilization of different 255 types of resources, following the autonomy and privacy of each member 256 network. To this end, a generic representation of resource 257 availabilities for different types of resources must be chosen. Such 258 a representation must be accurate and at the same time maintain the 259 privacy of member networks. Furthermore, to ensure maximal 260 deployment, the resource orchestration framework must be independent 261 of and compatible with the resource management systems used by member 262 networks. 264 5. Design Requirements 266 This section summarizes the design requirements for resource 267 orchestration for multi-domain, geo-distributed data analytics from 268 the previous section. 270 * REQ1: Provide performance predictability for data analytics jobs. 272 * REQ2: Achieve the efficient resource sharing among data analytics 273 jobs. 275 * REQ3: Achieve the high utilization of different types of resources 276 in member networks. 278 * REQ4: Maintain the autonomy and privacy of member networks. 280 * REQ5: Provide compatibility with different data analytics 281 applications and resource management systems to maximize the 282 deployment. 284 6. Review of Resource Orchestration Designs for Data Analytics 286 This section provides an overview of three general types of resource 287 orchestration designs for data analytics -- the centralized resource- 288 graph-based orchestration, the centralized ClassAds-based 289 orchestration and the distributed opportunistic orchestration. Then, 290 the key reason why these designs are inadequate for multi-domain, 291 geo-distributed data analytics is provided. 293 6.1. Centralized resource-graph-based orchestration 295 Systems such as Mesos [Mesos] and Borg [Borg] adopt a graph-based 296 abstraction to represent the resource availability of computing 297 clusters. Each node in the graph is a physical node representing 298 computation or storage resources and each edge between a pair of 299 nodes denotes the networking resource connecting two physical nodes. 300 This design is inadequate for multi-domain, geo-distributed data 301 analytics system because (1) it compromises the privacy of different 302 member networks by revealing all the details of resources; and (2) 303 the overhead to keep the resource availability graph up to date is 304 too expensive due to the heterogeneity and dynamicity of resources 305 from different member networks. 307 6.2. Centralized ClassAds-based orchestration 309 HTCondor [HTCondor] proposes a ClassAds programming model, which 310 allows different resource owners to advertise their resource supply 311 and the job owners to advertise the resource demand. However, this 312 programming model does not support the accurate discovery of 313 networking resources, but leave the orchestration of networking 314 resources completely to TCP, which has been known to behave poorly in 315 networks with high bandwidth-delay products [XCP]. 317 6.3. Distributed opportunistic orchestration 319 Some systems, such as Apollo [Apollo] and Sparrow [Sparrow], use a 320 distributed design. In this design, given a data analytics job, a 321 small number of computing and storage nodes are randomly selected as 322 candidates. Then a scheduling algorithm makes the decision to select 323 the best pair of computing and storage nodes within this small set of 324 candidates. Though it is shown in production that this design 325 achieves a performance very close to the theoretical optimal resource 326 allocation scheme, this design cannot be applied to multi-domain, 327 geo-distributed data analytics because (1) the pool of computing and 328 storage resources is much larger, and is distributed across the 329 world, and (2) it is hard to distributively orchestrate networking 330 resources in such a high bandwidth-delay product scenario. 332 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo-Distributed 333 Data Analytics 335 Applying the designs reviewed in the preceding subsections for multi- 336 domain, geo-distributed data analytics only satisfies the design 337 requirement of compatibility (REQ5), but leaves all the other 338 requirements unfulfilled. The key reason is that they do not have an 339 information model that simultaneously 341 * allows member networks to provide accurate information on 342 different types of resources, e.g., the computing, storage and 343 networking resources, they own; 345 * keeps the private information of member networks, such as physical 346 topologies and policies, from the data analytics applications; and 348 * allows data analytics jobs to accurately describe their 349 requirements of different types of resources. 351 7. Unicorn Design 353 This section presents the design of the Unicorn framework. First, 354 the motivations of using ALTO as the information model of resource 355 orchestration for multi-domain, geo-distributed data analytics are 356 reviewed. Then the architecture of Unicorn is provided. 358 7.1. Choosing ALTO as the Resource Information Model 360 As reviewed in the preceding section, the commonly used resource- 361 graph-based information model and the ClassAds information model do 362 not support the accurate, yet privacy-preserving resource discovery 363 across different member networks. In contrast, the ALTO protocol 364 uses abstract maps of networks to provide network information with 365 the goal of modifying network resource consumption patterns while 366 maintaining or improving application performance [RFC7285]. This 367 document proposes the use of ALTO for providing information of 368 different types of resources, e.g., computing, storage and networking 369 resources. This design has the following advantages: 371 * ALTO provides the network information based on abstract maps of a 372 network. Additional services are built on top of the ALTO 373 abstract maps to provide information of other types of resources, 374 e.g., the computing and storage resources. These maps provide 375 accurate information of different types of resources for the 376 resource orchestration system to effectively utilize them for data 377 analytics applications. For example, the ALTO Endpoint Property 378 Service can provide information of computing nodes and storage 379 nodes. 381 * The ALTO abstract maps provide a simplified view of resources of 382 member networks, instead of the full details of their resource 383 availability. Thus ALTO allows member networks to keep their 384 private information, such as physical topologies and policies, 385 from the applications. For example, the ALTO Network Map service 386 provides a "one-big-switch" view that defines a grouping of 387 network endpoints. This view hides the details of the underlying 388 physical topology of the network and a network deploying the ALTO 389 server has the autonomy to adopt any endpoint grouping algorithm. 391 * ALTO uses a client-server model, in which applications can use 392 ALTO clients to accurately describe their requirements of 393 different types of resources and send these requirements to the 394 ALTO servers to retrieve the accurate information of resources 395 that suit their requirements. For example, the ALTO Multi-Cost 396 service [RFC8189] allows an ALTO client to specify a logic set of 397 tests in a query. Such tests are used by ALTO servers to filter 398 out the information of unqualified resources from the response 399 sent back to the ALTO client. 401 7.2. Architecture of Unicorn 403 This section describes the design details of Unicorn. Figure 1 404 presents the architecture of Unicorn for a multi-domain, geo- 405 distributed data analytics system with N member networks. In 406 particular, Unicorn consists of the following key components: 408 .-------------. .-------------. 409 |Application 1| ... |Application N| 410 '-------------' '-------------' 411 \ / 412 .- - - - - - - -\- - - - - - - - - - - -/- - - - - - - - - - - - - - -. 413 | Unicorn \ / | 414 | .-----------------------. | 415 | | Resource Orchestrator | .----------------------.| 416 | | | |Distributed Hash Table|| 417 | | .-----------. |---- | of Computing and || 418 | | |ALTO Client| | | Storage Resources || 419 | | '-----------' | '----------------------'| 420 | '-----------------------' | 421 | / | \ | 422 | / | \ | 423 | .-------------. .-----------. .-------------. | 424 | |ALTO Server 1| | Execution | |ALTO Server M| | 425 | '-------------' | Agents | '-------------' | 426 | | '-----------' | | 427 | | / \ | | 428 | .----------------./ \ .----------------. | 429 | | Site 1 | . . | Site N | | 430 | '----------------' '----------------' | 431 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' 433 Figure 1: Architecture of Unicorn. 435 * ALTO Server: for each member network, one or more ALTO servers are 436 deployed to provide accurate, yet privacy-preserving information 437 of different types of resources owned by the corresponding 438 network. Examples of such information include the link bandwidth 439 between endpoints, the memory I/O bandwidth and the CPU 440 utilization at computing endpoints and the storage space at 441 storage endpoints. In addition to the basic ALTO services defined 442 in [RFC7285], The ALTO servers in Unicorn also provide ALTO 443 extension services such as the ALTO Multi-Cost Service [RFC8189], 444 the ALTO Server-Sent Event Service [DRAFT-SSE] and the ALTO 445 Multipart Cost Property Service [DRAFT-PV] to provide fine-grained 446 resource information. 448 * Distributed Hash Table (DHT) of Computing and Storage Resources: A 449 DHT system is deployed across member networks to lookup the 450 location of computing and storage resources. Compared with the 451 current centralized lookup services in the CMS network, i.e., 452 PhEDEx and HTCondor, a DHT system provides a significant 453 performance improvement for discovering the locations of computing 454 and storage resources in multi-domain, geo-distributed data 455 analytics systems. 457 * Resource Orchestrator: The orchestrator is a shim layer between 458 the data analytics jobs from different applications and the member 459 networks. It contains an ALTO client that communicates with the 460 ALTO servers at member networks to retrieve resource information. 461 Given a set of data analytics jobs, the orchestrator adopts a 462 three-phase discovery process, which will be elaborated in the 463 next section, to find the accurate information of all the 464 resources that can be used to execute these jobs. Then the 465 orchestrator runs a customized resource allocation algorithm to 466 compute the resource allocation decisions for these jobs, and send 467 the decisions to the execution agents at corresponding member 468 networks. 470 * Execution Agent: One or more execution agents are deployed at each 471 member network. They take the resource allocation decisions from 472 the resource orchestrator, and communicate with the underlying 473 resource management system deployed at the corresponding member 474 network to reserve the resources for the data analytics jobs and 475 execute them. 477 7.2.1. Three-Phase Resource Discovery 479 The preceding subsection describes the architecture and the key 480 components of Unicorn. One missing component is how to accurately 481 discover the information of different types of resources for a set of 482 data analytics jobs with the assistance of ALTO. This section 483 presents the three-phase resource discovery design in Unicorn. 485 7.2.1.1. Phase 1: Endpoint Property Discovery 487 Figure 2 shows the procedure of the endpoint property discovery 488 phase. Given a set of data analytics jobs, the resource orchestrator 489 communicates with the DHT lookup system to find the locations, i.e., 490 the endpoint addresses, of all candidate computing and storage 491 resources. With such information, the ALTO client then issues 492 Endpoint Property Service (EPS) queries to the ALTO servers deployed 493 at member networks to discover the information of all candidate 494 endpoints. 496 .-----------------------. 497 | Resource Orchestrator | Jobs' Resource 498 | | Requirements .------------. 499 | .-----------. | ---------------> | DHT Lookup | 500 | |ALTO Client| | <--------------- | System | 501 | '-----------' | Endpoint '------------' 502 '---------| ^-----------' Locations 503 EPS | | Endpoint 504 Queries | | Properties 505 v | 506 .-------------. 507 |ALTO Servers | 508 '-------------' 510 Figure 2: The Endpoint Property Discovery Phase. 512 7.2.1.2. Phase 2: Endpoint Path Discovery 514 Candidate computing and storage endpoints need to move data between 515 them before, during and after the execution of a data analytics job. 516 In multi-domain, geo-distributed data analytics, a pair of candidate 517 endpoints may not be in the same member network. In this case, the 518 orchestrator needs to find out the connectivity information between 519 such a pair of candidate endpoints. 521 Figure 3 shows the procedure of the endpoint path discovery phase. 522 Given a pair of candidate endpoints that are not in the same member 523 network, the ALTO client in the orchestrator adopts an iterative 524 process to find the interdomain connectivity information for this 525 pair. It starts by issuing an ALTO Endpoint Cost Service query or an 526 ALTO Flow-based Endpoint Cost Service [DRAFT-FCS] to the ALTO server 527 of the member network where the source endpoint locates. The cost 528 type of this query is a customized type called next-hop, with a 529 customized cost mode tuple and a customized cost metric next-network. 531 .-----------------------. 532 | Resource Orchestrator | 533 | | 534 | .-----------. | 535 | |ALTO Client| | 536 | '-----------' | 537 '---------| ^-----------' 538 Customized| | Endpoint 539 ECS | | Path 540 Queries | | Segments 541 v | 542 .-------------. 543 |ALTO Servers | 544 '-------------' 546 Figure 3: The Endpoint Path Discovery Phase. 548 The ALTO server returns a 2-tuple, where the first element is the 549 autonomous number (AS) of the next member network along the AS-path 550 from the source endpoint to the destination endpoint, and the second 551 element is the ingress of this next member network. In a member 552 network, the ALTO server can get such information from the underlying 553 interdomain routing protocol, e.g., BGP. Based on the received 554 response, the ALTO client then issues a similar query to the ALTO 555 server of the next member network. The process stops when the ALTO 556 server of the member network where the destination endpoint locates 557 receives such a query, who will return a null 2-tuple in response to 558 notify the ALTO client. By the end of this process, the ALTO client 559 can assemble a domain-path, in the form of a path vector of (ingress, 560 AS), of this pair of candidate endpoints. 562 7.2.1.3. Phase 3: Resource State Abstraction Discovery 564 After the second phase, the resource orchestrator has the 565 connectivity information of each candidate endpoint pair, i.e., the 566 domain-path. Equivalently, for each member network, it knows the set 567 of all candidate endpoint pairs that will enter this network. With 568 this information, the resource orchestrator can communicate with the 569 ALTO servers at member networks to discover the resource sharing 570 between all the candidate endpoint pairs. In particular, Unicorn 571 extends the routing state abstraction [DRAFT-RSA] to the more generic 572 resource state abstraction to represent such resource sharing. 574 Figure 4 shows the procedure of the resource state abstraction 575 discovery phase. For each member network, the ALTO client in the 576 orchestrator sends an ALTO Multipart Cost Property Service query 577 defined in [DRAFT-PV] by providing the set of candidate endpoint 578 pairs as input. The cost type of this query is path vector. Upon 579 receiving the query, the ALTO server in each member network computes 580 an ALTO cost map and an ATLO property map to the ALTO client. These 581 two maps represent a set of linear inequalities revealing the 582 resource sharing among the set of candidate endpoint pairs in the 583 member network. 585 .-----------------------. 586 | Resource Orchestrator | 587 | | 588 | .-----------. | 589 | |ALTO Client| | 590 | '-----------' | 591 '---------| ^-----------' 592 Multipart| | Resource 593 Cost | | State 594 Property | | Abstraction 595 Queries v | 596 .-------------. 597 |ALTO Servers | 598 '-------------' 600 Figure 4: The Resource State Abstraction Discovery Phase. 602 Unicorn provides two mechanisms for the ALTO servers to return the 603 computed cost maps and property maps to the ALTO client. The first 604 mechanism is to let each ALTO server independently sends its response 605 to the ALTO client. The second mechanism is a privacy-preserving 606 interdomain information aggregation process, in which the ALTO 607 servers in all member networks use a secure multi-party computation 608 (SMPC) protocol to collectively send the responses to the ALTO client 609 without revealing the source of any entry, i.e., the linear in 610 equality, in the cost maps and property maps. 612 The first mechanism has a higher security risk in that it exposes the 613 bottleneck resource information of each member network. In contrast, 614 the second mechanism provides a better protection of the private 615 information of each member network. The details of the privacy- 616 preserving interdomain information aggregation process will be 617 presented in the next section. 619 After receiving the responses sent back from the ALTO servers from 620 all the member networks, the orchestrator finishes the whole resource 621 discovery process and collects the accurate information of different 622 types of resources for data analytics jobs. 624 7.2.2. Proactive Full-Mesh Resource Discovery 626 To ensure the resource discovery process scales, a proactive full- 627 mesh resource discovery component is developed. The main idea of 628 this component consists in having the ALTO client periodically query 629 ALTO servers at all sites to discover the resource state abstraction 630 between every pair of source and destination sites. As such, when an 631 application submits a resource discovery request, the ALTO client 632 does not need to send any query to the ALTO servers. Instead, using 633 the site-level bandwidth sharing information, the ALTO client can 634 immediately perform projection operations to get the resource 635 information for the request. This mechanism substantially improves 636 the scalability of Unicorn. 638 7.3. Example 640 This subsection gives an example to illustrate the workflow of 641 Unicorn. Figure 5 gives a topology of three member networks, where 642 s1 and s2 are storage endpoints and d1 and d2 are computation 643 endpoints. Assume a data analytics job is composed of two parallel 644 tasks T1 and T2. T1 needs dataset X as input and T2 needs dataset Y 645 as input. 647 .------------. 648 | Network B | 649 .------------. ingB| | 650 | Network A |--------| d1 | 651 | | '------------' 652 | s1 | 653 | | .------------. 654 | s2 |--------| Network C | 655 '------------' ingC| | 656 | d2 | 657 '------------' 659 Figure 5: An Illustrating Example for Unicorn. 661 In the endpoint property discovery phase, the Unicorn resource 662 orchestrator finds that s1 stores X and s2 stores Y, and that the 663 locations of s1, s2, d1 and d2, from the DHT lookup system. It then 664 issues EPS queries to network A, B and C, respectively, to discover 665 that d1 satisfies the computing requirements of T1 and d2 satisfies 666 the computing requirements of T2. Hence there are only two candidate 667 endpoint pairs: (s1, d1) and (s2, d2). 669 In the endpoint path discovery phase, the ALTO client in the 670 orchestrator iteratively issues Endpoint Cost Service (ECS) query to 671 the ALTO servers in member networks, and finds that the domain-path 672 for pair (s1, d2) is [(null, A), (ingB, B)] and the domain-path for 673 pair (s2, d2) is [(null, A), (ingB, B)]. Hence both pairs will use 674 the networking resources of network A, while only (s1, d1) will use 675 network B and only (s2, d2) will use network C. 677 In the resource state abstraction discovery phase, the ALTO client in 678 the orchestrator issues Multipart Cost Property Service queries to 679 network A, B and C, respectively. Denote the available bandwidth 680 that can be assigned to T1 as x1 and that to T2 as x2. Assume the 681 linear inequalities computed by the three networks are: 683 A: x1 + x2 <= 10Mbps 684 B: x1 <= 3Mbps 685 C: x2 <= 3Mbps 687 If the ALTO servers use the first mechanism to directly return their 688 resource information to ALTO client, respectively, each of them will 689 send a cost map and a property map response encoding its own linear 690 inequality to the ALTO client. In this way, the orchestrator gets 691 the accurate information about networking resource sharing between 692 (s1, d1) and (s2, d2). It then can invoke a resource allocation 693 algorithm to allocate the resources to tasks T1 and T2. For example, 694 if the goal is to maximize the minimal bandwidth of two tasks, the 695 allocation decision will be to assign endpoints s1 and d1 to T1, with 696 a bandwidth of 3Mbps, and assign endpoints s2 and d2 to T2, with a 697 bandwidth of 3Mbps as well. 699 8. ALTO Extension: Privacy-Preserving Interdomain Information 700 Aggregation for Resource Discovery 702 This section describes a customized ALTO extension in Unicorn that 703 supports the privacy-preserving discovery of networking resource 704 sharing among a set of candidate endpoint pairs. 706 8.1. Extension Specification 708 Figure 6 presents the workflow of the proposed ALTO extension. 709 Assume a set of N member networks denoted as AS_1, AS_2, ... AS_N 710 and the number of all candidate endpoint pairs is F. The interdomain 711 information aggregation process works as follows: 713 .-----------------------. 714 | Resource Orchestrator | 715 | | 716 | .-----------. | 717 | |ALTO Client| | 718 | '-----------' | 719 >'-----------------------'< 720 /Disguised /|\ \ 721 Multipart Cost / Response | \ 722 Property Queries v | \ 723 .--------. .--------. .--------. 724 |ALTO | |ALTO | |ALTO | 725 |Server 1| <------->|Server 2|<--... -->|Server N| 726 '--------' Limitedly'--------' '--------' 727 Shared Secret 729 Figure 6: The Privacy-Preserving Interdomain Resource Information 730 Aggregation. 732 * Step 1: The ALTO client sends the Multipart Cost Property Service 733 request to and a homomorphic public key k_p to each member 734 network. 736 * Step 2: The ALTO server of each network AS_i computes its own set 737 of linear inequalities A_i x <= b_i. Denote the size of this set 738 as m_i. 740 * Step 3: The ALTO server of each network AS_i introduces m_i non- 741 negative slack variables to transform its set of linear 742 inequalities into a set of linear equations. 744 * Step 4: The ALTO servers of all member networks use a private 745 matrix SMPC summation protocol to collectively compute k=m_1 + m_2 746 + ... + m_N + 1. The value k is known to all the member networks. 748 * Step 5: The ALTO servers of each network AS_i selects a random k- 749 by-m_i matrix P_i, and computes the matrix P_iA_i and P_ib_i. 751 * Step 6: The ALTO server of each network then uses a few matrices, 752 which are only shared with a couple of other networks, to further 753 obfuscate P_iA_i and P_ib_i, and sends the obfuscated matrices to 754 the ALTO client via symmetric encryption. 756 * Step 7: the ALTO client decrypts the received responses from all 757 ALTO servers, and sums up the decrypted response to get a set of 758 linear equations sum P_iA_i x = sum P_ib_i. 760 This process ensures that the networking resource capacity region 761 derived from sum P_iA_i x = sum P_ib_i is the same as that derived 762 from A_1 x <= b_1, A_2 x <= b_2, ... A_N x <= b_N. More importantly, 763 the ALTO client has no knowledge on the information of network 764 resource sharing of a single member network. 766 8.2. Example 768 This subsection uses the same example in Figure 5 to illustrate the 769 privacy-preserving information aggregation process. The set of 770 linear inequalities computed by each network is as follows: 772 A: x1 + x2 <= 10 773 B: x1 <= 3 774 C: x2 <= 3 776 Then the networks collectively compute k=1+1+1+1=4. And then 777 introduces slack variables to transform the linear inequalities into 778 linear equations: 780 A: x1 + x2 + x3 <= 10 781 B: x1 + x4 <= 3 782 C: x2 + x5 <= 3 784 For each network, the random matrix it chooses as follows: 786 P_A: [11, 49, 95, 34] 787 P_B: [58, 22, 75, 25] 788 P_C: [50, 69, 89, 95] 790 After the obfuscating process in Step 5 and Step 6 in the previous 791 subsection, the decrypted set of linear equations the ALTO client 792 gets is 794 69 x1 + 61 x2 + 11 x3 + 58 x4_ + 50 x5 = 434 795 71 x1 + 118 x2 + 49 x3 + 22 x4_ + 69 x5 = 763 796 170 x1 + 184 x2 + 95 x3 + 75 x4_ + 89 x5 = 1442 797 59 x1 + 129 x2 + 34 x3 + 25 x4_ + 95 x5 = 700 799 Assume the goal is still to maximize the minimal bandwidth of two 800 tasks, the allocation decision made using this set of linear 801 equations will still be x1=3 and x2=3, i.e., assigning endpoints s1 802 and d1 to T1, with a bandwidth of 3 and assigning endpoints s2 and d2 803 to T2, with a bandwidth of 3 as well. 805 9. Implementation and Demonstration Experience 807 The authors build an ALTO server on top of the OpenDaylight Software 808 Defined Network controller. The ALTO server collects the network 809 state information from the OpenDaylight controller, e.g., topology, 810 policy and traffic statistics, processes the collected information 811 into resource abstraction, and sends the abstraction back to the ALTO 812 client at the resource orchestrator. 814 In the past few years, the Unicorn framework has been deployed and 815 demonstrated in small federation networks connecting Dallas, Texas, 816 Los Angles, California, and Denver, Colorado at different 817 conventions. For example, in 2018, the federation in the 818 demonstration is composed of three member networks. Network 1 is in 819 Dallas, Texas, and Network 2 and network 3 are in Los Angeles, 820 California. Network 1 is connected to network 2 through a layer-2 821 WAN circuit with a 100 Gbps bandwidth, provisioned by several 822 providers such as SCinet, CenturyLink and CENIC. Network 1 is a 823 temporal science network in the CMS experiment, while network 2 and 3 824 are long-running CMS Tier-2 sites. In this federation, users need to 825 reserve network resources to transfer large-scale science datasets 826 (e.g., with a size of hundreds of PB) between networks. 828 The authors evaluate the accuracy and latency of the framework for 829 discovering network resources in this network. During the 830 evaluation, the framework accurately discovers the network resource 831 information for a large amount of circuits reservation requests with 832 a very low discovery latency. Specifically, for all the reservation 833 requests, Unicorn always provides the accurate information of 834 available bandwidth sharing in the network (i.e., a 100% accuracy), 835 with an average discovery latency of 100 milliseconds, and a worst 836 latency of less than 1 second. With the discovered network resource 837 information, users can transmit large-scale science datasets at a 838 speed up to 100 Gbps, (i.e., the theoretical maximal throughput). 840 10. Discussion 841 10.1. Discovering the Domain-Paths Using a New Interdomain Routing 842 Protocol 844 The current design of the endpoint path discovery process in Unicorn 845 assumes that the underlying interdomain routing protocol is the 846 standard BGP, which only provides the path vector of ASes instead of 847 the path vector of (ingress, AS) tuples needed by Unicorn. If a 848 multi-domain, geo-distributed data analytics system uses an 849 interdomain routing protocol that provides the path vector of 850 (ingress, AS) pairs, the endpoint path discovery process in Unicorn 851 can be simplified to only send queries to the ALTO server of the 852 network where the source candidate endpoint locates. 854 10.2. Comparison of the Efficiency to Achieve Optimal Resource 855 Orchestration using ALTO and without using ALTO 857 The authors of this draft conduct a systematic investigation to 858 understand the efficiency differences to achieve optimal resource 859 orchestration between using ALTO and without using ALTO. 860 Specifically, the authors study the problem of computing the optimal 861 resource reservation without using ALTO, but only using the simple 862 reservation interface deployed in PCE-based network resource 863 reservation systems (e.g., OSCARS). Given a client's request, a 864 novel algorithm is developed to compute the optimal resource 865 reservation within O(n^3) times of queries on the simple reservation 866 interface, where n is the number of flows in the request. This is 867 the best known algorithm for this problem. Although the asymptotic 868 complexity appears efficient, the authors' experience to test this 869 design shows that it often takes tens of thousand queries on the 870 simple reservation interface to compute the optimal resource 871 orchestration, leading to substantial orchestration latencies. This 872 result has been published at AAAI 2019. 874 In contrast, by leveraging the capability of ALTO to expose accurate 875 network information to the client, the Unicorn framework can compute 876 the optimal resource reservation by querying an ALTO server only 877 once, resulting in orders of magnitude improvement on resource 878 orchestration latencies. 880 10.3. Future Work 1: Unified Resource Representation as an ALTO 881 extension 883 The authors of this draft are currently designing an ALTO extension 884 which uses generic mathematic programming constraints as a unified 885 resource representation of heterogeneous resources in the network. 886 In particular, a SQL-style language is designed for an ALTO client to 887 express its requirement on available resources. An SMT-based 888 compiler is developed for the network, which translates a user's SQL 889 resource query into a constraint programming model, finds feasible 890 resources in the network, and returns such resource information to 891 user in the compact representation expressed as generic mathematic 892 programming constraints. More details about this extension are 893 reported in a different draft titled "ALTO Extension: Unified 894 Resource Representation". A paper providing preliminary evaluation 895 results of this extension has been accepted by ACM SIGCOMM NAI 2020 896 Workshop. 898 10.4. Future Work 2: Integrating Unicorn and ALTO into Rucio 900 Ruzio [RUCIO] is a scientific data management framework that supports 901 scientific collaborations with the functionality to organize, manage, 902 and access data at exascales. It is currently deployed at the ATLAS 903 experiment of the LHC project, to manage detector data, simulation 904 data and user data. Rucio interacts with the heterogeneous storage 905 systems that are in use in ATLAS, and schedules the dataset transfers 906 in the ATLAS science network using the network infrastructure 907 provided by multiple national research and education networks, 908 including ESnet and Internet2, as well as commercial cloud storage 909 providers, including Google and Amazon. 911 However, the same as other scientific data management frameworks, 912 e.g., PhEDEx, Rucio does not see the underlying network 913 infrastructure. As such, it may suffer from underutilization of 914 network resources, leading to a low data transfer efficiency. In 915 contrast, ALTO and Unicorn can provide accurate, yet privacy- 916 preserving network resource information. By integrating ALTO and 917 Unicorn into the Rucio framework, Rucio will possess the capability 918 to retrieve accurate resource information from the underlying network 919 infrastructure, and use such information to schedule the exascale 920 scientific data flows with better predictable performances (e.g., 921 latency and throughput). As such, we plan to systematically 922 investigate the integration of ALTO/Unicorn and Rucio in the new 923 charter cycle of the ALTO working group. 925 11. Security Considerations 927 This document does not introduce any privacy or security issue not 928 already present in the ALTO protocol. 930 12. IANA Considerations 932 This document does not define any new media type or introduce any new 933 IANA consideration. 935 13. References 936 13.1. Normative References 938 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 939 Requirement Levels", BCP 14, RFC 2119, 940 DOI 10.17487/RFC2119, March 1997, 941 . 943 13.2. Informative References 945 [Apollo] Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., 946 Qian, Z., Wu, M., and L. Zhou, "Apollo: Scalable and 947 Coordinated Scheduling for Cloud-Scale Computing", 2014, 948 . 951 [Borg] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., 952 Tune, E., and J. Wilkes, "Large-scale cluster management 953 at Google with Borg", 2015, 954 . 956 [DRAFT-FCS] 957 Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. R. Yang, 958 "ALTO Extension: Flow-based Cost Query", 2017, 959 . 961 [DRAFT-PV] Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. R. 962 Yang, "ALTO Extension: Abstract Path Vector as a Cost 963 Mode", 2015, . 966 [DRAFT-RSA] 967 Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y. R., and G. 968 Chen, "A Recommendation for Compressing ALTO Path 969 Vectors", 2017, . 972 [DRAFT-SSE] 973 Roome, W. and Y. R. Yang, "ALTO Incremental Updates Using 974 Server-Sent Events (SSE)", 2015, 975 . 978 [HTCondor] Thain, D., Tannenbaum, T., and M. Livny, "Distributed 979 computing in practice: the Condor experience", 2005, 980 . 982 [Mesos] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., 983 Joseph, A., Katz, R., Shenker, S., and I. Stoica, "Mesos: 984 A Platform for Fine-Grained Resource Sharing in the Data 985 Center", 2011, 986 . 989 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 990 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 991 "Application-Layer Traffic Optimization (ALTO) Protocol", 992 RFC 7285, DOI 10.17487/RFC7285, September 2014, 993 . 995 [RFC8189] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 996 Application-Layer Traffic Optimization (ALTO)", RFC 8189, 997 DOI 10.17487/RFC8189, October 2017, 998 . 1000 [RUCIO] Barisits, M., "Rucio: Scientific Data Management", 2019, 1001 . 1003 [Sparrow] Ousterhout, K., Wendell, P., Zaharia, M., and I. Stoica, 1004 "Sparrow: Distributed, Low Latency Scheduling", 2013, 1005 . 1007 [XCP] Katabi, D., Handley, M., and C. Rohrs, "Internet 1008 Congestion Control for Future High Bandwidth-Delay Product 1009 Environments", 2002, 1010 . 1013 Authors' Addresses 1015 Qiao Xiang 1016 Xiamen University 1017 School of Informatics 1018 Xiamen 1019 Fujian Province, 1020 China 1022 Email: xiangq27@gmail.com 1023 J. Jensen Zhang 1024 Tongji/Yale University 1025 51 Prospect Street 1026 New Haven, CT 1027 United States of America 1029 Email: jingxuan.zhang@yale.edu 1031 Franck Le 1032 IBM 1033 Thomas J. Watson Research Center 1034 Yorktown Heights, NY, 1035 United States of America 1037 Email: fle@us.ibm.com 1039 Y. Richard Yang 1040 Yale University 1041 51 Prospect Street 1042 New Haven, CT 1043 United States of America 1045 Email: yry@cs.yale.edu 1047 Harvey Newman 1048 California Institute of Technology 1049 1200 California Blvd. 1050 Pasadena, CA 1051 United States of America 1053 Email: newman@hep.caltech.edu