idnits 2.17.1 draft-xiang-alto-multidomain-analytics-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 593 has weird spacing: '...Queries v |...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 13, 2020) is 1383 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '11' on line 784 -- Looks like a reference, but probably isn't: '49' on line 784 -- Looks like a reference, but probably isn't: '95' on line 786 -- Looks like a reference, but probably isn't: '34' on line 784 -- Looks like a reference, but probably isn't: '58' on line 785 -- Looks like a reference, but probably isn't: '22' on line 785 -- Looks like a reference, but probably isn't: '75' on line 785 -- Looks like a reference, but probably isn't: '25' on line 785 -- Looks like a reference, but probably isn't: '50' on line 786 -- Looks like a reference, but probably isn't: '69' on line 786 -- Looks like a reference, but probably isn't: '89' on line 786 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Yale University 4 Intended status: Informational J. Zhang 5 Expires: January 14, 2021 Tongji/Yale University 6 F. Le 7 IBM 8 Y. Yang 9 Yale University 10 H. Newman 11 California Institute of Technology 12 July 13, 2020 14 Resource Orchestration for Multi-Domain, Exascale, Geo-Distributed Data 15 Analytics 16 draft-xiang-alto-multidomain-analytics-05.txt 18 Abstract 20 As the data volume increases exponentially over time, data analytics 21 is transiting from a single-domain network to a multi-domain, geo- 22 distributed network, where different member networks contribute 23 various resources, e.g., computation, storage and networking 24 resources, to collaboratively collect, share and analyze extremely 25 large amounts of data. Such a network calls for a resource 26 orchestration framework that emphasizes the performance 27 predictability of data analytics jobs, the high utilization of 28 resources, and the autonomy and privacy of member networks. 30 This document presents the design of Unicorn, a unified resource 31 orchestration framework for multi-domain, geo-distributed data 32 analytics, which uses the Application-Layer Traffic Optimization 33 (ALTO) protocol as the key component for (1) allows member networks 34 to provide accurate information on different types of resources; (2) 35 keeps the private information of member networks; and (3) allows data 36 analytics jobs to accurately describe their requirements of different 37 types of resources. As a part of Unicorn, an ALTO extension for 38 privacy-preserving interdomain information aggregation is also 39 presented. 41 Status of This Memo 43 This Internet-Draft is submitted in full conformance with the 44 provisions of BCP 78 and BCP 79. 46 Internet-Drafts are working documents of the Internet Engineering 47 Task Force (IETF). Note that other groups may also distribute 48 working documents as Internet-Drafts. The list of current Internet- 49 Drafts is at https://datatracker.ietf.org/drafts/current/. 51 Internet-Drafts are draft documents valid for a maximum of six months 52 and may be updated, replaced, or obsoleted by other documents at any 53 time. It is inappropriate to use Internet-Drafts as reference 54 material or to cite them other than as "work in progress." 56 This Internet-Draft will expire on January 14, 2021. 58 Copyright Notice 60 Copyright (c) 2020 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (https://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 76 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 77 3. Changes Since Version -03 . . . . . . . . . . . . . . . . . . 4 78 4. Characteristics of Multi-Domain, Geo-Distributed Data 79 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 4 80 4.1. Dynamic Data Analytics Workload . . . . . . . . . . . . . 4 81 4.2. Dynamic Resource Availability . . . . . . . . . . . . . . 5 82 5. Design Requirements . . . . . . . . . . . . . . . . . . . . . 6 83 6. Review of Resource Orchestration Designs for Data Analytics . 6 84 6.1. Centralized resource-graph-based orchestration . . . . . 7 85 6.2. Centralized ClassAds-based orchestration . . . . . . . . 7 86 6.3. Distributed opportunistic orchestration . . . . . . . . . 7 87 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo- 88 Distributed Data Analytics . . . . . . . . . . . . . . . 7 89 7. Unicorn Design . . . . . . . . . . . . . . . . . . . . . . . 8 90 7.1. Choosing ALTO as the Resource Information Model . . . . . 8 91 7.2. Architecture of Unicorn . . . . . . . . . . . . . . . . . 9 92 7.2.1. Three-Phase Resource Discovery . . . . . . . . . . . 11 93 7.2.2. Proactive Full-Mesh Resource Discovery . . . . . . . 15 94 7.3. Example . . . . . . . . . . . . . . . . . . . . . . . . . 15 95 8. ALTO Extension: Privacy-Preserving Interdomain Information 96 Aggregation for Resource Discovery . . . . . . . . . . . . . 16 97 8.1. Extension Specification . . . . . . . . . . . . . . . . . 16 98 8.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . 18 99 9. Implementation and Demonstration Experience . . . . . . . . . 19 100 10. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 19 101 10.1. Discovering the Domain-Paths Using a New Interdomain 102 Routing Protocol . . . . . . . . . . . . . . . . . . . . 19 103 10.2. Comparison of the Efficiency to Achieve Optimal Resource 104 Orchestration using ALTO and without using ALTO . . . . 20 105 10.3. Current Ongoing Work: Unified Resource Representation as 106 an ALTO extension . . . . . . . . . . . . . . . . . . . 20 107 11. Security Considerations . . . . . . . . . . . . . . . . . . . 21 108 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 109 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 110 13.1. Normative References . . . . . . . . . . . . . . . . . . 21 111 13.2. Informative References . . . . . . . . . . . . . . . . . 21 112 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 114 1. Introduction 116 This document describes the design of Unicorn, a unified resource 117 orchestration framework for large-scale data analytics in multi- 118 domain, geo-distributed networks. An important use case for such 119 settings is the Large Hadron Collider (LHC) network, which consists 120 of over 180 member networks all over the world, to support scientists 121 to access multiple resources, e.g., computing, storage and networking 122 resources, distributed in the member networks to conduct large-scale 123 data analytics. With more and more data being generated and stored 124 in different geo-distributed member networks, network architects and 125 administrators are exploring different designs for efficient resource 126 orchestration in multi-domain, geo-distributed networks. 128 The design presented in this document is based on the development and 129 deployment experience of Unicorn in the CMS network, one of the 130 largest scientific experiments in the LHC network. The primary 131 requirements of resource orchestration in such a multi-domain, geo- 132 distributed environment are the performance predictability of various 133 data analytics jobs, the high utilization of different types of 134 resources, and the autonomy and privacy of resource owners, i.e., 135 member networks. 137 Pre-production development and extensive testing have shown that the 138 Application-Layer Traffic Optimization Protocol [RFC7285] is well 139 suited as a fundamental component in Unicorn for providing a generic 140 representation that (1) allows different types of data analytics jobs 141 to accurately describe their resource requirements and (2) allows 142 member networks to provide accurate information on different types of 143 resources they own and at the same time maintain their privacies. 145 This is in contrast with the state-of-the-art resource orchestration 146 frameworks, such as HTCondor and Mesos, which either do not provide 147 accurate networking information or expose all the private details of 148 member networks. This document elaborates on the design requirements 149 of resource orchestration in multi-domain, geo-distributed networks 150 that lead to this design choice and presents the details of Unicorn, 151 including an ALTO extension for privacy-preserving, interdomain 152 information aggregation. 154 This document first gives an overview of the characteristics of 155 multi-domain, geo-distributed data analytics. Then, the design 156 requirements for resource orchestration under such settings are 157 summarized. After reviewing existing designs and their limitations, 158 this document gives the arguments for using ALTO as the generic 159 representation for describing both resource requirements and the 160 resource information and describes the design details of Unicorn. 161 Finally, a privacy-preserving, interdomain extension of ALTO is 162 presented. 164 2. Requirements Language 166 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 167 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 168 document are to be interpreted as described in [RFC2119]. 170 3. Changes Since Version -03 172 o Add a discussion on an ongoing work to develop and deploy unified 173 resource representation as an ALTO extension to expose 174 heterogeneous resource information to application in Section 10.3. 176 4. Characteristics of Multi-Domain, Geo-Distributed Data Analytics 178 This section describes the characteristics of multi-domain, geo- 179 distributed data analytics. 181 4.1. Dynamic Data Analytics Workload 183 In multi-domain, geo-distributed data analytics, extremely large 184 amounts of data are generated and stored across different member 185 networks. Authorized users from different organizations can access 186 data and resources in member networks to conduct various data 187 analytics jobs using various data analytics applications. 189 An data analytics application usually provides an automated process 190 that decomposes a large data analytics job into a set of smaller 191 tasks, whose dependencies are expressed as a directed acyclic graph 192 (DAG). Tasks without any dependency can be executed in parallel to 193 improve the efficiency of the data analytics job they belong to. 194 This decomposition is highly user- and application-dependent. 196 Each task may have different requirements on different resources. 197 For instance, task T1 may require dataset A in storage node X as 198 input and 1 CPU as the computing resource, while task T2 may require 199 dataset B in storage node Y as input and 2 CPUs as the computing 200 resource. Furthermore, each task may require resources from 201 different member networks. In the previous example, T1 may require 202 its output to be stored in a storage node in another member network 203 for the purpose of secure storage. The resource requirements of 204 tasks are highly user- and application-dependent. 206 From the above description, it is observed that the workload of 207 multi-domain, geo-distributed data analytics is highly dynamic, in 208 terms of the number of users, the types of applications, the number 209 of jobs, the decomposition of jobs and the resource requirements of 210 tasks. 212 Though with such dynamism, it is the general consensus of users to 213 expect performance predictability of their analytics jobs (TODO: add 214 Mogul citation). Hence the resource orchestration for multi-domain, 215 geo-distributed data analytics must be able to achieve efficient 216 resource sharing among different data analytics jobs of different 217 applications from different users. To this end, a generic 218 representation of resource requirements for different tasks from 219 different analytics applications must be chosen. Furthermore, to 220 ensure maximal deployment, the resource orchestration framework must 221 be independent of and compatible with data analytics applications. 223 4.2. Dynamic Resource Availability 225 In the multi-domain, geo-distributed data analytics network, 226 different member networks belong to different administrative domains. 227 Each member network has its own resource management policies and can 228 choose to use different management software, such as HTCondor and 229 Mesos. 231 Each member network provides different types of resources with 232 different amounts. For example, transit networks such as ESNet and 233 Internet2 provide high-bandwidth networking resources. In contrast, 234 campus science networks provide abundant computation and storage 235 resources, but may provide limited networking bandwidths. And some 236 smaller science networks only provide limited computation and storage 237 resources. The availability of the resources in each member network 238 is subject to the autonomous control of the member network. 240 Furthermore, member networks are interconnected with high bandwidth- 241 delay-product links, where state-of-the-art networking resource 242 allocation mechanisms, such as TCP, become inefficient [XCP]. 244 From the above description, it is observed that the resource 245 availability of the multi-domain, geo-distributed data analytics 246 network is also highly dynamic, subject to the types of member 247 networks, the resources provided by member networks and the resource 248 management policies and management software used by member networks. 250 Though with such dynamism, it is the general consensus of member 251 networks that the resource orchestration for multi-domain, geo- 252 distributed data analytics must achieve high utilization of different 253 types of resources, following the autonomy and privacy of each member 254 network. To this end, a generic representation of resource 255 availabilities for different types of resources must be chosen. Such 256 a representation must be accurate and at the same time maintain the 257 privacy of member networks. Furthermore, to ensure maximal 258 deployment, the resource orchestration framework must be independent 259 of and compatible with the resource management systems used by member 260 networks. 262 5. Design Requirements 264 This section summarizes the design requirements for resource 265 orchestration for multi-domain, geo-distributed data analytics from 266 the previous section. 268 o REQ1: Provide performance predictability for data analytics jobs. 270 o REQ2: Achieve the efficient resource sharing among data analytics 271 jobs. 273 o REQ3: Achieve the high utilization of different types of resources 274 in member networks. 276 o REQ4: Maintain the autonomy and privacy of member networks. 278 o REQ5: Provide compatibility with different data analytics 279 applications and resource management systems to maximize the 280 deployment. 282 6. Review of Resource Orchestration Designs for Data Analytics 284 This section provides an overview of three general types of resource 285 orchestration designs for data analytics -- the centralized resource- 286 graph-based orchestration, the centralized ClassAds-based 287 orchestration and the distributed opportunistic orchestration. Then, 288 the key reason why these designs are inadequate for multi-domain, 289 geo-distributed data analytics is provided. 291 6.1. Centralized resource-graph-based orchestration 293 Systems such as Mesos [Mesos] and Borg [Borg] adopt a graph-based 294 abstraction to represent the resource availability of computing 295 clusters. Each node in the graph is a physical node representing 296 computation or storage resources and each edge between a pair of 297 nodes denotes the networking resource connecting two physical nodes. 298 This design is inadequate for multi-domain, geo-distributed data 299 analytics system because (1) it compromises the privacy of different 300 member networks by revealing all the details of resources; and (2) 301 the overhead to keep the resource availability graph up to date is 302 too expensive due to the heterogeneity and dynamicity of resources 303 from different member networks. 305 6.2. Centralized ClassAds-based orchestration 307 HTCondor [HTCondor] proposes a ClassAds programming model, which 308 allows different resource owners to advertise their resource supply 309 and the job owners to advertise the resource demand. However, this 310 programming model does not support the accurate discovery of 311 networking resources, but leave the orchestration of networking 312 resources completely to TCP, which has been known to behave poorly in 313 networks with high bandwidth-delay products [XCP]. 315 6.3. Distributed opportunistic orchestration 317 Some systems, such as Apollo [Apollo] and Sparrow [Sparrow], use a 318 distributed design. In this design, given a data analytics job, a 319 small number of computing and storage nodes are randomly selected as 320 candidates. Then a scheduling algorithm makes the decision to select 321 the best pair of computing and storage nodes within this small set of 322 candidates. Though it is shown in production that this design 323 achieves a performance very close to the theoretical optimal resource 324 allocation scheme, this design cannot be applied to multi-domain, 325 geo-distributed data analytics because (1) the pool of computing and 326 storage resources is much larger, and is distributed across the 327 world, and (2) it is hard to distributively orchestrate networking 328 resources in such a high bandwidth-delay product scenario. 330 6.4. Inadequacy of Existing Designs for Multi-Domain, Geo-Distributed 331 Data Analytics 333 Applying the designs reviewed in the preceding subsections for multi- 334 domain, geo-distributed data analytics only satisfies the design 335 requirement of compatibility (REQ5), but leaves all the other 336 requirements unfulfilled. The key reason is that they do not have an 337 information model that simultaneously 339 o allows member networks to provide accurate information on 340 different types of resources, e.g., the computing, storage and 341 networking resources, they own; 343 o keeps the private information of member networks, such as physical 344 topologies and policies, from the data analytics applications; and 346 o allows data analytics jobs to accurately describe their 347 requirements of different types of resources. 349 7. Unicorn Design 351 This section presents the design of the Unicorn framework. First, 352 the motivations of using ALTO as the information model of resource 353 orchestration for multi-domain, geo-distributed data analytics are 354 reviewed. Then the architecture of Unicorn is provided. 356 7.1. Choosing ALTO as the Resource Information Model 358 As reviewed in the preceding section, the commonly used resource- 359 graph-based information model and the ClassAds information model do 360 not support the accurate, yet privacy-preserving resource discovery 361 across different member networks. In contrast, the ALTO protocol 362 uses abstract maps of networks to provide network information with 363 the goal of modifying network resource consumption patterns while 364 maintaining or improving application performance [RFC7285]. This 365 document proposes the use of ALTO for providing information of 366 different types of resources, e.g., computing, storage and networking 367 resources. This design has the following advantages: 369 o ALTO provides the network information based on abstract maps of a 370 network. Additional services are built on top of the ALTO 371 abstract maps to provide information of other types of resources, 372 e.g., the computing and storage resources. These maps provide 373 accurate information of different types of resources for the 374 resource orchestration system to effectively utilize them for data 375 analytics applications. For example, the ALTO Endpoint Property 376 Service can provide information of computing nodes and storage 377 nodes. 379 o The ALTO abstract maps provide a simplified view of resources of 380 member networks, instead of the full details of their resource 381 availability. Thus ALTO allows member networks to keep their 382 private information, such as physical topologies and policies, 383 from the applications. For example, the ALTO Network Map service 384 provides a "one-big-switch" view that defines a grouping of 385 network endpoints. This view hides the details of the underlying 386 physical topology of the network and a network deploying the ALTO 387 server has the autonomy to adopt any endpoint grouping algorithm. 389 o ALTO uses a client-server model, in which applications can use 390 ALTO clients to accurately describe their requirements of 391 different types of resources and send these requirements to the 392 ALTO servers to retrieve the accurate information of resources 393 that suit their requirements. For example, the ALTO Multi-Cost 394 service [RFC8189] allows an ALTO client to specify a logic set of 395 tests in a query. Such tests are used by ALTO servers to filter 396 out the information of unqualified resources from the response 397 sent back to the ALTO client. 399 7.2. Architecture of Unicorn 401 This section describes the design details of Unicorn. Figure 1 402 presents the architecture of Unicorn for a multi-domain, geo- 403 distributed data analytics system with N member networks. In 404 particular, Unicorn consists of the following key components: 406 .-------------. .-------------. 407 |Application 1| ... |Application N| 408 '-------------' '-------------' 409 \ / 410 .- - - - - - - -\- - - - - - - - - - - -/- - - - - - - - - - - - - - -. 411 | Unicorn \ / | 412 | .-----------------------. | 413 | | Resource Orchestrator | .----------------------.| 414 | | | |Distributed Hash Table|| 415 | | .-----------. |---- | of Computing and || 416 | | |ALTO Client| | | Storage Resources || 417 | | '-----------' | '----------------------'| 418 | '-----------------------' | 419 | / | \ | 420 | / | \ | 421 | .-------------. .-----------. .-------------. | 422 | |ALTO Server 1| | Execution | |ALTO Server M| | 423 | '-------------' | Agents | '-------------' | 424 | | '-----------' | | 425 | | / \ | | 426 | .----------------./ \ .----------------. | 427 | | Site 1 | . . | Site N | | 428 | '----------------' '----------------' | 429 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' 431 Figure 1: Architecture of Unicorn. 433 o ALTO Server: for each member network, one or more ALTO servers are 434 deployed to provide accurate, yet privacy-preserving information 435 of different types of resources owned by the corresponding 436 network. Examples of such information include the link bandwidth 437 between endpoints, the memory I/O bandwidth and the CPU 438 utilization at computing endpoints and the storage space at 439 storage endpoints. In addition to the basic ALTO services defined 440 in [RFC7285], The ALTO servers in Unicorn also provide ALTO 441 extension services such as the ALTO Multi-Cost Service [RFC8189], 442 the ALTO Server-Sent Event Service [DRAFT-SSE] and the ALTO 443 Multipart Cost Property Service [DRAFT-PV] to provide fine-grained 444 resource information. 446 o Distributed Hash Table (DHT) of Computing and Storage Resources: A 447 DHT system is deployed across member networks to lookup the 448 location of computing and storage resources. Compared with the 449 current centralized lookup services in the CMS network, i.e., 450 PhEDEx and HTCondor, a DHT system provides a significant 451 performance improvement for discovering the locations of computing 452 and storage resources in multi-domain, geo-distributed data 453 analytics systems. 455 o Resource Orchestrator: The orchestrator is a shim layer between 456 the data analytics jobs from different applications and the member 457 networks. It contains an ALTO client that communicates with the 458 ALTO servers at member networks to retrieve resource information. 459 Given a set of data analytics jobs, the orchestrator adopts a 460 three-phase discovery process, which will be elaborated in the 461 next section, to find the accurate information of all the 462 resources that can be used to execute these jobs. Then the 463 orchestrator runs a customized resource allocation algorithm to 464 compute the resource allocation decisions for these jobs, and send 465 the decisions to the execution agents at corresponding member 466 networks. 468 o Execution Agent: One or more execution agents are deployed at each 469 member network. They take the resource allocation decisions from 470 the resource orchestrator, and communicate with the underlying 471 resource management system deployed at the corresponding member 472 network to reserve the resources for the data analytics jobs and 473 execute them. 475 7.2.1. Three-Phase Resource Discovery 477 The preceding subsection describes the architecture and the key 478 components of Unicorn. One missing component is how to accurately 479 discover the information of different types of resources for a set of 480 data analytics jobs with the assistance of ALTO. This section 481 presents the three-phase resource discovery design in Unicorn. 483 7.2.1.1. Phase 1: Endpoint Property Discovery 485 Figure 2 shows the procedure of the endpoint property discovery 486 phase. Given a set of data analytics jobs, the resource orchestrator 487 communicates with the DHT lookup system to find the locations, i.e., 488 the endpoint addresses, of all candidate computing and storage 489 resources. With such information, the ALTO client then issues 490 Endpoint Property Service (EPS) queries to the ALTO servers deployed 491 at member networks to discover the information of all candidate 492 endpoints. 494 .-----------------------. 495 | Resource Orchestrator | Jobs' Resource 496 | | Requirements .------------. 497 | .-----------. | ---------------> | DHT Lookup | 498 | |ALTO Client| | <--------------- | System | 499 | '-----------' | Endpoint '------------' 500 '---------| ^-----------' Locations 501 EPS | | Endpoint 502 Queries | | Properties 503 v | 504 .-------------. 505 |ALTO Servers | 506 '-------------' 508 Figure 2: The Endpoint Property Discovery Phase. 510 7.2.1.2. Phase 2: Endpoint Path Discovery 512 Candidate computing and storage endpoints need to move data between 513 them before, during and after the execution of a data analytics job. 514 In multi-domain, geo-distributed data analytics, a pair of candidate 515 endpoints may not be in the same member network. In this case, the 516 orchestrator needs to find out the connectivity information between 517 such a pair of candidate endpoints. 519 Figure 3 shows the procedure of the endpoint path discovery phase. 520 Given a pair of candidate endpoints that are not in the same member 521 network, the ALTO client in the orchestrator adopts an iterative 522 process to find the interdomain connectivity information for this 523 pair. It starts by issuing an ALTO Endpoint Cost Service query or an 524 ALTO Flow-based Endpoint Cost Service [DRAFT-FCS] to the ALTO server 525 of the member network where the source endpoint locates. The cost 526 type of this query is a customized type called next-hop, with a 527 customized cost mode tuple and a customized cost metric next-network. 529 .-----------------------. 530 | Resource Orchestrator | 531 | | 532 | .-----------. | 533 | |ALTO Client| | 534 | '-----------' | 535 '---------| ^-----------' 536 Customized| | Endpoint 537 ECS | | Path 538 Queries | | Segments 539 v | 540 .-------------. 541 |ALTO Servers | 542 '-------------' 544 Figure 3: The Endpoint Path Discovery Phase. 546 The ALTO server returns a 2-tuple, where the first element is the 547 autonomous number (AS) of the next member network along the AS-path 548 from the source endpoint to the destination endpoint, and the second 549 element is the ingress of this next member network. In a member 550 network, the ALTO server can get such information from the underlying 551 interdomain routing protocol, e.g., BGP. Based on the received 552 response, the ALTO client then issues a similar query to the ALTO 553 server of the next member network. The process stops when the ALTO 554 server of the member network where the destination endpoint locates 555 receives such a query, who will return a null 2-tuple in response to 556 notify the ALTO client. By the end of this process, the ALTO client 557 can assemble a domain-path, in the form of a path vector of (ingress, 558 AS), of this pair of candidate endpoints. 560 7.2.1.3. Phase 3: Resource State Abstraction Discovery 562 After the second phase, the resource orchestrator has the 563 connectivity information of each candidate endpoint pair, i.e., the 564 domain-path. Equivalently, for each member network, it knows the set 565 of all candidate endpoint pairs that will enter this network. With 566 this information, the resource orchestrator can communicate with the 567 ALTO servers at member networks to discover the resource sharing 568 between all the candidate endpoint pairs. In particular, Unicorn 569 extends the routing state abstraction [DRAFT-RSA] to the more generic 570 resource state abstraction to represent such resource sharing. 572 Figure 4 shows the procedure of the resource state abstraction 573 discovery phase. For each member network, the ALTO client in the 574 orchestrator sends an ALTO Multipart Cost Property Service query 575 defined in [DRAFT-PV] by providing the set of candidate endpoint 576 pairs as input. The cost type of this query is path vector. Upon 577 receiving the query, the ALTO server in each member network computes 578 an ALTO cost map and an ATLO property map to the ALTO client. These 579 two maps represent a set of linear inequalities revealing the 580 resource sharing among the set of candidate endpoint pairs in the 581 member network. 583 .-----------------------. 584 | Resource Orchestrator | 585 | | 586 | .-----------. | 587 | |ALTO Client| | 588 | '-----------' | 589 '---------| ^-----------' 590 Multipart| | Resource 591 Cost | | State 592 Property | | Abstraction 593 Queries v | 594 .-------------. 595 |ALTO Servers | 596 '-------------' 598 Figure 4: The Resource State Abstraction Discovery Phase. 600 Unicorn provides two mechanisms for the ALTO servers to return the 601 computed cost maps and property maps to the ALTO client. The first 602 mechanism is to let each ALTO server independently sends its response 603 to the ALTO client. The second mechanism is a privacy-preserving 604 interdomain information aggregation process, in which the ALTO 605 servers in all member networks use a secure multi-party computation 606 (SMPC) protocol to collectively send the responses to the ALTO client 607 without revealing the source of any entry, i.e., the linear in 608 equality, in the cost maps and property maps. 610 The first mechanism has a higher security risk in that it exposes the 611 bottleneck resource information of each member network. In contrast, 612 the second mechanism provides a better protection of the private 613 information of each member network. The details of the privacy- 614 preserving interdomain information aggregation process will be 615 presented in the next section. 617 After receiving the responses sent back from the ALTO servers from 618 all the member networks, the orchestrator finishes the whole resource 619 discovery process and collects the accurate information of different 620 types of resources for data analytics jobs. 622 7.2.2. Proactive Full-Mesh Resource Discovery 624 To ensure the resource discovery process scales, a proactive full- 625 mesh resource discovery component is developed. The main idea of 626 this component consists in having the ALTO client periodically query 627 ALTO servers at all sites to discover the resource state abstraction 628 between every pair of source and destination sites. As such, when an 629 application submits a resource discovery request, the ALTO client 630 does not need to send any query to the ALTO servers. Instead, using 631 the site-level bandwidth sharing information, the ALTO client can 632 immediately perform projection operations to get the resource 633 information for the request. This mechanism substantially improves 634 the scalability of Unicorn. 636 7.3. Example 638 This subsection gives an example to illustrate the workflow of 639 Unicorn. Figure 5 gives a topology of three member networks, where 640 s1 and s2 are storage endpoints and d1 and d2 are computation 641 endpoints. Assume a data analytics job is composed of two parallel 642 tasks T1 and T2. T1 needs dataset X as input and T2 needs dataset Y 643 as input. 645 .------------. 646 | Network B | 647 .------------. ingB| | 648 | Network A |--------| d1 | 649 | | '------------' 650 | s1 | 651 | | .------------. 652 | s2 |--------| Network C | 653 '------------' ingC| | 654 | d2 | 655 '------------' 657 Figure 5: An Illustrating Example for Unicorn. 659 In the endpoint property discovery phase, the Unicorn resource 660 orchestrator finds that s1 stores X and s2 stores Y, and that the 661 locations of s1, s2, d1 and d2, from the DHT lookup system. It then 662 issues EPS queries to network A, B and C, respectively, to discover 663 that d1 satisfies the computing requirements of T1 and d2 satisfies 664 the computing requirements of T2. Hence there are only two candidate 665 endpoint pairs: (s1, d1) and (s2, d2). 667 In the endpoint path discovery phase, the ALTO client in the 668 orchestrator iteratively issues Endpoint Cost Service (ECS) query to 669 the ALTO servers in member networks, and finds that the domain-path 670 for pair (s1, d2) is [(null, A), (ingB, B)] and the domain-path for 671 pair (s2, d2) is [(null, A), (ingB, B)]. Hence both pairs will use 672 the networking resources of network A, while only (s1, d1) will use 673 network B and only (s2, d2) will use network C. 675 In the resource state abstraction discovery phase, the ALTO client in 676 the orchestrator issues Multipart Cost Property Service queries to 677 network A, B and C, respectively. Denote the available bandwidth 678 that can be assigned to T1 as x1 and that to T2 as x2. Assume the 679 linear inequalities computed by the three networks are: 681 A: x1 + x2 <= 10Mbps 682 B: x1 <= 3Mbps 683 C: x2 <= 3Mbps 685 If the ALTO servers use the first mechanism to directly return their 686 resource information to ALTO client, respectively, each of them will 687 send a cost map and a property map response encoding its own linear 688 inequality to the ALTO client. In this way, the orchestrator gets 689 the accurate information about networking resource sharing between 690 (s1, d1) and (s2, d2). It then can invoke a resource allocation 691 algorithm to allocate the resources to tasks T1 and T2. For example, 692 if the goal is to maximize the minimal bandwidth of two tasks, the 693 allocation decision will be to assign endpoints s1 and d1 to T1, with 694 a bandwidth of 3Mbps, and assign endpoints s2 and d2 to T2, with a 695 bandwidth of 3Mbps as well. 697 8. ALTO Extension: Privacy-Preserving Interdomain Information 698 Aggregation for Resource Discovery 700 This section describes a customized ALTO extension in Unicorn that 701 supports the privacy-preserving discovery of networking resource 702 sharing among a set of candidate endpoint pairs. 704 8.1. Extension Specification 706 Figure 6 presents the workflow of the proposed ALTO extension. 707 Assume a set of N member networks denoted as AS_1, AS_2, ... AS_N 708 and the number of all candidate endpoint pairs is F. The interdomain 709 information aggregation process works as follows: 711 .-----------------------. 712 | Resource Orchestrator | 713 | | 714 | .-----------. | 715 | |ALTO Client| | 716 | '-----------' | 717 >'-----------------------'< 718 /Disguised /|\ \ 719 Multipart Cost / Response | \ 720 Property Queries v | \ 721 .--------. .--------. .--------. 722 |ALTO | |ALTO | |ALTO | 723 |Server 1| <------->|Server 2|<--... -->|Server N| 724 '--------' Limitedly'--------' '--------' 725 Shared Secret 727 Figure 6: The Privacy-Preserving Interdomain Resource Information 728 Aggregation. 730 o Step 1: The ALTO client sends the Multipart Cost Property Service 731 request to and a homomorphic public key k_p to each member 732 network. 734 o Step 2: The ALTO server of each network AS_i computes its own set 735 of linear inequalities A_i x <= b_i. Denote the size of this set 736 as m_i. 738 o Step 3: The ALTO server of each network AS_i introduces m_i non- 739 negative slack variables to transform its set of linear 740 inequalities into a set of linear equations. 742 o Step 4: The ALTO servers of all member networks use a private 743 matrix SMPC summation protocol to collectively compute k=m_1 + m_2 744 + ... + m_N + 1. The value k is known to all the member networks. 746 o Step 5: The ALTO servers of each network AS_i selects a random k- 747 by-m_i matrix P_i, and computes the matrix P_iA_i and P_ib_i. 749 o Step 6: The ALTO server of each network then uses a few matrices, 750 which are only shared with a couple of other networks, to further 751 obfuscate P_iA_i and P_ib_i, and sends the obfuscated matrices to 752 the ALTO client via symmetric encryption. 754 o Step 7: the ALTO client decrypts the received responses from all 755 ALTO servers, and sums up the decrypted response to get a set of 756 linear equations sum P_iA_i x = sum P_ib_i. 758 This process ensures that the networking resource capacity region 759 derived from sum P_iA_i x = sum P_ib_i is the same as that derived 760 from A_1 x <= b_1, A_2 x <= b_2, ... A_N x <= b_N. More importantly, 761 the ALTO client has no knowledge on the information of network 762 resource sharing of a single member network. 764 8.2. Example 766 This subsection uses the same example in Figure 5 to illustrate the 767 privacy-preserving information aggregation process. The set of 768 linear inequalities computed by each network is as follows: 770 A: x1 + x2 <= 10 771 B: x1 <= 3 772 C: x2 <= 3 774 Then the networks collectively compute k=1+1+1+1=4. And then 775 introduces slack variables to transform the linear inequalities into 776 linear equations: 778 A: x1 + x2 + x3 <= 10 779 B: x1 + x4 <= 3 780 C: x2 + x5 <= 3 782 For each network, the random matrix it chooses as follows: 784 P_A: [11, 49, 95, 34] 785 P_B: [58, 22, 75, 25] 786 P_C: [50, 69, 89, 95] 788 After the obfuscating process in Step 5 and Step 6 in the previous 789 subsection, the decrypted set of linear equations the ALTO client 790 gets is 792 69 x1 + 61 x2 + 11 x3 + 58 x4_ + 50 x5 = 434 793 71 x1 + 118 x2 + 49 x3 + 22 x4_ + 69 x5 = 763 794 170 x1 + 184 x2 + 95 x3 + 75 x4_ + 89 x5 = 1442 795 59 x1 + 129 x2 + 34 x3 + 25 x4_ + 95 x5 = 700 797 Assume the goal is still to maximize the minimal bandwidth of two 798 tasks, the allocation decision made using this set of linear 799 equations will still be x1=3 and x2=3, i.e., assigning endpoints s1 800 and d1 to T1, with a bandwidth of 3 and assigning endpoints s2 and d2 801 to T2, with a bandwidth of 3 as well. 803 9. Implementation and Demonstration Experience 805 The authors build an ALTO server on top of the OpenDaylight Software 806 Defined Network controller. The ALTO server collects the network 807 state information from the OpenDaylight controller, e.g., topology, 808 policy and traffic statistics, processes the collected information 809 into resource abstraction, and sends the abstraction back to the ALTO 810 client at the resource orchestrator. 812 In the past few years, the Unicorn framework has been deployed and 813 demonstrated in small federation networks connecting Dallas, Texas, 814 Los Angles, California, and Denver, Colorado at different 815 conventions. For example, in 2018, the federation in the 816 demonstration is composed of three member networks. Network 1 is in 817 Dallas, Texas, and Network 2 and network 3 are in Los Angeles, 818 California. Network 1 is connected to network 2 through a layer-2 819 WAN circuit with a 100 Gbps bandwidth, provisioned by several 820 providers such as SCinet, CenturyLink and CENIC. Network 1 is a 821 temporal science network in the CMS experiment, while network 2 and 3 822 are long-running CMS Tier-2 sites. In this federation, users need to 823 reserve network resources to transfer large-scale science datasets 824 (e.g., with a size of hundreds of PB) between networks. 826 The authors evaluate the accuracy and latency of the framework for 827 discovering network resources in this network. During the 828 evaluation, the framework accurately discovers the network resource 829 information for a large amount of circuits reservation requests with 830 a very low discovery latency. Specifically, for all the reservation 831 requests, Unicorn always provides the accurate information of 832 available bandwidth sharing in the network (i.e., a 100% accuracy), 833 with an average discovery latency of 100 milliseconds, and a worst 834 latency of less than 1 second. With the discovered network resource 835 information, users can transmit large-scale science datasets at a 836 speed up to 100 Gbps, (i.e., the theoretical maximal throughput). 838 10. Discussion 840 10.1. Discovering the Domain-Paths Using a New Interdomain Routing 841 Protocol 843 The current design of the endpoint path discovery process in Unicorn 844 assumes that the underlying interdomain routing protocol is the 845 standard BGP, which only provides the path vector of ASes instead of 846 the path vector of (ingress, AS) tuples needed by Unicorn. If a 847 multi-domain, geo-distributed data analytics system uses an 848 interdomain routing protocol that provides the path vector of 849 (ingress, AS) pairs, the endpoint path discovery process in Unicorn 850 can be simplified to only send queries to the ALTO server of the 851 network where the source candidate endpoint locates. 853 10.2. Comparison of the Efficiency to Achieve Optimal Resource 854 Orchestration using ALTO and without using ALTO 856 The authors of this draft conduct a systematic investigation to 857 understand the efficiency differences to achieve optimal resource 858 orchestration between using ALTO and without using ALTO. 859 Specifically, the authors study the problem of computing the optimal 860 resource reservation without using ALTO, but only using the simple 861 reservation interface deployed in PCE-based network resource 862 reservation systems (e.g., OSCARS). Given a client's request, a 863 novel algorithm is developed to compute the optimal resource 864 reservation within O(n^3) times of queries on the simple reservation 865 interface, where n is the number of flows in the request. This is 866 the best known algorithm for this problem. Although the asymptotic 867 complexity appears efficient, the authors' experience to test this 868 design shows that it often takes tens of thousand queries on the 869 simple reservation interface to compute the optimal resource 870 orchestration, leading to substantial orchestration latencies. This 871 result has been published at AAAI 2019. 873 In contrast, by leveraging the capability of ALTO to expose accurate 874 network information to the client, the Unicorn framework can compute 875 the optimal resource reservation by querying an ALTO server only 876 once, resulting in orders of magnitude improvement on resource 877 orchestration latencies. 879 10.3. Current Ongoing Work: Unified Resource Representation as an ALTO 880 extension 882 The authors of this draft are currently designing an ALTO extension 883 which uses generic mathematic programming constraints as a unified 884 resource representation of heterogeneous resources in the network. 885 In particular, a SQL-style language is designed for an ALTO client to 886 express its requirement on available resources. An SMT-based 887 compiler is developed for the network, which translates a user's SQL 888 resource query into a constraint programming model, finds feasible 889 resources in the network, and returns such resource information to 890 user in the compact representation expressed as generic mathematic 891 programming constraints. More details about this extension are 892 reported in a different draft titled "ALTO Extension: Unified 893 Resource Representation". A paper providing preliminary evaluation 894 results of this extension has been accepted by ACM SIGCOMM NAI 2020 895 Workshop. 897 11. Security Considerations 899 This document does not introduce any privacy or security issue not 900 already present in the ALTO protocol. 902 12. IANA Considerations 904 This document does not define any new media type or introduce any new 905 IANA consideration. 907 13. References 909 13.1. Normative References 911 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 912 Requirement Levels", BCP 14, RFC 2119, 913 DOI 10.17487/RFC2119, March 1997, 914 . 916 13.2. Informative References 918 [Apollo] Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., 919 Qian, Z., Wu, M., and L. Zhou, "Apollo: Scalable and 920 Coordinated Scheduling for Cloud-Scale Computing", 2014, 921 . 924 [Borg] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., 925 Tune, E., and J. Wilkes, "Large-scale cluster management 926 at Google with Borg", 2015, 927 . 929 [DRAFT-FCS] 930 Zhang, J., Gao, K., Wang, J., Xiang, Q., and Y. Yang, 931 "ALTO Extension: Flow-based Cost Query", 2017, 932 . 934 [DRAFT-PV] 935 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 936 Yang, "ALTO Extension: Abstract Path Vector as a Cost 937 Mode", 2015, . 940 [DRAFT-RSA] 941 Gao, K., Wang, X., Xiang, Q., Gu, C., Yang, Y., and G. 942 Chen, "A Recommendation for Compressing ALTO Path 943 Vectors", 2017, . 946 [DRAFT-SSE] 947 Roome, W. and Y. Yang, "ALTO Incremental Updates Using 948 Server-Sent Events (SSE)", 2015, 949 . 952 [HTCondor] 953 Thain, D., Tannenbaum, T., and M. Livny, "Distributed 954 computing in practice: the Condor experience", 2005, 955 . 957 [Mesos] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., 958 Joseph, A., Katz, R., Shenker, S., and I. Stoica, "Mesos: 959 A Platform for Fine-Grained Resource Sharing in the Data 960 Center", 2011, 961 . 964 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 965 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 966 "Application-Layer Traffic Optimization (ALTO) Protocol", 967 RFC 7285, DOI 10.17487/RFC7285, September 2014, 968 . 970 [RFC8189] Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 971 Application-Layer Traffic Optimization (ALTO)", RFC 8189, 972 DOI 10.17487/RFC8189, October 2017, 973 . 975 [Sparrow] Ousterhout, K., Wendell, P., Zaharia, M., and I. Stoica, 976 "Sparrow: Distributed, Low Latency Scheduling", 2013, 977 . 979 [XCP] Katabi, D., Handley, M., and C. Rohrs, "Internet 980 Congestion Control for Future High Bandwidth-Delay Product 981 Environments", 2002, 982 . 985 Authors' Addresses 987 Qiao Xiang 988 Yale University 989 51 Prospect Street 990 New Haven, CT 991 USA 993 Email: qiao.xiang@cs.yale.edu 994 J. Jensen Zhang 995 Tongji/Yale University 996 51 Prospect Street 997 New Haven, CT 998 USA 1000 Email: jingxuan.zhang@yale.edu 1002 Franck Le 1003 IBM 1004 Thomas J. Watson Research Center 1005 Yorktown Heights, NY 1006 USA 1008 Email: fle@us.ibm.com 1010 Y. Richard Yang 1011 Yale University 1012 51 Prospect Street 1013 New Haven, CT 1014 USA 1016 Email: yry@cs.yale.edu 1018 Harvey Newman 1019 California Institute of Technology 1020 1200 California Blvd. 1021 Pasadena, CA 1022 USA 1024 Email: newman@hep.caltech.edu