idnits 2.17.1 draft-xiang-alto-exascale-network-optimization-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 3, 2017) is 2489 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 ALTO WG Q. Xiang 3 Internet-Draft Tongji/Yale University 4 Intended status: Informational H. Newman 5 Expires: January 4, 2018 California Institute of Technology 6 G. Bernstein 7 Grotto Networking 8 H. Du 9 Tongji/Yale University 10 K. Gao 11 Tsinghua University 12 A. Mughal 13 J. Balcas 14 California Institute of Technology 15 J. Zhang 16 Tongji University 17 Y. Yang 18 Tongji/Yale University 19 July 3, 2017 21 Resource Orchestration for Multi-Domain Data Analytics 22 draft-xiang-alto-exascale-network-optimization-03.txt 24 Abstract 26 Data-intensive analytics is entering the era of multi-domain, 27 geographically-distributed, collaborative computing, where different 28 organizations contribute various resources to collaboratively 29 collect, share and analyze extremely large amounts of data. Examples 30 of this paradigm include the Compact Muon Solenoid (CMS) and A 31 Toroidal LHC ApparatuS (ATLAS) experiments of the Large Hadron 32 Collider (LHC) program. Massive datasets continue to be acquired, 33 simulated, processed and analyzed by globally distributed science 34 networks in these collaborations. Applications that manage and 35 analyze such massive data volumes can benefit substantially from the 36 information about networking, computing and storage resources from 37 each member's site, and more directly from network-resident services 38 that optimize and load balance resource usage among multiple data 39 transfers and analytics requests, and achieve a better utilization of 40 multiple resources in clusters. 42 The Application-Layer Traffic Optimization (ALTO) protocol can 43 provide via extensions the network information about different 44 clusters/sites, to both users and proactive network management 45 services where applicable, with the goal of improving both 46 application performance and network resource utilization. In this 47 document, we propose that it is feasible to use existing ALTO 48 services to provides not only network information, but also 49 information about computation and storage resources in data analytics 50 networks. We introduce a uniform resource orchestration framework 51 (Unicorn), which achieves an efficient multi-resource allocation to 52 support low-latency dataset transfer and data intensive analytics for 53 collaborative computing. It collects cluster information from 54 multiple ALTO services utilizing topology extensions and leverages 55 emerging SDN control capabilities to orchestrate the resource 56 allocation for dataset transfers and analytics tasks, leading to 57 improved transfer and analytics latency as well as more efficient 58 utilization of multi-resources in sites. 60 Status of This Memo 62 This Internet-Draft is submitted in full conformance with the 63 provisions of BCP 78 and BCP 79. 65 Internet-Drafts are working documents of the Internet Engineering 66 Task Force (IETF). Note that other groups may also distribute 67 working documents as Internet-Drafts. The list of current Internet- 68 Drafts is at http://datatracker.ietf.org/drafts/current/. 70 Internet-Drafts are draft documents valid for a maximum of six months 71 and may be updated, replaced, or obsoleted by other documents at any 72 time. It is inappropriate to use Internet-Drafts as reference 73 material or to cite them other than as "work in progress." 75 This Internet-Draft will expire on January 4, 2018. 77 Copyright Notice 79 Copyright (c) 2017 IETF Trust and the persons identified as the 80 document authors. All rights reserved. 82 This document is subject to BCP 78 and the IETF Trust's Legal 83 Provisions Relating to IETF Documents 84 (http://trustee.ietf.org/license-info) in effect on the date of 85 publication of this document. Please review these documents 86 carefully, as they describe your rights and restrictions with respect 87 to this document. Code Components extracted from this document must 88 include Simplified BSD License text as described in Section 4.e of 89 the Trust Legal Provisions and are provided without warranty as 90 described in the Simplified BSD License. 92 Table of Contents 94 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 95 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 5 96 3. Changes Since Version -02 . . . . . . . . . . . . . . . . . . 5 97 4. Problem Settings . . . . . . . . . . . . . . . . . . . . . . 6 98 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 6 99 4.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 100 5. Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . 7 101 5.1. Use ALTO services to provide multi-resource information . 7 102 5.2. Example 1: network information impacts data analytics 103 scheduling . . . . . . . . . . . . . . . . . . . . . . . 8 104 5.3. Example 2: encode multi-resources in abstract network 105 elements . . . . . . . . . . . . . . . . . . . . . . . . 9 106 6. Key Issues . . . . . . . . . . . . . . . . . . . . . . . . . 10 107 7. Unified Resource Orchestration Framework . . . . . . . . . . 11 108 7.1. Architecture . . . . . . . . . . . . . . . . . . . . . . 11 109 7.2. Workflow converter . . . . . . . . . . . . . . . . . . . 14 110 7.2.1. User API . . . . . . . . . . . . . . . . . . . . . . 14 111 7.3. Resource Demand Estimator . . . . . . . . . . . . . . . . 15 112 7.4. Entity Locator . . . . . . . . . . . . . . . . . . . . . 15 113 7.5. ALTO Client . . . . . . . . . . . . . . . . . . . . . . . 15 114 7.5.1. Query Mode . . . . . . . . . . . . . . . . . . . . . 16 115 7.6. ALTO Server . . . . . . . . . . . . . . . . . . . . . . . 16 116 7.7. Resource View Extractor . . . . . . . . . . . . . . . . . 16 117 7.8. Execution Agents . . . . . . . . . . . . . . . . . . . . 18 118 7.9. Multi-Resource Orchestrator . . . . . . . . . . . . . . . 18 119 7.9.1. Orchestration Algorithms . . . . . . . . . . . . . . 18 120 7.9.2. Online, Dynamic Orchestration . . . . . . . . . . . . 18 121 7.9.3. Example: A Max-Min Fairness Resource Allocation 122 Algorithm . . . . . . . . . . . . . . . . . . . . . . 19 123 8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 20 124 8.1. Deployment . . . . . . . . . . . . . . . . . . . . . . . 20 125 8.2. Benefiting From ALTO Extension Services . . . . . . . . . 21 126 8.3. Limitations of the MFRA Algorithm . . . . . . . . . . . . 21 127 9. Security Considerations . . . . . . . . . . . . . . . . . . . 22 128 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 129 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 22 130 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 131 12.1. Normative References . . . . . . . . . . . . . . . . . . 22 132 12.2. Informative References . . . . . . . . . . . . . . . . . 22 133 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 135 1. Introduction 137 As the data volume increases exponentially over time, data intensive 138 analytics is transiting from single-domain computing to multi- 139 organizational, geographically-distributed, collaborative computing, 140 where different organizations contribute various resources, e.g., 141 computation, storage and networking resources, to collaboratively 142 collect, share and analyze extremely large amounts of data. One 143 leading example is the Large Hadron Collider (LHC) high energy 144 physics (HEP) program, which aims to find new particles and 145 interactions in a previously inaccessible range of energies. The 146 scientific collaborations that have built and operate large HEP 147 experimental facilities at the LHC, such as the Compact Muon Solenoid 148 (CMS) and A Toroidal LHC ApparatuS (ATLAS), currently have more than 149 300 petabytes of data under management at hundreds of sites around 150 the world, and this volume is expected to grow to one exabyte by 151 approximately 2018. 153 With such an increasing data volume, how to manage the storage and 154 analytics of these data in a globally distributed infrastructure has 155 become an increasingly challenging issue. Applications such as the 156 Production ANd Distributed Analysis system (PanDA) in ATLAS and the 157 Physics Experiment Data Export system (PhEDEX) in CMS have been 158 developed to manage the data transfers among different cluster sites. 159 Given a data transfer request, these applications make data transfer 160 decisions based on the availability of dataset replicas at different 161 sites and initiate retransmission from a different replica if the 162 original transmission fails or is excessively delayed. And HTCondor 163 [HTCondor] is deployed to achieve coarse-grained data analytics 164 parallelization across these sites. When a data analytics task is 165 submitted, HTCondor adopts a match-making process to assign the task 166 to a certain set of servers in one site, based on the coarse-grained 167 description of resource availability, such as the number of cores, 168 the size of memory, the size of hard disk, etc. However, neither 169 dataset transfers nor data analytics task parallelization takes fine- 170 grained information of cluster resources, such as data locality, 171 memory speed, network delay, network bandwidth, etc., into account, 172 leading to high data transfer and analytics latency and 173 underutilization of cluster resources. 175 The Application-Layer Traffic Optimization (ALTO) services defined in 176 [RFC7285] provide network information with the goal of improving the 177 network resource utilization while maintaining or improving 178 application performance. Though ALTO is not designed to provide 179 information about other resources, such as computing and storage 180 resources in cluster networks, in this document we propose that 181 exascale science networks can leverage existing ALTO services defined 182 in [RFC7285] and ALTO topology extension services defined in network 183 graph [DRAFT-NETGRAPH], path vector [DRAFT-PV], routing state 184 abstraction[DRAFT-RSA], multi-cost [DRAFT-MC] and cost-calendar 185 [DRAFT-CC] and etc. to encode information about multiple types of 186 resources in science networks, such as memory I/O speed, CPU 187 utilization, network bandwidth, and provide such information to 188 orchestration applications to improve the performance of dataset 189 transfer and data analytics tasks, including throughput, latency, 190 etc. 192 This document introduces a unified resource orchestration framework 193 (Unicorn), which provides efficient multi-resource allocation to 194 support low-latency, multi-domain, geo-distributed data analytics. 195 Unicorn provides a set of simple APIs for authorized users to submit, 196 update and delete dataset transfer requests and data intensive 197 analytics requests. One important proposal we make in this document 198 is that it is feasible to use ALTO services to provide not only 199 network information, but also information on other resources in 200 multi-domain, geo-distributed analytics networks including computing 201 and storage. 203 A prototype of Unicorn with the dataset transfer scheduling component 204 has been implemented on a single-domain Caltech SDN development 205 testbed, where the ALTO OpenDaylight controller is used to collect 206 topology information. We are currently designing the resource 207 orchestration components to achieve low-latency data-intensive 208 analytics. 210 This document is organized as follows: Section 3 summarizes the 211 change of this document since version -01. Section 4 elaborates on 212 the motivation and challenges for coordinating storage, computing and 213 network resources in a globally distributed science network 214 infrastructure. Section 5 discusses the basic idea of encoding 215 multi-resource information into ALTO path vector and abstraction 216 services and gives an example. Section 6 lists several key issues to 217 address in order to realize the proposal of providing multi-resource 218 information by ALTO topology services. Section 7 gives the details 219 of Unicorn architecture for multi-domain, geo-distributed data 220 analytics. Section 8 discusses current development progress of 221 Unicorn and next steps. 223 2. Requirements Language 225 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 226 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 227 document are to be interpreted as described in [RFC2119]. 229 3. Changes Since Version -02 231 o Add an example in Section 7 to show the importance of network 232 information in resource allocation for data analytics. 234 o Update the architecture of Unicorn in Section 7, i.e., add the 235 entity locator and rename ANE aggregator to resource view 236 extractor. 238 o Add detailed description of how the entity locator and the 239 resource view extractor work. 241 o Minor changes in abstract and discussion sections. 243 4. Problem Settings 245 4.1. Motivation 247 Multi-domain, geo-distributed data analytics usually involves the 248 participation of countries and sites all over the world. Science 249 programs such as the CMS experiment and the ATLAS experiment at CERN 250 are typical examples. The site located at the LHC laboratory is 251 called a Tier-0 site, which processes the data selected and stored 252 locally by the online systems that select and record the data in 253 real-time as it comes off the particle detector, archives it and 254 transfers it to over 10 Tier-1 sites around the globe. Raw datasets 255 and processed datasets from Tier-1 sites are then transferred to over 256 160 Tier-2 sites around the world based on users' requests. 257 Different sites have different resources and belong to different 258 administration domains. With the exponentially increasing data 259 volume in the CMS experiment, the management of large data transfers 260 and data intensive analytics in such a global multi-domain science 261 network has become an increasingly challenging issue. Allocating 262 resources in different clusters to fulfill different users' dataset 263 transfer requests and data analytics requests require careful 264 orchestrating as different requests are competing for limited 265 storage, computation and network resources. 267 4.2. Challenges 269 Orchestrating exascale dataset transfers and analytics in a globally 270 distributed science network is non-trivial as it needs to cope with 271 two challenges. 273 o Different sites in this network belong to different administration 274 domains. Sharing raw site/cluster information would violate 275 sites' privacy constraints. Orchestrating data transfers and 276 analytics requests based on highly abstracted, non-real-time 277 network information may lead to suboptimal scheduling decisions. 278 Hence the orchestrating framework must be able to collect 279 sufficient resource information about different clusters/sites in 280 real-time as well as over the longer term, to allow reasonably 281 optimized network resource utilization without violating sites' 282 privacy requirements. 284 o Different science programs tend to adopt different software 285 infrastructures for managing dataset transfers and analytics, and 286 may place different requirements. Hence the orchestrating 287 framework must be modular so that it can support different dataset 288 management systems and different orchestrating algorithms. 290 The orchestrating framework must support the interaction between the 291 multi-resource orchestration module, the dataset transfer module, and 292 the data analytics execution module. The key information to be 293 exchanged between modules includes dataset information, the resource 294 state of different clusters and sites, the transfer and analytic 295 requests in progress, as well as trends and network-segment and site 296 performance from the network point of view. Such interaction ensures 297 that (1) the various programs can adopt their own data transfer and 298 analytics systems to be multi-resource-aware, and more efficient in 299 achieving their goals; and (2) the various orchestrating algorithms 300 can achieve a reasonably optimized utilization on not only the 301 network resource but also the computing and storage resources. 303 5. Basic Idea 305 5.1. Use ALTO services to provide multi-resource information 307 The ALTO protocol is designed to provide network information to 308 applications so that applications can achieve better performance and 309 the network more efficient use of resources. Different ALTO topology 310 services including path vector, routing state abstraction, multi- 311 cost, cost calendar, etc., have been proposed to provide fine-grained 312 network information to applications. In this document, we propose 313 that not only can ALTO provide network information of different 314 cluster sites, it can also provide information of multiple resources, 315 including computing and storage resources. To this end, the basic 316 "one-big-switch" abstraction provided by the base ALTO protocol is 317 not sufficient. Several examples have already been given in 318 [DRAFT-PV] and [DRAFT-RSA] to demonstrate that. There has been a 319 similar proposal before about using ALTO to provide resource 320 information for data centers [DRAFT-DC]. However, that proposal 321 requires a new information model for clusters or data centers, which 322 may affect the compatibility of ALTO. The solution of this proposal 323 is simpler. Its basic idea is that each computer node and storage 324 node can be seen as an "abstract network element" defined in ALTO- 325 path-vector [DRAFT-PV]. In this way, Unicorn can fully reuse all 326 existing ALTO services by introducing only one cost-mode (pv) and two 327 cost-metrics (ne and ane), instead of introducing a new information 328 model. 330 5.2. Example 1: network information impacts data analytics scheduling 332 .-----------. .-----------. 333 | .-----. | | .-----. | 334 | | eh1 | | | | eh3 | | 335 | '-----' | | '-----' | 336 | | | | 337 | | | | 338 | | | | 339 | | | | 340 | .-----. | | .-----. | 341 | | eh2 | | | | eh4 | | 342 | '-----' | | '-----' | 343 '-----------' '-----------' 344 Site 1 Site 2 345 distance(eh1, eh2) = 2 distance(eh1, eh2) = 2 347 Figure 1: An Example for Data Locality Information. 349 We first use the example in Figure 1 to show that only network 350 information itself has a huge impact on resource allocation for data 351 analytics. In this scenario, a MapReduce task needs to be executed. 352 The input data has two copies at end hosts eh1 and eh3, respectively. 353 And the end hosts eh2 and eh4 will be the computation nodes with the 354 same computation power, correspondingly. Using the common data 355 analytics resource management framework such as Hadoop it can be 356 revealed that both allocation options, i.e., eh1->eh2 and eh3->eh4, 357 have a storage-computation distance of 2, i.e., they have the same 358 data locality from Hadoop's view. As a result, it appears that both 359 options would provide the same performance for this task. 361 However, assume that the bandwidth between eh1 and eh2 is 100Mb/s 362 while that between eh3 and eh4 is 1Gb/s. These significant different 363 data accessing speeds decide that these options have very different 364 performances for the same task and the only optimal allocation option 365 is to allocate this task to eh3->eh4. This example demonstrates the 366 imperativeness of network information in making efficient resource 367 allocation decisions. Such information is not provided in Hadoop or 368 other similar or related projects such as Mesos. On the contrary, if 369 ALTO servers are deployed at these sites, applications can retrieve 370 such information via ALTO queries. 372 5.3. Example 2: encode multi-resources in abstract network elements 374 We use the same dumbbell topology in [DRAFT-RSA] as an example to 375 show the feasibility of using ALTO topology service to provide multi- 376 resource information. In this topology, we assume the bandwidth of 377 each network cable is 1Gbps, including the cables connecting end 378 hosts to switches. Consider a dataset transfer request which needs 379 to schedule the traffic among a set of end host source-destination 380 pairs, say eh1 -> eh2, and eh3 -> eh4. Assume that the transfer 381 application receives information from the ALTO Cost Map service that 382 both eh1 -> eh2 and eh3 -> eh4 have bandwidth 1Gbps. In [DRAFT-RSA], 383 it is shown that whether each of the two traffic flows can receive 384 1Gbps bandwidth depends on whether the routes of two flows share a 385 bottleneck link. Path vector and routing state abstraction services 386 provide additional information about network state encoded in 387 abstract network elements. If the returned state is ane1 + ane2 <= 388 1Gbps, it means two flows cannot each get 1Gbps bandwidth at the same 389 time. If the returned state is ane1 <= 1Gbps and ane2 <= 1Gbps, it 390 means two flows each can get 1Gbps bandwidth. 392 +------+ 393 | | 394 --+ sw6 +-- 395 / | | \ 396 PID1 +-----+ / +------+ \ +-----+ PID2 397 eh1__| |_ / \ ____| |__eh2 398 | sw1 | \ +--+---+ +---+--+ / | sw2 | 399 +-----+ \ | | | |/ +-----+ 400 \_| sw5 +---------+ sw7 | 401 PID3 +-----+ / | | | |\ +-----+ PID4 402 eh3__| |__/ +------+ +------+ \____| |__eh4 403 | sw3 | | sw4 | 404 +-----+ +-----+ 406 Figure 2: A Dumbbell Network Topology 408 Other than network resource, assume in this topology eh1 and eh3 are 409 equipped with commodity hard disk drives (HDD) while eh2 and eh4 are 410 equipped with SSDs. Because the bandwidth of an HDD is typically 411 0.8Gbps and that of SSD is typically 3Gbps. Even if the returned 412 routing state is ane1 <= 1Gbps and ane2 <=1Gbps, the actual 413 bottleneck of each traffic flow is the storage I/O bandwidth at 414 source host. As a result, the total bandwidth of both traffic flows 415 can only reach 1.6Gbps. 417 It has been verified in the CMS experiment, and also several studies 418 on commercial data centers that network resource are not always the 419 bottleneck of large dataset transfers and data analytics. Many have 420 reported that storage resources and computing resources become the 421 bottleneck in a fairly large percentage of dataset transfers and data 422 analytics tasks in science networks and commercial data centers. 424 In this example, if we look at the end hosts as abstract network 425 elements, the storage I/O bandwidth of each host can also be encoded 426 as an abstract element into the path-vector. And under the storage 427 and route settings above, the returned cluster state would be ane1 428 <=0.8Gbps and ane2 <=0.8Gbps, which provides a more accurate capacity 429 region for the requested traffic flows. 431 6. Key Issues 433 The last section described the basic idea of using ALTO topology 434 services to provide multi-resource information and gives an example 435 to demonstrate its feasibility. Next we list and discuss several key 436 issues to address in this proposal. 438 o Can ALTO topology services provide data locality information? 439 Existing ALTO topology services do not provide such information. 440 Many studies have pointed out that such information plays a vital 441 role in reducing the latency of data-intensive analytics. If ALTO 442 topology services can encode such information together with 443 information of other resources together, data-intensive 444 applications can benefit a great deal in terms of information 445 aggregation and communication overhead. 447 o How to quickly map applications' resource allocation decision on 448 abstract multi-resource view back to the physical multi-resource 449 view of clusters/sites? Fine-grained resource information can be 450 encoded into abstract network elements to reduce overhead and 451 provide certain privacy protection of clusters. Such information 452 can be highly compressed (see the dumbbell example used in this 453 document as well as in [DRAFT-PV] and [DRAFT-RSA]). In 454 preliminary evaluations on RSA, the network element compression 455 ratio can be as high as 80 percent. This ratio is expected to be 456 even higher in large-scale data center or cluster setting, e.g. a 457 fat-tree topology with k=48. Therefore a fast mapping from the 458 resource orchestration decisions based on the abstract view back 459 to the physical view is needed to satisfy the stringent latency 460 requirement of large dataset transfers and data-intensive 461 analytics. 463 o How much privacy, including key resource configurations, raw 464 topology, intra-cluster scheduling policy, etc., will be exposed? 465 Compared with the "one-big-switch" abstraction, other ALTO 466 topology services such as path vector [DRAFT-PV] and routing state 467 abstraction [DRAFT-RSA] provide fine-grained resource information 468 to applications. Even if such information can be encoded into 469 abstract network elements, it still risks exposing private 470 information of different clusters/sites. Current internet drafts 471 of these services did not provide any formal privacy analysis or 472 performance measurement. This would be one of the key issues this 473 document plans to investigate in the future. 475 o How does current ALTO services such as path-vector and RSA scale 476 when they are used to provide abstract information concerning 477 multiple resources in clusters? Another issue along this line is 478 how to balance the liveness of fine-grained resource information 479 and the corresponding information delivery overhead? Although 480 encoding information of network elements into abstract network 481 elements can achieve a very competitive information compression 482 ratio, large dataset transfers and analytics applications always 483 involve many network elements in multiple clusters/sites and the 484 absolute number of involved network elements keep increasing as 485 the scale of clusters increase. In addition, when resource 486 information in a cluster changes, the ALTO services need to inform 487 all related applications. In either cases, delivering fine- 488 grained resource information would cause high communication 489 overhead. There still lacks of an analytics or experimental 490 understanding on the scalability of path-vector and RSA services. 492 7. Unified Resource Orchestration Framework 494 7.1. Architecture 496 This section describes the design details of key components of the 497 Unicorn framework: the workflow converter, the resource demand 498 estimator, the ALTO clients, the ALTO servers, the resource view 499 extractor, the multi-resource orchestrator and the execution agents. 500 Figure 3 shows the architecture of Unicorn. The overall process is 501 as follows. 503 .---------. 504 | Users | 505 '---------' 506 | 1 507 .- - - - - - - - - - - - - - -|- - - - - - - - - - - - - - - -. 508 | .--------------------. | 509 | Unicorn | Workflow Converter | | 510 | '--------------------' | 511 | | 2 | 512 | .-----------------------------. | 513 | | Resource Demand Estimator | | 514 | '-----------------------------' | 515 | | 3 | 516 | .-----------------------------. | 517 | | Multi-Resource Orchestrator | | 518 | '-----------------------------' | 519 | / | | 10 | \ | 520 | 9 / | 4 | 4 | \ 9 | 521 | .-------------. .-------. | .-------. .-------------. | 522 | |Resource View| |Entity | | |Entity | |Resource View| | 523 | | Extractor | |Locator| | |Locator| | Extractor | | 524 | '-------------' '-------' | '-------' '-------------' | 525 | | 8 | 5 | 5 | | 8 | 526 | .----------------. .-----------. .----------------. | 527 | | ALTO Client(s) | .-| Execution | | ALTO Client(s) | | 528 | '----------------' | | Agents | '----------------' | 529 | | 6 | '-----------' | 6 | 530 | .----------------. '-----------' .----------------. | 531 | | ALTO Server(s) | / \ | ALTO Server(s) | | 532 | '----------------' / \ '----------------' | 533 | | 7 / 11 11 \ | 7 | 534 | .----------------. / \ .----------------. | 535 | | Site 1 | . . . | Site N | | 536 | '----------------' '----------------' | 537 '- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -' 539 Figure 3: Architecture of Unicorn. 541 o STEP 1 Authorized users submit high-level data analytics workflows 542 to Unicorn through a set of simple APIs. 544 o STEP 2 The workflow converter transforms the high-level data 545 analytics workflows into low-level task workflows, i.e., a set of 546 analytics tasks with precedence encoded in a directed acyclic 547 graph (DAG). 549 o STEP 3 The resource demand estimator automatically finds the 550 optimal configuration (resource demand) of each task, i.e., the 551 number of CPUs, the size of memory and disk, I/O bandwidth, etc. 553 o STEP 4 The multi-resource orchestrator receives the resource 554 demand of a set of tasks and sends them to the entity locator at 555 each site. 557 o STEP 5 The entity locator at each site retrieves the entity 558 addresses of the end hosts that would be allocated for the tasks 559 to be scheduled, and passes these addresses to the ALTO clients, 560 and asks the ALTO clients to collect information about these end 561 hosts, i.e., properties of corresponding computing, storage and 562 networking resources. 564 o STEP 6 The ALTO clients issue ALTO queries defined in the base 565 ALTO protocol [RFC7285], e.g., EPS, ECS, Network Map, etc. and 566 ALTO extension services, e.g., routing state abstraction (RSA) 567 [DRAFT-RSA], path vector [DRAFT-PV], network graph 568 [DRAFT-NETGRAPH], multi-cost [DRAFT-MC], cost-calendar [DRAFT-CC] 569 and property map [DRAFT-PM], to collect resource information. 571 o STEP 7 The ALTO servers at each site accept the queries from the 572 ALTO client, collect resource information from the residing site 573 and send back to the ALTO clients. 575 o STEP 8 The ALTO clients send the response from ALTO servers to the 576 resource view extractor. 578 o STEP 9 The extractor uses a lightweight, optimal algorithm to 579 compress the raw resource information provided by ALTO servers 580 into a minimal, equivalent view of resources and sends back to the 581 multi-resource orchestrator. 583 o STEP 10 The orchestrator makes resource allocation decisions, 584 e.g., dataset transfer scheduling and analytics task placement, 585 based on the resource demand of analytics tasks and the resource 586 supply sent back from the resource view extractor. Decisions are 587 then sent to the execution agents deployed in corresponding sites. 589 o STEP 11 The execution agents receive and execute instructions from 590 the multi-resource orchestrator. They also monitor the status of 591 different tasks and send the updated status to the multi-resource 592 orchestrator. 594 Unicorn provides a unified, automatic solution for multi-domain, geo- 595 distributed data analytics. In particular, its benefits include: 597 o On the resource demand side, it provides a set of simple APIs for 598 authorized users to submit and manage data analytics requests and 599 enables real-time requests' status monitoring. And it 600 automatically converts high-level analytics workflow into low- 601 level task workflows and finds the optimal configuration for each 602 task. 604 o On the resource supply side, it collects the resource information 605 of different sites through a common, REST based interface 606 specified by the ALTO protocol, encodes such information into 607 different entity abstractions and computes a minimal, yet accurate 608 view on resource supply dynamic. 610 o It provides a scalable multi-resource orchestrator that makes 611 efficient resource allocation decisions to achieve high resource 612 utilization and low-latency data analytics. 614 o The architecture of Unicorn is modular to support different 615 resource orchestration algorithms and the deployment of different 616 ALTO servers. 618 7.2. Workflow converter 620 The converter is the front end of Unicorn. It is responsible for 621 collecting high-level data analytics workflows from users and 622 transforming them into low-level task workflows, e.g., HTCondor 623 ClassAds. It provides a set of simple APIs for users to submit and 624 manage requests, and to track the status of requests in real-time. 626 7.2.1. User API 628 o submitReq(request, [options]) 630 This API allows users to submit a request and specify 631 corresponding options. The request can be a data transfer request 632 or a data analytics request. Request options include priority, 633 delay, etc. It returns a request identifier reqID that allows 634 users to update, delete this request or track its status. The 635 additional options may or may not be approved, and the relative 636 priorities may be modified by the resource orchestrator depending 637 on the role of users (regular users or administrators at different 638 levels), the resource availability and the status of other ongoing 639 requests. 641 o updateReq(requestID, [options]) 643 This API allows users to update the options of requests. It will 644 return a SUCCESS if the new options are received by the request 645 parser. But these new options may or may not be approved, and may 646 be modified by the resource orchestrator depending on the role of 647 users (regular users or administrators), the resource availability 648 and the status of other ongoing requests. 650 o deleteReq(requestID) 652 This API allows users to delete a request by passing the 653 corresponding requestID. A completed request cannot be deleted. 654 An ongoing request will be stopped and the output data will be 655 deleted. 657 o getReqStatus(requestID) 659 This API allows users to query the status of a request by 660 specifying the corresponding requestID. The returned status 661 information includes whether the request has started, the assigned 662 priority, the percentage of finished sub-requests, transmission 663 statistics, the expected remaining time to finish, etc. 665 7.3. Resource Demand Estimator 667 The estimator leverages the fact that low-level tasks are typically 668 repetitive or have very high similarities. It uses reinforcement 669 learning to predict the optimal configuration for each task and 670 passes the resource demand to the multi-resource orchestrator for 671 further processing. 673 7.4. Entity Locator 675 The task configurations computed by the demand estimator has no 676 knowledge on the underlying structure of topology of each site, i.e., 677 the addresses of end hosts and network devices. Hence they cannot be 678 directly used by the ALTO clients for querying resource information. 679 The entity locator retrieves the entity addresses of the end hosts 680 that would be allocated for the tasks to be scheduled, and passes 681 these addresses to the ALTO clients, and asks the ALTO clients to 682 collect information about these end hosts, i.e., properties of 683 corresponding computing, storage and networking resources. 685 7.5. ALTO Client 687 The ALTO client is in the back end of Unicorn and is responsible for 688 retrieving resource information through querying ALTO servers 689 deployed at different sites. The resource information needed in 690 Unicorn includes the topology, link bandwidth, computing node memory 691 I/O speed, computing node CPU utilization, etc. The base ALTO 692 protocol [RFC7285] provides an extreme single-node abstraction for 693 this information, which only allows the multi-resource orchestrator 694 to make coarse-grained resource allocation decisions. To enable 695 fine-grained multi-resource orchestration for dataset transfer and 696 data analytics in cluster networks, ALTO topology extension services 697 such as routing state abstraction (RSA) [DRAFT-RSA], path vector 698 [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost [DRAFT-MC] and 699 cost-calendar [DRAFT-CC] are needed to provide fine-grained 700 information about different types of resources in clusters. 702 7.5.1. Query Mode 704 The ALTO client should operate in different query modes depending on 705 the implementation of ALTO servers. If an ALTO server does not 706 support incremental updates using server-sent events (SSE) 707 [DRAFT-SSE], the ALTO client sends queries to this server 708 periodically to get the latest resource information. If the cluster 709 state changes after one query, the ALTO client will not be aware of 710 the change until next query. If an ALTO server supports SSE, the 711 ALTO client only sends one query to the ALTO server to get the 712 initial cluster information. When the resource state changes, the 713 ALTO client will be notified by the ALTO server through SSE. 715 7.6. ALTO Server 717 ALTO servers are deployed at different sites around the world, and at 718 strategic locations in the network itself, to provide information 719 about different types of resources in the cluster networks in 720 response to queries from the ALTO client. Such information include 721 topology, link bandwidth, memory I/O speed and CPU utilization at 722 computing nodes, storage constraints in storage nodes and etc. Each 723 ALTO server must provide basic information services as specified in 724 [RFC7285] such as network map, cost map, endpoint cost service (ECS), 725 etc. To support the fine-grained multi-resource allocation in 726 Unicorn, each ALTO server should also provide more fine-grained 727 information about different resources in clusters through ALTO 728 extension services such as the routing state abstraction [DRAFT-RSA], 729 path vector [DRAFT-PV], network graph [DRAFT-NETGRAPH], multi-cost 730 [DRAFT-MC] and cost-calendar [DRAFT-CC] services. 732 7.7. Resource View Extractor 734 In each site, the resource information collected by the ALTO clients 735 is not directly sent back to the orchestrator. Instead, we design a 736 resource view extractor to compress the resource information provided 737 by the ALTO servers into a minimal, equivalent view of all the 738 resources, i.e., computing, storage and networking resources, that 739 would be allocated to a set of tasks. The extractor works in the 740 following steps. 742 o Resource-Task Matrix 744 Depending on specific services provided by the ALTO servers, the 745 responses collected by ALTO clients may include information of 746 different entities, e.g., endpoints, PIDs, ane, etc. Each entity 747 possesses a set of resources available for tasks to be scheduled. 748 For each entity property p in the responses, such as bandwidth, 749 delay, etc., the extractor composes an entity-task matrix M(p). 750 Each row of this M(p) represents an entity in the responses 751 provides information about property p and each column represents a 752 task to be scheduled. The element m(i, j) of M(p) is a variable 753 representing the amount of property p of entity i that task j can 754 get. 756 o Resource-Task Constraints 758 For each property p, the extractor composes a series of resource- 759 task constraints. The first set of constraints is sum m(i, j) = 760 r(p, j) for each task j. These constraints calculate r(p, j), the 761 total amount of property p provided to j by all the entities. The 762 exact rule of summation depends on the property p. For instance, 763 if p is latency, the summation is through common addition 764 operations, but if p is bandwidth, the summation is a minimum 765 function. 767 The second set of constraints is sum m(i, j) <= r(p, i) for each 768 entity i. These constraints represent the usage of resource i on 769 property p for all the tasks cannot exceed the capacity of 770 resource i on this property. The exact rule of summation also 771 depends on the property p. For instance, if p is bandwidth, the 772 summation is through common addition operations, but if p is 773 latency, the constraints become m(i,j) = r(p, i) for each each 774 entity i and each task j. This means that entity i provides the 775 same delay property for each task. 777 o Resource View Compression 779 The whole set of resource-task constraints are linear, and express 780 the view of resources that are available for the tasks to be 781 scheduled. The extractor uses a lightweight, optimal algorithm to 782 compress them into a minimal, equivalent view of resources, i.e., 783 a minimal set of linear constraints that represent the same 784 feasible region as the original constraints. The basic idea of 785 this algorithm is described in [DRAFT-RSA]. 787 7.8. Execution Agents 789 Execution agents are deployed at each site and are responsible for 790 the following functions: 792 o Receive and process instructions from the multi-resource 793 orchestrator, e.g. dataset transfer scheduling, data analytics 794 task placement and execution, task update and abortion, etc. 796 o Monitor the status of data analytics tasks and send the updated 797 status to the multi-resource orchestrator. 799 Depending on the supporting data analytics frameworks, different 800 request execution agents may be deployed in each site. For instance, 801 in the CMS experiment at CERN, both MPI and Hadoop execution agents 802 are deployed. 804 7.9. Multi-Resource Orchestrator 806 The multi-resource orchestrator receives the resource demand 807 information, i.e., a set of low-level task workflows and their 808 configurations, from the resource demand estimator. It then asks the 809 entity locator in each site to get the addresses of end hosts that 810 would be allocated for the tasks to be scheduled and ALTO clients to 811 query properties related to these end hosts. When the resource view 812 extractor sends the response back, the orchestrator makes resource 813 allocation decisions, e.g., dataset transfer scheduling and analytics 814 task execution, based on both resource demand dynamic and resource 815 supply dynamic. The dataset transfer scheduling decisions include 816 dataset replica selection, path selection, and bandwidth allocation, 817 etc. The analytics task execution decisions include which cluster 818 should allocate how much resources to execute which tasks. These 819 decisions are sent to the execution agents at different sites for 820 execution. 822 7.9.1. Orchestration Algorithms 824 The modular design of Unicorn allows the adoption of different 825 orchestration algorithms and methodologies, depending on the specific 826 performance requirements. In Section 7.8.3, a max-min fairness 827 resource allocation algorithm for dataset transfer is described as an 828 example. 830 7.9.2. Online, Dynamic Orchestration 832 The multi-resource orchestrator should adjust the resource allocation 833 decisions based on the progress of ongoing requests, the utilization 834 and dynamics of cluster resources. In normal cases, the multi- 835 resource orchestrator periodically collects such information and 836 executes the orchestration algorithm. When it is notified of events 837 such as request status update, cluster state update and etc., the 838 orchestrator will also execute the orchestration algorithm to adjust 839 resource allocations. 841 7.9.3. Example: A Max-Min Fairness Resource Allocation Algorithm 843 In this section, we describe a max-min fair resource allocation 844 (MFRA) scheduling algorithm which aims to minimize the maximal time 845 to complete a dataset transfer subject to a set of constraints. To 846 make resource allocation decisions, MFRA requires sufficient network 847 information including topology, link bandwidth and recent historical 848 information in some cases. In a small-scale single-domain network, 849 an SDN controller can provide the raw complete topology information 850 for the MFRA algorithm. However, in a large-scale multi-domain 851 science network such as CMS, providing the raw network topology is 852 infeasible because (1) it would incur significant communication 853 overhead; and (2) it would violate the privacy constraints of some 854 sites. Several ALTO extension topology services including Abstract 855 Path Vector [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA 856 [DRAFT-RSA] can provide the fine-grained yet aggregated/abstract 857 topology information for MFRA to efficiently utilize bandwidth 858 resources in the network. 860 Ongoing pre-production deployment efforts of Unicorn in the CMS 861 network involve the implementation of the RSA service. Other than 862 topology information, the additional input of the MFRA algorithm is 863 the priority of each class of flows, expressed in terms of upper and 864 lower limits on the allocated bandwidth between the source and the 865 destination for each data transfer requests. 867 The basic idea of the MFRA algorithm is to iteratively maximize the 868 volume of data that can be transferred subject to the constraints. 869 It works in quantized time intervals such that it schedules network 870 paths and data volumes to be transferred in each time slot. When the 871 DTR scheduler is notified of events such as the cancellation of a 872 DTR, the completion of a DTR or network state changes, the MFRA 873 algorithm will also be invoked to make updated network path and 874 bandwidth allocation decisions. 876 In each execution cycle, MFRA first marks all transfers as 877 unsaturated. Then it solves a linear programming model to find the 878 common minimum transfer satisfaction rate (i.e., the ratio of 879 transferred data volume in a time interval over the whole data volume 880 of this request) that is satisfied by all transfer requests. With 881 this common rate found, MFRA then randomly selects an unsaturated 882 request in each iteration, increases its transfer rate as much as 883 possible by finding residual paths available in the network, or by 884 increasing the allocated bandwidth along an existing path, until it 885 reaches its upper limit or can otherwise not be increased further, so 886 it is saturated. At each iteration, newly saturated requests are 887 removed from the subsequent process by fixing their corresponding 888 rate value, and completed transfers are removed from further 889 consideration. After all the data transfer rates are saturated in 890 the given time slot, then a feasible set of data transfer volumes 891 scheduled to be transferred in the slot across each link in the 892 network can be derived. 894 The MFRA algorithm yields a full utilization of limited network 895 resources such as bandwidth so that all DTR can be completed in a 896 timely manner. It allocates network resources fairly so that no DTR 897 suffers starvation. It also achieves load balance among the sites 898 and the network paths crossing a complex network topology so that no 899 site and no network link is oversubscribed. Moreover, MFRA can 900 handle the case where particular routing constraints are specified, 901 e.g., where all routes are fixed ahead of time, or where each 902 transfer request only uses one single path in each time slot, by 903 introducing an additional set of linear constraints. 905 8. Discussion 907 8.1. Deployment 909 The Unicorn framework is the first step towards a new class of 910 intelligent, SDN-driven global systems for multi-domain, geo- 911 distributed data analytics involving a worldwide ensemble of sites 912 and networks, such as CMS and ATLAS. Unicorn relies heavily on the 913 ALTO services for collecting and expressing abstract, real-time 914 resource information from different sites, and the SDN centralized 915 control capability to orchestrate data analytics workflows. It aims 916 to provide a new operational paradigm in which science programs can 917 use complex network and computing infrastructures with high 918 throughput, while allowing for coexistence with other network 919 traffic. 921 A prototype case study implementation of Unicorn has been 922 demonstrated on the Caltech/StarLight/Michigan/Fermilab SDN 923 development testbed. Because this testbed is a single-domain 924 network, the current Unicorn prototype leverages the ALTO 925 OpenDaylight controller, to collect topology information. The CMS 926 experiment is currently exploring pre-production deployments of 927 Unicorn, looking towards future widespread production use. To 928 achieve this goal, it is imperative to collect sufficient resource 929 information from the various sites in the multi-domain CMS network, 930 without causing any privacy leak. To this end, the ALTO RSA service 932 [DRAFT-RSA] is under development. Furthermore, as will be discussed 933 next, other ALTO topology extension services can also substantially 934 improve the performance of Unicorn. 936 8.2. Benefiting From ALTO Extension Services 938 The current ALTO base protocol [RFC7285] exposes network topology and 939 endpoint properties using the extreme "my-Internet-view" 940 representation, which abstracts a whole network as a single node that 941 has a set of access ports, with each port connects to a set of end 942 hosts called endpoints. Such an extreme abstraction leads to 943 significant information loss on network topology [DRAFT-PV], which is 944 the key information for Unicorn to make dynamic scheduling and 945 resource allocation decisions. Though Unicorn can still allocate 946 resource for data transfer and analytics requests on this abstract 947 view, the resource allocation decisions are suboptimal. 948 Alternatively, feeding the raw, complete network topology of each 949 site to Unicorn is not desirable, either. First, this would violate 950 privacy constraints of different sites. Secondly, a raw network 951 topology would significantly increase the problem space and the 952 solution space of the orchestrating algorithm, leading to a long 953 computation time. Hence, Unicorn desires an ALTO topology service 954 that is able to provide only enough fine-grained topology 955 information. 957 Several ALTO topology extension services including Path Vector 958 [DRAFT-PV], Network Graphs [DRAFT-NETGRAPH] and RSA [DRAFT-RSA], 959 [DRAFT-PM] are potential candidates for providing fine-grained 960 abstract network formation to Unicorn. In addition, we propose that 961 these services can also be used to provide information about 962 computing and storage resources of different cluster/sites by viewing 963 each computing node and storage node as a network element or abstract 964 network element. For instance, the path vector service supports the 965 capacity region query, which accepts multiple concurrent data flows 966 as the input and returns the information of bottleneck resources, 967 which could be a set of links, computing devices or storage devices, 968 for the given set of concurrent flows. This information can be 969 interpreted as a set of linear constraints for the multi-resource 970 orchestrator, which can help data transfer and analytics requests 971 better utilize multiple types of resources in different clusters. 973 8.3. Limitations of the MFRA Algorithm 975 The first limitation of the MFRA algorithm is computation overhead. 976 The execution of MFRA involves solving linear programming problems 977 repeatedly at every time slot. The overhead of computation time is 978 acceptable for small sets of dataset transfer requests, but may 979 increase significantly when handling large sets of requests, e.g., 980 hundreds of transfer requests. Current efforts towards addressing 981 this issue include exploring the feasibility of incremental 982 computation of scheduling policies, and reducing the problem scale by 983 finding the minimal equivalent set of constraints of the linear 984 programming model. The latter approach can benefit substantially 985 from the ALTO RSA service [DRAFT-RSA]. 987 The second limitation is that the current version of MFRA does not 988 involve dataset replica selection. Simply denoting the replica 989 selection as a set of binary constraint will significantly increases 990 the computation complexity of the scheduling process. Current 991 efforts focus on finding efficient algorithms to make dataset replica 992 selection. 994 9. Security Considerations 996 This document does not introduce any privacy or security issue not 997 already present in the ALTO protocol. 999 10. IANA Considerations 1001 This document does not define any new media type or introduce any new 1002 IANA consideration. 1004 11. Acknowledgments 1006 The authors thank discussions with Shenshen Chen, Linghe Kong, Xiao 1007 Lin and Xin Wang. 1009 12. References 1011 12.1. Normative References 1013 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1014 Requirement Levels", BCP 14, RFC 2119, 1015 DOI 10.17487/RFC2119, March 1997, 1016 . 1018 12.2. Informative References 1020 [DRAFT-CC] 1021 Randriamasy, S., Yang, R., Wu, Q., Deng, L., and N. 1022 Schwan, "ALTO Cost Calendar", 2017, 1023 . 1026 [DRAFT-DC] 1027 Lee, Y., Bernstein, G., Dhody, D., and T. Choi, "ALTO 1028 Extensions for Collecting Data Center Resource 1029 Information", 2014, . 1032 [DRAFT-MC] 1033 Randriamasy, S., Roome, W., and N. Schwan, "Multi-Cost 1034 ALTO", 2017, . 1037 [DRAFT-NETGRAPH] 1038 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 1039 Yang, "ALTO Topology Extensions: Node-Link Graphs", 2015, 1040 . 1042 [DRAFT-PM] 1043 Roome, W. and Y. Yang, "Extensible Property Maps for the 1044 ALTO Protocol", 2015, . 1047 [DRAFT-PV] 1048 Bernstein, G., Lee, Y., Roome, W., Scharf, M., and Y. 1049 Yang, "ALTO Extension: Abstract Path Vector as a Cost 1050 Mode", 2015, . 1053 [DRAFT-RSA] 1054 Gao, K., Wang, X., Yang, Y., and G. Chen, "ALTO Extension: 1055 A Routing State Abstraction Service Using Declarative 1056 Equivalence", 2015, . 1059 [DRAFT-SSE] 1060 Roome, W. and Y. Yang, "ALTO Incremental Updates Using 1061 Server-Sent Events (SSE)", 2015, 1062 . 1065 [HTCondor] 1066 Thain, D., Tannenbaum, T., and M. Livny, "Distributed 1067 computing in practice: the Condor experience", 2005, 1068 . 1070 [RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S., 1071 Previdi, S., Roome, W., Shalunov, S., and R. Woundy, 1072 "Application-Layer Traffic Optimization (ALTO) Protocol", 1073 RFC 7285, DOI 10.17487/RFC7285, September 2014, 1074 . 1076 Authors' Addresses 1078 Qiao Xiang 1079 Tongji/Yale University 1080 51 Prospect Street 1081 New Haven, CT 1082 USA 1084 Email: qiao.xiang@cs.yale.edu 1086 Harvey Newman 1087 California Institute of Technology 1088 1200 California Blvd. 1089 Pasadena, CA 1090 USA 1092 Email: newman@hep.caltech.edu 1094 Greg Bernstein 1095 Grotto Networking 1096 Fremont, CA 1097 USA 1099 Email: gregb@grotto-networking.com 1101 Haizhou Du 1102 Tongji/Yale University 1103 51 Prospect Street 1104 New Haven, CT 1105 USA 1107 Email: duhaizhou@gmail.com 1108 Kai Gao 1109 Tsinghua University 1110 30 Shuangqinglu Street 1111 Beijing 1112 China 1114 Email: gaok12@mails.tsinghua.edu.cn 1116 Azher Mughal 1117 California Institute of Technology 1118 1200 California Blvd. 1119 Pasadena, CA 1120 USA 1122 Email: azher@hep.caltech.edu 1124 Justas Balcas 1125 California Institute of Technology 1126 1200 California Blvd. 1127 Pasadena, CA 1128 USA 1130 Email: justas.balcas@cern.ch 1132 Jingxuan Jensen Zhang 1133 Tongji University 1134 4800 Cao'an Hwy 1135 Shanghai 201804 1136 China 1138 Email: jingxuan.n.zhang@gmail.com 1140 Y. Richard Yang 1141 Tongji/Yale University 1142 51 Prospect Street 1143 New Haven, CT 1144 USA 1146 Email: yry@cs.yale.edu