idnits 2.17.1 draft-he-coin-datacenter-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 11, 2018) is 2018 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'CHAN' is defined on line 457, but no explicit reference was found in the text == Unused Reference: 'FORSTER' is defined on line 473, but no explicit reference was found in the text == Unused Reference: 'GRAHAM' is defined on line 475, but no explicit reference was found in the text == Unused Reference: 'LIX' is defined on line 487, but no explicit reference was found in the text == Unused Reference: 'SUBEDI' is defined on line 509, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 COIN J. He 3 Internet-Draft R. Chen 4 Intended status: Informational Huawei 5 Expires: April 14, 2019 M. Montpetit, Ed. 6 Triangle Video 7 October 11, 2018 9 In-Network Data-Center Computing 10 draft-he-coin-datacenter-00 12 Abstract 14 This draft wants to review the existing research and the open issues 15 that relate to the addition of data plane programmability in Data 16 Center. While some of the research hypotheses that are at the center 17 of in-network-computing have been investigated since the time of 18 active networking, recent developments in software defined 19 networking, virtualization programmable switches and new network 20 programming languages like P4 have generated a new enthusiasm in the 21 research community and a flourish of new projects in systems and 22 applications alike. This is what this draft is addressing. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on April 14, 2019. 41 Copyright Notice 43 Copyright (c) 2018 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 60 2. In Network Computing and Data Centers . . . . . . . . . . . . 3 61 3. State of the Art in DC Programmability . . . . . . . . . . . 4 62 3.1. In-Network Computing . . . . . . . . . . . . . . . . . . 4 63 3.2. In-Network Caching . . . . . . . . . . . . . . . . . . . 6 64 3.3. In Network Consensus . . . . . . . . . . . . . . . . . . 7 65 3.4. Research Topics in Next Generation Data Centers . . . . . 8 66 3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 10 67 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 68 4.1. Normative References . . . . . . . . . . . . . . . . . . 10 69 4.2. Informative References . . . . . . . . . . . . . . . . . 10 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 72 1. Introduction 74 It is now a given in the computing and networking world that 75 traditional approaches to cloud and client-server architectures lead 76 to complexity and scalability issues. New solutions are necessary to 77 address the growth of next generation network operation (in data 78 centers and edge devices alike) including automation, self- 79 management, orchestration across components and federation across 80 network nodes to enable emerging services and applications. 82 Mobility, social network and big data and AI/ML as well as emerging 83 content application in the XR (virtual, augmented and mixed reality) 84 require more scalable, available and reliable solution not only in 85 real time, anywhere and over a wide variety of end devices. While 86 these solutions involve edge resources for computing, rendering and 87 distribution, this paper focuses on the data center what are the 88 current research approaches to create more flexible solutions. We 89 must define what we understand by data centers. In this draft, we 90 are not going to limit them to single location cloud resources but 91 add multiple locations as well as interwork with edge resources to 92 enable the network programmability that is central to next generation 93 DCs in term of supported services and dynamic resilience. This leads 94 to innovative research opportunities, including but not limited to: 96 - Software defined networking (SDN) in distributed environments. 98 - Security and trust models. 100 - Data plane programmability for consensus and key-value 101 operations. 103 - High Level abstractions as in network computing should focus on 104 primitives, which can be widely re-used in a class of applications 105 and workloads, and identify those high level abstractions to 106 promote deployment. 108 - Machine Learning (ML) and Artificial Intelligence (AI) to detect 109 faults and failures and allow rapid responses as well as implement 110 network control and analytics. 112 - New services for mixed reality (XR) deployment with in-network 113 optimization and advanced data structures and rendering for 114 interactivity, security and resiliency. 116 1.1. Requirements Language 118 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 119 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 120 document are to be interpreted as described in RFC 2119 [RFC2119]. 122 2. In Network Computing and Data Centers 124 As DC hardware components becoming interchangeable, the advent of 125 software-defined technologies suggests that a change is underway. In 126 the next-generation data center, an increasing percentage of critical 127 business and management functions will be activated in the software 128 layer rather than the underlying hardware. This will allow 129 organizations to move away from the current manual configurations to 130 handle more dynamic, rules-based configurations. Hence, 131 virtualization and cloud computing have redefined the datacenter (DC) 132 boundaries beyond the traditional hardware-centric view [SAPIO]. 133 Servers, storage, monitoring and connectivity are becoming one. The 134 network is more and more the computer. 136 Hence, there is now a number of distributed networking and computing 137 systems which are the basis of big-data and AI-related applications 138 in DCs in particular. They include Distributed file system (e.g. the 139 Hadoop Distributed File System or HDFS [HADOOP]), distributed memory 140 database (e.g. MemCached [MEM]), distributed computing system (e.g. 141 mapReduce from Hadoop, [HADOOP], Tensorflow [TENSOR], and Spark 142 GraphX [SPARK]), as well as distributed trust systems on the 143 blockchain, such as hyperledgers and smart contracts. 145 In parallel the emergence of the P4 language [P4] and programmable 146 switches facilitates innovation and triggers new research. For 147 example, the latest programmable switches make the concept of the 148 totally programmable and dynamically reconfigurable network closer to 149 reality. And, as distributed systems are increasingly based on 150 memory instead of hard disks, distributed system-based application 151 performance is increasingly constrained by network resources not 152 computing. 154 However, there are some challenges when introducing in-network 155 computing and caching: 157 - Limited memory size: tor example, the SRAM size [TOFINO] can be 158 as small as tens of MBs. 160 - Limited instruction sets: the operations are mainly simple 161 arithmetic, data (packet) manipulation and hash operation. Some 162 switches can provide limited floating-point operation. This 163 enables network performance tools like forward error correction 164 but limits more advanced applications such as machine learning for 165 congestion control for example. 167 - Limited speed/CPU processing capabilities: only a few operations 168 can be performed on each packet to ensure line speed (tens of 169 nano-seconds on fast hardware). Looping could allow a processed 170 packet to re-enter the ingress queue, but with a cost of 171 increasing latency and reducing forwarding capabilities. 173 - Performance: for devices located on the links of the distributed 174 network; it is to be evaluated how on-path processing can reduce 175 the FCT (Flow Completion Time) in data center network hence reduce 176 the network traffic/congestion and increase the throughput. 178 The next sections of this draft review how some of these questions 179 are currently being addressed in the research community. 181 3. State of the Art in DC Programmability 183 Recent research has shown that in-network computing can greatly 184 improve the DC network performance of three typical scenarios: 185 aggregate on path- computing, key-value (K-V) cache, and strong 186 consistency. Some of these research results are summarized below. 188 3.1. In-Network Computing 190 The goals on on-path computing in DC is 1. to reduce delay and/or 191 increase throughput for improved performance by allowing advanced 192 packet processing and 2. to help reduce network traffic and alleviate 193 congestion by implementing better traffic and congestion management 194 [REXFORD][SOULE][SAPIO]. 196 However, in terms of research and implementation, there are still 197 open issues that need to be addressed in order to fulfill these 198 promises beyond what was mentioned in the previous section. In 199 particular, the end-to-end principle which has driven most of the 200 networking paradigms of the last 20 years is challenged when in- 201 network computing devices are inserted on the ingress-egress path. 202 This is still an open discussion topic. 204 The type of computing that can be performed to improve the DC 205 performance is another of the open topics. Computing should improve 206 performance but not at the expense of existing application 207 degradation Computing should also enable new applications to be 208 developed. At the time of this writing those include data intensive 209 applications in workload mode with partition and aggregation 210 functionality. 212 Data-intensive applications include big data analysis (e.g. data 213 reduction, deduplication and machine learning), graph processing, and 214 stream processing. They support scalability by distributing data and 215 computing to many worker servers. Each worker performs computing on 216 a part of the data, and there is a communication phase to update the 217 shared state or complete the final calculation. This process can be 218 executed iteratively. It is obvious that communication cost and 219 availability of bottleneck resources will be one of the main 220 challenges for such applications to perform well as a large amount of 221 data need to be transmitted frequently in many-to-many mode. But 222 already, there are several distributed frameworks with user-defined 223 aggregation functions, such as mapReduce from Haddop [HADOOP], Pregel 224 from Google [PREGEL], and DryadLinq from Microsoft [DRYAD]. These 225 functions enable application developers to reduce the network load 226 used for messaging by aggregating all single messages together and 227 consequently reduce the task execution time. Currently, these 228 aggregation functions are used only at the worker level. If they are 229 used at the network level, a higher traffic reduction ratio can be 230 reached. 232 The aggregation functions needed by the data intensive applications, 233 have some features that make it suitable to be at least executed in a 234 programmable. They usually reduce the total amount of data by 235 arithmetic (add) or logical function (minima/maxima detection) that 236 can be parallelized. Performing these functions in the DC at the 237 ingress of the network can be beneficial to reduce the total network 238 traffic and lead to reduced congestion. The challenge is of course 239 not to lose important data in the process especially when applied to 240 different parts of the input data without considering the order and 241 affect the accuracy of the final result. 243 In-network computing can also improve the performance of multipath 244 routing by aggregating path capacity to individual flows and 245 providing dynamic path selection, improving scalability and 246 multitenancy. 248 Other data intensive applications that can be improved in terms of 249 network load by in-network computing include: machine learning, graph 250 analysis, data analytics and map reduce. For all of those, 251 aggregation functions in the computing hardware provides a reduction 252 of potential network congestion; in addition, because of the reduced 253 load, the overall application performance is improved. The traffic 254 reduction was shown to range from 48% up to 93% [SAPIO]. 256 Machine learning is a very active research area for in-network 257 computing because of the large datasets it both requires and 258 generates. For example, in TensorFlow [TENSOR], parameters 259 updates are small deltas that only change a subset of the overall 260 tensor and can be aggregated by a vector addition operation. The 261 overlap of the tensor updates, i.e. the portion of tensor elements 262 that are updated by multi workers at the same time, is 263 representative of the possible data reduction achievable when the 264 updates are aggregated inside the network. 266 In graph analysis, three algorithms with various characteristics 267 have been considered in [SAPIO]: PageRank, Single Source Shortest 268 Path (SSSP) and Weakly Connected Components (WCC) with a 269 commutative and associative aggregation function. Experiment 270 shows that the potential traffic reduction ratio in the three 271 applications is signification. 273 Finally, in map-reduce, experiments in the same paper show that 274 after aggregation computing, the number of packets received by the 275 reducer decreases by 88%~90% compared UDP, by 40% compared to 276 using TCP. There is thus great promise for mapReduce-like to take 277 advantage of computing and storage optimization. 279 3.2. In-Network Caching 281 Key-value stores are ubiquitous and one of their major challenges are 282 to process their associated data-skewed workload in a dynamic 283 fashion. As in any caches, popular items receive more queries, and 284 the set of popular items can change rapidly, with the occurrence of 285 well-liked posts, limited-time offers, and trending events. The skew 286 generated by the dynamic nature of the K-V can lead to severe load 287 imbalance and significant performance deterioration. The server is 288 either overused in an area or underused in another, the throughput 289 can decrease rapidly, and the response time latency degrades 290 significantly. When the storage server uses per core sharding/ 291 partitioning to process high concurrency, this degradation will be 292 further amplified. The problem of unbalanced load is especially 293 acute for high performance in memory K-V store. 295 The selective replication copying of popular items is often used to 296 keep performance high. However, in addition to more hardware 297 resource consumption, selective replication requires a complex 298 mechanism to implement data mobility, data consistency and query 299 routing. As a result, system design becomes complex and overhead is 300 increased. 302 This is where in-network caching can help. Recent research 303 experiments show that K-V cache throughput can be improved by 3~10 304 times by introducing in net cache. Analytical results in [FAN] show 305 that a small frontend cache can provide load balancing for N back-end 306 nodes by caching only O(N logN) entries, even under worst-case 307 request patterns. Hence, caching O(NlogN) items is sufficient to 308 balance the load for N storage servers (or CPU cores). 310 In the NetCache system [JIN], a new rack-scale key-value store design 311 guarantees billions of queries per second (QPS) with bounded 312 latencies even under highly-skewed and rapidly-changing workloads. A 313 programmable switch is used to detect, sort, cache, and obtain a 314 hotspot K-V pair to process load balancing between the switch storage 315 nodes. 317 3.3. In Network Consensus 319 Strong consistency and consensus in distributed networks are 320 important. Significant efforts in the in-network computing community 321 have been directed towards it. Coordination is needed to maintain 322 system consistency and it requires a large amount of communication 323 between network nodes and instances, taking away processing 324 capabilities from other more essential tasks. Performance overhead 325 and extra resources often result in a decrease in consistency. And 326 as a result, potential inconsistencies need to be addressed. 328 Maintaining consistency requires multiple communications rounds in 329 order to reach agreement, hence the danger of creating messaging 330 bottlenecks in large systems. Even without congestion, failure or 331 lost messages, a decision can only be reached as fast as the network 332 round trip time (RTT) permits. Thus, it is essential to find 333 efficient mechanisms for the agreement protocols. One idea is to use 334 the network devices themselves. 336 Hence, consensus mechanisms for ensuring consistency are some of the 337 most expensive operations in managing large amounts of data [ZSOLT]. 338 Often, there is a tradeoff that involves reducing the coordination 339 overhead at the price of accepting possible data loss or 340 inconsistencies. As the demand for more efficient data centers 341 increases, it is important to provide better ways of ensuring 342 consistency without affecting performance. In [ZSOLT] consensus 343 (atomic broadcast) is removed from the critical path by moving it to 344 hardware. The Zookeeper atomic broadcast (also in Hadoop) proof of 345 concept is implemented at the network level on an FPGA, using both 346 TCP and an application specific network protocol. This design can be 347 used to push more value into the network, e.g., by extending the 348 functionality of middle boxes or adding inexpensive consensus to in- 349 network processing nodes. 351 A widely used protocol for consensus is Paxos. Paxos is a 352 fundamental protocol used by fault-tolerant systems, and is widely 353 used by data center applications. In summary, Paxos serializes 354 transaction requests from different clients in the leader, ensuring 355 that each learner (message replicator) in the distributed system is 356 implemented in the same order. Each proposal can be an atomic 357 operation (an inseparable operation set). Paxos does not care about 358 specific content of the proposal. Recently, some research evaluation 359 suggests that moving Paxos logic into the network would yield 360 significant performance benefits for distributed applications 361 [DANG15]. In this scheme network switches can play the role of 362 coordinators (request managers) and acceptors (managed storage 363 nodes). Messages travel fewer hops in the network, therefore 364 reducing the latency for the replicated system to reach consensus 365 since coordinators and acceptors typically act as bottlenecks in 366 Paxos implementations, because they must aggregate or multiplex 367 multiple messages. Experiments suggest that moving consensus logic 368 into network devices could dramatically improve the performance of 369 replicated systems. In [DANG15], NetPaxos achieves a maximum 370 throughput of 57,457 messages/s, while basic Paxos the coordinator 371 being CPU bound, is only able to send 6,369 messages/s. In [DANG16], 372 a P4 implementation of Paxos is presented as a result of Paxos 373 implementation with programmable data planes. 375 Other papers, have shown the use of in-network processing and SDN for 376 Paxos performance improvements using multi-ordered multicast and 377 multi-sequencing [LIJ] [PORTS]. 379 3.4. Research Topics in Next Generation Data Centers 381 While the previous section introduced the state of the art in data 382 center in-network computing, there are still some open issues that 383 need to be addressed. In this section, some of these questions are 384 listed as well as the impacts that adding in-network computing will 385 have on existing systems. 387 Adding computing and caching to the network violates the End-to-End 388 principle central to the Internet. And the interaction with 389 encrypted systems can limit the scope of what in-network can do to 390 individual packet. In addition, even when programmable, every switch 391 is still designed for (line speed) forwarding with the resulting 392 limitations, such as lack of floating-point support for advanced 393 algorithms and buffer size limitation. Especially in the high- 394 performance datacenters for in-network computing to be successful, a 395 balance between functionality, performance and cost must be found. 397 Hence the research areas include but are not limited to: 399 Protocol design and network architecture: a lot of the current in- 400 network work has targeted network layer optimization; however, 401 transport protocols will influence the performance of any in- 402 network solution. How can in-network optimization interact (or 403 not) with transport layer optimizations? 405 Can the end-to-end assumptions of existing transport like TCP 406 still be applicable in the in-network compute era? There is 407 heritage in middlebox interactions with existing flows. 409 Fixed-point calculation for current application vs. of floating- 410 point calculation for more complex operations and services: 411 network switches typically do not support floating-point 412 calculation. Is it necessary to introduce this capability and 413 scale to the demand? For example, AI and ML algorithms currently 414 mainly use floating-point calculation. If the AI algorithm is 415 changed to fixed-point calculation, will the training stop in 416 advance and the training result deteriorate (this needs to be done 417 as joint work with the AI community)? 419 What are the gains brought by aggregation in distributed and 420 decentralized networks? The built-in buffer of the network device 421 is often limited, and, for example the AI and XR application 422 parameters and caches can reach hundreds of megabytes. There is a 423 trade-off between aggregation of the data on a single network 424 device and its distribution across multiple nodes in terms of 425 performance. 427 What is the relationship between the depth of packet inspection 428 and not only performance but security and privacy? There is a 429 need to find what application layer cryptography is ready to 430 expose to other layers and even collaborating nodes; this is also 431 related to trust in distributed networks. 433 Relationship between the speed of creating tables on the data 434 plane and the performance. 436 3.5. Conclusion 438 In-network computing as it applies to data centers is a very current 439 and promising research area. Thus, the proposed Research Group 440 creates an opportunity to bring together the community in 441 establishing common goals, identify hurdles and difficulties, provide 442 paths to new research especially in applications and linkage to other 443 new networking research areas at the edge. More information is 444 available in [COIN]. 446 4. References 448 4.1. Normative References 450 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 451 Requirement Levels", BCP 14, RFC 2119, 452 DOI 10.17487/RFC2119, March 1997, 453 . 455 4.2. Informative References 457 [CHAN] Chan et al., M., "Network Support for DNN Training", 2018. 459 [COIN] He, J., Chen, R., and M. Montpetit, "Computing in the 460 Network, COIN, proposed IRTF group", 2018. 462 [DANG15] Dang et al., T., "NetPaxos: Consensus at Network Speed", 463 2015. 465 [DANG16] Dang et al., T., "Paxos Made Switch-y", 2016. 467 [DRYAD] Microsoft, "DryadLinq", 2018. 469 [FAN] Fan et al., B., "Small Cache, Big Effect: Provable Load 470 Balancing for Randomly Partitioned Cluster Services", 471 2011. 473 [FORSTER] Forster, N., "To be included", 2018. 475 [GRAHAM] Graham, R., "Scalable Hierarchical Aggregation Protocol 476 (SHArP): A Hardware Architecture for Efficient Data 477 Reduction.", 2016. 479 [HADOOP] Hadoop, "Hadoop Distributed Filesystem", 2016. 481 [JIN] Jin et al., X., "NetCache: Balancing Key-Value Stores with 482 Fast In-Network Caching", 2017. 484 [LIJ] Li et al., J., "NetCache: Balancing Key-Value Stores with 485 Fast In-Network Caching", 2017. 487 [LIX] Li et al., X., "Be fast, cheap and in control with 488 SwitchKV", 2016. 490 [MEM] memcached.org, "Memcached", 2018. 492 [P4] p4.org, "P4 Language", 2018. 494 [PORTS] Ports, D., "Designing Distributed Systems Using 495 Approximate Synchrony in Data Center Networks", 2015. 497 [PREGEL] github.com/igrigorik/pregel, "Pregel", 2018. 499 [REXFORD] Rexford, J., "Sigcomm 2018 Keynote Address", 2018. 501 [SAPIO] Sapio et al., A., "In net computing is a dumb idea whose 502 time has come", 2017. 504 [SOULE] Soule, R., "Sigcomm 2018 Netcompute Workshop Keynote 505 Address", 2018. 507 [SPARK] Apache, "Spark Graph X", 2018. 509 [SUBEDI] Subedi et al., T., "OpenFlow-based in-network Layer-2 510 adaptive multipath aggregation in data centers", 2015. 512 [TENSOR] tensorflow.org, "Tensorflow", 2018. 514 [TOFINO] BarefootNetworks, "Tofino", 2018. 516 [ZSOLT] Zsolt et al., I., "Consensus in a Box", 2018. 518 Authors' Addresses 520 Jeffrey He 521 Huawei 523 Email: jeffrey.he@huawei.com 524 Rachel Chen 525 Huawei 527 Email: chenlijuan5@huawei.com 529 Marie-Jose Montpetit (editor) 530 Triangle Video 531 Boston, MA 532 US 534 Email: marie@mjmontpetit.com