idnits 2.17.1 draft-he-coin-managed-networks-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 8, 2019) is 1873 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'CHAN' is defined on line 492, but no explicit reference was found in the text == Unused Reference: 'FORSTER' is defined on line 508, but no explicit reference was found in the text == Unused Reference: 'GRAHAM' is defined on line 510, but no explicit reference was found in the text == Unused Reference: 'LIX' is defined on line 522, but no explicit reference was found in the text == Unused Reference: 'SUBEDI' is defined on line 547, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 COIN J. He 3 Internet-Draft A. Li 4 Intended status: Informational Huawei 5 Expires: September 9, 2019 M. Montpetit, Ed. 6 Triangle Video 7 March 8, 2019 9 In-Network Computing for Managed Networks: Use Cases and Research 10 Challenges 11 draft-he-coin-managed-networks-00 13 Abstract 15 This draft wants to review the existing research and the open issues 16 that relate to the addition of data plane programmability in managed 17 networks. While some of the research hypotheses that are at the 18 center of in-network-computing have been investigated since the time 19 of active networking, recent developments in software defined 20 networking, virtualization programmable switches and new network 21 programming languages like P4 have generated a new enthusiasm in the 22 research community and a flourish of new projects in systems and 23 applications alike. This is what this draft is addressing. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on September 9, 2019. 42 Copyright Notice 44 Copyright (c) 2019 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 60 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 61 2. In Network Computing and Data Centers . . . . . . . . . . . . 3 62 3. State of the Art in DC Programmability . . . . . . . . . . . 5 63 3.1. In-Network Computing . . . . . . . . . . . . . . . . . . 5 64 3.2. In-Network Caching . . . . . . . . . . . . . . . . . . . 7 65 3.3. In Network Consensus . . . . . . . . . . . . . . . . . . 7 66 4. Industrial Networks . . . . . . . . . . . . . . . . . . . . . 9 67 5. Research Topics . . . . . . . . . . . . . . . . . . . . . . . 9 68 5.1. Data Plane Issues in Managed Networks . . . . . . . . . . 10 69 5.2. Interaction with Transport Protocols . . . . . . . . . . 10 70 5.3. Interaction with Security Mechanisms . . . . . . . . . . 10 71 5.4. Privacy aspects . . . . . . . . . . . . . . . . . . . . . 10 72 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 10 73 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 7.1. Normative References . . . . . . . . . . . . . . . . . . 11 75 7.2. Informative References . . . . . . . . . . . . . . . . . 11 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 78 1. Introduction 80 It is now a given in the computing and networking world that 81 traditional approaches to cloud and client-server architectures lead 82 to complexity and scalability issues. New solutions are necessary to 83 address the growth of next generation managed network operation (in 84 data centers and edge devices alike) including automation, self- 85 management, orchestration across components and federation across 86 network nodes to enable emerging services and applications. 88 Mobility, social network and big data and AI/ML as well as emerging 89 content application in the XR (virtual, augmented and mixed reality) 90 as well as emerging industrial networking applications require more 91 scalable, available and reliable solution not only in real time, 92 anywhere and over a wide variety of end devices. While these 93 solutions involve edge resources for computing, rendering and 94 distribution, this paper focuses on the data center what are the 95 current research approaches to create more flexible solutions. We 96 must define what we understand by data centers. In this draft, we 97 are not going to limit them to single location cloud resources but 98 add multiple locations as well as interwork with edge resources to 99 enable the network programmability that is central to next generation 100 DCs in term of supported services and dynamic resilience. This leads 101 to innovative research opportunities, including but not limited to: 103 - Software defined networking (SDN) in distributed environments. 105 - Security and trust models. 107 - Data plane programmability for consensus and key-value 108 operations. 110 - High Level abstractions as in network computing should focus on 111 primitives, which can be widely re-used in a class of applications 112 and workloads, and identify those high level abstractions to 113 promote deployment. 115 - Machine Learning (ML) and Artificial Intelligence (AI) to detect 116 faults and failures and allow rapid responses as well as implement 117 network control and analytics. 119 - New services for mixed reality (XR) deployment with in-network 120 optimization and advanced data structures and rendering for 121 interactivity, security and resiliency. 123 - New applications in industrial networking. 125 1.1. Requirements Language 127 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 128 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 129 document are to be interpreted as described in RFC 2119 [RFC2119]. 131 2. In Network Computing and Data Centers 133 As DC hardware components becoming interchangeable, the advent of 134 software-defined technologies suggests that a change is underway. In 135 the next-generation data center, an increasing percentage of critical 136 business and management functions will be activated in the software 137 layer rather than the underlying hardware. This will allow 138 organizations to move away from the current manual configurations to 139 handle more dynamic, rules-based configurations. Hence, 140 virtualization and cloud computing have redefined the datacenter (DC) 141 boundaries beyond the traditional hardware-centric view [SAPIO]. 142 Servers, storage, monitoring and connectivity are becoming one. The 143 network is more and more the computer. 145 Hence, there is now a number of distributed networking and computing 146 systems which are the basis of big-data and AI-related applications 147 in DCs in particular. They include Distributed file system (e.g. the 148 Hadoop Distributed File System or HDFS [HADOOP]), distributed memory 149 database (e.g. MemCached [MEM]), distributed computing system (e.g. 150 mapReduce from Hadoop, [HADOOP], Tensorflow [TENSOR], and Spark 151 GraphX [SPARK]), as well as distributed trust systems on the 152 blockchain, such as hyperledgers and smart contracts. 154 In parallel the emergence of the P4 language [P4] and programmable 155 switches facilitates innovation and triggers new research. For 156 example, the latest programmable switches make the concept of the 157 totally programmable and dynamically reconfigurable network closer to 158 reality. And, as distributed systems are increasingly based on 159 memory instead of hard disks, distributed system-based application 160 performance is increasingly constrained by network resources not 161 computing. 163 However, there are some challenges when introducing in-network 164 computing and caching: 166 - Limited memory size: tor example, the SRAM size [TOFINO] can be 167 as small as tens of MBs. 169 - Limited instruction sets: the operations are mainly simple 170 arithmetic, data (packet) manipulation and hash operation. Some 171 switches can provide limited floating-point operation. This 172 enables network performance tools like forward error correction 173 but limits more advanced applications such as machine learning for 174 congestion control for example. 176 - Limited speed/CPU processing capabilities: only a few operations 177 can be performed on each packet to ensure line speed (tens of 178 nano-seconds on fast hardware). Looping could allow a processed 179 packet to re-enter the ingress queue, but with a cost of 180 increasing latency and reducing forwarding capabilities. 182 - Performance: for devices located on the links of the distributed 183 network; it is to be evaluated how on-path processing can reduce 184 the FCT (Flow Completion Time) in data center network hence reduce 185 the network traffic/congestion and increase the throughput. 187 The next sections of this draft review how some of these questions 188 are currently being addressed in the research community. 190 3. State of the Art in DC Programmability 192 Recent research has shown that in-network computing can greatly 193 improve the DC network performance of three typical scenarios: 194 aggregate on path- computing, key-value (K-V) cache, and strong 195 consistency. Some of these research results are summarized below. 197 3.1. In-Network Computing 199 The goals on on-path computing in DC is 1. to reduce delay and/or 200 increase throughput for improved performance by allowing advanced 201 packet processing and 2. to help reduce network traffic and alleviate 202 congestion by implementing better traffic and congestion management 203 [REXFORD][SOULE][SAPIO]. 205 However, in terms of research and implementation, there are still 206 open issues that need to be addressed in order to fulfill these 207 promises beyond what was mentioned in the previous section. In 208 particular, the end-to-end principle which has driven most of the 209 networking paradigms of the last 20 years is challenged when in- 210 network computing devices are inserted on the ingress-egress path. 211 This is still an open discussion topic. 213 The type of computing that can be performed to improve the DC 214 performance is another of the open topics. Computing should improve 215 performance but not at the expense of existing application 216 degradation Computing should also enable new applications to be 217 developed. At the time of this writing those include data intensive 218 applications in workload mode with partition and aggregation 219 functionality. 221 Data-intensive applications include big data analysis (e.g. data 222 reduction, deduplication and machine learning), graph processing, and 223 stream processing. They support scalability by distributing data and 224 computing to many worker servers. Each worker performs computing on 225 a part of the data, and there is a communication phase to update the 226 shared state or complete the final calculation. This process can be 227 executed iteratively. It is obvious that communication cost and 228 availability of bottleneck resources will be one of the main 229 challenges for such applications to perform well as a large amount of 230 data need to be transmitted frequently in many-to-many mode. But 231 already, there are several distributed frameworks with user-defined 232 aggregation functions, such as mapReduce from Haddop [HADOOP], Pregel 233 from Google [PREGEL], and DryadLinq from Microsoft [DRYAD]. These 234 functions enable application developers to reduce the network load 235 used for messaging by aggregating all single messages together and 236 consequently reduce the task execution time. Currently, these 237 aggregation functions are used only at the worker level. If they are 238 used at the network level, a higher traffic reduction ratio can be 239 reached. 241 The aggregation functions needed by the data intensive applications, 242 have some features that make it suitable to be at least executed in a 243 programmable. They usually reduce the total amount of data by 244 arithmetic (add) or logical function (minima/maxima detection) that 245 can be parallelized. Performing these functions in the DC at the 246 ingress of the network can be beneficial to reduce the total network 247 traffic and lead to reduced congestion. The challenge is of course 248 not to lose important data in the process especially when applied to 249 different parts of the input data without considering the order and 250 affect the accuracy of the final result. 252 In-network computing can also improve the performance of multipath 253 routing by aggregating path capacity to individual flows and 254 providing dynamic path selection, improving scalability and 255 multitenancy. 257 Other data intensive applications that can be improved in terms of 258 network load by in-network computing include: machine learning, graph 259 analysis, data analytics and map reduce. For all of those, 260 aggregation functions in the computing hardware provides a reduction 261 of potential network congestion; in addition, because of the reduced 262 load, the overall application performance is improved. The traffic 263 reduction was shown to range from 48% up to 93% [SAPIO]. 265 Machine learning is a very active research area for in-network 266 computing because of the large datasets it both requires and 267 generates. For example, in TensorFlow [TENSOR], parameters 268 updates are small deltas that only change a subset of the overall 269 tensor and can be aggregated by a vector addition operation. The 270 overlap of the tensor updates, i.e. the portion of tensor elements 271 that are updated by multi workers at the same time, is 272 representative of the possible data reduction achievable when the 273 updates are aggregated inside the network. 275 In graph analysis, three algorithms with various characteristics 276 have been considered in [SAPIO]: PageRank, Single Source Shortest 277 Path (SSSP) and Weakly Connected Components (WCC) with a 278 commutative and associative aggregation function. Experiment 279 shows that the potential traffic reduction ratio in the three 280 applications is signification. 282 Finally, in map-reduce, experiments in the same paper show that 283 after aggregation computing, the number of packets received by the 284 reducer decreases by 88%~90% compared UDP, by 40% compared to 285 using TCP. There is thus great promise for mapReduce-like to take 286 advantage of computing and storage optimization. 288 3.2. In-Network Caching 290 Key-value stores are ubiquitous and one of their major challenges are 291 to process their associated data-skewed workload in a dynamic 292 fashion. As in any caches, popular items receive more queries, and 293 the set of popular items can change rapidly, with the occurrence of 294 well-liked posts, limited-time offers, and trending events. The skew 295 generated by the dynamic nature of the K-V can lead to severe load 296 imbalance and significant performance deterioration. The server is 297 either overused in an area or underused in another, the throughput 298 can decrease rapidly, and the response time latency degrades 299 significantly. When the storage server uses per core sharding/ 300 partitioning to process high concurrency, this degradation will be 301 further amplified. The problem of unbalanced load is especially 302 acute for high performance in memory K-V store. 304 The selective replication copying of popular items is often used to 305 keep performance high. However, in addition to more hardware 306 resource consumption, selective replication requires a complex 307 mechanism to implement data mobility, data consistency and query 308 routing. As a result, system design becomes complex and overhead is 309 increased. 311 This is where in-network caching can help. Recent research 312 experiments show that K-V cache throughput can be improved by 3~10 313 times by introducing in net cache. Analytical results in [FAN] show 314 that a small frontend cache can provide load balancing for N back-end 315 nodes by caching only O(N logN) entries, even under worst-case 316 request patterns. Hence, caching O(NlogN) items is sufficient to 317 balance the load for N storage servers (or CPU cores). 319 In the NetCache system [JIN], a new rack-scale key-value store design 320 guarantees billions of queries per second (QPS) with bounded 321 latencies even under highly-skewed and rapidly-changing workloads. A 322 programmable switch is used to detect, sort, cache, and obtain a 323 hotspot K-V pair to process load balancing between the switch storage 324 nodes. 326 3.3. In Network Consensus 328 Strong consistency and consensus in distributed networks are 329 important. Significant efforts in the in-network computing community 330 have been directed towards it. Coordination is needed to maintain 331 system consistency and it requires a large amount of communication 332 between network nodes and instances, taking away processing 333 capabilities from other more essential tasks. Performance overhead 334 and extra resources often result in a decrease in consistency. And 335 as a result, potential inconsistencies need to be addressed. 337 Maintaining consistency requires multiple communications rounds in 338 order to reach agreement, hence the danger of creating messaging 339 bottlenecks in large systems. Even without congestion, failure or 340 lost messages, a decision can only be reached as fast as the network 341 round trip time (RTT) permits. Thus, it is essential to find 342 efficient mechanisms for the agreement protocols. One idea is to use 343 the network devices themselves. 345 Hence, consensus mechanisms for ensuring consistency are some of the 346 most expensive operations in managing large amounts of data [ZSOLT]. 347 Often, there is a tradeoff that involves reducing the coordination 348 overhead at the price of accepting possible data loss or 349 inconsistencies. As the demand for more efficient data centers 350 increases, it is important to provide better ways of ensuring 351 consistency without affecting performance. In [ZSOLT] consensus 352 (atomic broadcast) is removed from the critical path by moving it to 353 hardware. The Zookeeper atomic broadcast (also in Hadoop) proof of 354 concept is implemented at the network level on an FPGA, using both 355 TCP and an application specific network protocol. This design can be 356 used to push more value into the network, e.g., by extending the 357 functionality of middle boxes or adding inexpensive consensus to in- 358 network processing nodes. 360 A widely used protocol for consensus is Paxos. Paxos is a 361 fundamental protocol used by fault-tolerant systems, and is widely 362 used by data center applications. In summary, Paxos serializes 363 transaction requests from different clients in the leader, ensuring 364 that each learner (message replicator) in the distributed system is 365 implemented in the same order. Each proposal can be an atomic 366 operation (an inseparable operation set). Paxos does not care about 367 specific content of the proposal. Recently, some research evaluation 368 suggests that moving Paxos logic into the network would yield 369 significant performance benefits for distributed applications 370 [DANG15]. In this scheme network switches can play the role of 371 coordinators (request managers) and acceptors (managed storage 372 nodes). Messages travel fewer hops in the network, therefore 373 reducing the latency for the replicated system to reach consensus 374 since coordinators and acceptors typically act as bottlenecks in 375 Paxos implementations, because they must aggregate or multiplex 376 multiple messages. Experiments suggest that moving consensus logic 377 into network devices could dramatically improve the performance of 378 replicated systems. In [DANG15], NetPaxos achieves a maximum 379 throughput of 57,457 messages/s, while basic Paxos the coordinator 380 being CPU bound, is only able to send 6,369 messages/s. In [DANG16], 381 a P4 implementation of Paxos is presented as a result of Paxos 382 implementation with programmable data planes. 384 Other papers, have shown the use of in-network processing and SDN for 385 Paxos performance improvements using multi-ordered multicast and 386 multi-sequencing [LIJ] [PORTS]. 388 4. Industrial Networks 390 For industrial networks, in-network processing can enable a new 391 flexibility for automation and control by combining control and 392 communication [RUETH][VESTIN]. 394 Note: this section will be expanded in the next version of the draft. 396 5. Research Topics 398 While the previous section introduced the state of the art in data 399 center and industrial in-network computing, there are still some open 400 issues that need to be addressed. In this section, some of these 401 questions are listed as well as the impacts that adding in-network 402 computing will have on existing systems. 404 Adding computing and caching to the network violates the End-to-End 405 principle central to the Internet. And the interaction with 406 encrypted systems can limit the scope of what in-network can do to 407 individual packet. In addition, even when programmable, every switch 408 is still designed for (line speed) forwarding with the resulting 409 limitations, such as lack of floating-point support for advanced 410 algorithms and buffer size limitation. Especially in the high- 411 performance datacenters for in-network computing to be successful, a 412 balance between functionality, performance and cost must be found. 414 Hence the research areas in managed networks include but are not 415 limited to: 417 Protocol design and network architecture: a lot of the current in- 418 network work has targeted network layer optimization; however, 419 transport protocols will influence the performance of any in- 420 network solution. How can in-network optimization interact (or 421 not) with transport layer optimizations? 423 Can the end-to-end assumptions of existing transport like TCP 424 still be applicable in the in-network compute era? There is 425 heritage in middlebox interactions with existing flows. 427 Fixed-point calculation for current application vs. of floating- 428 point calculation for more complex operations and services: 430 network switches typically do not support floating-point 431 calculation. Is it necessary to introduce this capability and 432 scale to the demand? For example, AI and ML algorithms currently 433 mainly use floating-point calculation. If the AI algorithm is 434 changed to fixed-point calculation, will the training stop in 435 advance and the training result deteriorate (this needs to be done 436 as joint work with the AI community)? 438 What are the gains brought by aggregation in distributed and 439 decentralized networks? The built-in buffer of the network device 440 is often limited, and, for example the AI and XR application 441 parameters and caches can reach hundreds of megabytes. There is a 442 trade-off between aggregation of the data on a single network 443 device and its distribution across multiple nodes in terms of 444 performance. 446 What is the relationship between the depth of packet inspection 447 and not only performance but security and privacy? There is a 448 need to find what application layer cryptography is ready to 449 expose to other layers and even collaborating nodes; this is also 450 related to trust in distributed networks. 452 Relationship between the speed of creating tables on the data 453 plane and the performance. 455 5.1. Data Plane Issues in Managed Networks 457 Note: To be added 459 5.2. Interaction with Transport Protocols 461 Note: To be added 463 5.3. Interaction with Security Mechanisms 465 Note: To be added 467 5.4. Privacy aspects 469 Note: To be added 471 6. Conclusion 473 In-network computing as it applies to data centers is a very current 474 and promising research area. Thus, the proposed Research Group 475 creates an opportunity to bring together the community in 476 establishing common goals, identify hurdles and difficulties, provide 477 paths to new research especially in applications and linkage to other 478 new networking research areas at the edge. More information is 479 available in [COIN]. 481 7. References 483 7.1. Normative References 485 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 486 Requirement Levels", BCP 14, RFC 2119, 487 DOI 10.17487/RFC2119, March 1997, 488 . 490 7.2. Informative References 492 [CHAN] Chan et al., M., "Network Support for DNN Training", 2018. 494 [COIN] He, J., Chen, R., and M. Montpetit, "Computing in the 495 Network, COIN, proposed IRTF group", 2018. 497 [DANG15] Dang et al., T., "NetPaxos: Consensus at Network Speed", 498 2015. 500 [DANG16] Dang et al., T., "Paxos Made Switch-y", 2016. 502 [DRYAD] Microsoft, "DryadLinq", 2018. 504 [FAN] Fan et al., B., "Small Cache, Big Effect: Provable Load 505 Balancing for Randomly Partitioned Cluster Services", 506 2011. 508 [FORSTER] Forster, N., "To be included", 2018. 510 [GRAHAM] Graham, R., "Scalable Hierarchical Aggregation Protocol 511 (SHArP): A Hardware Architecture for Efficient Data 512 Reduction.", 2016. 514 [HADOOP] Hadoop, "Hadoop Distributed Filesystem", 2016. 516 [JIN] Jin et al., X., "NetCache: Balancing Key-Value Stores with 517 Fast In-Network Caching", 2017. 519 [LIJ] Li et al., J., "NetCache: Balancing Key-Value Stores with 520 Fast In-Network Caching", 2017. 522 [LIX] Li et al., X., "Be fast, cheap and in control with 523 SwitchKV", 2016. 525 [MEM] memcached.org, "Memcached", 2018. 527 [P4] p4.org, "P4 Language", 2018. 529 [PORTS] Ports, D., "Designing Distributed Systems Using 530 Approximate Synchrony in Data Center Networks", 2015. 532 [PREGEL] github.com/igrigorik/pregel, "Pregel", 2018. 534 [REXFORD] Rexford, J., "Sigcomm 2018 Keynote Address", 2018. 536 [RUETH] Rueth et al., J., "Towards In-Network Industrial Feedback 537 Control", 2018. 539 [SAPIO] Sapio et al., A., "In net computing is a dumb idea whose 540 time has come", 2017. 542 [SOULE] Soule, R., "Sigcomm 2018 Netcompute Workshop Keynote 543 Address", 2018. 545 [SPARK] Apache, "Spark Graph X", 2018. 547 [SUBEDI] Subedi et al., T., "OpenFlow-based in-network Layer-2 548 adaptive multipath aggregation in data centers", 2015. 550 [TENSOR] tensorflow.org, "Tensorflow", 2018. 552 [TOFINO] BarefootNetworks, "Tofino", 2018. 554 [VESTIN] Vestin et al., J., "FastReact: In-Network Control and 555 Caching for Industrial Control Networks using Programmable 556 Data Planes.", 2018. 558 [ZSOLT] Zsolt et al., I., "Consensus in a Box", 2018. 560 Authors' Addresses 562 Jeffrey He 563 Huawei 565 Email: jeffrey.he@huawei.com 567 Aini Li 568 Huawei 570 Email: liaini@huawei.com 571 Marie-Jose Montpetit (editor) 572 Triangle Video 573 Boston, MA 574 US 576 Email: marie@mjmontpetit.com