A Generic COIN
framework in controlled environmentsChina MobileBeijing100053Chinayaokehan@chinamobile.comChina MobileBeijing100053Chinaxushiping@chinamobile.comChina MobileBeijing100053Chinalizhiqiangyjy@chinamobile.comPeking UniversityBeijing100871Chinawenfeiwu@pku.edu.cn
IRTF
Computing in the Network Research Groupframework;COINThere have been a lot of academic research and industrial practice in
the area of COIN, but most of them are case-by-case design and currently
they also rely heavily on programmable network devices, which lacks some
generality and scalability, thus will impede the development of COIN.
This document summarizes the computing primitives/operations/semantics
that can be implemented inside the network, through analysis of
different COIN use cases, and proposes a generic framework of COIN in
the controlled environments. Enabling technologies related to the
framework and the standardization landscape are also analyzed in the
document.Programmable network devices(PNDs) including programmable switches
and SmartNICs have inspired a lot of research work in the area of COIN.
Like In-band Network Telemetry(INT), Network functions offloading(LBs,
Firewalls), etc. However, technically, we argue that these use cases are
not strictly “computing” in the network, since they are
hardware implementation of network functions which traditionally
implemented in servers so as to accelerate or enhance these network
functions. The “network” in COIN is also ambiguous.
Narrowly, it refers to network devices like PNDs, but broadly, it refers
to network elements in different contexts. In edge computing or fog
computing, these network elements refer to ubiquitous heterogeneous edge
devices, but in controlled environments like data centers, network
elements refer to normal network devices. And in this draft, we just
limit the scope of the discussion inside the controlled environment,
which is consistent with most of the existing work.To make the work in COIN move further, there is a need to reach a
consensus on the definition of COIN. Despite there is an ongoing draft
about the terminology of COIN in the group, we want to share our
thoughts. Computing in the network is “to offload
application-specific functions to network elements, so as to accelerate
applications”. These application-specific functions are described
by series of computing primitives/operations/semantics that could be
supported by network elements, and they explain about what to
“compute” in the network. A very illustrative example is
In-network Aggregation(INA) for distributed machine learning model
training. The aggregation operation is implemented in network devices,
which could accelerate the entire model training process.A lot of
research have investigated what kind of computing primitives can be
offloaded to network devices, but there still lack a systematic
summarization of these application-specific primitives. We think that
application-specific functions can be generalized to be several types of
computing primitives which could be further standardized, thus COIN will
not depend on PNDs for implementation, but normal network devices that
support these general primitives could take the work.Further, current research on how COIN could accelerate applications
usually depend on a case-by-case hardware software co-design scheme,
which lacks generality and scalability for the development of COIN.
There is a need to design a generic framework of COIN, for one thing, to
make COIN a common capability of the network, for another, to lower the
application development barriers.Based on the analysis above, this document classifies several kinds
of computing primitives which could be standardized, and proposes a
generic framework of COIN, which can be scaled and promoted in the
controlled environment.PND Programmable Network DeviceThe key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 when, and only
when, they appear in all capitals, as shown here.The generic COIN framework contains three logical layers: Scheduling
layer(S), Control layer(C), and Infrastructure layer(I).The scheduling layer (S) decomposes a job into host tasks and COIN
tasks according to the host and COIN resources and scheduling policy.
These tasks are then distributed to the control layer.The control layer (C) is divided into host controller and COIN
controller, both of them can be centralized or distributed. Host
Controller is optional, which is deployed on demand according to the
application scenario. A host controller is mainly responsible for host
task deployment and control. The COIN controller is mainly responsible
for network management, COIN task deployment and control, and routing.
The host controller and the COIN controller are combined to realize the
end-network cooperation.The infrastructure layer (I) includes the host and network equipment,
including the relevant routing protocols and reliability protocols to
realize COIN.Task decomposition is the first step to achieve end-network
collaborative in-network computing. Through appropriate scheduling
policy, reasonable resource allocation can be achieved and better task
performance can be achieved. With the addition of in-network computing
technology, it is necessary to consider not only the host resources,
but also the in-network computing resources.End-network collaborative control realized by the host controller
and the COIN controller.Network side:* Network equipment management, including network equipment status,
load condition, network equipment computing capacity and resource,
etc.* Network topology management, including network topology update,
link status monitoring, etc.* Routing, selecting an optimal path for in-network computing and
forwarding.Host side:* Cooperate with the host application to do the COIN processing,
including completing the overall calculation task with the network
side, and reliability control.Network equipment implements the standard COIN primitive.A set of unified COIN primitives makes COIN more easier to achieve
docking and promotion. Some research work summarize common COIN primitives and data
structures. We refer to these research work and choose some major COIN
primitives out of these work. ValStr_Agg is used in applications like
distributed machine learning training, Asyn_Val_Agg is used in big
data analysis applications where map-reduce is needed. K-V is used for
caching, and consensus is used for synchronization within distributed
systems. Heterogeneous network devices can have different internal
implementations of the same COIN primitives, but the services provided
externally need to be unified. There is a need to standardize these
COIN primitives for generic use cases. Of course, due to equipment
differences, there may be differences in calculation accuracy for some
primitives. These differences need to be considered in task
decomposition and routing.COIN transformation of application program on host side.Network cannot guarantee that the computing task can be completed
during each transmission process, so the host side applications need
to be COIN aware and be able to flexibly process the data that has
been in-network processed or not.* End and network collaboration. Due to the limited resources within
network devices, there is a need to design some fallback mechanisms when
tasks cannot be fully accomplished within the network, and they should
be finished at the end devices. Relative algorithms, protocols should be
considered for implementation.* COIN reliability and correctness. On the premise that tasks can be
offloaded to network devices for computing, the correctness and
reliability of the work should be considered. There should be some
mechanisms designed to maintain that the COIN results is consistent with
that when tasks are fully accomplished at end devices. Besides, reliable
data transmission in COIN should be elaborately designed, since many
applications have very strict QoS requirements.TBD.TBD.NetRPC: Enabling In-Network Computation in Remote Procedure
CallsZhao, B., Wu, W., & Xu, W.When Should The Network Be The Computer?Dan R. K. Ports, Jacob Nelson