IoT Operations Working Group F. Foukalas Internet-Draft A. Tziouvaras Intended status: Draft Standard March 30, 2021 Expires: September, 2019 draft-distributed-ml-iot-edge-cmp-foukalas-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 22, 2021. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract Next generation Internet requires decentralized and distributed intelligence in order to make available a new type of experience to serve the user's interests. Such new services will be enabled by deploying the intelligence over a high volume of IoT devices in a form of distributed protocol. Such a protocol will orchestrate the machine learning (ML) application in order to train the aggregated data available from the IoT devices. The training is not an easy task in such a distributed environment, where the amount of connected IoT devices will scale up and the needs for both interoperability and computing are high. This draft, addresses both issues by combining two emerging technologies known as edge AI and fog computing. The protocol procedures aggregate the data collected by the IoT devices into a fog node and apply edge AI for data analysis at the edge of the infrastructure. The analysis of the IoT requirements resulted in an end-to-end ML protocol specification which is presented throughout this draft. Table of Contents 1. Introduction 2 2. Background and terminology 3 3. Edge computing architecture 4 4. Protocol stages 8 4.1. Initial configuration 8 4.2. FL training 11 4.3. Cloud update 12 5. Security Considerations 14 6. IANA Considerations 15 7. Conclusions 15 8. References 15 8.1. Normative References 15 9. Acknowledgments 16 1. Introduction There is an evident requirement to address several challenges to offer robust IoT services by leveraging the integration of Edge computing with IoT known as IoT edge computing. The concept of IoT edge computing has not been specified in detail yet although two recent drafts described already some aspects of such Internet architecture. Such architecture is way more useful in case of distributed machine learning deployment to future Internet, where the edge artificial intelligence will play an important role. Towards this end, the proposed draft provides first the IoT edge computing architecture, which includes the necessary elements to deploy distributed machine learning. Second, three stages of such a distributed intelligence are described in a sort of protocol procedures, where the initialization, the learning and cloud updates were devised. Details are given for all the protocol procedures of the distributed machine learning for IoT edge computing. 2. Background and terminology Below we list a number of terms related with the distributed machine learning solution: End devices: End devices [1] are IoT devices that collect data while also having computing and networking capabilities. End devices can be any type of device that can connect to the Edge gateway and facilitate sensors for data collection. Edge gateway: The Edge gateway is a server that is located to the Edge of the network [1]. It facilitates large computational and networking capabilities and coordinates the FL process. The Edge gateway is used to relieve the traffic from the network backhaul as the end devices connect to the Edge instead of the cloud. Cloud: Cloud supports very large computational capabilities [1] and is geographically located far from the end devices. It provides accessibility to the Edge gateway and remains agnostic on the amount and type of participating end devices. As a result, the cloud does not have an active role in the FL training process. Federated learning (FL): FL is a distributed ML technique which utilizes a large number of End devices that train their ML models locally without communicating with each other. The locally trained models are dispatched to the Edge gateway which aggregates the collected models into one global model. In the sequel the global model is broadcasted to the end devices in order for the next training round to begin. During the FL process, the end devices do not share data or any other information. Constrained application protocol (CoAP): CoAP is a UDP communication protocol which supports lightweight communication between two entities [RFC 7252]. CoAP is ideal for devices with limited computational capabilities as it does not require full protocol stack to operate. CoAP supports the following message formats: Confirmable (CON) messages, non-confirmable (NON) messages, acknowledgement (ACK) reply messages and reset (RST) reply messages. CON messages are reliable message requests and are provided by marking a message as confirmable. A confirmable message is retransmitted using a default timeout and exponential back off between retransmissions, until the recipient sends an Acknowledgement message (ACK) with the same Message ID. When a recipient is not able to process a Confirmable message, it replies with a Reset message (RST) instead of an Acknowledgement. NON messages are message requests that do not require reliable transmission. These are not acknowledged, but still have a Message ID for duplicate detection. When a recipient is not able to process a Non-confirmable message, it may reply with a Reset message (RST). 3. Edge computing architecture Fig 1 below depicts the IoT architecture we employ, where the three main entities are the end devices, the edge gateway and the cloud server. Below we describe the functionalities of each module and how each module it interacts with the rest of the architecture: End devices: End devices can be classified into constrained and non-constrained according to the processing capabilities they employ. Previous work in [2] classifies the end devices into the following categories: Class 0 (C0): This class contains sensor-like devices. Although they may answer keep-alive signals and send basic indications, they most likely do not have the resources to securely communicate with the Internet directly (larger devices act as proxies, gateways, or servers) and cannot be secured or managed comprehensively in the traditional sense. Class 1 (C1): Such devices are quite constrained in code space and processing capabilities and cannot easily talk to other Internet nodes nor employ a full protocol stack. Thus they are considered ideal for the Constrained Application Protocol (CoAP) over UDP. Class 2 (C2): C2 devices are less constrained and capable of supporting most of the same protocol stacks as servers and laptop computers. Other (C3): Devices with capabilities significantly beyond that of Class 2 are left uncategorized (Others). They may still be constrained by a limited energy supply, but can largely use existing protocols unchanged. To this end, the IoT architecture provides cameras as C1 devices and mobile phones as C2-other devices. Each device stores a local dataset independently from the others and does not have any access to the data sets of the rest of the devices. Also, end devices are responsible for training their local ML model and for reporting the trained model to the edge gateway for the aggregation process. Edge gateway: The edge gateway is responsible for collecting the locally trained models from the end devices and for aggregating such models into a global model. Further, the edge gateway is responsible for dispatching the trained model to the cloud in order to make it available to the developers. In order to support the aforementioned services the edge gateway employs the following controller interfaces: Southbound controller: The southbound interface is responsible for handling the communication between the edge gateway and the end devices [5]. The southbound controller also performs the resource discovery, resource authentication, device configuration and global model dispatch tasks. The resource discovery process manages to detect and identify the devices that participate on the FL training and also to establish a communication link between the edge and the device. The resource authentication process authenticates the end devices by matching each device's unique ID with a trusted ID list that is stored at the edge. The resource configuration broadcasts the ML model hyperparameters to the participating end devices. Finally the global model dispatch operation broadcasts the aggregated global model to the trusted connected devices. Central controller: The Central controller is the core component of Network Artificial Intelligence, which can be called as "Network Brain" [4]. It carries on the FL aggregation process and is responsible to stop the FL process when the model converges. It also performs the data sharing, global model training, global model aggregation and device scheduling functionalities. Northbound interface: The northbound interface is provided by a gateway component to a remote network [5], e.g. a cloud, home or enterprise network. The northbound interface is a data plane interface, which facilitates the communication management of the edge gateway with the cloud. Under this premise the northbound interface is responsible for the model sharing and the model publish functionalities. Model sharing is the function under which the edge is authenticated by the cloud as a trusted party and thus, gains the rights to upload the trained FL model to the cloud. Model publish the uploading process of the trained model to the cloud so that to make it available to the developers. Cloud server: The Cloud server may provide virtually unlimited storage and processing power [3]. The reliance of IoT on back-end cloud computing brings additional advantages such as flexibility and efficiency. The cloud will facilitate the trained FL model which can be used by developers for AR applications. FL model: The FL model should operate separately from the dataset used for the training process. In this sense, the ML model architecture and the dataset type may change without affecting the overall FL training process. This interoperability is ensured as we design the FL independently of the web protocol and thus, the end device-edge communication is not affected by any changes in the IoT architecture. Further, the datasets of each device are stored locally and interact only with the local FL model while the edge does not have any access to them. As a result the functionality of the FL training is not affected by either the dataset type or size, or by the FL model architecture. +------------------------------------------------------------------+ | | | +------------------------+ | | | End devices | | | | * Data collection | | | | * Reporting | | | | * Local model training | | | | +---------------------+| | | | FL training | | | | | +---------------------------------------------------------------+| | | Edge gateway || | | || | | +------------------+ +----------------+ +-----------------+ || | | | Southbound | | Central | | Northbound | || | | | interface | | controller | | interface | || | | | | | | | | || | | | * Resource | | * Device | | * Model sharing | || | | | discovery | | scheduling | | * Model publish | || | | | * Resource | | * Global model | +-----------------+ || | | | authentication | | aggregation | || | | | * Device | +----------------+ || | | | configuration | || | | | * Global model | || | | | dispatch | || | | +------------------+ || | | || | +---------------------------------------------------------------+| | | | | | Model to cloud | | +---------------+ | | | Cloud server | | | | | | | | * Store model | | | +---------------+ | | | +------------------------------------------------------------------+ Figure 1: Protocol architecture 4. Protocol stages In this section we describe the stages which are used by the Edge computing protocol to perform the FL process. 4.1. Initial configuration Fig. 2 below depicts the initial configuration stage of the Edge IoT protocol using the CoAP. The initial configuration stage provides the necessary functionalities for establishing the IoT-edge gateway communication link and for identifying the end devices that will participate in the training process. Such functionalities are considered as follows: 1.Resource discovery: The end devices are discovered by the edge and employ the CoAP to inform the edge gateway about their computational capabilities. More specific, the end devices send an NON message to the edge containing the resource type of the corresponding device, i.e. C0, C1, C2 or C3. The NON message type is not confirmable and thus, the edge informs the devices with an RST message only in case of a transmission error. In the sequel the edge decides which device types may participate in the training process and send back a NON message containing the resource discovery decision to the corresponding devices. 2.Resource authentication: The end devices are authenticated by the edge as trusted parties and are allowed to participate in the training process. On the contrary, any unauthenticated devices cannot participate in the training. To this end, the previously discovered end devices send a NON message to edge containing the ID information of the transmitted device. The edge then informs each device if it failed to receive the corresponding ID by dispatching an RST message. Once the edge collects all the IDs of the devices it performs the device authentication process which designates which end devices will participate on the FL process. Finally each device is informed about the edge decision by a NON message that contains the authentication outcome. Only authenticated end devices are eligible in participating in the FL training. 3.Device scheduling: The edge gateway selects the amount of the authenticated end devices that will participate in the training and dispatches the necessary messages to inform them about its decision. Under this premise, it dispatches a NON message containing such information to each of the authenticated devices. The devices send back an RST response in case of transmission failure and thus, making the edge to retransmit the message. In case of successful transmission of the original NON message the eligible devices proceed to the device configuration phase. 4.Device configuration: The edge gateway employs the CoAP to broadcast the FL model hyperparameters to the end devices in order to properly configure their local models. To this end, the end devices dispatch a NON message informing the edge about their computational capabilities. The edge sends back an RST response in case of transmission error, or no message in case of successfully message delivery. In the sequel, the edge processes the obtained information and designates the model architecture and ML parameters that will be used for the FL process. Then it broadcasts the related decisions back to the end devices through a NON message and all the eligible devices enter the training phase. After the initial configuration process completes, the Edge IoT protocol continues to the FL training stage. +------------------------------------------------------------------+ | +-------------+ +--------------+ | | | End devices | | Edge gateway | | | +-------------+ +--------------+ | | | Non message {Resource type} | | | |------------------------------->| | | | | | | | +------------------+ | | | |Resource discovery| | | | +------------------+ | | | | | | | Non message {discovery} | | | |<-------------------------------| | | | Non message {Device ID} | | | |------------------------------->| | | | | | | | +-----------------------+ | | | |Resource Authentication| | | | +-----------------------+ | | | | | | | Non message {Authentication} | | | |<-------------------------------| | | | +-----------------+ | | | |Device scheduling| | | | +-----------------+ | | | | | | | Non message {Scheduling info.}| | | |<-------------------------------| | | | | | | | Non message {Avl. Resources} | | | |------------------------------->| | | | | | | | +----------------+ | | | |FL configuration| | | | +----------------+ | | | | | | | Non message {Hyperparameters}| | | |<-------------------------------| | | | | | +------------------------------------------------------------------+ Figure 2: Protocol initial configuration stage. 4.2. FL training The FL training is stage in which the actual FL takes places. Fig. 3 depicts the functionalities we employ in order to support the FL process. Such functionalities are considered as follows: 1.Local model training: In this scenario, the end devices that are eligible to participate in the FL training send a NON message to request the ML model from the edge. Then, the edge responds with an RST message if necessary, to trigger the original NON message retransmission. In the sequel the edge dispatches the global model to the end devices using again the NON message format. The devices respond with an RST message in case the transmission resulted in errors and thus, the edge retransmits the NON message to the corresponding device. Afterwards, each device proceeds to locally train the model using its local data set. 2.Device reporting: Once a device completes the local model training, it dispatches its model to the edge gateway through the device reporting process. Due to the constrained nature of the participating devices, the end device-edge communication is implemented by using the NON message format. To this end, the devices dispatch their ids and the locally trained models to the edge via NON messages which are not followed by an ACK from the server side. As a result, if the Edge fails to obtain the corresponding RST reply will notify the end devices and will trigger a retransmission procedure of the original NON message to the Edge. After the edge obtains every local model, it conducts the global model aggregation process and produces one global model which is broadcasted back to the devices. The FL training process is repeated until the predefined amount of FL rounds is reached. After the FL training completes, the edge computing protocol enters the cloud update stage. +------------------------------------------------------------------+ | +-------------+ +--------------+ | | | End devices | | Edge gateway | | | +-------------+ +--------------+ | | | Non message {Model request} | | | |------------------------------->| | | | | | | | Non message {Global model} | | | |<-------------------------------| | | +------------+ | | | | Local model| | | | | training | | | | +------------+ | | | | Non message {Local model} | | | |------------------------------->| | | | | | | | +------------------------+ | | | |Global model aggregation| | | +------------------------+ | | | Non message {Model request} | | | |------------------------------->| | | | | | | | Non message {Global model} | | | |<-------------------------------| | | +------------+ | | | | Local model| | | | | training | | | | +------------+ | | | | | | | | | | | | +------------------------------------------------------------------+ Figure 3: Protocol training stage. 4.3. Cloud update Fig. 4 below depicts the cloud update stage of the Edge computing protocol which is invoked after the FL training completes. Cloud update consists of the following functionalities: 1.Model sharing: The edge gateway informs the cloud for its intentions to upload the trained FL model. In the sequel the cloud authenticates the edge and decides whether it can be considered a trusted party. When the model sharing process successfully completes, the edge is authenticated and can proceed to the model publish functionality. Due to the fact that no IoT devices participate in such communication process, we use the more reliable CON message format; instead of relying on NON messages. To this end, the edge dispatches a CON message to cloud that contains its ID to inform it that the FL process has been completed. The cloud in return responds by an ACK or RST reply that indicates whether the initial request was successfully delivered. In the sequel, the cloud performs the edge authorization procedure according to the received ID and sends a CON message to the edge that contains the authorization result. 2.Model publish: In this scenario, the edge sends the trained model and the model version through a CON message to the cloud. Thus the edge waits for an ACK or RST reply depending on the success of the transmission. If the model is transmitted without errors the cloud responds with an ACK message. On the contrary, transmission errors result in an RST reply from the cloud which triggers a retransmission from the edge. When the cloud successfully obtains the trained ML model it stores it and makes it available to the users. +------------------------------------------------------------------+ | +-------------+ +-----+ | | |Edge gateway | |Cloud| | | +-------------+ +-----+ | | | CON message {Edge ID} | | | |------------------------------->| | | | | | | | ACK/RST reply | | | |<-------------------------------| | | | +--------------+ | | | |Authentication| | | | +--------------+ | | | CON message {authorization} | | | |<-------------------------------| | | | | | | | ACK/RST reply | | | |------------------------------->| | | | | | | | CON message {Model, version} | | | |------------------------------->| | | | | | | | ACK/RST reply | | | |<-------------------------------| | | | +-----------+ | | | |Model store| | | | +-----------+ | | | | | | | | | +------------------------------------------------------------------+ Figure 4: Protocol cloud update stage. 5. Security Considerations The FL training process is considered a difficult task as the achievable accuracy of the model is affected by the characteristics of the local data sets. Local datasets are the data collected by the end devices which are stored locally on each device. In order to ensure data privacy, we make sure that no data exchange takes place between the end devices or between the end devices and the Edge gateway. In this sense, the Edge gateway aggregates the local models without utilizing any local data set information and the data privacy of each end devices is ensured. Regarding data security, the end device-Edge gateway communication can be encrypted using any existing encryption technique such as AES. Such an encryption mechanism can be applied either for data sharing between the end devices and the Edge or for encrypting the messages exchanged between those entities similarly to [6]. The encryption mechanism can be applied directly to the transmitted CoAP messages provided that a decryption process is deployed on the receiver side. Nonetheless, the implementation and deployment of such a technique is outside the scope of this work. 6. IANA Considerations There are no IANA considerations related to this document. 7. Conclusions In this draft we present an FL protocol suitable for distributed ML in an IoT network. We provide a functional architecture that consists of a number of end devices, of an edge gateway and of a cloud server. In order to support the FL training process we provide three distinct protocol stages that coordinate the distributed learning process. To this end we consider the initial configuration, the FL training and the cloud update stages each of which provides the necessary functionalities to the FL process. The FL training process is conducted by leveraging the CoAP communication protocol and takes place between the end devices and the edge server. After the training finishes, the trained FL model is stored to the cloud and is made accessible to the users. 8. References 8.1. Normative References [1] IoT Edge Computing Challenges and Functions, IETF draft. https://tools.ietf.org/html/draft-hong-t2trg-iot-edge-computing-01, Jul. 2020. [2] F. Pisani, F. M. C. de Oliveira, E. S. Gama, R. Immich, L. F. Bittencourt, E. Borin. "Fog Computing on Constrained Devices: Paving the Way for the Future IoT", in arXiv: https://arxiv.org/abs/2002.05300, Mar. 2019. [3] Distributed fault management for IoT Networks, IETf draft. https://tools.ietf.org/html/draft-hongcs-t2trg-dfm-00, Dec 2018. [4] IoT Edge Computing: Initiatives, Projects and Products, IETF draft. https://tools.ietf.org/html/draft-defoy-t2trg-iot -edge-computing-background-00, May 2020. [5] IETF iot-edge-computing draft, Weblink: https://www.potaroo. net/ietf/idref/draft-hong-t2trg-iot-edge-computing/#ref-RFC6291 [6] M. A. Rahman, M. S. Hossain, M. S. Islam, N. A. Alrajeh and G. Muhammad, "Secure and Provenance Enhanced Internet of Health Things Framework: A Blockchain Managed Federated Learning Approach," in IEEE Access, vol. 8, pp. 205071-205087, Nov. 2020. 8.1. Non-normative References [RFC 7252] The Constrained Application Protocol (CoAP), Weblink: https://tools.ietf.org/html/rfc7252 , Jun. 2014 9. Acknowledgments Copyright (c) 2021 IETF Trust and the persons identified as authors of the code. All rights reserved. Redistribution and use in source and binary forms, with or without modification,are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of Internet Society, IETF or IETF Trust, nor the names of specific contributors, may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Authors' Addresses Fotis Foukalas Cognitive Innovations Kifisias 125-127, 11524, Athens, Greece Email: fotis@cogninn.com Athanasios Tziouvaras Cognitive Innovations Kifisias 125-127, 11524, Athens, Greece Email: thanasis@cogninn.com