idnits 2.17.1 draft-mcbride-data-discovery-problem-statement-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 10, 2020) is 1386 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-05) exists of draft-mcbride-edge-data-discovery-overview-03 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group M. McBride 3 Internet-Draft Futurewei 4 Intended status: Standards Track D. Kutscher 5 Expires: January 11, 2021 Emden University 6 E. Schooler 7 Intel 8 CJ. Bernardos 9 UC3M 10 D. Lopez 11 Telefonica I+D 12 July 10, 2020 14 Data Discovery Problem Statement 15 draft-mcbride-data-discovery-problem-statement-00 17 Abstract 19 If data is the new oil of the 21st century, then we need a 20 standardized way of locating, capturing, classifying and transforming 21 this raw data to generate insights and recommendations. Data, like 22 oil, needs to be discovered and captured in order to be refined and 23 valuable. While the topic of data discovery can be far reaching, 24 this document focuses on the problem of actually locating data, 25 throughout a network of data servers, in a standardized way. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on January 11, 2021. 44 Copyright Notice 46 Copyright (c) 2020 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 2 63 2. Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . 2 64 3. Existing Solutions . . . . . . . . . . . . . . . . . . . . . 3 65 3.1. Proprietary . . . . . . . . . . . . . . . . . . . . . . . 3 66 3.2. Opensource . . . . . . . . . . . . . . . . . . . . . . . 4 67 4. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 4 68 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5 69 6. Security Considerations . . . . . . . . . . . . . . . . . . . 5 70 7. Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 5 71 8. Normative References . . . . . . . . . . . . . . . . . . . . 5 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6 74 1. Introduction 76 There are myriad proprietary and standardized ways of discovering 77 networking devices and hosts. There are many solutions for 78 discovering data within a database. There are proprietary, non- 79 standardized, ways of discovering the data that may be stored 80 throughout an environment of networking devices. We can discover 81 information about the devices but can't locate and capture stored 82 data in a standard way. With more networking devices storing 83 collected data there needs to be a standard way of discovering the 84 specific data needed amongst a potentially huge lake of databases. 86 1.1. Requirements Language 88 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 89 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 90 document are to be interpreted as described in RFC 2119 [RFC2119]. 92 2. Problem Scope 94 Data may be cached, copied and/or stored at multiple locations in the 95 network on route to its final destination. With an increasing 96 percentage of devices connecting to the Internet being mobile, 97 support for in-the-network caching and replication is critical for 98 continuous data availability. There are data repositories throughout 99 a modern network and there needs to be a standardized way to locating 100 the repositories and discovering the desired data within. 102 There are many types of relational (SQL) and non-relational (NoSQL) 103 data classification solutions. Existing database classification 104 engines allow for scanning of a database. We are defining the 105 problem, however, of having a standards based solution to discover 106 first where the databases exist throughout a network and then where 107 specific data objects are located. 109 Data discovery is likely to look different depending on if we are 110 seeking global vs local discovery. Data discovery may be location- 111 driven. A standard to find data may want to search for it in a more 112 proximal fashion, i.e., find the data that matches the search that is 113 nearest to a location. 115 There is so much data being created, processed, and migrated, that it 116 may only sometimes get stored more permanently in a database. There 117 is going to be slightly less permanent data that resides for a time 118 in memory, so that it may be discovered and accessed quickly. It may 119 be more dynamic and short lived. Although we refer to the data store 120 as a database, it may reside entirely in memory, and/or it may be 121 stored in some other non-SQL indexing technology. 123 Each database essentially provides a directory service for the data 124 within them and that directory service can be viewed as metadata. 125 There is the need to understand where the databases/data lakes/ 126 pockets of data reside. The location of each data store is the first 127 level discovery problem, and the details of the database's directory 128 is the second level discovery problem. 130 Publish and subscribe approaches allow nodes to express their 131 interest in specific pieces of data without knowing the location of 132 the data. There might be sources of data to be discovered that might 133 not produce the specific data desired by the subscribers (or not 134 produce data with a specific format or frequency). The subscriber 135 will want to find the publishers which send the desired data 136 characteristics. 138 3. Existing Solutions 140 3.1. Proprietary 142 There are many existing proprietary database discovery solutions we 143 can evaluate in order to understand what aspects we need to 144 standardized. For instance there is IBM Cognos, Wipro Data Discovery 145 Platform (DDP), and Amazon Macie among many others. Macie, for 146 instance, is a data security and data privacy service that uses 147 machine learning and pattern matching to discover and protect data in 148 AWS. The service allows you to define data types in order to 149 discover and protect the data that may be unique to a use case. 151 3.2. Opensource 153 There are opensource data solutions such as from ScienceBase 154 (https://sciencebase.usgs.gov/). The U.S. Geological Survey (USGS) 155 is developing ScienceBase, an open source, collaborative, scientific 156 data and information management platform. It provides current 157 documentation about its structure, information model, services, 158 directory and repository. sbtools uses an R (command line driven 159 program used to find data within the platform) interface for 160 ScienceBase. 162 Another solution is the Interplanetary File System (ipfs.io). IPFS 163 is a distributed system for storing and accessing files, websites, 164 applications, and data. IPFS is a peer-to-peer (p2p) storage 165 network. Content is accessible through peers, located anywhere in 166 the world, that might relay information, store it, or do both. IPFS 167 knows how to find what you ask for via its content address, rather 168 than its location. There are three fundamental principles to 169 understanding IPFS: 171 o Unique identification via content addressing 173 o Content linking via directed acyclic graphs (DAGs) 175 o Content discovery via distributed hash tables (DHTs) 177 4. Use Cases 179 Here are some of the use cases which will benefit from standards 180 based data discovery solutions: 182 o We need a standards based solution to discover the increasing 183 amount of data being stored in various locations throughout a 184 network including at the edge. We need a standard protocol set 185 for doing this data discovery, on the device or infrastructure 186 edge, in order to meet the requirements of many use cases. We 187 will have terabytes of data on the edge and need a way to identify 188 its existence and find the desired data. 189 [I-D.mcbride-edge-data-discovery-overview] is focusing on this 190 aspect of data discovery. 192 o We need a secure standards based solution for data discovery. 193 Several of the proprietary secure data discovery solutions use 194 machine learning and pattern matching to discover and protect the 195 data. We need to incorporate existing, or new, ietf security 196 solutions when discoverying data. 198 o We need a standards based solution for using named based solutions 199 for data discovery. An Information Centric Networking (ICN) 200 enabled network routes data by name (vs address), caches content 201 natively in the network, and employs data-centric security. Data 202 discovery may require that data be associated with a name or 203 names, a series of descriptive attributes, and/or a unique 204 identifier. NDN (Named Data Networking) can be applied to edge 205 data discovery to make it much easier to extract data and meta- 206 data by naming it. If data was named we would be able to discover 207 the appropriate data simply by its name. 209 o We need a standards based way of discovering data in mobile 210 wireless networks. Data could reside on the eNodeB or other 211 wireless access infrastructure equipment in addition to residing 212 on servers in the packet core. 214 5. IANA Considerations 216 N/A 218 6. Security Considerations 220 Data and metadata discovery are both a function of who asks for the 221 data and in what context. The policies attached to the database and 222 the metadata are going to dictate what view into the data that the 223 system returns to the requester. 225 7. Acknowledgement 227 8. Normative References 229 [I-D.mcbride-edge-data-discovery-overview] 230 McBride, M., Kutscher, D., Schooler, E., and C. Bernardos, 231 "Edge Data Discovery for COIN", draft-mcbride-edge-data- 232 discovery-overview-03 (work in progress), January 2020. 234 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 235 Requirement Levels", BCP 14, RFC 2119, 236 DOI 10.17487/RFC2119, March 1997, 237 . 239 Authors' Addresses 241 Mike McBride 242 Futurewei 244 Email: michael.mcbride@futurewei.com 246 Dirk Kutscher 247 Emden University 249 Email: ietf@dkutscher.net 251 Eve Schooler 252 Intel 254 Email: eve.m.schooler@intel.com 255 URI: http://www.eveschooler.com 257 Carlos J. Bernardos 258 Universidad Carlos III de Madrid 259 Av. Universidad, 30 260 Leganes, Madrid 28911 261 Spain 263 Phone: +34 91624 6236 264 Email: cjbc@it.uc3m.es 265 URI: http://www.it.uc3m.es/cjbc/ 267 Diego R. Lopez 268 Telefonica I+D 269 Don Ramon de la Cruz, 82 270 Madrid 28006 271 Spain 273 Email: diego.r.lopez@telefonica.com