idnits 2.17.1 

draft-mcbride-data-discovery-problem-statement-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 10, 2020) is 1386 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-05) exists of
     draft-mcbride-edge-data-discovery-overview-03


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                         M. McBride
3	Internet-Draft                                                 Futurewei
4	Intended status: Standards Track                             D. Kutscher
5	Expires: January 11, 2021                               Emden University
6	                                                             E. Schooler
7	                                                                   Intel
8	                                                           CJ. Bernardos
9	                                                                    UC3M
10	                                                                D. Lopez
11	                                                          Telefonica I+D
12	                                                           July 10, 2020

14	                    Data Discovery Problem Statement
15	           draft-mcbride-data-discovery-problem-statement-00

17	Abstract

19	   If data is the new oil of the 21st century, then we need a
20	   standardized way of locating, capturing, classifying and transforming
21	   this raw data to generate insights and recommendations.  Data, like
22	   oil, needs to be discovered and captured in order to be refined and
23	   valuable.  While the topic of data discovery can be far reaching,
24	   this document focuses on the problem of actually locating data,
25	   throughout a network of data servers, in a standardized way.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on January 11, 2021.

44	Copyright Notice

46	   Copyright (c) 2020 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
62	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   2
63	   2.  Problem Scope . . . . . . . . . . . . . . . . . . . . . . . .   2
64	   3.  Existing Solutions  . . . . . . . . . . . . . . . . . . . . .   3
65	     3.1.  Proprietary . . . . . . . . . . . . . . . . . . . . . . .   3
66	     3.2.  Opensource  . . . . . . . . . . . . . . . . . . . . . . .   4
67	   4.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   4
68	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   5
69	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   5
70	   7.  Acknowledgement . . . . . . . . . . . . . . . . . . . . . . .   5
71	   8.  Normative References  . . . . . . . . . . . . . . . . . . . .   5
72	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   6

74	1.  Introduction

76	   There are myriad proprietary and standardized ways of discovering
77	   networking devices and hosts.  There are many solutions for
78	   discovering data within a database.  There are proprietary, non-
79	   standardized, ways of discovering the data that may be stored
80	   throughout an environment of networking devices.  We can discover
81	   information about the devices but can't locate and capture stored
82	   data in a standard way.  With more networking devices storing
83	   collected data there needs to be a standard way of discovering the
84	   specific data needed amongst a potentially huge lake of databases.

86	1.1.  Requirements Language

88	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
89	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
90	   document are to be interpreted as described in RFC 2119 [RFC2119].

92	2.  Problem Scope

94	   Data may be cached, copied and/or stored at multiple locations in the
95	   network on route to its final destination.  With an increasing
96	   percentage of devices connecting to the Internet being mobile,
97	   support for in-the-network caching and replication is critical for
98	   continuous data availability.  There are data repositories throughout
99	   a modern network and there needs to be a standardized way to locating
100	   the repositories and discovering the desired data within.

102	   There are many types of relational (SQL) and non-relational (NoSQL)
103	   data classification solutions.  Existing database classification
104	   engines allow for scanning of a database.  We are defining the
105	   problem, however, of having a standards based solution to discover
106	   first where the databases exist throughout a network and then where
107	   specific data objects are located.

109	   Data discovery is likely to look different depending on if we are
110	   seeking global vs local discovery.  Data discovery may be location-
111	   driven.  A standard to find data may want to search for it in a more
112	   proximal fashion, i.e., find the data that matches the search that is
113	   nearest to a location.

115	   There is so much data being created, processed, and migrated, that it
116	   may only sometimes get stored more permanently in a database.  There
117	   is going to be slightly less permanent data that resides for a time
118	   in memory, so that it may be discovered and accessed quickly.  It may
119	   be more dynamic and short lived.  Although we refer to the data store
120	   as a database, it may reside entirely in memory, and/or it may be
121	   stored in some other non-SQL indexing technology.

123	   Each database essentially provides a directory service for the data
124	   within them and that directory service can be viewed as metadata.
125	   There is the need to understand where the databases/data lakes/
126	   pockets of data reside.  The location of each data store is the first
127	   level discovery problem, and the details of the database's directory
128	   is the second level discovery problem.

130	   Publish and subscribe approaches allow nodes to express their
131	   interest in specific pieces of data without knowing the location of
132	   the data.  There might be sources of data to be discovered that might
133	   not produce the specific data desired by the subscribers (or not
134	   produce data with a specific format or frequency).  The subscriber
135	   will want to find the publishers which send the desired data
136	   characteristics.

138	3.  Existing Solutions

140	3.1.  Proprietary

142	   There are many existing proprietary database discovery solutions we
143	   can evaluate in order to understand what aspects we need to
144	   standardized.  For instance there is IBM Cognos, Wipro Data Discovery
145	   Platform (DDP), and Amazon Macie among many others.  Macie, for
146	   instance, is a data security and data privacy service that uses
147	   machine learning and pattern matching to discover and protect data in
148	   AWS.  The service allows you to define data types in order to
149	   discover and protect the data that may be unique to a use case.

151	3.2.  Opensource

153	   There are opensource data solutions such as from ScienceBase
154	   (https://sciencebase.usgs.gov/).  The U.S.  Geological Survey (USGS)
155	   is developing ScienceBase, an open source, collaborative, scientific
156	   data and information management platform.  It provides current
157	   documentation about its structure, information model, services,
158	   directory and repository. sbtools uses an R (command line driven
159	   program used to find data within the platform) interface for
160	   ScienceBase.

162	   Another solution is the Interplanetary File System (ipfs.io).  IPFS
163	   is a distributed system for storing and accessing files, websites,
164	   applications, and data.  IPFS is a peer-to-peer (p2p) storage
165	   network.  Content is accessible through peers, located anywhere in
166	   the world, that might relay information, store it, or do both.  IPFS
167	   knows how to find what you ask for via its content address, rather
168	   than its location.  There are three fundamental principles to
169	   understanding IPFS:

171	   o  Unique identification via content addressing

173	   o  Content linking via directed acyclic graphs (DAGs)

175	   o  Content discovery via distributed hash tables (DHTs)

177	4.  Use Cases

179	   Here are some of the use cases which will benefit from standards
180	   based data discovery solutions:

182	   o  We need a standards based solution to discover the increasing
183	      amount of data being stored in various locations throughout a
184	      network including at the edge.  We need a standard protocol set
185	      for doing this data discovery, on the device or infrastructure
186	      edge, in order to meet the requirements of many use cases.  We
187	      will have terabytes of data on the edge and need a way to identify
188	      its existence and find the desired data.
189	      [I-D.mcbride-edge-data-discovery-overview] is focusing on this
190	      aspect of data discovery.

192	   o  We need a secure standards based solution for data discovery.
193	      Several of the proprietary secure data discovery solutions use
194	      machine learning and pattern matching to discover and protect the
195	      data.  We need to incorporate existing, or new, ietf security
196	      solutions when discoverying data.

198	   o  We need a standards based solution for using named based solutions
199	      for data discovery.  An Information Centric Networking (ICN)
200	      enabled network routes data by name (vs address), caches content
201	      natively in the network, and employs data-centric security.  Data
202	      discovery may require that data be associated with a name or
203	      names, a series of descriptive attributes, and/or a unique
204	      identifier.  NDN (Named Data Networking) can be applied to edge
205	      data discovery to make it much easier to extract data and meta-
206	      data by naming it.  If data was named we would be able to discover
207	      the appropriate data simply by its name.

209	   o  We need a standards based way of discovering data in mobile
210	      wireless networks.  Data could reside on the eNodeB or other
211	      wireless access infrastructure equipment in addition to residing
212	      on servers in the packet core.

214	5.  IANA Considerations

216	   N/A

218	6.  Security Considerations

220	   Data and metadata discovery are both a function of who asks for the
221	   data and in what context.  The policies attached to the database and
222	   the metadata are going to dictate what view into the data that the
223	   system returns to the requester.

225	7.  Acknowledgement

227	8.  Normative References

229	   [I-D.mcbride-edge-data-discovery-overview]
230	              McBride, M., Kutscher, D., Schooler, E., and C. Bernardos,
231	              "Edge Data Discovery for COIN", draft-mcbride-edge-data-
232	              discovery-overview-03 (work in progress), January 2020.

234	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
235	              Requirement Levels", BCP 14, RFC 2119,
236	              DOI 10.17487/RFC2119, March 1997,
237	              <https://www.rfc-editor.org/info/rfc2119>.

239	Authors' Addresses

241	   Mike McBride
242	   Futurewei

244	   Email: michael.mcbride@futurewei.com

246	   Dirk Kutscher
247	   Emden University

249	   Email: ietf@dkutscher.net

251	   Eve Schooler
252	   Intel

254	   Email: eve.m.schooler@intel.com
255	   URI:    http://www.eveschooler.com

257	   Carlos J. Bernardos
258	   Universidad Carlos III de Madrid
259	   Av. Universidad, 30
260	   Leganes, Madrid  28911
261	   Spain

263	   Phone: +34 91624 6236
264	   Email: cjbc@it.uc3m.es
265	   URI:   http://www.it.uc3m.es/cjbc/

267	   Diego R. Lopez
268	   Telefonica I+D
269	   Don Ramon de la Cruz, 82
270	   Madrid  28006
271	   Spain

273	   Email: diego.r.lopez@telefonica.com