idnits 2.17.1 

draft-irtf-pearg-pitfol-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 2 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (September 10, 2020) is 1322 days in the past.  Is
     this intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  == Unused Reference: 'RFC3164' is defined on line 432, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 3164 (Obsoleted by RFC 5424)


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                             S. Rao
3	Internet-Draft                                                S. Nagaraj
4	Intended status: Experimental                                       Grab
5	Expires: March 14, 2021                                         S. Sahib
6	                                                                R. Guest
7	                                                              Salesforce
8	                                                      September 10, 2020

10	                 Personal Information Tagging for Logs
11	                       draft-irtf-pearg-pitfol-00

13	Abstract

15	   Software systems typically generate log messages in the course of
16	   their operation.  These log messages (or 'logs') record events as
17	   they happen, thus providing a trail that can be used to understand
18	   the state of the system and help with troubleshooting issues.  Given
19	   that logs try to capture state that is useful for monitoring and
20	   debugging, they can contain information that can be used to identify
21	   users.  Personal data identification and anonymization in logs is
22	   crucial to ensure that no personal data is being inadvertently logged
23	   and retained which would make the logging system run afoul of laws
24	   around storing private information.  This document focuses on
25	   exploring mechanisms that can be used by a generating or intermediary
26	   logging service to specify personal or sensitive data in log
27	   message(s), thus allowing a downstream logging server to potentially
28	   enforce any redaction or transformation.

30	Requirements Language

32	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
33	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
34	   document are to be interpreted as described in RFC 2119 [RFC2119].

36	Status of This Memo

38	   This Internet-Draft is submitted in full conformance with the
39	   provisions of BCP 78 and BCP 79.

41	   Internet-Drafts are working documents of the Internet Engineering
42	   Task Force (IETF).  Note that other groups may also distribute
43	   working documents as Internet-Drafts.  The list of current Internet-
44	   Drafts is at https://datatracker.ietf.org/drafts/current/.

46	   Internet-Drafts are draft documents valid for a maximum of six months
47	   and may be updated, replaced, or obsoleted by other documents at any
48	   time.  It is inappropriate to use Internet-Drafts as reference
49	   material or to cite them other than as "work in progress."

51	   This Internet-Draft will expire on March 14, 2021.

53	Copyright Notice

55	   Copyright (c) 2020 IETF Trust and the persons identified as the
56	   document authors.  All rights reserved.

58	   This document is subject to BCP 78 and the IETF Trust's Legal
59	   Provisions Relating to IETF Documents
60	   (https://trustee.ietf.org/license-info) in effect on the date of
61	   publication of this document.  Please review these documents
62	   carefully, as they describe your rights and restrictions with respect
63	   to this document.  Code Components extracted from this document must
64	   include Simplified BSD License text as described in Section 4.e of
65	   the Trust Legal Provisions and are provided without warranty as
66	   described in the Simplified BSD License.

68	Table of Contents

70	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
71	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
72	   3.  Motivation and Use Cases  . . . . . . . . . . . . . . . . . .   4
73	   4.  Challenges with Existing Approaches . . . . . . . . . . . . .   4
74	   5.  Proposed Model  . . . . . . . . . . . . . . . . . . . . . . .   5
75	     5.1.  Defining the log privacy schema . . . . . . . . . . . . .   5
76	     5.2.  Typical Workflow  . . . . . . . . . . . . . . . . . . . .   7
77	     5.3.  Log Processing and Access Control . . . . . . . . . . . .   7
78	   6.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
79	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
80	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
81	   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
82	   10. Normative References  . . . . . . . . . . . . . . . . . . . .  10
83	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

85	1.  Introduction

87	   Logs capture the state of a software system in operation, thus
88	   providing observability.  However, because of the amount of state
89	   they capture, they can often contain sensitive user information
90	   [link: twitter storing passwords].  Personal data identification and
91	   redaction is crucial to make sure that a logging application is not
92	   storing and potentially leaking users' private information.  There
93	   are known precedents that help discover and extract sensitive data,
94	   for example, we can define a regular expression or lookup rules that
95	   will match a person's name, credit card number, email address and so
96	   on.  Besides, there are data dictionary based training models that
97	   can analyze logs and predict presence of sensitive data and
98	   subsequently redact it.  This document proposes an approach and
99	   framework for creating logs with personal information tagged, thus
100	   marking a step towards privacy aware logging.  Once personal
101	   information is identified in a log, it has to be appropriately tagged
102	   at source.  Personal data tagging is especially important in cases
103	   where log data is flowing in from disparate sources.  In cases where
104	   tagging at source is not possible (e.g. log data generated by a
105	   legacy application, IoT device, Web server or a Firewall), a
106	   centralized logging server can be tasked with making sure the log
107	   data is tagged before passing on downstream.  Once the logs are
108	   tagged, the logging application can use anonymization techniques to
109	   redact the fields appropriately.  While the proposal described here
110	   can be applied to any data deemed sensitive in a log, however this
111	   document specifically discusses and illustrates tagging of personal
112	   information in logs.

114	2.  Terminology

116	   *Personal data:* RFC 6973 [RFC6973] defines personal data as "any
117	   information relating to an individual who can be identified, directly
118	   or indirectly."  This typically includes information such as IP
119	   addresses, username, email address, financial data, passwords and so
120	   on.  However, the definition of personal data varies heavily by what
121	   other information is available, the jurisdiction of operation and
122	   other such factors.  Hence, this document does not focus on
123	   prescriptively listing what log fields contain personal data but
124	   rather on what a tagging mechanism would look like once a logging
125	   application has determined which fields it considers to hold personal
126	   data.

128	   *Structured logging:* Most applications generate logs in a
129	   unidimensional format that twine together logic status and input
130	   data.  This makes log output largely free flowing and unstructured
131	   without specific delimiters making it hard to segregate personal
132	   information from other text in the log.  Structured logging refers to
133	   a formal arrangement of logs with specific identifiers of personal
134	   information and semantic information to enable easy parsing and
135	   identification of specific information in the log.

137	   *Privacy Sensitivity Level:* Sensitivity level defines the degree of
138	   sensitivity of a data in log template or schema.  Level can be
139	   enumerated on a scale 1 to 5 and defined as follows: 1 - Low risk for
140	   leaking private information and 5 - Very high risk for leaking
141	   private information>

143	3.  Motivation and Use Cases

145	   Most systems like network devices, web servers and application
146	   services record information about user activity, transactions,
147	   network flows, etc., as log data.  Logs are incredibly useful for
148	   various purposes such as security monitoring, application debugging,
149	   investigations and operational maintenance.  In addition, there are
150	   use cases of organizations exporting or sharing logs with third party
151	   log analyzers for purposes of security incident response, monitoring,
152	   business analytics, where logs can be a valuable source of
153	   information.  In such cases, there are concerns about potential
154	   exposure of personal data to unintended systems or recipients.

156	4.  Challenges with Existing Approaches

158	   While methods of detecting personal identifiable information are
159	   continuously evolving, most approaches are around use of regular
160	   expressions, data or dataset based training models, pattern
161	   recognition, checksum matching, building custom logic.

163	   *Inconsistent Representation:* When applications, services or
164	   devices, log personal information, there is no consistency in the
165	   representation of the information.  For example the name of a user is
166	   often logged as either "fullname" (e.g.  John Doe) or with
167	   "firstname" (John) and "lastname" (Doe).

169	   *Context:* In most cases, what data is considered personal and
170	   sensitive is subjective, provisional and contextual to the data
171	   source or the application processing the data, which makes it hard to
172	   use automated techniques to identify personal data.  Even for a
173	   specific domain, it's controversial whether it is possible to
174	   definitively say that a piece of data is NOT identifying.

176	   *Disparate Types of Personal Data:* There are many disparate types of
177	   personal data and often require a multitude approaches for detection.

179	   *Lack of standards:* There are no standards that govern formats of
180	   sensitive data making automation difficult for most common use cases.

182	   *Detection Accuracy:* Most of the current PII detections tools employ
183	   regular expression based techniques or other pattern recognition
184	   techniques to identify the PII data.  Due to the very nature of logs,
185	   most of the current implementations let administrators to add
186	   redaction policies based on 'likelihood' of detection probability
187	   categorized as low, medium or high.  Defining a low detection scheme
188	   causes high false positives and a high detection scheme would cause
189	   PII leakage, thereby making a trade off inevitable to organizations.

191	5.  Proposed Model

193	   This section describes a reference model to enable tagging of
194	   personal information at source and extends it to include an approach
195	   of role or policy based redaction based on personal information
196	   annotated at source.  The figure below illustrates the proposed
197	   model.

199	     Log Template/Schema with personal data identifiers
200	                         |
201	                         V
202	                   Log library
203	                         |
204	                         V
205	                     Application
206	                         |
207	                         V
208	                 Generate annotated log
209	                         |
210	                         V
211	                     Log redaction +---  consumer / role based
212	                                   +--   sensitivity based

214	                              Figure 1: Flow

216	5.1.  Defining the log privacy schema

218	   We propose using structured logging where a log schema or a template
219	   defines standardized identifiers for every personal information and
220	   each log field is associated with a sensitivity level customized to a
221	   use case or log intent.

223	   Note that this is not to be confused with a log severity level (WARN,
224	   INFO...) - those are typically defined "dynamically" by the developer
225	   while defining the severity of a certain scenario.  A privacy
226	   sensitivity level is defined statically and is part of a log schema,
227	   associated with the log name and data type.

229	   +------------------+-----------------+----------------+-------------+
230	   | Name             | Abstract Data   | Description    | Sensitivity |
231	   |                  | Type            |                | [1-High     |
232	   |                  |                 |                | 5-Normal]   |
233	   +------------------+-----------------+----------------+-------------+
234	   | nationalIdentity | String          | National IDs   | 1           |
235	   |                  |                 | issued by      |             |
236	   |                  |                 | sovereign      |             |
237	   |                  |                 | governments.   |             |
238	   |                  |                 | Eg., SSN       |             |
239	   | drivingLicense   | String          | Driving        | 1           |
240	   |                  |                 | License number |             |
241	   | taxIdentity      | String          | Tax            | 1           |
242	   |                  |                 | identification |             |
243	   |                  |                 | numbers        |             |
244	   | credtCardNumber  | String          | Credit cards   | 1           |
245	   | bankAccount      | String          | Bank account   | 1           |
246	   |                  |                 | number         |             |
247	   | dateOfBirth      | Date            | Date of Birth  | 2           |
248	   | personName       | String          | Person name    | 1           |
249	   | emailAddress     | String          | Email          | 2           |
250	   | phoneNumber      | Number          | Phone          | 1           |
251	   | zipCode          | Integer         | Zip codes      | 5           |
252	   | ipAddress        | ipv4Address     | IPv4 or IPv6   | 4           |
253	   |                  |                 | Address        |             |
254	   | dateTimeSeconds  | dateTimeSeconds | seconds        | 5           |
255	   | age              | Integer         | Age            | 2           |
256	   | ethnicGroup      | String          | Ethnic group   | 1           |
257	   | genderIdentity   | String          | Gender         | 1           |
258	   |                  |                 | identity       |             |
259	   | macAddress       | macAddress      | MAC Address    | 4           |
260	   +------------------+-----------------+----------------+-------------+

262	                 Personal Information Identifiers Registry

264	   If an organization already uses structured logging with a log schema,
265	   then a privacy sensitivity level can be an additional attribute for
266	   the schema.

268	   The privacy sensitivity level for log types is intended to be defined
269	   by a centralized effort around privacy preservation in logs.  In
270	   other words, this mapping might be done by an organization's privacy
271	   team (which can include lawyers, engineers and privacy
272	   professionals).  The intention is that all logs generated by an org
273	   should conform to this structured format, which would ease downstream
274	   processing of logs for access control and removal of sensitive
275	   information.

277	   If the log is being generated by a web server, then two approaches
278	   can be taken:

280	   1.  Modify log-format for the service: identify the log data type of
281	   each piece of log data generated, and tag in generation (examples
282	   provided in later section)

284	   2.  Add automated tagging in a centralized log aggregator: collect
285	   all the logs generated by different services and apply the annotation
286	   using the log schema at the aggregator

288	5.2.  Typical Workflow

290	   1.  The log privacy schema can be parsed into a structured logging
291	       library, that is used by individual developer teams.  The
292	       intention is for developers to not log arbitrary data i.e. they
293	       are asked to identify what is the data type of the state they
294	       want to preserve.

296	   2.  Any addition to the log schema would have to go through review of
297	       the privacy team that came up with the log schema.

299	   3.  Once a log is generated, tagged and stored, various kinds of
300	       access control techniques can be applied to who can access the
301	       logs.

303	5.3.  Log Processing and Access Control

305	   1.  Consumer Role Based Access

307	       A.  Once the log is tagged, access to it can be based on a
308	           consumer's role and privilege level.

310	       B.  A consumeer role based policy can define what level of
311	           sensitivity they can access.

313	   2.  Case-based access

315	       A.  If there is a genuine case for which access to sensitive
316	           information is needed and granted by the legal department, a
317	           cryptographically-signed token (e.g.JWT) can be generated
318	           that will allow access to a developer/user to logs of an
319	           increased log level.  This access can be temporal in nature
320	           i.e. the token will only be valid for a certain amount of
321	           time.

323	       B.  A transaction ID can also be propagated automatically
324	           throughout the request processing, to correlate different
325	           logs related to a single request.  Note that the notion of a
326	           "request" can vary based on what the application is doing.
327	           The idea is to have a single unifying ID to tie a particular
328	           action.  If this is done, then the temporary token can be
329	           restricted to a particular request ID.

331	   3.  Redaction Techniques

333	       A.  Given that the log is tagged, an organization might choose to
334	           redact the more sensitive logs i.e. ones above a certain
335	           sensitivity level, ones of a certain log type.

337	       B.  More sophisticated approaches can be developed i.e.
338	           completely redact log types username and email, but obfuscate
339	           IP address so that a rough location can be garnered from the
340	           log record.  In this way, techniques such as differential
341	           privacy can be used in tandem to have privacy guarantees for
342	           logs while still providing usefulness to developers.

344	6.  Examples

346	   An example based on RFC 3164 Log format

348	   Normal Log Ouput

350	   <120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority
351	   failure" violation="A-Not authorized to object" actual_type="AF-A"
352	   jrn_seq="1001363" timestamp="20120418163258988000"
353	   job_name="QPADEV000B" user_name="XYZZY" job_number="256937"
354	   err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875"
355	   action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY"
356	   val_jobno="256937" object="TEST" object_library="CUS9242"
357	   object_type="*FILE" pgm_name="" pgm_libr="" workstation=""]

359	   Log Output with Personal Information Tagging

361	   <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority
362	   failure" violation="A-Not authorized to object" actual_type="AF-A"
363	   jrn_seq="1001363" timestamp="20120418163258988000"
364	   job_name="QPADEV000B" {personName="XYZZY" pii_sensitivity_level=1}
365	   job_number="256937" {emailAddress="xyz@foo.com"
366	   pii_sensitivity_level=2] [ip_addr="10.0.1.21"
367	   pii_sensitivity_level=4] port="55875" action="Undefined(x00)"
368	   val_job="QPADEV000B" val_jobno="256937" object="TEST"
369	   object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr=""
370	   workstation=""]

372	7.  IANA Considerations

374	   IANA can consider defining a new central respository for Personal
375	   Information name and identifier registries to used in logging
376	   personal information.  The personal identifier registry would
377	   enumerate namee and identifiers as described in Section 5.1.

379	8.  Security Considerations

381	   It is anticipated that developers will want additional log data types
382	   for capturing application logic, and might abuse an existing log type
383	   instead of going through the process of adding a new one.  In such a
384	   case, the log would be incorrectly tagged.  This can be mitigated by
385	   having stronger typing for the log data types i.e. restricting
386	   address to a certain string length instead of storing arbitrary
387	   length.

389	   Encouraging developers to think carefully about what kind of data
390	   they're logging is a good practice and will lead to fewer incidents
391	   of private data being inadvertently logged.  An organization might
392	   choose to have an unstructured log type for letting developers log
393	   data that truly do not fit anywhere else.  This is still better than
394	   not having structured privacy-aware logging, because the potential
395	   privacy leakage is isolated to one particular field and its use can
396	   be monitored.

398	   Having a mapping from log data type to privacy sensitivity will need
399	   continuous effort by a privacy team, which might be expensive for an
400	   organization.

402	   Log data is often collated, propagated, transformed, loaded into
403	   different formats or data models for purposes of analytics,
404	   troubleshooting and visualization.  In such cases, it is necessary
405	   and critical to ensure that personal information tagging and
406	   annotations is preserved and forwarded across format transformations.

408	   If the privacy marking or classification changes for a log, for
409	   historical logs, the change of privacy classification is applied on
410	   subsequent access of the log.

412	   *TODO*: In case of logs that are not tagged or marked with personal
413	   information, an out-of-band mechanism to communicate log template or
414	   schema with personal data identifiers can be considered.  Such a
415	   mechansim can also be used to notify changes to privacy tagging or
416	   classification.

418	9.  Acknowledgements

420	   The authors would like to thank everyone who provided helpful
421	   comments at the mic at IETF 106 during the PEARG session.  Thanks
422	   also to Joe Salowey for thoughts on aspects of log transformations,
423	   change of privacy classifications, models for privacy marking.

425	10.  Normative References

427	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
428	              Requirement Levels", BCP 14, RFC 2119,
429	              DOI 10.17487/RFC2119, March 1997,
430	              <https://www.rfc-editor.org/info/rfc2119>.

432	   [RFC3164]  Lonvick, C., "The BSD Syslog Protocol", RFC 3164,
433	              DOI 10.17487/RFC3164, August 2001,
434	              <https://www.rfc-editor.org/info/rfc3164>.

436	   [RFC6973]  Cooper, A., Tschofenig, H., Aboba, B., Peterson, J.,
437	              Morris, J., Hansen, M., and R. Smith, "Privacy
438	              Considerations for Internet Protocols", RFC 6973,
439	              DOI 10.17487/RFC6973, July 2013,
440	              <https://www.rfc-editor.org/info/rfc6973>.

442	Authors' Addresses

444	   Sandeep Rao
445	   Grab
446	   Bangalore
447	   India

449	   Email: sandeeprao.ietf@gmail.com

451	   Santhosh C N
452	   Grab

454	   Email: santoshcn1@gmail.com

456	   Shivan Sahib
457	   Salesforce

459	   Email: shivankaulsahib@gmail.com
460	   Ryan Guest
461	   Salesforce

463	   Email: rguest@salesforce.com