idnits 2.17.1 draft-irtf-pearg-pitfol-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (September 10, 2020) is 1322 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: 'RFC3164' is defined on line 432, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3164 (Obsoleted by RFC 5424) Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Rao 3 Internet-Draft S. Nagaraj 4 Intended status: Experimental Grab 5 Expires: March 14, 2021 S. Sahib 6 R. Guest 7 Salesforce 8 September 10, 2020 10 Personal Information Tagging for Logs 11 draft-irtf-pearg-pitfol-00 13 Abstract 15 Software systems typically generate log messages in the course of 16 their operation. These log messages (or 'logs') record events as 17 they happen, thus providing a trail that can be used to understand 18 the state of the system and help with troubleshooting issues. Given 19 that logs try to capture state that is useful for monitoring and 20 debugging, they can contain information that can be used to identify 21 users. Personal data identification and anonymization in logs is 22 crucial to ensure that no personal data is being inadvertently logged 23 and retained which would make the logging system run afoul of laws 24 around storing private information. This document focuses on 25 exploring mechanisms that can be used by a generating or intermediary 26 logging service to specify personal or sensitive data in log 27 message(s), thus allowing a downstream logging server to potentially 28 enforce any redaction or transformation. 30 Requirements Language 32 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 33 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 34 document are to be interpreted as described in RFC 2119 [RFC2119]. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on March 14, 2021. 53 Copyright Notice 55 Copyright (c) 2020 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 72 3. Motivation and Use Cases . . . . . . . . . . . . . . . . . . 4 73 4. Challenges with Existing Approaches . . . . . . . . . . . . . 4 74 5. Proposed Model . . . . . . . . . . . . . . . . . . . . . . . 5 75 5.1. Defining the log privacy schema . . . . . . . . . . . . . 5 76 5.2. Typical Workflow . . . . . . . . . . . . . . . . . . . . 7 77 5.3. Log Processing and Access Control . . . . . . . . . . . . 7 78 6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 79 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 80 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 81 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 82 10. Normative References . . . . . . . . . . . . . . . . . . . . 10 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 85 1. Introduction 87 Logs capture the state of a software system in operation, thus 88 providing observability. However, because of the amount of state 89 they capture, they can often contain sensitive user information 90 [link: twitter storing passwords]. Personal data identification and 91 redaction is crucial to make sure that a logging application is not 92 storing and potentially leaking users' private information. There 93 are known precedents that help discover and extract sensitive data, 94 for example, we can define a regular expression or lookup rules that 95 will match a person's name, credit card number, email address and so 96 on. Besides, there are data dictionary based training models that 97 can analyze logs and predict presence of sensitive data and 98 subsequently redact it. This document proposes an approach and 99 framework for creating logs with personal information tagged, thus 100 marking a step towards privacy aware logging. Once personal 101 information is identified in a log, it has to be appropriately tagged 102 at source. Personal data tagging is especially important in cases 103 where log data is flowing in from disparate sources. In cases where 104 tagging at source is not possible (e.g. log data generated by a 105 legacy application, IoT device, Web server or a Firewall), a 106 centralized logging server can be tasked with making sure the log 107 data is tagged before passing on downstream. Once the logs are 108 tagged, the logging application can use anonymization techniques to 109 redact the fields appropriately. While the proposal described here 110 can be applied to any data deemed sensitive in a log, however this 111 document specifically discusses and illustrates tagging of personal 112 information in logs. 114 2. Terminology 116 *Personal data:* RFC 6973 [RFC6973] defines personal data as "any 117 information relating to an individual who can be identified, directly 118 or indirectly." This typically includes information such as IP 119 addresses, username, email address, financial data, passwords and so 120 on. However, the definition of personal data varies heavily by what 121 other information is available, the jurisdiction of operation and 122 other such factors. Hence, this document does not focus on 123 prescriptively listing what log fields contain personal data but 124 rather on what a tagging mechanism would look like once a logging 125 application has determined which fields it considers to hold personal 126 data. 128 *Structured logging:* Most applications generate logs in a 129 unidimensional format that twine together logic status and input 130 data. This makes log output largely free flowing and unstructured 131 without specific delimiters making it hard to segregate personal 132 information from other text in the log. Structured logging refers to 133 a formal arrangement of logs with specific identifiers of personal 134 information and semantic information to enable easy parsing and 135 identification of specific information in the log. 137 *Privacy Sensitivity Level:* Sensitivity level defines the degree of 138 sensitivity of a data in log template or schema. Level can be 139 enumerated on a scale 1 to 5 and defined as follows: 1 - Low risk for 140 leaking private information and 5 - Very high risk for leaking 141 private information> 143 3. Motivation and Use Cases 145 Most systems like network devices, web servers and application 146 services record information about user activity, transactions, 147 network flows, etc., as log data. Logs are incredibly useful for 148 various purposes such as security monitoring, application debugging, 149 investigations and operational maintenance. In addition, there are 150 use cases of organizations exporting or sharing logs with third party 151 log analyzers for purposes of security incident response, monitoring, 152 business analytics, where logs can be a valuable source of 153 information. In such cases, there are concerns about potential 154 exposure of personal data to unintended systems or recipients. 156 4. Challenges with Existing Approaches 158 While methods of detecting personal identifiable information are 159 continuously evolving, most approaches are around use of regular 160 expressions, data or dataset based training models, pattern 161 recognition, checksum matching, building custom logic. 163 *Inconsistent Representation:* When applications, services or 164 devices, log personal information, there is no consistency in the 165 representation of the information. For example the name of a user is 166 often logged as either "fullname" (e.g. John Doe) or with 167 "firstname" (John) and "lastname" (Doe). 169 *Context:* In most cases, what data is considered personal and 170 sensitive is subjective, provisional and contextual to the data 171 source or the application processing the data, which makes it hard to 172 use automated techniques to identify personal data. Even for a 173 specific domain, it's controversial whether it is possible to 174 definitively say that a piece of data is NOT identifying. 176 *Disparate Types of Personal Data:* There are many disparate types of 177 personal data and often require a multitude approaches for detection. 179 *Lack of standards:* There are no standards that govern formats of 180 sensitive data making automation difficult for most common use cases. 182 *Detection Accuracy:* Most of the current PII detections tools employ 183 regular expression based techniques or other pattern recognition 184 techniques to identify the PII data. Due to the very nature of logs, 185 most of the current implementations let administrators to add 186 redaction policies based on 'likelihood' of detection probability 187 categorized as low, medium or high. Defining a low detection scheme 188 causes high false positives and a high detection scheme would cause 189 PII leakage, thereby making a trade off inevitable to organizations. 191 5. Proposed Model 193 This section describes a reference model to enable tagging of 194 personal information at source and extends it to include an approach 195 of role or policy based redaction based on personal information 196 annotated at source. The figure below illustrates the proposed 197 model. 199 Log Template/Schema with personal data identifiers 200 | 201 V 202 Log library 203 | 204 V 205 Application 206 | 207 V 208 Generate annotated log 209 | 210 V 211 Log redaction +--- consumer / role based 212 +-- sensitivity based 214 Figure 1: Flow 216 5.1. Defining the log privacy schema 218 We propose using structured logging where a log schema or a template 219 defines standardized identifiers for every personal information and 220 each log field is associated with a sensitivity level customized to a 221 use case or log intent. 223 Note that this is not to be confused with a log severity level (WARN, 224 INFO...) - those are typically defined "dynamically" by the developer 225 while defining the severity of a certain scenario. A privacy 226 sensitivity level is defined statically and is part of a log schema, 227 associated with the log name and data type. 229 +------------------+-----------------+----------------+-------------+ 230 | Name | Abstract Data | Description | Sensitivity | 231 | | Type | | [1-High | 232 | | | | 5-Normal] | 233 +------------------+-----------------+----------------+-------------+ 234 | nationalIdentity | String | National IDs | 1 | 235 | | | issued by | | 236 | | | sovereign | | 237 | | | governments. | | 238 | | | Eg., SSN | | 239 | drivingLicense | String | Driving | 1 | 240 | | | License number | | 241 | taxIdentity | String | Tax | 1 | 242 | | | identification | | 243 | | | numbers | | 244 | credtCardNumber | String | Credit cards | 1 | 245 | bankAccount | String | Bank account | 1 | 246 | | | number | | 247 | dateOfBirth | Date | Date of Birth | 2 | 248 | personName | String | Person name | 1 | 249 | emailAddress | String | Email | 2 | 250 | phoneNumber | Number | Phone | 1 | 251 | zipCode | Integer | Zip codes | 5 | 252 | ipAddress | ipv4Address | IPv4 or IPv6 | 4 | 253 | | | Address | | 254 | dateTimeSeconds | dateTimeSeconds | seconds | 5 | 255 | age | Integer | Age | 2 | 256 | ethnicGroup | String | Ethnic group | 1 | 257 | genderIdentity | String | Gender | 1 | 258 | | | identity | | 259 | macAddress | macAddress | MAC Address | 4 | 260 +------------------+-----------------+----------------+-------------+ 262 Personal Information Identifiers Registry 264 If an organization already uses structured logging with a log schema, 265 then a privacy sensitivity level can be an additional attribute for 266 the schema. 268 The privacy sensitivity level for log types is intended to be defined 269 by a centralized effort around privacy preservation in logs. In 270 other words, this mapping might be done by an organization's privacy 271 team (which can include lawyers, engineers and privacy 272 professionals). The intention is that all logs generated by an org 273 should conform to this structured format, which would ease downstream 274 processing of logs for access control and removal of sensitive 275 information. 277 If the log is being generated by a web server, then two approaches 278 can be taken: 280 1. Modify log-format for the service: identify the log data type of 281 each piece of log data generated, and tag in generation (examples 282 provided in later section) 284 2. Add automated tagging in a centralized log aggregator: collect 285 all the logs generated by different services and apply the annotation 286 using the log schema at the aggregator 288 5.2. Typical Workflow 290 1. The log privacy schema can be parsed into a structured logging 291 library, that is used by individual developer teams. The 292 intention is for developers to not log arbitrary data i.e. they 293 are asked to identify what is the data type of the state they 294 want to preserve. 296 2. Any addition to the log schema would have to go through review of 297 the privacy team that came up with the log schema. 299 3. Once a log is generated, tagged and stored, various kinds of 300 access control techniques can be applied to who can access the 301 logs. 303 5.3. Log Processing and Access Control 305 1. Consumer Role Based Access 307 A. Once the log is tagged, access to it can be based on a 308 consumer's role and privilege level. 310 B. A consumeer role based policy can define what level of 311 sensitivity they can access. 313 2. Case-based access 315 A. If there is a genuine case for which access to sensitive 316 information is needed and granted by the legal department, a 317 cryptographically-signed token (e.g.JWT) can be generated 318 that will allow access to a developer/user to logs of an 319 increased log level. This access can be temporal in nature 320 i.e. the token will only be valid for a certain amount of 321 time. 323 B. A transaction ID can also be propagated automatically 324 throughout the request processing, to correlate different 325 logs related to a single request. Note that the notion of a 326 "request" can vary based on what the application is doing. 327 The idea is to have a single unifying ID to tie a particular 328 action. If this is done, then the temporary token can be 329 restricted to a particular request ID. 331 3. Redaction Techniques 333 A. Given that the log is tagged, an organization might choose to 334 redact the more sensitive logs i.e. ones above a certain 335 sensitivity level, ones of a certain log type. 337 B. More sophisticated approaches can be developed i.e. 338 completely redact log types username and email, but obfuscate 339 IP address so that a rough location can be garnered from the 340 log record. In this way, techniques such as differential 341 privacy can be used in tandem to have privacy guarantees for 342 logs while still providing usefulness to developers. 344 6. Examples 346 An example based on RFC 3164 Log format 348 Normal Log Ouput 350 <120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority 351 failure" violation="A-Not authorized to object" actual_type="AF-A" 352 jrn_seq="1001363" timestamp="20120418163258988000" 353 job_name="QPADEV000B" user_name="XYZZY" job_number="256937" 354 err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875" 355 action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY" 356 val_jobno="256937" object="TEST" object_library="CUS9242" 357 object_type="*FILE" pgm_name="" pgm_libr="" workstation=""] 359 Log Output with Personal Information Tagging 361 <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority 362 failure" violation="A-Not authorized to object" actual_type="AF-A" 363 jrn_seq="1001363" timestamp="20120418163258988000" 364 job_name="QPADEV000B" {personName="XYZZY" pii_sensitivity_level=1} 365 job_number="256937" {emailAddress="xyz@foo.com" 366 pii_sensitivity_level=2] [ip_addr="10.0.1.21" 367 pii_sensitivity_level=4] port="55875" action="Undefined(x00)" 368 val_job="QPADEV000B" val_jobno="256937" object="TEST" 369 object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" 370 workstation=""] 372 7. IANA Considerations 374 IANA can consider defining a new central respository for Personal 375 Information name and identifier registries to used in logging 376 personal information. The personal identifier registry would 377 enumerate namee and identifiers as described in Section 5.1. 379 8. Security Considerations 381 It is anticipated that developers will want additional log data types 382 for capturing application logic, and might abuse an existing log type 383 instead of going through the process of adding a new one. In such a 384 case, the log would be incorrectly tagged. This can be mitigated by 385 having stronger typing for the log data types i.e. restricting 386 address to a certain string length instead of storing arbitrary 387 length. 389 Encouraging developers to think carefully about what kind of data 390 they're logging is a good practice and will lead to fewer incidents 391 of private data being inadvertently logged. An organization might 392 choose to have an unstructured log type for letting developers log 393 data that truly do not fit anywhere else. This is still better than 394 not having structured privacy-aware logging, because the potential 395 privacy leakage is isolated to one particular field and its use can 396 be monitored. 398 Having a mapping from log data type to privacy sensitivity will need 399 continuous effort by a privacy team, which might be expensive for an 400 organization. 402 Log data is often collated, propagated, transformed, loaded into 403 different formats or data models for purposes of analytics, 404 troubleshooting and visualization. In such cases, it is necessary 405 and critical to ensure that personal information tagging and 406 annotations is preserved and forwarded across format transformations. 408 If the privacy marking or classification changes for a log, for 409 historical logs, the change of privacy classification is applied on 410 subsequent access of the log. 412 *TODO*: In case of logs that are not tagged or marked with personal 413 information, an out-of-band mechanism to communicate log template or 414 schema with personal data identifiers can be considered. Such a 415 mechansim can also be used to notify changes to privacy tagging or 416 classification. 418 9. Acknowledgements 420 The authors would like to thank everyone who provided helpful 421 comments at the mic at IETF 106 during the PEARG session. Thanks 422 also to Joe Salowey for thoughts on aspects of log transformations, 423 change of privacy classifications, models for privacy marking. 425 10. Normative References 427 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 428 Requirement Levels", BCP 14, RFC 2119, 429 DOI 10.17487/RFC2119, March 1997, 430 . 432 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 433 DOI 10.17487/RFC3164, August 2001, 434 . 436 [RFC6973] Cooper, A., Tschofenig, H., Aboba, B., Peterson, J., 437 Morris, J., Hansen, M., and R. Smith, "Privacy 438 Considerations for Internet Protocols", RFC 6973, 439 DOI 10.17487/RFC6973, July 2013, 440 . 442 Authors' Addresses 444 Sandeep Rao 445 Grab 446 Bangalore 447 India 449 Email: sandeeprao.ietf@gmail.com 451 Santhosh C N 452 Grab 454 Email: santoshcn1@gmail.com 456 Shivan Sahib 457 Salesforce 459 Email: shivankaulsahib@gmail.com 460 Ryan Guest 461 Salesforce 463 Email: rguest@salesforce.com