TOC 
IPFIX Working GroupE. Boschi
Internet-DraftB. Trammell
Intended status: ExperimentalHitachi Europe
Expires: January 15, 2009July 14, 2008


IP Flow Anonymisation Support
draft-boschi-ipfix-anon-01.txt

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on January 15, 2009.

Abstract

This document describes anonymisation techniques for IP flow data. It provides a categorization of common anonymisation schemes and defines the parameters needed to describe them. It describes support for anonymization within the IPFIX protocol, providing the basis for the definition of information models for configuring anonymisation techniques within an IPFIX Metering or Exporting Process, and for reporting the technique in use to an IPFIX Collecting Process.



Table of Contents

1.  Introduction
    1.1.  IPFIX Protocol Overview
    1.2.  IPFIX Documents Overview
2.  Terminology
3.  Categorisation of Anonymisation Techniques
4.  Anonymisation of IP Flow Data
    4.1.  IP Address Anonymisation
        4.1.1.  Truncation
        4.1.2.  Random Permutations
        4.1.3.  Prefix-preserving Pseudonymisation
    4.2.  Timestamp Anonymisation
        4.2.1.  Precision Degradation
        4.2.2.  Enumeration
        4.2.3.  Random Time Shifts
    4.3.  Counter Anonymisation
        4.3.1.  Precision Degradation
        4.3.2.  Binning
        4.3.3.  Random Noise Addition
    4.4.  Anonymisation of Other Flow Fields
5.  Parameters for the Description of Anonymisation Techniques
6.  Anonymisation Support in IPFIX
7.  Security Considerations
8.  IANA Considerations
9.  References
    9.1.  Normative References
    9.2.  Informative References
§  Authors' Addresses
§  Intellectual Property and Copyright Statements




 TOC 

1.  Introduction

The standardisation of an IP flow information export protocol [RFC5101] (Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” January 2008.) and associated representations removes a technical barrier to the sharing of IP flow data across organizational boundaries and with network operations, security, and research communities for a wide variety of purposes. However, with wider dissemination comes greater risks to the privacy of the users of networks under measurement, and to the security of those networks. While it is not a complete solution to the issues posed by distribution of IP flow information, anonymisation is an important tool for the protection of privacy within network measurement infrastructures.

This document presents a mechanism for representing anonymised data within IPFIX and guidelines for using it. It begins with a categorization of anonymisation techniques. It then describes applicability of each technique to commonly anonymisable fields of IP flow data, organized by information element data type and semantics as in [RFC5102] (Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, “Information Model for IP Flow Information Export,” January 2008.); enumerates the parameters required by each of the applicable anonymisation techniques; and provides guidelines for the use of each of these techniques in accordance with best practices in data protection. Finally, it specifies a mechanism for exporting anonymised data and binding anonymisation metadata to templates using IPFIX Options.



 TOC 

1.1.  IPFIX Protocol Overview

In the IPFIX protocol, { type, length, value } tuples are expressed in templates containing { type, length } pairs, specifying which { value } fields are present in data records conforming to the Template, giving great flexibility as to what data is transmitted. Since Templates are sent very infrequently compared with Data Records, this results in significant bandwidth savings. Various different data formats may be transmitted simply by sending new Templates specifying the { type, length } pairs for the new data format. See [RFC5101] (Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” January 2008.) for more information.

The IPFIX information model (Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, “Information Model for IP Flow Information Export,” January 2008.) [RFC5102] defines a large number of standard Information Elements which provide the necessary { type } information for Templates. The use of standard elements enables interoperability among different vendors' implementations. Additionally, non-standard enterprise-specific elements may be defined for private use.



 TOC 

1.2.  IPFIX Documents Overview

"Specification of the IPFIX Protocol for the Exchange of IP Traffic Flow Information" (Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” January 2008.) [RFC5101] and its associated documents define the IPFIX Protocol, which provides network engineers and administrators with access to IP traffic flow information.

"Architecture for IP Flow Information Export" (Sadasivan, G. and N. Brownlee, “Architecture Model for IP Flow Information Export,” October 2003.) [I‑D.ietf‑ipfix‑arch] defines the architecture for the export of measured IP flow information out of an IPFIX Exporting Process to an IPFIX Collecting Process, and the basic terminology used to describe the elements of this architecture, per the requirements defined in "Requirements for IP Flow Information Export" (Quittek, J., Zseby, T., Claise, B., and S. Zander, “Requirements for IP Flow Information Export (IPFIX),” October 2004.) [RFC3917]. The IPFIX Protocol document [RFC5101] (Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” January 2008.) then covers the details of the method for transporting IPFIX Data Records and Templates via a congestion-aware transport protocol from an IPFIX Exporting Process to an IPFIX Collecting Process.

"Information Model for IP Flow Information Export" (Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, “Information Model for IP Flow Information Export,” January 2008.) [RFC5102] describes the Information Elements used by IPFIX, including details on Information Element naming, numbering, and data type encoding. Finally, "IPFIX Applicability" (Zseby, T., “IPFIX Applicability,” July 2007.) [I‑D.ietf‑ipfix‑as] describes the various applications of the IPFIX protocol and their use of information exported via IPFIX, and relates the IPFIX architecture to other measurement architectures and frameworks.

This document references the Protocol and Architecture documents for terminology and extends the IPFIX Information Model to provide new Information Elements for anonymisation metadata.



 TOC 

2.  Terminology

Terms used in this document that are defined in the Terminology section of the IPFIX Protocol (Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” January 2008.) [RFC5101] document are to be interpreted as defined there.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.) [RFC2119].



 TOC 

3.  Categorisation of Anonymisation Techniques

Anonymisation modifies a data set in order to protect the identity of the people or entities described by the data set from disclosure. With respect to network traffic data, anonymisation generally attempts to preserve some set of properties of the network traffic useful for a given application or applications, while ensuring the data cannot be traced back to the specific networks, hosts, or users generating the traffic.

Anonymisation may be broadly split into three categories: generalisation and reversible or irreversible substitution. When generalisation is used, identifying information is grouped in sets, and one single value is used to identify each set element. In effect, this causes multiple records to become indistinguishable, thereby aggregating them together. Generalisation is an irreversible operation, in that the information needed to identify a single record from its "generalised value" is lost.

Substitution (or pseudonymization) maps the real space of identifiers or values into a separate, replacement space, using some substitution function. If the substitution function is invertible or can otherwise be reversed, then the substitution is reversible, and a real identifier can be recovered from a given replacement identifier. This allows to keep different elements distinguishable from each other: the number of different elements in the real and the replacement space is the same.

Irreversible substitution results when a randomising or one-way function is used to map the value space; real identifiers cannot be recovered in an irreversible substitution. The number of different elements in the real and replacement spaces is not necessarily the same.



 TOC 

4.  Anonymisation of IP Flow Data

Due to the restricted semantics of IP flow data, there are a relatively limited set of specific anonymisation techniques available on flow data, though each falls into the broad categories above. Each type of field that may commonly appear in a flow record may have its own applicable specific techniques.

While anonymisation is generally applied at the resolution of single fields within a flow record, attacks against anonymisation use entire flows and relationships between hosts and flows within a given data set. Therefore, fields which may not necessarily be identifying by themselves may be anonymised in order to increase the anonymity of the data set as a whole.

Of all the fields in an IP flow record, only IP addresses directly identify entities in the real world. Each IP address is associated with an interface on a network host, and can potentially be identified with a single user. Additionally, IP addresses are structured identifiers; that is, partial IP address prefixes may be used to identify networks just as full IP addresses identify hosts. This makes anonymisation of IP addresses particularly important.

Port numbers identify abstract entities (applications) as opposed to real-world entities, but they can be used to classify hosts and user behavior. Passive port fingerprinting, both of well-known and ephemeral ports, can be used to determine the operating system running on a host. Relative data volumes by port can also be used to determine the host's function (workstation, web server, etc.); this information can be used to identify hosts and users.

While not identifiers in and of themselves, timestamps and counters can reveal the behavior of the hosts and users on a network. Any given network activity is recognizable by a pattern of relative time differences and data volumes in the associated sequence of flows, even without host address information. They can therefore be used to identify hosts and users. Timestamps and counters are also vulnerable to traffic injection attacks, where traffic with a known pattern is injected into a network under measurement, and this pattern is later identified in the anonymised data set.

The simplest and most extreme form of anonymisation, which can be applied to any field of a flow record, is black-marker anonymisation, or complete deletion of a given field. While black-marker anonymisation completely protects the data in the deleted fields from the risk of disclosure, it also reduces the utility of the anonymised data set as a whole. Techniques that retain some information while reducing (though not eliminating) the disclosure risk will be extensively discussed in the following sections; note that the techniques specifically applicable to IP addresses, timestamps, and counters will be discussed in separate sections.



 TOC 

4.1.  IP Address Anonymisation

The following table gives an overview of the schemes for IP address anonymization described in this document and their categorization.

SchemeActionReversibility
Truncation Generalisation N
Random Permutation Substitution Y/N
Prefix-preserving Pseudonymisation Substitution Y

Note that random permutations might be either reversible or not, depending on the function used.



 TOC 

4.1.1.  Truncation

Truncation removes "n" of the least significant bits from an IP Address. Note that truncating 8 bits would replace an IP Address with the corresponding class C network address.



 TOC 

4.1.2.  Random Permutations

When random permutations are used, each IP Address is replaced with a random permutation on the set of possible IP Addresses. The permutation function can be implemented using hash tables.



 TOC 

4.1.3.  Prefix-preserving Pseudonymisation

Prefix-preserving pseudonymisation preserves the structure of IP Addresses. If two IP Addresses match on a prefix of "n" bits, their anonymised versions will match on a prefix of "n" bits too.



 TOC 

4.2.  Timestamp Anonymisation

[TODO: introductory text]

SchemeActionReversibility
Precision Degradation Generalisation N
Enumeration Substitution Y
Random Shifts Substitution Y



 TOC 

4.2.1.  Precision Degradation

Precision Degradation removes the most precise components of a timestamp, accounting all events occurring in each given interval (e.g. one millisecond for millisecond level degradation) as simultaneous. This has the effect of potentially collapsing many timestamps into one. With this technique time precision is reduced, and sequencing may be lost, but the information at which time the event happened is kept.



 TOC 

4.2.2.  Enumeration

Enumeration keeps the chronological order in which events occurred while eliminating time information. Timestamps are substituted by equidistant timestamps (or numbers) starting from an rendomly chosen start value.



 TOC 

4.2.3.  Random Time Shifts

Random Time Shifts keep the information on how far apart two events are from each other. This is achieved by shifting all timestamps by the same random number. Note that random time shifts also preserve chronological order.



 TOC 

4.3.  Counter Anonymisation

Counters (such as packet and octet volumes per flow) are subject to fingerprinting and injection attacks against anonymisation, as timestamps are, but relative magnitudes of activity can be useful for certain analysis tasks. [TODO: more intro text]

SchemeActionReversibility
Precision Degradation Generalisation N
Binning Generalisation N
Random noise addition Substitution N



 TOC 

4.3.1.  Precision Degradation

As with precision degradation in timestamps, precision degradation of counters removes lower-order bits of the counters, treating all the counters in a given range as having the same value. Depending on the precision reduction, this loses information about the relationships between sizes of similarly-sized flows, but keeps relative magnitude information.



 TOC 

4.3.2.  Binning

Binning can be seen as a special case of precision degradation; the operation is identical, except for in precision degradation the counter ranges are uniform, and in binning they need not be. For example, a common counter binning scheme for packet counters could be to bin values 1-2 together, and 3-infinity together, thereby separating potentially completely-opened TCP connections from unopened ones. Binning schemes are generally chosen to keep precisely the amount of information required in a counter for a given analysis task



 TOC 

4.3.3.  Random Noise Addition

Random noise addition adds a random amount to a counter in each flow; this is used to keep relative magnitude information and minimize the disruption to size relationship information while avoiding fingerprinting attacks against anonymization.



 TOC 

4.4.  Anonymisation of Other Flow Fields

[TODO: as section 4.1]



 TOC 

5.  Parameters for the Description of Anonymisation Techniques

[TODO: see corresponding section of draft-ietf-psamp-sample-tech for the proposed structure of this section.]



 TOC 

6.  Anonymisation Support in IPFIX

[TODO: Here we'll describe how the information specified above can be transmitted on the wire using an option template. The idea is to scope the option to the Template ID and for each field specify which are anonymised, providing info on the output characteristics of the technique, and which ones aren't.]

[EDITOR'S NOTE: Multiple anon. techniques applied on an IE at the same time is indicated with multiple elements of the same type (in application order as in PSAMP)]

[EDITOR'S NOTE: for blackmarking we'll recommend not to export the information at all following the data protection law principle that only necessary information should be exported.]



 TOC 

7.  Security Considerations

[TODO: write this section.]



 TOC 

8.  IANA Considerations

This document contains no actions for IANA.



 TOC 

9.  References



 TOC 

9.1. Normative References

[RFC5101] Claise, B., “Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information,” RFC 5101, January 2008 (TXT).
[RFC5102] Quittek, J., Bryant, S., Claise, B., Aitken, P., and J. Meyer, “Information Model for IP Flow Information Export,” RFC 5102, January 2008 (TXT).


 TOC 

9.2. Informative References

[I-D.ietf-ipfix-arch] Sadasivan, G. and N. Brownlee, “Architecture Model for IP Flow Information Export,” draft-ietf-ipfix-arch-02 (work in progress), October 2003 (TXT).
[I-D.ietf-ipfix-as] Zseby, T., “IPFIX Applicability,” draft-ietf-ipfix-as-12 (work in progress), July 2007 (TXT).
[I-D.ietf-ipfix-architecture] Sadasivan, G., “Architecture for IP Flow Information Export,” draft-ietf-ipfix-architecture-12 (work in progress), September 2006 (TXT).
[I-D.ietf-ipfix-reducing-redundancy] Boschi, E., “Reducing Redundancy in IP Flow Information Export (IPFIX) and Packet Sampling (PSAMP) Reports,” draft-ietf-ipfix-reducing-redundancy-04 (work in progress), May 2007 (TXT).
[RFC3917] Quittek, J., Zseby, T., Claise, B., and S. Zander, “Requirements for IP Flow Information Export (IPFIX),” RFC 3917, October 2004 (TXT).
[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).


 TOC 

Authors' Addresses

  Elisa Boschi
  Hitachi Europe
  c/o ETH Zurich
  Gloriastrasse 35
  8092 Zurich
  Switzerland
Phone:  +41 44 632 70 57
Email:  elisa.boschi@hitachi-eu.com
  
  Brian Trammell
  Hitachi Europe
  c/o ETH Zurich
  Gloriastrasse 35
  8092 Zurich
  Switzerland
Phone:  +41 44 632 70 13
Email:  brian.trammell@hitachi-eu.com


 TOC 

Full Copyright Statement

Intellectual Property