Internet Draft                                                          
  Document: <draft-ietf-psamp-sample-tech-04.txt>                T. Zseby 
  Expires: August 2004                                   Fraunhofer FOKUS 
                                                                M. Molina 
                                                          NEC Europe Ltd. 
                                                               F. Raspall 
                                                          NEC Europe Ltd. 
                                                              N. Duffield 
                                                     AT&T Labs - Research 
                                                                          
                                                            February 2004 
   
   
     Sampling and Filtering Techniques for IP Packet Selection 
   
   
  Status of this Memo 
     This document is an Internet-Draft and is in full conformance 
     with all provisions of Section 10 of RFC2026.  
      
     Internet-Drafts are working documents of the Internet 
     Engineering Task Force (IETF), its areas, and its working 
     groups. Note that other groups may also distribute working 
     documents as Internet-Drafts. Internet-Drafts are draft 
     documents valid for a maximum of six months and may be updated, 
     replaced, or obsoleted by other documents at any time. It is 
     inappropriate to use Internet-Drafts as reference material or to 
     cite them other than as "work in progress."  
      
     The list of current Internet-Drafts can be accessed at 
     http://www.ietf.org/ietf/1id-abstracts.txt  
      
     The list of Internet-Draft Shadow Directories can be accessed at 
     http://www.ietf.org/shadow.html. 
      
  Abstract 
      
     This document describes sampling and filtering techniques for IP 
     packet selection. In two information models (one for sampling, 
     one for filtering) it defines what parameters are needed to 
     describe the most common selection schemes and shows how 
     techniques can be combined to build more elaborate packet 
     selectors. The information models are used for configuring the 
     selection technique in measurement processes and for reporting 
     the technique in use to a collector. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004     [Page 1] 

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
  Table of Contents 
   
     1.   Introduction.................................................2 
     2.   Terminology..................................................3 
     2.1  General Terminology..........................................3 
     2.2  Packet Selection Terminology.................................6 
     3.   Scope and Deployment of Packet Selection Techniques..........8 
     3.1  Sampling....................................................10 
     3.1.1 Systematic Sampling........................................11 
     3.1.2 Random Sampling............................................12 
     3.1.2.1 n-out-of-N Sampling......................................12 
     3.1.2.2 Probabilistic Sampling...................................12 
     3.1.2.2.1 Uniform Probabilistic Sampling.........................12 
     3.1.2.2.2 Non-Uniform Probabilistic Sampling.....................12 
     3.1.2.2.3 Non-Uniform Flow State Dependent Sampling..............13 
     3.1.2.2.4 Configuration of non-uniform probabilistic and flow-
     state sampling...................................................13 
     4.   Filtering...................................................14 
     4.1  Mask/Match filtering........................................14 
     4.2  Hash-based Selection........................................15 
     4.2.1 Application Examples for Hash-based Selection..............16 
     4.2.1.1 Approximation of Random Sampling.........................16 
     4.2.1.2 Consistent Packet Selection..............................16 
     4.2.2 Guarding Against Pitfalls and Vulnerabilities..............17 
     4.2.3 Considerations and Recommendations for Hash-functions......18 
     4.2.4 IP Shift-XOR (IPSX) hash function..........................19 
     4.2.5 "Bob" hash function........................................20 
     4.3  Router State filtering......................................23 
     5.   Input Parameters and Information Models.....................24 
     5.1 Information Model for Sampling Techniques....................25 
     5.2 Information Model for Filtering Techniques...................26 
     6.   Composite Techniques........................................29 
     6.1 Cascaded filtering->sampling or sampling->filtering..........29 
     6.2 Stratified Sampling..........................................30 
     7.   Security Considerations.....................................30 
     8.   References..................................................31 
     9.   Author's Addresses..........................................33 
     10.  Intellectual Property Statement.............................34 
     11.  Full Copyright Statement....................................34 
   
  1. Introduction 
      
     Increasing data rates and growing measurement demands increase 
     the requirements for data collection resources. High packet 
     rates in backbone networks load measurement processes. Demands 
     for fine granular results (e.g. per flow analysis) require 
     performant and flexible classification algorithms, which are 
     usually resource extensive. Furthermore, some measurement 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 2] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     methods require the capturing of packet headers or even parts of 
     the payload. All this can lead to an overwhelming amount of 
     measurement data, resulting in high demands regarding resources 
     for measurement, storage, transport and post processing.  
   
     In some cases specialized hardware helps to fulfill these 
     demands but on the other hand increases the costs for providing 
     the measurement. Since measurements are mainly a supporting 
     functionality for the service provisioning, measurement costs 
     should be limited to a small fraction of the costs of the 
     network service provisioning itself. A reduction of the data 
     that is considered and reported by a measurement process is 
     crucial to prevent the depletion of the available (i.e. the 
     affordable) resources. Such a reduction can be achieved by a 
     reasonable deployment of packet selection techniques, that 
     sample a subset of the packets while still allowing an 
     appropriate accuracy, or filter out all packets that are not of 
     interest for the measurement at all. Packet selection helps to 
     prevent an exhaustion of resources and to limit the measurement 
     costs. Examples for applications that benefit from packet 
     selection are given in [Du04]. 
   
  2. Terminology 
      
     The PSAMP terminology resulted from joint discussions of the 
     authors of this document together with the authors of [Du04]. 
     Therefore all terms used throughout this document represent the 
     common understanding of the authors of both documents and are 
     consistent with those defined in [Du04]. Furthermore, it is 
     aimed at consistency with the terms used in [QuZC03]. 
   
  2.1 General Terminology 
      
     * Observation Point: a location in the network where a packet 
        stream is observed. Examples include: 
         
          (i) a line to which a probe is attached; 
           
          (ii) a shared medium, such as an Ethernet-based LAN; 
           
          (iii) a single port of a router, or set of interfaces 
          (physical or logical) of a router; 
           
          (iv) an embedded measurement subsystem within an interface.  
           
     * Observed Packet Stream: the set of all packets observed at the 
        observation point. 
         

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 3] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     * Packet Stream: either the observed packet stream, or a subset 
        of it. 
         
        Note that packets selected from a stream, e.g. by sampling, do 
        not necessarily possess a property by which they can be 
        distinguished from packets that have not been selected. For 
        this reason the term ôstreamö is favored over ôflowö, which is 
        defined as set of packets with common properties [QuZC02]. 
         
     * Selection Process: takes a packet stream as its input and 
        selects a subset of that stream as its output. 
         
     * Packet Content: the union of the packet header (which includes 
        link layer, network layer and other encapsulation headers) and 
        the packet payload. 
         
     * Selection State: a selection process may maintain state 
        information for use by the selection process and/or the 
        reporting process. At a given time, the selection state may 
        depend on packets observed at and before that time, and other 
        variables. Examples include: 
          
          (i) sequence numbers of packets at the input of selectors; 
      
          (ii) a timestamp of observation of the packet at the 
     observation points; 
      
          (iii) iterators for pseudorandom number generators; 
       
          (iv) hash values calculated during selection; 
    
          (v) indicators of whether the packet was selected by a 
          given selector; 
           
        Selection processes may change portions of the selection state 
        as a result of processing a packet. 
         
     * Selector: the component that performs a selection process on a 
        single packet of its input. A selected packet becomes an 
        element of the output packet stream of the selection process. 
         
        The selector can make use of the following information in 
        determining whether a packet is selected: 
         
          (i) the packetÆs content; 
      
          (ii) information derived from the packet's treatment at the 
     observation point; 
      

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 4] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
          (iii) any selection state that may be maintained by the 
          selection process. 
           
     * Composite Selection Process: an ordered composition of 
        selection processes, in which the output stream issuing from 
        one component forms the input stream for the succeeding 
        component.  
         
     * Primitive Selection Process: a selection process that is not a 
        composite selection process. 
   
     * Composite Selector: the selector of a composite selection 
        process. 
      
     * Primitive Selector: the selector of a primitive selection 
        process. 
   
     * Reporting Process: creates a report stream on packets selected 
        by a selection process, in preparation for export. The input 
        to the reporting process comprises that information available 
        to the selection process per selected packet, specifically: 
         
          (i) the selected packetÆs content or parts of it; 
           
          (ii) information derived from the selected packet's 
          treatment at the observation point; 
           
          (iii) any selection state maintained by the inputting 
          selection process, reflecting any modifications to the 
          selection state made during selection of the packet. 
      
     * Report Stream: the output of a reporting process is a report 
        stream, comprising two distinguished types of information: 
        packet reports, and report interpretation. 
      
     * Packet Reports: a configurable subset of the per packet input 
        to the reporting process.  
      
     * Report Interpretation: subsidiary information relating to one 
        or more packets, that is used for interpretation of their 
        packet reports. Examples include configuration parameters of 
        the selection process and of the reporting process. 
      
     * Measurement Process: the composition of a selection process 
        that takes the observed packet stream as its input, followed 
        by a reporting process. 
      
     * Export Process: sends the output of one or more reporting 
        processes to one or more collectors. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 5] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     * Collector: a collector receives a report stream exported by 
        one or more export processes. In some cases, the host of the 
        measurement and/or export processes may also serve as the 
        collector. 
      
     * Export packets: one or more packet reports, and perhaps report 
        interpretation, are bundled by the export process into a 
        export packet for export to a collector. 
    
     Various possibilities for the high level architecture of these 
     elements are as follows. 
      
         MP = Measurement Process, EP = Export Process 
      
        +---------------------+                 +------------------+ 
        |Observation Point(s) |                 | Collector(1)     | 
        |MP(s)--->EP----------+---------------->|                  |     
        |MP(s)--->EP----------+-------+-------->|                  | 
        +---------------------+       |         +------------------+ 
                                      |     
        +---------------------+       |         +------------------+ 
        |Observation Point(s) |       +-------->| Collector(2)     | 
        |MP(s)--->EP----------+---------------->|                  | 
        +---------------------+                 +------------------+ 
                                         
        +---------------------+          
        |Observation Point(s) |          
        |MP(s)--->EP---+      |          
        |              |      |          
        |Collector(3)<-+      | 
        +---------------------+   
      
     The PSAMP measurement process can be viewed as analogous to the 
     IPFIX metering process. The PSAMP measurement process takes an 
     observed packet stream as its input, and produces packet reports 
     as its output. The IPFIX metering process produces flow records 
     as its output. The distinct name ômeasurement processö has been 
     retained in order to avoid potential confusion in settings where 
     IPFIX and PSAMP coexist, and in order to avoid the implicit 
     requirement that the PSAMP version satisfy the requirements of 
     an IPFIX metering process (at least while these are under 
     development). The relation between PSAMP and IPFIX is further 
     discussed in [QC03]. 
   
  2.2 Packet Selection Terminology. 
      
     * Filtering: a filter is a selection operation that selects a 
        packet deterministically based on the packet content, its 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 6] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
        treatment, and functions of these occurring in the selection 
        state. Examples include mask/match filtering, and hash-based 
        selection. 
      
     * Sampling: a selection operation that is not a filter is called 
        a sampling operation. This reflects the intuitive notion that 
        if the selection of a packet cannot be determined from its 
        content alone, there must be some type of sampling taking 
        place.  
      
     * Content-independent Sampling: a sampling operation that does 
        not use packet content (or quantities derived from it) as the 
        basis for selection is called a content-independent sampling 
        operation. Examples include systematic sampling, and uniform 
        pseudorandom sampling driven by a pseudorandom number whose 
        generation is independent of packet content. Note that in 
        content-independent sampling it is not necessary to access the 
        packet content in order to make the selection decision. 
      
     * Content-dependent Sampling: a sampling operation where 
        selection is dependent on packet content is called a content-
        dependent sampling operation. Examples include pseudorandom 
        selection according to a probability that depends on the 
        contents of a packet field. Note that this is not a filter , 
        because the selection is not deterministic.. 
   
     * Hash Domain: a subset of the packet content and the packet 
        treatment, viewed as an N-bit string for some positive integer 
        N. 
      
     * Hash Range: a set of M-bit strings for some positive integer 
        M. 
      
     * Hash Function: a deterministic map from the hash domain into 
        the hash range. 
      
     * Hash Selection Range: a subset of the hash range. The packet 
        is selected if the action of the hash function on the hash 
        domain for the packet yields a result in the hash selection 
        range. 
      
     * Hash-based Selection: filtering specified by a hash domain, a 
        hash function, and hash range and a hash selection range. 
   
     * Approximative Selection: selection operations in any of the 
        above categories may be approximated by operations in the same 
        or another category for the purposes of implementation. For 
        example, uniform pseudorandom sampling may be approximated by 
        hash-based selection, using a suitable hash function and hash 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 7] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
        domain. In this case, the closeness of the approximation 
        depends on the choice of hash function and hash domain. 
      
     * Population size: the number of all packets in the packet 
        stream (or subset) for which a metric should be estimated  
      
     * Sample size: the number of packets selected from the 
        population by a selection operation. 
   
     * Attained Selection Frequency: the actual frequency with which 
        packets are selected by a selection process. The attained 
        sampling frequency is calculated as ratio of the size of a 
        sample size to the size of the population from which it was 
        selected.  
         
     * Target Selection Frequency: the long-term frequency with which 
        packets are expected to be selected, based on selector 
        parameter settings. Depending on the selector, the target 
        selection frequency may be count-based or time-based. 
         
        Due to the inherent statistical variability of sampling 
        decisions, the target and attained selection frequencies can 
        differ (e.g. for probabilistic sampling and hash-based 
        selection). Nevertheless, for large population sizes and 
        properly configured sampling schemes the attained selection 
        frequency usually approaches the target selection frequency. 
        In hash-based selection, the target selection frequency is the 
        quotient of size of the hash selection range by the size of 
        the hash range. 
   
  3. Scope and Deployment of Packet Selection Techniques 
   
     Packet selection techniques generate a subset of packets from an 
     Observed Packet Stream at an observation point. We distinguish 
     between sampling and filtering. 
      
     Sampling is targeted at the selection of a representative subset 
     of packets. The subset is used to infer knowledge about the 
     whole set of observed packets without processing them all. The 
     selection can depend on packet position, and/or on packet 
     content, and/or on (pseudo) random decisions.  
   
     Filtering selects a subset with common properties. This is used 
     if only a subset of packets is of interest. The properties can 
     be directly derived from the packet content, or depend on the 
     treatment given by the router to the packet. Filtering is a 
     deterministic operation. It depends on packet content or router 
     treatment. It never depends on packet position or on (pseudo) 
     random decisions. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 8] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     Note that a common technique to select packets is to compute a 
     Hash Function on some bits of the packet header and/or content 
     and to select it if the Hash Value falls in  the Hash Selection 
     Range. Since hashing is a deterministic operation on the packet 
     content, it is a filtering technique according to our 
     categorization. Nevertheless, hash functions are sometimes used 
     to approximate random sampling. Depending on the chosen input 
     bits, the Hash Function and the Hash Selection Range, this 
     technique can be used to approximate the random selection of 
     packets with a given probability p. It is also a powerful 
     technique to consistently select the same packet subset at 
     multiple observation points [DuGr00] 
      
     The following table gives an overview of the schemes described 
     in this document and their categorization. An X in brackets (X) 
     denotes schemes for which also content-independent variants 
     exist. It easily can be seen that only schemes with both 
     properties, content dependence and deterministic selection, are 
     considered as filters.  
      
            Selection Scheme   | deterministic | content- | Category 
                               |  selection    | dependent|           
       ------------------------+---------------+----------+---------- 
        systematic             |       X       |     _    | Sampling  
        count-based            |               |          | 
       ------------------------+---------------+----------+---------- 
        systematic             |       X       |     -    | Sampling 
        time-based             |               |          | 
       ------------------------+---------------+----------+---------- 
        random                 |       -       |     -    | Sampling 
        n-out-of-N             |               |          | 
       ------------------------+---------------+----------+---------- 
        random                 |       -       |     -    | Sampling 
        uniform probabilistic  |               |          | 
       ------------------------+---------------+----------+---------- 
        random                 |       -       |    (X)   | Sampling 
        non-uniform probabil.  |               |          | 
       ------------------------+---------------+----------+---------- 
        random                 |       -       |    (X)   | Sampling 
        non-uniform flow-state |               |          | 
       ------------------------+---------------+----------+---------- 
        mask/match filter      |       X       |     X    | Filter 
       ------------------------+---------------+----------+---------- 
        hash function          |       X       |     X    | Filter 
       ------------------------+---------------+----------+---------- 
        router state filter    |       X       |    (X)   | Filter  


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 9] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
       ------------------------+---------------+----------+---------- 
      
     The introduced categorization is mainly useful for the 
     definition of an information model describing Primitive 
     Selectors . More complex selection techniques can be described 
     through the composition of cascaded sampling and filtering 
     operations. 
     For example, a packet selection that weights the selection 
     probability on the basis of the packet length can be described 
     as a cascade of a filter and a sampling scheme. However, this 
     descriptive approach is not intended to be rigid: if a common 
     and consolidated selection practice turns out to be too complex 
     to be described as a composition of the mentioned building 
     blocks, an ad hoc description can be specified instead and added 
     as a new scheme to the information model. 
      
     Packet selectors can be part of an IPFIX metering process and 
     can also use the IPFIX exporting process. This is expressed as 
     association to one or more IPFIX processes. 
   
  3.1 Sampling 
   
     The deployment of sampling techniques aims at the provisioning 
     of information about a specific characteristic of the parent 
     population at a lower cost than a full census would demand. In 
     order to plan a suitable sampling strategy it is therefore 
     crucial to determine the needed type of information and the 
     desired degree of accuracy in advance. 
      
     First of all it is important to know the type of metric that 
     should be estimated. The metric of interest can range from 
     simple packet counts [JePP92] up to the estimation of whole 
     distributions of flow characteristics (e.g. packet 
     sizes)[ClPB93].  
   
     Secondly, the required accuracy of the information and with 
     this, the confidence that is aimed at, should be known in 
     advance. For instance for usage-based accounting the required 
     confidence for the estimation of packet counters can depend on 
     the monetary value that corresponds to the transfer of one 
     packet. That means that a higher confidence could be required 
     for expensive packet flows (e.g. premium IP service) than for 
     cheaper flows (e.g. best effort). The accuracy requirements for 
     validating a previously agreed quality can also vary extremely 
     with the customer demands. These requirements are usually 
     determined by the service level agreement (SLA). 
      
   
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 10] 
   
  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     The sampling method and the parameters in use must be clearly 
     communicated to all applications that use the measurement data. 
     Only with this knowledge a correct interpretation of the 
     measurement results can be ensured.  
      
     Sampling methods can be characterized by the sampling algorithm, 
     the trigger type used for starting a sampling interval and the 
     length of the sampling interval. These parameters are described 
     here in detail. The sampling algorithm describes the basic 
     process for selection of samples. In accordance to [AmCa89] and 
     [ClPB93] we define the following basic sampling processes: 
      
      3.1.1       Systematic Sampling 
   
     Systematic sampling describes the process of selecting the start 
     points and the duration of the selection intervals according to 
     a deterministic function. This can be for instance the periodic 
     selection of every k-th element of a trace but also the 
     selection of all packets that arrive at pre-defined points in 
     time. Even if the selection process does not follow a periodic 
     function (e.g. if the time between the sampling intervals varies 
     over time) we consider this as systematic sampling as long as 
     the selection is deterministic.  
     The use of systematic sampling always involves the risk of 
     biasing the results. If the systematics in the sampling process 
     resemble systematics in the observed stochastic process 
     (occurrence of the characteristic of interest in the network), 
     there is a high probability that the estimation will be biased. 
     Systematics in the observed process might not be known in 
     advance.  
     Here only equally spaced schemes are considered, where triggers 
     for sampling are periodic, either in time or in packet count. 
     All packets occurring in a selection interval (either in time or 
     packet count) beyond the trigger are selected.  
   
     Systematic count-based 
     In systematic count-based sampling the start and stop triggers 
     for the sampling interval are defined in accordance to the 
     spatial packet position (packet count). 
   
     Systematic time-based 
     In systematic count-based sampling the start and stop triggers 
     for the sampling interval are defined in accordance to the 
     temporal packet position (arrival time). 
   
     Both schemes are contentûindependent selection schemes. Content 
     dependent deterministic selectors are categorized as filter. 
      

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 11] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
  3.1.2   Random Sampling 
   
     Random sampling selects the starting points of the sampling 
     intervals in accordance to a random process. The selection of 
     elements are independent experiments. With this, unbiased 
     estimations can be achieved. In contrast to systematic sampling, 
     random sampling requires the generation of random numbers. One 
     can differentiate two methods of random sampling: 
      
  3.1.2.1 n-out-of-N Sampling 
   
     In n-out-of-N sampling n elements are selected out of the parent 
     population that consists of N elements. One example would be to 
     generate n different random numbers in the range [1,N] and 
     select all packets which have a packet position equal to one of 
     the random numbers. For this kind of sampling the sample size n 
     is fixed.  
      
  3.1.2.2 Probabilistic Sampling 
      
     In probabilistic sampling the decision whether an element is 
     selected or not is made in accordance to a pre-defined selection 
     probability. An example would be to flip a coin for each packet 
     and select all packets for which the coin showed the head. For 
     this kind of sampling the sample size can vary for different 
     trials. The selection probability does not necessarily has to be 
     the same for each packet. Therefore we distinguish between 
     uniform probabilistic sampling (with the same selection 
     probability for all packets) and non-uniform  probabilistic 
     sampling (where the selection probability can vary for different 
     packets). 
      

  3.1.2.2.1      Uniform Probabilistic Sampling 
      
     For Uniform Probabilistic  Sampling packets are selected 
     independently with a uniform probability p. This sampling can be 
     count-driven, and is sometimes referred to as geometric random 
     sampling, since the difference in count between successive 
     selected packets are independent random variables with a 
     geometric distribution of mean 1/p. A time-driven analog, 
     exponential random  sampling, has the time between triggers 
     exponentially distributed. 
     Both geometric and exponential random sampling are examples of 
     what is known as additive random sampling, defined as sampling 
     where the intervals or counts between successive samples are 
     independent identically distributed random variable. 

  3.1.2.2.2      Non-Uniform Probabilistic Sampling 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 12] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     This is a variant of Probabilistic Sampling in which the 
     sampling probabilities can depend on the selection process 
     input. This can be used to weight sampling probabilities in 
     order e.g. to boost the chance of sampling packets that are rare 
     but are deemed important. Unbiased estimators for quantitative 
     statistics are recovered by renormalization of sample values; 
     see [HT52]. 
      

  3.1.2.2.3      Non-Uniform Flow State Dependent Sampling  
   
     Another type of sampling that can be classified as probabilistic 
     Non-Uniform  is closely related to the flow concept as defined 
     in [QuZC02], and it is only used jointly with a flow monitoring 
     function (IPFIX monitoring function). Packets are selected,  
     dependent on a selection state. The point, here, is that the 
     selection state is determined also by the state of the flow the 
     packet belongs to and/or by the state of the other flows 
     currently being monitored by the associated flow monitoring 
     function. An example for such an algorithm is the ösample and 
     holdö method described in [EsVa01]: 
      
     - If a packet accounts for a flow record that already exists in 
        the IPFIX flow recording process, it is selected (i.e. the 
        flow record is updated) 
     - If a packet doesn't account to any existing flow record, it is 
        selected with probability p. If it has been selected a new 
        flow record has to be created. 
      
     A further algorithm that fits into the category of non-uniform 
     flow state dependent sampling is described in [Moli03]. 
      
     This type of sampling is content dependent because the 
     identification of the flow the packet belongs to requires 
     analyzing part of the packet content. If the packet is selected, 
     then it is passed as an input to the IPFIX monitoring function 
     (this is called öLocal Exportö in [Du04] 
     Selecting the packet depending on the state of a flow cache is 
     useful when memory resources of the flow monitoring function are 
     scarce (i.e. thereÆs no room to keep all the flows that have 
     been scheduled for monitoring). See [MolFl03] for a more 
     detailed description of the motivations for this type of 
     sampling and the impact on the IPFIX metering. 
      

  3.1.2.2.4      Configuration of non-uniform probabilistic and 
         flow-state sampling 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 13] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     Many different specific methods can be grouped under the terms 
     non-uniform probabilistic and flow state sampling. Dependent on 
     the sampling goal and the implemented scheme, a different number 
     and type of input parameters is required to configure such 
     scheme. 
      
     Some concrete proposals for such methods exist from the research 
     community (e.g. [EsVa01],[DuLT01],[Moli03]). All of these 
     proposals are still in an early stage and need further 
     investigations to prove their usefulness and applicability.  
      
     Since we donÆt want give preference to one of the existing early 
     stage methods we here only describe the basic methods and leave 
     the specification of explicit schemes and their parameters up to 
     vendors (e.g. as extension of the information model). 
   
      
  4. Filtering  
      
     Filtering is the deterministic selection of packets based on the 
     packet content, the treatment of the packet at the observation 
     point, or deterministic functions of these occurring in the 
     selection state. The packet is selected if these quantities fall 
     into a specified range. 
     The role of filtering, as the word itself suggest, is to 
     separate all the packets having a certain property from those 
     not having it. A distinguishing characteristic from sampling is 
     that the property never depends on the packet position in time 
     or in the space, or on a random process. 
     We identify and describe in the following three filtering 
     techniques. The first two (Mask/Match filtering and Hashing 
     filtering) are stateless, and therefore can make their decision 
     based on the analysis of portion of the packet only. The other 
     (router state filtering) requires to access state information 
     after the analysis of part of the packet and is therefore more 
     complex: its usage makes sense only in particular circumstances, 
     as described below. 
      
  4.1 Mask/Match filtering 
   
     This type of filtering selects a packet operating as follows:  
      
     A number of bit positions are chosen in accordance to the 
     filtering goal. A mask is applied with a logical AND to the 
     incoming packet to keep only the chosen bits. The result of this 
     operation is then compared to a predefined single value (e.g. a 
     specific source IP address), a set of values or a range. The 
     packet is selected in accordance to the result of this 
     comparison. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 14] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     The selected bits of the packet arenÆt necessarily only those of 
     the IP header. If  bits of the IP header and bits of the payload 
     are considered, the masks and the selection intervals MUST be 
     specified separately for the header and the payload. 
     An implementation is not constrained to apply exactly all the 
     steps or in this sequence, provided that the result is 
     equivalent to a logical function doing it. 
     Examples of filters of this class are filters that select 
     packets based on the matching of some of the IP header fields 
     with a (possibly masked) specific value, filters that select 
     packets having some IP header fields values falling within a 
     range, filters that do the same as above on some of the 
     transport header fields (that are thus considered as part of the 
     payload), or a combination of all of the above mentioned 
     possibilities. 
      
   
  4.2 Hash-based Selection  
      
     A Hash Function h maps the packet content c, or some portion of 
     it, onto a Hash Range R. The packet is selected if h(c) is an 
     element of S, which is a subset of R called the Hash Selection 
     Range. Thus hash-based sampling is indeed a particular case of 
     filtering: the object is selected if c is in inv(h(S)). But for 
     desirable Hash Functions the inverse image inv(h(S)) will be 
     extremely complex, and hence h would not be expressible as, say, 
     a match/mask filter or a simple combination of these. 
     Hash-based selection has mainly two types of usage: it offers a 
     way to approximate random sampling by using packet content to 
     generate pseudorandom variates and a way to consistently select 
     subsets of packets that share a common property (e.g. at 
     different observation points).  
     In the following subsections we give more details about them. 
     However, both usages require that the Hash Functions has two 
     statistical properties. First, the hash function h must have 
     good mixing properties, in the sense that small changes in the 
     input (e.g. the flipping of a single bit) cause large changes in 
     the output (many bits change). Then any local clump of values of 
     c is spread widely over R by h, and so the distribution of h(c) 
     is fairly uniform even if the distribution of c is not. Then the 
     sampling rate is #S/#R, which can be tuned by choice of S. If S 
     and R are sets contiguous integers, h(c), suitably shifted and 
     normalized, can be interpreted as a pseudorandom variate. The 
     second desirable property depends more closely on the statistics 
     of the content c. In applications, the content c comprises a 
     number of distinct fields, c1 ... cm, e.g. source and 
     destination IP Address, IP identification, and TCP/UDP port 
     numbers (if present) for a packet. With a hash function 
     satisfying the first properties above, selection decisions will 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 15] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     appear uncorrelated with the contents of any individual field, 
     if the complementary fields are (i) sufficiently variable 
     themselves, and (ii) sufficiently uncorrelated with cj. 
      
      4.2.1       Application Examples for Hash-based Selection 
      
  4.2.1.1 Approximation of Random Sampling 
      
     Although pseudorandom number generators with well understood 
     properties have been developed, they may not be the method of 
     choice in setting where computational resources are scarce. A 
     convenient alternative is to use hash functions of packet 
     content as a source of randomness. The hash (suitably 
     renormalized) is a pseudorandom variate in the interval [0,1]. 
     Other schemes may use packet fields in iterators for 
     pseudorandom numbers. 
     The point here, is that the statistical properties of the 
     idealized packet selection law (such as independence of sampling 
     decisions for different packets, or independence on packet 
     content) may not be exactly shared by an implementation, but 
     only approximately so. 
     Although the selection decisions for non-uniform probabilistic 
     sampling (see Section 3.1.2.2.2 above) also depend on the packet 
     content, this form of sampling is distinguished from the use of 
     packet content to generate variates. In the former case, the 
     content only determines the selection probabilities: selection 
     could then proceed e.g by use of a variates obtained by an 
     independent pseudorandom number generator. In the latter case, 
     the content determines the pseudorandom variates rather than the 
     probabilities. 
      
      
  4.2.1.2 Consistent Packet Selection 
      
     In Consistent packet selection, all routers in a network hash 
     parts of the packet content using identical hash function and 
     selection range. The domain of the hash is restricted to those 
     parts of the packet that are invariant from hop to hop. Fields 
     such as Time-to-Live, which is decremented per hop, and header 
     CRC, which is recalculated per hop, are thus excluded from the 
     hash domain. Thus, a given packet is selected at either all 
     points on its path through the network, or at none. The domain 
     of the hash function needs to be wider than just a flow key, if 
     packets are to be selected quasi-randomly within flows (and e.g. 
     include portions of the payload), see [DuGr00].  
      
     If a report on each selected packet is exported to a collector, 
     the collector can reconstruct trajectories of the selected 
     packets, provided it can match different reports on the same 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 16] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     packet, and distinguish these from reports on other packets. For 
     this purpose a second distinct hash function may be used to 
     generate a label or digest of each selected packet for inclusion 
     in its packet report. The benefit of using a digest is reduction 
     in reporting bandwidth. Routers need only report the digest 
     provided at least one router (say an edge router) sends a full 
     packet report containing packet fields of interest in addition 
     to the digest. A digest size of 32 bits has been found 
     sufficient to distinguish packets; see [DuGeGr02].  
      
      
     Applications of consistent packet selection  include (i) 
     estimation of the network path matrix, i.e., the traffic 
     intensities according to network path, broken down by flow; (ii) 
     detection of routing loops, as indicated by self-intersecting 
     trajectories; (iii) passive performance measurement: prematurely 
     terminating trajectories indicate packet loss, packet one way 
     delay can be determined if reports include (synchronized) 
     timestamps; (iv) network attack tracing, through determination 
     of the actual paths taken by attack packets with spoofed source 
     addresses. 
      
      4.2.2       Guarding Against Pitfalls and Vulnerabilities 
      
     A concern is whether some large set of related packets could be 
     sampled at a rate that significantly differs from the expected 
     sampling rate, either (i) through unanticipated behavior in the 
     hash function, or (ii) because the packets had been deliberately 
     crafted to have this property. 
     The first point underlines the importance of using a hash 
     function with good mixing properties. Examples of such are CRC32 
     and hash functions based on modular arithmetic, see 6.4 in 
     [Knuth98]. The statistical properties of candidate hash 
     functions need to be evaluated, preferably on packet before 
     adoption for hash-based sampling 
     Hash sampling could be overloaded (or evaded) by an attacker if 
     the hash function and the selection rate are both known. A 
     service provider could build a first defense keeping the Hash 
     Selection Range S private. Then, an attacker could not determine 
     whether a crafted packet is selected, but he would still know 
     that a crafted set of packets all with the same hash is either 
     all selected or all not selected. Moreover, when applications 
     (like multi domain trajectory sampling, or one way delay 
     estimation across multiple domains) may require multiple 
     administrative entities to agree on a common hash function and 
     selection range, mutual trust between the entities cannot be 
     avoided and just keeping S secret may not be feasible. A 
     stronger defense is to employ a parametrizable hash function and 
     keep the parameter private: in this case, the set of hash values 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 17] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     of the packets could not be determined. Examples of parameters 
     are the initial vector in CRC32, and moduli in hashes based on 
     modular arithmetic. 
      
      4.2.3       Considerations and Recommendations for Hash 
           Functions. 
   
     For hash-based selection, the most important requirements for a 
     hash function are simplicity of implementation and speed of 
     operation, followed by uniformity of hash distribution. 
     Simplicity promotes wider implementation and the ability to 
     operate at line speed. Uniformity of hash promotes the 
     approximation of random sampling by hash-based selection.  
      
     For a packet digesting hash, uniformity of hash distribution and 
     a small collision probability are the most important 
     requirements. Since the digest need only be computed over the 
     substream of selected packets, a digesting hash does not have 
     the same speed requirements as a sampling hash. Thus digesting 
     enables and benefits from the use of more complex hash functions 
     than hash-based selection.  
   
     The properties of a number of different hash functions were 
     evaluated and compared. On this basis, the following 
     recommendations are made for PSAMP hash functions.  
      
     When hash-based packet selection is supported, the following 
     hash functions MUST be offered: 
   
     * The IPSX hash function; see Section 4.2.4 below. IPSX is a 
        fast hash function with good uniformity properties, and is 
        intended for hash-based selection. It operates with fixed 
        length input tailored for IPv4 packets, and produces a 16 bit 
        output, enabling sampling down to rates of 1 in 65536. 
         
         
     * The CRC32 hash function; see [ISO3309]. CRC32 has small 
        collision probabilities, its 32 bit output being sufficiently 
        large to function as a digest. Its use as a selection hash is 
        not precluded, however it is roughly an order of magnitude 
        slower than IPSX and so is not expected to function as widely 
        for this purpose. It can accept a variable length input and so 
        provides a potential future path for hash-based selection of  
        packets of protocols other than IPv4.  
      
     Other hash functions MAY be provided. A candidate hash function 
     is the BOB hash function described in Section 4.2.5 below. Its 
     performance in speed, uniformity and collisions are comparable  
     with (and slightly superior to) CRC32. As remarked in Section 9 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 18] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     of [Du04], the MD5 hash function may be precluded for ubiquitous 
     use at full line rate, at least for hash-based selection. 
   
      4.2.4        IP Shift-XOR (IPSX) hash function 
            
     The IPSX hash function is tailored for acting on IP version 4 
     packets. It exploits the structure of IP packet and in 
     particular the variability expected to be exibited within 
     different fields of the IP packet in order to furnish a hash 
     value with little apparent correlation with individual packet 
     fields. Fields from the IPv4 and TCP/UDP headers are used as 
     input. The IPSX hash function uses a small number of simple 
     instructions. 
      
     Input parameters: None 
      
     Built-in parameters: None 
      
     Output: The output of the IPSX is a 16 bit number 
      
     Functioning:  
     The functioning can be divided into two parts: input selection, 
     which forms are composite input from various portions of the IP 
     packet, followed by computation of the hash on the composite. 
      
     Input Selection: 
     The raw input is drawn from the first 20 bytes of the IP packet 
     header and the first 8 bytes of the IP payload. If IP options 
     are not used, the IP header has 20 bytes, and hence the two 
     portions adjoin and comprise the first 28 bytes of the IP 
     packet. We now use the raw input as 4 32-bit subportions of 
     these 28 bytes. We specify the input by bit offsets from the 
     start of IP header or payload. 
      
     f1 = bits 32 to 63 of the IP header, comprising the IP     
          identification field, flags, and fragment offset. 
         
     f2 = bits 96 to 127 of the IP header, the source IP address. 
         
     f3 = bits 128 to 159 of the IP header, the destination IP  
          address. 
      
     f4 = bits 32 to 63 of the IP payload. For a TCP packet, f4  
          comprises the TCP sequence number followed by the message 
          length. For a UDP packet f4 comprises the UDP checksum. 
      
     Hash Computation: 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 19] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     The hash is computed from f1, f2, f3 and f4 by a combination of 
     XOR (^), right shift (>>) and left shift (<<) operations. The 
     intermediate quantities h1, v1, v2 are 32-bit numbers. 
      
            1.    v1 = f1 ^ f2; 
            2.    v2 = f3 ^ f4;   
            3.    h1 = v1 << 8; 
            4.    h1 ^= v1 >> 4; 
            5.    h1 ^= v1 >> 12;  
            6.    h1 ^= v1 >> 16; 
            7.    h1 ^= v2 << 6; 
            8.    h1 ^= v2 << 10; 
            9.    h1 ^= v2 << 14; 
            10.   h1 ^= v2 >> 7; 
      
     The output of the hash is the least significant 16 bits of h1. 
   
      
      4.2.5       "Bob" hash function  
   
    
     "Bob" hash function is a hash function designed for having each 
     bit of the input affecting every bit of the return value and 
     using both 1-bit and 2-bit deltas to achieve the so called 
     avalanche effect [Jenk97]. This function was originally built 
     for hash table lookup with fast software implementation.  
            
     Input Parameters:  
     The input parameters of such a function are:  
     - the length of the input string (key) to be hashed, in    
     bytes. The elementary input blocks of Bob hash are the single 
     bytes, therefore no padding is needed.  
     - an init value (an arbitrary 32-bit number).  
            
     Built in parameters:  
     The Bob Hash uses the following built-in parameter:        
     - the golden ratio (an arbitrary 32-bit number used in the hash  
     function computation: its purpose is to avoid mapping all zeros 
     to all zeros);  
            
     Note: the mix sub-function (see mix (a,b,c) macro in the 
     reference code in 3.2.4) has a number of parameters governing 
     the shifts in the registers. The one presented is not the only 
     possible choice.  
         
     It is an open point whether these may be considered additional  
     built-in parameters to specify at function configuration.  
            
     Output.  


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 20] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     The output of the BOB function is a 32-bit number. It should be 
     specified:  
     - A 32 bit mask to apply to the output  
     - The selection range as a list of non overlapping intervals 
     [start value, end value] where value is in [0,2^32]  
            
     Functioning:  
     The hash value is obtained computing first an initialization of 
     an internal state (composed of 3 32-bit numbers, called a, b, c 
     in the reference code below), then, for each input byte of the 
     key the internal state is combined by addition and mixed using 
     the mix sub-function. Finally, the internal state mixed one last 
     time and the third number of the state (c) is chosen as the 
     return value.  
            
     typedef  unsigned long  int  ub4;   /* unsigned 4-byte 
     quantities */  
     typedef  unsigned       char ub1;   /* unsigned 1-byte 
     quantities */  
         
     #define hashsize(n) ((ub4)1<<(n))  
     #define hashmask(n) (hashsize(n)-1)  
         
     /* ------------------------------------------------------ 
       mix -- mix 3 32-bit values reversibly.  
       For every delta with one or two bits set, and the deltas of 
     all three high bits or all three low bits, whether the original 
     value of a,b,c is almost all zero or is uniformly distributed,  
       * If mix() is run forward or backward, at least 32 bits in 
     a,b,c have at least 1/4 probability of changing.  
       * If mix() is run forward, every bit of c will change between 
     1/3 and 2/3 of the time.  (Well, 22/100 and 78/100 for some 2-
     bit deltas.) mix() was built out of 36 single-cycle latency 
     instructions in a structure that could supported 2x parallelism, 
     like so:  
             a -= b;  
             a -= c; x = (c>>13);  
             b -= c; a ^= x;  
             b -= a; x = (a<<8);  
             c -= a; b ^= x;  
             c -= b; x = (b>>13);  
             ...  
     Unfortunately, superscalar Pentiums and Sparcs can't take 
     advantage of that parallelism.  They've also turned some of 
     those single-cycle latency instructions into multi-cycle latency 
     instructions  
       
     ------------------------------------------------------------*/  
         

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 21] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
       #define mix(a,b,c)  \  
       { \  
         a -= b; a -= c; a ^= (c>>13); \  
         b -= c; b -= a; b ^= (a<<8); \  
         c -= a; c -= b; c ^= (b>>13); \  
         a -= b; a -= c; a ^= (c>>12);  \  
         b -= c; b -= a; b ^= (a<<16); \  
         c -= a; c -= b; c ^= (b>>5); \  
         a -= b; a -= c; a ^= (c>>3);  \  
         b -= c; b -= a; b ^= (a<<10); \  
         c -= a; c -= b; c ^= (b>>15); \  
       }  
         
       /* -----------------------------------------------------------  
     hash() -- hash a variable-length key into a 32-bit value  
     k       : the key (the unaligned variable-length array of bytes)  
     len     : the length of the key, counting by bytes  
     initval : can be any 4-byte value  
     Returns a 32-bit value.  Every bit of the key affects every bit 
     of the return value.  Every 1-bit and 2-bit delta achieves 
     avalanche. About 6*len+35 instructions.  
         
     The best hash table sizes are powers of 2.  There is no need to 
     do mod a prime (mod is sooo slow!).  If you need less than 32 
     bits, use a bitmask.  For example, if you need only 10 bits, do  
     h = (h & hashmask(10));  
     In which case, the hash table should have hashsize(10) elements.  
         
     If you are hashing n strings (ub1 **)k, do it like this:  
     for (i=0, h=0; i<n; ++i) h = hash( k[i], len[i], h);  
         
     By Bob Jenkins, 1996.  bob_jenkins@burtleburtle.net.  You may 
     use this code any way you wish, private, educational, or 
     commercial.  It's free.  
         
    See http://burtleburtle.net/bob/hash/evahash.html  
     Use for hash table lookup, or anything where one collision in 
     2^^32 is acceptable.  Do NOT use for cryptographic purposes.  
      ----------------------------------------------------------- */  
         
       ub4 bob_hash(k, length, initval)  
       register ub1 *k;        /* the key */  
       register ub4  length;   /* the length of the key */  
       register ub4  initval;  /* an arbitrary value */  
       {  
          register ub4 a,b,c,len;  
         
          /* Set up the internal state */  
          len = length;  


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 22] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
          a = b = 0x9e3779b9; /*the golden ratio; an arbitrary value 
     */ 
          c = initval;         /* another arbitrary value */  
         
     /*------------------------------------ handle most of the key */  
         
          while (len >= 12)  
          {  
             a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16) 
     +((ub4)k[3]<<24));  
             b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16) 
     +((ub4)k[7]<<24));  
             c += (k[8] +((ub4)k[9]<<8) 
     +((ub4)k[10]<<16)+((ub4)k[11]<<24));  
             mix(a,b,c);  
             k += 12; len -= 12;  
          }  
         
          /*---------------------------- handle the last 11 bytes */  
          c += length;  
          switch(len)       /* all the case statements fall through*/  
          {  
          case 11: c+=((ub4)k[10]<<24);  
          case 10: c+=((ub4)k[9]<<16);  
          case 9 : c+=((ub4)k[8]<<8);  
             /* the first byte of c is reserved for the length */  
          case 8 : b+=((ub4)k[7]<<24);  
          case 7 : b+=((ub4)k[6]<<16);  
          case 6 : b+=((ub4)k[5]<<8);  
          case 5 : b+=k[4];  
          case 4 : a+=((ub4)k[3]<<24);  
          case 3 : a+=((ub4)k[2]<<16);  
          case 2 : a+=((ub4)k[1]<<8);  
          case 1 : a+=k[0];  
            /* case 0: nothing left to add */  
          }  
          mix(a,b,c);  
          /*-------------------------------- report the result */  
          return c;  
       }  
         
         
  4.3 Router State filtering 
      
     This class of filters select a packet on the basis of router 
     state conditions. The following list gives examples for such 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 23] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     conditions. Conditions can be  combined with AND, OR or NOT 
     operators. 
   
        - Ingress interface at which the packet arrives equals a 
           specified value 
        - Egress interface to which the packet is routed equals a 
           specified value 
        - Packet violated Access Control List (ACL) on the router 
        - Reverse Path Forwarding (RPF) failed for the packet 
        - Resource Reservation is insufficient for the packet 
        - No route found for the packet 
        - Origin AS equals a specified value or lies within a given  
           range 
        - Destination AS equals a specified value or lies within a 
           given range 
         
     Router architectural considerations may preclude some 
     information concerning the packet treatment, e.g. routing state, 
     being available at line rate for selection of packets. However, 
     if selection not based on routing state has reduced down from 
     line rate, subselection based on routing state may be feasible. 
   
  5. Input Parameters and Information Models 
   
     This section gives an overview of different alternative 
     selection schemes and their required parameters. In order to be 
     compliant with PSAMP it is sufficient to implement one of the 
     proposed schemes. 
      
     The decision whether to select a packet or not is based on a 
     function which is performed when the packet arrives at the 
     selection process. Packet selection schemes differ in the input 
     parameters for the selection process and the functions they 
     require to do the packet selection. The following table gives an 
     overview. 
   
          Scheme       |   input parameters     |     functions  
        ---------------+------------------------+------------------- 
         systematic    |    packet position     |  packet counter  
         count-based   |    sampling pattern    |  
        ---------------+------------------------+------------------- 
         systematic    |      arrival time      |  clock or timer 
         time-based    |     sampling pattern   | 
        ---------------+------------------------+------------------- 
           random      |  packet position       |  packet counter, 
         n-out-of-N    |  sampling pattern      |  random numbers 
                       | (random number list)   | 
        ---------------+------------------------+------------------- 
         uniform       |        sampling        |  random function 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 24] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
         probabilistic |      probability       |    
        ---------------+------------------------+------------------- 
         non-uniform   |e.g. packet position,   | selection function, 
         probabilistic |  packet content(parts) |  probability calc. 
        ---------------+------------------------+------------------- 
         non-uniform   |e.g. flow state,        | selection function, 
         flow-state    |  packet content(parts) |  probability calc. 
        ---------------+------------------------+------------------- 
         mask/match    | packet content(parts)  |  filter function 
        ---------------+------------------------+------------------- 
         hash-based    |  packet content(parts) |  hash function 
        ---------------+------------------------+------------------- 
         router state  |    router state        |   router state 
                       |                        |   discovery 
        ---------------+------------------------+------------------- 
   
      
  5.1 Information Model for Sampling Techniques 
      
     In this section we define the information models for most common 
     sampling techniques. Here the selection function is pre-defined 
     and given by the selector ID.  
      
     Sampler Description: 
          SELECTOR_ID 
          SELECTOR_TYPE 
          SELECTOR_PARAMETERS 
          ASSOCIATIONS 
   
     Where: 
      
     SELECTOR_ID: 
     Unique ID for the packet sampler. The ID can be calculated under 
     consideration of the ASSOCIATIONS and a local ID. 
   
      
     SELECTOR_TYPE 
     For sampling processes the SELECTOR TYPE defines what sampling 
     algorithm is used. 
     Values: Systematic Count-based | Systematic Time-based | Random 
     n-out-of-N | Uniform Probabilistic | Non-uniform Probabilistic | 
     Non-uniform Flow-state 
      
   
     SELECTOR_PARAMETERS 
     For sampling processes the SELECTOR PARAMETERS define the input 
     parameters for the process. Interval length in systematic 
     sampling means, that all packets that arrive in this interval 
     are selected. The spacing parameter defines the spacing in time 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 25] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     or number of packets between the end of one sampling interval 
     and the start of the next succeeding interval. 
   
     Case n out of N: 
        - Population size N, Sample size n 
      
     Case Systematic Time Based: 
        - Interval length (in usec), Spacing (in usec) 
      
     Case Systematic Count Based: 
        - Interval length(in packets), Spacing (in packets) 
      
     Case Uniform Probabilistic (with equal probability per packet): 
        - Sampling probability p 
         
     Case Non-uniform Probabilistic: 
        - Calculation function for sampling probability p (see also 
           section 3.1.2.2.4) 
      
     Case flow state: 
        - Information reported for flow state can be found in 
           [MolFl03](see also section 3.1.2.2.4) 
         
     ASSOCIATIONS 
     The ASSOCIATIONS field describes the observation point and 
     (possibly) the IPFIX processes to which the packet selector is 
     associated. The STREAM ID denotes the origin of the data stream 
     that is input to the selection function. It can be the 
     observation point directly or the ID of another selector. With 
     this it is possible to define combined schemes. If the STREAM ID 
     contains IDs from other selectors, one can derive the original 
     observation point from the selector definitions of these 
     specified selectors. 
      
     Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting 
     process ID, IDs of other associated processes> 
     With STREAM ID: Observation point ID AND List of SELECTOR_IDs 
      
         
  5.2 Information Model for Filtering Techniques 
      
     In this section we define the information models for most common 
     filtering techniques. The information model structure closely 
     parallels the one presented for the sampling techniques. 
      
     Filter Description: 
          SELECTOR_ID 
          SELECTOR_TYPE 
          SELECTOR_PARAMETERS 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 26] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
          ASSOCIATIONS 
   
     Where: 
      
     SELECTOR_ID: 
     Unique ID for the packet filter. The ID can be calculated under 
     consideration of the ASSOCIATIONS and a local ID. 
      
     SELECTOR_TYPE 
     For filtering processes the SELECTOR TYPE defines what filtering 
     type is used. 
     Values: Matching | Hashing | Router_state 
      
     SELECTOR_PARAMETERS 
     For filtering processes the SELECTOR PARAMETERS define formally 
     the common property of the packet being filtered. For the 
     filters of type Matching and Hashing the definitions have a lot 
     of points in common. 
     Values: 
     Case Matching 
        - <Header type = ip v4 > 
        - <bit specification, header part> 
        - <Selection interval specification, header part> 
        - <Header type = ipv6> 
        - <bit specification, header part> 
        - <Selection interval specification, header part> 
        - <payload byte number N> 
        - <bit specification, payload part> 
        - <Selection interval specification, payload part> 
      
     Notes to Case Matching: 
      
        - The filter can be defined for the header part only, for the 
           payload part only or for both. In the latter case the 
           matching must be an AND of the two. 
        - The bit specification, for the header part, can be 
           specified for ipv4 or ipv6 only, or both 
        - In case of ipv4, the bit specification is a sequence of 20 
           Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask 
           to be applied to the header 
        - In case of ipv6, it is a sequence of 40 Hexadecimal numbers 
           [00,FF] specifying a 40 bytes bitmask to be applied to the 
           header 
        - The bit specification, for the payload part, is a sequence 
           of Hexadecimal numbers [00,FF] specifying the bitmask to be 
           applied to the first N bytes of the payload, as specified 
           by the previous field. In case the Hexadecimal number 
           sequence is longer then N, only the first N numbers are 
           considered. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 27] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
        - In case the payload is shorter than N, the packet will not 
           match the filter Other options, like padding with zeros, 
           may be considered in the future. 
        - The selection interval specification is a list of non 
           overlapping intervals [intv_begin, intv_end] where 
           intv_begin, intv_end are bit strings of length 20*8 (ipv4 
           case), 40*8 (ipv6 case), N*8 (payload case). 
        - A filter cannot be defined on the options field of the ipv4 
           header, neither on stacked headers of ipv6. 
        - This specification doesnÆt preclude the future definition 
           of a high level syntax for defining in a concise way bit 
           selection and matching rules in a more human readable form 
           (e.g. ôTCP port in [2000,3000]ö). The requirement is that 
           such a syntax can be univoquely compiled into the one 
           defined above 
      
     Case Hashing: 
        - <Header type = ipv4> 
        - <Input bit specification, header part> 
        - <Header type =  ipv6> 
        - <Input bit specification, header part> 
        - <payload byte number N> 
        - <Input bit specification, payload part> 
        - <additional hash input bits (seed)> 
        - Hashing function specification 
             - Hash function name  
             - Length of input key (eliminate 0x bytes) 
             - Output value (length M and bitmask) 
             - Selection interval specification, as a list of non 
               overlapping intervals [start value, end value] where 
               value is in [0,2^M-1] 
             - Additional parameters dependent on specific hash 
               function 
   
     Notes to Case Hashing: 
      
        - On Input bit specifications fields, the same notes on bit 
           specifications of the Matching case reported above apply 
        - Some hash functions, their detailed functioning and their 
           specific parameter list are described in [NiMD03]. 
   
     Case Router State: 
   
        - Ingress interface at which the packet arrives equals a 
           specified value 
        - Egress interface to which the packet is routed equals a 
           specified value 
        - Packet violated Access Control List (ACL) on the router 
        - Reverse Path Forwarding (RPF) failed for the packet 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 28] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
        - Resource Reservation is insufficient for the packet 
        - No route found for the packet 
        - Origin AS equals a specified value or lies within a given  
           range 
        - Destination AS equals a specified value or lies within a 
           given range 
   
     Note to Case Router State: 
        - All Router state entries can be linked by AND, OR, NOT 
           operators 
         
     ASSOCIATIONS 
     The ASSOCIATIONS field describes the observation point and 
     (possibly) the IPFIX processes to which the packet selector is 
     associated. The STREAM ID denotes the origin of the data stream 
     that is input to the selection function. It can be the 
     observation point directly or the ID of another selector. With 
     this it is possible to define combined schemes. If the STREAM ID 
     contains IDs from other selectors, one can derive the original 
     observation point from the selector definitions of these 
     specified selectors. 
      
     Values: <STREAM ID, IPFIX Metering process ID, IPFIX Exporting 
     process ID, IDs of other associated processes> 
     With STREAM ID: Observation point ID AND List of SELECTOR_IDs 
      
      
  6. Composite Techniques  
      
     Composite schemes are realized by using the STREAM ID in the 
     information models. The STREAM ID denotes from which selectors 
     the input stream originates. If multiple stream IDs are given, 
     this means that the selector operates on the packet stream 
     simply resulting from the time superposition of the output of 
     all the listed filters and samplers. Some examples of composite 
     schemes are reported below. 
      
  6.1 Cascaded filtering->sampling or sampling->filtering 
   
     If a filter precedes a sampling process the role of filtering is 
     to create a set of ôparent populationsö from a single stream 
     that can then be fed independently to different sampling 
     functions, with different parameters tuned for the population 
     itself (e.g. if streams of different intensity result from 
     filtering, it may be good to have different sampling rates). If 
     filtering follows a sampling process, the same sampling rate and 
     type is applied to the whole stream, independently of the 
     relative size of the streams resulting from the filtering 
     function. Moreover, also packets not destined to be selected in 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 29] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     the filtering operation will ôloadö the sampling function. So, 
     in principle, filtering before sampling allows a more accurate 
     tuning of the sampling procedure, but if filters are too complex 
     to work at full line rate (e.g. because they have to access 
     router state information), sampling before filtering may be a 
     need. 
      
  6.2 Stratified Sampling  
      
     Stratified sampling is one example for using a composite 
     technique. The basic idea behind stratified sampling is to 
     increase the estimation accuracy by using a-priori information 
     about correlations of the investigated characteristic with some 
     other characteristic that is easier to obtain. The a-priori 
     information is used to perform an intelligent grouping of the 
     elements of the parent population. With this a higher estimation 
     accuracy can be achieved with the same sample size or the sample 
     size can be reduced without reducing the estimation accuracy. 
      
     Stratified sampling divides the sampling process into multiple 
     steps. First, the elements of the parent population are grouped 
     into subsets in accordance to a given characteristic. This 
     grouping can be done in multiple steps. Then samples are taken 
     from each subset.  
      
     The stronger the correlation between the characteristic used to 
     divide the parent population (stratification variable) and the 
     characteristic of interest (for which an estimate is sought 
     after), the easier is the consecutive sampling process and the 
     higher is the stratification gain. For instance if the dividing 
     characteristic were equal to the investigated characteristic, 
     each element of the sub-group would be a perfect representative 
     of that characteristic. In this case it would be sufficient to 
     take one arbitrary element out of each subgroup to get the 
     actual distribution of the characteristic in the parent 
     population. Therefore stratified sampling can reduce the costs 
     for the sampling process (i.e. the number of samples needed to 
     achieve a given level of confidence). 
   
     For stratified sampling one has to specify classification rules 
     for grouping the elements into subgroups and the sampling scheme 
     that is used within the subgroups. The classification rules can 
     be expressed by multiple filters. For the sampling scheme within 
     the subgroups the parameters have to be specified as described 
     above. The use of stratified sampling methods for measurement 
     purposes is described for instance in [ClPB93] and [Zseb03]. 
      
  7. Security Considerations 
   

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 30] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     Malicious users or attackers may be interested to hide packets 
     from service providers or network operators. For instance if 
     packet selectors are used for accounting or intrusion detection 
     applications, users may want to prevent that packets are 
     selected. If a deterministic sampling scheme is used or a 
     selection scheme that takes packet content into account, the 
     user can shape or send packets in a way that they are less 
     likely to be selected (see also section 4.2.2). Even if the 
     selection function is unknown to the user, some insight into the 
     function can be obtained by performing experiments with 
     different packet sequences. This has to be taken into account 
     when choosing an appropriate packet selection technique. 
      
     Further security threats can occur if the configuration of 
     sampling parameters or the communication of sampling parameters 
     to the application is corrupted. This document only describes 
     sampling schemes that can be used for packet selection. It 
     neither describes a mechanism how those parameters are 
     configured nor how these parameters are communicated to the 
     application. Therefore the security threats that originate from 
     this kind of communication cannot be assessed with the 
     information given in this document.  
   
   
  8. References 
      
     [AmCa89]    Paul D. Amer, Lillian N. Cassel: Management of 
                  Sampled Real-Time Network Measurements, 14th 
                  Conference on Local Computer Networks, October 
                  1989, Minneapolis, pages 62-68, IEEE, 1989 
      
     [ClPB93]    K.C. Claffy, George C Polyzos, Hans-Werner Braun: 
                  Application of Sampling Methodologies to Network 
                  Traffic Characterization, Proceedings of ACM 
                  SIGCOMM'93, San Francisco, CA, USA, September 13 - 
                  17, 1993 
      
     [CoGi98]    I. Cozzani, S. Giordano: Traffic Sampling Methods 
                  for end-to-end QoS Evaluation in Large 
                  Heterogeneous Networks. Computer Networks and ISDN 
                  Systems, 30 (16-18), September 1998. 
      
     [Du04]      N.G. Duffield (Ed.), A Framework for Packet 
                  Selection and Reporting, Internet Draft draft-ietf-
                  psamp-framework-05, work in progress, January 2004               
      
     [DuGr00]    N.G. Duffield, M. Grossglauser: Trajectory Sampling 
                  for Direct Traffic Observation, Proceedings of ACM 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 31] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
                  SIGCOMM 2000, Stockholm, Sweden, August 28 - 
                  September 1, 2000. 
      
     [DuGeGr02]   N.G. Duffield, A. Gerber, M. Grossglauser,        
                  Trajectory Engine: A Backend for Trajectory 
                  Sampling, IEEE Network Operations and Management 
                  Symposium 2002, Florence, Italy, April 15-19, 2002. 
      
      
     [DuLT01]    Nick Duffield, Carsten Lund, and Mikkel Thorup: 
                  Charging from Sampled Network Usage, ACM Internet 
                  Measurement Workshop IMW 2001, San Francisco, USA, 
                  November 1-2, 2001 
      
     [EsVa01]    C. Estan and G. Varghese, ôNew Directions in 
                  Traffic Measurement and Accountingö, ACM SIGCOMM 
                  Internet Measurement Workshop 2001, San Francisco 
                  (CA) Nov. 2001 
      
     [HT52]      D.G. Horvitz and D.J. Thompson, A Generalization of 
                  Sampling   without replacement from a Finite 
                  Universe. J. Amer. Statist. Assoc. Vol. 47, pp. 
                  663-685, 1952. 
      
     [ISO3309]    International Organization for Standardization, 
                  "ISO Information Processing Systems - Data 
                  Communication High-Level Data Link Control 
                  Procedure - Frame Structure", IS 3309, October 
                         rd
                  1984, 3  Edition. 
      
      
     [Jenk97]    B. Jenkins: Algorithm Alley, Dr. Dobb's Journal, 
                  September 1997. 
                  http://burtleburtle.net/bob/hash/doobs.html  
      
     [JePP92]    Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic 
                  Estimation for the Largest Sources on a Network, 
                  Using Packet Sampling with Limited Storage, HP 
                  technical report, Managemenr, Mathematics and 
                  Security Department, HP Laboratories, Bristol, 
                  March 1992, 
                  http://www.hpl.hp.com/techreports/92/HPL-92-35.html 
      
     [Knuth98]   Donald E. Knuth: The Art of Computer Programming, 
                  Volume 3: Searching and Sorting, Addison Wesley, 
                  1998  
                    

  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 32] 
   
  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     [MolFl03]   M.Molina: Flow selection support in IPFIX, Internet 
                  Draft <draft-molina-flow-selection-00.txt>, work in 
                  progress, October 2003. 
      
     [Moli03]    M.Molina: A scalable and efficient methodology for 
                  flow monitoring in the internet, International 
                  Teletraffic Congress (ITC-18), Berlin, Sep. 2003 
      
     [NiMD03]    S. Niccolini, M.Molina, N.Duffield: Hash functions 
                  description for packet selection, Internet Draft 
                  <draft-niccolini-hash-descr-00.txt>, work in 
                  progress, October 2003. 
      
     [QuZC03]    J. Quittek, T. Zseby, B. Claise, S. Zander: 
                  Requirements for IP Flow Information Export, 
                  Internet Draft <draft-ietf-ipfix-reqs-13.txt>, work 
                  in progress, December 2003 
      
     [Zseb03]    T. Zseby: Stratification Strategies for Sampling-
                  based Non-intrusive Measurement of One-way Delay. 
                  Passive and Active Measurement Workshop 
                  Proceedings, La Jolla, CA, USA, pp. 171-179, Apr. 
                  2003 
   
      
  9. Author's Addresses 
      
     Tanja Zseby 
     Fraunhofer Institute for Open Communication Systems 
     Kaiserin-Augusta-Allee 31 
     10589 Berlin 
     Germany 
     Phone: +49-30-34 63 7153 
     Fax:   +49-30-34 53 8153 
     Email: zseby@fokus.fhg.de 
      
     Maurizio Molina 
     NEC Europe Ltd., Network Laboratories 
     Adenauerplatz 6 
     69115 Heidelberg 
     Germany 
     Phone: +49 6221 90511-18  
     Email: molina@ccrle.nec.de 
   
     Fredric Raspall 
     NEC Europe Ltd., Network Laboratories 
     Adenauerplatz 6 
     69115 Heidelberg 
     Germany 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 33] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
     Phone: +49 6221 90511-31  
     EMail: raspall@ccrle.nec.de 
      
     Nick Duffield 
     AT&T Labs - Research 
     Room B-139 
     180 Park Ave 
     Florham Park NJ 07932, USA 
     Phone: +1 973-360-8726 
     Email: duffield@research.att.com 
      
  10. 
      Intellectual Property Statement 
   
     AT&T Corporation may own intellectual property applicable to 
     this 
     contribution. The IETF has been notified of AT&T's licensing 
     intent for the specification contained in this document. See 
     http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR 
     statement. 
      
  11. 
      Full Copyright Statement 
   
     Copyright (C) The Internet Society (2002). All Rights Reserved. 
     This document and translations of it may be copied and furnished 
     to others, and derivative works that comment on or otherwise 
     explain it or assist in its implementation may be prepared, 
     copied, published and distributed, in whole or in part, without 
     restriction of any kind, provided that the above copyright 
     notice and this paragraph are included on all such copies and 
     derivative works. However, this document itself may not be 
     modified in any way, such as by removing the copyright notice or 
     references to the Internet Society or other Internet 
     organizations, except as needed for the purpose of developing 
     Internet standards in which case the procedures for copyrights 
     defined in the Internet Standards process must be followed, or 
     as required to translate it into languages other than 
     English.  
      
     The limited permissions granted above are perpetual and will not 
     be revoked by the Internet Society or its successors or assigns. 
      
     This document and the information contained herein is provided 
     on an 
     "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET 
     ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR 
     IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE 
     OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY 
     IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A 
     PARTICULAR PURPOSE. 


  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 34] 
   

  Internet Draft  Techniques for IP Packet Selection    February 2004 
   
   
  Zseby, Molina, Raspall, Duffield   Expires August 2004  [Page 35]