idnits 2.17.1 draft-ietf-psamp-sample-tech-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 24. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1734. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1702. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1709. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1715. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 12 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 1844 has weird spacing: '...ong int ub4; ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 2007) is 6160 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'PSAMP-TECH' is mentioned on line 165, but not defined == Missing Reference: 'IPFIX-PROTO' is mentioned on line 323, but not defined -- Looks like a reference, but probably isn't: '1' on line 1932 == Missing Reference: 'N' is mentioned on line 651, but not defined -- Looks like a reference, but probably isn't: '0' on line 1957 -- Looks like a reference, but probably isn't: '00' on line 1385 == Missing Reference: 'FF' is mentioned on line 1385, but not defined -- Looks like a reference, but probably isn't: '2' on line 1932 -- Looks like a reference, but probably isn't: '4' on line 1953 -- Looks like a reference, but probably isn't: '5' on line 1934 -- Looks like a reference, but probably isn't: '6' on line 1934 -- Looks like a reference, but probably isn't: '8' on line 1936 -- Looks like a reference, but probably isn't: '9' on line 1936 == Outdated reference: A later version (-13) exists of draft-ietf-psamp-framework-11 == Outdated reference: A later version (-11) exists of draft-ietf-psamp-info-06 == Outdated reference: A later version (-09) exists of draft-ietf-psamp-protocol-07 Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft 3 Document: T. Zseby 4 Intended status: Proposed Standard Fraunhofer FOKUS 5 Expires: December 2007 M. Molina 6 DANTE 7 N. Duffield 8 AT&T Labs-Research 9 S. Niccolini 10 NEC Europe Ltd. 11 F. Raspall 12 EPSC-UPC 14 June 2007 16 Sampling and Filtering Techniques for IP Packet Selection 18 Status of this Memo 20 By submitting this Internet-Draft, each author represents that 21 any applicable patent or other IPR claims of which he or she is 22 aware have been or will be disclosed, and any of which he or she 23 becomes aware will be disclosed, in accordance with Section 6 of 24 BCP 79. 26 Internet-Drafts are working documents of the Internet 27 Engineering Task Force (IETF), its areas, and its working 28 groups. Note that other groups may also distribute working 29 documents as Internet-Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six 32 months and may be updated, replaced, or obsoleted by other 33 documents at any time. It is inappropriate to use Internet- 34 Drafts as reference material or to cite them other than as "work 35 in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt. 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html. 43 This Internet-Draft will expire on December, 2007. 45 Copyright Notice 47 Copyright (C) The IETF Trust (2007). 49 Abstract 51 This document describes Sampling and Filtering techniques for IP 52 packet selection. It provides a categorization of schemes and 53 defines what parameters are needed to describe the most common 54 selection schemes. Furthermore it shows how techniques can be 55 combined to build more elaborate packet Selectors. The document 56 provides the basis for the definition of information models for 57 configuring selection techniques in Metering Processes and for 58 reporting the technique in use to a Collector. 60 Conventions used in this document 62 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 63 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 64 "OPTIONAL" in this document are to be interpreted as described 65 in RFC 2119 [RFC2119]. 67 Table of Contents 69 1. Introduction.................................................4 70 2. PSAMP Documents Overview.....................................4 71 3. Terminology..................................................5 72 3.1 Observation Points, Packet Streams and Packet Content.....5 73 3.2 Selection Process.........................................6 74 3.3 Reporting.................................................7 75 3.4 Metering Process..........................................8 76 3.5 Exporting Process.........................................8 77 3.6 PSAMP Device..............................................8 78 3.7 Collector.................................................9 79 3.8 Selection Methods.........................................9 80 4. Categorization of Packet Selection Techniques...............11 81 5. Sampling....................................................13 82 5.1 Systematic Sampling......................................14 83 5.2 Random Sampling..........................................15 84 5.2.1 n-out-of-N Sampling......................................15 85 5.2.2 Probabilistic Sampling...................................15 86 5.2.2.1 Uniform Probabilistic Sampling...........................15 87 5.2.2.2 Non-Uniform Probabilistic Sampling.......................16 88 5.2.2.3 Non-Uniform Flow State Dependent Sampling................16 89 5.2.2.4 Configuration of non-uniform probabilistic and flow- 90 state Sampling..........................................17 91 6. Filtering...................................................17 92 6.1 Property Match Filtering.................................17 93 6.2 Hash-based Filtering.....................................19 94 6.2.1 Application Examples for Hash-based Selection............20 95 6.2.1.1 Approximation of Random Sampling.........................20 96 6.2.1.2 Trajectory Sampling and Consistent Packet Selection......20 97 6.2.2 Security Considerations for Hash Functions...............21 98 6.2.2.1 Vulnerabilities of Hash-based selection without 99 knowledge of selection outcomes.........................22 100 6.2.2.2 Vulnerabilities of Hash-based selection using knowledge 101 of selection outcomes...................................23 102 6.2.2.3 Vulnerabilities to Replay Attacks........................24 103 6.2.3 Choice of Hash-Function..................................25 104 6.2.3.1 Properties of some hash functions........................25 105 6.2.3.2 Hash Functions for Packet Selection......................26 106 6.2.3.3 Hash Functions Suitable for Packet Digesting.............27 107 7. Parameters for the Description of Selection Techniques......27 108 7.1 Description of Sampling Techniques.......................28 109 7.2 Description of Filtering Techniques......................29 110 8. Composite Techniques........................................31 111 8.1 Cascaded Filtering->Sampling or Sampling->Filtering......31 112 8.2 Stratified Sampling......................................32 113 9. Security Considerations.....................................32 114 10. Acknowledgements............................................33 115 11. IANA Considerations.........................................33 116 12. Normative References........................................33 117 13. Informative References......................................33 118 Authors' Addresses...............................................36 119 Intellectual Property Statement..................................37 120 Copyright Statement..............................................37 121 Appendix A: Hash Functions.......................................38 122 A.1 IP Shift-XOR (IPSX) Hash Function............................38 123 A.2 BOB Hash Function............................................39 125 1. Introduction 127 There are two main drivers for the growth in measurement 128 infrastructures and their underlying technology. First, network 129 data rates are increasing, with a concomitant growth in 130 measurement data. Secondly, the growth is compounded by the 131 demand by measurement-based applications for increasingly fine 132 grained traffic measurements. Devices such as routers, which 133 perform the measurements, require increasingly sophisticated and 134 resource intensive measurement capabilities, including the 135 capture of packet headers or even parts of the payload, and 136 classification for flow analysis. All these factors can lead to 137 an overwhelming amount of measurement data, resulting in high 138 demands on resources for measurement, storage, transport and 139 post processing. 141 The sustained capture of network traffic at line rate can be 142 performed by specialized measurement hardware. However, the cost 143 of the hardware and the measurement infrastructure required to 144 accommodate the measurements preclude this as a ubiquitous 145 approach. Instead some form of data reduction at the point of 146 measurement is necessary. This can be achieved by an intelligent 147 packet selection through Sampling, Filtering, or aggregation. 148 The motivation for Sampling is to select a representative subset 149 of packets that allow accurate estimates of properties of the 150 unsampled traffic to be formed. The motivation for Filtering is 151 to remove all packets that are not of interest. Aggregation 152 combines data and allows compact pre-defined views of the 153 traffic. Examples of applications that benefit from packet 154 selection are given in [PSAMP-FW]. Aggregation techniques are 155 out of scope of this document. 157 2. PSAMP Documents Overview 159 [PSAMP-FW]: "A Framework for Packet Selection and Reporting" 160 describes the PSAMP framework for network elements 161 to select subsets of packets by statistical and 162 other methods, and to export a stream of reports 163 on the selected packets to a Collector. 165 [PSAMP-TECH]: "Sampling and Filtering Techniques for IP Packet 166 Selection" (this document) describes the set of 167 packet selection techniques supported by PSAMP. 169 [PSAMP-PROTO]: "Packet Sampling (PSAMP) Protocol Specifications" 170 specifies the export of packet information from a 171 PSAMP Exporting Process to a PSAMP Colleting 172 Process. 174 [PSAMP-INFO]: "Information Model for Packet Sampling Exports" 175 defines an information and data model for PSAMP. 177 3. Terminology 179 The PSAMP terminology defined here is fully consistent with all 180 terms listed in [PSAMP-FW] but includes additional terms 181 required for the description of packet selection methods. An 182 architecture overview and possible configurations of PSAMP 183 elements can be found in [PSAMP-FW]. PSAMP terminology also aims 184 at consistency with terms used in [RFC3917]. The relationship 185 between PSAMP and IPFIX terms is described in [PSAMP-FW]. 187 3.1 Observation Points, Packet Streams and Packet Content 189 * Observation Point 191 An Observation Point is a location in the network where 192 packets can be observed. Examples include: 194 (i) a line to which a probe is attached; 196 (ii) a shared medium, such as an Ethernet-based LAN; 198 (iii) a single port of a router, or set of interfaces 199 (physical or logical) of a router; 201 (iv) an embedded measurement subsystem within an interface. 203 Note that one Observation Point may be a superset of several 204 other Observation Points. For example one Observation Point 205 can be an entire line card. This would be the superset of the 206 individual Observation Points at the line card's interfaces. 208 * Observed Packet Stream 209 The Observed Packet Stream is the set of all packets observed 210 at the Observation Point. 212 * Packet Stream 214 A packet stream denotes a set of packets that flows past some 215 specified point within the metering process. An example of a 216 Packet Stream is the output of the selection process. 217 Note that packets selected from a stream, e.g. by Sampling, do 218 not necessarily possess a property by which they can be 219 distinguished from packets that have not been selected. For 220 this reason the term "stream" is favored over "flow", which is 221 defined as set of packets with common properties [RFC3917]. 223 * Packet Content 225 The packet content denotes the union of the packet header 226 (which includes link layer, network layer and other 227 encapsulation headers) and the packet payload. 229 3.2 Selection Process 231 * Selection Process 233 A Selection Process takes the Observed Packet Stream as its 234 input and selects a subset of that stream as its output. 236 * Selection State 238 A Selection Process may maintain state information for use by 239 the Selection Process. At a given time, the Selection State 240 may depend on packets observed at and before that time, and 241 other variables. Examples include: 243 (i) sequence numbers of packets at the input of Selectors; 245 (ii) a timestamp of observation of the packet at the 246 Observation Point; 248 (iii) iterators for pseudorandom number generators; 250 (iv) hash values calculated during selection; 252 (v) indicators of whether the packet was selected by a 253 given Selector; 255 Selection Processes may change portions of the Selection State 256 as a result of processing a packet. Selection state for a 257 packet is to reflect the state after processing the packet. 259 * Selector 261 A Selector defines the action of a Selection Process on a 262 single packet of its input. If selected, the packet becomes an 263 element of the output Packet Stream. 265 The Selector can make use of the following information in 266 determining whether a packet is selected: 268 (i) the packet's content; 270 (ii) information derived from the packet's treatment at the 271 Observation Point; 273 (iii) any selection state that may be maintained by the 274 Selection Process. 276 * Composite Selector 278 A Composite Selector is an ordered composition of Selectors, 279 in which the output Packet Stream issuing from one Selector 280 forms the input Packet Stream to the succeeding Selector. 282 * Primitive Selector 284 A Selector is primitive if it is not a Composite Selector. 286 * Selection Sequence 288 From all the packets observed at an Observation Point, only a 289 few packets are selected by one or more Selectors. The 290 Selection Sequence is a unique value per Observation Domain 291 describing the Observation Point and the Selector IDs through 292 which the packets are selected. 294 3.3 Reporting 296 * Packet Reports 298 Packet Reports comprise a configurable subset of a packet's 299 input to the Selection Process, including the packet's 300 content, information relating to its treatment (for example, 301 the output interface), and its associated selection state (for 302 example, a hash of the packet's content) 304 * Report Interpretation: 306 Report Interpretation comprises subsidiary information, 307 relating to one or more packets, that is used for 308 interpretation of their packet reports. Examples include 309 configuration parameters of the Selection Process. 311 * Report Stream: 313 The Report Stream is the output of a Metering Process, 314 comprising two distinguished types of information: Packet 315 Reports, and Report Interpretation. 317 3.4 Metering Process 319 A Metering Process selects packets from the Observed Packet 320 Stream using a Selection Process, and produces as output a 321 Report Stream concerning the selected packets. The PSAMP 322 Metering Process can be viewed as analogous to the IPFIX 323 metering process [IPFIX-PROTO], which produces flow records as 324 its output. While the Metering Process definition in this 325 document specifies the PSAMP definition, the PSAMP protocol 326 specifications [PSAMP-PROTO] will use the IPFIX Metering 327 Process definition, which also suits the PSAMP requirements. 328 The relationship between PSAMP and IPFIX is described more in 329 [PSAMP-INFO] and [PSAMP-PROTO]. 331 3.5 Exporting Process 333 * Exporting Process: 335 An Exporting Process sends, in the form of Export Packet, the 336 output of one or more Metering Processes to one or more 337 Collectors. 339 * Export Packet: 341 An Export Packet is a combination of Report Interpretation 342 and/or one or more Packet Reports are bundled by the Exporting 343 Process into an Export Packet for exporting to a Collector. 345 3.6 PSAMP Device 347 * PSAMP Device 349 A PSAMP Device is a device hosting at least an Observation 350 Point, a Metering Process and an Exporting Process. Typically, 351 corresponding Observation Point(s), Metering Process(es) and 352 Exporting Process(es) are co-located at this device, for 353 example at a router. 355 3.7 Collector 357 * Collector 359 A Collector receives a Report Stream exported by one or more 360 Exporting Processes. In some cases, the host of the Metering 361 and/or Exporting Processes may also serve as the Collector. 363 3.8 Selection Methods 365 * Filtering 366 A filter is a Selector that selects a packet deterministically 367 based on the Packet Content, or its treatment, or functions of 368 these occurring in the Selection State. Two examples are: 370 (i) Property match filtering: a packet is selected if a 371 specific field in the packet equals a predefined 372 value. 374 (ii) Hash-based selection: a hash function is applied to 375 the Packet Content, and the packet is selected if the 376 result falls in a specified range. 378 * Sampling 380 A selector that is not a filter is called a sampling 381 operation. This reflects the intuitive notion that if the 382 selection of a packet cannot be determined from its content 383 alone, there must be some type of sampling taking place. 384 Sampling operations can be divided into two subtypes: 386 (i) Content-independent sampling, which does not use 387 Packet Content in reaching sampling decisions. 388 Examples include systematic sampling, and uniform 389 pseudorandom sampling driven by a pseudorandom number 390 whose generation is independent of Packet Content. 391 Note that in Content-independent Sampling it is not 392 necessary to access the Packet Content in order to 393 make the selection decision. 395 (ii) Content-dependent sampling, in which the Packet 396 Content is used in reaching selection decisions. An 397 application is pseudorandom selection according to a 398 probability that depends on the contents of a packet 399 field, e.g., sampling packets with a probability 400 dependent on their TCP/UDP port numbers. Note that 401 this is not a Filter. 403 * Hash Domain 405 A subset of the Packet Content and the packet treatment, 406 viewed as an N-bit string for some positive integer N. 408 * Hash Range 410 A set of M-bit strings for some positive integer M that define 411 the range of values the result of the hash operation can take. 413 * Hash Function 415 A deterministic map from the Hash Domain into the Hash Range. 417 * Hash Selection Range 419 A subset of the Hash Range. The packet is selected if the 420 action of the Hash Function on the Hash Domain for the packet 421 yields a result in the Hash Selection Range. 423 * Hash-based Selection 425 Filtering specified by a Hash Domain, a Hash Function, and 426 Hash Range and a Hash Selection Range. 428 * Approximative Selection 430 Selectors in any of the above categories may be approximated 431 by operations in the same or another category for the purposes 432 of implementation. For example, uniform pseudorandom Sampling 433 may be approximated by Hash-based Selection, using a suitable 434 Hash Function and Hash Domain. In this case, the closeness of 435 the approximation depends on the choice of Hash Function and 436 Hash Domain. 438 * Population 440 A Population is a Packet Stream, or a subset of a Packet 441 Stream. A Population can be considered as a base set from 442 which packets are selected. An example is all packets in the 443 Observed Packet Stream that are observed within some specified 444 time interval. 446 * Population Size 448 The Population Size is the number of all packets in the 449 Population. 451 * Sample Size 453 The number of packets selected from the Population by a 454 Selector. 456 * Configured Selection Fraction 458 The Configured Selection Fraction is the ratio of the number 459 of packets selected by a Selector from an input Population, to 460 the Population Size, as based on the configured selection 461 parameters. 463 * Attained Selection Fraction 465 The Attained Selection Fraction is the actual ratio of the 466 number of packets selected by a Selector from an input 467 Population, to the Population Size. 469 For some sampling methods the Attained Selection Fraction can 470 differ from the Configured Selection Fraction due to, for 471 example, the inherent statistical variability in sampling 472 decisions of probabilistic Sampling and Hash-based Selection. 473 Nevertheless, for large Population Sizes and properly configured 474 Selectors, the Attained Selection Fraction usually approaches 475 the Configured Selection Fraction. 477 4. Categorization of Packet Selection Techniques 479 Packet selection techniques generate a subset of packets from an 480 Observed Packet Stream at an Observation Point. We distinguish 481 between Sampling and Filtering. 483 Sampling is targeted at the selection of a representative subset 484 of packets. The subset is used to infer knowledge about the 485 whole set of observed packets without processing them all. The 486 selection can depend on packet position, and/or on packet 487 content, and/or on (pseudo) random decisions. 489 Filtering selects a subset with common properties. This is used 490 if only a subset of packets is of interest. The properties can 491 be directly derived from the packet content, or depend on the 492 treatment given by the router to the packet. Filtering is a 493 deterministic operation. It depends on packet content or router 494 treatment. It never depends on packet position or on (pseudo) 495 random decisions. 497 Note that a common technique to select packets is to compute a 498 Hash Function on some bits of the packet header and/or content 499 and to select it if the Hash Value falls in the Hash Selection 500 Range. Since hashing is a deterministic operation on the packet 501 content, it is a Filtering technique according to our 502 categorization. Nevertheless, Hash Functions are sometimes used 503 to emulate random Sampling. Depending on the chosen input bits, 504 the Hash Function and the Hash Selection Range, this technique 505 can be used to emulate the random selection of packets with a 506 given probability p. It is also a powerful technique to 507 consistently select the same packet subset at multiple 508 Observation Points [DuGr00] 510 The following table gives an overview of the schemes described 511 in this document and their categorization. An X in brackets (X) 512 denotes schemes for which also content-independent variants 513 exist. It easily can be seen that only schemes with both 514 properties, content dependence and deterministic selection, are 515 considered as filters. 517 Selection Scheme | Deterministic | Content- | Category 518 | Selection | dependent| 519 ------------------------+---------------+----------+---------- 520 Systematic | X | _ | Sampling 521 Count-based | | | 522 ------------------------+---------------+----------+---------- 523 Systematic | X | - | Sampling 524 Time-based | | | 525 ------------------------+---------------+----------+---------- 526 Random | - | - | Sampling 527 n-out-of-N | | | 528 ------------------------+---------------+----------+---------- 529 Random | - | - | Sampling 530 Uniform probabilistic | | | 531 ------------------------+---------------+----------+---------- 532 Random | - | (X) | Sampling 533 Non-uniform probabil. | | | 534 ------------------------+---------------+----------+---------- 535 Random | - | (X) | Sampling 536 Non-uniform flow-state | | | 537 ------------------------+---------------+----------+---------- 538 Property Match | X | (X) | Filtering 539 Filtering | | | 540 ------------------------+---------------+----------+---------- 541 Hash Function | X | X | Filtering 542 ------------------------+---------------+----------+---------- 544 The categorization just introduced is mainly useful for the 545 definition of an information model describing Primitive 546 Selectors. More complex selection techniques can be described 547 through the composition of cascaded Sampling and Filtering 548 operations. For example, a packet selection that weights the 549 selection probability on the basis of the packet length can be 550 described as a cascade of a Filtering and a Sampling scheme. 551 However, this descriptive approach is not intended to be rigid: 552 if a common and consolidated selection practice turns out to be 553 too complex to be described as a composition of the mentioned 554 building blocks, an ad hoc description can be specified instead 555 and added as a new scheme to the information model. 557 5. Sampling 559 The deployment of Sampling techniques aims at the provisioning 560 of information about a specific characteristic of the parent 561 population at a lower cost than a full census would demand. In 562 order to plan a suitable Sampling strategy it is therefore 563 crucial to determine the needed type of information and the 564 desired degree of accuracy in advance. 566 First of all it is important to know the type of metric that 567 should be estimated. The metric of interest can range from 568 simple packet counts [JePP92] up to the estimation of whole 569 distributions of flow characteristics (e.g. packet 570 sizes)[ClPB93]. 572 Secondly, the required accuracy of the information and with 573 this, the confidence that is aimed at, should be known in 574 advance. For instance for usage-based accounting the required 575 confidence for the estimation of packet counters can depend on 576 the monetary value that corresponds to the transfer of one 577 packet. That means that a higher confidence could be required 578 for expensive packet flows (e.g. premium IP service) than for 579 cheaper flows (e.g. best effort). The accuracy requirements for 580 validating a previously agreed quality can also vary extremely 581 with the customer demands. These requirements are usually 582 determined by the service level agreement (SLA). 584 The Sampling method and the parameters in use must be clearly 585 communicated to all applications that use the measurement data. 586 Only with this knowledge a correct interpretation of the 587 measurement results can be ensured. 589 Sampling methods can be characterized by the Sampling algorithm, 590 the trigger type used for starting a Sampling interval and the 591 length of the Sampling interval. These parameters are described 592 here in detail. The Sampling algorithm describes the basic 593 process for selection of samples. In accordance to [AmCa89] and 594 [ClPB93] we define the following basic Sampling processes: 596 5.1 Systematic Sampling 598 Systematic Sampling describes the process of selecting the start 599 points and the duration of the selection intervals according to 600 a deterministic function. This can be for instance the periodic 601 selection of every k-th element of a trace but also the 602 selection of all packets that arrive at pre-defined points in 603 time. Even if the selection process does not follow a periodic 604 function (e.g. if the time between the Sampling intervals varies 605 over time) we consider this as systematic Sampling as long as 606 the selection is deterministic. 608 The use of systematic Sampling always involves the risk of 609 biasing the results. If the systematics in the Sampling process 610 resemble systematics in the observed stochastic process 611 (occurrence of the characteristic of interest in the network), 612 there is a high probability that the estimation will be biased. 613 Systematics in the observed process might not be known in 614 advance. 616 Here only equally spaced schemes are considered, where triggers 617 for Sampling are periodic, either in time or in packet count. 618 All packets occurring in a selection interval (either in time or 619 packet count) beyond the trigger are selected. 621 Systematic count-based 622 In systematic count-based Sampling the start and stop triggers 623 for the Sampling interval are defined in accordance to the 624 spatial packet position (packet count). 626 Systematic time-based 627 In systematic time-based Sampling time-based start and stop 628 triggers are used to define the Sampling intervals. All packets 629 are selected that arrive at the Observation Point within the 630 time-intervals defined by the start and stop triggers (i.e. 632 arrival time of the packet is larger than the start time and 633 smaller than the stop time). 635 Both schemes are content-independent selection schemes. Content 636 dependent deterministic Selectors are categorized as filter. 638 5.2 Random Sampling 640 Random Sampling selects the starting points of the Sampling 641 intervals in accordance to a random process. The selection of 642 elements are independent experiments. With this, unbiased 643 estimations can be achieved. In contrast to systematic Sampling, 644 random Sampling requires the generation of random numbers. One 645 can differentiate two methods of random Sampling: 647 5.2.1 n-out-of-N Sampling 649 In n-out-of-N Sampling n elements are selected out of the parent 650 population that consists of N elements. One example would be to 651 generate n different random numbers in the range [1,N] and 652 select all packets which have a packet position equal to one of 653 the random numbers. For this kind of Sampling the Sample Size n 654 is fixed. 656 5.2.2 Probabilistic Sampling 658 In probabilistic Sampling the decision whether an element is 659 selected or not is made in accordance to a pre-defined selection 660 probability. An example would be to flip a coin for each packet 661 and select all packets for which the coin showed the head. For 662 this kind of Sampling the Sample Size can vary for different 663 trials. The selection probability does not necessarily has to be 664 the same for each packet. Therefore we distinguish between 665 uniform probabilistic Sampling (with the same selection 666 probability for all packets) and non-uniform probabilistic 667 Sampling (where the selection probability can vary for different 668 packets). 670 5.2.2.1 Uniform Probabilistic Sampling 672 For Uniform Probabilistic Sampling packets are selected 673 independently with a uniform probability p. This Sampling can be 674 count-driven, and is sometimes referred to as geometric random 675 Sampling, since the difference in count between successive 676 selected packets are independent random variables with a 677 geometric distribution of mean 1/p. A time-driven analog, 678 exponential random Sampling, has the time between triggers 679 exponentially distributed. 681 Both geometric and exponential random Sampling are examples of 682 what is known as additive random Sampling, defined as Sampling 683 where the intervals or counts between successive samples are 684 independent identically distributed random variable. 686 5.2.2.2 Non-Uniform Probabilistic Sampling 688 This is a variant of Probabilistic Sampling in which the 689 Sampling probabilities can depend on the selection process 690 input. This can be used to weight Sampling probabilities in 691 order e.g. to boost the chance of Sampling packets that are rare 692 but are deemed important. Unbiased estimators for quantitative 693 statistics are recovered by renormalization of sample values; 694 see [HT52]. 696 5.2.2.3 Non-Uniform Flow State Dependent Sampling 698 Another type of Sampling that can be classified as probabilistic 699 Non-Uniform is closely related to the flow concept as defined in 700 [RFC3917], and it is only used jointly with a flow monitoring 701 function (IPFIX metering process). Packets are selected, 702 dependent on a selection state. The point, here, is that the 703 selection state is determined also by the state of the flow the 704 packet belongs to and/or by the state of the other flows 705 currently being monitored by the associated flow monitoring 706 function. An example for such an algorithm is the "sample and 707 hold" method described in [EsVa01]: 709 - If a packet accounts for a flow record that already exists in 710 the IPFIX flow recording process, it is selected (i.e. the 711 flow record is updated) 712 - If a packet doesn't account to any existing flow record, it is 713 selected with probability p. If it has been selected a new 714 flow record has to be created. 716 A further algorithm that fits into the category of non-uniform 717 flow state dependent Sampling is described in [Moli03]. 719 This type of Sampling is content dependent because the 720 identification of the flow the packet belongs to requires 721 analyzing part of the packet content. If the packet is selected, 722 then it is passed as an input to the IPFIX monitoring function 723 (this is called "Local Export" in [PSAMP-FW]. Selecting the 724 packet depending on the state of a flow cache is useful when 725 memory resources of the flow monitoring function are scarce 726 (i.e. there is no room to keep all the flows that have been 727 scheduled for monitoring). 729 5.2.2.4 Configuration of non-uniform probabilistic and flow-state 730 Sampling 732 Many different specific methods can be grouped under the terms 733 non-uniform probabilistic and flow state Sampling. Dependent on 734 the Sampling goal and the implemented scheme, a different number 735 and type of input parameters is required to configure such 736 scheme. 738 Some concrete proposals for such methods exist from the research 739 community (e.g. [EsVa01],[DuLT01],[Moli03]). Some of these 740 proposals are still in an early stage and need further 741 investigations to prove their usefulness and applicability. It 742 is not our aim to indicate preference amongst these methods. 743 Instead, we only describe here the basic methods and leave the 744 specification of explicit schemes and their parameters up to 745 vendors (e.g. as extension of the information model). 747 6. Filtering 749 Filtering is the deterministic selection of packets based on the 750 packet content, the treatment of the packet at the Observation 751 Point, or deterministic functions of these occurring in the 752 selection state. The packet is selected if these quantities fall 753 into a specified range. The role of Filtering, as the word 754 itself suggest, is to separate all the packets having a certain 755 property from those not having it. A distinguishing 756 characteristic from Sampling is that the selection decision does 757 not depend on the packet position in time or in the space, or on 758 a random process. 759 We identify and describe in the following two Filtering 760 techniques. 762 6.1 Property Match Filtering 764 With this Filtering method a packet is selected if specific 765 fields within the packet and/or properties of the router state 766 equal a predefined value. Possible filter fields are all IPFIX 767 flow attributes specified in [IPFIX-INFO]. Further fields can be 768 defined by vendor specific extensions. 770 A packet is selected if Field=Value. Masks and ranges are only 771 supported to the extent to which [IPFIX-INFO] allows them e.g. 772 by providing explicit fields like the netmasks for source and 773 destination addresses. 775 AND operations are possible by concatenating filters, thus 776 producing a composite selection operation. In this case, the 777 ordering in which the filtering happens is implicitly defined 778 (outer filters come after inner filters). However, as long as 779 the concatenation is on filters only, the result of the cascaded 780 filter is independent from the order, but the order may be 781 important for implementation purposes, as the first filter will 782 have to work at a higher rate. In any case, an implementation 783 is not constrained to respect the filter ordering, as long as 784 the result is the same, and it may even implement the composite 785 filtering in filtering in one single step. 787 OR operations are not supported with this basic model. More 788 sophisticated filters (e.g. supporting bitmasks, ranges or OR 789 operations etc.) can be realized as vendor specific schemes. 791 Property match operations should be available for different 792 protocol portions of the packet header: 794 (i) the IP header (excluding options in IPv4, stacked 795 headers in IPv6) 797 (ii) transport header 799 (iii) encapsulation headers (e.g. the MPLS label stack, if 800 present) 802 When the PSAMP Device offers property match filtering, and, in 803 its usual capacity other than in performing PSAMP functions, 804 identifies or processes information from IP, transport or 805 encapsulation protocols, then the information should be made 806 available for filtering. For example, when a PSAMP Device 807 routes based on destination IP address, that field should be 808 made available for filtering. Conversely, a PSAMP Device that 809 does not route is not expected to be able to locate an IP 810 address within a packet, or make it available for Filtering, 811 although it may do so. 813 Since packet encryption conceals the real values of encrypted 814 fields, property match filtering must be configurable to ignore 815 encrypted packets, when detected. 817 The Selection Process may support filtering based on the 818 properties of the router state: 820 (i) Ingress interface at which packet arrives equals a 821 specified value 823 (ii) Egress interface to which packet is routed to equals a 824 specified value 826 (iii) Packet violated Access Control List (ACL) on the 827 router 829 (iv) Failed Reverse Path Forwarding (RPF) 831 (v) Failed Resource Reservation (RSVP) 833 (vi) No route found for the packet 835 (vii) Origin Border Gateway Protocol (BGP) Autonomous System 836 (AS) [RFC4271] equals a specified value or lies within 837 a given range 838 (viii)Destination BGP AS equals a specified value or lies 839 within a given range 841 Router architectural considerations may preclude some 842 information concerning the packet treatment being available at 843 line rate for selection of packets. For example, the Selection 844 Process may not be implemented in the fast path that is able to 845 access routing state at line rate. However, when filtering 846 follows sampling (or some other selection operation) in a 847 Composite Selector, the rate of the Packet Stream output from 848 the sampler and input to the filter may be sufficiently slow 849 that the filter could select based on routing state. 851 6.2 Hash-based Filtering 853 A Hash Function h maps the Packet Content c, or some portion of 854 it, onto a Hash Range R. The packet is selected if h(c) is an 855 element of S, which is a subset of R called the Hash Selection 856 Range. Thus Hash-based Selection is indeed a particular case of 857 Filtering: the object is selected if c is in inv(h(S)). But for 858 desirable Hash Functions the inverse image inv(h(S)) will be 859 extremely complex, and hence h would not be expressible as, say, 860 a Property Match Filter or a simple combination of these. 862 Hash-based selection has mainly two types of usage: it offers a 863 way to approximate random Sampling by using packet content to 864 generate pseudorandom variates, and a way to consistently select 865 subsets of packets that share a common property (e.g. at 866 different Observation Points). 868 In the following subsections we give more details about them. 869 However, both usages require that the Hash Functions has two 870 statistical properties. 872 First, the Hash Function h must have good mixing properties, in 873 the sense that small changes in the input (e.g. the flipping of 874 a single bit) cause large changes in the output (many bits 875 change). Then any local clump of values of c is spread widely 876 over R by h, and so the distribution of h(c) is fairly uniform 877 even if the distribution of c is not. Then the Sampling Fraction 878 is #S/#R, which can be tuned by choice of S. 880 The second desirable property depends more closely on the 881 statistics of the content c. In applications, the content c 882 comprises a number of distinct fields, c1 ... cm, e.g. source 883 and destination IP Address, IP identification, and TCP/UDP port 884 numbers (if present) for a packet. With a Hash Function 885 satisfying the first properties above, selection decisions will 886 appear uncorrelated with the contents of any individual field, 887 if the complementary fields are (i) sufficiently variable 888 themselves, and (ii) sufficiently uncorrelated with cj. 890 6.2.1 Application Examples for Hash-based Selection 892 6.2.1.1 Approximation of Random Sampling 894 Although pseudorandom number generators with well understood 895 properties have been developed, they may not be the method of 896 choice in settings where computational resources are scarce. A 897 convenient alternative is to use Hash Functions of packet 898 content as a source of randomness. The hash (suitably 899 renormalized) is a pseudorandom variate in the interval [0,1]. 900 Other schemes may use packet fields in iterators for 901 pseudorandom numbers. However, the statistical properties of an 902 ideal packet selection law (such as independent Sampling for 903 different packets, or independence on packet content) may not be 904 exactly rendered by an implementation, but only approximately 905 so. 907 Use of packet content to generate pseudorandom variates shares 908 with Non-uniform Probabilistic Sampling (see Section 3.1.2.2.2 909 above) the property that selection decisions depend on Packet 910 Content. However, there is a fundamental difference between the 911 two. In the former case the content determines pseudorandom 912 variates. In the latter case the content only determines the 913 selection probabilities: selection could then proceed e.g by use 914 of random variates obtained by an independent pseudorandom 915 number generator. 917 6.2.1.2 Trajectory Sampling and Consistent Packet Selection. 919 Trajectory Sampling is the consistent selection of a subset of 920 packets at either all of a set of Observation Points or none of 921 them. Trajectory Sampling is realized by Hash-based Selection if 922 all Observation Points in the set use a common Hash Function, 923 Hash Domain and selection range. The Hash Domain comprises all 924 or part of the packet content that is invariant along the packet 925 path. Fields such as Time-to-Live, which is decremented per hop, 926 and header CRC, which is recalculated per hop, are thus excluded 927 from the Hash Domain. The Hash Domain needs to be wider than 928 just a flow key, if packets are to be selected quasirandomly 929 within flows. 931 The trajectory (or path) followed by a packet is reconstructed 932 from PSAMP reports on it that reach a Collector. Reports on a 933 given packet originating from different observations points are 934 associated by matching a label from the reports. The label may 935 comprise that portion invariant packet content that is reported, 936 or possibly some digest of the invariant packet content that is 937 inserted into the packet report at the Observation Point. Such a 938 digest may be constructed by applying a second Hash Function 939 (distinct from that used for selection) to the invariant packet 940 content. The reconstruction of trajectories, and methods for 941 dealing with possible ambiguities due to label collisions 942 (identical labels reported for different packets) and potential 943 loss of reports in transmission, are dealt with in [DuGr00], 944 [DuGG02] and [DuGr04]. 946 Applications of trajectory Sampling include (i) estimation of 947 the network path matrix, i.e., the traffic intensities according 948 to network path, broken down by flow key; (ii) detection of 949 routing loops, as indicated by self-intersecting trajectories; 950 (iii) passive performance measurement: prematurely terminating 951 trajectories indicate packet loss, packet one way delay can be 952 determined if reports include (synchronized) timestamps of 953 packet arrival at the Observation Point; (iv) network attack 954 tracing, of the actual paths taken by attack packets with 955 spoofed source addresses. 957 6.2.2 Security Considerations for Hash Functions 959 A concern for Hash-based Selection is whether some large set of 960 related packets could be disproportionately sampled, i.e., have 961 an Attained Sampling Fraction significantly different from the 962 Configured Sampling Fraction, either (i) through unanticipated 963 behavior in the Hash Function, or (ii) because the packets had 964 been deliberately crafted to have this property. 966 The first point underlines the importance of using a Hash 967 Function with good mixing properties. The statistical properties 968 of candidate Hash Functions need to be evaluated, preferably on 969 packet traces before adoption for hash-based Sampling. However, 970 hash functions which perform well on typical traffic may not be 971 sufficiently strong to withstand attacks specifically targeted 972 against them. As detailed in the following section, only 973 cryptographic hash functions employing a private parameter 974 operating in pseudo-random function mode are sufficiently strong 975 to 976 withstand the range of conceivable attacks. For example, fixed 977 or 978 variable length inputs could be hashed using a block cipher 979 (like AES) in cipher-block-chaining mode. Fixed length inputs 980 could also be hashed using an iterated cryptographic hash 981 function (like MD5 or SHA1), with a private initial vector. For 982 variable length inputs, iterated cryptographic hash function 983 (like MD5 or SHA1) should employ private string post-pended to 984 the data in addition to a private initial vector. For more 985 details, see the "append-cascade" construction of [BeCK96]. 987 The following assumes that the hash function is public and hence 988 known to an attacker. An attacker uses its knowledge of the hash 989 function to craft packets which are then dispatched, either as 990 the attack itself, or to elicit further information which can be 991 used to refine the attack. Thus two scenarios are considered. In 992 the first scenario, the attacker has no knowledge about whether 993 the crafted packets are selected or not. In the second scenario 994 the attacker uses some knowledge of sampling outcomes; the means 995 by which this might be acquired is discussed below. Some attacks 996 that involve tampering with export packets in transit, as 997 opposed to attacking the PSAMP device, are discussed in 998 [GoRe07]. 1000 6.2.2.1 Vulnerabilities of Hash-based selection without knowledge 1001 of selection outcomes. 1003 (i) The hash function does not use a private parameter. 1005 If the selection range is public, an attacker can craft packets 1006 whose selection properties are known in advance. If the 1007 selection range is private, an attacker cannot determine whether 1008 a crafted packet is selected. However by computing the hash on 1009 different trial crafted packets, and selecting those yielding a 1010 given hash value, the attacker can construct an arbitrarily 1011 large set of distinct packets with a common selection 1012 properties, i.e., packets that will be either all selected or 1013 all not selected. This can be done whatever the strength of the 1014 hash function. 1016 (ii) The hash function is not cryptographically strong. 1018 If the hash function is not cryptographically strong, it may 1019 still be possible to construct sequences of distinct packets 1020 with the common selection property. An example is the standard 1021 CRC-32 hash function used with a private modulus (but without a 1022 private string post-pended to the input). It has weak mixing 1023 properties for low order bits. Consequently, simply by 1024 incrementing the hash input, one obtains distinct packets whose 1025 hashes mostly fall in a narrow range, and hence are likely 1026 commonly selected; see [GoRe07] 1028 Suitable parameterization of the hash function can make such 1029 attacks more difficult. For example, post-pending a private 1030 string to the input before hashing with CRC-32 will give 1031 stronger mixing properties over all bits of the input. However, 1032 with a hash function, such as CRC-32, that is not 1033 cryptographically strong, the possibility of discovering a 1034 method to construct packet sets with the common selected 1035 property cannot be ruled out, even when a private modulus or 1036 post-pended string is used. 1038 6.2.2.2 Vulnerabilities of Hash-based selection using knowledge of 1039 selection outcomes. 1041 Knowledge of the selection outcomes of crafted packets can by 1042 used by an attacker to more easily construct sets of packets 1043 which are disproportionately sampled and/or are commonly 1044 selected. There are several ways an attacker might acquire this 1045 knowledge: 1047 (i) Billing Reports: if samples are used for billing purposes, 1048 then the selection outcomes of packets may be able to be 1049 inferred by correlating a crafted packet stream with the billing 1050 reports that it generates. However, the rate at knowledge of 1051 selection outcomes can be acquired depends on the temporal and 1052 spatial granularity of the billing reports, being slower the 1053 more aggregated the reports are. 1055 (ii) Feedback from an Intrusion Detection System: e.g., a 1056 botmaster adversary learns if his packets were detected by the 1057 intrusion detection system by seeing if one of his bots is 1058 blocked by the network. 1060 (iii) Observation of the Report Stream: export packets sent 1061 across a public network may be eavesdropped on by an adversary. 1062 Encryption of the export packets provides only a partial 1063 defense, since it may be possible to infer the selection 1064 outcomes of packets by correlating a crafted packet stream with 1065 the occurrence (not the content) of packets in the export stream 1066 that it generates. The rate at which such knowledge could be 1067 acquired is limited by the temporal resolution at which reports 1068 can be associated with packets, e.g. due to processing and 1069 propagation variability, and difficulty in distinguishing report 1070 on attack packets from those of background traffic, if present. 1071 The association between packets and their reports on which this 1072 depends could be removed by padding export packets to a constant 1073 length and sending them at a constant rate. 1075 We now turn to attacks that can exploit knowledge of selection 1076 outcomes. Firstly, with a non-cryptographic hash function, 1077 knowledge of selection outcomes for a trial stream may be used 1078 to further craft a packet set with the common selection 1079 property. This has been demonstrated for the modular hash f(x) = 1080 a x + b mod k, for private parameters a, b, and k. With sampling 1081 rate p, knowledge of the sampling outcomes of roughly 2/p is 1082 sufficient for the attack to succeed, independent of the values 1083 of a, b and k. With knowledge of the selection outcomes of a 1084 larger number of packets, the parameters a b and k can be 1085 determined; see [GoRe07]. 1087 A cryptographic hash function employing a private parameter and 1088 operating in one of the pseudo-random function modes specified 1089 above is not vulnerable to these attacks, even if the selection 1090 range is known. 1092 6.2.2.3 Vulnerabilities to Replay Attacks 1094 Since hash-based selection is deterministic, any packet or set 1095 of packets with known selection properties can be replayed into 1096 a network and experience the same selection outcomes provide the 1097 hash function and its parameters are not changed. Repetition of 1098 a single packet may be noticeable to other measurement methods 1099 if employed (e.g. collection of flow statistics), whereas a set 1100 of distinct packets that appears statistically similar to 1101 regular traffic may be less noticeable. 1103 Replay attacks may be mitigated by repeated changing of hash 1104 function parameters. This also prevents attacks that exploit 1105 knowledge of sampling outcomes, at least if the parameters are 1106 changed at least as fast as the knowledge can be acquired by an 1107 attacker. In order to preserve the ability to perform Trajectory 1108 Sampling, parameter changed would have to be simultaneous (or 1109 approximately so) across all observation point. 1111 6.2.3 Choice of Hash-Function 1113 The specific choice of hash function represents a trade-off 1114 between complexity and ease of implementation. Ideally, a 1115 cryptographically strong hash function employing a private 1116 parameter and operating in pseudorandom function mode as 1117 specified above would be used, yielding a good emulation a 1118 random packet selection at a target sampling rate, and giving 1119 maximal robustness against the attacks described in the previous 1120 section. However, it is not assumed that all PSAMP devices will 1121 be capable of applying a cryptographically strong hash function 1122 to every packet at line rate. For this reason, the hash 1123 functions listed in this section will be of a weaker variety. 1124 Future protocol extensions that employ stronger hash functions 1125 are not precluded. 1127 6.2.3.1 Properties of some hash functions. 1129 This document recommends 3 hash functions: IPSX, BOB and CRC-32. 1130 Specifications of IPSX and BOB are in the appendix; the CRC-32 1131 function is described in [RFC1071]. None of these hash functions 1132 is recommended for cryptographic purposes. A comparison of hash- 1133 functions with regard to execution speed, collision probability, 1134 uniformity of the distribution of values in the Hash 1135 Range and the speed of the functions is described in [MoND05]. 1137 (i) Speed: IPSX is simple to implement and was correspondingly 1138 about an order of magnitude faster to execute per packet than 1139 BOB or CRC-32. 1141 (ii) Uniformity: All three hash functions evaluated showed 1142 relatively poor uniformity with 16 byte input that was drawn 1143 from only invariant fields in the IP and TCP/UDP headers (i.e. 1144 header fields that do not change from hop to hop). IPSX is 1145 inherently limited to 16 bytes. BOB and 1146 CRC-32 exhibits noticeably better uniformity when 4 or more 1147 bytes from the payload are also included in the input. Although 1148 the uniformity has been checked for different traffic traces, 1149 results cannot be generalized to arbitrary traffic. Since hash- 1150 based selection is a deterministic function on the packet 1151 content, it can always be biased towards packets with specific 1152 attributes. Furthermore, it should be noted that all Hash 1153 Functions were evaluated only for IPv4. 1155 6.2.3.2 Hash Functions for Packet Selection 1157 The BOB function SHOULD be used for packet selection operations. 1158 Both the parameter (the init value) and the selection range 1159 should be kept private. Other functions, such as CRC-32 and IPSX 1160 MAY be used. If CRC-32 is used, the input should first be post- 1161 pended with a private string that acts as a parameter, and the 1162 modulus of the CRC should also be kept private. 1164 Input bytes for the Hash Function need to be invariant along the 1165 path the packet is traveling. Only with this it is ensured that 1166 the same packets are selected at different observation points. 1167 Furthermore they should have a high variability between 1168 different packets to generate a high variation in the Hash 1169 Range. 1171 If a hash-based selection with the BOB function is used with 1172 IPv4 traffic, the following input bytes MUST be used. 1173 - IP identification field 1174 - Flags field 1175 - Fragment offset 1176 - Source IP address 1177 - Destination IP address 1178 - A configurable number of bytes from the IP payload, starting 1179 at a configurable offset. 1181 All investigated Hash Functions were evaluated only for IPv4. 1182 Due to the IPv6 header fields and address structure it is 1183 expected that there is less randomness in IPv6 packet headers 1184 than in IPv4 headers. Nevertheless, the randomness of IPv6 1185 traffic was not evaluated in the tests mentioned above. In 1186 addition to this, IPv6 traffic profiles may change significantly 1187 in future when IPv6 is used by a broader community. If a hash- 1188 based selection with the BOB function is used with IPv6 traffic, 1189 the following input bytes MUST be used. 1190 - Payload length (2 bytes) 1191 - Byte number 10,11,14,15,16 of the IPv6 source address 1192 - Byte number 10,11,14,15,16 of the IPv6 destination address 1193 - A configurable number of bytes from the IP payload, starting 1194 at a configurable offset. It is recommended to use at least 4 1195 bytes from the IP payload. 1197 The payload itself is not changing during the path. Even if some 1198 routers process some extension headers they are not going to 1199 strip them from the packet. Therefore the payload length is 1200 invariant along the path. Furthermore it usually differs for 1201 different packets. The IPv6 address has 16 bytes. The first part 1202 is the network part and it contains low variation. The second 1203 part is the host part and contains higher variation. Therefore 1204 the second part of the address is used. Nevertheless, the 1205 uniformity has not been checked for IPv6 traffic. 1207 6.2.3.3 Hash Functions Suitable for Packet Digesting 1209 For digesting Packet Content for inclusion in a reported label, 1210 the most important property is a low collision frequency. A 1211 secondary requirement is the ability to accept variable length 1212 input, in order to allow inclusion of maximal amount of packet 1213 as input. Execution speed is of secondary importance, since the 1214 digest need only be formed from selected packets. 1216 For this purpose also the BOB function is recommended. Other 1217 functions (such as CRC-32) MAY be used. Among the functions 1218 capable of operating with variable length input BOB and CRC-32 1219 have the fastest execution, BOB being slightly faster. IPSX is 1220 not recommended for digesting because it has a significantly 1221 higher collision rate and takes only a fixed length input. 1223 7. Parameters for the Description of Selection Techniques 1225 This section gives an overview of different alternative 1226 selection schemes and their required parameters. In order to be 1227 compliant with PSAMP at least one of proposed schemes MUST be 1228 implemented. 1230 The decision whether to select a packet or not is based on a 1231 function which is performed when the packet arrives at the 1232 selection process. Packet selection schemes differ in the input 1233 parameters for the selection process and the functions they 1234 require to do the packet selection. The following table gives an 1235 overview. 1237 Scheme | input parameters | functions 1238 ---------------+------------------------+------------------- 1239 systematic | packet position | packet counter 1240 count-based | Sampling pattern | 1241 ---------------+------------------------+------------------- 1242 systematic | arrival time | clock or timer 1243 time-based | Sampling pattern | 1244 ---------------+------------------------+------------------- 1245 random | packet position | packet counter, 1246 n-out-of-N | Sampling pattern | random numbers 1247 | (random number list) | 1248 ---------------+------------------------+------------------- 1249 uniform | Sampling | random function 1250 probabilistic | probability | 1252 ---------------+------------------------+------------------- 1253 non-uniform |e.g. packet position, | selection function, 1254 probabilistic | packet content(parts) | probability calc. 1255 ---------------+------------------------+------------------- 1256 non-uniform |e.g. flow state, | selection function, 1257 flow-state | packet content(parts) | probability calc. 1258 ---------------+------------------------+------------------- 1259 property | packet content(parts) | filter function or 1260 match | or router state | state discovery 1261 ---------------+------------------------+------------------- 1262 hash-based | packet content(parts) | Hash Function 1263 ---------------+------------------------+------------------- 1265 7.1 Description of Sampling Techniques 1267 In this section we define what elements are needed to describe 1268 the most common Sampling techniques. Here the selection function 1269 is pre-defined and given by the Selector ID. 1271 Sampler Description: 1272 SELECTOR_ID 1273 SELECTOR_TYPE 1274 SELECTOR_PARAMETERS 1276 Where: 1278 SELECTOR_ID: 1279 Unique ID for the packet sampler. 1281 SELECTOR_TYPE 1282 For Sampling processes the SELECTOR TYPE defines what Sampling 1283 algorithm is used. 1284 Values: Systematic Count-based | Systematic Time-based | Random 1285 n-out-of-N | Uniform Probabilistic | Non-uniform Probabilistic | 1286 Non-uniform Flow-state 1288 SELECTOR_PARAMETERS 1289 For Sampling processes the SELECTOR PARAMETERS define the input 1290 parameters for the process. Interval length in systematic 1291 Sampling means, that all packets that arrive in this interval 1292 are selected. The spacing parameter defines the spacing in time 1293 or number of packets between the end of one Sampling interval 1294 and the start of the next succeeding interval. 1296 Case n out of N: 1297 - Population size N, Sample size n 1299 Case Systematic Time Based: 1301 - Interval length (in usec), Spacing (in usec) 1303 Case Systematic Count Based: 1304 - Interval length(in packets), Spacing (in packets) 1306 Case Uniform Probabilistic (with equal probability per packet): 1307 - Sampling probability p 1309 Case Non-uniform Probabilistic: 1310 - Calculation function for Sampling probability p (see also 1311 section . 1312 5.2.2.4) 1314 Case flow state: 1315 - Information reported for flow state sampling are not 1316 defined in this document (see also section 5.2.2.4) 1318 7.2 Description of Filtering Techniques 1320 In this section we define what elements are needed to describe 1321 the most common Filtering techniques. The structure closely 1322 parallels the one presented for the Sampling techniques. 1324 Filter Description: 1325 SELECTOR_ID 1326 SELECTOR_TYPE 1327 SELECTOR_PARAMETERS 1329 Where: 1331 SELECTOR_ID: 1332 Unique ID for the packet filter. The ID can be calculated under 1333 consideration of the SELECTION SEQUENCE and a local ID. 1335 SELECTOR_TYPE 1336 For Filtering processes the SELECTOR TYPE defines what Filtering 1337 type is used. 1338 Values: Matching | Hashing | Router_state 1340 SELECTOR_PARAMETERS 1341 For Filtering processes the SELECTOR PARAMETERS define formally 1342 the common property of the packet being filtered. For the 1343 filters of type Matching and Hashing the definitions have a lot 1344 of points in common. 1346 Values: 1348 Case Matching 1349 - Information Element (from [IPFIX-INFO]) 1350 - Value (type in accordance to [IPFIX-INFO]) 1352 In case of multiple match criteria, multiple "case matching" 1353 have to be bound by a logical AND. 1355 Case Hashing: 1356 - Hash Domain (Input bits from packet) 1357 -
1358 - 1359 -
1360 - 1361 - 1362 - 1363 - Hash Function 1364 - Hash function name 1365 - Length of input key (eliminate 0x bytes) 1366 - Output value (length M and bitmask) 1367 - Hash Selection Range, as a list of non overlapping 1368 intervals [start value, end value] where value is in 1369 [0,2^M-1] 1370 - Additional parameters dependent on specific Hash 1371 Function (e.g. hash input bits (seed)) 1373 Notes to input bits for Case Hashing: 1374 - Input bits can be from header part only, from the payload 1375 part only or from both. 1376 - The bit specification, for the header part, can be 1377 specified for ipv4 or ipv6 only, or both 1378 - In case of ipv4, the bit specification is a sequence of 20 1379 Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask 1380 to be applied to the header. 1381 - In case of ipv6, it is a sequence of 40 Hexadecimal numbers 1382 [00,FF] specifying a 40 bytes bitmask to be applied to the 1383 header 1384 - The bit specification, for the payload part, is a sequence 1385 of Hexadecimal numbers [00,FF] specifying the bitmask to be 1386 applied to the first N bytes of the payload, as specified 1387 by the previous field. In case the Hexadecimal number 1388 sequence is longer then N, only the first N numbers are 1389 considered. 1390 - In case the payload is shorter than N, the Hash Function 1391 cannot be applied. Other options, like padding with zeros, 1392 may be considered in the future. 1393 - A Hash Function cannot be defined on the options field of 1394 the ipv4 header, neither on stacked headers of ipv6. 1395 - The Hash Selection Range defines a range of hash-values 1396 (out of all possible results of the Hash-Operation). If the 1397 hash result for a specific packet falls in this range, the 1398 packet is selected. If the value is outside the range, the 1399 packet is not selected. E.g. if the selection interval 1400 specification is [1:3], [6:9] all packets are selected for 1401 which the hash result is 1,2,3,6,7,8, or 9. In all other 1402 cases the packet is not selected. 1404 Case Router State: 1406 - Ingress interface at which the packet arrives equals a 1407 specified value 1408 - Egress interface to which the packet is routed equals a 1409 specified value 1410 - Packet violated Access Control List (ACL) on the router 1411 - Reverse Path Forwarding (RPF) failed for the packet 1412 - Resource Reservation is insufficient for the packet 1413 - No route found for the packet 1414 - Origin AS equals a specified value or lies within a given 1415 range 1416 - Destination AS equals a specified value or lies within a 1417 given range 1419 Note to Case Router State: 1420 - All Router state entries can be linked by AND operators 1422 8. Composite Techniques 1424 Composite schemes are realized by combining the selector IDs 1425 into a Selection Sequence. The Selection Sequence contains all 1426 selector IDs that are applied to the packet stream subsequently. 1427 Some examples of composite schemes are reported below. 1429 8.1 Cascaded Filtering->Sampling or Sampling->Filtering 1431 If a filter precedes a Sampling process the role of Filtering is 1432 to create a set of "parent populations" from a single stream 1433 that can then be fed independently to different Sampling 1434 functions, with different parameters tuned for the population 1435 itself (e.g. if streams of different intensity result from 1436 Filtering, it may be good to have different Sampling rates). If 1437 Filtering follows a Sampling process, the same Sampling Fraction 1438 and type is applied to the whole stream, independently of the 1439 relative size of the streams resulting from the Filtering 1440 function. Moreover, also packets not destined to be selected in 1441 the Filtering operation will "load" the Sampling function. So, 1442 in principle, Filtering before Sampling allows a more accurate 1443 tuning of the Sampling procedure, but if filters are too complex 1444 to work at full line rate (e.g. because they have to access 1445 router state information), Sampling before Filtering may be a 1446 need. 1448 8.2 Stratified Sampling 1450 Stratified Sampling is one example for using a composite 1451 technique. The basic idea behind stratified Sampling is to 1452 increase the estimation accuracy by using a-priori information 1453 about correlations of the investigated characteristic with some 1454 other characteristic that is easier to obtain. The a-priori 1455 information is used to perform an intelligent grouping of the 1456 elements of the parent population. In this manner, a higher 1457 estimation accuracy can be achieved with the same Sample Size or 1458 the Sample Size can be reduced without reducing the estimation 1459 accuracy. 1461 Stratified Sampling divides the Sampling process into multiple 1462 steps. First, the elements of the parent population are grouped 1463 into subsets in accordance to a given characteristic. This 1464 grouping can be done in multiple steps. Then samples are taken 1465 from each subset. 1467 The stronger the correlation between the characteristic used to 1468 divide the parent population (stratification variable) and the 1469 characteristic of interest (for which an estimate is sought 1470 after), the easier is the consecutive Sampling process and the 1471 higher is the stratification gain. For instance, if the dividing 1472 characteristic were equal to the investigated characteristic, 1473 each element of the sub-group would be a perfect representative 1474 of that characteristic. In this case it would be sufficient to 1475 take one arbitrary element out of each subgroup to get the 1476 actual distribution of the characteristic in the parent 1477 population. Therefore stratified Sampling can reduce the costs 1478 for the Sampling process (i.e. the number of samples needed to 1479 achieve a given level of confidence). 1481 For stratified Sampling one has to specify classification rules 1482 for grouping the elements into subgroups and the Sampling scheme 1483 that is used within the subgroups. The classification rules can 1484 be expressed by multiple filters. For the Sampling scheme within 1485 the subgroups the parameters have to be specified as described 1486 above. The use of stratified Sampling methods for measurement 1487 purposes is described for instance in [ClPB93] and [Zseb03]. 1489 9. Security Considerations 1491 Security considerations concerning the choice of sampling hash 1492 function have been discussed in Section 6.2.2. That section 1493 discussed a number of potential attacks to craft packet streams 1494 which are disproportionately detected and/or discover the hash 1495 function parameters, the vulnerabilities of different hash 1496 functions to these attacks, and practices to minimize these 1497 vulnerabilities. In addition to this a user can gains knowledge 1498 about the start and stop triggers in time-based systematic 1499 sampling e.g. by sending test packets. This knowledge might 1500 allow users to modify their send schedule in a way that their 1501 packets are disproportionately selected or not selected 1502 [GoRe07]. 1504 Further security threats can occur if the configuration of 1505 Sampling parameters or the communication of Sampling parameters 1506 to the application is corrupted. This document only describes 1507 Sampling schemes that can be used for packet selection. It 1508 neither describes a mechanism how those parameters are 1509 configured nor how these parameters are communicated to the 1510 application. Therefore the security threats that originate from 1511 this kind of communication cannot be assessed with the 1512 information given in this document. 1513 10. Acknowledgements 1515 We would like to thank the PSAMP group, especially Benoit Claise 1516 and Stewart Bryant, for fruitful discussions and for 1517 proofreading the document. We thank Sharon Goldberg for her 1518 input on security issues concerning hash-based selection. 1520 11. IANA Considerations 1522 This document has no actions for IANA. 1524 12. Normative References 1526 [RFC2119] Bradner, S., Key words for use in RFCs to Indicate 1527 Requirement Levels, BCP 14, RFC 2119, March 1997 1529 13. Informative References 1531 [AmCa89] Paul D. Amer, Lillian N. Cassel, "Management of 1532 Sampled Real-Time Network Measurements", 14th 1533 Conference on Local Computer Networks, October 1534 1989, Minneapolis, pages 62-68, IEEE, 1989. 1536 [BeCK96] M. Bellare, R. Canetti and H. Krawczyk, 1537 "Pseudorandom Functions Revisited: The Cascade 1538 Construction and its Concrete Security", Symposium 1539 on Foundations of Computer Science, 1996. 1541 [ClPB93] K.C. Claffy, George C. Polyzos, Hans-Werner Braun, 1542 "Application of Sampling Methodologies to Network 1543 Traffic Characterization", Proceedings of ACM 1544 SIGCOMM'93, San Francisco, CA, USA, September 13 - 1545 17, 1993. 1547 [DuGG02] N.G. Duffield, A. Gerber, M. Grossglauser, 1548 "Trajectory Engine: A Backend for Trajectory 1549 Sampling", IEEE Network Operations and Management 1550 Symposium 2002, Florence, Italy, April 15-19, 2002. 1552 [DuGr00] N.G. Duffield, M. Grossglauser, "Trajectory 1553 Sampling for Direct Traffic Observation", 1554 Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden, 1555 August 28 - September 1, 2000. 1557 [DuGr04] N. G. Duffield and M. Grossglauser "Trajectory 1558 Sampling with Unreliable Reporting", Proc IEEE 1559 Infocom 2004, Hong Kong, March 2004. 1561 [DuLT01] N.G. Duffield, C. Lund, and M. Thorup, "Charging 1562 from Sampled Network Usage", ACM Internet 1563 Measurement Workshop IMW 2001, San Francisco, USA, 1564 November 1-2, 2001. 1566 [EsVa01] C. Estan and G. Varghese, "New Directions in 1567 Traffic Measurement and Accounting", ACM SIGCOMM 1568 Internet Measurement Workshop 2001, San Francisco 1569 (CA) Nov. 2001. 1571 [GoRe07] S. Goldberg, J. Rexford, "Security Vulnerabilities 1572 and Solutions for Packet Sampling", IEEE Sarnoff 1573 Symposium, Princeton, NJ, May 2007. 1575 [HT52] D.G. Horvitz and D.J. Thompson, "A Generalization 1576 of Sampling without replacement from a Finite 1577 Universe" J. Amer. Statist. Assoc. Vol. 47, pp. 1578 663-685, 1952. 1580 [IPFIX-INFO] J. Meyer, J. Quittek, S. Bryant "Information Model 1581 for IP Flow Information Export", RFC XXXX 1582 [Currently Internet Draft, draft-ietf-ipfix-info- 1583 15, February 2007]. 1585 [IPFIX-PROTO]B. Claise (Editor) "Specification of the IPFIX 1586 Protocol for the Exchange of IP Traffic Flow 1587 Information", RFC XXXX. [Currently Internet Draft, 1588 draft-ietf-ipfix-protocol-24.txt, November 2006]. 1590 [Jenk97] B. Jenkins, "Algorithm Alley", Dr. Dobb's Journal, 1591 September 1997. 1592 http://burtleburtle.net/bob/hash/doobs.html 1594 [JePP92] Jonathan Jedwab, Peter Phaal, Bob Pinna, "Traffic 1595 Estimation for the Largest Sources on a Network, 1596 Using Packet Sampling with Limited Storage", HP 1597 technical report, Managemenr, Mathematics and 1598 Security Department, HP Laboratories, Bristol, 1599 March 1992, 1600 http://www.hpl.hp.com/techreports/92/HPL-92-35.html 1602 [Moli03] M.Molina, "A scalable and efficient methodology for 1603 flow monitoring in the Internet", International 1604 Teletraffic Congress (ITC-18), Berlin, Sep. 2003 1606 [MoND05] M. Molina, S.Niccolini, N.G.Duffield "A Comparative 1607 Experimental Study of Hash Functions Applied to 1608 Packet Sampling" International Teletraffic Congress 1609 (ITC-19), Beijing, August 2005. 1611 [PSAMP-FW] Nick Duffield (Ed.), "A Framework for Packet 1612 Selection and Reporting", RFC XXXX [currently 1613 Internet Draft draft-ietf-psamp-framework-11, work 1614 in progress, May 2007]. 1616 [PSAMP-INFO] T. Dietz, F. Dressler, G. Carle, B. Claise, 1617 "Information Model for Packet Sampling Exports", 1618 RFC XXXX. [Currently Internet Draft, draft-ietf- 1619 psamp-info-06, June 2007] 1621 [PSAMP-PROTO] B. Claise (Ed.), "Packet Sampling (PSAMP) Protocol 1622 Specifications", RFC XXXX. [Currently Internet 1623 Draft draft-ietf-psamp-protocol-07.txt, work in 1624 progress, October 2006]. 1626 [RFC1071] R. Braden, D. Borman, C. Partridge, "Computing the 1627 Internet Checksum", RFC 1071, Sep. 1988 (updated by 1628 RFCs1141 and RFC1624). 1630 [RFC3917] J. Quittek, T. Zseby, B. Claise, S. Zander, 1631 "Requirements for IP Flow Information Export", RFC 1632 3917, October 2004. 1634 [RFC4271] Y. Rekhter, T. Li, S. Hares, "A Border Gateway 1635 Protocol 4 (BGP-4)", RFC 4271, January 2006. 1637 [Zseb03] T. Zseby, "Stratification Strategies for Sampling- 1638 based Non-intrusive Measurement of One-way Delay", 1639 Proceedings of Passive and Active Measurement 1640 Workshop (PAM 20003), La Jolla, CA, USA, pp. 171- 1641 179, April 2003. 1643 Authors' Addresses 1645 Tanja Zseby 1646 Fraunhofer Institute for Open Communication Systems 1647 Kaiserin-Augusta-Allee 31 1648 10589 Berlin 1649 Germany 1650 Phone: +49-30-34 63 7153 1651 Email: tanja.zseby@fokus.fraunhofer.de 1653 Maurizio Molina 1654 DANTE 1655 City House 1656 126-130 Hills Road 1657 Cambridge CB21PQ 1658 United Kingdom 1659 Phone: +44 1223 371 300 1660 Email: maurizio.molina@dante.org.uk 1662 Nick Duffield 1663 AT&T Labs - Research 1664 Room B-139 1665 180 Park Ave 1666 Florham Park NJ 07932, USA 1667 Phone: +1 973-360-8726 1668 Email: duffield@research.att.com 1670 Saverio Niccolini 1671 Network Laboratories, NEC Europe Ltd. 1672 Kurfuerstenanlage 36 1673 69115 Heidelberg 1674 Germany 1675 Phone: +49-6221-9051118 1676 Email: saverio.niccolini@netlab.nec.de 1678 Fredric Raspall 1679 EPSC-UPC 1680 Dept. of Telematics 1681 Av. del Canal Olimpic, s/n 1682 Edifici C4 1683 E-08860 Castelldefels, Barcelona 1684 Spain 1685 Email: fredi@entel.upc.es 1687 Intellectual Property Statement 1689 The IETF has been notified of intellectual property rights 1690 claimed in regard to some or all of the specification contained 1691 in this document. For more information consult the online list 1692 of claimed rights. 1694 The IETF takes no position regarding the validity or scope of 1695 any Intellectual Property Rights or other rights that might be 1696 claimed to pertain to the implementation or use of the 1697 technology described in this document or the extent to which any 1698 license under such rights might or might not be available; nor 1699 does it represent that it has made any independent effort to 1700 identify any such rights. Information on the procedures with 1701 respect to rights in RFC documents can be found in BCP 78 and 1702 BCP 79. 1704 Copies of IPR disclosures made to the IETF Secretariat and any 1705 assurances of licenses to be made available, or the result of an 1706 attempt made to obtain a general license or permission for the 1707 use of such proprietary rights by implementers or users of this 1708 specification can be obtained from the IETF on-line IPR 1709 repository at http://www.ietf.org/ipr. 1711 The IETF invites any interested party to bring to its attention 1712 any copyrights, patents or patent applications, or other 1713 proprietary rights that may cover technology that may be 1714 required to implement this standard. Please address the 1715 information to the IETF at ietf-ipr@ietf.org. 1717 Copyright Statement 1719 Copyright (C) The IETF Trust (2007). 1721 This document is subject to the rights, licenses and 1722 restrictions contained in BCP 78, and except as set forth 1723 therein, the authors retain all their rights. 1725 Disclaimer 1727 This document and the information contained herein are provided 1728 on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 1729 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, 1730 THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM 1731 ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO 1732 ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT 1733 INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY 1734 OR FITNESS FOR A PARTICULAR PURPOSE. 1736 Appendix A: Hash Functions 1738 A.1 IP Shift-XOR (IPSX) Hash Function 1740 The IPSX Hash Function is tailored for acting on IP version 4 1741 packets. It exploits the structure of IP packet and in 1742 particular the variability expected to be exhibited within 1743 different fields of the IP packet in order to furnish a hash 1744 value with little apparent correlation with individual packet 1745 fields. Fields from the IPv4 and TCP/UDP headers are used as 1746 input. The IPSX Hash Function uses a small number of simple 1747 instructions. 1749 Input parameters: None 1751 Built-in parameters: None 1753 Output: The output of the IPSX is a 16 bit number 1755 Functioning: 1756 The functioning can be divided into two parts: input selection, 1757 which forms are composite input from various portions of the IP 1758 packet, followed by computation of the hash on the composite. 1760 Input Selection: 1761 The raw input is drawn from the first 20 bytes of the IP packet 1762 header and the first 8 bytes of the IP payload. If IP options 1763 are not used, the IP header has 20 bytes, and hence the two 1764 portions adjoin and comprise the first 28 bytes of the IP 1765 packet. We now use the raw input as 4 32-bit subportions of 1766 these 28 bytes. We specify the input by bit offsets from the 1767 start of IP header or payload. 1769 f1 = bits 32 to 63 of the IP header, comprising the IP 1770 identification field, flags, and fragment offset. 1772 f2 = bits 96 to 127 of the IP header, the source IP address. 1774 f3 = bits 128 to 159 of the IP header, the destination IP 1775 address. 1777 f4 = bits 32 to 63 of the IP payload. For a TCP packet, f4 1778 comprises the TCP sequence number followed by the message 1779 length. For a UDP packet f4 comprises the UDP checksum. 1781 Hash Computation: 1782 The hash is computed from f1, f2, f3 and f4 by a combination of 1783 XOR (^), right shift (>>) and left shift (<<) operations. The 1784 intermediate quantities h1, v1, v2 are 32-bit numbers. 1786 1. v1 = f1 ^ f2; 1787 2. v2 = f3 ^ f4; 1788 3. h1 = v1 << 8; 1789 4. h1 ^= v1 >> 4; 1790 5. h1 ^= v1 >> 12; 1791 6. h1 ^= v1 >> 16; 1792 7. h1 ^= v2 << 6; 1793 8. h1 ^= v2 << 10; 1794 9. h1 ^= v2 << 14; 1795 10. h1 ^= v2 >> 7; 1797 The output of the hash is the least significant 16 bits of h1. 1799 A.2 BOB Hash Function 1801 The BOB Hash Function is a Hash Function designed for having 1802 each bit of the input affecting every bit of the return value 1803 and using both 1-bit and 2-bit deltas to achieve the so called 1804 avalanche effect [Jenk97]. This function was originally built 1805 for hash table lookup with fast software implementation. 1807 Input Parameters: 1808 The input parameters of such a function are: 1809 - the length of the input string (key) to be hashed, in bytes. 1810 The elementary input blocks of Bob hash are the single bytes, 1811 therefore no padding is needed. 1812 - an init value (an arbitrary 32-bit number). 1814 Built in parameters: 1815 The Bob Hash uses the following built-in parameter: 1816 - the golden ratio (an arbitrary 32-bit number used in the hash 1817 function computation: its purpose is to avoid mapping all zeros 1818 to all zeros); 1820 Note: the mix sub-function (see mix (a,b,c) macro in the 1821 reference code in 3.2.4) has a number of parameters governing 1822 the shifts in the registers. The one presented is not the only 1823 possible choice. 1825 It is an open point whether these may be considered additional 1826 built-in parameters to specify at function configuration. 1828 Output. 1829 The output of the BOB function is a 32-bit number. It should be 1830 specified: 1831 - A 32 bit mask to apply to the output 1832 - The selection range as a list of non overlapping intervals 1833 [start value, end value] where value is in [0,2^32] 1835 Functioning: 1836 The hash value is obtained computing first an initialization of 1837 an internal state (composed of 3 32-bit numbers, called a, b, c 1838 in the reference code below), then, for each input byte of the 1839 key the internal state is combined by addition and mixed using 1840 the mix sub-function. Finally, the internal state mixed one last 1841 time and the third number of the state (c) is chosen as the 1842 return value. 1844 typedef unsigned long int ub4; /* unsigned 4-byte quantities 1845 */ 1846 typedef unsigned char ub1; /* unsigned 1-byte quantities 1847 */ 1849 #define hashsize(n) ((ub4)1<<(n)) 1850 #define hashmask(n) (hashsize(n)-1) 1852 /* ------------------------------------------------------ 1853 mix -- mix 3 32-bit values reversibly. 1854 For every delta with one or two bits set, and the deltas of 1855 all three high bits or all three low bits, whether the original 1856 value of a,b,c is almost all zero or is uniformly distributed, 1857 * If mix() is run forward or backward, at least 32 bits in 1858 a,b,c have at least 1/4 probability of changing. 1859 * If mix() is run forward, every bit of c will change between 1860 1/3 and 2/3 of the time. (Well, 22/100 and 78/100 for some 2- 1861 bit deltas.) mix() was built out of 36 single-cycle latency 1862 instructions in a structure that could supported 2x parallelism, 1863 like so: 1864 a -= b; 1865 a -= c; x = (c>>13); 1866 b -= c; a ^= x; 1867 b -= a; x = (a<<8); 1868 c -= a; b ^= x; 1869 c -= b; x = (b>>13); 1870 ... 1871 Unfortunately, superscalar Pentiums and Sparcs can't take 1872 advantage of that parallelism. They've also turned some of 1873 those single-cycle latency instructions into multi-cycle latency 1874 instructions 1875 ------------------------------------------------------------*/ 1877 #define mix(a,b,c) \ 1878 { \ 1879 a -= b; a -= c; a ^= (c>>13); \ 1880 b -= c; b -= a; b ^= (a<<8); \ 1881 c -= a; c -= b; c ^= (b>>13); \ 1882 a -= b; a -= c; a ^= (c>>12); \ 1883 b -= c; b -= a; b ^= (a<<16); \ 1884 c -= a; c -= b; c ^= (b>>5); \ 1885 a -= b; a -= c; a ^= (c>>3); \ 1886 b -= c; b -= a; b ^= (a<<10); \ 1887 c -= a; c -= b; c ^= (b>>15); \ 1888 } 1890 /* ----------------------------------------------------------- 1891 hash() -- hash a variable-length key into a 32-bit value 1892 k : the key (the unaligned variable-length array of bytes) 1893 len : the length of the key, counting by bytes 1894 initval : can be any 4-byte value 1895 Returns a 32-bit value. Every bit of the key affects every bit 1896 of the return value. Every 1-bit and 2-bit delta achieves 1897 avalanche. About 6*len+35 instructions. 1899 The best hash table sizes are powers of 2. There is no need to 1900 do mod a prime (mod is sooo slow!). If you need less than 32 1901 bits, use a bitmask. For example, if you need only 10 bits, do 1902 h = (h & hashmask(10)); 1903 In which case, the hash table should have hashsize(10) elements. 1905 If you are hashing n strings (ub1 **)k, do it like this: 1906 for (i=0, h=0; i= 12) 1931 { 1932 a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16) 1933 +((ub4)k[3]<<24)); 1934 b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16) 1935 +((ub4)k[7]<<24)); 1936 c += (k[8] +((ub4)k[9]<<8) 1937 +((ub4)k[10]<<16)+((ub4)k[11]<<24)); 1938 mix(a,b,c); 1939 k += 12; len -= 12; 1940 } 1942 /*---------------------------- handle the last 11 bytes */ 1943 c += length; 1944 switch(len) /* all the case statements fall through*/ 1945 { 1946 case 11: c+=((ub4)k[10]<<24); 1947 case 10: c+=((ub4)k[9]<<16); 1948 case 9 : c+=((ub4)k[8]<<8); 1949 /* the first byte of c is reserved for the length */ 1950 case 8 : b+=((ub4)k[7]<<24); 1951 case 7 : b+=((ub4)k[6]<<16); 1952 case 6 : b+=((ub4)k[5]<<8); 1953 case 5 : b+=k[4]; 1954 case 4 : a+=((ub4)k[3]<<24); 1955 case 3 : a+=((ub4)k[2]<<16); 1956 case 2 : a+=((ub4)k[1]<<8); 1957 case 1 : a+=k[0]; 1958 /* case 0: nothing left to add */ 1959 } 1960 mix(a,b,c); 1961 /*-------------------------------- report the result */ 1962 return c; 1963 }