| < draft-geib-ippm-metrictest-00.txt | draft-geib-ippm-metrictest-01.txt > | |||
|---|---|---|---|---|
| Internet Engineering Task Force R. Geib, Ed. | Internet Engineering Task Force R. Geib, Ed. | |||
| Internet-Draft Deutsche Telekom | Internet-Draft Deutsche Telekom | |||
| Intended status: Informational R. Fardid | Intended status: Informational A. Morton | |||
| Expires: January 7, 2010 Covad Communications | Expires: April 29, 2010 AT&T Labs | |||
| July 6, 2009 | R. Fardid | |||
| Covad Communications | ||||
| October 26, 2009 | ||||
| IPPM standard compliance testing | IPPM standard compliance testing | |||
| draft-geib-ippm-metrictest-00 | draft-geib-ippm-metrictest-01 | |||
| Status of this Memo | Status of this Memo | |||
| This Internet-Draft is submitted to IETF in full conformance with the | This Internet-Draft is submitted to IETF in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 32 ¶ | skipping to change at page 1, line 34 ¶ | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on January 7, 2010. | This Internet-Draft will expire on April 29, 2010. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2009 IETF Trust and the persons identified as the | Copyright (c) 2009 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents in effect on the date of | Provisions Relating to IETF Documents in effect on the date of | |||
| publication of this document (http://trustee.ietf.org/license-info). | publication of this document (http://trustee.ietf.org/license-info). | |||
| Please review these documents carefully, as they describe your rights | Please review these documents carefully, as they describe your rights | |||
| skipping to change at page 2, line 15 ¶ | skipping to change at page 2, line 20 ¶ | |||
| Internet standard. Results of different IPPM implementations can be | Internet standard. Results of different IPPM implementations can be | |||
| compared if they measure under the same underlying network | compared if they measure under the same underlying network | |||
| conditions. Results are compared using state of the art statistical | conditions. Results are compared using state of the art statistical | |||
| methods. | methods. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 | |||
| 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 3. Verification of equivalence by statistic measurements . . . . 5 | 3. Verification of conformance to a metric specification . . . . 6 | |||
| 4. Recommended Metric Verification Measurement Process . . . . . 12 | 3.1. Tests of an individual implementation against a metric | |||
| 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 14 | specification . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 14 | 3.2. Test set up resulting in identical live network | |||
| 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 | testing conditions . . . . . . . . . . . . . . . . . . . . 7 | |||
| 8. Security Considerations . . . . . . . . . . . . . . . . . . . 15 | 3.3. Tests two or more different implementations against a | |||
| 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | metric specification . . . . . . . . . . . . . . . . . . . 9 | |||
| 9.1. Normative References . . . . . . . . . . . . . . . . . . . 15 | 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 10 | |||
| 9.2. Informative References . . . . . . . . . . . . . . . . . . 15 | 3.5. Recommended Metric Verification Measurement Process . . . 11 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 | 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 13 | ||||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 | ||||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 14 | ||||
| 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | ||||
| 8.1. Normative References . . . . . . . . . . . . . . . . . . . 14 | ||||
| 8.2. Informative References . . . . . . . . . . . . . . . . . . 14 | ||||
| Appendix A. Further ideas on statistical tests . . . . . . . . . 15 | ||||
| Appendix B. Verification of measurement precision by | ||||
| statistical methods . . . . . . . . . . . . . . . . . 17 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19 | ||||
| 1. Introduction | 1. Introduction | |||
| Draft bradner-metrictest [bradner-metrictest] states: | Draft bradner-metrictest [bradner-metrictest] states: | |||
| The Internet Standards Process RFC2026 [RFC2026] requires that for a | The Internet Standards Process RFC2026 [RFC2026] requires that for a | |||
| IETF specification to advance beyond the Proposed Standard level, at | IETF specification to advance beyond the Proposed Standard level, at | |||
| least two genetically unrelated implementations must be shown to | least two genetically unrelated implementations must be shown to | |||
| interoperate correctly with all features and options. There are two | interoperate correctly with all features and options. There are two | |||
| distinct reasons for this requirement. | distinct reasons for this requirement. | |||
| skipping to change at page 4, line 5 ¶ | skipping to change at page 3, line 52 ¶ | |||
| specified test set up to create the required separate data sets | specified test set up to create the required separate data sets | |||
| (which may be seen as samples taken from the same underlying | (which may be seen as samples taken from the same underlying | |||
| distribution) and then apply state of the art statistical methods to | distribution) and then apply state of the art statistical methods to | |||
| verify equivalence of the results. To illustrate application of the | verify equivalence of the results. To illustrate application of the | |||
| process defined her, validating compliance with RFC2679 [RFC2679] is | process defined her, validating compliance with RFC2679 [RFC2679] is | |||
| picked as an example. While test set ups may vary with the metrics | picked as an example. While test set ups may vary with the metrics | |||
| to be validated, the statistical methods will not. Documents | to be validated, the statistical methods will not. Documents | |||
| defining test setups to validate other metrics should be created by | defining test setups to validate other metrics should be created by | |||
| the IPPM WG, once the process proposed here has been agreed upon. | the IPPM WG, once the process proposed here has been agreed upon. | |||
| This document defines the process of verifying equivalence by using a | ||||
| specified test set up to create the required separate data sets | ||||
| (which may be seen as samples taken from the same underlying | ||||
| distribution) and then apply state of the art statistical methods to | ||||
| verify equivalence of the results. To illustrate application of the | ||||
| process defined her, validating compliance with RFC2679 [RFC2679] is | ||||
| picked as an example. While test set ups may vary with the metrics | ||||
| to be validated, the statistical methods will not. Documents | ||||
| defining test setups to validate other metrics should be created by | ||||
| the IPPM WG, once the process proposed here has been agreed upon. | ||||
| Changes from -00 to -01 version | ||||
| o Addition of a comparison of individual metric implementations | ||||
| against the metric specification (trying to pick up problems and | ||||
| solutions for metric advancement [morton-advance-metrics]). | ||||
| o More emphasis on the requirement to carefully design and document | ||||
| the measurement set up of the metric comparison. | ||||
| o Proposal of testing conditions under identical WAN netwrok | ||||
| conditions using IP in IP tunneling or Pseudo Wires and parallel | ||||
| measurement streams. | ||||
| o Proposing the requirement to document the smallest resolution at | ||||
| which an ADK test was passed by 95%. As no minimum resolution is | ||||
| specified, IPPM metric compliance is not linked to a particular | ||||
| performance of an implementation. | ||||
| o Reference to RFC 2330 and RFC 2679 for the 95% confidence interval | ||||
| as preferred criterion to decide on statistical equivalence | ||||
| o Reducing the proposed statistical test to ADK with 95% confidence. | ||||
| 1.1. Requirements Language | 1.1. Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [RFC2119]. | document are to be interpreted as described in RFC 2119 [RFC2119]. | |||
| 2. Basic idea | 2. Basic idea | |||
| The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) | ||||
| expects that a "methodology for a metric should have the property | ||||
| that it is repeatable: if the methodology is used multiple times | ||||
| under identical conditions, it should result in consistent | ||||
| measurements." This means, an IPPM implementation is expected to | ||||
| measure a metric with high precision. The metric compliance test | ||||
| specified in the following emphasises precision over accuracy. | ||||
| Further the methodology and test methods proposed by RFC 2330 are | ||||
| used by this document too. | ||||
| The implementation of a standard compliant metric is expected to meet | ||||
| the requrirements of the related a metric specification. So before | ||||
| comparing two metrice implementations, each metric implementation is | ||||
| individually compared against the metric specification. As an | ||||
| example, an implementation of the OWD metric must be calibrated. | ||||
| Calibration results of a standard conformant metric implementation | ||||
| must be published then. | ||||
| Most metric specificatios leave freedom to implementors on those | ||||
| aspects, which aren't fundamental for an individual metric | ||||
| implementation. Calibration of individual metric implementations and | ||||
| comparing different ones requires a careful design and documentation | ||||
| of the metric implementation and of the testing conditions. | ||||
| The IPPM framework expects repeating measurements to lead to the same | ||||
| results, if the conditions under which these measurements have been | ||||
| collected are identical. Small deviations are expected to lead to | ||||
| small deviations in results only. To charaterise statistical | ||||
| equivalence in the case of small deviations, RFC 2330 and RFC 2679 | ||||
| suggest to apply a 95% confidence interval. Quoting RFC 2679, "95 | ||||
| percent was chosen because ... a particular confidence level should | ||||
| be specified so that the results of independent implementations can | ||||
| be compared." | ||||
| Two different IPPM implementations are expected to measure | Two different IPPM implementations are expected to measure | |||
| statistically equivalent results, if they both measure a metric under | statistically equivalent results, if they both measure a metric under | |||
| the same networking conditions. Formulating the measurement in | the same networking conditions. Formulating the measurement in | |||
| statistical terms: separate samples are collected (by separate metric | statistical terms: separate samples are collected (by separate metric | |||
| implementations) from the same underlying statistical process (the | implementations) from the same underlying statistical process (the | |||
| same network conditions). The "statistical hypothesis" to be tested | same network conditions). The "statistical hypothesis" to be tested | |||
| is the expectation, that both samples expose statistically equivalent | is the expectation, that both samples do not expose statistically | |||
| properties. This requires careful test design: | different properties. This requires careful test design: | |||
| o The error induced by the sample size must be small enough to | o The error induced by the sample size must be small enough to | |||
| minimize its influence on the test result. This may have to be | minimize its influence on the test result. This may have to be | |||
| respected, especially if two implementations measure with | respected, especially if two implementations measure with | |||
| different average probing rates. | different average probing rates. | |||
| o If time series are compared, the implementation with the lowest | o If statistics of time series are compared, the implementation with | |||
| probing frequency determines the smallest temporal interval for | the lowest probing frequency determines the smallest temporal | |||
| which results can be compared. | interval for which results can be compared. | |||
| o Every comparison must be repeated several times based on different | o Every comparison must be repeated several times based on different | |||
| measurement data to avoid random indications of compatibility (or | measurement data to avoid random indications of compatibility (or | |||
| the lack of it). | the lack of it). | |||
| o The measurement test set up must be self-consistent to the largest | o The measurement test set up must be self-consistent to the largest | |||
| possible extent. This means, network conditions, paths and IPPM | possible extent. This means, network conditions, paths and IPPM | |||
| metric implementations SHOULD be identical for the compared | metric implementations SHOULD be identical for the compared | |||
| implementations to the largest possible degree to minimize the | implementations to the largest possible degree to minimize the | |||
| influence of the test and measurement set up on the result. This | influence of the test and measurement set up on the result. This | |||
| includes e.g. aspects of the stability and non-ambiguity of routes | includes e.g. aspects of the stability and non-ambiguity of routes | |||
| taken by the measurement packets. See RFC 2330 for a discussion | taken by the measurement packets. See RFC 2330 for a discussion | |||
| on self-consistency RFC 2330 [RFC2330]. | on self-consistency RFC 2330 [RFC2330]. | |||
| State of the art statistical methods are proposed for a comparison of | As addressed by "problems and solutions for metric advancement" | |||
| measurement results in the hope that user friendly tools required to | [morton-advance-metrics], documentation of the metric test will | |||
| perform the necessary statistical analysis are easily accessible. | indicate which requirements and options of a metric specification are | |||
| [editor: this sentence may be reworded or deleted, if the expectation | specified clear enough for an implementation or uncover gaps in the | |||
| doesn't hold]. | metric specification. The final step in advancing a metric | |||
| specification to standard is by improving unclear specifications and | ||||
| Let's assume a one way delay measurement comparison between system A, | by cleaning it from not supported options. | |||
| probing with a frequency of 2 probes per second and system B probing | ||||
| at a rate of 2 probes every 3 minutes. To ensure reasonable | ||||
| confidence in results, sample metrics are calculated from at least 5 | ||||
| singletons per compared time interval. This means, sample delay | ||||
| values are calculated for each system for identical 6 minute | ||||
| intervals for the whole test duration. Per 6 minute interval, the | ||||
| sample metric is calculated from 720 singletons for system A and from | ||||
| 6 singletons for system B). Note, that if outliers are not filtered, | ||||
| moving averages are an option for an evaluation too. The minimum | ||||
| move of an averaging interval is three minutes in our example. | ||||
| The test set up for the delay measurement is chosen to minimize | ||||
| errors by locating one system of each implementation at the same end | ||||
| of two separate sites, between which delay is measured for the metric | ||||
| test. Both measurement sites are connected by one IPSEC tunnel, so | ||||
| that all measurement packets cross the Internet with the same IP | ||||
| addresses. Both measurement systems measure simultaneously and the | ||||
| local links are dimensioned to avoid congestion caused by the probing | ||||
| traffic itself. | ||||
| The measured delay values are reported with a resolution above the | ||||
| measurement error and above the synchronisation error. This is done | ||||
| to avoid comparing these errors between two different metric | ||||
| implementations instead of comparing the IPPM metric implementation | ||||
| itself. | ||||
| The overall duration of the test is chosen so that more than 1000 six | ||||
| minute measurement intervals are collected. The amount of data | ||||
| collected allows separate comparisons for e.g. 200 consecutive 6 | ||||
| minute intervals. intervals, during which routes were instable, are | ||||
| discarded prior to evaluation. | ||||
| 3. Verification of equivalence by statistic measurements | ||||
| Following the definition of statistical precision [Precision], a | ||||
| measurement process can be characterised by two properties: | ||||
| o Accuracy, which is the degree of conformity of a measured quantity | ||||
| to its actual (true) value. | ||||
| o Precision, also called reproducibility or repeatability, the | ||||
| degree to which repeated measurements show the same or similar | ||||
| results. | ||||
| Figure 1 further clarifies the difference between accuracy and | ||||
| precision of a measurement. | ||||
| Probability ^ | ||||
| Density | | ||||
| | Reference value Measured Value | ||||
| | | | | ||||
| | |<---Accuracy---->| | ||||
| | | _|_ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| Measured | | /<- Precision ->\ | ||||
| Value -|---------|-----------------|----------> | ||||
| | | ||||
| Measurement accuracy and precision [Precision]. | 3. Verification of conformance to a metric specification | |||
| Figure 1 | This section specifies how to verify compliance of two or more IPPM | |||
| implementations against a metric specification. This document only | ||||
| proposes a general methodology. Compliance criteria to a specific | ||||
| metric implementation are expected to be drafted for each individual | ||||
| metric specification. The only exception is the statistical test | ||||
| comparing two metric implementations which are simultaneously tested. | ||||
| This test is applicable without metric specific decision criteria. | ||||
| The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) | 3.1. Tests of an individual implementation against a metric | |||
| expects that a "methodology for a metric should have the property | specification | |||
| that it is repeatable: if the methodology is used multiple times | ||||
| under identical conditions, it should result in consistent | ||||
| measurements." This means, an IPPM implementation is expected to | ||||
| measure a metric with high precision. | ||||
| Further, RFC2330 expects that a "a methodology for a given metric | A metric implementation MUST support the requirements classified as | |||
| exhibits continuity if, for small variations in conditions, it | "MUST" and "REQUIRED" of the related metric specification to be | |||
| results in small variations in the resulting measurements. Slightly | compliant to the latter. | |||
| more precisely, for every positive epsilon, there exists a positive | ||||
| delta, such that if two sets of conditions are within delta of each | ||||
| other, then the resulting measurements will be within epsilon of each | ||||
| other." A small variation in conditions in the context of a metric | ||||
| comparison can be seen as two implementations measuring the same | ||||
| metric along the same path. | ||||
| Two guidelines for an IPPM conformant metric implementation can be | Further, supported options of a metric implementation SHOULD be | |||
| taken from these principles: | documented in sufficient detail to validate and improve the | |||
| underlying metric specification option or remove options which saw no | ||||
| implementation or which are badly specified from the metric | ||||
| specification to be promoted to a standard. | ||||
| o A single IPPM conformant implementation MUST under otherwise | RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric | |||
| identical network conditions produce highly precise results for | implementations. A single IPPM conformant implementation MUST under | |||
| repeated measurements of the same metric. | otherwise identical network conditions produce precise results for | |||
| repeated measurements of the same metric. | ||||
| o Two different implementations measuring the same IPPM metric MUST | RFC 2330 prefers the "empirical distribution function" EDF to | |||
| produce results with a rather limited difference if measuring | describe collections of measurements. RFC 2330 determines, that | |||
| under to the largest extent possible identical network conditions. | "unless otherwise stated, IPPM goodness-of-fit tests are done using | |||
| 5% significance." The goodness of fit test required to determine the | ||||
| preciusion of a metric implementation consists of testing, whether | ||||
| two or more samples belong to the same underlying distribution (of | ||||
| measured network performance events). The goodness of fit test to be | ||||
| applied is the Anderson-Darling K sample test (ADK test, K stands for | ||||
| the number of samples to be compared). Please note that RFC 2330 and | ||||
| RFC 2679 apply an Anderson Darling goodness of fit test too. | ||||
| In a metric test, both conditions must hold, meaning that repeated | The results of a repeated tests with a single implementation MUST | |||
| tests of two implementations MUST produce precise results for all | pass an ADK sample test with confidence level of 95%. The resolution | |||
| repetition intervals. | for which the ADK test has been passed with the specified confidence | |||
| level MUST be documented. To formulate different: The requirement is | ||||
| to document the smalles resolution, at which the results of the | ||||
| tested metric implementation pass an ADK test with a confidence level | ||||
| of 95%. | ||||
| A suitable statistical test and and a level of confidence to define | As an example, a one way delay measurement may pass an ADK test with | |||
| whether differences are rather limited and whether a measurement is | a timestamp resultion of 1 ms. The same test may fail, if timestamps | |||
| highly precise are specified below. | with a resolution of 100 microseconds are eavluated. The | |||
| implementation then is then conforming to the metric specification up | ||||
| to a timestamp resolution of 1 ms. | ||||
| RFC 2330 prefers the "empirical distribution function" EDF to | 3.2. Test set up resulting in identical live network testing conditions | |||
| describe collections of measurements. RFC 2330 uses the EDF to test | ||||
| goodness of fit of an IPPM flow's inter packet spacing to a Poisson | ||||
| process. To do that, RFC 2330 uses the Anderson-Darling test with a | ||||
| 5% significance. RFC 2330 further determines, that "unless otherwise | ||||
| stated, IPPM goodness-of-fit tests are done using 5% significance." | ||||
| The principles suggested by RFC 2330 are applied to compare the | Two major issues complicate tests for metric compliance across live | |||
| implementation of IPPM metrics as follows: | networks under identical testing conditions. One of these is the | |||
| general posit, "metric definition implementations cannot be | ||||
| conveniently examined in field measurement scenarios". The other is | ||||
| more more specificcally addressing "parallelism in devices and | ||||
| networks", by which mechanisms like load balancing are meant. As a | ||||
| reference for the latter, [RFC 4814] is given. | ||||
| o The empirical distribution function of the singletons or samples | This section proposes two measures how to deal with both. Tunneling | |||
| resulting from the measurement of a particular metric is forming | mechanisms can be used to avoid pallalel processing of different | |||
| the basis of a comparison of two IPPM implementations. Note that | flows in the network. Measuring by separate parallel probe flows | |||
| a parametric description of this distribution is not required. | results in repeated collection of data. In both cases, WAN network | |||
| conditions are identical, no matter what they are in detail. | ||||
| o The hypothesis to be validated by an IPPM metric test is that two | Any measurement set up MUST be made to avoid the probing traffic | |||
| implementations of an IPPM metric draw probes from the same | itself to impede the metric measurement. The created measurement | |||
| underlying distribution. The hypothesis is true, if samples of | load MUST NOT result in congestion at the access link connecting the | |||
| two tested metric implementations follow the same distribution by | measurement implementation to the WAN. The created measurement load | |||
| a significance of 95%. Note that the distribution function from | MUST NOT overload the measurement implementation itself, eg. by | |||
| which the probes are drawn itself is irrelevant. | causing a high CPU load or by creating imprecisions due to internal | |||
| send/receive probe packet collisions. | ||||
| o The samples taken by two implementations to be tested are compared | IP in IP tunnels can be used to avoid ECMP routing of different | |||
| by an Anderson-Darling k sample test. The Anderson-Darling k | measurement streams if they allow to carry inner IP packets from | |||
| sample test is the generalization of the classical Anderson- | different senders in a single tunnel with the same outer origin and | |||
| Darling goodness of fit test, and it is used to test the | destination address as well as the same port numbers. The author is | |||
| hypothesis that k independent samples belong to the same | not an expert on tunneling and appreciates guidance on the | |||
| population without specifying their common distribution function. | applicability of one or more of the following protocols: IP in IP | |||
| [Editor: I couldn't find a complete documentation of that test on | [RFC2003], GRE [RFC2784] or L2TP [RFC2661] or [RFC3931]. RFC 4928 | |||
| the web by a fast search, but a reference to a publication is | [RFC4928] proposes measures how to avoid ECMP treatment in MPLS | |||
| there and code seems to be available too. Other tests which are | networks. Applying Pseudo-Wires for a metric implementation test is | |||
| documented in Wikipedia for that purpose are Kolmogorov-Smirnov | one way to avoid MPLS based ECMP treatment. If tuneling is applied, | |||
| and Chi-Square. it is proposed to make Anderson Darling k sample | a single tunnel MUST carry all test traffic in one direction. If eg. | |||
| obligatory/a MUST if code can be appended to this draft. If not, | Ethernet Pseudo Wires are applied and the measurement streams are | |||
| Anderson Darling k sample is recommended and Kolmogorov-Smirnov or | carried in different VLANs, the Pseudo Wires MUST be set up in | |||
| Chi Square are optional]. | physical port mode to avoid set up of Pseudo Wires per VLAN (which | |||
| may see different paths due to ECMP routing), see RFC 4448 [RFC4448]. | ||||
| Getting back to the chosen example delay measurement, the captured | To have statsitical significance, a test MUST be repeated 5 times at | |||
| delays may have been captured singletons ranging from an absolute | least (see below). WAN conditions may change over time. Sequential | |||
| minimum Delay Dmin to values Dmin + 5 ms. To compare distributions, | testing is no useful metric test option. However tests can be | |||
| the set of singletons of a chosen evaluation interval (e.g. the data | carried out by applying 5 or more different parallel measuremet | |||
| of one of the five 1800 minute capture sequences, see above) is | flows. The author takes no position, whether such a test is carried | |||
| sorted for the frequency of singletons per Dmin + N * 0.5 ms (n = 1, | out by sending eg a single CBR flow and defining avery n-th (n = | |||
| 2, ...). After that, a comparison of the two probe sets with any of | 1..5) packet to belong to a specific measurement flow, or whether | |||
| the mentioned tests may be applied. | multiple network cards are applied to create several distinct flows | |||
| of a single implementation. In the latter case, three different | ||||
| cards of one implementation at a single test site will do, if | ||||
| tunneling set ups like the one proposed by GRE encapsulated multicast | ||||
| probing [GU&Duffield] are applied (note that one or more remote | ||||
| tunnel end points and the same number of routers are required). | ||||
| While constructing the example, some additional rules to calculate | Some additional rules to calculate and compare samples have to be | |||
| and compare samples have been respected. The following two rules are | respected. The following rules are of importance for the IPPM metric | |||
| of importance for the IPPM metric tests: | test: | |||
| o To compare different probes of a common underlying distribution in | o To compare different probes of a common underlying distribution in | |||
| terms of metrics characterising a communication network requires | terms of metrics characterising a communication network requires | |||
| to respect the temporal nature for which the assumption of common | to respect the temporal nature for which the assumption of common | |||
| underlying distribution may hold. Any singletons or samples to be | underlying distribution may hold. Any singletons or samples to be | |||
| compared MUST be captured within the same time interval. | compared MUST be captured within the same time interval. | |||
| o Whenever sample metrics, samples of singletons or rates are used | o Whenever statistical events like singletons or rates are used to | |||
| to characterise measured metrics of a time-interval, at least 5 | characterise measured metrics of a time-interval, at least 5 | |||
| events of a relevant metric MUST be present to ensure a minimum | events of a relevant metric MUST be present to ensure a minimum | |||
| confidence into the reported value (see Wikipedia on confidence | confidence into the reported value (see Wikipedia on confidence | |||
| [Rule of thumb]). Note that this criterion is to be respected | [Rule of thumb]). Note that this criterion also is to be | |||
| e.g. when comparing packet loss metrics. Any packet loss | respected e.g. when comparing packet loss metrics. Any packet | |||
| measurement interval to be compared with the results of another | loss measurement interval to be compared with the results of | |||
| implementation needs to contain at least five lost packets to have | another implementation needs to contain at least five lost packets | |||
| a minimum confidence that these losses didn't happen randomly. | to have a minimum confidence that the observed loss rate wasn't | |||
| caused by a samll number of random packet drops. | ||||
| o The minimum number of singletons or samples to be compared by an | o The minimum number of singletons or samples to be compared by an | |||
| Anderson-Darling test is 100 per tested metric implementation. | Anderson-Darling test is 100 per tested metric implementation. | |||
| Note that the Anderson-Darling test detects small differences in | Note that the Anderson-Darling test detects small differences in | |||
| distributions fairly well and will fail for high number of | distributions fairly well and will fail for high number of | |||
| compared results (RFC2330 mentions an example with 8192 | compared results (RFC2330 mentions an example with 8192 | |||
| measurements to guarantee a failure of an Anderson-Darling test). | measurements to guarantee a failure of an Anderson-Darling test). | |||
| Comparing "Accuracy" of IPPM implementations based on averages and | o The Anderson-Darling test is sensible against differing accuracy | |||
| variations may require prior checks for the absence of long range | or bias of different implementations. These differences result in | |||
| dependency within the compared measurements. Large outliers as | differing averages of compared samples. In general, differences | |||
| typically occurring in the case of long range dependency, can have a | in averages of samples may result from differing test conditions. | |||
| serious impact on mean values. The median or percentiles may be more | An example may be different packet sizes, resulting in a constant | |||
| robust measures on which to compare the accuracy of different IPPM | delay difference between compared samples. Therefore samples to | |||
| implementations. An idea may be to consider data up to a certain | be compared by an Anderson Darling test MAY be calibrated by the | |||
| percentile, calculate the mean for data up to this percentile and | difference of the average values of the samples. | |||
| then compare the means of the two implementations. This could be | ||||
| repeated for different percentiles. If long range dependencies | ||||
| impact is limited to large outliers, the method may work for lower | ||||
| percentiles. Whether this makes sense must be confirmed by a | ||||
| statistician, so this attempt requires further study. | ||||
| IPPM metrics are captured by time series. Time series can be checked | ||||
| for correlation. There are two expectations on statistical time | ||||
| series properties which should be met by separate measurements | ||||
| probing the same underlying network performance distribution: | ||||
| o The Autocorrelation indicates, whether there are any repeating | 3.3. Tests two or more different implementations against a metric | |||
| patterns within a time series. For the purpose of this document, | specification | |||
| it does not matter whether there is autocorrelation in a | ||||
| measurement. It is however expected, that two measurements expose | ||||
| the same autocorrelation on identical "lag" intervals. If | ||||
| calculable, the autocorrelation lies within an interval [-1;1], | ||||
| (see Wikipedia on autocorrelation [Autocorrelation]). | ||||
| o The correlation coefficient "indicates the strength of a linear | RFC2330 expects that a "a methodology for a given metric exhibits | |||
| relationship between two random variables." The two random | continuity if, for small variations in conditions, it results in | |||
| variables in the case of this document are the measurement time | small variations in the resulting measurements. Slightly more | |||
| series of the IPPM implementations to be compared. The | precisely, for every positive epsilon, there exists a positive delta, | |||
| expectation is, that both are strongly correlated and the | such that if two sets of conditions are within delta of each other, | |||
| resulting correlation coefficient is close to 1, (see Wikipedia on | then the resulting measurements will be within epsilon of each | |||
| correlation [Correlation]). | other." A small variation in conditions in the context of a metric | |||
| comparison can be seen as different implementations measuring the | ||||
| same metric along the same path. | ||||
| A metric test can derive additional statistics from time series | RFC2679 comments that a "95 percent [confidence level for an | |||
| analysis. Further, formulation of a test hypothesis is possible for | Anderson-Darling goodness of fit test] was chosen because....a | |||
| autocorrelation and the correlation coefficient. It is however not | particular confidence level should be specified so that the results | |||
| clear, whether an appropriate statistical test to validate the | of independent implementations can be compared." While the RFC 2679 | |||
| hypothesis by 95% significance exists. Applicability of time series | statement refers to calibration, it expresses the expectation that | |||
| analysis for a metric test requires further input from statisticians. | the methodology allows for comparisons between different | |||
| implementations. | ||||
| In the absence of any metric test on time series, any test result | IPPM metric specification however allow for implementor options to | |||
| SHOULD provide the autocorrelation of the compared metrics time | the largest possible degree. It can't be expected that two | |||
| series by lags from 1 to 10. In addition, the value of the | implementors pick identical options for the implementations. | |||
| correlation coefficient SHOULD be provided. Autocorrelation and | Implementors SHOULD to the highest degree possible pick the same | |||
| Correlation coefficient are expected to be rather close to the value | configurations for their systems when comparing their implementations | |||
| 1. | by a metric test. | |||
| As mentioned earlier, the time series analysis requires application | In some cases, a goodness of fit test may not be possible or show | |||
| of identical time intervals to allow a comparison. In our delay | dissapointing results. To clarify the difficulties arising from | |||
| example, single sample delay metric values are calculated for 9 | different implemenation options, the individual options picked for | |||
| minute intervals. If 200 consecutive sample delay metrics with the | every compared implementation SHOULD be documented in sufficient | |||
| same start and end interval are available for each implementation, | detail. Based on this documentation, the underlying metric | |||
| autocorrelation can be calculated for different n * 9 minute lags. | specification should be improved before it is promoted to a standard. | |||
| The autocorrelation calculated for the time series of each | ||||
| implementation should be very close to the autocorrelation of the | ||||
| other implementation for the same time lag. Further, the correlation | ||||
| coefficient for both time series should be close to 1. | ||||
| The way to prove that two IPPM metric measurements provide compatible | The same statistical test as applicable to quantify precision of a | |||
| results then could be performed stepwise: | single metric implementation MUST be passed to compare metric | |||
| conformance of different implemenations. To document compatibility, | ||||
| the smallest measurement resolution at which the compared | ||||
| implementations passed the ADK sample test MUST be documented. | ||||
| o First prove that the two compared implementations have the same | For different implementations of the same metric, "variations in | |||
| precision by comparing statistics of the distribution of | conditions" are reasonably expected. The ADK test comparing samples | |||
| singletons (or samples) of a metric by comparing the EDF of the | of the different implemenations may result in a lower precision than | |||
| samples captured by the two implementations. | the test for precision of each implementation individually. | |||
| o Second indicate that two compared implementations produce strongly | 3.4. Clock synchronisation | |||
| correlated time series of which each one individually has the same | ||||
| autocorrelation as the other one. | ||||
| Clock synchronization effects require special attention. Accuracy of | Clock synchronization effects require special attention. Accuracy of | |||
| one-way active delay measurements for any metrics implementation | one-way active delay measurements for any metrics implementation | |||
| depends on clock synchronization between the source and destination | depends on clock synchronization between the source and destination | |||
| of tests. Ideally, one-way active delay measurement (RFC 2679, | of tests. Ideally, one-way active delay measurement (RFC 2679, | |||
| [RFC2679]) test endpoints either have direct access to independent | [RFC2679]) test endpoints either have direct access to independent | |||
| GPS or CDMA-based time sources or indirect access to nearby NTP | GPS or CDMA-based time sources or indirect access to nearby NTP | |||
| primary (stratum 1) time sources, equipped with GPS receivers. | primary (stratum 1) time sources, equipped with GPS receivers. | |||
| Access to these time sources may not be available at all test | Access to these time sources may not be available at all test | |||
| locations associated with different Internet paths, for a variety of | locations associated with different Internet paths, for a variety of | |||
| skipping to change at page 11, line 13 ¶ | skipping to change at page 11, line 23 ¶ | |||
| ms (+/- 500 us) with a confidence of 95% if the metric is captured | ms (+/- 500 us) with a confidence of 95% if the metric is captured | |||
| along an Internet path which is stable and not congested during a | along an Internet path which is stable and not congested during a | |||
| measurement duration of an hour or more. [Editor: this latter | measurement duration of an hour or more. [Editor: this latter | |||
| definition may avoid NTP (stratum 2 or worse) synchonized IPPM | definition may avoid NTP (stratum 2 or worse) synchonized IPPM | |||
| implementations from becoming IPPM compliant. However internal PC | implementations from becoming IPPM compliant. However internal PC | |||
| clock synched implementations can't be rejected that way. Ideas on | clock synched implementations can't be rejected that way. Ideas on | |||
| criteria to deal with the latter are welcome. May drift be one, as | criteria to deal with the latter are welcome. May drift be one, as | |||
| GPS synched implementations shouldn't have one or the same on origin | GPS synched implementations shouldn't have one or the same on origin | |||
| and destination, respectively]. | and destination, respectively]. | |||
| Metric tests should be executed under conditions which are identical | 3.5. Recommended Metric Verification Measurement Process | |||
| to the largest possible or necessary extent. As "identical network | ||||
| conditions" are fundamental to the nethodology proposed by this | ||||
| document, more input and a thorough discussion is needed to define | ||||
| these. Some thoughts are: | ||||
| o In a laboratory environment, NTP synchronisation may have a less | ||||
| serious impact. In a real network, improper synchronisation will | ||||
| be harder to conceal. | ||||
| o OWD measurements are of highest precision with well synchonized | ||||
| measurement systems measuring delays along a stable not congested | ||||
| path. Care must be taken to avoid comparing noise and the | ||||
| measurement error respectively instead of the delay. | ||||
| o Packet loss, delay variation and packet reordering require a | ||||
| sufficient number of these events to allow for a metric test with | ||||
| the desired confidence. While one could wait for congestion or | ||||
| execute the test across known bottlenecks, this may incur some | ||||
| effort. A question is, whether to test these metrics under | ||||
| laboratory conditions. To generalise this question: can | ||||
| laboratory metric tests be tolerated for metrics whose precision | ||||
| doesn't depend on synchonized clocks? | ||||
| o Packet loss and delay variation probably allow for a relaxed | ||||
| definition of "identical test conditions", as it may be sufficient | ||||
| for test packets to share the congested interface or paths to test | ||||
| for these metrics. | ||||
| o In a laboratory environment, "stationary" networking conditions | ||||
| can be produced without having to care about parallel resources, | ||||
| applied by carriers to increase capacity. In a commercial | ||||
| network, hashing functions (on addresses and ports) determine | ||||
| which set of resources all the packets in a flow will traverse. | ||||
| Testing in the lab may not remove the parallel resources, but it | ||||
| can provide some time stability that's never assured in live | ||||
| network testing. | ||||
| o Applicability of tunnels to avoid the impact of unknown parallel | ||||
| resources applied by networks traversed by measuremenmts packets | ||||
| during a test should be investigated. | ||||
| o To determine if some aspects of the metric specifications are | ||||
| clear and unambiguous, some specific conditions in the lab may be | ||||
| simulated to determine if implementations measure them as | ||||
| expected. This it should be tested whether all implementors read | ||||
| the spec the same way. Further, reducing some sources of | ||||
| variation right at the start, will make the job of statistical | ||||
| comparison simpler. | ||||
| o Getting access to operator information like load and packet loss | ||||
| counters of a network which was used during a metric test is | ||||
| improbable. But testing across a real network still is desirable | ||||
| for a metric test. | ||||
| 4. Recommended Metric Verification Measurement Process | ||||
| The proposal made by the authors of bradner-metrictest | The proposal made by the authors of bradner-metrictest | |||
| [bradner-metrictest] is picked up and slightly enhanced: | [bradner-metrictest] is picked up and slightly enhanced: | |||
| "In order to meet their obligations under the IETF Standards Process | "In order to meet their obligations under the IETF Standards Process | |||
| the IESG must be convinced that each metric specification advanced to | the IESG must be convinced that each metric specification advanced to | |||
| Draft Standard or Internet Standard status is clearly written, that | Draft Standard or Internet Standard status is clearly written, that | |||
| there are the required multiple verifiably equivalent | there are the required multiple verifiably equivalent | |||
| implementations, and that all options have been implemented. | implementations, and that all options have been implemented. | |||
| skipping to change at page 13, line 7 ¶ | skipping to change at page 12, line 7 ¶ | |||
| stable network, or simultaneously on a network that may or may not be | stable network, or simultaneously on a network that may or may not be | |||
| stable should produce essentially the same results." | stable should produce essentially the same results." | |||
| Following these assumptions any recommendation for the advancement of | Following these assumptions any recommendation for the advancement of | |||
| a metric specification needs to be accompanied by an implementation | a metric specification needs to be accompanied by an implementation | |||
| report, as is the case with all requests for the advancement of IETF | report, as is the case with all requests for the advancement of IETF | |||
| specifications. The implementation report needs to include a | specifications. The implementation report needs to include a | |||
| specific plan to test the specific metrics in the RFC in lab or real- | specific plan to test the specific metrics in the RFC in lab or real- | |||
| world networks and reports of the tests performed with two or more | world networks and reports of the tests performed with two or more | |||
| implementations of the software. The test plan should cover key | implementations of the software. The test plan should cover key | |||
| parts of the specification, specify the accuracy required for each | parts of the specification, specify the precision reached for each | |||
| measured metric and thus define the meaning of "statistically | measured metric and thus define the meaning of "statistically | |||
| equivalent" for the specific metrics being tested. Ideally, the test | equivalent" for the specific metrics being tested. Ideally, the test | |||
| plan would co-evolve with the development of the metric, since that's | plan would co-evolve with the development of the metric, since that's | |||
| when people have the most context in their thinking regarding the | when people have the most context in their thinking regarding the | |||
| different subtleties that can arise. | different subtleties that can arise. | |||
| In particular, the implementation report MUST as a minimum document: | In particular, the implementation report MUST as a minimum document: | |||
| o The metric compared and the RFC specifying it, including the | o The metric compared and the RFC specifying it, including the | |||
| chosen options (like e.g. the implemented selection function in | chosen options (like e.g. the implemented selection function in | |||
| skipping to change at page 13, line 33 ¶ | skipping to change at page 12, line 33 ¶ | |||
| stream property which could result in deviating results. | stream property which could result in deviating results. | |||
| Deviations in results can be caused also if chosen IP addresses | Deviations in results can be caused also if chosen IP addresses | |||
| and ports of different implementations can result in different | and ports of different implementations can result in different | |||
| layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path | layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path | |||
| routing in an operational network | routing in an operational network | |||
| o The duration of each measurement to be used for a metric | o The duration of each measurement to be used for a metric | |||
| validation, the number of measurement points collected for each | validation, the number of measurement points collected for each | |||
| metric during each measurement interval (i.e. the probe size) and | metric during each measurement interval (i.e. the probe size) and | |||
| the level of confidence derived from this probe size for each | the level of confidence derived from this probe size for each | |||
| measurement interval | measurement interval. | |||
| o The result of the statistical tests performed for each metric | o The result of the statistical tests performed for each metric | |||
| validation. | validation. | |||
| o The measurement configuration and set up | o The measurement configuration and set up. | |||
| o A parameterization of laboratory conditions and applied traffic | o A parameterization of laboratory conditions and applied traffic | |||
| and network conditions allowing reproduction of these laboratory | and network conditions allowing reproduction of these laboratory | |||
| conditions for readers of the implementation report. | conditions for readers of the implementation report. | |||
| "All of the tests for each set MUST be run in the same direction | All of the tests for each set MUST be run in a test set up as | |||
| between the same two points on the same network. The tests SHOULD be | specified in the section "Test set up resulting in identical live | |||
| run simultaneously unless the network is stable enough to ensure that | network testing conditions." | |||
| the path the data takes through the network will not change between | ||||
| tests." | ||||
| It is RECOMMENDED to avoid effects falsifying results of real data | It is RECOMMENDED to avoid effects falsifying results of real data | |||
| networks, if validation measurements are taken over them. Obviously, | networks, if validation measurements are taken over them. Obviously, | |||
| the conditions met there can't be reproduced. As the measurement | the conditions met there can't be reproduced. As the measurement | |||
| equipment compared is designed to reliable quantify real network | equipment compared is designed to reliable quantify real network | |||
| performance, validating metrics under real network conditions is | performance, validating metrics under real network conditions is | |||
| desirable of course. | desirable of course. | |||
| Data networks may forward packets differently in the case of: | Data networks may forward packets differently in the case of: | |||
| skipping to change at page 14, line 24 ¶ | skipping to change at page 13, line 22 ¶ | |||
| against an original distribution. | against an original distribution. | |||
| o Selection of differing IP addresses and ports used by different | o Selection of differing IP addresses and ports used by different | |||
| metric implementations during metric validation tests. If ECMP is | metric implementations during metric validation tests. If ECMP is | |||
| applied on IP or MPLS level, different paths can result (note that | applied on IP or MPLS level, different paths can result (note that | |||
| it may be impossible to detect an MPLS ECMP path from an IP | it may be impossible to detect an MPLS ECMP path from an IP | |||
| endpoint). A proposed counter measure is to connect the | endpoint). A proposed counter measure is to connect the | |||
| measurement equipment to be compared by a NAT device, or | measurement equipment to be compared by a NAT device, or | |||
| establishing a single tunnel to transport all measurement traffic | establishing a single tunnel to transport all measurement traffic | |||
| The aim is to have the same IP addresses and port for all | The aim is to have the same IP addresses and port for all | |||
| measurement packets or to avoid ECMP by a layer 2 tunnel. | measurement packets or to avoid ECMP based local routing diversion | |||
| by using a layer 2 tunnel. | ||||
| o Different IP options. | o Different IP options. | |||
| o Different DSCP. | o Different DSCP. | |||
| The test design may have to be adapted for the purpose of the | 4. Acknowledgements | |||
| measurement. Creation of delay and delay variation probes is simple | ||||
| and straightforward, also if the measurement runs acrossa real data | ||||
| network. Collecting a large number of packet loss samples on a real | ||||
| data network while being sure that operational conditions are stable | ||||
| may not be feasible. Further discussion on test designs to verify | ||||
| specific metrics may indeed be required. | ||||
| 5. Acknowledgements | ||||
| Gerhard Hasslinger commented a first version of this document, | Gerhard Hasslinger commented a first version of this document, | |||
| suggested statistical tests and the evaluation of time series | suggested statistical tests and the evaluation of time series | |||
| information. Henk Uijterwaal pushed this work and Mike Hamilton | information. Henk Uijterwaal pushed this work and Mike Hamilton | |||
| reviewed the document before publication. | reviewed the document before publication. | |||
| 6. Contributors | 5. Contributors | |||
| Scott Bradner, Vern Paxson and Allison Manking drafted bradner- | Scott Bradner, Vern Paxson and Allison Manking drafted bradner- | |||
| metrictest [bradner-metrictest], and major parts of it are quoted in | metrictest [bradner-metrictest], and major parts of it are quoted in | |||
| this document. Al Morton and Scott Bradner commented this draft | this document. Scott Bradner and Emile Stephan commented this draft | |||
| before publication. | before publication. | |||
| 7. IANA Considerations | 6. IANA Considerations | |||
| This memo includes no request to IANA. | This memo includes no request to IANA. | |||
| 8. Security Considerations | 7. Security Considerations | |||
| This draft does not raise any specific security issues. | This draft does not raise any specific security issues. | |||
| 9. References | 8. References | |||
| 9.1. Normative References | 8.1. Normative References | |||
| [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, | ||||
| October 1996. | ||||
| [RFC2026] Bradner, S., "The Internet Standards Process -- Revision | [RFC2026] Bradner, S., "The Internet Standards Process -- Revision | |||
| 3", BCP 9, RFC 2026, October 1996. | 3", BCP 9, RFC 2026, October 1996. | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, | [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, | |||
| "Framework for IP Performance Metrics", RFC 2330, | "Framework for IP Performance Metrics", RFC 2330, | |||
| May 1998. | May 1998. | |||
| [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, | ||||
| G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", | ||||
| RFC 2661, August 1999. | ||||
| [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way | [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way | |||
| Delay Metric for IPPM", RFC 2679, September 1999. | Delay Metric for IPPM", RFC 2679, September 1999. | |||
| 9.2. Informative References | [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. | |||
| Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, | ||||
| March 2000. | ||||
| [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling | ||||
| Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. | ||||
| [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, | ||||
| "Encapsulation Methods for Transport of Ethernet over MPLS | ||||
| Networks", RFC 4448, April 2006. | ||||
| [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal | ||||
| Cost Multipath Treatment in MPLS Networks", BCP 128, | ||||
| RFC 4928, June 2007. | ||||
| 8.2. Informative References | ||||
| [Autocorrelation] | [Autocorrelation] | |||
| N., N., "Autocorrelation", December 2008. | N., N., "Autocorrelation", December 2008. | |||
| [Correlation] | [Correlation] | |||
| N., N., "Correlation", June 2009. | N., N., "Correlation", June 2009. | |||
| [GU&Duffield] | ||||
| Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE | ||||
| Encapsulated Multicast Probing: A Scalable Technique for | ||||
| Measuring One-Way Loss", SIGMETRICS'07 San Diego, | ||||
| California, USA, June 2007. | ||||
| [Precision] | [Precision] | |||
| N., N., "Accuracy and precision", June 2009. | N., N., "Accuracy and precision", June 2009. | |||
| [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. | [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. | |||
| Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | |||
| RFC 5357, October 2008. | RFC 5357, October 2008. | |||
| [Rule of thumb] | [Rule of thumb] | |||
| N., N., "Confidence interval", October 2008. | N., N., "Confidence interval", October 2008. | |||
| [bradner-metrictest] | [bradner-metrictest] | |||
| Bradner, S., Mankin, A., and V. Paxson, "Advancement of | Bradner, S., Mankin, A., and V. Paxson, "Advancement of | |||
| metrics specifications on the IETF Standards Track", | metrics specifications on the IETF Standards Track", | |||
| draft -bradner-metricstest-03, (work in progress), | draft -morton-ippm-advance-metrics-00, (work in progress), | |||
| July 2007. | July 2007. | |||
| [morton-advance-metrics] | ||||
| Morton, A., "Problems and Possible Solutions for Advancing | ||||
| Metrics on the Standards Track", draft -bradner- | ||||
| metricstest-03, (work in progress), July 2009. | ||||
| Appendix A. Further ideas on statistical tests | ||||
| IPPM metrics are captured by time series. Time series can be checked | ||||
| for correlation. There are two expectations on statistical time | ||||
| series properties which should be met by separate measurements | ||||
| probing the same underlying network performance distribution: | ||||
| o The Autocorrelation indicates, whether there are any repeating | ||||
| patterns within a time series. For the purpose of this document, | ||||
| it does not matter whether there is autocorrelation in a | ||||
| measurement. It is however expected, that two measurements expose | ||||
| the same autocorrelation on identical "lag" intervals. If | ||||
| calculable, the autocorrelation lies within an interval [-1;1], | ||||
| (see Wikipedia on autocorrelation [Autocorrelation]). | ||||
| o The correlation coefficient "indicates the strength of a linear | ||||
| relationship between two random variables." The two random | ||||
| variables in the case of this document are the measurement time | ||||
| series of the IPPM implementations to be compared. The | ||||
| expectation is, that both are strongly correlated and the | ||||
| resulting correlation coefficient is close to 1, (see Wikipedia on | ||||
| correlation [Correlation]). | ||||
| A metric test can derive additional statistics from time series | ||||
| analysis. Further, formulation of a test hypothesis is possible for | ||||
| autocorrelation and the correlation coefficient. It is however not | ||||
| clear, whether an appropriate statistical test to validate the | ||||
| hypothesis by 95% significance exists. Applicability of time series | ||||
| analysis for a metric test requires further input from statisticians. | ||||
| In the absence of any metric test on time series, any test result | ||||
| SHOULD provide the autocorrelation of the compared metrics time | ||||
| series by lags from 1 to 10. In addition, the value of the | ||||
| correlation coefficient SHOULD be provided. Autocorrelation and | ||||
| Correlation coefficient are expected to be rather close to the value | ||||
| 1. | ||||
| As mentioned earlier, the time series analysis requires application | ||||
| of identical time intervals to allow a comparison. In our delay | ||||
| example, single sample delay metric values are calculated for 9 | ||||
| minute intervals. If 200 consecutive sample delay metrics with the | ||||
| same start and end interval are available for each implementation, | ||||
| autocorrelation can be calculated for different n * 9 minute lags. | ||||
| The autocorrelation calculated for the time series of each | ||||
| implementation should be very close to the autocorrelation of the | ||||
| other implementation for the same time lag. Further, the correlation | ||||
| coefficient for both time series should be close to 1. | ||||
| The way to prove that two IPPM metric measurements provide compatible | ||||
| results then could be performed stepwise: | ||||
| o First prove that the two compared implementations have the same | ||||
| precision by comparing statistics of the distribution of | ||||
| singletons (or samples) of a metric by comparing the EDF of the | ||||
| samples captured by the two implementations. | ||||
| o Second indicate that two compared implementations produce strongly | ||||
| correlated time series of which each one individually has the same | ||||
| autocorrelation as the other one. | ||||
| Comparing "Accuracy" of IPPM implementations based on averages and | ||||
| variations may require prior checks for the absence of long range | ||||
| dependency within the compared measurements. Large outliers as | ||||
| typically occurring in the case of long range dependency, can have a | ||||
| serious impact on mean values. The median or percentiles may be more | ||||
| robust measures on which to compare the accuracy of different IPPM | ||||
| implementations. An idea may be to consider data up to a certain | ||||
| percentile, calculate the mean for data up to this percentile and | ||||
| then compare the means of the two implementations. This could be | ||||
| repeated for different percentiles. If long range dependencies | ||||
| impact is limited to large outliers, the method may work for lower | ||||
| percentiles. Whether this makes sense must be confirmed by a | ||||
| statistician, so this attempt requires further study. | ||||
| Appendix B. Verification of measurement precision by statistical | ||||
| methods | ||||
| Following the definition of statistical precision [Precision], a | ||||
| measurement process can be characterised by two properties: | ||||
| o Accuracy, which is the degree of conformity of a measured quantity | ||||
| to its actual (true) value. | ||||
| o Precision, also called reproducibility or repeatability, the | ||||
| degree to which repeated measurements show the same or similar | ||||
| results. | ||||
| Figure 1 further clarifies the difference between accuracy and | ||||
| precision of a measurement. | ||||
| Probability ^ | ||||
| Density | | ||||
| | Reference value Measured Value | ||||
| | | | | ||||
| | |<---Accuracy---->| | ||||
| | | _|_ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| | | / | \ | ||||
| Measured | | /<- Precision ->\ | ||||
| Value -|---------|-----------------|----------> | ||||
| | | ||||
| Measurement accuracy and precision [Precision]. | ||||
| Figure 1 | ||||
| The Framework for IP Performance Metrics (RFC 2330, [RFC2330]) | ||||
| expects that a "methodology for a metric should have the property | ||||
| that it is repeatable: if the methodology is used multiple times | ||||
| under identical conditions, it should result in consistent | ||||
| measurements." This means, an IPPM implementation is expected to | ||||
| measure a metric with high precision. | ||||
| A guideline for an IPPM conformant metric implementation can be taken | ||||
| from these principles: | ||||
| Two different implementations measuring the same IPPM metric must | ||||
| produce results with a limited difference if measuring under to the | ||||
| largest extent possible identical network conditions. | ||||
| In a metric test, both conditions are expected to hold, meaning that | ||||
| repeated tests of two implementations MUST produce precise results | ||||
| for all repetition intervals. | ||||
| A suitable statistical test and and a level of confidence to define | ||||
| whether differences are rather limited and whether a measurement is | ||||
| highly precise are specified below. | ||||
| Let's assume a one way delay measurement comparison between system A, | ||||
| probing with a frequency of 2 probes per second and system B probing | ||||
| at a rate of 2 probes every 3 minutes. To ensure reasonable | ||||
| confidence in results, sample metrics are calculated from at least 5 | ||||
| singletons per compared time interval. This means, sample delay | ||||
| values are calculated for each system for identical 6 minute | ||||
| intervals for the whole test duration. Per 6 minute interval, the | ||||
| sample metric is calculated from 720 singletons for system A and from | ||||
| 6 singletons for system B). Note, that if outliers are not filtered, | ||||
| moving averages are an option for an evaluation too. The minimum | ||||
| move of an averaging interval is three minutes in our example. | ||||
| The test set up for the delay measurement is chosen to minimize | ||||
| errors by locating one system of each implementation at the same end | ||||
| of two separate sites, between which delay is measured for the metric | ||||
| test. Both measurement sites are connected by one IPSEC tunnel, so | ||||
| that all measurement packets cross the Internet with the same IP | ||||
| addresses. Both measurement systems measure simultaneously and the | ||||
| local links are dimensioned to avoid congestion caused by the probing | ||||
| traffic itself. | ||||
| The measured delay values are reported with a resolution above the | ||||
| measurement error and above the synchronisation error. This is done | ||||
| to avoid comparing these errors between two different metric | ||||
| implementations instead of comparing the IPPM metric implementation | ||||
| itself. | ||||
| The overall duration of the test is chosen so that more than 1000 six | ||||
| minute measurement intervals are collected. The amount of data | ||||
| collected allows separate comparisons for e.g. 200 consecutive 6 | ||||
| minute intervals. intervals, during which routes were instable, are | ||||
| discarded prior to evaluation. | ||||
| The captured delays may have been captured singletons ranging from an | ||||
| absolute minimum Delay Dmin to values Dmin + 5 ms. To compare | ||||
| distributions, the set of singletons of a chosen evaluation interval | ||||
| (e.g. the data of one of the five 1800 minute capture sequences, see | ||||
| above) is sorted for the frequency of singletons per Dmin + N * 0.5 | ||||
| ms (n = 1, 2, ...). After that, a comparison of the two probe sets | ||||
| with any of the mentioned tests may be applied. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Ruediger Geib (editor) | Ruediger Geib (editor) | |||
| Deutsche Telekom | Deutsche Telekom | |||
| Heinrich Hertz Str. 3-7 | Heinrich Hertz Str. 3-7 | |||
| Darmstadt, 64295 | Darmstadt, 64295 | |||
| Germany | Germany | |||
| Phone: +49 6151 628 2747 | Phone: +49 6151 628 2747 | |||
| Email: Ruediger.Geib@telekom.de | Email: Ruediger.Geib@telekom.de | |||
| Al Morton | ||||
| AT&T Labs | ||||
| 200 Laurel Avenue South | ||||
| Middletown, NJ 07748 | ||||
| USA | ||||
| Phone: +1 732 420 1571 | ||||
| Fax: +1 732 368 1192 | ||||
| Email: acmorton@att.com | ||||
| URI: http://home.comcast.net/~acmacm/ | ||||
| Reza Fardid | Reza Fardid | |||
| Covad Communications | Covad Communications | |||
| 2510 Zanker Road | 2510 Zanker Road | |||
| San Jose, CA 95131 | San Jose, CA 95131 | |||
| USA | USA | |||
| Phone: +1 408 434-2042 | Phone: +1 408 434-2042 | |||
| Email: RFardid@covad.com | Email: RFardid@covad.com | |||
| End of changes. 55 change blocks. | ||||
| 315 lines changed or deleted | 493 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||