idnits 2.17.1 draft-ietf-ippm-metrictest-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 29, 2011) is 4522 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1346 == Unused Reference: 'RFC3931' is defined on line 996, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 2330 ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Geib, Ed. 3 Internet-Draft Deutsche Telekom 4 Intended status: BCP A. Morton 5 Expires: June 1, 2012 AT&T Labs 6 R. Fardid 7 Cariden Technologies 8 A. Steinmitz 9 Deutsche Telekom 10 November 29, 2011 12 IPPM standard advancement testing 13 draft-ietf-ippm-metrictest-05 15 Abstract 17 This document specifies tests to determine if multiple independent 18 instantiations of a performance metric RFC have implemented the 19 specifications in the same way. This is the performance metric 20 equivalent of interoperability, required to advance RFCs along the 21 standards track. Results from different implementations of metric 22 RFCs will be collected under the same underlying network conditions 23 and compared using statistical methods. The goal is an evaluation of 24 the metric RFC itself; whether its definitions are clear and 25 unambiguous to implementors and therefore a candidate for advancement 26 on the IETF standards track. This document is an Internet Best 27 Current Practice. 29 Status of this Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on June 1, 2012. 46 Copyright Notice 48 Copyright (c) 2011 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 65 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 5 66 3. Verification of conformance to a metric specification . . . . 7 67 3.1. Tests of an individual implementation against a metric 68 specification . . . . . . . . . . . . . . . . . . . . . . 8 69 3.2. Test setup resulting in identical live network testing 70 conditions . . . . . . . . . . . . . . . . . . . . . . . . 9 71 3.3. Tests of two or more different implementations against 72 a metric specification . . . . . . . . . . . . . . . . . . 15 73 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 16 74 3.5. Recommended Metric Verification Measurement Process . . . 17 75 3.6. Proposal to determine an "equivalence" threshold for 76 each metric evaluated . . . . . . . . . . . . . . . . . . 20 77 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 78 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 21 79 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 21 81 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21 82 8.1. Normative References . . . . . . . . . . . . . . . . . . . 21 83 8.2. Informative References . . . . . . . . . . . . . . . . . . 22 84 Appendix A. An example on a One-way Delay metric validation . . . 23 85 A.1. Compliance to Metric specification requirements . . . . . 23 86 A.2. Examples related to statistical tests for One-way Delay . 25 87 Appendix B. Anderson-Darling K-sample Reference and 2 sample 88 C++ code . . . . . . . . . . . . . . . . . . . . . . 27 89 Appendix C. Glossary . . . . . . . . . . . . . . . . . . . . . . 36 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36 92 1. Introduction 94 The Internet Standards Process as updated by RFC6410 [RFC6410] 95 specifies that widespread deployment and use is sufficient to show 96 interoperability as a condition for advancement to Internet Standard. 97 The previous requirement of interoperability tests prior to advancing 98 an RFC to the Standard maturity level specified in RFC 2026 [RFC2026] 99 and RFC 5657 [RFC5657] has been removed. While the modified 100 requirement is applicable to protocols, wide deployment of different 101 measurement systems does not prove that the implementations measure 102 metrics in a standard way. Section 5.3 of RFC 5657 [RFC5657] 103 explicitly mentions the special case of Standards that are not "on- 104 the-wire" protocols. While this special case is not explicitly 105 mentioned by RFC6410 [RFC6410], the four criteria in Section 2.2 of 106 RFC6410 [RFC6410] are augmented by this document for RFCs that 107 specify performance metrics. This document takes the position that 108 flexible metric definitions can be proven to be clear and unambiguous 109 through tests that compare the results from independent 110 implementations. It describes tests which infer whether metric 111 specifications are sufficient using a definition of metric 112 "interoperability": measuring equivalent results (in a statistical 113 sense) under the same network conditions. The document expands on 114 this problem and its solution below. 116 In the case of a protocol specification, the notion of 117 "interoperability" is reasonably intuitive - the implementations must 118 successfully "talk to each other", while exercising all features and 119 options. To achieve interoperability, two implementors need to 120 interpret the protocol specifications in equivalent ways. In the 121 case of IP Performance Metrics (IPPM), this definition of 122 interoperability is only useful for test and control protocols like 123 the One-Way Active Measurement Protocol, OWAMP [RFC4656], and the 124 Two-Way Active Measurement Protocol, TWAMP [RFC5357]. 126 A metric specification RFC describes one or more metric definitions, 127 methods of measurement and a way to report the results of 128 measurement. One example would be a way to test and report the One- 129 way Delay that data packets incur while being sent from one network 130 location to another, One-way Delay Metric. 132 In the case of metric specifications, the conditions that satisfy the 133 "interoperability" requirement are less obvious, and there was a need 134 for IETF agreement on practices to judge metric specification 135 "interoperability" in the context of the IETF Standards Process. 136 This memo provides methods which should be suitable to evaluate 137 metric specifications for standards track advancement. The methods 138 proposed here MAY be generally applicable to metric specification 139 RFCs beyond those developed under the IPPM Framework [RFC2330]. 141 Since many implementations of IP metrics are embedded in measurement 142 systems that do not interact with one another (they were built before 143 OWAMP and TWAMP), the interoperability evaluation called for in the 144 IETF standards process cannot be determined by observing that 145 independent implementations interact properly for various protocol 146 exchanges. Instead, verifying that different implementations give 147 statistically equivalent results under controlled measurement 148 conditions takes the place of interoperability observations. Even 149 when evaluating OWAMP and TWAMP RFCs for standards track advancement, 150 the methods described here are useful to evaluate the measurement 151 results because their validity would not be ascertained in protocol 152 interoperability testing. 154 The standards advancement process aims at producing confidence that 155 the metric definitions and supporting material are clearly worded and 156 unambiguous, or reveals ways in which the metric definitions can be 157 revised to achieve clarity. The process also permits identification 158 of options that were not implemented, so that they can be removed 159 from the advancing specification. Thus, the product of this process 160 is information about the metric specification RFC itself: 161 determination of the specifications or definitions that are clear and 162 unambiguous and those that are not (as opposed to an evaluation of 163 the implementations which assist in the process). 165 This document defines a process to verify that implementations (or 166 practically, measurement systems) have interpreted the metric 167 specifications in equivalent ways, and produce equivalent results. 169 Testing for statistical equivalence requires ensuring identical test 170 setups (or awareness of differences) to the best possible extent. 171 Thus, producing identical test conditions is a core goal of the memo. 172 Another important aspect of this process is to test individual 173 implementations against specific requirements in the metric 174 specifications using customized tests for each requirement. These 175 tests can distinguish equivalent interpretations of each specific 176 requirement. 178 Conclusions on equivalence are reached by two measures. 180 First, implementations are compared against individual metric 181 specifications to make sure that differences in implementation are 182 minimised or at least known. 184 Second, a test setup is proposed ensuring identical networking 185 conditions so that unknowns are minimized and comparisons are 186 simplified. The resulting separate data sets may be seen as samples 187 taken from the same underlying distribution. Using statistical 188 methods, the equivalence of the results is verified. To illustrate 189 application of the process and methods defined here, evaluation of 190 the One-way Delay Metric [RFC2679] is provided in an Appendix. While 191 test setups will vary with the metrics to be validated, the general 192 methodology of determining equivalent results will not. Documents 193 defining test setups to evaluate other metrics should be developed 194 once the process proposed here has been agreed and approved. 196 The metric RFC advancement process begins with a request for protocol 197 action accompanied by a memo that documents the supporting tests and 198 results. The procedures of [RFC2026] are expanded in[RFC5657], 199 including sample implementation and interoperability reports. 200 [morton-testplan-rfc2679] can serve as a template for a metric RFC 201 report which accompanies the protocol action request to the Area 202 Director, including description of the test set-up, procedures, 203 results for each implementation and conclusions. 205 1.1. Requirements Language 207 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 208 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 209 document are to be interpreted as described in RFC 2119 [RFC2119]. 211 2. Basic idea 213 The implementation of a standard compliant metric is expected to meet 214 the requirements of the related metric specification. So before 215 comparing two metric implementations, each metric implementation is 216 individually compared against the metric specification. 218 Most metric specifications leave freedom to implementors on non- 219 fundamental aspects of an individual metric (or options). Comparing 220 different measurement results using a statistical test with the 221 assumption of identical test path and testing conditions requires 222 knowledge of all differences in the overall test setup. Metric 223 specification options chosen by implementors have to be documented. 224 It is RECOMMENDED to use identical metric options for any test 225 proposed here (an exception would be if a variable parameter of the 226 metric definition is not configurable in one or more 227 implementations). Calibrations specified by metric standards SHOULD 228 be performed to further identify (and possibly reduce) potential 229 sources of error in the test setup. 231 The Framework for IP Performance Metrics [RFC2330] expects that a 232 "methodology for a metric should have the property that it is 233 repeatable: if the methodology is used multiple times under identical 234 conditions, it should result in consistent measurements." This means 235 an implementation is expected to repeatedly measure a metric with 236 consistent results (repeatability with the same result). Small 237 deviations in the test setup are expected to lead to small deviations 238 in results only. To characterise statistical equivalence in the case 239 of small deviations, RFC 2330 and [RFC2679] suggest to apply a 95% 240 confidence interval. Quoting RFC 2679, "95 percent was chosen 241 because ... a particular confidence level should be specified so that 242 the results of independent implementations can be compared." 244 Two different implementations are expected to produce statistically 245 equivalent results if they both measure a metric under the same 246 networking conditions. Formulating in statistical terms: separate 247 metric implementations collect separate samples from the same 248 underlying statistical process (the same network conditions). The 249 statistical hypothesis to be tested is the expectation that both 250 samples do not expose statistically different properties. This 251 requires careful test design: 253 o The measurement test setup must be self-consistent to the largest 254 possible extent. To minimize the influence of the test and 255 measurement setup on the result, network conditions and paths MUST 256 be identical for the compared implementations to the largest 257 possible degree. This includes both the stability and non- 258 ambiguity of routes taken by the measurement packets. See RFC 259 2330 for a discussion on self-consistency. 261 o To minimize the influence of implementation options on the result, 262 metric implementations SHOULD use identical options and parameters 263 for the metric under evaluation. 265 o The sample size must be large enough to minimize its influence on 266 the consistency of the test results. This consideration may be 267 especially important if two implementations measure with different 268 average packet transmission rates. 270 o The implementation with the lowest average packet transmission 271 rate determines the smallest temporal interval for which samples 272 can be compared. 274 o Repeat comparisons with several independent metric samples to 275 avoid random indications of compatibility (or the lack of it). 277 The metric specifications themselves are the primary focus of 278 evaluation, rather than the implementations of metrics. The 279 documentation produced by the advancement process should identify 280 which metric definitions and supporting material were found to be 281 clearly worded and unambiguous, OR, it should identify ways in which 282 the metric specification text should be revised to achieve clarity 283 and unified interpretation. 285 The process should also permit identification of options that were 286 not implemented, so that they can be removed from the advancing 287 specification (this is an aspect more typical of protocol advancement 288 along the standards track). 290 Note that this document does not propose to base interoperability 291 indications of performance metric implementations on comparisons of 292 individual singletons. Individual singletons may be impacted by many 293 statistical effects while they are measured. Comparing two 294 singletons of different implementations may result in failures with 295 higher probability than comparing samples. 297 3. Verification of conformance to a metric specification 299 This section specifies how to verify compliance of two or more IPPM 300 implementations against a metric specification. This document only 301 proposes a general methodology. Compliance criteria to a specific 302 metric implementation need to be defined for each individual metric 303 specification. The only exception is the statistical test comparing 304 two metric implementations which are simultaneously tested. This 305 test is applicable without metric specific decision criteria. 307 Several testing options exist to compare two or more implementations: 309 o Use a single test lab to compare the implementations and emulate 310 the Internet with an impairment generator. 312 o Use a single test lab to compare the implementations and measure 313 across the Internet. 315 o Use remotely separated test labs to compare the implementations 316 and emulate the Internet with two "identically" configured 317 impairment generators. 319 o Use remotely separated test labs to compare the implementations 320 and measure across the Internet. 322 o Use remotely separated test labs to compare the implementations 323 and measure across the Internet and include a single impairment 324 generator to impact all measurement flows in non discriminatory 325 way. 327 The first two approaches work, but involve higher expenses than the 328 others (due to travel and/or shipping plus installation). For the 329 third option, ensuring two identically configured impairment 330 generators requires well defined test cases and possibly identical 331 hardware and software. 333 As documented in a test report [morton-testplan-rfc2679], the last 334 option was required to prove compatibility of two delay metric 335 implementations. An impairment generator is probably required when 336 testing compatibility of most other metrics and it is therefore 337 RECOMMENDED to include an impairment generator in metric test setups. 339 3.1. Tests of an individual implementation against a metric 340 specification 342 A metric implementation is compliant with a metric specification if 343 it supports the requirements classified as "MUST" and "REQUIRED" of 344 the related metric specification. An implementation that implements 345 all requirements is fully compliant with the specification, and the 346 degree of compliance SHOULD be noted in the conclusions of the 347 report. 349 Further, supported options of a metric implementation SHOULD be 350 documented in sufficient detail to evaluate whether the specification 351 was correctly interpreted. The documentation of chosen options 352 should minimise (and recognise) differences in the test setup if two 353 metric implementations are compared. Further, this documentation is 354 used to validate or clarify the wording of the metric specification 355 option, to remove options which saw no implementation or which are 356 badly specified from the metric specification. This documentation 357 SHOULD be included for all implementation-relevant specifications of 358 a metric picked for a comparison, even those that are not explicitly 359 marked as "MUST" or "REQUIRED" in the RFC text. This applies for the 360 following sections of all metric specifications: 362 o Singleton Definition of the Metric. 364 o Sample Definition of the Metric. 366 o Statistics Definition of the Metric. As statistics are compared 367 by the test specified here, this documentation is required even in 368 the case, that the metric specification does not contain a 369 Statistics Definition. 371 o Timing and Synchronisation related specification (if relevant for 372 the Metric). 374 o Any other technical part present or missing in the metric 375 specification, which is relevant for the implementation of the 376 Metric. 378 RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric 379 implementations. A single IPPM conforming implementation should 380 under otherwise identical network conditions produce precise results 381 for repeated measurements of the same metric. 383 RFC 2330 prefers the "empirical distribution function" EDF to 384 describe collections of measurements. RFC 2330 determines, that 385 "unless otherwise stated, IPPM goodness-of-fit tests are done using 386 5% significance." The goodness of fit test determines by which 387 precision two or more samples of a metric implementation belong to 388 the same underlying distribution (of measured network performance 389 events). The goodness of fit test suggested for the metric test is 390 the Anderson-Darling K sample test (ADK sample test, K stands for the 391 number of samples to be compared) [ADK]. Please note that RFC 2330 392 and RFC 2679 apply an Anderson Darling goodness of fit test too. 394 The results of a repeated test with a single implementation MUST pass 395 an ADK sample test with confidence level of 95%. The conditions for 396 which the ADK test has been passed with the specified confidence 397 level MUST be documented. To formulate this differently: The 398 requirement is to document the set of parameters with the smallest 399 deviation, at which the results of the tested metric implementation 400 pass an ADK test with a confidence level of 95%. The minimum 401 resolution available in the reported results from each implementation 402 MUST be taken into account in the ADK test. 404 The test conditions to be documented for a passed metric test 405 include: 407 o The metric resolution at which a test was passed (e.g. the 408 resolution of timestamps) 410 o The parameters modified by an impairment generator. 412 o The impairment generator parameter settings. 414 3.2. Test setup resulting in identical live network testing conditions 416 Two major issues complicate tests for metric compliance across live 417 networks under identical testing conditions. One is the general 418 point that metric definition implementations cannot be conveniently 419 examined in field measurement scenarios. The other one is more 420 broadly described as "parallelism in devices and networks", including 421 mechanisms like those that achieve load balancing (see [RFC4928]). 423 This section proposes two measures to deal with both issues. 424 Tunneling mechanisms can be used to avoid parallel processing of 425 different flows in the network. Measuring by separate parallel probe 426 flows results in repeated collection of data. If both measures are 427 combined, WAN network conditions are identical for a number of 428 independent measurement flows, no matter what the network conditions 429 are in detail. 431 Any measurement setup must be made to avoid the probing traffic 432 itself to impede the metric measurement. The created measurement 433 load must not result in congestion at the access link connecting the 434 measurement implementation to the WAN. The created measurement load 435 must not overload the measurement implementation itself, e.g., by 436 causing a high CPU load or by creating imprecisions due to internal 437 transmit (receive respectively) probe packet collisions. 439 Tunneling multiple flows reaching a network element on a single 440 physical port may allow to transmit all packets of the tunnel via the 441 same path. Applying tunnels to avoid undesired influence of standard 442 routing for measurement purposes is a concept known from literature, 443 see e.g. GRE encapsulated multicast probing [GU-Duffield]. An 444 existing IP in IP tunnel protocol can be applied to avoid Equal-Cost 445 Multi-Path (ECMP) routing of different measurement streams if it 446 meets the following criteria: 448 o Inner IP packets from different measurement implementations are 449 mapped into a single tunnel with single outer IP origin and 450 destination address as well as origin and destination port numbers 451 which are identical for all packets. 453 o An easily accessible commodity tunneling protocol allows to carry 454 out a metric test from more test sites. 456 o A low operational overhead may enable a broader audience to set up 457 a metric test with the desired properties. 459 o The tunneling protocol should be reliable and stable in set up and 460 operation to avoid disturbances or influence on the test results. 462 o The tunneling protocol should not incur any extra cost for those 463 interested in setting up a metric test. 465 An illustration of a test setup with two layer 2 tunnels and two 466 flows between two linecards of one implementation is given in 467 Figure 1. 469 Implementation ,---. +--------+ 470 +~~~~~~~~~~~/ \~~~~~~| Remote | 471 +------->-----F2->-| / \ |->---+ | 472 | +---------+ | Tunnel 1( ) | | | 473 | | transmit|-F1->-| ( ) |->+ | | 474 | | LC1 | +~~~~~~~~~| |~~~~| | | | 475 | | receive |-<--+ ( ) | F1 F2 | 476 | +---------+ | |Internet | | | | | 477 *-------<-----+ F2 | | | | | | 478 +---------+ | | +~~~~~~~~~| |~~~~| | | | 479 | transmit|-* *-| | | |--+<-* | 480 | LC2 | | Tunnel 2( ) | | | 481 | receive |-<-F1-| \ / |<-* | 482 +---------+ +~~~~~~~~~~~\ /~~~~~~| Router | 483 `-+-' +--------+ 485 Illustration of a test setup with two layer 2 tunnels. For 486 simplicity, only two linecards of one implementation and two flows F 487 between them are shown. 489 Figure 1 491 Figure 2 shows the network elements required to set up layer 2 492 tunnels as shown by figure 1. 494 Implementation 496 +-----+ ,---. 497 | LC1 | / \ 498 +-----+ / \ +------+ 499 | +-------+ ( ) +-------+ |Remote| 500 +--------+ | | | | | | | | 501 |Ethernet| | Tunnel| |Internet | | Tunnel| | | 502 |Switch |--| Head |--| |--| Head |--| | 503 +--------+ | Router| | | | Router| | | 504 | | | ( ) | | |Router| 505 +-----+ +-------+ \ / +-------+ +------+ 506 | LC2 | \ / 507 +-----+ `-+-' 508 Illustration of a hardware setup to realise the test setup 509 illustrated by figure 1 with layer 2 tunnels or Pseudowires. 511 Figure 2 513 The test set up successfully used during a delay metric test 514 [morton-testplan-rfc2679] is given as an example in figure 3. Note 515 that the shown set up allows a metric test between two remote sites. 517 +----+ +----+ +----+ +----+ 518 |LC10| |LC11| ,---. |LC20| |LC21| 519 +----+ +----+ / \ +-------+ +----+ +----+ 520 | V10 | V11 / \ | Tunnel| | V20 | V21 521 | | ( ) | Head | | | 522 +--------+ +------+ | | | Router|__+----------+ 523 |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | 524 |Switch |--|Head |-| | | |Switch | 525 +-+--+---+ |Router| | | +---+---+ +--+--+----+ 526 |__| +--A---+ ( )--|Option.| |__| 527 \ / |Impair.| 528 Bridge \ / |Gener. | Bridge 529 V20 to V21 `-+-? +-------+ V10 to V11 531 Figure 3 533 In figure 3, LC10 identify measurement clients /line cards. V10 and 534 the others denote VLANs. All VLANs are using the same tunnel from A 535 to B and in the reverse direction. The remote site VLANs are 536 U-bridged at the local site Ethernet switch. The measurement packets 537 of site 1 travel tunnel A->B first, are U-bridged at site 2 and 538 travel tunnel B->A second. Measurement packets of site 2 travel 539 tunnel B->A first, are U-bridged at site 1 and travel tunnel A->B 540 second. So all measurement packets pass the same tunnel segments, 541 but in different segment order. 543 If tunneling is applied, two tunnels MUST carry all test traffic in 544 between the test site and the remote site. For example, if 802.1Q 545 Virtual LANs (VLAN) are applied and the measurement streams are 546 carried in different VLANs, the IP tunnel or Pseudo Wires 547 respectively are set up in physical port mode to avoid set up of 548 Pseudo Wires per VLAN (which may see different paths due to ECMP 549 routing), see RFC 4448. The remote router and the Ethernet switch 550 shown in figure 3 has to support 802.1Q in this set up. 552 The IP packet size of the metric implementation SHOULD be chosen 553 small enough to avoid fragmentation due to the added Ethernet and 554 tunnel headers. Otherwise, the impact of tunnel overhead on 555 fragmentation and interface MTU size must be understood and taken 556 into account (see [RFC4459]). 558 An Ethernet port mode IP tunnel carrying several 802.1Q VLANs each 559 containing measurement traffic of a single measurement system was 560 successfully applied when testing compatibility of two metric 561 implementations [morton-testplan-rfc2679]. Ethernet over L2TPv3 562 [RFC4719] was picked for this test. 564 The following headers may have to be accounted for when calculating 565 total packet length, if VLANs and Ethernet over L2TPv3 tunnels are 566 applied: 568 o Ethernet 802.1Q: 22 Byte. 570 o L2TPv3 Header: 4-16 Byte for L2TPv3 data messages over IP; 16-28 571 Byte for L2TPv3 data messages over UDP. 573 o IPv4 Header (outer IP header): 20 Byte. 575 o MPLS Labels may be added by a carrier. Each MPLS Label has a 576 length of 4 Bytes. By the time of writing, between 1 and 4 Labels 577 seems to be a fair guess of what's expectable. 579 The applicability of one or more of the following tunneling protocols 580 may be investigated by interested parties if Ethernet over L2TPv3 is 581 felt to be not suitable: IP in IP [RFC2003] or Generic Routing 582 Encapsulation (GRE) [RFC2784]. RFC 4928 [RFC4928] proposes measures 583 how to avoid ECMP treatment in MPLS networks. 585 L2TP is a commodity tunneling protocol [RFC2661]. By the time of 586 writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is 587 applied, software based implementations of this protocol are not 588 suitable for the test set up, as such implementations may cause 589 incalculable delay shifts. 591 Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. 592 While there's no technical issue with this solution, MPLS interfaces 593 are mostly found in the network provider domain. Hence not all of 594 the above criteria to select a tunneling protocol are met. 596 Note that setting up a metric test environment isn't a plug and play 597 issue. Skilled networking engineers should be consulted and 598 involved, if a set up between remote sites is preferred. 600 Passing or failing an ADK test with 2 samples could be a random 601 result (note that [RFC2330] defines a sample as a set of singleton 602 metric values produced by a measurement stream, and we continue to 603 use this terminology here). The error margin of a statistical test 604 is higher if the number of samples it is based on is low (the number 605 of samples taken influences the so called "degree of freedom" of a 606 statistical test and a higher degree of freedom produces more 607 reliable results). To pass ADK with higher probability, the number 608 of samples collected per implementation under identical networking 609 conditions SHOULD be greater than 2. Hardware and load constraints 610 may enforce an upper limit on the number of simultaneous measurement 611 streams. The ADK test allows one to combine different samples (see 612 section 9 [ADK]) and then to run a two sample test between combined 613 samples. At least 4 samples per implementation captured under 614 identical networking conditions is RECOMMENDED when comparing 615 different metric implementations by a statistical test. 617 It is RECOMMENDED that tests be carried out by establishing N 618 different parallel measurement flows. Two or three linecards per 619 implementation serving to send or receive measurement flows should be 620 sufficient to create 4 or more parallel measurement flows. Other 621 options are to separate flows by DiffServ marks (without deploying 622 any QoS in the inner or outer tunnel) or using a single CBR flow and 623 evaluating every n-th singleton to belong to a specific measurement 624 flow. Note that a practical test indeed showed that ADK was passed 625 with 4 samples even if a 2 sample test 626 failed[morton-testplan-rfc2679]. 628 Some additional guidelines to calculate and compare samples to 629 perform a metric test are: 631 o To compare different probes of a common underlying distribution in 632 terms of metrics characterising a communication network requires 633 to respect the temporal nature for which the assumption of common 634 underlying distribution may hold. Any singletons or samples to be 635 compared must be captured within the same time interval. 637 o If statistical events like rates are used to characterise measured 638 metrics of a time-interval, a minimum 5 singletons of a relevant 639 metric should be picked to ensure a minimum confidence into the 640 reported value. The error margin of the determined rate depends 641 on the number singletons (refer to statistical textbooks on 642 Student's t-test). As an example, any packet loss measurement 643 interval to be compared with the results of another implementation 644 contains at least five lost packets to have some confidence that 645 the observed loss rate wasn't caused by a small number of random 646 packet drops. 648 o The minimum number of singletons or samples to be compared by an 649 Anderson-Darling test should be 100 per tested metric 650 implementation. Note that the Anderson-Darling test detects small 651 differences in distributions fairly well and will fail for high 652 number of compared results (RFC2330 mentions an example with 8192 653 measurements where an Anderson-Darling test always failed). 655 o Generally, the Anderson-Darling test is sensitive to differences 656 in the accuracy or bias associated with varying implementations or 657 test conditions. These dissimilarities may result in differing 658 averages of samples to be compared. An example may be different 659 packet sizes, resulting in a constant delay difference between 660 compared samples. Therefore samples to be compared by an Anderson 661 Darling test MAY be calibrated by the difference of the average 662 values of the samples. Any calibration of this kind MUST be 663 documented in the test result. 665 3.3. Tests of two or more different implementations against a metric 666 specification 668 RFC2330 expects "a methodology for a given metric [to] exhibit 669 continuity if, for small variations in conditions, it results in 670 small variations in the resulting measurements. Slightly more 671 precisely, for every positive epsilon, there exists a positive delta, 672 such that if two sets of conditions are within delta of each other, 673 then the resulting measurements will be within epsilon of each 674 other." A small variation in conditions in the context of the metric 675 test proposed here can be seen as different implementations measuring 676 the same metric along the same path. 678 IPPM metric specifications however allow for implementor options to 679 the largest possible degree. It cannot be expected that two 680 implementors allow 100% identical options in their implementations. 681 Testers SHOULD pick the same metric measurement configurations for 682 their systems when comparing their implementations by a metric test. 684 In some cases, a goodness of fit test may not be possible or show 685 disappointing results. To clarify the difficulties arising from 686 different metric implementation options, the individual options 687 picked for every compared metric implementation should be documented 688 as specified in section 3.5. If the cause of the failure is a lack 689 of specification clarity or multiple legitimate interpretations of 690 the definition text, the text should be modified and the resulting 691 memo proposed for consensus and (possible) advancement to Internet 692 Standard. 694 The same statistical test as applicable to quantify precision of a 695 single metric implementation must be used to compare metric result 696 equivalence for different implementations. To document 697 compatibility, the smallest measurement resolution at which the 698 compared implementations passed the ADK sample test must be 699 documented. 701 For different implementations of the same metric, "variations in 702 conditions" are reasonably expected. The ADK test comparing samples 703 of the different implementations may result in a lower precision than 704 the test for precision in the same-implementation comparison. 706 3.4. Clock synchronisation 708 Clock synchronization effects require special attention. Accuracy of 709 one-way active delay measurements for any metrics implementation 710 depends on clock synchronization between the source and destination 711 of tests. Ideally, one-way active delay measurement (RFC 2679, 712 [RFC2679]) test endpoints either have direct access to independent 713 GPS or CDMA-based time sources or indirect access to nearby NTP 714 primary (stratum 1) time sources, equipped with GPS receivers. 715 Access to these time sources may not be available at all test 716 locations associated with different Internet paths, for a variety of 717 reasons out of scope of this document. 719 When secondary (stratum 2 and above) time sources are used with NTP 720 running across the same network, whose metrics are subject to 721 comparative implementation tests, network impairments can affect 722 clock synchronization, distort sample one-way values and their 723 interval statistics. It is recommended to discard sample one-way 724 delay values for any implementation when one of the following 725 reliability conditions is met: 727 o Delay is measured and is finite in one direction, but not the 728 other. 730 o Absolute value of the difference between the sum of one-way 731 measurements in both directions and round-trip measurement is 732 greater than X% of the latter value. 734 Examination of the second condition requires RTT measurement for 735 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 736 conjunction with one-way delay measurement. 738 Specification of X% to strike a balance between identification of 739 unreliable one-way delay samples and misidentification of reliable 740 samples under a wide range of Internet path RTTs probably requires 741 further study. 743 An IPPM compliant metric implementation of an RFC that requires 744 synchronized clocks is expected to provide precise measurement 745 results. 747 IF an implementation publishes a specification of its precision, such 748 as "a precision of 1 ms (+/- 500 us) with a confidence of 95%", then 749 the specification should be met over a useful measurement duration. 750 For example, if the metric is measured along an Internet path which 751 is stable and not congested, then the precision specification should 752 be met over durations of an hour or more. 754 3.5. Recommended Metric Verification Measurement Process 756 In order to meet their obligations under the IETF Standards Process 757 the IESG must be convinced that each metric specification advanced to 758 Internet Standard status is clearly written, that there are a 759 sufficient number of verified equivalent implementations, and that 760 options that have been implemented are documented. 762 In the context of this document, metrics are designed to measure some 763 characteristic of a data network. An aim of any metric definition 764 should be that it should be specified in a way that can reliably 765 measure the specific characteristic in a repeatable way across 766 multiple independent implementations. 768 Each metric, statistic or option of those to be validated MUST be 769 compared against a reference measurement or another implementation as 770 specified in this document. 772 Finally, the metric definitions, embodied in the text of the RFCs, 773 are the objects that require evaluation and possible revision in 774 order to advance to Internet Standard. 776 IF two (or more) implementations do not measure an equivalent metric 777 as specified by this document, 779 AND sources of measurement error do not adequately explain the lack 780 of agreement, 782 THEN the details of each implementation should be audited along with 783 the exact definition text, to determine if there is a lack of clarity 784 that has caused the implementations to vary in a way that affects the 785 correspondence of the results. 787 IF there was a lack of clarity or multiple legitimate interpretations 788 of the definition text, 790 THEN the text should be modified and the resulting memo proposed for 791 consensus and (possible) advancement along the standards track. 793 Finally, all the findings MUST be documented in a report that can 794 support advancement to Internet Standard, as described here (similar 795 to those described in [RFC5657]). The list of measurement devices 796 used in testing satisfies the implementation requirement, while the 797 test results provide information on the quality of each specification 798 in the metric RFC (the surrogate for feature interoperability). 800 The complete process of advancing a metric specification to a 801 standard as defined by this document is illustrated in Figure 4. 803 ,---. 804 / \ 805 ( Start ) 806 \ / Implementations 807 `-+-' +-------+ 808 | /| 1 `. 809 +---+----+ / +-------+ `.-----------+ ,-------. 810 | RFC | / |Check for | ,' was RFC `. YES 811 | | / |Equivalence.... clause x ------+ 812 | |/ +-------+ |under | `. clear? ,' | 813 | Metric \.....| 2 ....relevant | `---+---' +----+-----+ 814 | Metric |\ +-------+ |identical | No | |Report | 815 | Metric | \ |network | +--+----+ |results + | 816 | ... | \ |conditions | |Modify | |Advance | 817 | | \ +-------+ | | |Spec +--+RFC | 818 +--------+ \| n |.'+-----------+ +-------+ |request | 819 +-------+ +----------+ 821 Illustration of the metric standardisation process 823 Figure 4 825 Any recommendation for the advancement of a metric specification MUST 826 be accompanied by an implementation report. The implementation 827 report needs to include the tests performed, the applied test setup, 828 the specific metrics in the RFC and reports of the tests performed 829 with two or more implementations. The test plan needs to specify the 830 precision reached for each measured metric and thus define the 831 meaning of "statistically equivalent" for the specific metrics being 832 tested. 834 Ideally, the test plan would co-evolve with the development of the 835 metric, since that's when participants have the clearest context in 836 their minds regarding the different subtleties that can arise. 838 In particular, the implementation report MUST as a minimum document: 840 o The metric compared and the RFC specifying it. This includes 841 statements as required by the section "Tests of an individual 842 implementation against a metric specification" of this document. 844 o The measurement configuration and setup. 846 o A complete specification of the measurement stream (mean rate, 847 statistical distribution of packets, packet size or mean packet 848 size and their distribution), DSCP and any other measurement 849 stream properties which could result in deviating results. 850 Deviations in results can be caused also if chosen IP addresses 851 and ports of different implementations can result in different 852 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 853 routing in an operational network. 855 o The duration of each measurement to be used for a metric 856 validation, the number of measurement points collected for each 857 metric during each measurement interval (i.e. the probe size) and 858 the level of confidence derived from this probe size for each 859 measurement interval. 861 o The result of the statistical tests performed for each metric 862 validation as required by the section "Tests of two or more 863 different implementations against a metric specification" of this 864 document. 866 o A parameterization of laboratory conditions and applied traffic 867 and network conditions allowing reproduction of these laboratory 868 conditions for readers of the implementation report. 870 o The documentation helping to improve metric specifications defined 871 by this section. 873 All of the tests for each set SHOULD be run in a test setup as 874 specified in the section "Test setup resulting in identical live 875 network testing conditions." 877 If a different test setup is chosen, it is recommended to avoid 878 effects falsifying results of validation measurements caused by real 879 data networks (like parallelism in devices and networks). Data 880 networks may forward packets differently in the case of: 882 o Different packet sizes chosen for different metric 883 implementations. A proposed countermeasure is selecting the same 884 packet size when validating results of two samples or a sample 885 against an original distribution. 887 o Selection of differing IP addresses and ports used by different 888 metric implementations during metric validation tests. If ECMP is 889 applied on IP or MPLS level, different paths can result (note that 890 it may be impossible to detect an MPLS ECMP path from an IP 891 endpoint). A proposed counter measure is to connect the 892 measurement equipment to be compared by a NAT device, or 893 establishing a single tunnel to transport all measurement traffic 894 The aim is to have the same IP addresses and port for all 895 measurement packets or to avoid ECMP based local routing diversion 896 by using a layer 2 tunnel. 898 o Different IP options. 900 o Different DSCP. 902 o If the N measurements are captured using sequential measurements 903 instead of simultaneous ones, then the following factors come into 904 play: Time varying paths and load conditions. 906 3.6. Proposal to determine an "equivalence" threshold for each metric 907 evaluated 909 This section describes a proposal for maximum error of "equivalence", 910 based on performance comparison of identical implementations. This 911 comparison may be useful for both ADK and non-ADK comparisons. 913 Each metric tested by two or more implementations (cross- 914 implementation testing). 916 Each metric is also tested twice simultaneously by the *same* 917 implementation, using different Src/Dst Address pairs and other 918 differences such that the connectivity differences of the cross- 919 implementation tests are also experienced and measured by the same 920 implementation. 922 Comparative results for the same implementation represent a bound on 923 cross-implementation equivalence. This should be particularly useful 924 when the metric does *not* produces a continuous distribution of 925 singleton values, such as with a loss metric, or a duplication 926 metric. Appendix A indicates how the ADK will work for 0ne-way 927 delay, and should be likewise applicable to distributions of delay 928 variation. Appendix B discusses two possible ways to perform the ADK 929 analysis, the R statistical language [Rtool] with ADK package [Radk] 930 and C++ code. 932 Proposal: the implementation with the largest difference in 933 homogeneous comparison results is the lower bound on the equivalence 934 threshold, noting that there may be other systematic errors to 935 account for when comparing between implementations. 937 Thus, when evaluating equivalence in cross-implementation results: 939 Maximum_Error = Same_Implementation_Error + Systematic_Error 940 and only the systematic error need be decided beforehand. 942 In the case of ADK comparison, the largest same-implementation 943 resolution of distribution equivalence can be used as a limit on 944 cross-implementation resolutions (at the same confidence level). 946 4. Acknowledgements 948 Gerhard Hasslinger commented a first version of this document, 949 suggested statistical tests and the evaluation of time series 950 information. Matthias Wieser's thesis on a metric test resulted in 951 new input for this draft. Henk Uijterwaal and Lars Eggert have 952 encouraged and helped to orgainize this work. Mike Hamilton, Scott 953 Bradner, David Mcdysan and Emile Stephan commented on this draft. 954 Carol Davids reviewed the 01 version of the ID before it was promoted 955 to WG draft. 957 5. Contributors 959 Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- 960 metrictest [bradner-metrictest], and major parts of it are included 961 in this document. 963 6. IANA Considerations 965 This memo includes no request to IANA. 967 7. Security Considerations 969 This memo does not raise any specific security issues. 971 8. References 973 8.1. Normative References 975 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 976 October 1996. 978 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 979 Requirement Levels", BCP 14, RFC 2119, March 1997. 981 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 982 "Framework for IP Performance Metrics", RFC 2330, 983 May 1998. 985 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 986 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 987 RFC 2661, August 1999. 989 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 990 Delay Metric for IPPM", RFC 2679, September 1999. 992 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 993 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 994 March 2000. 996 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 997 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 999 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 1000 "Encapsulation Methods for Transport of Ethernet over MPLS 1001 Networks", RFC 4448, April 2006. 1003 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1004 Zekauskas, "A One-way Active Measurement Protocol 1005 (OWAMP)", RFC 4656, September 2006. 1007 [RFC4719] Aggarwal, R., Townsley, M., and M. Dos Santos, "Transport 1008 of Ethernet Frames over Layer 2 Tunneling Protocol Version 1009 3 (L2TPv3)", RFC 4719, November 2006. 1011 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 1012 Cost Multipath Treatment in MPLS Networks", BCP 128, 1013 RFC 4928, June 2007. 1015 [RFC5657] Dusseault, L. and R. Sparks, "Guidance on Interoperation 1016 and Implementation Reports for Advancement to Draft 1017 Standard", BCP 9, RFC 5657, September 2009. 1019 [RFC6410] Housley, R., Crocker, D., and E. Burger, "Reducing the 1020 Standards Track to Two Maturity Levels", BCP 9, RFC 6410, 1021 October 2011. 1023 8.2. Informative References 1025 [ADK] Scholz, F. and M. Stephens, "K-sample Anderson-Darling 1026 Tests of fit, for continuous and discrete cases", 1027 University of Washington, Technical Report No. 81, 1028 May 1986. 1030 [GU-Duffield] 1031 Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE 1032 Encapsulated Multicast Probing: A Scalable Technique for 1033 Measuring One-Way Loss", SIGMETRICS'07 San Diego, 1034 California, USA, June 2007. 1036 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1037 3", BCP 9, RFC 2026, October 1996. 1039 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1040 Network Tunneling", RFC 4459, April 2006. 1042 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1043 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1044 RFC 5357, October 2008. 1046 [Radk] Scholz, F., "adk: Anderson-Darling K-Sample Test and 1047 Combinations of Such Tests. R package version 1.0.", , 1048 2008. 1050 [Rtool] R Development Core Team, "R: A language and environment 1051 for statistical computing. R Foundation for Statistical 1052 Computing, Vienna, Austria. ISBN 3-900051-07-0, URL 1053 http://www.R-project.org/", , 2011. 1055 [bradner-metrictest] 1056 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 1057 metrics specifications on the IETF Standards Track", 1058 draft -bradner-metricstest-03, (work in progress), 1059 July 2007. 1061 [morton-testplan-rfc2679] 1062 Ciavattone, L., Geib, R., Morton, A., and M. Wieser, "Test 1063 Plan and Results for Advancing RFC 2679 on the Standards 1064 Track", draft -morton-ippm-testplan-rfc2679-01, (work in 1065 progress), June 2011. 1067 Appendix A. An example on a One-way Delay metric validation 1069 The text of this appendix is not binding. It is an example how parts 1070 of a One-way Delay metric test could look like. 1072 A.1. Compliance to Metric specification requirements 1074 One-way Delay, Loss threshold, RFC 2679 1076 This test determines if implementations use the same configured 1077 maximum waiting time delay from one measurement to another under 1078 different delay conditions, and correctly declare packets arriving in 1079 excess of the waiting time threshold as lost. See Section 3.5 of 1080 RFC2679, 3rd bullet point and also Section 3.8.2 of RFC2679. 1082 (1) Configure a path with 1 sec one-way constant delay. 1084 (2) Measure one-way delay with 2 or more implementations, using 1085 identical waiting time thresholds for loss set at 2 seconds. 1087 (3) Configure the path with 3 sec one-way delay. 1089 (4) Repeat measurements. 1091 (5) Observe that the increase measured in step 4 caused all packets 1092 to be declared lost, and that all packets that arrive 1093 successfully in step 2 are assigned a valid one-way delay. 1095 One-way Delay, First-bit to Last bit, RFC 2679 1097 This test determines if implementations register the same relative 1098 increase in delay from one measurement to another under different 1099 delay conditions. This test tends to cancel the sources of error 1100 which may be present in an implementation. See Section 3.7.2 of 1101 RFC2679, and Section 10.2 of RFC2330. 1103 (1) Configure a path with X ms one-way constant delay, and ideally 1104 including a low-speed link. 1106 (2) Measure one-way delay with 2 or more implementations, using 1107 identical options and equal size small packets (e.g., 100 octet 1108 IP payload). 1110 (3) Maintain the same path with X ms one-way delay. 1112 (4) Measure one-way delay with 2 or more implementations, using 1113 identical options and equal size large packets (e.g., 1500 octet 1114 IP payload). 1116 (5) Observe that the increase measured in steps 2 and 4 is 1117 equivalent to the increase in ms expected due to the larger 1118 serialization time for each implementation. Most of the 1119 measurement errors in each system should cancel, if they are 1120 stationary. 1122 One-way Delay, RFC 2679 1124 This test determines if implementations register the same relative 1125 increase in delay from one measurement to another under different 1126 delay conditions. This test tends to cancel the sources of error 1127 which may be present in an implementation. This test is intended to 1128 evaluate measurments in sections 3 and 4 of RFC2679. 1130 (1) Configure a path with X ms one-way constant delay. 1132 (2) Measure one-way delay with 2 or more implementations, using 1133 identical options. 1135 (3) Configure the path with X+Y ms one-way delay. 1137 (4) Repeat measurements. 1139 (5) Observe that the increase measured in steps 2 and 4 is ~Y ms for 1140 each implementation. Most of the measurement errors in each 1141 system should cancel, if they are stationary. 1143 Error Calibration, RFC 2679 1145 This is a simple check to determine if an implementation reports the 1146 error calibration as required in Section 4.8 of RFC2679. Note that 1147 the context (Type-P) must also be reported. 1149 A.2. Examples related to statistical tests for One-way Delay 1151 A one way delay measurement may pass an ADK test with a timestamp 1152 resultion of 1 ms. The same test may fail, if timestamps with a 1153 resolution of 100 microseconds are eavluated. The implementation 1154 then is then conforming to the metric specification up to a timestamp 1155 resolution of 1 ms. 1157 Let's assume another one way delay measurement comparison between 1158 implementation 1, probing with a frequency of 2 probes per second and 1159 implementation 2 probing at a rate of 2 probes every 3 minutes. To 1160 ensure reasonable confidence in results, sample metrics are 1161 calculated from at least 5 singletons per compared time interval. 1162 This means, sample delay values are calculated for each system for 1163 identical 6 minute intervals for the whole test duration. Per 6 1164 minute interval, the sample metric is calculated from 720 singletons 1165 for implementation 1 and from 6 singletons for implementation 2. 1166 Note, that if outliers are not filtered, moving averages are an 1167 option for an evaluation too. The minimum move of an averaging 1168 interval is three minutes in this example. 1170 The data in table 1 may result from measuring One-Way Delay with 1171 implementation 1 (see column Implemnt_1) and implementation 2 (see 1172 column implemnt_2). Each data point in the table represents a 1173 (rounded) average of the sampled delay values per interval. The 1174 resolution of the clock is one micro-second. The difference in the 1175 delay values may result eg. from different probe packet sizes. 1177 +------------+------------+-----------------------------+ 1178 | Implemnt_1 | Implemnt_2 | Implemnt_2 - Delta_Averages | 1179 +------------+------------+-----------------------------+ 1180 | 5000 | 6549 | 4997 | 1181 | 5008 | 6555 | 5003 | 1182 | 5012 | 6564 | 5012 | 1183 | 5015 | 6565 | 5013 | 1184 | 5019 | 6568 | 5016 | 1185 | 5022 | 6570 | 5018 | 1186 | 5024 | 6573 | 5021 | 1187 | 5026 | 6575 | 5023 | 1188 | 5027 | 6577 | 5025 | 1189 | 5029 | 6580 | 5028 | 1190 | 5030 | 6585 | 5033 | 1191 | 5032 | 6586 | 5034 | 1192 | 5034 | 6587 | 5035 | 1193 | 5036 | 6588 | 5036 | 1194 | 5038 | 6589 | 5037 | 1195 | 5039 | 6591 | 5039 | 1196 | 5041 | 6592 | 5040 | 1197 | 5043 | 6599 | 5047 | 1198 | 5046 | 6606 | 5054 | 1199 | 5054 | 6612 | 5060 | 1200 +------------+------------+-----------------------------+ 1202 Table 1 1204 Average values of sample metrics captured during identical time 1205 intervals are compared. This excludes random differences caused by 1206 differing probing intervals or differing temporal distance of 1207 singletons resulting from their Poisson distributed sending times. 1209 In the example, 20 values have been picked (note that at least 100 1210 values are recommended for a single run of a real test). Data must 1211 be ordered by ascending rank. The data of Implemnt_1 and Implemnt_2 1212 as shown in the first two columns of table 1 clearly fails an ADK 1213 test with 95% confidence. 1215 The results of Implemnt_2 are now reduced by difference of the 1216 averages of column 2 (rounded to 6581 us) and column 1 (rounded to 1217 5029 us), which is 1552 us. The result may be found in column 3 of 1218 table 1. Comparing column 1 and column 3 of the table by an ADK test 1219 shows, that the data contained in these columns passes an ADK tests 1220 with 95% confidence. 1222 >>> Comment: Extensive averaging was used in this example, because of 1223 the vastly different sampling frequencies. As a result, the 1224 distributions compared do not exactly align with a metric in 1225 [RFC2679], but illustrate the ADK process adequately. 1227 Appendix B. Anderson-Darling K-sample Reference and 2 sample C++ code 1229 There are many statistical tools available, and this Appendix 1230 describes two that are familiar to the authors. 1232 The "R tool" is a language and command-line environment for 1233 statistical computing and plotting [Rtool]. With the optional "adk" 1234 package installed [Radk], it can perform individual and combined 1235 sample ADK computations. The user must consult the package 1236 documentation and the original paper [ADK] to interpret the results, 1237 but this is as it should be. 1239 The C++ code below will perform a 2-sample AD comparison when 1240 compiled and presented with two column vectors in a file (using white 1241 space as separation). This version contains modifications to use the 1242 vectors and run as a stand-alone module by Wes Eddy, Sept 2011. The 1243 status of the comparison can be checked on the command line with "$ 1244 echo $?" or the last line can be replaced with a printf statement for 1245 adk_result instead. 1247 /* 1249 Copyright (c) 2011 IETF Trust and the persons identified 1250 as authors of the code. All rights reserved. 1252 Redistribution and use in source and binary forms, with 1253 or without modification, is permitted pursuant to, and subject 1254 to the license terms contained in, the Simplified BSD License 1255 set forth in Section 4.c of the IETF Trust's Legal Provisions 1256 Relating to IETF Documents (http://trustee.ietf.org/license-info). 1258 */ 1260 /* Routines for computing the Anderson-Darling 2 sample 1261 * test statistic. 1262 * 1263 * Implemented based on the description in 1264 * "Anderson-Darling K Sample Test" Heckert, Alan and 1265 * Filliben, James, editors, Dataplot Reference Manual, 1266 * Chapter 15 Auxiliary, NIST, 2004. 1267 * Official Reference by 2010 1268 * Heckert, N. A. (2001). Dataplot website at the 1269 * National Institute of Standards and Technology: 1270 * http://www.itl.nist.gov/div898/software/dataplot.html/ 1271 * June 2001. 1272 */ 1274 #include 1275 #include 1276 #include 1277 #include 1279 using namespace std; 1281 int main() { 1282 vector vec1, vec2; 1283 double adk_result; 1284 static int k, val_st_z_samp1, val_st_z_samp2, 1285 val_eq_z_samp1, val_eq_z_samp2, 1286 j, n_total, n_sample1, n_sample2, L, 1287 max_number_samples, line, maxnumber_z; 1288 static int column_1, column_2; 1289 static double adk, n_value, z, sum_adk_samp1, 1290 sum_adk_samp2, z_aux; 1291 static double H_j, F1j, hj, F2j, denom_1_aux, denom_2_aux; 1292 static bool next_z_sample2, equal_z_both_samples; 1293 static int stop_loop1, stop_loop2, stop_loop3,old_eq_line2, 1294 old_eq_line1; 1296 static double adk_criterium = 1.993; 1298 /* vec1 and vec2 to be initialised with sample 1 and 1299 * sample 2 values in ascending order */ 1300 while (!cin.eof()) { 1301 double f1, f2; 1302 cin >> f1; 1303 cin >> f2; 1304 vec1.push_back(f1); 1305 vec2.push_back(f2); 1306 } 1308 k = 2; 1309 n_sample1 = vec1.size() - 1; 1310 n_sample2 = vec2.size() - 1; 1312 // -1 because vec[0] is a dummy value 1313 n_total = n_sample1 + n_sample2; 1314 /* value equal to the line with a value = zj in sample 1. 1315 * Here j=1, so the line is 1. 1316 */ 1317 val_eq_z_samp1 = 1; 1319 /* value equal to the line with a value = zj in sample 2. 1320 * Here j=1, so the line is 1. 1321 */ 1322 val_eq_z_samp2 = 1; 1324 /* value equal to the last line with a value < zj 1325 * in sample 1. Here j=1, so the line is 0. 1326 */ 1327 val_st_z_samp1 = 0; 1329 /* value equal to the last line with a value < zj 1330 * in sample 1. Here j=1, so the line is 0. 1331 */ 1332 val_st_z_samp2 = 0; 1334 sum_adk_samp1 = 0; 1335 sum_adk_samp2 = 0; 1336 j = 1; 1338 // as mentioned above, j=1 1339 equal_z_both_samples = false; 1341 next_z_sample2 = false; 1343 //assuming the next z to be of sample 1 1344 stop_loop1 = n_sample1 + 1; 1346 // + 1 because vec[0] is a dummy, see n_sample1 declaration 1347 stop_loop2 = n_sample2 + 1; 1348 stop_loop3 = n_total + 1; 1350 /* The required z values are calculated until all values 1351 * of both samples have been taken into account. See the 1352 * lines above for the stoploop values. Construct required 1353 * to avoid a mathematical operation in the While condition 1354 */ 1355 while (((stop_loop1 > val_eq_z_samp1) 1356 || (stop_loop2 > val_eq_z_samp2)) && stop_loop3 > j) 1357 { 1358 if(val_eq_z_samp1 < n_sample1+1) 1359 { 1360 /* here, a preliminary zj value is set. 1361 * See below how to calculate the actual zj. 1363 */ 1364 z = vec1[val_eq_z_samp1]; 1366 /* this while sequence calculates the number of values 1367 * equal to z. 1368 */ 1369 while ((val_eq_z_samp1+1 < n_sample1) 1370 && z == vec1[val_eq_z_samp1+1] ) 1371 { 1372 val_eq_z_samp1++; 1373 } 1374 } 1375 else 1376 { 1377 val_eq_z_samp1 = 0; 1378 val_st_z_samp1 = n_sample1; 1380 // this should be val_eq_z_samp1 - 1 = n_sample1 1381 } 1383 if(val_eq_z_samp2 < n_sample2+1) 1384 { 1385 z_aux = vec2[val_eq_z_samp2];; 1387 /* this while sequence calculates the number of values 1388 * equal to z_aux 1389 */ 1391 while ((val_eq_z_samp2+1 < n_sample2) 1392 && z_aux == vec2[val_eq_z_samp2+1] ) 1393 { 1394 val_eq_z_samp2++; 1395 } 1397 /* the smaller of the two actual data values is picked 1398 * as the next zj. 1399 */ 1401 if(z > z_aux) 1402 { 1403 z = z_aux; 1404 next_z_sample2 = true; 1405 } 1406 else 1407 { 1408 if (z == z_aux) 1409 { 1410 equal_z_both_samples = true; 1411 } 1413 /* This is the case, if the last value of column1 is 1414 * smaller than the remaining values of column2. 1415 */ 1416 if (val_eq_z_samp1 == 0) 1417 { 1418 z = z_aux; 1419 next_z_sample2 = true; 1420 } 1421 } 1422 } 1423 else 1424 { 1425 val_eq_z_samp2 = 0; 1426 val_st_z_samp2 = n_sample2; 1428 // this should be val_eq_z_samp2 - 1 = n_sample2 1430 } 1432 /* in the following, sum j = 1 to L is calculated for 1433 * sample 1 and sample 2. 1434 */ 1435 if (equal_z_both_samples) 1436 { 1438 /* hj is the number of values in the combined sample 1439 * equal to zj 1440 */ 1441 hj = val_eq_z_samp1 - val_st_z_samp1 1442 + val_eq_z_samp2 - val_st_z_samp2; 1444 /* H_j is the number of values in the combined sample 1445 * smaller than zj plus one half the the number of 1446 * values in the combined sample equal to zj 1447 * (that's hj/2). 1448 */ 1449 H_j = val_st_z_samp1 + val_st_z_samp2 1450 + hj / 2; 1452 /* F1j is the number of values in the 1st sample 1453 * which are less than zj plus one half the number 1454 * of values in this sample which are equal to zj. 1455 */ 1457 F1j = val_st_z_samp1 + (double) 1458 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1460 /* F2j is the number of values in the 1st sample 1461 * which are less than zj plus one half the number 1462 * of values in this sample which are equal to zj. 1463 */ 1464 F2j = val_st_z_samp2 + (double) 1465 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1467 /* set the line of values equal to zj to the 1468 * actual line of the last value picked for zj. 1469 */ 1470 val_st_z_samp1 = val_eq_z_samp1; 1472 /* Set the line of values equal to zj to the actual 1473 * line of the last value picked for zjof each 1474 * sample. This is required as data smaller than zj 1475 * is accounted differently than values equal to zj. 1476 */ 1477 val_st_z_samp2 = val_eq_z_samp2; 1479 /* next the lines of the next values z, ie. zj+1 1480 * are addressed. 1481 */ 1482 val_eq_z_samp1++; 1484 /* next the lines of the next values z, ie. 1485 * zj+1 are addressed 1486 */ 1487 val_eq_z_samp2++; 1488 } 1489 else 1490 { 1492 /* the smaller z value was contained in sample 2, 1493 * hence this value is the zj to base the following 1494 * calculations on. 1495 */ 1496 if (next_z_sample2) 1497 { 1499 /* hj is the number of values in the combined 1500 * sample equal to zj, in this case these are 1501 * within sample 2 only. 1502 */ 1503 hj = val_eq_z_samp2 - val_st_z_samp2; 1505 /* H_j is the number of values in the combined sample 1506 * smaller than zj plus one half the the number of 1507 * values in the combined sample equal to zj 1508 * (that's hj/2). 1509 */ 1511 H_j = val_st_z_samp1 + val_st_z_samp2 1512 + hj / 2; 1514 /* F1j is the number of values in the 1st sample which 1515 * are less than zj plus one half the number of values in 1516 * this sample which are equal to zj. 1517 * As val_eq_z_samp2 < val_eq_z_samp1, these are the 1518 * val_st_z_samp1 only. 1519 */ 1520 F1j = val_st_z_samp1; 1522 /* F2j is the number of values in the 1st sample which 1523 * are less than zj plus one half the number of values in 1524 * this sample which are equal to zj. The latter are from 1525 * sample 2 only in this case. 1526 */ 1528 F2j = val_st_z_samp2 + (double) 1529 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1531 /* Set the line of values equal to zj to the actual line 1532 * of the last value picked for zj of sample 2 only in 1533 * this case. 1534 */ 1535 val_st_z_samp2 = val_eq_z_samp2; 1537 /* next the line of the next value z, ie. zj+1 is 1538 * addressed. Here, only sample 2 must be addressed. 1539 */ 1541 val_eq_z_samp2++; 1542 if (val_eq_z_samp1 == 0) 1543 { 1544 val_eq_z_samp1 = stop_loop1; 1545 } 1546 } 1548 /* the smaller z value was contained in sample 2, 1549 * hence this value is the zj to base the following 1550 * calculations on. 1551 */ 1553 else 1554 { 1556 /* hj is the number of values in the combined 1557 * sample equal to zj, in this case these are 1558 * within sample 1 only. 1559 */ 1560 hj = val_eq_z_samp1 - val_st_z_samp1; 1562 /* H_j is the number of values in the combined 1563 * sample smaller than zj plus one half the the number 1564 * of values in the combined sample equal to zj 1565 * (that's hj/2). 1566 */ 1568 H_j = val_st_z_samp1 + val_st_z_samp2 1569 + hj / 2; 1571 /* F1j is the number of values in the 1st sample which 1572 * are less than zj plus, in this case these are within 1573 * sample 1 only one half the number of values in this 1574 * sample which are equal to zj. The latter are from 1575 * sample 1 only in this case. 1576 */ 1578 F1j = val_st_z_samp1 + (double) 1579 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1581 /* F2j is the number of values in the 1st sample which 1582 * are less than zj plus one half the number of values 1583 * in this sample which are equal to zj. As 1584 * val_eq_z_samp1 < val_eq_z_samp2, these are the 1585 * val_st_z_samp2 only. 1586 */ 1588 F2j = val_st_z_samp2; 1590 /* Set the line of values equal to zj to the actual line 1591 * of the last value picked for zj of sample 1 only in 1592 * this case 1593 */ 1595 val_st_z_samp1 = val_eq_z_samp1; 1597 /* next the line of the next value z, ie. zj+1 is 1598 * addressed. Here, only sample 1 must be addressed. 1599 */ 1600 val_eq_z_samp1++; 1602 if (val_eq_z_samp2 == 0) 1603 { 1604 val_eq_z_samp2 = stop_loop2; 1605 } 1606 } 1607 } 1609 denom_1_aux = n_total * F1j - n_sample1 * H_j; 1610 denom_2_aux = n_total * F2j - n_sample2 * H_j; 1612 sum_adk_samp1 = sum_adk_samp1 + hj 1613 * (denom_1_aux * denom_1_aux) / 1614 (H_j * (n_total - H_j) 1615 - n_total * hj / 4); 1616 sum_adk_samp2 = sum_adk_samp2 + hj 1617 * (denom_2_aux * denom_2_aux) / 1618 (H_j * (n_total - H_j) 1619 - n_total * hj / 4); 1621 next_z_sample2 = false; 1622 equal_z_both_samples = false; 1624 /* index to count the z. It is only required to prevent 1625 * the while slope to execute endless 1626 */ 1627 j++; 1628 } 1630 // calculating the adk value is the final step. 1631 adk_result = (double) (n_total - 1) / (n_total 1632 * n_total * (k - 1)) 1633 * (sum_adk_samp1 / n_sample1 1634 + sum_adk_samp2 / n_sample2); 1636 /* if(adk_result <= adk_criterium) 1637 * adk_2_sample test is passed 1638 */ 1639 return adk_result <= adk_criterium; 1640 } 1642 Figure 5 1644 Appendix C. Glossary 1646 +-------------+-----------------------------------------------------+ 1647 | ADK | Anderson-Darling K-Sample test, a test used to | 1648 | | check whether two samples have the same statistical | 1649 | | distribution. | 1650 | ECMP | Equal Cost Multipath, a load balancing mechanism | 1651 | | evaluating MPLS labels stacks, IP addresses and | 1652 | | ports. | 1653 | EDF | The "Empirical Distribution Function" of a set of | 1654 | | scalar measurements is a function F(x) which for | 1655 | | any x gives the fractional proportion of the total | 1656 | | measurements that were smaller than or equal as x. | 1657 | Metric | A measured quantity related to the performance and | 1658 | | reliability of the Internet, expressed by a value. | 1659 | | This could be a singleton (single value), a sample | 1660 | | of single values or a statistic based on a sample | 1661 | | of singletons. | 1662 | OWAMP | One-way Active Measurement Protocol, a protocol for | 1663 | | communication between IPPM measurement systems | 1664 | | specified by IPPM. | 1665 | OWD | One-Way Delay, a performance metric specified by | 1666 | | IPPM. | 1667 | Sample | A sample metric is derived from a given singleton | 1668 | metric | metric by evaluating a number of distinct instances | 1669 | | together. | 1670 | Singleton | A singleton metric is, in a sense, one atomic | 1671 | metric | measurement of this metric. | 1672 | Statistical | A 'statistical' metric is derived from a given | 1673 | metric | sample metric by computing some statistic of the | 1674 | | values defined by the singleton metric on the | 1675 | | sample. | 1676 | TWAMP | Two-way Active Measurement Protocol, a protocol for | 1677 | | communication between IPPM measurement systems | 1678 | | specified by IPPM. | 1679 +-------------+-----------------------------------------------------+ 1681 Table 2 1683 Authors' Addresses 1685 Ruediger Geib (editor) 1686 Deutsche Telekom 1687 Heinrich Hertz Str. 3-7 1688 Darmstadt, 64295 1689 Germany 1691 Phone: +49 6151 58 12747 1692 Email: Ruediger.Geib@telekom.de 1694 Al Morton 1695 AT&T Labs 1696 200 Laurel Avenue South 1697 Middletown, NJ 07748 1698 USA 1700 Phone: +1 732 420 1571 1701 Fax: +1 732 368 1192 1702 Email: acmorton@att.com 1703 URI: http://home.comcast.net/~acmacm/ 1705 Reza Fardid 1706 Cariden Technologies 1707 888 Villa Street, Suite 500 1708 Mountain View, CA 94041 1709 USA 1711 Phone: 1712 Email: rfardid@cariden.com 1714 Alexander Steinmitz 1715 Deutsche Telekom 1716 Memmelsdorfer Str. 209b 1717 Bamberg, 96052 1718 Germany 1720 Phone: 1721 Email: Alexander.Steinmitz@telekom.de