idnits 2.17.1 draft-ietf-ippm-metrictest-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 24, 2011) is 4561 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1426 == Unused Reference: 'RFC2680' is defined on line 1070, but no explicit reference was found in the text == Unused Reference: 'RFC2681' is defined on line 1073, but no explicit reference was found in the text == Unused Reference: 'RFC3931' is defined on line 1080, but no explicit reference was found in the text == Unused Reference: 'RFC4719' is defined on line 1094, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 2330 ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) ** Obsolete normative reference: RFC 2680 (Obsoleted by RFC 7680) ** Downref: Normative reference to an Informational RFC: RFC 4459 Summary: 4 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Geib, Ed. 3 Internet-Draft Deutsche Telekom 4 Intended status: Standards Track A. Morton 5 Expires: April 26, 2012 AT&T Labs 6 R. Fardid 7 Cariden Technologies 8 A. Steinmitz 9 Deutsche Telekom 10 October 24, 2011 12 IPPM standard advancement testing 13 draft-ietf-ippm-metrictest-04 15 Abstract 17 This document specifies tests to determine if multiple independent 18 instantiations of a performance metric RFC have implemented the 19 specifications in the same way. This is the performance metric 20 equivalent of interoperability, required to advance RFCs along the 21 standards track. Results from different implementations of metric 22 RFCs will be collected under the same underlying network conditions 23 and compared using state of the art statistical methods. The goal is 24 an evaluation of the metric RFC itself, whether its definitions are 25 clear and unambiguous to implementors and therefore a candidate for 26 advancement on the IETF standards track. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on April 26, 2012. 45 Copyright Notice 47 Copyright (c) 2011 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 7 64 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 7 65 3. Verification of conformance to a metric specification . . . . 8 66 3.1. Tests of an individual implementation against a metric 67 specification . . . . . . . . . . . . . . . . . . . . . . 9 68 3.2. Test setup resulting in identical live network testing 69 conditions . . . . . . . . . . . . . . . . . . . . . . . . 11 70 3.3. Tests of two or more different implementations against 71 a metric specification . . . . . . . . . . . . . . . . . . 16 72 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 17 73 3.5. Recommended Metric Verification Measurement Process . . . 18 74 3.6. Proposal to determine an "equivalence" threshold for 75 each metric evaluated . . . . . . . . . . . . . . . . . . 21 76 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 77 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 22 78 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 79 7. Security Considerations . . . . . . . . . . . . . . . . . . . 23 80 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 81 8.1. Normative References . . . . . . . . . . . . . . . . . . . 23 82 8.2. Informative References . . . . . . . . . . . . . . . . . . 24 83 Appendix A. An example on a One-way Delay metric validation . . . 25 84 A.1. Compliance to Metric specification requirements . . . . . 25 85 A.2. Examples related to statistical tests for One-way Delay . 27 86 Appendix B. Anderson-Darling K-sample Reference and 2 sample 87 C++ code . . . . . . . . . . . . . . . . . . . . . . 29 88 Appendix C. Glossary . . . . . . . . . . . . . . . . . . . . . . 37 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 91 1. Introduction 93 The Internet Standards Process RFC2026 [RFC2026] requires that for a 94 IETF specification to advance beyond the Proposed Standard level, at 95 least two genetically unrelated implementations must be shown to 96 interoperate correctly with all features and options. This 97 requirement can be met by supplying: 99 o evidence that (at least a sub-set of) the specification has been 100 implemented by multiple parties, thus indicating adoption by the 101 IETF community and the extent of feature coverage. 103 o evidence that each feature of the specification is sufficiently 104 well-described to support interoperability, as demonstrated 105 through testing and/or user experience with deployment. 107 In the case of a protocol specification, the notion of 108 "interoperability" is reasonably intuitive - the implementations must 109 successfully "talk to each other", while exercising all features and 110 options. To achieve interoperability, two implementors need to 111 interpret the protocol specifications in equivalent ways. In the 112 case of IP Performance Metrics (IPPM), this definition of 113 interoperability is only useful for test and control protocols like 114 the One-Way Active Measurement Protocol, OWAMP [RFC4656], and the 115 Two-Way Active Measurement Protocol, TWAMP [RFC5357]. 117 A metric specification RFC describes one or more metric definitions, 118 methods of measurement and a way to report the results of 119 measurement. One example would be a way to test and report the One- 120 way Delay that data packets incur while being sent from one network 121 location to another, One-way Delay Metric. 123 In the case of metric specifications, the conditions that satisfy the 124 "interoperability" requirement are less obvious, and there was a need 125 for IETF agreement on practices to judge metric specification 126 "interoperability" in the context of the IETF Standards Process. 127 This memo provides methods which should be suitable to evaluate 128 metric specifications for standards track advancement. The methods 129 proposed here MAY be generally applicable to metric specification 130 RFCs beyond those developed under the IPPM Framework [RFC2330]. 132 Since many implementations of IP metrics are embedded in measurement 133 systems that do not interact with one another (they were built before 134 OWAMP and TWAMP), the interoperability evaluation called for in the 135 IETF standards process cannot be determined by observing that 136 independent implementations interact properly for various protocol 137 exchanges. Instead, verifying that different implementations give 138 statistically equivalent results under controlled measurement 139 conditions takes the place of interoperability observations. Even 140 when evaluating OWAMP and TWAMP RFCs for standards track advancement, 141 the methods described here are useful to evaluate the measurement 142 results because their validity would not be ascertained in typical 143 interoperability testing. 145 The standards advancement process aims at producing confidence that 146 the metric definitions and supporting material are clearly worded and 147 unambiguous, or reveals ways in which the metric definitions can be 148 revised to achieve clarity. The process also permits identification 149 of options that were not implemented, so that they can be removed 150 from the advancing specification. Thus, the product of this process 151 is information about the metric specification RFC itself: 152 determination of the specifications or definitions that are clear and 153 unambiguous and those that are not (as opposed to an evaluation of 154 the implementations which assist in the process). 156 This document defines a process to verify that implementations (or 157 practically, measurement systems) have interpreted the metric 158 specifications in equivalent ways, and produce equivalent results. 160 Testing for statistical equivalence requires ensuring identical test 161 setups (or awareness of differences) to the best possible extent. 162 Thus, producing identical test conditions is a core goal of the memo. 163 Another important aspect of this process is to test individual 164 implementations against specific requirements in the metric 165 specifications using customized tests for each requirement. These 166 tests can distinguish equivalent interpretations of each specific 167 requirement. 169 Conclusions on equivalence are reached by two measures. 171 First, implementations are compared against individual metric 172 specifications to make sure that differences in implementation are 173 minimised or at least known. 175 Second, a test setup is proposed ensuring identical networking 176 conditions so that unknowns are minimized and comparisons are 177 simplified. The resulting separate data sets may be seen as samples 178 taken from the same underlying distribution. Using state of the art 179 statistical methods, the equivalence of the results is verified. To 180 illustrate application of the process and methods defined here, 181 evaluation of the One-way Delay Metric [RFC2679] is provided in an 182 Appendix. While test setups will vary with the metrics to be 183 validated, the general methodology of determining equivalent results 184 will not. Documents defining test setups to evaluate other metrics 185 should be developed once the process proposed here has been agreed 186 and approved. 188 The metric RFC advancement process begins with a request for protocol 189 action accompanied by a memo that documents the supporting tests and 190 results. The procedures of [RFC2026] are expanded in[RFC5657], 191 including sample implementation and interoperability reports. 192 Section 3 of [morton-advance-metrics-01] can serve as a template for 193 a metric RFC report which accompanies the protocol action request to 194 the Area Director, including description of the test set-up, 195 procedures, results for each implementation and conclusions. 197 Changes from WG-03 to WG-04: 199 o Revisions to Appendix B code and add reference to "R" in the 200 Appendix and the text of section 3.6. 202 Changes from WG-02 to WG-03: 204 o Changes stemming from experiments that implemented this plan, in 205 general. 207 o Adoption of the VLAN loopback figure in the main body of the memo 208 (section 3.2). 210 Changes from WG-01 to WG-02: 212 o Clarification of the number of test streams recommended in section 213 3.2. 215 o Clarifications on testing details in sections 3.3 and 3.4. 217 o Spelling corrections throughout. 219 Changes from WG -00 to WG -01 draft 221 o Discussion on merits and requirements of a distributed lab test 222 using only local load generators. 224 o Proposal of metrics suitable for tests using the proposed 225 measurement configuration. 227 o Hint on delay caused by software based L2TPv3 implementation. 229 o Added an appendix with a test configuration allowing remote tests 230 comparing different implementations across the network. 232 o Proposal for maximum error of "equivalence", based on performance 233 comparison of identical implementations. This may be useful for 234 both ADK and non-ADK comparisons. 236 Changes from prior ID -02 to WG -00 draft 238 o Incorporation of aspects of reporting to support the protocol 239 action request in the Introduction and section 3.5 241 o Overhaul of section 3.2 regarding tunneling: Added generic 242 tunneling requirements and L2TPv3 as an example tunneling 243 mechanism fulfilling the tunneling requirements. Removed and 244 adapted some of the prior references to other tunneling protocols 246 o Softened a requirement within section 3.4 (MUST to SHOULD on 247 precision) and removed some comments of the authors. 249 o Updated contact information of one author and added a new author. 251 o Added example C++ code of an Anderson-Darling two sample test 252 implementation. 254 Changes from ID -01 to ID -02 version 256 o Major editorial review, rewording and clarifications on all 257 contents. 259 o Additional text on parallel testing using VLANs and GRE or 260 Pseudowire tunnels. 262 o Additional examples and a glossary. 264 Changes from ID -00 to ID -01 version 266 o Addition of a comparison of individual metric implementations 267 against the metric specification (trying to pick up problems and 268 solutions for metric advancement [morton-advance-metrics]). 270 o More emphasis on the requirement to carefully design and document 271 the measurement setup of the metric comparison. 273 o Proposal of testing conditions under identical WAN network 274 conditions using IP in IP tunneling or Pseudo Wires and parallel 275 measurement streams. 277 o Proposing the requirement to document the smallest resolution at 278 which an ADK test was passed by 95%. As no minimum resolution is 279 specified, IPPM metric compliance is not linked to a particular 280 performance of an implementation. 282 o Reference to RFC 2330 and RFC 2679 for the 95% confidence interval 283 as preferred criterion to decide on statistical equivalence 285 o Reducing the proposed statistical test to ADK with 95% confidence. 287 1.1. Requirements Language 289 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 290 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 291 document are to be interpreted as described in RFC 2119 [RFC2119]. 293 2. Basic idea 295 The implementation of a standard compliant metric is expected to meet 296 the requirements of the related metric specification. So before 297 comparing two metric implementations, each metric implementation is 298 individually compared against the metric specification. 300 Most metric specifications leave freedom to implementors on non- 301 fundamental aspects of an individual metric (or options). Comparing 302 different measurement results using a statistical test with the 303 assumption of identical test path and testing conditions requires 304 knowledge of all differences in the overall test setup. Metric 305 specification options chosen by implementors have to be documented. 306 It is REQUIRED to use identical implementation options wherever 307 possible for any test proposed here. Calibrations proposed by metric 308 standards should be performed to further identify (and possibly 309 reduce) potential sources of errors in the test setup. 311 The Framework for IP Performance Metrics [RFC2330] expects that a 312 "methodology for a metric should have the property that it is 313 repeatable: if the methodology is used multiple times under identical 314 conditions, it should result in consistent measurements." This means 315 an implementation is expected to repeatedly measure a metric with 316 consistent results (repeatability with the same result). Small 317 deviations in the test setup are expected to lead to small deviations 318 in results only. To characterise statistical equivalence in the case 319 of small deviations, RFC 2330 and [RFC2679] suggest to apply a 95% 320 confidence interval. Quoting RFC 2679, "95 percent was chosen 321 because ... a particular confidence level should be specified so that 322 the results of independent implementations can be compared." 324 Two different implementations are expected to produce statistically 325 equivalent results if they both measure a metric under the same 326 networking conditions. Formulating in statistical terms: separate 327 metric implementations collect separate samples from the same 328 underlying statistical process (the same network conditions). The 329 statistical hypothesis to be tested is the expectation that both 330 samples do not expose statistically different properties. This 331 requires careful test design: 333 o The measurement test setup must be self-consistent to the largest 334 possible extent. To minimize the influence of the test and 335 measurement setup on the result, network conditions and paths MUST 336 be identical for the compared implementations to the largest 337 possible degree. This includes both the stability and non- 338 ambiguity of routes taken by the measurement packets. See RFC 339 2330 for a discussion on self-consistency. 341 o To minimize the influence of implementation options on the result, 342 metric implementations SHOULD use identical options and parameters 343 for the metric under evaluation. 345 o The error induced by the sample size must be small enough to 346 minimize its influence on the test result. This may have to be 347 respected, especially if two implementations measure with 348 different average probing rates. 350 o The implementation with the lowest probing frequency determines 351 the smallest temporal interval for which samples can be compared. 353 o Repeat comparisons with several independent metric samples to 354 avoid random indications of compatibility (or the lack of it). 356 The metric specifications themselves are the primary focus of 357 evaluation, rather than the implementations of metrics. The 358 documentation produced by the advancement process should identify 359 which metric definitions and supporting material were found to be 360 clearly worded and unambiguous, OR, it should identify ways in which 361 the metric specification text should be revised to achieve clarity 362 and unified interpretation. 364 The process should also permit identification of options that were 365 not implemented, so that they can be removed from the advancing 366 specification (this is an aspect more typical of protocol advancement 367 along the standards track). 369 Note that this document does not propose to base interoperability 370 indications of performance metric implementations on comparisons of 371 individual singletons. Individual singletons may be impacted by many 372 statistical effects while they are measured. Comparing two 373 singletons of different implementations may result in failures with 374 higher probability than comparing samples. 376 3. Verification of conformance to a metric specification 378 This section specifies how to verify compliance of two or more IPPM 379 implementations against a metric specification. This document only 380 proposes a general methodology. Compliance criteria to a specific 381 metric implementation need to be defined for each individual metric 382 specification. The only exception is the statistical test comparing 383 two metric implementations which are simultaneously tested. This 384 test is applicable without metric specific decision criteria. 386 Several testing options exist to compare two or more implementations: 388 o Use a single test lab to compare the implementations and emulate 389 the Internet with an impairment generator. 391 o Use a single test lab to compare the implementations and measure 392 across the Internet. 394 o Use remotely separated test labs to compare the implementations 395 and emulate the Internet with two "identically" configured 396 impairment generators. 398 o Use remotely separated test labs to compare the implementations 399 and measure across the Internet. 401 o Use remotely separated test labs to compare the implementations 402 and measure across the Internet and include a single impairment 403 generator to impact all measurement flows in non discriminatory 404 way. 406 The first two approaches work, but cause higher expenses than the 407 other ones (due to travel and/or shipping+installation). For the 408 third option, ensuring two identically configured impairment 409 generators requires well defined test cases and possibly identical 410 hard- and software. 412 As documented in a test report [morton-testplan-rfc2679], the last 413 option was required to prove compatibility of two delay metric 414 implementations. An impairment generator is probably required when 415 testing compatibility of most other metrics and it therefore 416 RECOMMENDED to include an impairment generator in metric test set 417 ups. 419 3.1. Tests of an individual implementation against a metric 420 specification 422 A metric implementation MUST support the requirements classified as 423 "MUST" and "REQUIRED" of the related metric specification to be 424 compliant to the latter. 426 Further, supported options of a metric implementation SHOULD be 427 documented in sufficient detail. The documentation of chosen options 428 is RECOMMENDED to minimise (and recognise) differences in the test 429 setup if two metric implementations are compared. Further, this 430 documentation is used to validate and improve the underlying metric 431 specification option, to remove options which saw no implementation 432 or which are badly specified from the metric specification to be 433 promoted to a standard. This documentation SHOULD be made for all 434 implementation-relevant specifications of a metric picked for a 435 comparison that are not explicitly marked as "MUST" or "REQUIRED" in 436 the RFC text. This applies for the following sections of all metric 437 specifications: 439 o Singleton Definition of the Metric. 441 o Sample Definition of the Metric. 443 o Statistics Definition of the Metric. As statistics are compared 444 by the test specified here, this documentation is required even in 445 the case, that the metric specification does not contain a 446 Statistics Definition. 448 o Timing and Synchronisation related specification (if relevant for 449 the Metric). 451 o Any other technical part present or missing in the metric 452 specification, which is relevant for the implementation of the 453 Metric. 455 RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric 456 implementations. A single IPPM conformant implementation MUST under 457 otherwise identical network conditions produce precise results for 458 repeated measurements of the same metric. 460 RFC 2330 prefers the "empirical distribution function" EDF to 461 describe collections of measurements. RFC 2330 determines, that 462 "unless otherwise stated, IPPM goodness-of-fit tests are done using 463 5% significance." The goodness of fit test determines by which 464 precision two or more samples of a metric implementation belong to 465 the same underlying distribution (of measured network performance 466 events). The goodness of fit test suggested for the metric test is 467 the Anderson-Darling K sample test (ADK sample test, K stands for the 468 number of samples to be compared) [ADK]. Please note that RFC 2330 469 and RFC 2679 apply an Anderson Darling goodness of fit test too. 471 The results of a repeated test with a single implementation MUST pass 472 an ADK sample test with confidence level of 95%. The conditions for 473 which the ADK test has been passed with the specified confidence 474 level MUST be documented. To formulate this differently: The 475 requirement is to document the set of parameters with the smallest 476 deviation, at which the results of the tested metric implementation 477 pass an ADK test with a confidence level of 95%. The minimum 478 resolution available in the reported results from each implementation 479 MUST be taken into account in the ADK test. 481 The test conditions which MUST be documented for a passed metric test 482 include: 484 o The metric resolution at which a test was passed (e.g. the 485 resolution of timestamps) 487 o The parameters modified by an impairment generator. 489 o The impairment generator parameter settings. 491 3.2. Test setup resulting in identical live network testing conditions 493 Two major issues complicate tests for metric compliance across live 494 networks under identical testing conditions. One is the general 495 point that metric definition implementations cannot be conveniently 496 examined in field measurement scenarios. The other one is more 497 broadly described as "parallelism in devices and networks", including 498 mechanisms like those that achieve load balancing (see [RFC4928]). 500 This section proposes two measures to deal with both issues. 501 Tunneling mechanisms can be used to avoid parallel processing of 502 different flows in the network. Measuring by separate parallel probe 503 flows results in repeated collection of data. If both measures are 504 combined, WAN network conditions are identical for a number of 505 independent measurement flows, no matter what the network conditions 506 are in detail. 508 Any measurement setup MUST be made to avoid the probing traffic 509 itself to impede the metric measurement. The created measurement 510 load MUST NOT result in congestion at the access link connecting the 511 measurement implementation to the WAN. The created measurement load 512 MUST NOT overload the measurement implementation itself, e.g., by 513 causing a high CPU load or by creating imprecisions due to internal 514 transmit (receive respectively) probe packet collisions. 516 Tunneling multiple flows reaching a network element on a single 517 physical port may allow to transmit all packets of the tunnel via the 518 same path. Applying tunnels to avoid undesired influence of standard 519 routing for measurement purposes is a concept known from literature, 520 see e.g. GRE encapsulated multicast probing [GU+Duffield]. An 521 existing IP in IP tunnel protocol can be applied to avoid Equal-Cost 522 Multi-Path (ECMP) routing of different measurement streams if it 523 meets the following criteria: 525 o Inner IP packets from different measurement implementations are 526 mapped into a single tunnel with single outer IP origin and 527 destination address as well as origin and destination port numbers 528 which are identical for all packets. 530 o An easily accessible commodity tunneling protocol allows to carry 531 out a metric test from more test sites. 533 o A low operational overhead may enable a broader audience to set up 534 a metric test with the desired properties. 536 o The tunneling protocol should be reliable and stable in set up and 537 operation to avoid disturbances or influence on the test results. 539 o The tunneling protocol should not incur any extra cost for those 540 interested in setting up a metric test. 542 An illustration of a test setup with two layer 2 tunnels and two 543 flows between two linecards of one implementation is given in 544 Figure 1. 546 Implementation ,---. +--------+ 547 +~~~~~~~~~~~/ \~~~~~~| Remote | 548 +------->-----F2->-| / \ |->---+ | 549 | +---------+ | Tunnel 1( ) | | | 550 | | transmit|-F1->-| ( ) |->+ | | 551 | | LC1 | +~~~~~~~~~| |~~~~| | | | 552 | | receive |-<--+ ( ) | F1 F2 | 553 | +---------+ | |Internet | | | | | 554 *-------<-----+ F2 | | | | | | 555 +---------+ | | +~~~~~~~~~| |~~~~| | | | 556 | transmit|-* *-| | | |--+<-* | 557 | LC2 | | Tunnel 2( ) | | | 558 | receive |-<-F1-| \ / |<-* | 559 +---------+ +~~~~~~~~~~~\ /~~~~~~| Router | 560 `-+-' +--------+ 562 Illustration of a test setup with two layer 2 tunnels. For 563 simplicity, only two linecards of one implementation and two flows F 564 between them are shown. 566 Figure 1 568 Figure 2 shows the network elements required to set up layer 2 569 tunnels as shown by figure 1. 571 Implementation 573 +-----+ ,---. 574 | LC1 | / \ 575 +-----+ / \ +------+ 576 | +-------+ ( ) +-------+ |Remote| 577 +--------+ | | | | | | | | 578 |Ethernet| | Tunnel| |Internet | | Tunnel| | | 579 |Switch |--| Head |--| |--| Head |--| | 580 +--------+ | Router| | | | Router| | | 581 | | | ( ) | | |Router| 582 +-----+ +-------+ \ / +-------+ +------+ 583 | LC2 | \ / 584 +-----+ `-+-' 585 Illustration of a hardware setup to realise the test setup 586 illustrated by figure 1 with layer 2 tunnels or Pseudowires. 588 Figure 2 590 The test set up successfully used during a delay metric test 591 [morton-testplan-rfc2679] is given as an example in figure 3. Note 592 that the shown set up allows a metric test between two remote sites. 594 +----+ +----+ +----+ +----+ 595 |LC10| |LC11| ,---. |LC20| |LC21| 596 +----+ +----+ / \ +-------+ +----+ +----+ 597 | V10 | V11 / \ | Tunnel| | V20 | V21 598 | | ( ) | Head | | | 599 +--------+ +------+ | | | Router|__+----------+ 600 |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | 601 |Switch |--|Head |-| | | |Switch | 602 +-+--+---+ |Router| | | +---+---+ +--+--+----+ 603 |__| +--A---+ ( )--|Option.| |__| 604 \ / |Impair.| 605 Bridge \ / |Gener. | Bridge 606 V20 to V21 `-+-? +-------+ V10 to V11 608 Figure 3 610 In figure 3, LC10 identify measurement clients /line cards. V10 and 611 the others denote VLANs. All VLANs are using the same tunnel from A 612 to B and in the reverse direction. The remote site VLANs are 613 U-bridged at the local site Ethernet switch. The measurement packets 614 of site 1 travel tunnel A->B first, are U-bridged at site 2 and 615 travel tunnel B->A second. Measurement packets of site 2 travel 616 tunnel B->A first, are U-bridged at site 1 and travel tunnel A->B 617 second. So all measurement packets pass the same tunnel segments, 618 but in different segment order. 620 If tunneling is applied, two tunnels MUST carry all test traffic in 621 between the test site and the remote site. For example, if 802.1Q 622 Ethernet Virtual LANs (VLAN) are applied and the measurement streams 623 are carried in different VLANs, the IP tunnel or Pseudo Wires 624 respectively MUST be set up in physical port mode to avoid set up of 625 Pseudo Wires per VLAN (which may see different paths due to ECMP 626 routing), see RFC 4448. The remote router and the Ethernet switch 627 shown in figure 3 has to support 802.1Q in this set up. 629 The IP packet size of the metric implementation SHOULD be chosen 630 small enough to avoid fragmentation due to the added Ethernet and 631 tunnel headers. Otherwise, the impact of tunnel overhead on 632 fragmentation and interface MTU size MUST be understood and taken 633 into account (see [RFC4459]). 635 An Ethernet port mode IP tunnel carrying several 802.1Q VLANs each 636 containing measurement traffic of a single measurement system was 637 successfully applied when testing compatibility of two metric 638 implementations [morton-testplan-rfc2679]. 640 The following headers may have to be accounted for when calculating 641 total packet length, if VLANs and Ethernet over L2TPv3 tunnels are 642 applied: 644 o Ethernet 802.1Q: 22 Byte. 646 o L2TPv3 Header: 4-16 Byte for L2TPv3 data messages over IP; 16-28 647 Byte for L2TPv3 data messages over UDP. 649 o IPv4 Header (outer IP header): 20 Byte. 651 o MPLS Labels may be added by a carrier. Each MPLS Label has a 652 length of 4 Bytes. By the time of writing, between 1 and 4 Labels 653 seems to be a fair guess of what's expectable. 655 The applicability of one or more of the following tunneling protocols 656 may be investigated by interested parties if Ethernet over L2TPv3 is 657 felt to be not suitable: IP in IP [RFC2003] or Generic Routing 658 Encapsulation (GRE) [RFC2784]. RFC 4928 [RFC4928] proposes measures 659 how to avoid ECMP treatment in MPLS networks. 661 L2TP is a commodity tunneling protocol [RFC2661]. By the time of 662 writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is 663 applied, software based implementations of this protocol are not 664 suitable for the test set up, as such implementations may cause 665 incalculable delay shifts. 667 Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. 668 While there's no technical issue with this solution, MPLS interfaces 669 are mostly found in the network provider domain. Hence not all of 670 the above criteria to select a tunneling protocol are met. 672 Note that setting up a metric test environment isn't a plug and play 673 issue. Skilled networking engineers should be consulted and 674 involved, if a set up between remote sites is preferred. 676 Passing or failing an ADK test with 2 samples could be a random 677 result (note that [RFC2330] defines a sample as a set of singleton 678 metric values produced by a measurement stream, and we continue to 679 use this terminology here). The error margin of a statistical test 680 is higher if the number of samples it is based on is low (the number 681 of samples taken influences the so called "degree of freedom" of a 682 statistical test and a higher degree of freedom produces more 683 reliable results). To pass ADK with higher probability, the number 684 of samples collected per implementation under identical networking 685 conditions SHOULD be greater than 2. Hardware and load constraints 686 may enforce an upper limit on the number of simultaneous measurement 687 streams. The ADK test allows one to combine different samples (see 688 section 9 [ADK]) and then to run a two sample test between combined 689 samples. At least 4 samples per implementation captured under 690 identical networking conditions is RECOMMENDED when comparing 691 different metric implementations by a statistical test. 693 It is RECOMMENDED that tests be carried out by establishing N 694 different parallel measurement flows. Two or three linecards per 695 implementation serving to send or receive measurement flows should be 696 sufficient to create 4 or more parallel measurement flows. Other 697 options are to separate flows by DiffServ marks (without deploying 698 any QoS in the inner or outer tunnel) or using a single CBR flow and 699 evaluating every n-th singleton to belong to a specific measurement 700 flow. Note that a practical test indeed showed that ADK was passed 701 with 4 samples even if a 2 sample test 702 failed[morton-testplan-rfc2679]. 704 Some additional guidelines to calculate and compare samples to 705 perform a metric test are: 707 o To compare different probes of a common underlying distribution in 708 terms of metrics characterising a communication network requires 709 to respect the temporal nature for which the assumption of common 710 underlying distribution may hold. Any singletons or samples to be 711 compared MUST be captured within the same time interval. 713 o If statistical events like rates are used to characterise measured 714 metrics of a time-interval, its RECOMMENDED to pick as a minimum 5 715 singletons of a relevant metric to ensure a minimum confidence 716 into the reported value. The error margin of the determined rate 717 depends on the number singletons (refer to statistical textbooks 718 on Student's t-test). As an example, any packet loss measurement 719 interval to be compared with the results of another implementation 720 contains at least five lost packets to have some confidence that 721 the observed loss rate wasn't caused by a small number of random 722 packet drops. 724 o The minimum number of singletons or samples to be compared by an 725 Anderson-Darling test SHOULD be 100 per tested metric 726 implementation. Note that the Anderson-Darling test detects small 727 differences in distributions fairly well and will fail for high 728 number of compared results (RFC2330 mentions an example with 8192 729 measurements where an Anderson-Darling test always failed). 731 o Generally, the Anderson-Darling test is sensitive to differences 732 in the accuracy or bias associated with varying implementations or 733 test conditions. These dissimilarities may result in differing 734 averages of samples to be compared. An example may be different 735 packet sizes, resulting in a constant delay difference between 736 compared samples. Therefore samples to be compared by an Anderson 737 Darling test MAY be calibrated by the difference of the average 738 values of the samples. Any calibration of this kind MUST be 739 documented in the test result. 741 3.3. Tests of two or more different implementations against a metric 742 specification 744 RFC2330 expects "a methodology for a given metric [to] exhibit 745 continuity if, for small variations in conditions, it results in 746 small variations in the resulting measurements. Slightly more 747 precisely, for every positive epsilon, there exists a positive delta, 748 such that if two sets of conditions are within delta of each other, 749 then the resulting measurements will be within epsilon of each 750 other." A small variation in conditions in the context of the metric 751 test proposed here can be seen as different implementations measuring 752 the same metric along the same path. 754 IPPM metric specifications however allow for implementor options to 755 the largest possible degree. It cannot be expected that two 756 implementors allow 100% identical options in their implementations. 757 Testers SHOULD to the highest degree possible pick the same 758 configurations for their systems when comparing their implementations 759 by a metric test. 761 In some cases, a goodness of fit test may not be possible or show 762 disappointing results. To clarify the difficulties arising from 763 different implementation options, the individual options picked for 764 every compared implementation SHOULD be documented in sufficient 765 detail. Based on this documentation, the underlying metric 766 specification should be improved before it is promoted to a standard. 768 The same statistical test as applicable to quantify precision of a 769 single metric implementation MUST be used to compare metric result 770 equivalence for different implementations. To document 771 compatibility, the smallest measurement resolution at which the 772 compared implementations passed the ADK sample test MUST be 773 documented. 775 For different implementations of the same metric, "variations in 776 conditions" are reasonably expected. The ADK test comparing samples 777 of the different implementations MAY result in a lower precision than 778 the test for precision in the same-implementation comparison. 780 3.4. Clock synchronisation 782 Clock synchronization effects require special attention. Accuracy of 783 one-way active delay measurements for any metrics implementation 784 depends on clock synchronization between the source and destination 785 of tests. Ideally, one-way active delay measurement (RFC 2679, 786 [RFC2679]) test endpoints either have direct access to independent 787 GPS or CDMA-based time sources or indirect access to nearby NTP 788 primary (stratum 1) time sources, equipped with GPS receivers. 789 Access to these time sources may not be available at all test 790 locations associated with different Internet paths, for a variety of 791 reasons out of scope of this document. 793 When secondary (stratum 2 and above) time sources are used with NTP 794 running across the same network, whose metrics are subject to 795 comparative implementation tests, network impairments can affect 796 clock synchronization, distort sample one-way values and their 797 interval statistics. It is RECOMMENDED to discard sample one-way 798 delay values for any implementation, when one of the following 799 reliability conditions is met: 801 o Delay is measured and is finite in one direction, but not the 802 other. 804 o Absolute value of the difference between the sum of one-way 805 measurements in both directions and round-trip measurement is 806 greater than X% of the latter value. 808 Examination of the second condition requires RTT measurement for 809 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 810 conjunction with one-way delay measurement. 812 Specification of X% to strike a balance between identification of 813 unreliable one-way delay samples and misidentification of reliable 814 samples under a wide range of Internet path RTTs probably requires 815 further study. 817 An IPPM compliant metric implementation of an RFC that requires 818 synchronized clocks is expected to provide precise measurement 819 results. 821 IF an implementation publishes a specification of its precision, such 822 as "a precision of 1 ms (+/- 500 us) with a confidence of 95%", then 823 the specification SHOULD be met over a useful measurement duration. 824 For example, if the metric is measured along an Internet path which 825 is stable and not congested, then the precision specification SHOULD 826 be met over durations of an hour or more. 828 3.5. Recommended Metric Verification Measurement Process 830 In order to meet their obligations under the IETF Standards Process 831 the IESG must be convinced that each metric specification advanced to 832 Draft Standard or Internet Standard status is clearly written, that 833 there are a sufficient number of verified equivalent implementations, 834 and that options that have been implemented are documented. 836 In the context of this document, metrics are designed to measure some 837 characteristic of a data network. An aim of any metric definition 838 should be that it should be specified in a way that can reliably 839 measure the specific characteristic in a repeatable way across 840 multiple independent implementations. 842 Each metric, statistic or option of those to be validated MUST be 843 compared against a reference measurement or another implementation by 844 as specified by this document. 846 Finally, the metric definitions, embodied in the text of the RFCs, 847 are the objects that require evaluation and possible revision in 848 order to advance to the next step on the standards track. 850 IF two (or more) implementations do not measure an equivalent metric 851 as specified by this document, 853 AND sources of measurement error do not adequately explain the lack 854 of agreement, 855 THEN the details of each implementation should be audited along with 856 the exact definition text, to determine if there is a lack of clarity 857 that has caused the implementations to vary in a way that affects the 858 correspondence of the results. 860 IF there was a lack of clarity or multiple legitimate interpretations 861 of the definition text, 863 THEN the text should be modified and the resulting memo proposed for 864 consensus and (possible) advancement along the standards track. 866 Finally, all the findings MUST be documented in a report that can 867 support advancement on the standards track, similar to those 868 described in [RFC5657]. The list of measurement devices used in 869 testing satisfies the implementation requirement, while the test 870 results provide information on the quality of each specification in 871 the metric RFC (the surrogate for feature interoperability). 873 The complete process of advancing a metric specification to a 874 standard as defined by this document is illustrated in Figure 4. 876 ,---. 877 / \ 878 ( Start ) 879 \ / Implementations 880 `-+-' +-------+ 881 | /| 1 `. 882 +---+----+ / +-------+ `.-----------+ ,-------. 883 | RFC | / |Check for | ,' was RFC `. YES 884 | | / |Equivalence.... clause x ------+ 885 | |/ +-------+ |under | `. clear? ,' | 886 | Metric \.....| 2 ....relevant | `---+---' +----+-----+ 887 | Metric |\ +-------+ |identical | No | |Report | 888 | Metric | \ |network | +--+----+ |results + | 889 | ... | \ |conditions | |Modify | |Advance | 890 | | \ +-------+ | | |Spec +--+RFC | 891 +--------+ \| n |.'+-----------+ +-------+ |request(?)| 892 +-------+ +----------+ 894 Illustration of the metric standardisation process 896 Figure 4 898 Any recommendation for the advancement of a metric specification MUST 899 be accompanied by an implementation report, as is the case with all 900 requests for the advancement of IETF specifications. The 901 implementation report needs to include the tests performed, the 902 applied test setup, the specific metrics in the RFC and reports of 903 the tests performed with two or more implementations. The test plan 904 needs to specify the precision reached for each measured metric and 905 thus define the meaning of "statistically equivalent" for the 906 specific metrics being tested. 908 Ideally, the test plan would co-evolve with the development of the 909 metric, since that's when people have the most context in their 910 thinking regarding the different subtleties that can arise. 912 In particular, the implementation report MUST as a minimum document: 914 o The metric compared and the RFC specifying it. This includes 915 statements as required by the section "Tests of an individual 916 implementation against a metric specification" of this document. 918 o The measurement configuration and setup. 920 o A complete specification of the measurement stream (mean rate, 921 statistical distribution of packets, packet size or mean packet 922 size and their distribution), DSCP and any other measurement 923 stream properties which could result in deviating results. 924 Deviations in results can be caused also if chosen IP addresses 925 and ports of different implementations can result in different 926 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 927 routing in an operational network. 929 o The duration of each measurement to be used for a metric 930 validation, the number of measurement points collected for each 931 metric during each measurement interval (i.e. the probe size) and 932 the level of confidence derived from this probe size for each 933 measurement interval. 935 o The result of the statistical tests performed for each metric 936 validation as required by the section "Tests of two or more 937 different implementations against a metric specification" of this 938 document. 940 o A parameterization of laboratory conditions and applied traffic 941 and network conditions allowing reproduction of these laboratory 942 conditions for readers of the implementation report. 944 o The documentation helping to improve metric specifications defined 945 by this section. 947 All of the tests for each set SHOULD be run in a test setup as 948 specified in the section "Test setup resulting in identical live 949 network testing conditions." 951 If a different test set up is chosen, it is RECOMMENDED to avoid 952 effects falsifying results of validation measurements caused by real 953 data networks (like parallelism in devices and networks). Data 954 networks may forward packets differently in the case of: 956 o Different packet sizes chosen for different metric 957 implementations. A proposed countermeasure is selecting the same 958 packet size when validating results of two samples or a sample 959 against an original distribution. 961 o Selection of differing IP addresses and ports used by different 962 metric implementations during metric validation tests. If ECMP is 963 applied on IP or MPLS level, different paths can result (note that 964 it may be impossible to detect an MPLS ECMP path from an IP 965 endpoint). A proposed counter measure is to connect the 966 measurement equipment to be compared by a NAT device, or 967 establishing a single tunnel to transport all measurement traffic 968 The aim is to have the same IP addresses and port for all 969 measurement packets or to avoid ECMP based local routing diversion 970 by using a layer 2 tunnel. 972 o Different IP options. 974 o Different DSCP. 976 o If the N measurements are captured using sequential measurements 977 instead of simultaneous ones, then the following factors come into 978 play: Time varying paths and load conditions. 980 3.6. Proposal to determine an "equivalence" threshold for each metric 981 evaluated 983 This section describes a proposal for maximum error of "equivalence", 984 based on performance comparison of identical implementations. This 985 comparison may be useful for both ADK and non-ADK comparisons. 987 Each metric tested by two or more implementations (cross- 988 implementation testing). 990 Each metric is also tested twice simultaneously by the *same* 991 implementation, using different Src/Dst Address pairs and other 992 differences such that the connectivity differences of the cross- 993 implementation tests are also experienced and measured by the same 994 implementation. 996 Comparative results for the same implementation represent a bound on 997 cross-implementation equivalence. This should be particularly useful 998 when the metric does *not* produces a continuous distribution of 999 singleton values, such as with a loss metric, or a duplication 1000 metric. Appendix A indicates how the ADK will work for 0ne-way 1001 delay, and should be likewise applicable to distributions of delay 1002 variation. Appendix B discusses two possible ways to perform the ADK 1003 analysis, the R statistical language [Rtool] with ADK package [Radk] 1004 and C++ code. 1006 Proposal: the implementation with the largest difference in 1007 homogeneous comparison results is the lower bound on the equivalence 1008 threshold, noting that there may be other systematic errors to 1009 account for when comparing between implementations. 1011 Thus, when evaluating equivalence in cross-implementation results: 1013 Maximum_Error = Same_Implementation_Error + Systematic_Error 1015 and only the systematic error need be decided beforehand. 1017 In the case of ADK comparison, the largest same-implementation 1018 resolution of distribution equivalence can be used as a limit on 1019 cross-implementation resolutions (at the same confidence level). 1021 4. Acknowledgements 1023 Gerhard Hasslinger commented a first version of this document, 1024 suggested statistical tests and the evaluation of time series 1025 information. Matthias Wieser's thesis on a metric test resulted in 1026 new input for this draft. Henk Uijterwaal and Lars Eggert have 1027 encouraged and helped to orgainize this work. Mike Hamilton, Scott 1028 Bradner, David Mcdysan and Emile Stephan commented on this draft. 1029 Carol Davids reviewed the 01 version of the ID before it was promoted 1030 to WG draft. 1032 5. Contributors 1034 Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- 1035 metrictest [bradner-metrictest], and major parts of it are included 1036 in this document. 1038 6. IANA Considerations 1040 This memo includes no request to IANA. 1042 7. Security Considerations 1044 This memo does not raise any specific security issues. 1046 8. References 1048 8.1. Normative References 1050 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 1051 October 1996. 1053 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1054 3", BCP 9, RFC 2026, October 1996. 1056 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1057 Requirement Levels", BCP 14, RFC 2119, March 1997. 1059 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 1060 "Framework for IP Performance Metrics", RFC 2330, 1061 May 1998. 1063 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 1064 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 1065 RFC 2661, August 1999. 1067 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1068 Delay Metric for IPPM", RFC 2679, September 1999. 1070 [RFC2680] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1071 Packet Loss Metric for IPPM", RFC 2680, September 1999. 1073 [RFC2681] Almes, G., Kalidindi, S., and M. Zekauskas, "A Round-trip 1074 Delay Metric for IPPM", RFC 2681, September 1999. 1076 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 1077 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 1078 March 2000. 1080 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 1081 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 1083 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 1084 "Encapsulation Methods for Transport of Ethernet over MPLS 1085 Networks", RFC 4448, April 2006. 1087 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1088 Network Tunneling", RFC 4459, April 2006. 1090 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1091 Zekauskas, "A One-way Active Measurement Protocol 1092 (OWAMP)", RFC 4656, September 2006. 1094 [RFC4719] Aggarwal, R., Townsley, M., and M. Dos Santos, "Transport 1095 of Ethernet Frames over Layer 2 Tunneling Protocol Version 1096 3 (L2TPv3)", RFC 4719, November 2006. 1098 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 1099 Cost Multipath Treatment in MPLS Networks", BCP 128, 1100 RFC 4928, June 2007. 1102 [RFC5657] Dusseault, L. and R. Sparks, "Guidance on Interoperation 1103 and Implementation Reports for Advancement to Draft 1104 Standard", BCP 9, RFC 5657, September 2009. 1106 8.2. Informative References 1108 [ADK] Scholz, F. and M. Stephens, "K-sample Anderson-Darling 1109 Tests of fit, for continuous and discrete cases", 1110 University of Washington, Technical Report No. 81, 1111 May 1986. 1113 [GU+Duffield] 1114 Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE 1115 Encapsulated Multicast Probing: A Scalable Technique for 1116 Measuring One-Way Loss", SIGMETRICS'07 San Diego, 1117 California, USA, June 2007. 1119 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1120 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1121 RFC 5357, October 2008. 1123 [Radk] Scholz, F., "adk: Anderson-Darling K-Sample Test and 1124 Combinations of Such Tests. R package version 1.0.", , 1125 2008. 1127 [Rtool] R Development Core Team, "R: A language and environment 1128 for statistical computing. R Foundation for Statistical 1129 Computing, Vienna, Austria. ISBN 3-900051-07-0, URL 1130 http://www.R-project.org/", , 2011. 1132 [Rule of thumb] 1133 Hardy, M., "Confidence interval", March 2010. 1135 [bradner-metrictest] 1136 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 1137 metrics specifications on the IETF Standards Track", 1138 draft -bradner-metricstest-03, (work in progress), 1139 July 2007. 1141 [morton-advance-metrics] 1142 Morton, A., "Problems and Possible Solutions for Advancing 1143 Metrics on the Standards Track", draft -morton-ippm- 1144 advance-metrics-00, (work in progress), July 2009. 1146 [morton-advance-metrics-01] 1147 Morton, A., "Lab Test Results for Advancing Metrics on the 1148 Standards Track", draft -morton-ippm-advance-metrics-01, 1149 (work in progress), June 2010. 1151 [morton-testplan-rfc2679] 1152 Ciavattone, L., Geib, R., Morton, A., and M. Wieser, "Test 1153 Plan and Results for Advancing RFC 2679 on the Standards 1154 Track", draft -morton-ippm-testplan-rfc2679-01, (work in 1155 progress), June 2011. 1157 Appendix A. An example on a One-way Delay metric validation 1159 The text of this appendix is not binding. It is an example how parts 1160 of a One-way Delay metric test could look like. 1161 http://xml.resource.org/public/rfc/bibxml/ 1163 A.1. Compliance to Metric specification requirements 1165 One-way Delay, Loss threshold, RFC 2679 1167 This test determines if implementations use the same configured 1168 maximum waiting time delay from one measurement to another under 1169 different delay conditions, and correctly declare packets arriving in 1170 excess of the waiting time threshold as lost. See Section 3.5 of 1171 RFC2679, 3rd bullet point and also Section 3.8.2 of RFC2679. 1173 (1) Configure a path with 1 sec one-way constant delay. 1175 (2) Measure one-way delay with 2 or more implementations, using 1176 identical waiting time thresholds for loss set at 2 seconds. 1178 (3) Configure the path with 3 sec one-way delay. 1180 (4) Repeat measurements. 1182 (5) Observe that the increase measured in step 4 caused all packets 1183 to be declared lost, and that all packets that arrive 1184 successfully in step 2 are assigned a valid one-way delay. 1186 One-way Delay, First-bit to Last bit, RFC 2679 1188 This test determines if implementations register the same relative 1189 increase in delay from one measurement to another under different 1190 delay conditions. This test tends to cancel the sources of error 1191 which may be present in an implementation. See Section 3.7.2 of 1192 RFC2679, and Section 10.2 of RFC2330. 1194 (1) Configure a path with X ms one-way constant delay, and ideally 1195 including a low-speed link. 1197 (2) Measure one-way delay with 2 or more implementations, using 1198 identical options and equal size small packets (e.g., 100 octet 1199 IP payload). 1201 (3) Maintain the same path with X ms one-way delay. 1203 (4) Measure one-way delay with 2 or more implementations, using 1204 identical options and equal size large packets (e.g., 1500 octet 1205 IP payload). 1207 (5) Observe that the increase measured in steps 2 and 4 is 1208 equivalent to the increase in ms expected due to the larger 1209 serialization time for each implementation. Most of the 1210 measurement errors in each system should cancel, if they are 1211 stationary. 1213 One-way Delay, RFC 2679 1215 This test determines if implementations register the same relative 1216 increase in delay from one measurement to another under different 1217 delay conditions. This test tends to cancel the sources of error 1218 which may be present in an implementation. This test is intended to 1219 evaluate measurments in sections 3 and 4 of RFC2679. 1221 (1) Configure a path with X ms one-way constant delay. 1223 (2) Measure one-way delay with 2 or more implementations, using 1224 identical options. 1226 (3) Configure the path with X+Y ms one-way delay. 1228 (4) Repeat measurements. 1230 (5) Observe that the increase measured in steps 2 and 4 is ~Y ms for 1231 each implementation. Most of the measurement errors in each 1232 system should cancel, if they are stationary. 1234 Error Calibration, RFC 2679 1236 This is a simple check to determine if an implementation reports the 1237 error calibration as required in Section 4.8 of RFC2679. Note that 1238 the context (Type-P) must also be reported. 1240 A.2. Examples related to statistical tests for One-way Delay 1242 A one way delay measurement may pass an ADK test with a timestamp 1243 resultion of 1 ms. The same test may fail, if timestamps with a 1244 resolution of 100 microseconds are eavluated. The implementation 1245 then is then conforming to the metric specification up to a timestamp 1246 resolution of 1 ms. 1248 Let's assume another one way delay measurement comparison between 1249 implementation 1, probing with a frequency of 2 probes per second and 1250 implementation 2 probing at a rate of 2 probes every 3 minutes. To 1251 ensure reasonable confidence in results, sample metrics are 1252 calculated from at least 5 singletons per compared time interval. 1253 This means, sample delay values are calculated for each system for 1254 identical 6 minute intervals for the whole test duration. Per 6 1255 minute interval, the sample metric is calculated from 720 singletons 1256 for implementation 1 and from 6 singletons for implementation 2. 1257 Note, that if outliers are not filtered, moving averages are an 1258 option for an evaluation too. The minimum move of an averaging 1259 interval is three minutes in this example. 1261 The data in table 1 may result from measuring One-Way Delay with 1262 implementation 1 (see column Implemnt_1) and implementation 2 (see 1263 column implemnt_2). Each data point in the table represents a 1264 (rounded) average of the sampled delay values per interval. The 1265 resolution of the clock is one micro-second. The difference in the 1266 delay values may result eg. from different probe packet sizes. 1268 +------------+------------+-----------------------------+ 1269 | Implemnt_1 | Implemnt_2 | Implemnt_2 - Delta_Averages | 1270 +------------+------------+-----------------------------+ 1271 | 5000 | 6549 | 4997 | 1272 | 5008 | 6555 | 5003 | 1273 | 5012 | 6564 | 5012 | 1274 | 5015 | 6565 | 5013 | 1275 | 5019 | 6568 | 5016 | 1276 | 5022 | 6570 | 5018 | 1277 | 5024 | 6573 | 5021 | 1278 | 5026 | 6575 | 5023 | 1279 | 5027 | 6577 | 5025 | 1280 | 5029 | 6580 | 5028 | 1281 | 5030 | 6585 | 5033 | 1282 | 5032 | 6586 | 5034 | 1283 | 5034 | 6587 | 5035 | 1284 | 5036 | 6588 | 5036 | 1285 | 5038 | 6589 | 5037 | 1286 | 5039 | 6591 | 5039 | 1287 | 5041 | 6592 | 5040 | 1288 | 5043 | 6599 | 5047 | 1289 | 5046 | 6606 | 5054 | 1290 | 5054 | 6612 | 5060 | 1291 +------------+------------+-----------------------------+ 1293 Table 1 1295 Average values of sample metrics captured during identical time 1296 intervals are compared. This excludes random differences caused by 1297 differing probing intervals or differing temporal distance of 1298 singletons resulting from their Poisson distributed sending times. 1300 In the example, 20 values have been picked (note that at least 100 1301 values are recommended for a single run of a real test). Data must 1302 be ordered by ascending rank. The data of Implemnt_1 and Implemnt_2 1303 as shown in the first two columns of table 1 clearly fails an ADK 1304 test with 95% confidence. 1306 The results of Implemnt_2 are now reduced by difference of the 1307 averages of column 2 (rounded to 6581 us) and column 1 (rounded to 1308 5029 us), which is 1552 us. The result may be found in column 3 of 1309 table 1. Comparing column 1 and column 3 of the table by an ADK test 1310 shows, that the data contained in these columns passes an ADK tests 1311 with 95% confidence. 1313 >>> Comment: Extensive averaging was used in this example, because of 1314 the vastly different sampling frequencies. As a result, the 1315 distributions compared do not exactly align with a metric in 1317 [RFC2679], but illustrate the ADK process adequately. 1319 Appendix B. Anderson-Darling K-sample Reference and 2 sample C++ code 1321 There are many statistical tools available, and this Appendix 1322 describes two that are familiar to the authors. 1324 The "R tool" is a language and command-line environment for 1325 statistical computing and plotting [Rtool]. With the optional "adk" 1326 package installed [Radk], it can perform individual and combined 1327 sample ADK computations. The user must consult the package 1328 documentation and the original paper [ADK] to interpret the results, 1329 but this is as it should be. 1331 The C++ code below will perform a 2-sample AD comparison when 1332 compiled and presented with two column vectors in a file (using white 1333 space as separation). This version contains modifications to use the 1334 vectors and run as a stand-alone module by Wes Eddy, Sept 2011. The 1335 status of the comparison can be checked on the command line with "$ 1336 echo $?" or the last line can be replaced with a printf statement for 1337 adk_result instead. 1339 /* Routines for computing the Anderson-Darling 2 sample 1340 * test statistic. 1341 * 1342 * Implemented based on the description in 1343 * "Anderson-Darling K Sample Test" Heckert, Alan and 1344 * Filliben, James, editors, Dataplot Reference Manual, 1345 * Chapter 15 Auxiliary, NIST, 2004. 1346 * Official Reference by 2010 1347 * Heckert, N. A. (2001). Dataplot website at the 1348 * National Institute of Standards and Technology: 1349 * http://www.itl.nist.gov/div898/software/dataplot.html/ 1350 * June 2001. 1351 */ 1353 #include 1354 #include 1355 #include 1356 #include 1358 using namespace std; 1360 int main() { 1361 vector vec1, vec2; 1362 double adk_result; 1363 static int k, val_st_z_samp1, val_st_z_samp2, 1364 val_eq_z_samp1, val_eq_z_samp2, 1365 j, n_total, n_sample1, n_sample2, L, 1366 max_number_samples, line, maxnumber_z; 1367 static int column_1, column_2; 1368 static double adk, n_value, z, sum_adk_samp1, 1369 sum_adk_samp2, z_aux; 1370 static double H_j, F1j, hj, F2j, denom_1_aux, denom_2_aux; 1371 static bool next_z_sample2, equal_z_both_samples; 1372 static int stop_loop1, stop_loop2, stop_loop3,old_eq_line2, 1373 old_eq_line1; 1375 static double adk_criterium = 1.993; 1377 /* vec1 and vec2 to be initialised with sample 1 and 1378 * sample 2 values in ascending order */ 1379 while (!cin.eof()) { 1380 double f1, f2; 1381 cin >> f1; 1382 cin >> f2; 1383 vec1.push_back(f1); 1384 vec2.push_back(f2); 1385 } 1387 k = 2; 1388 n_sample1 = vec1.size() - 1; 1389 n_sample2 = vec2.size() - 1; 1391 // -1 because vec[0] is a dummy value 1392 n_total = n_sample1 + n_sample2; 1394 /* value equal to the line with a value = zj in sample 1. 1395 * Here j=1, so the line is 1. 1396 */ 1397 val_eq_z_samp1 = 1; 1399 /* value equal to the line with a value = zj in sample 2. 1400 * Here j=1, so the line is 1. 1401 */ 1402 val_eq_z_samp2 = 1; 1404 /* value equal to the last line with a value < zj 1405 * in sample 1. Here j=1, so the line is 0. 1406 */ 1407 val_st_z_samp1 = 0; 1409 /* value equal to the last line with a value < zj 1410 * in sample 1. Here j=1, so the line is 0. 1411 */ 1412 val_st_z_samp2 = 0; 1414 sum_adk_samp1 = 0; 1415 sum_adk_samp2 = 0; 1416 j = 1; 1418 // as mentioned above, j=1 1419 equal_z_both_samples = false; 1421 next_z_sample2 = false; 1423 //assuming the next z to be of sample 1 1424 stop_loop1 = n_sample1 + 1; 1426 // + 1 because vec[0] is a dummy, see n_sample1 declaration 1427 stop_loop2 = n_sample2 + 1; 1428 stop_loop3 = n_total + 1; 1430 /* The required z values are calculated until all values 1431 * of both samples have been taken into account. See the 1432 * lines above for the stoploop values. Construct required 1433 * to avoid a mathematical operation in the While condition 1434 */ 1435 while (((stop_loop1 > val_eq_z_samp1) 1436 || (stop_loop2 > val_eq_z_samp2)) && stop_loop3 > j) 1437 { 1438 if(val_eq_z_samp1 < n_sample1+1) 1439 { 1440 /* here, a preliminary zj value is set. 1441 * See below how to calculate the actual zj. 1442 */ 1443 z = vec1[val_eq_z_samp1]; 1445 /* this while sequence calculates the number of values 1446 * equal to z. 1447 */ 1448 while ((val_eq_z_samp1+1 < n_sample1) 1449 && z == vec1[val_eq_z_samp1+1] ) 1450 { 1451 val_eq_z_samp1++; 1452 } 1453 } 1454 else 1455 { 1456 val_eq_z_samp1 = 0; 1457 val_st_z_samp1 = n_sample1; 1459 // this should be val_eq_z_samp1 - 1 = n_sample1 1460 } 1462 if(val_eq_z_samp2 < n_sample2+1) 1463 { 1464 z_aux = vec2[val_eq_z_samp2];; 1466 /* this while sequence calculates the number of values 1467 * equal to z_aux 1468 */ 1470 while ((val_eq_z_samp2+1 < n_sample2) 1471 && z_aux == vec2[val_eq_z_samp2+1] ) 1472 { 1473 val_eq_z_samp2++; 1474 } 1476 /* the smaller of the two actual data values is picked 1477 * as the next zj. 1478 */ 1480 if(z > z_aux) 1481 { 1482 z = z_aux; 1483 next_z_sample2 = true; 1484 } 1485 else 1486 { 1487 if (z == z_aux) 1488 { 1489 equal_z_both_samples = true; 1490 } 1492 /* This is the case, if the last value of column1 is 1493 * smaller than the remaining values of column2. 1494 */ 1495 if (val_eq_z_samp1 == 0) 1496 { 1497 z = z_aux; 1498 next_z_sample2 = true; 1499 } 1500 } 1501 } 1502 else 1503 { 1504 val_eq_z_samp2 = 0; 1505 val_st_z_samp2 = n_sample2; 1507 // this should be val_eq_z_samp2 - 1 = n_sample2 1509 } 1511 /* in the following, sum j = 1 to L is calculated for 1512 * sample 1 and sample 2. 1513 */ 1514 if (equal_z_both_samples) 1515 { 1517 /* hj is the number of values in the combined sample 1518 * equal to zj 1519 */ 1520 hj = val_eq_z_samp1 - val_st_z_samp1 1521 + val_eq_z_samp2 - val_st_z_samp2; 1523 /* H_j is the number of values in the combined sample 1524 * smaller than zj plus one half the the number of 1525 * values in the combined sample equal to zj 1526 * (that's hj/2). 1527 */ 1528 H_j = val_st_z_samp1 + val_st_z_samp2 1529 + hj / 2; 1531 /* F1j is the number of values in the 1st sample 1532 * which are less than zj plus one half the number 1533 * of values in this sample which are equal to zj. 1534 */ 1536 F1j = val_st_z_samp1 + (double) 1537 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1539 /* F2j is the number of values in the 1st sample 1540 * which are less than zj plus one half the number 1541 * of values in this sample which are equal to zj. 1542 */ 1543 F2j = val_st_z_samp2 + (double) 1544 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1546 /* set the line of values equal to zj to the 1547 * actual line of the last value picked for zj. 1548 */ 1549 val_st_z_samp1 = val_eq_z_samp1; 1551 /* Set the line of values equal to zj to the actual 1552 * line of the last value picked for zjof each 1553 * sample. This is required as data smaller than zj 1554 * is accounted differently than values equal to zj. 1556 */ 1557 val_st_z_samp2 = val_eq_z_samp2; 1559 /* next the lines of the next values z, ie. zj+1 1560 * are addressed. 1561 */ 1562 val_eq_z_samp1++; 1564 /* next the lines of the next values z, ie. 1565 * zj+1 are addressed 1566 */ 1567 val_eq_z_samp2++; 1568 } 1569 else 1570 { 1572 /* the smaller z value was contained in sample 2, 1573 * hence this value is the zj to base the following 1574 * calculations on. 1575 */ 1576 if (next_z_sample2) 1577 { 1579 /* hj is the number of values in the combined 1580 * sample equal to zj, in this case these are 1581 * within sample 2 only. 1582 */ 1583 hj = val_eq_z_samp2 - val_st_z_samp2; 1585 /* H_j is the number of values in the combined sample 1586 * smaller than zj plus one half the the number of 1587 * values in the combined sample equal to zj 1588 * (that's hj/2). 1589 */ 1591 H_j = val_st_z_samp1 + val_st_z_samp2 1592 + hj / 2; 1594 /* F1j is the number of values in the 1st sample which 1595 * are less than zj plus one half the number of values in 1596 * this sample which are equal to zj. 1597 * As val_eq_z_samp2 < val_eq_z_samp1, these are the 1598 * val_st_z_samp1 only. 1599 */ 1600 F1j = val_st_z_samp1; 1602 /* F2j is the number of values in the 1st sample which 1603 * are less than zj plus one half the number of values in 1604 * this sample which are equal to zj. The latter are from 1605 * sample 2 only in this case. 1606 */ 1608 F2j = val_st_z_samp2 + (double) 1609 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1611 /* Set the line of values equal to zj to the actual line 1612 * of the last value picked for zj of sample 2 only in 1613 * this case. 1614 */ 1615 val_st_z_samp2 = val_eq_z_samp2; 1617 /* next the line of the next value z, ie. zj+1 is 1618 * addressed. Here, only sample 2 must be addressed. 1619 */ 1621 val_eq_z_samp2++; 1622 if (val_eq_z_samp1 == 0) 1623 { 1624 val_eq_z_samp1 = stop_loop1; 1625 } 1626 } 1628 /* the smaller z value was contained in sample 2, 1629 * hence this value is the zj to base the following 1630 * calculations on. 1631 */ 1633 else 1634 { 1636 /* hj is the number of values in the combined 1637 * sample equal to zj, in this case these are 1638 * within sample 1 only. 1639 */ 1640 hj = val_eq_z_samp1 - val_st_z_samp1; 1642 /* H_j is the number of values in the combined 1643 * sample smaller than zj plus one half the the number 1644 * of values in the combined sample equal to zj 1645 * (that's hj/2). 1646 */ 1648 H_j = val_st_z_samp1 + val_st_z_samp2 1649 + hj / 2; 1651 /* F1j is the number of values in the 1st sample which 1652 * are less than zj plus, in this case these are within 1653 * sample 1 only one half the number of values in this 1654 * sample which are equal to zj. The latter are from 1655 * sample 1 only in this case. 1656 */ 1658 F1j = val_st_z_samp1 + (double) 1659 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1661 /* F2j is the number of values in the 1st sample which 1662 * are less than zj plus one half the number of values 1663 * in this sample which are equal to zj. As 1664 * val_eq_z_samp1 < val_eq_z_samp2, these are the 1665 * val_st_z_samp2 only. 1666 */ 1668 F2j = val_st_z_samp2; 1670 /* Set the line of values equal to zj to the actual line 1671 * of the last value picked for zj of sample 1 only in 1672 * this case 1673 */ 1675 val_st_z_samp1 = val_eq_z_samp1; 1677 /* next the line of the next value z, ie. zj+1 is 1678 * addressed. Here, only sample 1 must be addressed. 1679 */ 1680 val_eq_z_samp1++; 1682 if (val_eq_z_samp2 == 0) 1683 { 1684 val_eq_z_samp2 = stop_loop2; 1685 } 1686 } 1687 } 1689 denom_1_aux = n_total * F1j - n_sample1 * H_j; 1690 denom_2_aux = n_total * F2j - n_sample2 * H_j; 1692 sum_adk_samp1 = sum_adk_samp1 + hj 1693 * (denom_1_aux * denom_1_aux) / 1694 (H_j * (n_total - H_j) 1695 - n_total * hj / 4); 1696 sum_adk_samp2 = sum_adk_samp2 + hj 1697 * (denom_2_aux * denom_2_aux) / 1698 (H_j * (n_total - H_j) 1699 - n_total * hj / 4); 1700 next_z_sample2 = false; 1701 equal_z_both_samples = false; 1703 /* index to count the z. It is only required to prevent 1704 * the while slope to execute endless 1705 */ 1706 j++; 1707 } 1709 // calculating the adk value is the final step. 1710 adk_result = (double) (n_total - 1) / (n_total 1711 * n_total * (k - 1)) 1712 * (sum_adk_samp1 / n_sample1 1713 + sum_adk_samp2 / n_sample2); 1715 /* if(adk_result <= adk_criterium) 1716 * adk_2_sample test is passed 1717 */ 1718 return adk_result <= adk_criterium; 1719 } 1721 Figure 5 1723 Appendix C. Glossary 1725 +-------------+-----------------------------------------------------+ 1726 | ADK | Anderson-Darling K-Sample test, a test used to | 1727 | | check whether two samples have the same statistical | 1728 | | distribution. | 1729 | ECMP | Equal Cost Multipath, a load balancing mechanism | 1730 | | evaluating MPLS labels stacks, IP addresses and | 1731 | | ports. | 1732 | EDF | The "Empirical Distribution Function" of a set of | 1733 | | scalar measurements is a function F(x) which for | 1734 | | any x gives the fractional proportion of the total | 1735 | | measurements that were smaller than or equal as x. | 1736 | Metric | A measured quantity related to the performance and | 1737 | | reliability of the Internet, expressed by a value. | 1738 | | This could be a singleton (single value), a sample | 1739 | | of single values or a statistic based on a sample | 1740 | | of singletons. | 1741 | OWAMP | One-way Active Measurement Protocol, a protocol for | 1742 | | communication between IPPM measurement systems | 1743 | | specified by IPPM. | 1744 | OWD | One-Way Delay, a performance metric specified by | 1745 | | IPPM. | 1746 | Sample | A sample metric is derived from a given singleton | 1747 | metric | metric by evaluating a number of distinct instances | 1748 | | together. | 1749 | Singleton | A singleton metric is, in a sense, one atomic | 1750 | metric | measurement of this metric. | 1751 | Statistical | A 'statistical' metric is derived from a given | 1752 | metric | sample metric by computing some statistic of the | 1753 | | values defined by the singleton metric on the | 1754 | | sample. | 1755 | TWAMP | Two-way Active Measurement Protocol, a protocol for | 1756 | | communication between IPPM measurement systems | 1757 | | specified by IPPM. | 1758 +-------------+-----------------------------------------------------+ 1760 Table 2 1762 Authors' Addresses 1764 Ruediger Geib (editor) 1765 Deutsche Telekom 1766 Heinrich Hertz Str. 3-7 1767 Darmstadt, 64295 1768 Germany 1770 Phone: +49 6151 58 12747 1771 Email: Ruediger.Geib@telekom.de 1773 Al Morton 1774 AT&T Labs 1775 200 Laurel Avenue South 1776 Middletown, NJ 07748 1777 USA 1779 Phone: +1 732 420 1571 1780 Fax: +1 732 368 1192 1781 Email: acmorton@att.com 1782 URI: http://home.comcast.net/~acmacm/ 1783 Reza Fardid 1784 Cariden Technologies 1785 888 Villa Street, Suite 500 1786 Mountain View, CA 94041 1787 USA 1789 Phone: 1790 Email: rfardid@cariden.com 1792 Alexander Steinmitz 1793 Deutsche Telekom 1794 Memmelsdorfer Str. 209b 1795 Bamberg, 96052 1796 Germany 1798 Phone: 1799 Email: Alexander.Steinmitz@telekom.de