idnits 2.17.1 draft-ietf-ippm-metrictest-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 14, 2011) is 4763 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1387 == Unused Reference: 'RFC2680' is defined on line 1056, but no explicit reference was found in the text == Unused Reference: 'RFC2681' is defined on line 1059, but no explicit reference was found in the text == Unused Reference: 'RFC3931' is defined on line 1066, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 2330 ** Obsolete normative reference: RFC 2679 (Obsoleted by RFC 7679) ** Obsolete normative reference: RFC 2680 (Obsoleted by RFC 7680) ** Downref: Normative reference to an Informational RFC: RFC 4459 Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Geib, Ed. 3 Internet-Draft Deutsche Telekom 4 Intended status: Standards Track A. Morton 5 Expires: September 15, 2011 AT&T Labs 6 R. Fardid 7 Cariden Technologies 8 A. Steinmitz 9 HS Fulda 10 March 14, 2011 12 IPPM standard advancement testing 13 draft-ietf-ippm-metrictest-02 15 Abstract 17 This document specifies tests to determine if multiple independent 18 instantiations of a performance metric RFC have implemented the 19 specifications in the same way. This is the performance metric 20 equivalent of interoperability, required to advance RFCs along the 21 standards track. Results from different implementations of metric 22 RFCs will be collected under the same underlying network conditions 23 and compared using state of the art statistical methods. The goal is 24 an evaluation of the metric RFC itself, whether its definitions are 25 clear and unambiguous to implementors and therefore a candidate for 26 advancement on the IETF standards track. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on September 15, 2011. 45 Copyright Notice 47 Copyright (c) 2011 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 6 64 2. Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3. Verification of conformance to a metric specification . . . . 8 66 3.1. Tests of an individual implementation against a metric 67 specification . . . . . . . . . . . . . . . . . . . . . . 9 68 3.2. Test setup resulting in identical live network testing 69 conditions . . . . . . . . . . . . . . . . . . . . . . . . 11 70 3.3. Tests of two or more different implementations against 71 a metric specification . . . . . . . . . . . . . . . . . . 15 72 3.4. Clock synchronisation . . . . . . . . . . . . . . . . . . 16 73 3.5. Recommended Metric Verification Measurement Process . . . 17 74 3.6. Miscellaneous . . . . . . . . . . . . . . . . . . . . . . 20 75 3.7. Proposal to determine an "equivalence" threshold for 76 each metric evaluated . . . . . . . . . . . . . . . . . . 21 77 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 78 5. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 22 79 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 81 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 82 8.1. Normative References . . . . . . . . . . . . . . . . . . . 23 83 8.2. Informative References . . . . . . . . . . . . . . . . . . 24 84 Appendix A. An example on a One-way Delay metric validation . . . 25 85 A.1. Compliance to Metric specification requirements . . . . . 25 86 A.2. Examples related to statistical tests for One-way Delay . 26 87 Appendix B. Anderson-Darling 2 sample C++ code . . . . . . . . . 28 88 Appendix C. A tunneling set up for remote metric 89 implementation testing . . . . . . . . . . . . . . . 36 90 Appendix D. Glossary . . . . . . . . . . . . . . . . . . . . . . 38 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 38 93 1. Introduction 95 The Internet Standards Process RFC2026 [RFC2026] requires that for a 96 IETF specification to advance beyond the Proposed Standard level, at 97 least two genetically unrelated implementations must be shown to 98 interoperate correctly with all features and options. This 99 requirement can be met by supplying: 101 o evidence that (at least a sub-set of) the specification has been 102 implemented by multiple parties, thus indicating adoption by the 103 IETF community and the extent of feature coverage. 105 o evidence that each feature of the specification is sufficiently 106 well-described to support interoperability, as demonstrated 107 through testing and/or user experience with deployment. 109 In the case of a protocol specification, the notion of 110 "interoperability" is reasonably intuitive - the implementations must 111 successfully "talk to each other", while exercising all features and 112 options. To achieve interoperability, two implementors need to 113 interpret the protocol specifications in equivalent ways. In the 114 case of IP Performance Metrics (IPPM), this definition of 115 interoperability is only useful for test and control protocols like 116 the One-Way Active Measurement Protocol, OWAMP [RFC4656], and the 117 Two-Way Active Measurement Protocol, TWAMP [RFC5357]. 119 A metric specification RFC describes one or more metric definitions, 120 methods of measurement and a way to report the results of 121 measurement. One example would be a way to test and report the One- 122 way Delay that data packets incur while being sent from one network 123 location to another, One-way Delay Metric. 125 In the case of metric specifications, the conditions that satisfy the 126 "interoperability" requirement are less obvious, and there was a need 127 for IETF agreement on practices to judge metric specification 128 "interoperability" in the context of the IETF Standards Process. 129 This memo provides methods which should be suitable to evaluate 130 metric specifications for standards track advancement. The methods 131 proposed here MAY be generally applicable to metric specification 132 RFCs beyond those developed under the IPPM Framework [RFC2330]. 134 Since many implementations of IP metrics are embedded in measurement 135 systems that do not interact with one another (they were built before 136 OWAMP and TWAMP), the interoperability evaluation called for in the 137 IETF standards process cannot be determined by observing that 138 independent implementations interact properly for various protocol 139 exchanges. Instead, verifying that different implementations give 140 statistically equivalent results under controlled measurement 141 conditions takes the place of interoperability observations. Even 142 when evaluating OWAMP and TWAMP RFCs for standards track advancement, 143 the methods described here are useful to evaluate the measurement 144 results because their validity would not be ascertained in typical 145 interoperability testing. 147 The standards advancement process aims at producing confidence that 148 the metric definitions and supporting material are clearly worded and 149 unambiguous, or reveals ways in which the metric definitions can be 150 revised to achieve clarity. The process also permits identification 151 of options that were not implemented, so that they can be removed 152 from the advancing specification. Thus, the product of this process 153 is information about the metric specification RFC itself: 154 determination of the specifications or definitions that are clear and 155 unambiguous and those that are not (as opposed to an evaluation of 156 the implementations which assist in the process). 158 This document defines a process to verify that implementations (or 159 practically, measurement systems) have interpreted the metric 160 specifications in equivalent ways, and produce equivalent results. 162 Testing for statistical equivalence requires ensuring identical test 163 setups (or awareness of differences) to the best possible extent. 164 Thus, producing identical test conditions is a core goal of the memo. 165 Another important aspect of this process is to test individual 166 implementations against specific requirements in the metric 167 specifications using customized tests for each requirement. These 168 tests can distinguish equivalent interpretations of each specific 169 requirement. 171 Conclusions on equivalence are reached by two measures. 173 First, implementations are compared against individual metric 174 specifications to make sure that differences in implementation are 175 minimised or at least known. 177 Second, a test setup is proposed ensuring identical networking 178 conditions so that unknowns are minimized and comparisons are 179 simplified. The resulting separate data sets may be seen as samples 180 taken from the same underlying distribution. Using state of the art 181 statistical methods, the equivalence of the results is verified. To 182 illustrate application of the process and methods defined here, 183 evaluation of the One-way Delay Metric [RFC2679] is provided in an 184 Appendix. While test setups will vary with the metrics to be 185 validated, the general methodology of determining equivalent results 186 will not. Documents defining test setups to evaluate other metrics 187 should be developed once the process proposed here has been agreed 188 and approved. 190 The metric RFC advancement process begins with a request for protocol 191 action accompanied by a memo that documents the supporting tests and 192 results. The procedures of [RFC2026] are expanded in[RFC5657], 193 including sample implementation and interoperability reports. 194 Section 3 of [morton-advance-metrics-01] can serve as a template for 195 a metric RFC report which accompanies the protocol action request to 196 the Area Director, including description of the test set-up, 197 procedures, results for each implementation and conclusions. 199 Changes from WG-01 to WG-02: 201 o Clarification of the number of test streams recommended in section 202 3.2. 204 o Clarifications on testing details in sections 3.3 and 3.4. 206 o Spelling corrections throughout. 208 Changes from WG -00 to WG -01 draft 210 o Discussion on merits and requirements of a distributed lab test 211 using only local load generators. 213 o Proposal of metrics suitable for tests using the proposed 214 measurement configuration. 216 o Hint on delay caused by software based L2TPv3 implementation. 218 o Added an appendix with a test configuration allowing remote tests 219 comparing different implementations across the network. 221 o Proposal for maximum error of "equivalence", based on performance 222 comparison of identical implementations. This may be useful for 223 both ADK and non-ADK comparisons. 225 Changes from prior ID -02 to WG -00 draft 227 o Incorporation of aspects of reporting to support the protocol 228 action request in the Introduction and section 3.5 230 o Overhaul of section 3.2 regarding tunneling: Added generic 231 tunneling requirements and L2TPv3 as an example tunneling 232 mechanism fulfilling the tunneling requirements. Removed and 233 adapted some of the prior references to other tunneling protocols 235 o Softened a requirement within section 3.4 (MUST to SHOULD on 236 precision) and removed some comments of the authors. 238 o Updated contact information of one author and added a new author. 240 o Added example C++ code of an Anderson-Darling two sample test 241 implementation. 243 Changes from ID -01 to ID -02 version 245 o Major editorial review, rewording and clarifications on all 246 contents. 248 o Additional text on parallel testing using VLANs and GRE or 249 Pseudowire tunnels. 251 o Additional examples and a glossary. 253 Changes from ID -00 to ID -01 version 255 o Addition of a comparison of individual metric implementations 256 against the metric specification (trying to pick up problems and 257 solutions for metric advancement [morton-advance-metrics]). 259 o More emphasis on the requirement to carefully design and document 260 the measurement setup of the metric comparison. 262 o Proposal of testing conditions under identical WAN network 263 conditions using IP in IP tunneling or Pseudo Wires and parallel 264 measurement streams. 266 o Proposing the requirement to document the smallest resolution at 267 which an ADK test was passed by 95%. As no minimum resolution is 268 specified, IPPM metric compliance is not linked to a particular 269 performance of an implementation. 271 o Reference to RFC 2330 and RFC 2679 for the 95% confidence interval 272 as preferred criterion to decide on statistical equivalence 274 o Reducing the proposed statistical test to ADK with 95% confidence. 276 1.1. Requirements Language 278 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 279 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 280 document are to be interpreted as described in RFC 2119 [RFC2119]. 282 2. Basic idea 284 The implementation of a standard compliant metric is expected to meet 285 the requirements of the related metric specification. So before 286 comparing two metric implementations, each metric implementation is 287 individually compared against the metric specification. 289 Most metric specifications leave freedom to implementors on non- 290 fundamental aspects of an individual metric (or options). Comparing 291 different measurement results using a statistical test with the 292 assumption of identical test path and testing conditions requires 293 knowledge of all differences in the overall test setup. Metric 294 specification options chosen by implementors have to be documented. 295 It is REQUIRED to use identical implementation options wherever 296 possible for any test proposed here. Calibrations proposed by metric 297 standards should be performed to further identify (and possibly 298 reduce) potential sources of errors in the test setup. 300 The Framework for IP Performance Metrics [RFC2330] expects that a 301 "methodology for a metric should have the property that it is 302 repeatable: if the methodology is used multiple times under identical 303 conditions, it should result in consistent measurements." This means 304 an implementation is expected to repeatedly measure a metric with 305 consistent results (repeatability with the same result). Small 306 deviations in the test setup are expected to lead to small deviations 307 in results only. To characterise statistical equivalence in the case 308 of small deviations, RFC 2330 and [RFC2679] suggest to apply a 95% 309 confidence interval. Quoting RFC 2679, "95 percent was chosen 310 because ... a particular confidence level should be specified so that 311 the results of independent implementations can be compared." 313 Two different implementations are expected to produce statistically 314 equivalent results if they both measure a metric under the same 315 networking conditions. Formulating in statistical terms: separate 316 metric implementations collect separate samples from the same 317 underlying statistical process (the same network conditions). The 318 statistical hypothesis to be tested is the expectation that both 319 samples do not expose statistically different properties. This 320 requires careful test design: 322 o The measurement test setup must be self-consistent to the largest 323 possible extent. To minimize the influence of the test and 324 measurement setup on the result, network conditions and paths MUST 325 be identical for the compared implementations to the largest 326 possible degree. This includes both the stability and non- 327 ambiguity of routes taken by the measurement packets. See RFC 328 2330 for a discussion on self-consistency. 330 o The error induced by the sample size must be small enough to 331 minimize its influence on the test result. This may have to be 332 respected, especially if two implementations measure with 333 different average probing rates. 335 o Every comparison must be repeated several times based on different 336 measurement data to avoid random indications of compatibility (or 337 the lack of it). 339 o To minimize the influence of implementation options on the result, 340 metric implementations SHOULD use identical options and parameters 341 for the metric under evaluation. 343 o The implementation with the lowest probing frequency determines 344 the smallest temporal interval for which samples can be compared. 346 The metric specifications themselves are the primary focus of 347 evaluation, rather than the implementations of metrics. The 348 documentation produced by the advancement process should identify 349 which metric definitions and supporting material were found to be 350 clearly worded and unambiguous, OR, it should identify ways in which 351 the metric specification text should be revised to achieve clarity 352 and unified interpretation. 354 The process should also permit identification of options that were 355 not implemented, so that they can be removed from the advancing 356 specification (this is an aspect more typical of protocol advancement 357 along the standards track). 359 Note that this document does not propose to base interoperability 360 indications of performance metric implementations on comparisons of 361 individual singletons. Individual singletons may be impacted by many 362 statistical effects while they are measured. Comparing two 363 singletons of different implementations may result in failures with 364 higher probability than comparing samples. 366 3. Verification of conformance to a metric specification 368 This section specifies how to verify compliance of two or more IPPM 369 implementations against a metric specification. This document only 370 proposes a general methodology. Compliance criteria to a specific 371 metric implementation need to be defined for each individual metric 372 specification. The only exception is the statistical test comparing 373 two metric implementations which are simultaneously tested. This 374 test is applicable without metric specific decision criteria. 376 Several testing options exist to compare two or more implementations: 378 o Use a single test lab to compare the implementations and emulate 379 the Internet with an impairment generator. 381 o Use a single test lab to compare the implementations and measure 382 across the Internet. 384 o Use remotely separated test labs to compare the implementations 385 and emulate the Internet with two "identically" configured 386 impairment generators. 388 o Use remotely separated test labs to compare the implementations 389 and measure across the Internet. 391 o Use remotely separated test labs to compare the implementations 392 and measure across the Internet and include a single impairment 393 generator to impact all measurement flows in non discriminatory 394 way. 396 The first two approaches work, but cause higher expenses than the 397 other ones (due to travel and/or shipping+installation). For the 398 third option, ensuring two identically configured impairment 399 generators requires well defined test cases and possibly identical 400 hard- and software. >>>Comment: for some specific tests, impairment 401 generator accuracy requirements are less-demanding than others, and 402 in such cases there is more flexibility in impairment generator 403 configuration. <<< 405 It is a fair question, whether the last two options can result in any 406 applicable test set up at all. While an experimental approach is 407 given in Appendix C, the trade off that measurement packets of 408 different sites pass the path segments but always in a different 409 order of segments probably can't be avoided. 411 The question of which option above results in identical networking 412 conditions and is broadly accepted can't be answered without more 413 practical experience in comparing implementations. The last proposal 414 has the advantage that, while the measurement equipment is remotely 415 distributed, a single network impairment generator and the Internet 416 can be used in combination to impact all measurement flows. 418 3.1. Tests of an individual implementation against a metric 419 specification 421 A metric implementation MUST support the requirements classified as 422 "MUST" and "REQUIRED" of the related metric specification to be 423 compliant to the latter. 425 Further, supported options of a metric implementation SHOULD be 426 documented in sufficient detail. The documentation of chosen options 427 is RECOMMENDED to minimise (and recognise) differences in the test 428 setup if two metric implementations are compared. Further, this 429 documentation is used to validate and improve the underlying metric 430 specification option, to remove options which saw no implementation 431 or which are badly specified from the metric specification to be 432 promoted to a standard. This documentation SHOULD be made for all 433 implementation-relevant specifications of a metric picked for a 434 comparison that are not explicitly marked as "MUST" or "REQUIRED" in 435 the RFC text. This applies for the following sections of all metric 436 specifications: 438 o Singleton Definition of the Metric. 440 o Sample Definition of the Metric. 442 o Statistics Definition of the Metric. As statistics are compared 443 by the test specified here, this documentation is required even in 444 the case, that the metric specification does not contain a 445 Statistics Definition. 447 o Timing and Synchronisation related specification (if relevant for 448 the Metric). 450 o Any other technical part present or missing in the metric 451 specification, which is relevant for the implementation of the 452 Metric. 454 RFC2330 and RFC2679 emphasise precision as an aim of IPPM metric 455 implementations. A single IPPM conformant implementation MUST under 456 otherwise identical network conditions produce precise results for 457 repeated measurements of the same metric. 459 RFC 2330 prefers the "empirical distribution function" EDF to 460 describe collections of measurements. RFC 2330 determines, that 461 "unless otherwise stated, IPPM goodness-of-fit tests are done using 462 5% significance." The goodness of fit test determines by which 463 precision two or more samples of a metric implementation belong to 464 the same underlying distribution (of measured network performance 465 events). The goodness of fit test to be applied is the Anderson- 466 Darling K sample test (ADK sample test, K stands for the number of 467 samples to be compared) [ADK]. Please note that RFC 2330 and RFC 468 2679 apply an Anderson Darling goodness of fit test too. 470 The results of a repeated test with a single implementation MUST pass 471 an ADK sample test with confidence level of 95%. The resolution for 472 which the ADK test has been passed with the specified confidence 473 level MUST be documented. To formulate this differently: The 474 requirement is to document the smallest resolution, at which the 475 results of the tested metric implementation pass an ADK test with a 476 confidence level of 95%. The minimum resolution available in the 477 reported results from each implementation MUST be taken into account 478 in the ADK test. 480 3.2. Test setup resulting in identical live network testing conditions 482 Two major issues complicate tests for metric compliance across live 483 networks under identical testing conditions. One is the general 484 point that metric definition implementations cannot be conveniently 485 examined in field measurement scenarios. The other one is more 486 broadly described as "parallelism in devices and networks", including 487 mechanisms like those that achieve load balancing (see [RFC4928]). 489 This section proposes two measures to deal with both issues. 490 Tunneling mechanisms can be used to avoid parallel processing of 491 different flows in the network. Measuring by separate parallel probe 492 flows results in repeated collection of data. If both measures are 493 combined, WAN network conditions are identical for a number of 494 independent measurement flows, no matter what the network conditions 495 are in detail. 497 Any measurement setup MUST be made to avoid the probing traffic 498 itself to impede the metric measurement. The created measurement 499 load MUST NOT result in congestion at the access link connecting the 500 measurement implementation to the WAN. The created measurement load 501 MUST NOT overload the measurement implementation itself, e.g., by 502 causing a high CPU load or by creating imprecisions due to internal 503 transmit (receive respectively) probe packet collisions. 505 Tunneling multiple flows reaching a network element on a single 506 physical port may allow to transmit all packets of the tunnel via the 507 same path. Applying tunnels to avoid undesired influence of standard 508 routing for measurement purposes is a concept known from literature, 509 see e.g. GRE encapsulated multicast probing [GU+Duffield]. An 510 existing IP in IP tunnel protocol can be applied to avoid Equal-Cost 511 Multi-Path (ECMP) routing of different measurement streams if it 512 meets the following criteria: 514 o Inner IP packets from different measurement implementations are 515 mapped into a single tunnel with single outer IP origin and 516 destination address as well as origin and destination port numbers 517 which are identical for all packets. 519 o An easily accessible commodity tunneling protocol allows to carry 520 out a metric test from more test sites. 522 o A low operational overhead may enable a broader audience to set up 523 a metric test with the desired properties. 525 o The tunneling protocol should be reliable and stable in set up and 526 operation to avoid disturbances or influence on the test results. 528 o The tunneling protocol should not incur any extra cost for those 529 interested in setting up a metric test. 531 An illustration of a test setup with two tunnels and two flows 532 between two linecards of one implementation is given in Figure 1. 534 Implementation ,---. +--------+ 535 +~~~~~~~~~~~/ \~~~~~~| Remote | 536 +------->-----F2->-| / \ |->---+ | 537 | +---------+ | Tunnel 1( ) | | | 538 | | transmit|-F1->-| ( ) |->+ | | 539 | | LC1 | +~~~~~~~~~| |~~~~| | | | 540 | | receive |-<--+ ( ) | F1 F2 | 541 | +---------+ | |Internet | | | | | 542 *-------<-----+ F2 | | | | | | 543 +---------+ | | +~~~~~~~~~| |~~~~| | | | 544 | transmit|-* *-| | | |--+<-* | 545 | LC2 | | Tunnel 2( ) | | | 546 | receive |-<-F1-| \ / |<-* | 547 +---------+ +~~~~~~~~~~~\ /~~~~~~| Router | 548 `-+-' +--------+ 550 Illustration of a test setup with two tunnels. For simplicity, only 551 two linecards of one implementation and two flows F between them are 552 shown. 554 Figure 1 556 Figure 2 shows the network elements required to set up GRE tunnels or 557 as shown by figure 1. 559 Implementation 561 +-----+ ,---. 562 | LC1 | / \ 563 +-----+ / \ +------+ 564 | +-------+ ( ) +-------+ |Remote| 565 +--------+ | | | | | | | | 566 |Ethernet| | Tunnel| |Internet | | Tunnel| | | 567 |Switch |--| Head |--| |--| Head |--| | 568 +--------+ | Router| | | | Router| | | 569 | | | ( ) | | |Router| 570 +-----+ +-------+ \ / +-------+ +------+ 571 | LC2 | \ / 572 +-----+ `-+-' 573 Illustration of a hardware setup to realise the test setup 574 illustrated by figure 1 with GRE tunnels or Pseudowires. 576 Figure 2 578 If tunneling is applied, two tunnels MUST carry all test traffic in 579 between the test site and the remote site. For example, if 802.1Q 580 Ethernet Virtual LANs (VLAN) are applied and the measurement streams 581 are carried in different VLANs, the IP tunnel or Pseudo Wires 582 respectively MUST be set up in physical port mode to avoid set up of 583 Pseudo Wires per VLAN (which may see different paths due to ECMP 584 routing), see RFC 4448. The remote router and the Ethernet switch 585 shown in figure 2 must support 802.1Q in this set up. 587 The IP packet size of the metric implementation SHOULD be chosen 588 small enough to avoid fragmentation due to the added Ethernet and 589 tunnel headers. Otherwise, the impact of tunnel overhead on 590 fragmentation and interface MTU size MUST be understood and taken 591 into account (see [RFC4459]). 593 An Ethernet port mode IP tunnel carrying several 802.1Q VLANs each 594 containing measurement traffic of a single measurement system was set 595 up as a proof of concept using RFC4719 [RFC4719], Transport of 596 Ethernet Frames over L2TPv3. Ethernet over L2TPv3 seems to fulfill 597 most of the desired tunneling protocol criteria mentioned above. 599 The following headers may have to be accounted for when calculating 600 total packet length, if VLANs and Ethernet over L2TPv3 tunnels are 601 applied: 603 o Ethernet 802.1Q: 22 Byte. 605 o L2TPv3 Header: 4-16 Byte for L2TPv3 data messages over IP; 16-28 606 Byte for L2TPv3 data messages over UDP. 608 o IPv4 Header (outer IP header): 20 Byte. 610 o MPLS Labels may be added by a carrier. Each MPLS Label has a 611 length of 4 Bytes. By the time of writing, between 1 and 4 Labels 612 seems to be a fair guess of what's expectable. 614 The applicability of one or more of the following tunneling protocols 615 may be investigated by interested parties if Ethernet over L2TPv3 is 616 felt to be not suitable: IP in IP [RFC2003] or Generic Routing 617 Encapsulation (GRE) [RFC2784]. RFC 4928 [RFC4928] proposes measures 618 how to avoid ECMP treatment in MPLS networks. 620 L2TP is a commodity tunneling protocol [RFC2661]. By the time of 621 writing, L2TPv3 [RFC3931]is the latest version of L2TP. If L2TPv3 is 622 applied, software based implementations of this protocol are not 623 suitable for the test set up, as such implementations may cause 624 incalculable delay shifts. 626 Ethernet Pseudo Wires may also be set up on MPLS networks [RFC4448]. 627 While there's no technical issue with this solution, MPLS interfaces 628 are mostly found in the network provider domain. Hence not all of 629 the above tunneling criteria are met. 631 Appendix C provides an experimental tunneling set up for metric 632 implementation testing between two (or more) remote sites. 634 Each test SHOULD be conducted multiple times. Sequential testing is 635 possible, but may not be a useful metric test option because WAN 636 conditions are likely to change over time. It is RECOMMENDED that 637 tests be carried out by establishing at least 2 different parallel 638 measurement flows. Two linecards per implementation that send and 639 receive measurement flows should be sufficient to create 4 parallel 640 measurement flows (when each card sends and receives 2 flows). Other 641 options are to separate flows by DiffServ marks (without deploying 642 any QoS in the inner or outer tunnel) or using a single CBR flow and 643 evaluating every n-th singleton to belong to a specific measurement 644 flow. 646 Some additional rules to calculate and compare samples have to be 647 respected to perform a metric test: 649 o To compare different probes of a common underlying distribution in 650 terms of metrics characterising a communication network requires 651 to respect the temporal nature for which the assumption of common 652 underlying distribution may hold. Any singletons or samples to be 653 compared MUST be captured within the same time interval. 655 o Whenever statistical events like singletons or rates are used to 656 characterise measured metrics of a time-interval, at least 5 657 singletons of a relevant metric SHOULD be present to ensure a 658 minimum confidence into the reported value (see Wikipedia on 659 confidence [Rule of thumb]). Note that this criterion also is to 660 be respected e.g. when comparing packet loss metrics. Any packet 661 loss measurement interval to be compared with the results of 662 another implementation SHOULD contain at least five lost packets 663 to have a minimum confidence that the observed loss rate wasn't 664 caused by a small number of random packet drops. 666 o The minimum number of singletons or samples to be compared by an 667 Anderson-Darling test SHOULD be 100 per tested metric 668 implementation. Note that the Anderson-Darling test detects small 669 differences in distributions fairly well and will fail for high 670 number of compared results (RFC2330 mentions an example with 8192 671 measurements where an Anderson-Darling test always failed). 673 o Generally, the Anderson-Darling test is sensitive to differences 674 in the accuracy or bias associated with varying implementations or 675 test conditions. These dissimilarities may result in differing 676 averages of samples to be compared. An example may be different 677 packet sizes, resulting in a constant delay difference between 678 compared samples. Therefore samples to be compared by an 679 Anderson-Darling test MAY be calibrated by the difference of the 680 average values of the samples. Any calibration of this kind MUST 681 be documented in the test result. 683 3.3. Tests of two or more different implementations against a metric 684 specification 686 RFC2330 expects "a methodology for a given metric [to] exhibit 687 continuity if, for small variations in conditions, it results in 688 small variations in the resulting measurements. Slightly more 689 precisely, for every positive epsilon, there exists a positive delta, 690 such that if two sets of conditions are within delta of each other, 691 then the resulting measurements will be within epsilon of each 692 other." A small variation in conditions in the context of the metric 693 test proposed here can be seen as different implementations measuring 694 the same metric along the same path. 696 IPPM metric specifications however allow for implementor options to 697 the largest possible degree. It can not be expected that two 698 implementors pick identical value ranges in options for the 699 implementations. Implementors SHOULD to the highest degree possible 700 pick the same configurations for their systems when comparing their 701 implementations by a metric test. 703 In some cases, a goodness of fit test may not be possible or show 704 disappointing results. To clarify the difficulties arising from 705 different implementation options, the individual options picked for 706 every compared implementation SHOULD be documented in sufficient 707 detail. Based on this documentation, the underlying metric 708 specification should be improved before it is promoted to a standard. 710 The same statistical test as applicable to quantify precision of a 711 single metric implementation MUST be used to compare metric result 712 equivalence for different implementations. To document 713 compatibility, the smallest measurement resolution at which the 714 compared implementations passed the ADK sample test MUST be 715 documented. 717 For different implementations of the same metric, "variations in 718 conditions" are reasonably expected. The ADK test comparing samples 719 of the different implementations MAY result in a lower precision than 720 the test for precision in the same-implementation comparison. 722 3.4. Clock synchronisation 724 Clock synchronization effects require special attention. Accuracy of 725 one-way active delay measurements for any metrics implementation 726 depends on clock synchronization between the source and destination 727 of tests. Ideally, one-way active delay measurement (RFC 2679, 728 [RFC2679]) test endpoints either have direct access to independent 729 GPS or CDMA-based time sources or indirect access to nearby NTP 730 primary (stratum 1) time sources, equipped with GPS receivers. 731 Access to these time sources may not be available at all test 732 locations associated with different Internet paths, for a variety of 733 reasons out of scope of this document. 735 When secondary (stratum 2 and above) time sources are used with NTP 736 running across the same network, whose metrics are subject to 737 comparative implementation tests, network impairments can affect 738 clock synchronization, distort sample one-way values and their 739 interval statistics. It is RECOMMENDED to discard sample one-way 740 delay values for any implementation, when one of the following 741 reliability conditions is met: 743 o Delay is measured and is finite in one direction, but not the 744 other. 746 o Absolute value of the difference between the sum of one-way 747 measurements in both directions and round-trip measurement is 748 greater than X% of the latter value. 750 Examination of the second condition requires RTT measurement for 751 reference, e.g., based on TWAMP (RFC5357, RFC 5357 [RFC5357]), in 752 conjunction with one-way delay measurement. 754 Specification of X% to strike a balance between identification of 755 unreliable one-way delay samples and misidentification of reliable 756 samples under a wide range of Internet path RTTs probably requires 757 further study. 759 An implementation of an RFC that requires synchronized clocks is 760 expected to provide precise measurement results in order to claim 761 that the metric measured is compliant. 763 IF an implementation publishes a specification of its precision, such 764 as "a precision of 1 ms (+/- 500 us) with a confidence of 95%", then 765 the specification SHOULD be met over a useful measurement duration. 766 For example, if the metric is measured along an Internet path which 767 is stable and not congested, then the precision specification SHOULD 768 be met over durations of an hour or more. 770 3.5. Recommended Metric Verification Measurement Process 772 In order to meet their obligations under the IETF Standards Process 773 the IESG must be convinced that each metric specification advanced to 774 Draft Standard or Internet Standard status is clearly written, that 775 there are the a sufficient number of verified equivalent 776 implementations, and that all options have been implemented. 778 In the context of this document, metrics are designed to measure some 779 characteristic of a data network. An aim of any metric definition 780 should be that it should be specified in a way that can reliably 781 measure the specific characteristic in a repeatable way across 782 multiple independent implementations. 784 Each metric, statistic or option of those to be validated MUST be 785 compared against a reference measurement or another implementation by 786 at least 5 different basic data sets, each one with sufficient size 787 to reach the specified level of confidence, as specified by this 788 document. 790 Finally, the metric definitions, embodied in the text of the RFCs, 791 are the objects that require evaluation and possible revision in 792 order to advance to the next step on the standards track. 794 IF two (or more) implementations do not measure an equivalent metric 795 as specified by this document, 797 AND sources of measurement error do not adequately explain the lack 798 of agreement, 799 THEN the details of each implementation should be audited along with 800 the exact definition text, to determine if there is a lack of clarity 801 that has caused the implementations to vary in a way that affects the 802 correspondence of the results. 804 IF there was a lack of clarity or multiple legitimate interpretations 805 of the definition text, 807 THEN the text should be modified and the resulting memo proposed for 808 consensus and (possible) advancement along the standards track. 810 Finally, all the findings MUST be documented in a report that can 811 support advancement on the standards track, similar to those 812 described in [RFC5657]. The list of measurement devices used in 813 testing satisfies the implementation requirement, while the test 814 results provide information on the quality of each specification in 815 the metric RFC (the surrogate for feature interoperability). 817 The complete process of advancing a metric specification to a 818 standard as defined by this document is illustrated in Figure 3. 820 ,---. 821 / \ 822 ( Start ) 823 \ / Implementations 824 `-+-' +-------+ 825 | /| 1 `. 826 +---+----+ / +-------+ `.-----------+ ,-------. 827 | RFC | / |Check for | ,' was RFC `. YES 828 | | / |Equivalence.... clause x ------+ 829 | |/ +-------+ |under | `. clear? ,' | 830 | Metric \.....| 2 ....relevant | `---+---' +----+-----+ 831 | Metric |\ +-------+ |identical | No | |Report | 832 | Metric | \ |network | +--+----+ |results + | 833 | ... | \ |conditions | |Modify | |Advance | 834 | | \ +-------+ | | |Spec +--+RFC | 835 +--------+ \| n |.'+-----------+ +-------+ |request(?)| 836 +-------+ +----------+ 838 Illustration of the metric standardisation process 840 Figure 3 842 Any recommendation for the advancement of a metric specification MUST 843 be accompanied by an implementation report, as is the case with all 844 requests for the advancement of IETF specifications. The 845 implementation report needs to include the tests performed, the 846 applied test setup, the specific metrics in the RFC and reports of 847 the tests performed with two or more implementations. The test plan 848 needs to specify the precision reached for each measured metric and 849 thus define the meaning of "statistically equivalent" for the 850 specific metrics being tested. 852 Ideally, the test plan would co-evolve with the development of the 853 metric, since that's when people have the most context in their 854 thinking regarding the different subtleties that can arise. 856 In particular, the implementation report MUST as a minimum document: 858 o The metric compared and the RFC specifying it. This includes 859 statements as required by the section "Tests of an individual 860 implementation against a metric specification" of this document. 862 o The measurement configuration and setup. 864 o A complete specification of the measurement stream (mean rate, 865 statistical distribution of packets, packet size or mean packet 866 size and their distribution), DSCP and any other measurement 867 stream properties which could result in deviating results. 868 Deviations in results can be caused also if chosen IP addresses 869 and ports of different implementations can result in different 870 layer 2 or layer 3 paths due to operation of Equal Cost Multi-Path 871 routing in an operational network. 873 o The duration of each measurement to be used for a metric 874 validation, the number of measurement points collected for each 875 metric during each measurement interval (i.e. the probe size) and 876 the level of confidence derived from this probe size for each 877 measurement interval. 879 o The result of the statistical tests performed for each metric 880 validation as required by the section "Tests of two or more 881 different implementations against a metric specification" of this 882 document. 884 o A parameterization of laboratory conditions and applied traffic 885 and network conditions allowing reproduction of these laboratory 886 conditions for readers of the implementation report. 888 o The documentation helping to improve metric specifications defined 889 by this section. 891 All of the tests for each set SHOULD be run in a test setup as 892 specified in the section "Test setup resulting in identical live 893 network testing conditions." 895 If a different test set up is chosen, it is RECOMMENDED to avoid 896 effects falsifying results of validation measurements caused by real 897 data networks (like parallelism in devices and networks). Data 898 networks may forward packets differently in the case of: 900 o Different packet sizes chosen for different metric 901 implementations. A proposed countermeasure is selecting the same 902 packet size when validating results of two samples or a sample 903 against an original distribution. 905 o Selection of differing IP addresses and ports used by different 906 metric implementations during metric validation tests. If ECMP is 907 applied on IP or MPLS level, different paths can result (note that 908 it may be impossible to detect an MPLS ECMP path from an IP 909 endpoint). A proposed counter measure is to connect the 910 measurement equipment to be compared by a NAT device, or 911 establishing a single tunnel to transport all measurement traffic 912 The aim is to have the same IP addresses and port for all 913 measurement packets or to avoid ECMP based local routing diversion 914 by using a layer 2 tunnel. 916 o Different IP options. 918 o Different DSCP. 920 o If the N measurements are captured using sequential measurements 921 instead of simultaneous ones, then the following factors come into 922 play: Time varying paths and load conditions. 924 3.6. Miscellaneous 926 A minimum amount of singletons per metric is required if results are 927 to be compared. To avoid accidental singletons from impacting a 928 metric comparison, a minimum number of 5 singletons per compared 929 interval was proposed above. Commercial Internet service is not 930 operated to reliably create enough rare events of singletons to 931 characterize bad measurement engineering or bad implementations. In 932 the case that a metric validation requires capturing rare events, an 933 impairment generator may have to be added to the test set up. 934 Inclusion of an impairment generator and the parameterisation of the 935 impairments generated MUST be documented. 937 A metric characterising a common impairment condition would be one, 938 which by expectation creates a singleton result for each measured 939 packet. Delay or Delay Variation are examples of this type, and in 940 such cases, the Internet may be used to compare metric 941 implementations. 943 Rare events are those, where by expectation no or a rather low number 944 of "event is present" singletons are captured during a measurement 945 interval. Packet duplications, packet loss rates above one digit 946 percentages, loss patterns and packet reordering are examples. Note 947 especially that a packet reordering or loss pattern metric 948 implementation comparison may require a more sophisticated test set 949 up than described here. Spatial and temporal effects combine in the 950 case of packet re-ordering and measurements with different packet 951 rates may always lead to different results. 953 As specified above, 5 singletons are the recommended basis to 954 minimise interference of random events with the statistical test 955 proposed by this document. In the case of ratio measurements (like 956 packet loss), the underlying sum of basic events, against the which 957 the metric's monitored singletons are "rated", determines the 958 resolution of the test. A packet loss statistic with a resolution of 959 1% requires one packet loss statistic-data point to consist of 500 960 delay singletons (of which at least 5 were lost). To compare EDFs on 961 packet loss requires one hundred such statistics per flow. That 962 means, all in all at least 50 000 delay singletons are required per 963 single measurement flow. Live network packet loss is assumed to be 964 present during main traffic hours only. Let this interval be 5 965 hours. The required minimum rate of a single measurement flow in 966 that case is 2.8 packets/sec (assuming a loss of 1% during 5 hours). 967 If this measurement is too demanding under live network conditions, 968 an impairment generator should be used. 970 3.7. Proposal to determine an "equivalence" threshold for each metric 971 evaluated 973 This section describes a proposal for maximum error of "equivalence", 974 based on performance comparison of identical implementations. This 975 comparison may be useful for both ADK and non-ADK comparisons. 977 Each metric tested by two or more implementations (cross- 978 implementation testing). 980 Each metric is also tested twice simultaneously by the *same* 981 implementation, using different Src/Dst Address pairs and other 982 differences such that the connectivity differences of the cross- 983 implementation tests are also experienced and measured by the same 984 implementation. 986 Comparative results for the same implementation represent a bound on 987 cross-implementation equivalence. This should be particularly useful 988 when the metric does *not* produces a continuous distribution of 989 singleton values, such as with a loss metric, or a duplication 990 metric. Appendix A indicates how the ADK will work for 0ne-way 991 delay, and should be likewise applicable to distributions of delay 992 variation. 994 Proposal: the implementation with the largest difference in 995 homogeneous comparison results is the lower bound on the equivalence 996 threshold, noting that there may be other systematic errors to 997 account for when comparing between implementations. 999 Thus, when evaluating equivalence in cross-implementation results: 1001 Maximum_Error = Same_Implementation_Error + Systematic_Error 1003 and only the systematic error need be decided beforehand. 1005 In the case of ADK comparison, the largest same-implementation 1006 resolution of distribution equivalence can be used as a limit on 1007 cross-implementation resolutions (at the same confidence level). 1009 4. Acknowledgements 1011 Gerhard Hasslinger commented a first version of this document, 1012 suggested statistical tests and the evaluation of time series 1013 information. Henk Uijterwaal and Lars Eggert have encouraged and 1014 helped to orgainize this work. Mike Hamilton, Scott Bradner, David 1015 Mcdysan and Emile Stephan commented on this draft. Carol Davids 1016 reviewed the 01 version of the ID before it was promoted to WG draft. 1018 5. Contributors 1020 Scott Bradner, Vern Paxson and Allison Mankin drafted bradner- 1021 metrictest [bradner-metrictest], and major parts of it are included 1022 in this document. 1024 6. IANA Considerations 1026 This memo includes no request to IANA. 1028 7. Security Considerations 1030 This draft does not raise any specific security issues. 1032 8. References 1034 8.1. Normative References 1036 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, 1037 October 1996. 1039 [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 1040 3", BCP 9, RFC 2026, October 1996. 1042 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1043 Requirement Levels", BCP 14, RFC 2119, March 1997. 1045 [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, 1046 "Framework for IP Performance Metrics", RFC 2330, 1047 May 1998. 1049 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 1050 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 1051 RFC 2661, August 1999. 1053 [RFC2679] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1054 Delay Metric for IPPM", RFC 2679, September 1999. 1056 [RFC2680] Almes, G., Kalidindi, S., and M. Zekauskas, "A One-way 1057 Packet Loss Metric for IPPM", RFC 2680, September 1999. 1059 [RFC2681] Almes, G., Kalidindi, S., and M. Zekauskas, "A Round-trip 1060 Delay Metric for IPPM", RFC 2681, September 1999. 1062 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 1063 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 1064 March 2000. 1066 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 1067 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 1069 [RFC4448] Martini, L., Rosen, E., El-Aawar, N., and G. Heron, 1070 "Encapsulation Methods for Transport of Ethernet over MPLS 1071 Networks", RFC 4448, April 2006. 1073 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 1074 Network Tunneling", RFC 4459, April 2006. 1076 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1077 Zekauskas, "A One-way Active Measurement Protocol 1078 (OWAMP)", RFC 4656, September 2006. 1080 [RFC4719] Aggarwal, R., Townsley, M., and M. Dos Santos, "Transport 1081 of Ethernet Frames over Layer 2 Tunneling Protocol Version 1082 3 (L2TPv3)", RFC 4719, November 2006. 1084 [RFC4928] Swallow, G., Bryant, S., and L. Andersson, "Avoiding Equal 1085 Cost Multipath Treatment in MPLS Networks", BCP 128, 1086 RFC 4928, June 2007. 1088 [RFC5657] Dusseault, L. and R. Sparks, "Guidance on Interoperation 1089 and Implementation Reports for Advancement to Draft 1090 Standard", BCP 9, RFC 5657, September 2009. 1092 8.2. Informative References 1094 [ADK] Scholz, F. and M. Stephens, "K-sample Anderson-Darling 1095 Tests of fit, for continuous and discrete cases", 1096 University of Washington, Technical Report No. 81, 1097 May 1986. 1099 [GU+Duffield] 1100 Gu, Y., Duffield, N., Breslau, L., and S. Sen, "GRE 1101 Encapsulated Multicast Probing: A Scalable Technique for 1102 Measuring One-Way Loss", SIGMETRICS'07 San Diego, 1103 California, USA, June 2007. 1105 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1106 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1107 RFC 5357, October 2008. 1109 [Rule of thumb] 1110 Hardy, M., "Confidence interval", March 2010. 1112 [bradner-metrictest] 1113 Bradner, S., Mankin, A., and V. Paxson, "Advancement of 1114 metrics specifications on the IETF Standards Track", 1115 draft -bradner-metricstest-03, (work in progress), 1116 July 2007. 1118 [morton-advance-metrics] 1119 Morton, A., "Problems and Possible Solutions for Advancing 1120 Metrics on the Standards Track", draft -morton-ippm- 1121 advance-metrics-00, (work in progress), July 2009. 1123 [morton-advance-metrics-01] 1124 Morton, A., "Lab Test Results for Advancing Metrics on the 1125 Standards Track", draft -morton-ippm-advance-metrics-01, 1126 (work in progress), June 2010. 1128 Appendix A. An example on a One-way Delay metric validation 1130 The text of this appendix is not binding. It is an example how parts 1131 of a One-way Delay metric test could look like. 1132 http://xml.resource.org/public/rfc/bibxml/ 1134 A.1. Compliance to Metric specification requirements 1136 One-way Delay, Loss threshold, RFC 2679 1138 This test determines if implementations use the same configured 1139 maximum waiting time delay from one measurement to another under 1140 different delay conditions, and correctly declare packets arriving in 1141 excess of the waiting time threshold as lost. See Section 3.5 of 1142 RFC2679, 3rd bullet point and also Section 3.8.2 of RFC2679. 1144 (1) Configure a path with 1 sec one-way constant delay. 1146 (2) Measure one-way delay with 2 or more implementations, using 1147 identical waiting time thresholds for loss set at 2 seconds. 1149 (3) Configure the path with 3 sec one-way delay. 1151 (4) Repeat measurements. 1153 (5) Observe that the increase measured in step 4 caused all packets 1154 to be declared lost, and that all packets that arrive 1155 successfully in step 2 are assigned a valid one-way delay. 1157 One-way Delay, First-bit to Last bit, RFC 2679 1159 This test determines if implementations register the same relative 1160 increase in delay from one measurement to another under different 1161 delay conditions. This test tends to cancel the sources of error 1162 which may be present in an implementation. See Section 3.7.2 of 1163 RFC2679, and Section 10.2 of RFC2330. 1165 (1) Configure a path with X ms one-way constant delay, and ideally 1166 including a low-speed link. 1168 (2) Measure one-way delay with 2 or more implementations, using 1169 identical options and equal size small packets (e.g., 100 octet 1170 IP payload). 1172 (3) Maintain the same path with X ms one-way delay. 1174 (4) Measure one-way delay with 2 or more implementations, using 1175 identical options and equal size large packets (e.g., 1500 octet 1176 IP payload). 1178 (5) Observe that the increase measured in steps 2 and 4 is 1179 equivalent to the increase in ms expected due to the larger 1180 serialization time for each implementation. Most of the 1181 measurement errors in each system should cancel, if they are 1182 stationary. 1184 One-way Delay, RFC 2679 1186 This test determines if implementations register the same relative 1187 increase in delay from one measurement to another under different 1188 delay conditions. This test tends to cancel the sources of error 1189 which may be present in an implementation. This test is intended to 1190 evaluate measurments in sections 3 and 4 of RFC2679. 1192 (1) Configure a path with X ms one-way constant delay. 1194 (2) Measure one-way delay with 2 or more implementations, using 1195 identical options. 1197 (3) Configure the path with X+Y ms one-way delay. 1199 (4) Repeat measurements. 1201 (5) Observe that the increase measured in steps 2 and 4 is ~Y ms for 1202 each implementation. Most of the measurement errors in each 1203 system should cancel, if they are stationary. 1205 Error Calibration, RFC 2679 1207 This is a simple check to determine if an implementation reports the 1208 error calibration as required in Section 4.8 of RFC2679. Note that 1209 the context (Type-P) must also be reported. 1211 A.2. Examples related to statistical tests for One-way Delay 1213 A one way delay measurement may pass an ADK test with a timestamp 1214 resultion of 1 ms. The same test may fail, if timestamps with a 1215 resolution of 100 microseconds are eavluated. The implementation 1216 then is then conforming to the metric specification up to a timestamp 1217 resolution of 1 ms. 1219 Let's assume another one way delay measurement comparison between 1220 implementation 1, probing with a frequency of 2 probes per second and 1221 implementation 2 probing at a rate of 2 probes every 3 minutes. To 1222 ensure reasonable confidence in results, sample metrics are 1223 calculated from at least 5 singletons per compared time interval. 1224 This means, sample delay values are calculated for each system for 1225 identical 6 minute intervals for the whole test duration. Per 6 1226 minute interval, the sample metric is calculated from 720 singletons 1227 for implementation 1 and from 6 singletons for implementation 2. 1228 Note, that if outliers are not filtered, moving averages are an 1229 option for an evaluation too. The minimum move of an averaging 1230 interval is three minutes in this example. 1232 The data in table 1 may result from measuring One-Way Delay with 1233 implementation 1 (see column Implemnt_1) and implementation 2 (see 1234 column implemnt_2). Each data point in the table represents a 1235 (rounded) average of the sampled delay values per interval. The 1236 resolution of the clock is one micro-second. The difference in the 1237 delay values may result eg. from different probe packet sizes. 1239 +------------+------------+-----------------------------+ 1240 | Implemnt_1 | Implemnt_2 | Implemnt_2 - Delta_Averages | 1241 +------------+------------+-----------------------------+ 1242 | 5000 | 6549 | 4997 | 1243 | 5008 | 6555 | 5003 | 1244 | 5012 | 6564 | 5012 | 1245 | 5015 | 6565 | 5013 | 1246 | 5019 | 6568 | 5016 | 1247 | 5022 | 6570 | 5018 | 1248 | 5024 | 6573 | 5021 | 1249 | 5026 | 6575 | 5023 | 1250 | 5027 | 6577 | 5025 | 1251 | 5029 | 6580 | 5028 | 1252 | 5030 | 6585 | 5033 | 1253 | 5032 | 6586 | 5034 | 1254 | 5034 | 6587 | 5035 | 1255 | 5036 | 6588 | 5036 | 1256 | 5038 | 6589 | 5037 | 1257 | 5039 | 6591 | 5039 | 1258 | 5041 | 6592 | 5040 | 1259 | 5043 | 6599 | 5047 | 1260 | 5046 | 6606 | 5054 | 1261 | 5054 | 6612 | 5060 | 1262 +------------+------------+-----------------------------+ 1264 Table 1 1266 Average values of sample metrics captured during identical time 1267 intervals are compared. This excludes random differences caused by 1268 differing probing intervals or differing temporal distance of 1269 singletons resulting from their Poisson distributed sending times. 1271 In the example, 20 values have been picked (note that at least 100 1272 values are recommended for a single run of a real test). Data must 1273 be ordered by ascending rank. The data of Implemnt_1 and Implemnt_2 1274 as shown in the first two columns of table 1 clearly fails an ADK 1275 test with 95% confidence. 1277 The results of Implemnt_2 are now reduced by difference of the 1278 averages of column 2 (rounded to 6581 us) and column 1 (rounded to 1279 5029 us), which is 1552 us. The result may be found in column 3 of 1280 table 1. Comparing column 1 and column 3 of the table by an ADK test 1281 shows, that the data contained in these columns passes an ADK tests 1282 with 95% confidence. 1284 >>> Comment: Extensive averaging was used in this example, because of 1285 the vastly different sampling frequencies. As a result, the 1286 distributions compared do not exactly align with a metric in 1287 [RFC2679], but illustrate the ADK process adequately. 1289 Appendix B. Anderson-Darling 2 sample C++ code 1291 /* Routines for computing the Anderson-Darling 2 sample 1292 * test statistic. 1293 * 1294 * Implemented based on the description in 1295 * "Anderson-Darling K Sample Test" Heckert, Alan and 1296 * Filliben, James, editors, Dataplot Reference Manual, 1297 * Chapter 15 Auxiliary, NIST, 2004. 1298 * Official Reference by 2010 1299 * Heckert, N. A. (2001). Dataplot website at the 1300 * National Institute of Standards and Technology: 1301 * http://www.itl.nist.gov/div898/software/dataplot.html/ 1302 * June 2001. 1303 */ 1305 #include 1306 #include 1307 #include 1308 #include 1310 using namespace std; 1312 vector vec1, vec2; 1313 double adk_result; 1314 double adk_criterium = 1.993; 1316 /* vec1 and vec2 to be initialised with sample 1 and 1317 * sample 2 values in ascending order. 1318 */ 1320 /* example for iterating the vectors 1321 * for(vector::iterator it = vec1->begin(); 1322 * it != vec1->end(); it++ 1323 * { 1324 * cout << *it << endl; 1325 * } 1326 */ 1328 static int k, val_st_z_samp1, val_st_z_samp2, 1329 val_eq_z_samp1, val_eq_z_samp2, 1330 j, n_total, n_sample1, n_sample2, L, 1331 max_number_samples, line, maxnumber_z; 1332 static int column_1, column_2; 1333 static double adk, n_value, z, sum_adk_samp1, 1334 sum_adk_samp2, z_aux; 1335 static double H_j, F1j, hj, F2j, denom_1_aux, denom_2_aux; 1336 static bool next_z_sample2, equal_z_both_samples; 1337 static int stop_loop1, stop_loop2, stop_loop3,old_eq_line2, 1338 old_eq_line1; 1340 static double adk_criterium = 1.993; 1342 k = 2; 1343 n_sample1 = vec1->size() - 1; 1344 n_sample2 = vec2->size() - 1; 1346 // -1 because vec[0] is a dummy value 1348 n_total = n_sample1 + n_sample2; 1350 /* value equal to the line with a value = zj in sample 1. 1351 * Here j=1, so the line is 1. 1352 */ 1354 val_eq_z_samp1 = 1; 1356 /* value equal to the line with a value = zj in sample 2. 1357 * Here j=1, so the line is 1. 1358 */ 1360 val_eq_z_samp2 = 1; 1362 /* value equal to the last line with a value < zj 1363 * in sample 1. Here j=1, so the line is 0. 1364 */ 1366 val_st_z_samp1 = 0; 1368 /* value equal to the last line with a value < zj 1369 * in sample 1. Here j=1, so the line is 0. 1370 */ 1372 val_st_z_samp2 = 0; 1374 sum_adk_samp1 = 0; 1375 sum_adk_samp2 = 0; 1376 j = 1; 1378 // as mentioned above, j=1 1380 equal_z_both_samples = false; 1381 next_z_sample2 = false; 1383 //assuming the next z to be of sample 1 1385 stop_loop1 = n_sample1 + 1; 1387 // + 1 because vec[0] is a dummy, see n_sample1 declaration 1389 stop_loop2 = n_sample2 + 1; 1390 stop_loop3 = n_total + 1; 1392 /* The required z values are calculated until all values 1393 * of both samples have been taken into account. See the 1394 * lines above for the stoploop values. Construct required 1395 * to avoid a mathematical operation in the While condition 1396 */ 1398 while (((stop_loop1 > val_eq_z_samp1) 1399 || (stop_loop2 > val_eq_z_samp2)) && stop_loop3 > j) 1400 { 1401 if(val_eq_z_samp1 < n_sample1+1) 1402 { 1404 /* here, a preliminary zj value is set. 1405 * See below how to calculate the actual zj. 1406 */ 1408 z = (*vec1)[val_eq_z_samp1]; 1410 /* this while sequence calculates the number of values 1411 * equal to z. 1412 */ 1413 while ((val_eq_z_samp1+1 < n_sample1) 1414 && z == (*vec1)[val_eq_z_samp1+1] ) 1415 { 1416 val_eq_z_samp1++; 1417 } 1418 } 1419 else 1420 { 1421 val_eq_z_samp1 = 0; 1422 val_st_z_samp1 = n_sample1; 1424 // this should be val_eq_z_samp1 - 1 = n_sample1 1425 } 1427 if(val_eq_z_samp2 < n_sample2+1) 1428 { 1429 z_aux = (*vec2)[val_eq_z_samp2];; 1431 /* this while sequence calculates the number of values 1432 * equal to z_aux 1433 */ 1435 while ((val_eq_z_samp2+1 < n_sample2) 1436 && z_aux == (*vec2)[val_eq_z_samp2+1] ) 1437 { 1438 val_eq_z_samp2++; 1439 } 1441 /* the smaller of the two actual data values is picked 1442 * as the next zj. 1443 */ 1445 if(z > z_aux) 1446 { 1447 z = z_aux; 1448 next_z_sample2 = true; 1449 } 1450 else 1451 { 1452 if (z == z_aux) 1453 { 1454 equal_z_both_samples = true; 1455 } 1457 /* This is the case, if the last value of column1 is 1458 * smaller than the remaining values of column2. 1459 */ 1460 if (val_eq_z_samp1 == 0) 1461 { 1462 z = z_aux; 1463 next_z_sample2 = true; 1464 } 1465 } 1466 } 1467 else 1468 { 1469 val_eq_z_samp2 = 0; 1470 val_st_z_samp2 = n_sample2; 1472 // this should be val_eq_z_samp2 - 1 = n_sample2 1474 } 1476 /* in the following, sum j = 1 to L is calculated for 1477 * sample 1 and sample 2. 1478 */ 1480 if (equal_z_both_samples) 1481 { 1483 /* hj is the number of values in the combined sample 1484 * equal to zj 1485 */ 1486 hj = val_eq_z_samp1 - val_st_z_samp1 1487 + val_eq_z_samp2 - val_st_z_samp2; 1489 /* H_j is the number of values in the combined sample 1490 * smaller than zj plus one half the the number of 1491 * values in the combined sample equal to zj 1492 * (that's hj/2). 1493 */ 1495 H_j = val_st_z_samp1 + val_st_z_samp2 1496 + hj / 2; 1498 /* F1j is the number of values in the 1st sample 1499 * which are less than zj plus one half the number 1500 * of values in this sample which are equal to zj. 1501 */ 1503 F1j = val_st_z_samp1 + (double) 1504 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1506 /* F2j is the number of values in the 1st sample 1507 * which are less than zj plus one half the number 1508 * of values in this sample which are equal to zj. 1510 */ 1511 F2j = val_st_z_samp2 + (double) 1512 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1514 /* set the line of values equal to zj to the 1515 * actual line of the last value picked for zj. 1516 */ 1517 val_st_z_samp1 = val_eq_z_samp1; 1519 /* Set the line of values equal to zj to the actual 1520 * line of the last value picked for zjof each 1521 * sample. This is required as data smaller than zj 1522 * is accounted differently than values equal to zj. 1523 */ 1525 val_st_z_samp2 = val_eq_z_samp2; 1527 /* next the lines of the next values z, ie. zj+1 1528 * are addressed. 1529 */ 1531 val_eq_z_samp1++; 1533 /* next the lines of the next values z, ie. 1534 * zj+1 are addressed 1535 */ 1537 val_eq_z_samp2++; 1538 } 1539 else 1540 { 1542 /* the smaller z value was contained in sample 2, 1543 * hence this value is the zj to base the following 1544 * calculations on. 1545 */ 1546 if (next_z_sample2) 1547 { 1549 /* hj is the number of values in the combined 1550 * sample equal to zj, in this case these are 1551 * within sample 2 only. 1552 */ 1553 hj = val_eq_z_samp2 - val_st_z_samp2; 1555 /* H_j is the number of values in the combined sample 1556 * smaller than zj plus one half the the number of 1557 * values in the combined sample equal to zj 1558 * (that's hj/2). 1559 */ 1561 H_j = val_st_z_samp1 + val_st_z_samp2 1562 + hj / 2; 1564 /* F1j is the number of values in the 1st sample which 1565 * are less than zj plus one half the number of values in 1566 * this sample which are equal to zj. 1567 * As val_eq_z_samp2 < val_eq_z_samp1, these are the 1568 * val_st_z_samp1 only. 1569 */ 1570 F1j = val_st_z_samp1; 1572 /* F2j is the number of values in the 1st sample which 1573 * are less than zj plus one half the number of values in 1574 * this sample which are equal to zj. The latter are from 1575 * sample 2 only in this case. 1576 */ 1578 F2j = val_st_z_samp2 + (double) 1579 (val_eq_z_samp2 - val_st_z_samp2) / 2; 1581 /* Set the line of values equal to zj to the actual line 1582 * of the last value picked for zj of sample 2 only in 1583 * this case. 1584 */ 1585 val_st_z_samp2 = val_eq_z_samp2; 1587 /* next the line of the next value z, ie. zj+1 is 1588 * addressed. Here, only sample 2 must be addressed. 1589 */ 1591 val_eq_z_samp2++; 1592 if (val_eq_z_samp1 == 0) 1593 { 1594 val_eq_z_samp1 = stop_loop1; 1595 } 1596 } 1598 /* the smaller z value was contained in sample 2, 1599 * hence this value is the zj to base the following 1600 * calculations on. 1601 */ 1603 else 1604 { 1606 /* hj is the number of values in the combined 1607 * sample equal to zj, in this case these are 1608 * within sample 1 only. 1609 */ 1610 hj = val_eq_z_samp1 - val_st_z_samp1; 1612 /* H_j is the number of values in the combined 1613 * sample smaller than zj plus one half the the number 1614 * of values in the combined sample equal to zj 1615 * (that's hj/2). 1616 */ 1618 H_j = val_st_z_samp1 + val_st_z_samp2 1619 + hj / 2; 1621 /* F1j is the number of values in the 1st sample which 1622 * are less than zj plus, in this case these are within 1623 * sample 1 only one half the number of values in this 1624 * sample which are equal to zj. The latter are from 1625 * sample 1 only in this case. 1626 */ 1628 F1j = val_st_z_samp1 + (double) 1629 (val_eq_z_samp1 - val_st_z_samp1) / 2; 1631 /* F2j is the number of values in the 1st sample which 1632 * are less than zj plus one half the number of values 1633 * in this sample which are equal to zj. As 1634 * val_eq_z_samp1 < val_eq_z_samp2, these are the 1635 * val_st_z_samp2 only. 1636 */ 1638 F2j = val_st_z_samp2; 1640 /* Set the line of values equal to zj to the actual line 1641 * of the last value picked for zj of sample 1 only in 1642 * this case 1643 */ 1645 val_st_z_samp1 = val_eq_z_samp1; 1647 /* next the line of the next value z, ie. zj+1 is 1648 * addressed. Here, only sample 1 must be addressed. 1649 */ 1650 val_eq_z_samp1++; 1652 if (val_eq_z_samp2 == 0) 1653 { 1654 val_eq_z_samp2 = stop_loop2; 1655 } 1656 } 1657 } 1659 denom_1_aux = n_total * F1j - n_sample1 * H_j; 1660 denom_2_aux = n_total * F2j - n_sample2 * H_j; 1662 sum_adk_samp1 = sum_adk_samp1 + hj 1663 * (denom_1_aux * denom_1_aux) / 1664 (H_j * (n_total - H_j) 1665 - n_total * hj / 4); 1666 sum_adk_samp2 = sum_adk_samp2 + hj 1667 * (denom_2_aux * denom_2_aux) / 1668 (H_j * (n_total - H_j) 1669 - n_total * hj / 4); 1671 next_z_sample2 = false; 1672 equal_z_both_samples = false; 1674 /* index to count the z. It is only required to prevent 1675 * the while slope to execute endless 1676 */ 1677 j++; 1678 } 1680 // calculating the adk value is the final step. 1682 adk_result = (double) (n_total - 1) / (n_total 1683 * n_total * (k - 1)) 1684 * (sum_adk_samp1 / n_sample1 1685 + sum_adk_samp2 / n_sample2); 1687 /* if(adk_result <= adk_criterium) 1688 * adk_2_sample test is passed 1689 */ 1691 Figure 4 1693 Appendix C. A tunneling set up for remote metric implementation testing 1695 Parties interested in testing metric compliance is most convenient if 1696 all involved parties can stay in their local test laboratories. 1697 Figure 4 shows a test configuration which may enable remote metric 1698 compliance testing. 1700 +----+ +----+ +----+ +----+ 1701 |LC10| |LC11| ,---. |LC20| |LC21| 1702 +----+ +----+ / \ +-------+ +----+ +----+ 1703 | V10 | V11 / \ | Tunnel| | V20 | V21 1704 | | ( ) | Head | | | 1705 +--------+ +------+ | | | Router|__+----------+ 1706 |Ethernet| |Tunnel| |Internet | +---B---+ |Ethernet | 1707 |Switch |--|Head |-| | | |Switch | 1708 +-+--+---+ |Router| | | +---+---+ +--+--+----+ 1709 |__| +--A---+ ( )--|Option.| |__| 1710 \ / |Impair.| 1711 Bridge \ / |Gener. | Bridge 1712 V20 to V21 `-+-? +-------+ V10 to V11 1714 Figure 5 1716 LC10 identify measurement clients /line cards. V10 and the others 1717 denote VLANs. All VLANs are using the same tunnel from A to B and in 1718 the reverse direction. The remote site VLANs are U-bridged at the 1719 local site Ethernet switch. The measurement packets of site 1 travel 1720 tunnel A->B first, are U-bridged at site 2 and travel tunnel B->A 1721 second. Measurement packets of site 2 travel tunnel B->A first, are 1722 U-bridged at site 1 and travel tunnel A->B second. So all 1723 measurement packets pass the same tunnel segments, but in different 1724 segment order. An experiment to prove or reject the above test set 1725 up shown in figure 4 has been agreed but not yet scheduled between 1726 Deutsche Telekom and RIPE. 1728 Figure 4 includes an optional impairment generator. If this 1729 impairment generator is inserted in the IP path between the tunnel 1730 head end routers, it equally impacts all measurement packets and 1731 flows. Thus trouble with ensuring identical test set up by 1732 configuring two separated impairment generators identically is 1733 avoided (which was another proposal allowing remote metric compliance 1734 testing). 1736 Appendix D. Glossary 1738 +-------------+-----------------------------------------------------+ 1739 | ADK | Anderson-Darling K-Sample test, a test used to | 1740 | | check whether two samples have the same statistical | 1741 | | distribution. | 1742 | ECMP | Equal Cost Multipath, a load balancing mechanism | 1743 | | evaluating MPLS labels stacks, IP addresses and | 1744 | | ports. | 1745 | EDF | The "Empirical Distribution Function" of a set of | 1746 | | scalar measurements is a function F(x) which for | 1747 | | any x gives the fractional proportion of the total | 1748 | | measurements that were smaller than or equal as x. | 1749 | Metric | A measured quantity related to the performance and | 1750 | | reliability of the Internet, expressed by a value. | 1751 | | This could be a singleton (single value), a sample | 1752 | | of single values or a statistic based on a sample | 1753 | | of singletons. | 1754 | OWAMP | One-way Active Measurement Protocol, a protocol for | 1755 | | communication between IPPM measurement systems | 1756 | | specified by IPPM. | 1757 | OWD | One-Way Delay, a performance metric specified by | 1758 | | IPPM. | 1759 | Sample | A sample metric is derived from a given singleton | 1760 | metric | metric by evaluating a number of distinct instances | 1761 | | together. | 1762 | Singleton | A singleton metric is, in a sense, one atomic | 1763 | metric | measurement of this metric. | 1764 | Statistical | A 'statistical' metric is derived from a given | 1765 | metric | sample metric by computing some statistic of the | 1766 | | values defined by the singleton metric on the | 1767 | | sample. | 1768 | TWAMP | Two-way Active Measurement Protocol, a protocol for | 1769 | | communication between IPPM measurement systems | 1770 | | specified by IPPM. | 1771 +-------------+-----------------------------------------------------+ 1773 Table 2 1775 Authors' Addresses 1777 Ruediger Geib (editor) 1778 Deutsche Telekom 1779 Heinrich Hertz Str. 3-7 1780 Darmstadt, 64295 1781 Germany 1783 Phone: +49 6151 628 2747 1784 Email: Ruediger.Geib@telekom.de 1786 Al Morton 1787 AT&T Labs 1788 200 Laurel Avenue South 1789 Middletown, NJ 07748 1790 USA 1792 Phone: +1 732 420 1571 1793 Fax: +1 732 368 1192 1794 Email: acmorton@att.com 1795 URI: http://home.comcast.net/~acmacm/ 1797 Reza Fardid 1798 Cariden Technologies 1799 888 Villa Street, Suite 500 1800 Mountain View, CA 94041 1801 USA 1803 Phone: 1804 Email: rfardid@cariden.com 1806 Alexander Steinmitz 1807 HS Fulda 1808 Marquardstr. 35 1809 Fulda, 36039 1810 Germany 1812 Phone: 1813 Email: steinionline@gmx.de