idnits 2.17.1 draft-hoene-codec-quality-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 3, 2011) is 4709 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CODEC C. Hoene 2 Internet Draft Universitaet Tuebingen 3 Intended status: Informational June 3, 2011 4 Expires: December 2011 6 Measuring the Quality of an Internet Interactive Audio Codec 7 draft-hoene-codec-quality-01.txt 9 Status of this Memo 11 This Internet-Draft is submitted in full conformance with the 12 provisions of BCP 78 and BCP 79. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other documents 21 at any time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html 30 This Internet-Draft will expire on June 3, 2011. 32 Copyright Notice 34 Copyright (c) 2011 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with 42 respect to this document. 44 Abstract 46 The quality of a codec has to be measured by multiple parameters 47 such as audio quality, speech quality, algorithmic efficiency, 48 latency, coding rates and their respective tradeoffs. During 49 standardization, codecs are tested and evaluated multiple times to 50 ensure a high quality outcome. 52 As the upcoming Internet codec is likely to have unique features, 53 there is a need to develop new quality testing procedures to measure 54 these features. Thus, this draft reviews existing methods on how to 55 measure a codec's qualities, proposes a couple of new methods, and 56 gives suggestions which may be used for testing the Internet 57 Interactive Audio Codec (IIAC). 59 This document is work in progress. 61 Conventions used in this document 63 In this document, equations are written in Latex syntax. An equation 64 starts with a dollar sign and ends with a dollar sign. The text in 65 between is an equation following the notation of Latex Version 2e. 66 In the PDF version of this document, as a courtesy to its readers, 67 all Latex equations are already rendered. 69 Table of Contents 71 Conventions used in this document ............................... 2 72 1. Introduction ................................................. 4 73 2. Optimization Goal ............................................ 6 74 3. Measuring Speech and Audio Quality ........................... 7 75 3.1. Formal Subjective Tests ................................. 7 76 3.1.1. ITU-R Recommendation BS.1116-1 ..................... 7 77 3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA) ............ 8 78 3.1.3. ITU-T Recommendation P.800 ......................... 8 79 3.1.4. ITU-T Recommendation P.805 ......................... 8 80 3.1.5. ITU-T Recommendation P.880 ......................... 9 81 3.1.6. Formal Methods Used for Codec Testing at the ITU ... 9 82 3.2. Informal Subjective Tests ............................... 9 83 3.3. Interview and Survey Tests .............................. 9 84 3.4. Web-based Testing ...................................... 10 85 3.5. Call Length and Conversational Quality ................. 10 86 3.6. Field Studies .......................................... 12 87 3.7. Objective Tests......................................... 13 88 3.7.1. ITU-R Recommendation BS.1387-1 .................... 14 89 3.7.2. ITU-T Recommendation P.862 ........................ 14 90 3.7.3. ITU-T Draft P.OLQA ................................ 15 92 4. Measuring Complexity ........................................ 15 93 4.1. ITU-T Approaches to Measuring Algorithmic Efficiency ... 15 94 4.2. Software Profiling ..................................... 17 95 4.3. Cycle Accurate Simulation .............................. 18 96 4.4. Typical run time environments .......................... 19 97 5. Measuring Latency ........................................... 19 98 5.1. ITU-T Recommendation G.114 ............................. 20 99 5.2. Discussion ............................................. 20 100 6. Measuring Bit and Frame Rates ............................... 21 101 7. Codec Testing Procedures Used by Other SDOs ................. 22 102 7.1. ITU-T Recommendation P.830 ............................. 22 103 7.2. Testing procedure for the ITU-T G.719 .................. 24 104 8. Transmission Channel ........................................ 25 105 8.1. ITU-T G.1050: Network Model for Evaluating Multimedia 106 Transmission Performance over IP (11/2007) .................. 26 107 8.2. Draft G.1050 / TIA-921B ................................ 27 108 8.3. Delay and Throughput Distributions on the Global Internet27 109 8.4. Transmission Variability on the Internet ............... 30 110 8.5. The Effects of Transport Protocols ..................... 30 111 8.6. The Effect of Jitter Buffers and FEC ................... 33 112 8.7. Discussion ............................................. 33 113 9. Usage Scenarios ............................................. 34 114 9.1. Point-to-point Calls (VoIP) ............................ 34 115 9.2. High Quality Interactive Audio Transmissions (AoIP) .... 35 116 9.3. High Quality Teleconferencing .......................... 35 117 9.4. Interconnecting to Legacy PSTN and VoIP (Convergence) .. 36 118 9.5. Music streaming......................................... 36 119 9.6. Ensemble Performances over a Network ................... 36 120 9.7. Push-to-talk like Services (PTT) ....................... 37 121 9.8. Discussion ............................................. 38 122 10. Recommendations for Testing the IIAC ....................... 38 123 10.1. During Codec Development .............................. 38 124 10.2. Characterization Phase ................................ 39 125 10.2.1. Methodology ...................................... 39 126 10.2.2. Material ......................................... 39 127 10.2.3. Listening Laboratory ............................. 40 128 10.2.4. Degradation Factors .............................. 40 129 10.3. Application Developers ................................ 41 130 10.4. Codec Implementers .................................... 42 131 10.5. End Users ............................................. 42 132 11. Security Considerations .................................... 42 133 12. IANA Considerations......................................... 42 134 13. References ................................................. 43 135 13.1. Normative References .................................. 43 136 13.2. Informative References ................................ 43 137 14. Acknowledgments ............................................ 48 139 1. Introduction 141 The IETF Working Group CODEC is standardizing an Internet 142 Interactive Audio and Speech Codec (IIAC). If the codec shall be of 143 high quality it is important to measure the codec's quality 144 throughout the entire process of development, standardization, and 145 usage. Thus, this document supports the standardizing process by 146 providing an overview of quality metrics, quality assessment 147 procedures, and other quality control issues and gives suggestions 148 on how to test the IIAC. 150 Quality must be measured by the following stakeholders and in the 151 following phases of the codec's development: 153 o Codec developers must decide on different algorithms or parameter 154 sets during the development and enhancement of a codec. These 155 might also include the selection among multiple codec candidates 156 that implement different algorithms; however the WG Codec base 157 its work on a common consensus not on a competitive selection of 158 one of multiple codec contributions. Thus, measuring the quality 159 of codecs to select one might not be required. 160 Besides selection, one is obliged to debug the codec software. To 161 find errors and bugs - and programming mistakes are present in 162 any complex software - the developer has to test this software by 163 conducting quality measurements. 165 o Typically the codec standardization includes a qualification 166 phase that measures the performance of a codec and verifies 167 whether it confirms to predefined quality requirements. In the 168 qualification phase, it becomes obvious whether the codec 169 development and standardization has been successful. Again, in 170 the process of rigorous testing during qualification phase, 171 algorithmic weaknesses and bugs in the implementation may be 172 found. Still, in complex software such as the IIAC, correctness 173 cannot be proved or guaranteed. 175 o Users of the codec need to know how well the codec is performing 176 while manufactures need to decide whether to include the IIAC in 177 their products. Quality measures play an important role in this 178 decision process. Also, the numerous quality measurement results 179 of the quality help developers of the VoIP system to dimension or 180 tune their system to take optimal advantage of a codec. For 181 example, during network planning, operators can predict the 182 amount of bandwidth needed for high quality voice calls. 183 An adaptive VoIP application needs to know which quality is 184 achieved with a different codec parameters set to be able to make 185 an optimal selection of the codec parameters under varying 186 network conditions. 187 As suggested in [50] an RTP payload specification for an IIAC 188 codec should include a rate control. Similar to the performance 189 of the codec, the rate control unit has a big impact on the 190 overall quality of experience. Thus, it should be tested well 191 too. 193 o Software implementers need to verify whether their particular 194 codec implementation that might be optimized on a specific 195 platform confirms to the standard's reference implementation. 196 This is particularly important as some intellectual property 197 rights might only be granted, if the codec conforms to the 198 standard. 199 As the IIAC must not to be bit conform, which would allow simple 200 comparisons of correctness, other means of conformance testing 201 must be applied. 202 In addition, the standard conformance and interoperability of 203 multiple implementations must be checked. 204 Last but not least, implementers may implement optimized 205 concealment algorithms, jitter buffers or other algorithms. Those 206 algorithms have to be tested, too. 208 o Since the success of MP3, end users do acknowledge the existence 209 of a high quality codec. It would make sense to use the IIAC in a 210 brand marketing campaign (such as "Intel inside"). A quality 211 comparison between IIAC and other codecs might be part of the 212 marketing. Online testing with user participation might also 213 raise the awareness level. 215 All those stakeholders might have different requirements regarding 216 the codec's quality testing procedures. Thus, this document tries to 217 identify those requirements and shows which of the existing quality 218 measurement procedures can be applied to fulfill those specific 219 demands efficiently. 221 In the following section we describe a primary optimization goal: 222 Quality of Experience (QoE). Next, we briefly list the most common 223 methods of how to perform subjective evaluations on speech and audio 224 quality. In Section 4, 5, and 6, we discuss on how to measure 225 complexity, latency, and bit- and frame rates. Section 7 describes 226 how other SDOs have measured the quality of their codecs. As 227 compared IIAC to previous standardized codecs, the IIAC is likely to 228 have different unique requirements and thus needs newly developed 229 quality testing procedures. To achieve this, in Section 8 we 230 describe the properties of Internet transmission paths. Section 9 231 summarizes the usage scenarios, for which the codec is going to be 232 used and finally, in Section 10, we recommend procedures on how to 233 test the IIAC. 235 2. Optimization Goal 237 The aim of the Codec WG is to produce a codec of high quality. 238 However, how can quality be measured? The measurement of the 239 features of a codec can be based on many different criteria. Those 240 include complexity, memory consumption, audio quality, speech 241 quality, and others. But in the end, it's the users' opinions that 242 really count since they are the customers. Thus, one important - if 243 not the most important quality measure of the IIAC - shall be the 244 Quality of Experience (QoE). 246 The ITU-T Standards ITU-T P.10/G.100 [22] defines the term "Quality 247 of Experience" as "the overall acceptability of an application or 248 service, as perceived subjectively by the end-user." The ITU-T 249 document G.RQAM [21] extends this definition by noting that "quality 250 of experience includes the complete end-to-end system effects 251 (client, terminal, network, services infrastructure, etc.)" and that 252 the "overall acceptability may be influenced by user expectations 253 and context". 255 These definitions already give guidelines on how to judge the 256 quality of the IIAC: 258 o The acceptability and the subjective quality impression of 259 endusers have to be measured (Section 3). 261 o The IIAC codec has to be tested as part of an entire 262 telecommunication system. It must be carefully considered whether 263 to measure the codec's performance just in a stand-alone setup or 264 to evaluate it as part of the overall system (Section 8). 266 o The environments and contexts of particular communication 267 scenarios have to be considered and controlled because they have 268 an impact on the human rating behavior and on quality 269 expectations and requirements (Section 9). 271 3. Measuring Speech and Audio Quality 273 The perceived quality of a service can be measured by various means. 274 If humans are interrogated, those quality tests are called 275 subjective. If the tests are conducted by instrumental means (such 276 as an algorithm) they are called objective. Subjective tests are 277 divided up into formal and informal tests. Formal tests follow 278 strictly defined procedures and methods and typically include a 279 large number of subjects. Informal tests are less precise because 280 they are conducted in an uncontrolled manner. 282 3.1. Formal Subjective Tests 284 Formal subjective tests must follow a well-defined procedure. 285 Otherwise the results of multiple tests cannot be mutually compared 286 and are not repeatable. Most subjective testing procedures have been 287 standardized by the ITU. If applied to coding testing, the testing 288 procedures follow the same pattern [26]: 290 "Performing subjective evaluations of digital codecs proceeds 291 via a number of steps: 293 o Preparation of source speech materials, including recording of 294 talkers; 296 o Selection of experimental parameters to exercise the features 297 of the codec that are of interest; 299 o Design of the experiment; 301 o Selection of a test procedure and conduct of the experiment; 303 o Analysis of results." 305 The ITU has standardized different formal subjective tests to 306 measure the quality of speech and audio transmission, which are 307 described in the following. 309 3.1.1. ITU-R Recommendation BS.1116-1 311 The ITU-R BS.1116-1 standard [14] is good for audio items with small 312 degradations (stimuli) and uses a continuous scale from 313 imperceptible (5.0) to very annoying (1.0). It is a double blind 314 triple-stimulus with a hidden reference testing method and must be 315 done twice for the degraded sample and the hidden reference. In a 30 316 minutes session, 10-15 sample items can be judged. Overall, about 20 317 subjects shall rate the items. Testing shall take place with 318 loudspeakers in a controlled environment or with headphones in a 319 quiet room. 321 3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA) 323 The ITU-R BS.1534-1 standard [16] defines a method for the 324 subjective assessment of intermediate quality levels. Multiple audio 325 stimuli are compared at the same time. Maximal 12 but preferably 326 only 8 stimuli plus a hidden one with Hidden Reference and an anchor 327 are compared and judged. MUSHRA uses a continuous quality scale 328 (CQS) ranging from 0 to 100 divided into five equal intervals ("bad" 329 to "excellent"). In 30 minutes, about 42 stimuli can be tested. 330 Again, 20 test subjects shall rate the items with either headphones 331 or loudspeakers. 333 The standard recommends using as lower anchor a low-pass filtered 334 version with a bandwidth limit of 3.5 kHz. Additional anchors are 335 recommended, especially if specific distortions are to be tested. 337 3.1.3. ITU-T Recommendation P.800 339 The ITU-T P.800 defines multiple testing procedures to assess the 340 speech quality of telephone connections. The most important 341 procedure is called listening-only speech quality of telephone 342 connections. Listeners rate short groups of unrelated sentences. The 343 listeners are taken from the normal telephone-using population (no 344 experts). They use a typical sending system (e.g. a local telephone) 345 that may follow "modified IRS" frequency characteristics. The 346 results is the listening-quality scale, which is an absolute 347 category scale (ACS) ranging from excellent=5 to bad=1. Listeners 348 can judge about 54 stimuli within 30 minutes. 350 Other tests described in P.800 measure listening-effort, loudness- 351 preference scale, conversation opinion and difficulty, 352 delectability, degradation, or minimal differences. 354 3.1.4. ITU-T Recommendation P.805 356 The P.805 standard [24] extends P.800 and defines precisely how to 357 measure conversational quality. Subjects have to do conversation 358 tests to evaluate the communication quality of a connected. Expert, 359 experienced or untrained (naive) subjects have to do these tests 360 collaboratively in soundproof cabinets. Typically, 6 transmission 361 conditions can be tested within 30 minutes. Depending on the 362 required precision, these tests have to be made 20 to 40 times. 364 3.1.5. ITU-T Recommendation P.880 366 To measure time-variable distortion, a continuous evaluation of 367 speech quality has been defined in P.880 [31]. Subjects have to 368 assess transmitted speech quality consisting of long speech 369 sequences with quality/time fluctuations. The quality is rated on a 370 continuous scale ranging from Excellent=5 to Bad=1 is dynamically 371 changed over the time while the stimuli are played. Stimuli have a 372 length of between 45 seconds and 3 minutes. 374 3.1.6. Formal Methods Used for Codec Testing at the ITU 376 In the last year, new narrow and wideband codecs have been tested 377 using ITU-T P.800 (and ITU-T P.830). For the ITU-T G.719 standard, 378 which supports besides speech content also audio, the ITU-R BS.1116- 379 1 testing method has been applied during the selection of potential 380 codec candidates. During the qualification phase, the method that 381 was used was the ITU-P BS.1584-1. For the ITU-T G.718 codec, the 382 Absolute Category Rating (ACR) following ITU-T P.800 has been 383 applied. 385 3.2. Informal Subjective Tests 387 Besides formal tests, informal subjective tests following less 388 stringent conditions might be taken to judge the quality of stimuli. 389 However, informal tests cannot be easily verified and lack the 390 reliability, accuracy and precision of formal tests. Informal tests 391 are needed if the available number of subjects who are able to 392 conduct the tests is low, or if time or money is limited. 394 3.3. Interview and Survey Tests 396 In ITU-T P.800 [23] and [9] interview and survey tests are 397 described. In P.800, it says that "if the rather large amount of 398 effort needed is available and the importance of the study warrants 399 it, transmission quality can be determined by 'service 400 observations'." 402 These service observations are based on statistical surveys common 403 in social science and marketing research. Typically, the questions 404 asked in a survey are structured. 406 In addition, according to [23]: "To maintain a high degree of 407 precision a total of at least 100 interviews per condition is 408 required. A disadvantage of the service-observation method for many 409 purposes is that little control is possible over the detailed 410 characteristics of the telephone connections being tested." 412 3.4. Web-based Testing 414 If the large-wide scale proliferation of the Internet, researchers 415 suggested testing the speech or audio quality on web sites via web 416 site visitors [43]. A current web site that compares multiple audio 417 codecs has been setup at SoundExpert.org [42]. On this web site, a 418 user can download an audio item that consists of a reference item 419 and a degraded item. Then, the user must identify the reference and 420 rate the ODG of the degraded item. The tests are single-blind as the 421 user does not know which codec he is currently rating. 423 One can anticipate that the visitors of web sites will use similar 424 equipment for testing of audio samples and for conducting VoIP 425 calls. Thus, web site testing can be made realistic in a way that 426 considers the impact of (typically used) loudspeakers and 427 headphones. 429 However, currently used web sites lack a proper identification of 430 outliers. Thus, all ratings of all users are considered despite the 431 fact that they might be (deliberately) faked or that subjects might 432 not be able to hear well the acoustic difference. Thus, one can 433 expect that web based ratings will show a high degree of variation 434 and that many more tests are needed to achieve the same confidence 435 that is gained within formal tests. A profound scientific study on 436 the quality of web based audio rating has not yet been published. 437 Thus, any statements on the validity of web based rating are 438 premature. 440 3.5. Call Length and Conversational Quality 442 In the ETSI technical report document ETR-250 [6], a model is 443 presented that discusses various impairments caused in narrow band 444 telephone systems. The ETSI model describes the combinatorial effect 445 of all those impairments. The ETSI model later became the famous E- 446 Model described in ITU-T G.107. Both the ETSI- and the E-Model 447 calculate the R factor that ranges from 0 (bad) to 100 (excellent 448 conversational quality). 450 Based on the R factor, the users' reaction to the voice transmission 451 quality of a connection can be predicted. For example, Section 8.3 452 describes the effect that users terminate the call if the quality is 453 bad. More precisely, they summarize it as users who "(i) terminate 454 their calls unusually early, (ii) re-dial or even (iii) actually 455 complain to the network operator". 457 In the ETSI model, the percentage of users "terminating calls 458 early", TME, is given as 460 $TME=100\cdot erf\left(\frac{36-R}{16}\right)\%$ 462 with $erf(X)$ being the sigmoid shaped Gaussian error function and 463 $R$ the R Factor of the E-Model (Figure 1). This relation is based 464 on results from "AT&T Long toll" interviews as cited in [2]. 466 These findings have been confirmed by Holub et al. [12] who have 467 studied the correlation between call length and narrow band speech 468 quality. Birke et al. [1] have also studied the duration of phone 469 calls which show a duration varying with day time and day of the 470 week and also may be affected by pricing schemata. 472 100 -+TME. +- 5 473 |..iii. | 474 T | .ii | 475 e | ii MOS| 476 r | i. .iiii| 477 m 80 -+ .i. .ii. | 478 i | .i .ii. +- 4 479 n | i. .i. | M 480 a | .i .ii. | O 481 t | i. .i. | S 482 e 60 -+ .i .i. | | 483 | i. ii. | C 484 E | .i .ii +- 3 Q 485 a | i. .i. | E 486 r 40 -+ .i .i. | 487 l | i..i. | 488 y | .ii. | 489 | .il. | 490 ( | .i..i +- 2 491 T 20 -+ .i. i. | 492 M | .ii. .i. | 493 E | .ii. .i. | 494 ) | .ii. .ii. | 495 |MOSlii. .iiiiiiiiiiiiiTME| 496 0 -+-----------------+-----------------+- 1 497 | | | 498 0 50 100 500 R Factor 502 Figure 1 - Relation between calls terminating early, the R Factor, 503 and the speech quality given in (MOS-CQE) 505 Whereas bad quality is related to short calls, it remains unproven 506 whether better quality (>4 MOS) results in longer phone calls. There 507 are two factors which might have an opposite effect on the call 508 length. On the one hand, if the quality is superb, the talkers might 509 be more willing to talk because of the pleasure of talking, on the 510 other hand they might fulfill their conversational tasks faster 511 because of the great quality Thus, depending on the context, good 512 speech quality might result either in longer or shorter calls. 514 3.6. Field Studies 516 Field studies can be conducted if usage data on calls are collected. 517 Field studies are useful to monitor real user behavior and to 518 collect data about the actual conversational context. 520 Because of highly varying conditions, the precision of those 521 measurements is high and many tests have to be done to get 522 significantly different measurement values. Also, the tests are not 523 repeatable because the conditions are changing with time. 525 For example, Skype has done quality tests in a deployed VoIP system 526 in the field with its users as testers [47]. The subjective tests 527 are done in the following manner. 529 o Download of test vectors to VoIP clients. Typically, this can be 530 done with an automated software update. 532 o Delivery changing VoIP configurations (such as the used codecs) 533 so that different calls are subjected to different 534 configurations. The selection of configurations can be done 535 randomly, alternating in time or based on other criteria. 537 o Collecting feedback from the users. For example, the following 538 parameters can be monitored or recorded: 540 o The call length and other call specific parameters 542 o A user's quality voting (e.g. MOS-ACR) after the call 544 o Other feedback of the user (e.g. via support channels) 546 The field tests have the benefit of being conducted under real 547 conditions with the real users. However, they have some drawbacks. 548 First, the experimental conditions cannot be controlled well. 549 Second, the tests are only valid for the current situations and do 550 not allow predictions for other use cases. Third, the statistical 551 significance might be largely questionable if confidence intervals 552 are overlapping. 554 The costs for running the tests are low because the users are doing 555 the tests for free. However, the operator might lose users after a 556 user experienced a test case causing bad quality. 558 3.7. Objective Tests 560 Objective tests, also called instrumental tests, try to predict the 561 human rating behavior with mathematical models and algorithms. They 562 also calculate quality ratings for a given set of audio items. 563 Naturally, they are not rating as precisely as their human 564 counterparts, whom they try to simulate. However, the results are 565 repeatable and less costly than formal subjective testing campaigns. 566 Instrumental methods have a limited precision. That means that their 567 quality ratings do not perfectly match the results of formal 568 listening-only tests. Typically, the correlation between formal 569 results and instrumental calculations are compared using a 570 correlation function. The resulting metric is given as R ranging 571 from 0 (no correlation) to 1 (perfect match). 573 Over the last years, several objective evaluation algorithms have 574 been developed and standardized. We describe them briefly in the 575 following. 577 3.7.1. ITU-R Recommendation BS.1387-1 579 The ITU developed an algorithm that is called Perceptual Evaluation 580 of Audio Quality (PEAQ). It was published in the document ITU-R 581 BS.1387 called Method for objective measurements of perceived audio 582 quality in 1998 [15]. PEAQ is intended to predict the quality rating 583 of low-bit-rate coded audio signals. Two different versions of PEAQ 584 are provided: a basic version with lower computational complexity 585 and an advanced version with higher computational complexity. 587 PEAQ calculates a quality grading called "Objective Difference 588 Grade" (ODB) ranging from 0 to -4. Typically, it shows a prediction 589 quality of between R=0.85 and 0.97 when compared to subjective 590 testing results. The ITU-T Study Group 12 assumes that PEAQ can 591 detect auditable differences between two implementations of the same 592 codec [5]. 594 3.7.2. ITU-T Recommendation P.862 596 The ITU-T PESQ algorithm [27] is intended to judge distortions 597 caused by narrow band speech codecs and other kind of channel and 598 transmission errors. These include also variable delays, filtering 599 and short localize distortions such as those caused by frame loss 600 concealment. For a large number of conditions, the validity and 601 precision of PESQ has been proven. For untested distortions, prior 602 subjective tests must be conducted to verify whether PESQ judges 603 these kinds of distortions precisely. Also, it is recommended to use 604 PESQ for 3.1 kHz (narrow-band) handset telephony and narrow-band 605 speech codecs only. For wide-band operations, a modified filter has 606 to be applied prior to the tests. 608 Furthermore, the ITU-T Recommendation P.862.1 [28] describes how to 609 transfer the PESQ's raw scores, which range from -0.5 to 4.5, to 610 MOS-LQO values similar to those gathered from ACR ratings. Then, as 611 it has been shown, the correlation between a large corpus of testing 612 samplings shows a correlation of R=0.879 (instead of R=0.876) 613 between subjective and MOS-LQO (respective PESQ raw) ratings. The 614 ITU-T Recommendation P.862.2 [29] modifies the PESQ algorithm 615 slightly to support wideband operations. And finally, the ITU-T 616 Recommendation P.862.3 [30] gives detailed hints and recommendations 617 on how and when to use the PESQ algorithms. 619 3.7.3. ITU-T Draft P.OLQA 621 The soon-to-be standardized algorithm P.OLQA [40] extends PESQ and 622 will be able to rate narrow to super-wideband speech and the effect 623 of time-varying speech playout. Later distortions are common in 624 modern VoIP systems which stretch and shrink the speech playout 625 during voice activity to adapt it to the delay process of the 626 network. 628 4. Measuring Complexity 630 Besides audio and speech quality, the complexity of a codec is of 631 prime importance. Knowing the algorithmic efficiency is important 632 because: 634 . the complexity has an impact on power consumption and system 635 costs 637 . the hardware can be selected to fit pre-known complexity 638 requirements and 640 . different codec proposals can be compared if they show similar 641 performances in other aspects. 643 Before any complexity comparisons can be made, one has to agree on 644 an objective, precise, reliable, and repeatable metric on how to 645 measure the algorithmic efficiency. In the following, we list three 646 different approaches. 648 4.1. ITU-T Approaches to Measuring Algorithmic Efficiency 650 Over the last 17 years, the ITU-T Study Group 16 measured the 651 complexity of codecs using a library called ITU-T Basic Operators 652 and described in ITU-T G.191 [19], which counts the kind and number 653 of operations and the amount of memory used. The latest version of 654 the standard supports both fix-point operations of different widths 655 and floating operations. Each operation can be counted 656 automatically and weighted accordingly. The following source code is 657 an [edited] excerpt from the source file baseop32.h: 659 /* Prototypes for basic arithmetic operators */ 661 /* Short add, 1 */ 662 Word16 add (Word16 var1, Word16 var2); 664 /* Short sub, 1 */ 665 Word16 sub (Word16 var1, Word16 var2); 667 /* Short abs, 1 */ 668 Word16 abs_s (Word16 var1); 670 /* Short shift left, 1 */ 671 Word16 shl (Word16 var1, Word16 var2); 673 /* Short shift right, 1 */ 674 Word16 shr (Word16 var1, Word16 var2); 676 ... 678 /* Short division, 18 */ 679 Word16 div_s (Word16 var1, Word16 var2); 681 /* Long norm, 1 */ 682 Word16 norm_l (Word32 L_var1); 684 In the upcoming ITU-T G.GSAD standard another approach has been used 685 as shown in the following code example. For each operation, WMPOS 686 functions have been added, which count the number of operations. If 687 the efficiency of an algorithm has to be measured, the program is 688 started and the operations are counted for a known input length. 690 for (i=0; iband_enrg_long_fx[i] = 30; 697 state_fx->band_enrg_fx[i] = 30; 698 state_fx->band_enrg_bgd_fx[i] = 30; 699 state_fx->min_band_enrg_fx[i] = 30; 700 } 702 4.2. Software Profiling 704 The previously described methods are well-established procedures on 705 how to measure computational complexity. Still, they have some 706 drawbacks: 708 o Existing algorithms must be modified manually to include 709 instructions that count arithmetic operations. In complex codecs, 710 this may take substantial time. 712 o The CPU model is simple as it does not consider memory access 713 (e.g. cache), parallel executions, or other kinds of optimization 714 that are done in modern microprocessors and compilers. Thus, the 715 number of instructions might not correlate to the actual 716 execution time on modern CPUs. 718 Thus, instead of counting instructions manually, run times of the 719 codec can be measured on a real system. In software engineering, 720 this is called profiling. The Wikipedia article on profiling [54] 721 explains profiling as follows: 723 "In software engineering, program profiling, software profiling or 724 simply profiling, a form of dynamic program analysis (as opposed 725 to static code analysis), is the investigation of a program's 726 behavior using information gathered as the program executes. The 727 usual purpose of this analysis is to determine which sections of a 728 program to optimize - to increase its overall speed, decrease its 729 memory requirement or sometimes both. 731 o A (code) profiler is a performance analysis tool that, most 732 commonly, measures only the frequency and duration of 733 function calls, but there are other specific types of 734 profilers (e.g. memory profilers) in addition to more 735 comprehensive profilers, capable of gathering extensive 736 performance data 738 o An instruction set simulator which is also - by necessity - a 739 profiler, can measure the totality of a program's behaviour 740 from invocation to termination." 742 Thus, a typical profiler such as the GNU gprof can be used to 743 measure and understand the complexity of a codec implementation. 744 This is precisely the case because it is used on modern computers. 745 However, the execution times depend on the CPU architecture, the PC 746 in general, the OS and parallel running programs. 748 To ensure repeatable results, the execution environment (i.e. the 749 computer) must be standardized. Otherwise the results of run times 750 cannot be verified by other parties as the results may differ if 751 done under slightly changed conditions. 753 4.3. Cycle Accurate Simulation 755 If reliable and repeatable results are needed, another similar 756 approach can be chosen. Instead of run times, CPU clock cycles on a 757 virtual reference system can be measured. Quoting Wikipedia again 758 [52]: 760 "A Cycle Accurate Simulator (CAS) is a computer program that 761 simulates a microarchitecture cycle-accurate. In contrast 762 an instruction set simulator simulates an Instruction Set 763 Architecture usually faster but not cycle-accurate to a specific 764 implementation of this architecture." 766 With a cycle accurate simulator, the execution times are precise and 767 repeatable for the system that is being studied. If two parties make 768 measurements using different real computers, they still get the same 769 results if they use the same CAS. 771 A cycle accurate simulator is slower than the real CPU by a factor 772 of about 100. Also, it might have a measurement error as compared to 773 the simulated, real CPU because the CPU is typically not perfectly 774 modeled. 776 If an x86-64 architecture shall be simulated, the open-source Cycle 777 accurate simulator called PTLsim can be considered [55]. PTLsim 778 simulates a Pentium IV. On their website, the authors of PTLsim 779 write: 781 "PTLsim is a cycle accurate x86 microprocessor simulator and 782 virtual machine for the x86 and x86-64 instruction sets. PTLsim 783 models a modern superscalar out of order x86-64 compatible 784 processor core at a configurable level of detail ranging from 785 full-speed native execution on the host CPU all the way down 786 to RTL level models of all key pipeline structures." 788 Another cycle accurate simulator called FaCSIM simulated the ARM9E-S 789 processor core and ARM926EJ-S memory subsystem [36]. It is also 790 available as open-source. Texas Instruments also provides as CAS for 791 its C64x+ digital signal processor [44]. 793 To have a metric that is independent of a particular architecture, 794 the results of cycle accurate simulators could be combined. 796 4.4. Typical run time environments 798 The IIAC codec will run on various different platforms with quite 799 diverse properties. After discussions on the WG mailing list, a few 800 typical run time environments have been identified. 802 Three of the run time environments are end devices (aka phones). The 803 first one is a PC, either stationary or a portable, having a >2 GHz 804 PCU, >2 GByte of RAM, and a hard disk for permanent storage. 805 Typically, a Windows, MacOS or Linux operating system is running on 806 a PC. The second one is a SmartPhone, for example with an ARM11 500 807 MHz CPU, 192 Mbyte RAM and 256 MByte Flashrom. An example is the HTC 808 Dream Smart phone equipped with Qualcomm MSM7201A chip. Various 809 operating systems are found on those devices such as Symbian, 810 Android, and iOS. The last ones are high end stationary VoIP phones 811 with for example a 275-MHz MIPS32 CPU (with 400 DMIPS) with a 125- 812 MHz (250 MIPS) ZSP DSP with dual-MAC. They both have more than 1 813 Mbyte RAM and FlashRom. An exemplary Chip is the BCM1103 [3]. 815 Besides phones, VoIP gateways are frequently needed for conferencing 816 or transcoding to legacy VoIP or PSTN. In this case, two different 817 platforms have been identified. The first one is based on standard 818 PC server platforms. It consists, for example, of an Intel six core 819 Xeon 54XX or 55XX, two 1 GB NIC, 12 GByte RAM, hard disks, and a 820 Linux operating system. Thus, a server can serve from 400 to 10000 821 calls depending on conference mode, codecs used, and ability of user 822 pre-encoded audio [46]. On the other hand, high density, highly 823 optimized voice gateways use a special purpose hardware platform 824 like for example, TNETV3020 chips consisting of six TI C64x+ DSPs 825 with 5.5 MB internal RAM. If they run with a Telogy conference 826 engine, they might serve about 1300 AMR or 3000 G.711 calls per chip 827 [45]. 829 5. Measuring Latency 831 Latency is a measure of time delay experienced in a system. Latency 832 can be measured as one-way delay or as round-trip time. The latter 833 one is the one-way latency from a source to destination plus the 834 one-way latency back from destination to source. Latency can be 835 measured at multiple positions, at the network layer or at higher 836 layers [53]. 838 As we aim to increase the Quality of Experience, the mouth-to-ear 839 delay is of importance because it directly correlates with 840 perceptual quality [17]. More precisely, the acoustic round-trip 841 time shall be a means of optimization when studying interactive and 842 conversational application scenarios. 844 5.1. ITU-T Recommendation G.114 846 The G.114 standard [45] gives guidelines on how to estimate one-way 847 transmission delays. It describes how the delay introduced by the 848 codec is generated. Because most of the encoders do a processing of 849 frames, the duration of a frame (named "frame size") is the foremost 850 contributor to the overall algorithmic delay. Citing [18]: 852 "In addition, many coders also look into the succeeding frame to 853 improve compression efficiency. The length of this advance look is 854 known as the look-ahead time of the coder. The time required to 855 process an input frame is assumed to be the same as the frame 856 length since efficient use of processor resources will be 857 accomplished when an encoder/decoder pair (or multiple 858 encoder/decoder pairs operating in parallel on multiple input 859 streams) fully uses the available processing power (evenly 860 distributed in the time domain). Thus, the delay through an 861 encoder/decoder pair is normally assumed to be:" 863 $2*frameSize + lookAhead$ 865 In addition, if the link speeds are low, the serialization delay 866 might contribute significantly to the codec delay. 868 Also, if IP transmissions are used and multiple frames are 869 concatenated in one IP packet, further delay is added. Then, "the 870 minimum delay attributable to codec-related processing in IP-based 871 systems with multiple frames per packet is:" 873 $(N+1)*frameSize + lookAhead$ 875 "where N is the number of frames in each packet." 877 5.2. Discussion 879 Extensive discussion on the WG mailing list led to the insight that 880 the afore mentioned ITU delay model overestimates the delay 881 introduced by the codec. In the last decade, two developments led to 882 slightly other conditions. 884 First, the processing power of CPU increased significantly (see 885 Section 4.4). Nowadays, even stand-alone VoIPs have CPUs with a 886 speed of 300 MHz. They are capable of doing the encoding and 887 decoding faster than real time. Thus, also the delay introduced by 888 processing is not at 100% anymore but significantly lower. For 889 example, it might be just 10% or less. 891 Second, even if the CPUs are fully loaded, especially if also other 892 tasks such as a video conference or other calls need to be 893 processed, advantaged scheduling algorithms allow for a timely 894 encoding and decoding. For example, a staggered processing schedule 895 can be used to reduce processing delays [45]. 897 Thus, the impact of processing delay is reduced significantly in 898 most of the cases. 900 Moreover, besides a look-ahead time, the decoder might also 901 contribute to the algorithmic delay e.g. if decoded and concealed 902 periods shall be mixed well. 904 6. Measuring Bit and Frame Rates 906 For decades, there was a quest to achieve high quality while keeping 907 the coding rate low. Coding rate, sometimes called multimedia bit 908 rate, is the bit rate that an encoder produces as its output stream. 909 In cases of variable rate encoding, the coding bit rate differs over 910 time. Thus, one has to describe the coding rate statistically. For 911 example, minimal, mean, and maximal coding rates need to be 912 measured. 914 A second parameter is the frame rate as the encoder produces frames 915 at a given rate. Again, in case of discontinuous transmission modes 916 (DTX), the frame rate can vary and a statistical description is 917 required. 919 Both coding and frame rate influence network related bit rates. For 920 example, the physical layer gross bit rate is the total number of 921 physically transferred bits per second over a communication link, 922 including useful data as well as protocol overhead [51]. It depends 923 on the access technology, the packet rate, and packet sizes. The 924 physical layer net bit rate is measured in a similar way but 925 excludes the physical layer protocol overhead. The network 926 throughput is the maximal throughput of a communication link of an 927 access network. Finally, the goodput or data transfer rate refers to 928 the net bit rate delivered to an application excluding all protocol 929 headers and data link layer retransmissions, etc. Typically, to 930 avoid packet losses or queuing delay, the goodput shall be equally 931 large as the coding rate. 933 The relation between goodput and the physical layer gross bit rate 934 is not trivial. First of all, the goodput is measured end-to-end. 935 The end-to-end path can consist of multiple physical links, each 936 having a different overhead. Second, the overhead of physical layers 937 may vary with time and load, depending for example on link 938 utilization and link quality. Third, packets may be tunneled through 939 the network and additional headers (such as IPsec) might be added. 940 Fourth, IP header compression might be applied (as in LTE networks) 941 and the overhead might be reduced. Overall, many information about 942 the network connection must be collected to predict what the 943 relation between physical layer gross bit rate and a given coding 944 and frame rate is going to be. Applications, which have only a 945 limited view of the network, can hardly know the precise relation. 947 For example, the DCCP TFRC-SP transport protocol simply estimates a 948 header size on data packets of 36 bytes (20 bytes for the IPv4 949 header and 16 bytes for the DCCP-Data header with 48-bit sequence 950 numbers) [7][8]. Thus, [11] suggested a typical scenario in which 951 one encoded frame is transmitted with the RTP, UDP, IPv4 and IEEE 952 802.3 protocols and thus each packet contains packet headers having 953 12 bytes, 8 bytes, 20 bytes and 18 bytes respectively. The gross bit 954 rate calculates as 956 $r_{gross}=r_{coding}+overhead \cdot framerate$ 958 where $r_{coding}$ is the coding rate of the encoding, $framerate$ 959 is the frame rate of the codec, $overhead$ is the number of bits for 960 protocol headers in each packet (typically 58*8=464), and the 961 $r_{gross}$ is the rate used on physical mediums. 963 7. Codec Testing Procedures Used by Other SDOs 965 To ensure quality, each newly standardized codec is rigorously 966 tested. ITU-T Study Group 12 and 16 have developed very good and 967 mature procedures on how to test codecs. The ITU-T Study Group 12 968 has described the testing procedures of narrow- and wide-band codecs 969 in the ITU-T P.830 standard. 971 7.1. ITU-T Recommendation P.830 973 The ITU-T P.830 recommendation describes methods and procedures for 974 conducting subjective performance evaluations of digital speech 975 codecs. It recommends for most applications the Absolute Category 976 Rating (ACR) method using the Listening Quality scale. The process 977 of judging the quality of a speech codec consists of five steps, 978 which are described in the following. 980 Step 1: Preparation of Source Speech Materials Including Recording 981 of Talkers. When testing a narrow band codec, the recommendation 982 suggests to use a bandwidth filter before applying sample items to a 983 codec. This bandwidth filter is called modified Intermediate 984 Reference System (IRS) and limits the frequency band to the range 985 between 300 and 3400 Hz. In addition, the recommendation states that 986 "if a wideband system (100-7000 Hz) is to be used for audio- 987 conferencing, then the sending end should conform to IEC Publication 988 581.7." 990 It also says that "speech material should consist of simple, short, 991 meaningful sentences." The sentences shall be understandable to a 992 broad audience and sample items should consist of two or three 993 sentences, each of them having a duration of between 2 and 3 994 seconds. Sample items should not contain noise or reverberations 995 longer than 500 ms. The recommendation also makes suggestions on the 996 loudness of the signal: "A typical nominal value for mean active 997 speech level (measured according to Recommendation P.56) is 998 -20 dBm0, corresponding to approximately -26 dBov" 1000 Step 2: Selection of Experimental Parameters to Exercise the 1001 Features of the Codec That Are of Interest. Various parameters shall 1002 be tested. Those include 1004 o Codec Conditions 1006 o Speech input levels ("input levels of 14, 26 and 38 dB below 1007 the overload point of the codec") 1009 o Listening levels ("levels should lie 10 dB to either side of 1010 the preferred listening level") 1012 o Talkers 1014 . Different talkers ("a minimum of two male and two female 1015 talkers") 1017 . Multiple talkers ("multiple simultaneous voice input 1018 signals") 1020 o Errors ("randomly distributed bit errors" or burst-errors) 1022 o Bitrates ("The codec must be tested at all the bit rates") 1024 o Transcodings ("Asynchronous tandeming", "Synchronous 1025 tandeming", and "Interoperability with other speech coding 1026 standards") 1028 o Mismatch (sender and receiver operate in different modes) 1030 o Environmental noise (sending) ("30 dB for room noise" and "10 1031 dB and 20 dB for vehicular noise") 1033 o Network information signals ("signaling tones, conforming to 1034 Recommendation Q.35, should be tested subjectively, and the 1035 minimum should be proceed to dial tone, called subscriber 1036 ringing tone, called subscriber engaged tone, equipment 1037 engaged tone, [and] number unobtainable tone.") 1039 o Music ("to ensure that the music is of reasonable quality") 1041 o Reference conditions ("for making meaningful comparisons") 1043 o Direct (no coding, only input and output filtering) 1045 o Modulated Noise Reference Unit (MNRU) 1047 o Signal-to-Noise Ratio (SNR) (for comparison purposes) 1049 o Reference codecs 1051 Step 3: Design of the Experiment. The considerations described in 1052 B.3/P.80 apply here. Typically, it is not possible to test each 1053 combination of parameters. Thus, recommendation P.830 states that 1054 "it is recommended that a minimum set of experiments be conducted, 1055 which, although they would not cover every combination, would result 1056 in sufficient data to make sensible decisions. [...] Extreme caution 1057 should be used when comparing systems with widely differing 1058 degradations, e.g. digital codecs, frequency division multiplex 1059 systems, vocoders, etc., even within the same test." 1061 Step 4: Selection of a Test Procedure and Conduct of the Experiment. 1062 Here, the considerations as in B.4/P.80 apply. However, a modified 1063 IRS at the receiver shall be used (narrow band) or an IEC 1064 Publication 581.7 filter (wideband). Also, "Gaussian noise 1065 equivalent to -68 dBmp should be added at the input to the receiving 1066 system to reduce noise contrast effects at the onset of speech 1067 utterances." 1069 Step 5: Analysis of Results. Again, the considerations detailed in 1070 B.4.7/P.80 apply. The arithmetic mean (over subjects) is to be 1071 calculated for each condition at each listening level. 1073 7.2. Testing procedure for the ITU-T G.719 1075 Recently, the ITU-T has standardized the audio and speech codec ITU- 1076 T G.719. The G.719 has similar properties as the anticipated IIAC, 1077 thus the optimization and characterization of the G.719 is of 1078 particular interest. 1080 In the following, we will describe the "Quality Assessment Test 1081 Plan" in TD 322 and 323 [33][35]. The ITU Study Group 16 used ITU-R 1082 BS.1116 to tests sample items. Audio sample items were sampled at 48 1083 kHz mixed down to mono. Speech sample items contain one sentence 1084 with a duration of 4 s, mixed content had a duration of 5-6 s and 1085 music a duration of between 10 and 15 s. The beginning and ending of 1086 the samples were smoothed. Also, a filter was applied to limit the 1087 nominal bandwidth of the input signal to the range of 20 to 20000 1088 Hz. As for the mixed content, advertisements, film trailers and news 1089 (including a jingle) have been selected. For music items, classical 1090 and modern styles of music have been selected. Besides the codec 1091 under test, test stimuli degraded with LAMP MP3 and G722 were added 1092 to the tests. Some test stimuli have been modified to include 1093 reverberations or an interfering talker and office noise. Some tests 1094 were done studying the effect of a frame erasure rate of 3% having 1095 random loss patterns. All listening labs used different sample items 1096 and attention paid to not use the same material twice. 1098 Listening labs were required to provide the results of 24 1099 experienced listeners excluding those listeners, who did not passed 1100 a pre- and post-screening. The experienced listeners should "neither 1101 have a background in technical implementations of the equipment 1102 under test nor do they have detailed knowledge of the influence of 1103 these implementations on subjective quality". 1105 During the tests, "circum aural headphones - open back for example: 1106 STAX Signature SR-404 or Sennheiser HD-600) on both ears (diotic 1107 presentation)" were used. The listening levels were -26 dB relative 1108 to OVL. 1110 Some results of the listening tests are given in TD 341 R1 [34]. In 1111 those tests, they also compared the subjective ratings that were 1112 made following BS.1116 with the objective ratings of ITU-R BS.1387- 1113 1. The correlation between objective and subjective ratings was 1114 below R=0.9. 1116 8. Transmission Channel 1118 Between speech encoder and decoder lies a transmission channel that 1119 effects the transmission. For cellular or wireless phones, the 1120 typical transmission channel is assumed to be equal to the wireless 1121 link(s). This typically means, that a circuit switch link is assumed 1122 (e.g., in GSM, UMTS, DECT). The bandwidth is typically constant in 1123 DECT and GSM or variable in a given range depending on the quality 1124 of the wireless transmission (UMTS). Bit errors do occur but they 1125 don't be equally distributed if unequal bit error correction is 1126 applied (UMTS). 1128 In the case of the IIAC codec, the transmission channel is the 1129 internet. More precisely, it is the packet transmission over the 1130 Internet, plus the transport protocol (e.g. UDP, TCP, DCCP), plus 1131 potentially Forward Error Correction, and plus dejittering buffers. 1133 Also, the transmission channel is reactive. It changes its 1134 properties depending on how much data is transmitted. For example, 1135 parallel TCP flows reduce their transmission bandwidth in the 1136 presence of an unresponsive UDP stream. 1138 Overall, one can say that the transmission channel "Internet" is 1139 difficult to understand. Thus, in this chapter, we try to shed light 1140 on the question of what types of transmission channels a codec has 1141 to cope with. 1143 8.1. ITU-T G.1050: Network Model for Evaluating Multimedia Transmission 1144 Performance over IP (11/2007) 1146 The current ITU-T G.1050 standard [20] describes layer 3 packet 1147 transmission models that can be used to evaluate IP applications. 1148 The models are of statistical nature. They consider networks 1149 architectures, types of access links, QoS controlled edge routing, 1150 MTU size, networks faults, link failures, route flapping, reordered 1151 packets, packet loss, one-way delay, variable deploys and background 1152 traffics. 1154 G.1050 is a network model consisting of three parts, LAN a, LAN b, 1155 and an interconnection core. Both LANs can have different rates and 1156 occupancy and can be of different types. LAN and core are connected 1157 via access technologies, which might vary in data rate, occupancy 1158 and MTU size. 1160 The core is characterized by route flapping, link failures, one-way 1161 delay, jitter, packet loss and reordered packets. Route flaps are 1162 repeatedly changed in a transmission path because of alternating 1163 routing tables. These routing updates cause incremental changes in 1164 the transmission delays. A link failure is a period of consecutive 1165 packet loss. Packet losses can be bursty having a high loss rate 1166 during bursts and having otherwise a lower loss rate otherwise. 1167 Delays are modeled via multiple different jitter models supporting 1168 delay spikes, random jitter and filtered random jitters. 1170 The standard recommends three profiles, named "Well-managed IP 1171 network", "Partially-managed IP network", and "Unmanaged IP Network, 1172 Internet", which differ in their connection qualities. 1174 Limitations to these models are the missing cross-correlation 1175 between packet delays and packet loss events, the lack of 1176 responsiveness to the tests application flow, and the lack of link 1177 qualities that vary with time. 1179 8.2. Draft G.1050 / TIA-921B 1181 Currently, an enhancement to ITU-T G.1050 (11/2007) is being 1182 developed (e.g. [13])). It does not use a statistical model but 1183 takes advantage of the NS/2 simulator. Thus, most of the above 1184 mentioned limitations have been overcome. 1186 Despite that, even the new model does not yet give an answer to the 1187 question of which distributions of typical Internet connection 1188 qualities can be expected. 1190 8.3. Delay and Throughput Distributions on the Global Internet 1192 In general, it is not precisely known how the qualities of end-to- 1193 end connections are distributed. It is also unclear whether the 1194 anticipated IIAC Codec will be used globally or whether its area of 1195 usage will be somehow restricted. 1197 Despite the fact, that the codec has to be optimized for an unknown 1198 Internet, the following scientific publications give an estimate on 1199 how different Internet end-to-end paths might behave. One recent 1200 example is on studies about the residential broadband Internet 1201 access traffic of a major European ISP [37]. 1203 +------------------------------------------------------------+ 1204 p 0.6-+ | 1205 r | e eDonkey | 1206 o | ee | 1207 b | H HTTP e e | 1208 a | ee e | 1209 b | e e | 1210 i 0.4-+ e e | 1211 l | e e | 1212 i | e e | 1213 t | e e HHHH | 1214 y | e e HHHHHHHHH | 1215 | ee e HH HH | 1216 d 0.2-+ e eHH HH | 1217 e | e H HH | 1218 n | ee He HH | 1219 s | ee e HH e HH | 1220 i | e ee e HH e HHH | 1221 t | ee eeeeee HHHHHH eeee HHH | 1222 y 0.0-+ eHeHeHeHHHHHHHHHHHHHHH eeeeeeeeeeeeeHHHHHHH | 1223 +----+---------+---------+--------+---------+---------+------+ 1224 | | | | | | 1225 0.1 1.0 10 100 1000 10000 1227 Throughput [kbps] 1229 Figure 2 Achieved throughput of flows measured for eDonkey and HTTP 1230 applications [37] 1232 Figure 2 displays the throughput distribution of TCP connections for 1233 eDonkey peer-to-peer and HTTP applications. It only considers single 1234 flow with a length of more than 50 Kbyte. But typically, a web 1235 browser uses two to three TCP connections at the same time and an 1236 eDonkey client about 10. Still, the throughput of a single HTTP flow 1237 is in about an order faster than the of eDonkey flow. In [37], the 1238 authors assume this is due to the fact that peer-to-peer connections 1239 fill the uplink and that HTTP is used at the faster downlink. 1241 +------------------------------------------------------------+ 1242 | | 1243 | ** | 1244 p 0.8-+ ** | 1245 r | *** | 1246 o | * * | 1247 b | ** * | 1248 a 0.6-+ * * | 1249 b | * ** | 1250 i | * * | 1251 l | * * | 1252 i | * * | 1253 t 0.4-+ ** ** | 1254 y | * * | 1255 | * * **** | 1256 d | * * * | 1257 e 0.2-+ * ** | 1258 n | ** ** | 1259 s | **** * *** | 1260 i | *** *** *** | 1261 t | *** ************** | 1262 y 0.0-+********* *****************| 1263 +-------+-----------------+----------------+-----------------+ 1264 | | | | 1265 10 100 1000 10000 1267 RTT [ms] 1269 Figure 3 TCP roundtrip times [36] 1271 Figure 3 displays TCP roundtrip times including both access and 1272 backbone network. Both graphs can be seen as an indication for the 1273 assumption that an application, even in modern Internet access 1274 networks, might be subjected to a wide variability of throughput 1275 ranging from a few kbits/s up to 10 Gbit/s and TCP round trip times 1276 from 5ms up to one of several seconds. 1278 Albeit these results are only valid for TCP, similar results should 1279 be expected for RTP over UDP - with a small advantage because UDP 1280 flows are not always responsive. 1282 As a summary, a codec for the Internet should be able to work under 1283 these widely varying transmission conditions and should be tested 1284 against a wide distribution of expected throughputs. 1286 8.4. Transmission Variability on the Internet 1288 Besides effects such as route flapping or link failures modeled in 1289 G.1050 [20], the Internet experience in short-time scales sharp 1290 changes sharply in bandwidth utilization. For example, [49] and [38] 1291 showed that variability of Internet traffic comes in form of spike 1292 like traffic increments. Similarly, [32] studied why the Internet is 1293 bursty in time scales of between 100 and to 1000 milliseconds. 1295 In the light of these results, one can assume that the IIAC's 1296 transmission conditions will vary in similar time scales. More 1297 precisely, it will be subjected to 1299 . variability due to bursty traffic having a duration of between 1300 100 and 1000 milliseconds, 1302 . interruptions due to temporal link failures every minute to every 1303 hour that might have a temporal interruption from 64 ms to 1304 several seconds [20], and 1306 . route flap events every minute to every hour that have a delay of 1307 between 2 and 128 ms [20]. 1309 8.5. The Effects of Transport Protocols 1311 Realtime multimedia is not always transported over RTP and UDP. 1312 Sometimes it makes sense to use a different transport protocol or an 1313 additional rate adaptation. The reasons for that are manifold. 1315 . If a scalable codec shall be supported, RTCP-based feedback 1316 information can be utilized to implement a rate control 1317 mechanisms [41]. However, RTCP-based feedback suffers from the 1318 drawback that RTCP messages are allowed only every 5 s. Thus, 1319 implementing a fast responding mechanism is not possible. 1321 . In the presence of restricted firewalls, VoIP can sometimes only 1322 be transmitted over TCP. In those cases, the transmission 1323 scheduling is not given by the codec but by TCP. TCP algorithms 1324 typically don't have a smooth sending rate but frequently send 1325 packets in bursts and change the amount of packets sent every 1326 round trip time (Figure 4). More precisely, TCP causes the 1327 sending schedule to behave in the following way: 1329 . During the Slow Start phase (for example at the beginning of 1330 a TCP connection) the transmission rate increases 1331 exponentially. 1333 . If a TCP segment is not acknowledged after about four RTTs, 1334 the TCP sending rate starts at one packet per RTT again. 1336 . During congestion avoidance, the sending rate increases 1337 steadily by one segment per RTT. 1339 . If a congestion event is then detected, the sending rate is 1340 reduced by 50%. 1342 p 15-+-------------------------------------------------------------+ 1343 a | | 1344 c | ** ** ** | 1345 k | ** * ** * ** * | 1346 e | ** * ** * ** * | 1347 t | ** * ** * ** * **| 1348 s | ** * ** * ** * ** | 1349 8-+ ** * ** * ** * ** | 1350 p | * * ** * ** * ** | 1351 e | * * * *** *** | 1352 r | * * * | 1353 4-+ * * * | 1354 R | * * * | 1355 T 2-+ * * * | 1356 T 1-+* * * | 1357 +---------+---------+---------+---------+---------+---------+-+ 1358 | | | | | | | 1359 0 10 20 30 40 50 60 1361 time in round- trip times (RTT) 1363 Figure 4 Sending rate of a standard TCP over time 1365 . The DCCP transport protocol supports multiple congestion control 1366 protocols and gives means to support TCP friendliness without 1367 retransmission. Thus, it is suitable for real time multimedia 1368 transmissions. DCCP supports a TCP emulation, which shows a 1369 similar rate over time as TCP, and the TFRC congestion control, 1370 which changes its rate in a smoother way (Figure 5). 1371 Besides TFRC, which is intended to transmit packets of maximal 1372 size (aka MTU), TFRC-SP is optimized for flows with variable 1373 packet sizes such as VoIP. With TFRC-SP, smaller packets can be 1374 transmitted at a faster pace than it is the case for larger 1375 packets because they contribute less to the gross bandwidth 1376 consumption. 1377 The TFRC protocol might provide a lower bandwidth and a lower QoE 1378 as UDP or TCP, unless if not proper optimizations are taken (see 1379 [48]). Also, it is suggested to limit the rate control to 100 1380 packets per second. This limit might be too low for an IIAC. 1382 p 15-+-------------------------------------------------------------+ 1383 a | | 1384 c | ** ** ** | 1385 k | ** ** ** ** ** ** | 1386 e | ** ** ** ** ** ** | 1387 t | ** ** ** ** ** **| 1388 s | ** ** ** ** ** | 1389 8-+ ** ** ** | 1390 p | * | 1391 e | * | 1392 r | * | 1393 4-+ * | 1394 R | * | 1395 T 2-+ * | 1396 T 1-+* | 1397 +---------+---------+---------+---------+---------+---------+-+ 1398 | | | | | | | 1399 0 10 20 30 40 50 60 1401 time in round- trip times (RTT) 1403 Figure 5 Sending rate of the TFRC protocol 1405 In general, the transport protocol has a clear influence on the 1406 transmission conditions. Coding rates need to be adapted by sharply 1407 and smoothly to changed bandwidth estimations. Changes of the 1408 bandwidth estimation may occur every RTT. Also, in cases of a TCP 1409 timeout, the transmission is halted and the decoding must be 1410 stalled. 1412 8.6. The Effect of Jitter Buffers and FEC 1414 Both jitter buffers trade frame losses against delay. In cases of a 1415 jitter buffer, frames are delayed before playout. This helps in 1416 cases of lately arriving frames that would otherwise be ignored and 1417 would have to be concealed. Jitter buffers are adaptive and are 1418 changing dynamically to the current loss process on the Internet. 1420 Forward Error Correction helps to cope with isolated losses as 1421 redundant speech frames are transmitted in the following packets. In 1422 the presence of loss, FEC increases the delay because the receiver 1423 has to wait for the following packets. Both delay and packet losses 1424 are important contributors to the overall Quality of Experience [2]. 1426 Since the delay process on the Internet often comes in the form of a 1427 gamma distribution, thus a statistical monitor of past delays helps 1428 to predict the size of future jitter. Then, if the playout schedule 1429 does not match the predicted loss process, playout can be 1430 accelerated or slowed down. 1432 However, due to the reasons described in Section 8.4 not all 1433 increments in transmission time might be predictable. This has a 1434 profound effect on the jitter buffer as it actually cannot predict 1435 well, whether a frame is lost or whether it is going to be delayed. 1436 If a frame is scheduled for playout but has not been received, the 1437 jitter buffer has to consider two cases. First, the frame is lost 1438 and has to be concealed. This typically means that the audio signal 1439 needs to be extrapolated or interpolated to conceal the gap due to a 1440 lost frame. Second, the frame is delayed and shall be played out at 1441 a later point in time. Then, the resulting gap in playout must be 1442 concealed by extrapolating the previous audio signal. 1444 These issues have an effect on testing the concealment algorithm of 1445 the codec. The same concealment function must be tested against time 1446 gap concealment and loss concealment. 1448 8.7. Discussion 1450 Judging a codec performance using a realistic model of a 1451 transmission channel is difficult. Good models of IP transmission 1452 channels are available. However, before a codec can be tested 1453 against those channels, further building blocks such as the 1454 transport protocol, the jitter buffer, and FEC should be known - at 1455 least roughly. 1457 Alternatively, a codec can be tested only against of packet loss 1458 patterns only without considering any rate adaption or playout 1459 rescheduling. But then again, the codec should be additionally 1460 tested for those impairments, which occur due to the dynamics of the 1461 Internet. These include 1463 o slowing down and speeding up the playout in cases of moderate 1464 rescheduling of playout times, 1466 o stalling and resuming the playout in cases of temporal link 1467 outages, 1469 o moderately reducing and increasing bit and frame rates during 1470 contention periods, and 1472 o sharply reducing (in case of congestion) and fast increasing 1473 (during connection establishment) of bit and frame rates. 1475 o Time gap and loss concealment. 1477 o Speeding up and slowing down the playout speed. 1479 9. Usage Scenarios 1481 Quality of Experience is the service quality perceived subjectively 1482 by end-users (refer to Section 2) and as ITU-T document G.RQAM [21] 1483 states "overall acceptability may be influenced by user expectations 1484 and context". Thus, in this section we describe the usage scenarios, 1485 in which the IIAC codec will probably be used, and the expectations 1486 users have in those communication contexts. We list seven main 1487 scenarios and describe their quality requirements. 1489 9.1. Point-to-point Calls (VoIP) 1491 The classic scenario is that of the phone usage to which we will 1492 refer in this document as Voice over IP (VoIP). Human speech is 1493 transmitted interactively between two Internet hosts. Typically, 1494 besides speech some background noise is present, too. 1496 The quality of a telephone call is traditionally judged by 1497 subjective tests such as those described in [24]. The ACR scale used 1498 in MOS-LQS sometimes might not be very suitable for high quality 1499 calls, then - for example - the MUSHRA [16] rating can be applied. 1501 A telephone call is considered good if it has a maximal mouth-to-ear 1502 delay of 150 ms [17] and a speech quality of MOS-LQS 4 or above. 1503 However, interhuman communication is still possible if the mounth- 1504 to-ear delay is much larger. 1506 The effect of delay jitter might not be very well notable in case of 1507 speech. Thus, playout rescheduling can happen often take place. 1509 In many cases, phone calls are made between mobile devices such as 1510 mobile phones and cellular phone. In these cases, energy consumption 1511 is crucial and both complexity and transmission rate may be reduced 1512 to save resources. 1514 9.2. High Quality Interactive Audio Transmissions (AoIP) 1516 In this scenario we consider a telephone call having a very good 1517 audio quality at modest acoustic one-way latencies ranging from 50 1518 and 150 ms [17], so that music can be listened to over the telephone 1519 while two persons are talking interactively. 1521 While delay expectations might be similar to those of classic 1522 telephony, the audio quality must meet similar standards as those of 1523 consumer Hifi equipment like MP3 and CD players, iPods, etc. 1525 If music is played, playout rescheduling events may be heard easily 1526 be heard as the rhythm changes. Only a few studies such as [10] have 1527 been made to examine the effect of time varying delays on service 1528 quality. In general, it can be assumed that the requirements 1529 regarding constancies of playout schedules are higher than in case 1530 of speech because human beings can notice rhythmic changes easily. 1531 Thus, in the presence of music, frequent playout rescheduling shall 1532 be avoided. 1534 9.3. High Quality Teleconferencing 1536 Also, for today's teleconferencing and videoconferencing systems 1537 there is a strong and increasing demand for audio coding providing 1538 the full human auditory bandwidth of 20 Hz to 20 kHz. This rising 1539 demand for high quality audio is due to the following reasons: 1541 o Conferencing systems are increasingly used for more elaborated 1542 presentations, often including music and sound effects which 1543 occupy a wider audio bandwidth than that of speech. For example, 1544 Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect 1545 are based on an IP based transmission. 1547 o The new "Telepresence" video conferencing systems, providing the 1548 user with High Definition video and audio quality, create the 1549 experience of being in the same room by introducing high quality 1550 media delivery (such as from Cisco). 1552 o The emerging Digital Living Rooms are to be interconnected and 1553 might require a constant high quality acoustic transmission at 1554 high qualities. 1556 o Spatial audio teleconference solutions increase the quality 1557 because they take advantage of the cocktail-party effect. By 1558 taking advantage of 3D audio, participants can be identified by 1559 their location in a virtual acoustic environment and multiple 1560 talkers can be distinguished from each other. However, these 1561 systems require stereo audio, if the spatial audio is rendered 1562 for headphones. 1564 9.4. Interconnecting to Legacy PSTN and VoIP (Convergence) 1566 This scenario does not include the use case of using a VoIP-PSTN 1567 gateway to connect to legacy telephone systems. In those cases, the 1568 gateway would make an audio conversion from broadband Internet voice 1569 to the frugal 1930's 3.1 kHz audio bandwidth. 1571 The quality requirements in this scenario are low because legacy 1572 PSTN typically uses narrow-band voice. Also, in those cases one 1573 might expect the codec negotiation might decide on a common codec 1574 both for PSTN and VoIP in order to avoid transcoding. 1576 However, the complexity requirements might be stringent because 1577 central media gateways must scale to a high number of users. In this 1578 context, hardware costs are an important criterion and the codec has 1579 to operate efficient. 1581 9.5. Music streaming 1583 Music streaming typically does not require low delays. However, in 1584 special cases such as live events and in the presence of alternative 1585 transmission technologies, low-delay streaming may be demanded. 1587 Examples are important sport events, which are streamed both on 1588 terrestrial, (analogue) and low delay broadcast networks and on IP- 1589 based distribution networks. The latter ones becomes aware (such as 1590 when a footballer scores) more lately than the ones their neighbors 1591 using terrestrial technology. 1593 9.6. Ensemble Performances over a Network 1595 In some usage scenarios, users want to act simultaneously and not 1596 just interactively. For example, if persons sing in a chorus, if 1597 musicians jam, or if e-sportsmen play computer games in a team 1598 together they need to communicate acoustically. 1600 In this scenario, the latency requirements are much harder than for 1601 interactive usages. For example, if two musicians are placed more 1602 than 10 meters apart, they can hardly stay synchronized. Empirical 1603 studies [10] have shown that if ensembles play over networks, the 1604 optimal acoustic latency is at around 11.5 ms with a targeted range 1605 from 10 to 25 ms. 1607 Also, the users demand very high audio quality, very low delay and 1608 very few events of playout rescheduling. 1610 9.7. Push-to-talk like Services (PTT) 1612 In spite of the development of broadband access (xDSL), a lot of 1613 users do only have service access via PSTN modems or mobile links. 1614 Also, on these links the available bandwidth might be shared among 1615 multiple flows and is subjected to congestion. Then, even low coding 1616 rates of about 8 kbps are too high. 1618 If transmission capacity hardly exists, one can still degrade the 1619 quality of a telephone call to something like a push-to-talk (PTT) 1620 like service having very high latencies. Technically, this scenario 1621 takes advantage of bandwidth gains due to disruptive transmission 1622 (DTX) modes and very large packets containing multiple speech frames 1623 causing a very low packetization overhead. 1625 The quality requirements of a push-to-talk like service have hardly 1626 been studied. The OMA lists as a requirement of a Push-to-talk over 1627 cellular service a transmission delay of 1.6 s and a MOS values of 1628 above 3.0 that typically should be kept [39]. However, as long as an 1629 understandable transmission of speech is possible, the delay can be 1630 even higher. For example, [39] allows a delay of typically up to 4 s 1631 for the first talk-burst. Also, [39] describes a maximum duration of 1632 speaking. If a participant speaking reaches the time limit, the 1633 participant's righttospeak shall be automatically revoked. 1635 If the quality of a telephone call is very low, then instead of 1636 listening-only speech quality the degree of understandability can be 1637 chosen as performance metric. For example, objective tests of the 1638 understandability use automatic speech recognition (ASR) systems and 1639 measure the amount of correctly detected words. 1641 In any case, the participant shall be informed about the quality of 1642 connection, the presence of high delays, the half-duplex style of 1643 communication, and its (limited) righttospeak. For example this can 1644 be achieved by a simulated talker echo. 1646 9.8. Discussion 1648 The requirements of the usage scenarios are summarized in the 1649 following table. 1651 | Sound Quality | Latency | Complexity 1652 Scenario | low | avg. | hifi | 10ms | 150ms| high | low | high 1653 -------------+------+------+------+------+------+------+------+----- 1654 VoIP | X | | | | X | | X | X 1655 AoIP | | X | X | | X | | | X 1656 Conference | | X | | | X | | | X 1657 Convergence | X | | | | X | | X | X 1658 Streaming | | X | X | | | X | | X 1659 Performances | | | X | X | | | | X 1660 Push-To-Talk | X | | | | | X | X | X 1662 Figure 6 Different requirements for different usage scenarios 1664 10. Recommendations for Testing the IIAC 1666 The IETF IIAC differs substantially from a classic narrow and 1667 wideband codec. Thus, the previously applied codec testing 1668 procedures such as ITU P.830 cannot be entirely adopted. Instead, 1669 one must check carefully, which of the procedures are used without 1670 changes, which procedures are used with minor changes and which 1671 procedures are dropped or replaced. 1673 In Section 1 we listed five groups of stakeholders, which have 1674 different requirements and demands on how to test the quality of an 1675 IIAC. In the following, we recommend testing procedures for those 1676 stakeholders. 1678 10.1. During Codec Development 1680 The codec development is an innovative process. In general, 1681 innovation and research in general benefits from openness and 1682 discussion between experts. Thus, format restrictions on how to test 1683 the codec might hinder the codec development because innovation may 1684 also take place in testing procedures. Instead, many experts both in 1685 codec development and codec usage shall be able to participate. If 1686 this is the case, they contribute with their expertise, identify 1687 weaknesses, and discuss potential codec enhancements. During 1688 innovation, openness in participation and discussion is very 1689 fruitful and leads to good results. 1691 Based on the ongoing experience, codec developers know best on how 1692 to tests their codecs. Typically, those tests include informal 1693 testing, semiformal testing, and expert interviews. They are 1694 intended to find weaknesses in the codec, to identify artifacts or 1695 distortions, and to achieve algorithmic progress. 1697 10.2. Characterization Phase 1699 The characterization phase is intended to study the features, the 1700 quality tradeoff and the properties of a codec under 1701 standardization. It is intended to be an objective measure of the 1702 codec's quality to convince third parties of the quality properties 1703 of the standardized codec. In order to achieve this aim, a formal 1704 testing procedure has to be established. 1706 In general, we recommend to base the procedure of the 1707 characterization phase on procedures that are similar to those that 1708 were used for the G.719 standardization (Section 7.2 and especially 1709 [35]). In the following, we describe the suggested testing procedure 1710 in the characterization phase. 1712 10.2.1. Methodology 1714 The testing of sound quality can be done using the MUSHRA tests 1715 with eight samples and three anchors. One anchor is the known 1716 reference, the second one is a hidden reference, and the third one 1717 the hidden anchor. It is suggested to use a bandwidth filtered 1718 signal with at low-pass filter at 3.5 kHz. However, because a will 1719 range of qualities are to be tested ranging from Hifi down to toll 1720 quality, it is beneficial to add a further low quality anchor such 1721 as a 3.5 kHz bandwidth sample distorted by modulated noise (MNRU) 1722 [25], for example with MNRU of a strength of Q=25 dB that 1723 corresponds to a MOS value of 1.79 [4]. 1725 10.2.2. Material 1727 Reference samples should be 48 kHz sampled, stereo channel material. 1728 The nominal bandwidth of the reference samples shall be limited to 1729 the range of 20 to 20000 Hz. Three different kinds of contents shall 1730 be tested: speech, music and mixed content. 1732 Speech samples shall include different languages including English 1733 and tonal languages. The speech samples shall be recorded in a quiet 1734 environment without background noise or reverberations. The speech 1735 samples shall contain one meaningful sentence having a length of 1736 about 4 s. 1738 Music samples shall contain a wide variety of music styles including 1739 classical music, pop, jazz, and single instruments. The length of 1740 samples shall be of between 10 and 15 s. A smoothing of 100 ms both 1741 at the beginning and at the end shall be conducted, if required. 1743 Mixed content may contain advertisements, film trailers, news with 1744 jingles and other mixtures of speech, music and noises. The length 1745 may be at about 5-6 s. 1747 10.2.3. Listening Laboratory 1749 Multiple independent laboratories shall conduct the listening tests. 1750 They are responsible for generating or selecting reference samples 1751 as well as for the pre and post screening of subjects. In the end, 1752 the results of about 24 experienced listeners shall be published (in 1753 addition to the samples). 1755 The tests must be conducted in a quiet listening environment at 1756 about NC25 (approximate 35 dBA). For example, an ISOBOOTH room can 1757 be used. 1759 It is recommended to use a high quality D/A, such as Benchmark DAC, 1760 Metric Halo ULN-2, Apogee MiniDAC. High quality headphone amplifiers 1761 and playback level calibration shall be used. Playback levels might 1762 be measured via Etymotic in-ear microphones. Also, high quality 1763 headphones (e.g. AKG 240DF, Sennheiser HD600) are advisable. 1765 10.2.4. Degradation Factors 1767 The IIAC is likely to be highly configurable. However, due to time 1768 limits, only a few parameter sets can be tested subjectively. Thus, 1769 we recommend to do subjective studies with 1771 o different bit rates (from low to high, 5 tests) 1773 o different frame rates (from low to high, 2 tests) 1775 o different loss pattern (G.1050 profile A, B, and C at low rate 1776 with speech content and at high rate with music content. The 1777 influence of jitter, delay, and link failures shall be ignored. 1778 In total, this would be 6 tests) 1780 o different sample contents 1782 o Speech, speech+reverberations, and speech+noise+reverberations 1783 at low and medium rates (3 tests). 1785 o The speech sample must be tested in different languages 1786 (English, Chinese, ...) and with male/female voices (6 tests) 1788 o Mixed content and music shall be tested at medium and high 1789 rates (about 10 tests). 1791 o A low complexity mode, DTX and the FEC mode shall be tested at 1792 low rates because they are typically used on constraint devices 1793 (3 tests) 1795 o Abrupt changes in bit and frame rates (reduction by half, 1796 exponential start, 2 tests) 1798 o Smooth changes of bit and frame rates (incrementing or degreasing 1799 the codec's gross rate by 1.5 kbyte every 100ms, 2 tests) 1801 o Stall and continue operations (20, 200, and 1000 ms, 3 tests) 1803 o Accelerated and slowed down playout (+- 10% for speech at low 1804 rates) 1806 o Reference codecs such as LAME MP3, G.719, and AMR each at two 1807 coding rate (6 tests) 1809 Already, these are 48 different tests that need to be conducted. 1811 In addition, for intermediate values objective tests shall be run 1812 using PEAQ (for music) and P.OLQA (for speech). The intermediate 1813 results shall be mapped on the MUSHRA scale with a quadratic 1814 regression because PEAQ and P.OLGA are using an ODG and MOS scale 1815 respectively. 1817 10.3. Application Developers 1819 Application developers can take advantage of the results of the 1820 qualification phase. They may use the results to develop a quality 1821 model, which describes the expected quality of the codec at a given 1822 parameter set (refer to [11] for an example). 1824 In addition, they can test their system using the draft G.1050 1825 simulation model, which is especially useful for optimizing rate 1826 control, dejittering buffers and concealment algorithms. Different 1827 systems may be tested with quality models, subjective listening 1828 tests, conversational listening tests, or with objective measures 1829 such as POLQA. 1831 Also, field tests may be conducted to test the effect of a real 1832 network on the VoIP application. 1834 10.4. Codec Implementers 1836 To tests the conformance of a codec, codec implementers can use 1837 objective tools like PEAQ or P.OLQA to see, whether the newly 1838 implemented codec performs in a way that is similar to the 1839 performance of the reference implementation. These tests shall be 1840 done for many different parameter sets. 1842 10.5. End Users 1844 End user may be included in the qualification tests. The intentions 1845 of these tests are two-fold. First, the awareness of the end-user 1846 shall be increased. Second, querying users may be a cost effective 1847 way of conducting listening-only tests. 1849 However, before the rating results of end users can be considered 1850 for further usage, one need to compare between formal and web-based 1851 testing results to see, to what extent they differ from each other. 1853 11. Security Considerations 1855 The results of the quality tests shall be convincing. Thus, special 1856 care has to be taken to make the tests precise, accurate, repeatable 1857 and trustworthy. 1859 Some testing houses may have a conflict of interest between accurate 1860 quality ratings and promotion of own codecs. Thus, a high degree of 1861 openness shall be enforced that requires all of the testing material 1862 and results to be published. This way, others may verify the results 1863 of testing houses. In addition, some stimuli shall be tested by all 1864 the testing houses to compare their quality of rating. 1866 Moreover, hidden anchors may help to identify subjects, which rate 1867 the quality of samples less precisely. 1869 12. IANA Considerations 1871 This document has no actions for IANA. 1873 13. References 1875 13.1. Normative References 1877 13.2. Informative References 1879 [1] R. Birke, M. Mellia, M. Petracca, D. Rossi, "Understanding 1880 VoIP from Backbone Measurements", IEEE INFOCOM 2007, 26th IEEE 1881 International Conference on Computer Communications, pp.2027- 1882 2035, May 2007. 1884 [2] C. Boutremans, J.-Y. Le Boudec, "Adaptive joint playout buffer 1885 and FEC adjustment for Internet telephony," IEEE Societies 1886 INFOCOM 2003. Twenty-Second Annual Joint Conference of the 1887 IEEE Computer and Communications., vol.1, pp. 652- 662 vol.1, 1888 30 March-3 April 2003. 1890 [3] Broadcom, "BCM1103: GIGABIT IP PHONE CHIP", Jan. 2005, 1891 http://www.datasheetcatalog.org/datasheet2/3/07ozspx224dsarq6z 1892 u13i2ofyqyy.pdf 1894 [4] N. Cote, V. Koehl, V. Gautier-Turbin, A. Raake, S. Moeller, 1895 "Reference Units for the Comparison of Speech Quality Test 1896 Results", Audio Engineering Society Convention 126, May 2009. 1898 [5] Ericsson, "Analysis of PEAQ's applicability in predicting the 1899 quality difference between alternative implementations of the 1900 G.722.1FB coding algorithm", ITU-T SG12, Received on 2008-05- 1901 09, Related to question(s) : Q9/12, Meeting 2008-05-22. 1903 [6] ETSI TC-TM, "ETR 250: Transmission and Multiplexing (TM); 1904 Speech communication quality from mouth to ear for 3,1 kHz 1905 handset telephony across networks", ETSI Technical Report, 1906 July 1996. 1908 [7] S. Floyd, E. Kohler, "Profile for Datagram Congestion Control 1909 Protocol (DCCP) Congestion ID 4: TCP-Friendly Rate Control for 1910 Small Packets (TFRC-SP)", RFC 5622, August 2009. 1912 [8] S. Floyd, E. Kohler, "TCP Friendly Rate Control (TFRC): The 1913 Small-Packet (SP) Variant", RFC 4828, April 2007. 1915 [9] J. Gruber, G. Williams, Transmission Performance of Evolving 1916 Telecommunications Networks, Artech House, 1992. 1918 [10] M. Gurevich, C. Chafe, G. Leslie, S. Tyan, "Simulation of 1919 Networked Ensemble Performance with Varying Time Delays: 1920 Characterization of Ensemble Accuracy", Proceedings of the 1921 2004 International Computer Music Conference, Miami, USA, 1922 2004. 1924 [11] C. Hoene, H. Karl, A. Wolisz, "A perceptual quality model 1925 intended adaptive VoIP applications", International Journal of 1926 Communication Systems, Wiley, August 2005. 1928 [12] J. Holub, J.G. Beerends, R. Smid, "A dependence between 1929 average call duration and voice transmission quality: 1930 measurement and applications," Wireless Telecommunications 1931 Symposium, 2004, pp. 75- 81, May 2004. 1933 [13] ITU, "Incoming LS: Proposed G.1050/TIA-921B IP Network Model 1934 Simulation", ITU-T SG 12, Temporary Document 268-GEN, May 12, 1935 2010. 1937 [14] ITU, "ITU-R BS.1116-1: Methods for the subjective assessment 1938 of small impairments in audio systems including multichannel 1939 sound systems", Recommendation, October 1997. 1941 [15] ITU, "ITU-R BS.1387: Method for objective measurements of 1942 perceived audio quality", Recommendation, November 2001. 1944 [16] ITU, "ITU-R BS.1534-1: Method for the subjective assessment of 1945 intermediate quality levels of coding systems", 1946 Recommendation, January 2003. 1948 [17] ITU, "ITU-T G.107: The E-model: a computational model for use 1949 in transmission planning", Recommendation, April 2009. 1951 [18] ITU, "ITU-T G.114: One-way transmission time", Recommendation, 1952 May 2003. 1954 [19] ITU, "ITU-T G.191: Software tools for speech and audio coding 1955 standardization", Recommendation, March 2010. 1957 [20] ITU, "ITU-T G.1050: Network model for evaluating multimedia 1958 transmission performance over Internet Protocol", 1959 Recommendation, November 2007. 1961 [21] ITU, "ITU-T G.RQAM, "Reference guide to QoE assessment 1962 methodologies", standard draft TD 310rev1, May 2010. 1964 [22] ITU, "ITU-T P.10/G.100: Vocabulary and effects of transmission 1965 parameters on customer opinion of transmission quality", 1966 Recommendation, July 2006. 1968 [23] ITU, "ITU-T P.800: Methods for objective and subjective 1969 assessment of quality", Recommendation, August 1996. 1971 [24] ITU, "ITU-T P.805: Subjective evaluation of conversational 1972 quality", Recommendation, April 2007. 1974 [25] ITU, "ITU-T P.810: Modulated noise reference unit (MNRU)", 1975 Recommendation, February 1996. 1977 [26] ITU, "ITU-T P.830: Subjective performance assessment of 1978 telephone-band and wideband digital codecs", Recommendation, 1979 February 1996. 1981 [27] ITU, "ITU-T P.862: Perceptual evaluation of speech quality 1982 (PESQ): An objective method for end-to-end speech quality 1983 assessment of narrow-band telephone networks and speech 1984 codecs", Recommendation, February 2001. 1986 [28] ITU, "ITU-T P.862.1: Mapping function for transforming P.862 1987 raw result scores to MOS-LQO", Recommendation, November 2003. 1989 [29] ITU, "ITU-T P.862.2: Wideband extension to Recommendation 1990 P.862 for the assessment of wideband telephone networks and 1991 speech codecs", Recommendation, November 2007. 1993 [30] ITU, "ITU-T P.862.3: Application guide for objective quality 1994 measurement based on Recommendations P.862, P.862.1 and 1995 P.862.2", Recommendation, November 2007. 1997 [31] ITU, "ITU-T P.880: Continuous evaluation of time-varying 1998 speech quality", Recommendation, May 2004. 2000 [32] H. Jiang, C. Dovrolis, "Why is the internet Traffic Bursty in 2001 Short Time Scales?" Sigmetrics'05, Banff, Alberta, Canada, 2002 June 2005. 2004 [33] C. Lamblin, R. Even, "Processing Test Plan for the ITU-T 2005 G.722.1 fullband extension optimization/characterization 2006 phase", ITU-T Study Group 16, Temporary Document TD 322 (WP 2007 3/16), 22 April - 2 May 2008. 2009 [34] C. Lamblin, R. Even, "G.722.1 fullband extension 2010 characterization phase test results: objective (ITU-R BS.1387- 2011 1) and subjective (ITU-R BS.1116) scores", ITU-T Study Group 2012 16, Temporary Document TD 341 R1 (WP 3/16), 22 April - 2 May 2013 2008. 2015 [35] C. Lamblin, R. Even, "G.722.1 fullband extension 2016 optimization/characterization Quality Assessment Test Plan", 2017 ITU-T Study Group 16, Temporary Document TD 323 (WP 3/16), 22 2018 April - 2 May 2008. 2020 [36] J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, S Han, 2021 "FaCSim: A Fast and Cycle-Accurate Architecture Simulator for 2022 Embedded Systems", in Proceedings of the International 2023 Conference on Languages, Compilers, and Tools for Embedded 2024 Systems (LCTES'08), Tucson, Arizona, USA, June 2007, Software 2025 available at http://facsim.snu.ac.kr/. 2027 [37] G. Maier, A. Feldmann, V. Paxson, M. Allman, "On Dominant 2028 Characteristics of Residential Broadband Internet Traffic", 2029 IMC'09, November 4-6, 2009, Chicago, Illinois, USA. 2031 [38] T. Mori, S. Naito, R. Kawahara, S. Goto, "On the 2032 characteristics of internet traffic variability: Spikes and 2033 Elephants", SAINT'04, 2004. 2035 [39] Open Mobile Alliance, "Push to talk over Cellular 2036 Requirements", Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC- 2037 V1_0-20060609-A.pdf 2039 [40] OPTICOM, SwissQual, TNO, "Announcement of OPTICOM, SwissQual 2040 and TNO to submit a joint P.OLQA model", ITU-T SG 12, 2041 Contribution 117, Received on 2010-05-07. Related to 2042 question(s): Q9/12. 2044 [41] D. Sisalem, A. Wolisz, "Towards TCP-friendly adaptive 2045 multimedia applications based on RTP", IEEE International 2046 Symposium on Computers and Communications, pp. 166-172, 1999. 2048 [42] S. Smirnoff, K. Pupkov, "SoundExpert, How it Works, Audio 2049 quality measurements in the digital age", 2050 http://soundexpert.org/, revived Nov. 2010. 2052 [43] L. Sun, "Speech Quality prediction For Voice Over Internet", 2053 PhD thesis, University of Plymouth, January 2004, 2054 http://www.tech.plymouth.ac.uk/spmc/people/lfsun/mos/. 2056 [44] Texas Instruments, "C64x+ CPU Cycle Accurate Simulator", 2057 October 2010, 2058 http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accu 2059 rate_Simulator. 2061 [45] Texas Instruments, "TNETV3020: Carrier Infrastructure 2062 Platform, Telogy Software products integrated with TI's DSP- 2063 based high-density communications processor", 2008, 2064 http://focus.ti.com/lit/ml/spat174a/spat174a.pdf 2066 [46] TransNexus, "Asterisk V1.4.11 Performance", webpage, accessed 2067 Nov. 2010, 2068 http://www.transnexus.com/White%20Papers/asterisk_V1-4- 2069 11_performance.htm 2071 [47] K. Vos, K. Vandborg Sorensen, S. Skak Jensen, J. Spittka, 2072 "SILK", presentation at the 77th IETF meeting in the WG Codec, 2073 March 22, 2010, Anaheim, USA. 2074 http://tools.ietf.org/agenda/77/slides/codec-3.pdf 2076 [48] H. Vlad Balan, L. Eggert, S. Niccolini, M. Brunner, "An 2077 Experimental Evaluation of Voice Quality Over the Datagram 2078 Congestion Control Protocol," IEEE INFOCOM 2007. 26th IEEE 2079 International Conference on Computer Communications. pp. 2009- 2080 2017, 6-12 May 2007. 2082 [49] J. Wallerich, A. Feldmann, "Capturing the Variability of 2083 Internet Flows Across Time", Proceedings INFOCOM 2006. 25th 2084 IEEE International Conference on Computer Communications, 23- 2085 29 April 2006. 2087 [50] M. Westerlund, "How to Write an RTP Payload Format", work in 2088 progress, draft-ietf-avt-rtp-howto-06, Internet-draft, 2089 March 2, 2009. 2091 [51] Wikipedia contributors, "Bit rate", Wikipedia, The Free 2092 Encyclopedia, 10 October 2010, 20:00 UTC, 2093 http://en.wikipedia.org/w/index.php?title=Bit_rate&oldid=38993 2094 1944 2096 [52] Wikipedia contributors, "Cycle accurate simulator", Wikipedia, 2097 The Free Encyclopedia, 4 September 2010, 14:27 UTC, 2098 http://en.wikipedia.org/w/index.php?title=Cycle_accurate_simul 2099 ator&oldid=382876676 2101 [53] Wikipedia contributors, "Latency (engineering)", The Free 2102 Encyclopedia, 15 October 2010, 23:54 UTC, 2103 http://en.wikipedia.org/w/index.php?title=Latency_(engineering 2104 )&oldid=390971153 2106 [54] Wikipedia contributors, "Profiling (computer programming)", 2107 Wikipedia, The Free Encyclopedia, 15 August 2010, 03:57 UTC, 2108 http://en.wikipedia.org/w/index.php?title=Profiling_(computer_ 2109 programming)&oldid=378987422. 2111 [55] M. T. Yourst, "PTLsim: A cycle accurate full system x86-64 2112 microarchitectural simulator", in ISPASS '07, 2007, software 2113 available at http://www.ptlsim.org/. 2115 14. Acknowledgments 2117 This document is based on many discussions with experts in the field 2118 of codec design, quality of experience and quality management. My 2119 special thanks go to Michael Knappe, Sebastian Moeller, Raymond 2120 Chen, Jack Douglass, Paul Coverdale, Jean-Marc Valin, Koen Vos, 2121 Bilke Ullrich, and all active participants of the Codec WG mailing 2122 list. Also, I like to express my appreciation to the members of the 2123 ITU-T study groups 12 and 16, with whom I had many fruitful 2124 discussions. 2126 Authors' Addresses 2128 Christian Hoene 2129 Universitaet Tuebingen 2130 WSI-ICS 2131 Sand 13 2132 72076 Tuebingen 2133 Germany 2135 Phone: +49 7071 2970532 2136 Email: hoene@uni-tuebingen.de