idnits 2.17.1 

draft-hoene-codec-quality-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 3, 2011) is 4709 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	CODEC                                                         C. Hoene
2	Internet Draft                                  Universitaet Tuebingen
3	Intended status: Informational                            June 3, 2011
4	Expires: December 2011

6	        Measuring the Quality of an Internet Interactive Audio Codec
7	                     draft-hoene-codec-quality-01.txt

9	Status of this Memo

11	   This Internet-Draft is submitted in full conformance with the
12	   provisions of BCP 78 and BCP 79.

14	   Internet-Drafts are working documents of the Internet Engineering
15	   Task Force (IETF), its areas, and its working groups. Note that
16	   other groups may also distribute working documents as Internet-
17	   Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six
20	   months and may be updated, replaced, or obsoleted by other documents
21	   at any time. It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html

30	   This Internet-Draft will expire on June 3, 2011.

32	Copyright Notice

34	   Copyright (c) 2011 IETF Trust and the persons identified as the
35	   document authors. All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents
39	   (http://trustee.ietf.org/license-info) in effect on the date of
40	   publication of this document. Please review these documents
41	   carefully, as they describe your rights and restrictions with
42	   respect to this document.

44	Abstract

46	   The quality of a codec has to be measured by multiple parameters
47	   such as audio quality, speech quality, algorithmic efficiency,
48	   latency, coding rates and their respective tradeoffs. During
49	   standardization, codecs are tested and evaluated multiple times to
50	   ensure a high quality outcome.

52	   As the upcoming Internet codec is likely to have unique features,
53	   there is a need to develop new quality testing procedures to measure
54	   these features. Thus, this draft reviews existing methods on how to
55	   measure a codec's qualities, proposes a couple of new methods, and
56	   gives suggestions which may be used for testing the Internet
57	   Interactive Audio Codec (IIAC).

59	   This document is work in progress.

61	Conventions used in this document

63	   In this document, equations are written in Latex syntax. An equation
64	   starts with a dollar sign and ends with a dollar sign. The text in
65	   between is an equation following the notation of Latex Version 2e.
66	   In the PDF version of this document, as a courtesy to its readers,
67	   all Latex equations are already rendered.

69	Table of Contents

71	   Conventions used in this document ............................... 2
72	   1. Introduction ................................................. 4
73	   2. Optimization Goal ............................................ 6
74	   3. Measuring Speech and Audio Quality ........................... 7
75	      3.1. Formal Subjective Tests ................................. 7
76	         3.1.1. ITU-R Recommendation BS.1116-1 ..................... 7
77	         3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA) ............ 8
78	         3.1.3. ITU-T Recommendation P.800 ......................... 8
79	         3.1.4. ITU-T Recommendation P.805 ......................... 8
80	         3.1.5. ITU-T Recommendation P.880 ......................... 9
81	         3.1.6. Formal Methods Used for Codec Testing at the ITU ... 9
82	      3.2. Informal Subjective Tests ............................... 9
83	      3.3. Interview and Survey Tests .............................. 9
84	      3.4. Web-based Testing ...................................... 10
85	      3.5. Call Length and Conversational Quality ................. 10
86	      3.6. Field Studies .......................................... 12
87	      3.7. Objective Tests......................................... 13
88	         3.7.1. ITU-R Recommendation BS.1387-1 .................... 14
89	         3.7.2. ITU-T Recommendation P.862 ........................ 14
90	         3.7.3. ITU-T Draft P.OLQA ................................ 15

92	   4. Measuring Complexity ........................................ 15
93	      4.1. ITU-T Approaches to Measuring Algorithmic Efficiency ... 15
94	      4.2. Software Profiling ..................................... 17
95	      4.3. Cycle Accurate Simulation .............................. 18
96	      4.4. Typical run time environments .......................... 19
97	   5. Measuring Latency ........................................... 19
98	      5.1. ITU-T Recommendation G.114 ............................. 20
99	      5.2. Discussion ............................................. 20
100	   6. Measuring Bit and Frame Rates ............................... 21
101	   7. Codec Testing Procedures Used by Other SDOs ................. 22
102	      7.1. ITU-T Recommendation P.830 ............................. 22
103	      7.2. Testing procedure for the ITU-T G.719 .................. 24
104	   8. Transmission Channel ........................................ 25
105	      8.1. ITU-T G.1050: Network Model for Evaluating Multimedia
106	      Transmission Performance over IP (11/2007) .................. 26
107	      8.2. Draft G.1050 / TIA-921B ................................ 27
108	      8.3. Delay and Throughput Distributions on the Global Internet27
109	      8.4. Transmission Variability on the Internet ............... 30
110	      8.5. The Effects of Transport Protocols ..................... 30
111	      8.6. The Effect of Jitter Buffers and FEC ................... 33
112	      8.7. Discussion ............................................. 33
113	   9. Usage Scenarios ............................................. 34
114	      9.1. Point-to-point Calls (VoIP) ............................ 34
115	      9.2. High Quality Interactive Audio Transmissions (AoIP) .... 35
116	      9.3. High Quality Teleconferencing .......................... 35
117	      9.4. Interconnecting to Legacy PSTN and VoIP (Convergence) .. 36
118	      9.5. Music streaming......................................... 36
119	      9.6. Ensemble Performances over a Network ................... 36
120	      9.7. Push-to-talk like Services (PTT) ....................... 37
121	      9.8. Discussion ............................................. 38
122	   10. Recommendations for Testing the IIAC ....................... 38
123	      10.1. During Codec Development .............................. 38
124	      10.2. Characterization Phase ................................ 39
125	         10.2.1. Methodology ...................................... 39
126	         10.2.2. Material ......................................... 39
127	         10.2.3. Listening Laboratory ............................. 40
128	         10.2.4. Degradation Factors .............................. 40
129	      10.3. Application Developers ................................ 41
130	      10.4. Codec Implementers .................................... 42
131	      10.5. End Users ............................................. 42
132	   11. Security Considerations .................................... 42
133	   12. IANA Considerations......................................... 42
134	   13. References ................................................. 43
135	      13.1. Normative References .................................. 43
136	      13.2. Informative References ................................ 43
137	   14. Acknowledgments ............................................ 48

139	1. Introduction

141	   The IETF Working Group CODEC is standardizing an Internet
142	   Interactive Audio and Speech Codec (IIAC). If the codec shall be of
143	   high quality it is important to measure the codec's quality
144	   throughout the entire process of development, standardization, and
145	   usage. Thus, this document supports the standardizing process by
146	   providing an overview of quality metrics, quality assessment
147	   procedures, and other quality control issues and gives suggestions
148	   on how to test the IIAC.

150	   Quality must be measured by the following stakeholders and in the
151	   following phases of the codec's development:

153	   o Codec developers must decide on different algorithms or parameter
154	      sets during the development and enhancement of a codec. These
155	      might also include the selection among multiple codec candidates
156	      that implement different algorithms; however the WG Codec base
157	      its work on a common consensus not on a competitive selection of
158	      one of multiple codec contributions. Thus, measuring the quality
159	      of codecs to select one might not be required.
160	      Besides selection, one is obliged to debug the codec software. To
161	      find errors and bugs - and programming mistakes are present in
162	      any complex software - the developer has to test this software by
163	      conducting quality measurements.

165	   o Typically the codec standardization includes a qualification
166	      phase that measures the performance of a codec and verifies
167	      whether it confirms to predefined quality requirements. In the
168	      qualification phase, it becomes obvious whether the codec
169	      development and standardization has been successful. Again, in
170	      the process of rigorous testing during qualification phase,
171	      algorithmic weaknesses and bugs in the implementation may be
172	      found. Still, in complex software such as the IIAC, correctness
173	      cannot be proved or guaranteed.

175	   o Users of the codec need to know how well the codec is performing
176	      while manufactures need to decide whether to include the IIAC in
177	      their products. Quality measures play an important role in this
178	      decision process. Also, the numerous quality measurement results
179	      of the quality help developers of the VoIP system to dimension or
180	      tune their system to take optimal advantage of a codec. For
181	      example, during network planning, operators can predict the
182	      amount of bandwidth needed for high quality voice calls.
183	      An adaptive VoIP application needs to know which quality is
184	      achieved with a different codec parameters set to be able to make
185	      an optimal selection of the codec parameters under varying
186	      network conditions.
187	      As suggested in [50] an RTP payload specification for an IIAC
188	      codec should include a rate control. Similar to the performance
189	      of the codec, the rate control unit has a big impact on the
190	      overall quality of experience. Thus, it should be tested well
191	      too.

193	   o Software implementers need to verify whether their particular
194	      codec implementation that might be optimized on a specific
195	      platform confirms to the standard's reference implementation.
196	      This is particularly important as some intellectual property
197	      rights might only be granted, if the codec conforms to the
198	      standard.
199	      As the IIAC must not to be bit conform, which would allow simple
200	      comparisons of correctness, other means of conformance testing
201	      must be applied.
202	      In addition, the standard conformance and interoperability of
203	      multiple implementations must be checked.
204	      Last but not least, implementers may implement optimized
205	      concealment algorithms, jitter buffers or other algorithms. Those
206	      algorithms have to be tested, too.

208	   o Since the success of MP3, end users do acknowledge the existence
209	      of a high quality codec. It would make sense to use the IIAC in a
210	      brand marketing campaign (such as "Intel inside"). A quality
211	      comparison between IIAC and other codecs might be part of the
212	      marketing. Online testing with user participation might also
213	      raise the awareness level.

215	   All those stakeholders might have different requirements regarding
216	   the codec's quality testing procedures. Thus, this document tries to
217	   identify those requirements and shows which of the existing quality
218	   measurement procedures can be applied to fulfill those specific
219	   demands efficiently.

221	   In the following section we describe a primary optimization goal:
222	   Quality of Experience (QoE). Next, we briefly list the most common
223	   methods of how to perform subjective evaluations on speech and audio
224	   quality. In Section 4, 5, and 6, we discuss on how to measure
225	   complexity, latency, and bit- and frame rates. Section 7 describes
226	   how other SDOs have measured the quality of their codecs. As
227	   compared IIAC to previous standardized codecs, the IIAC is likely to
228	   have different unique requirements and thus needs newly developed
229	   quality testing procedures. To achieve this, in Section 8 we
230	   describe the properties of Internet transmission paths. Section 9
231	   summarizes the usage scenarios, for which the codec is going to be
232	   used and finally, in Section 10, we recommend procedures on how to
233	   test the IIAC.

235	2. Optimization Goal

237	   The aim of the Codec WG is to produce a codec of high quality.
238	   However, how can quality be measured? The measurement of the
239	   features of a codec can be based on many different criteria. Those
240	   include complexity, memory consumption, audio quality, speech
241	   quality, and others. But in the end, it's the users' opinions that
242	   really count since they are the customers. Thus, one important - if
243	   not the most important quality measure of the IIAC - shall be the
244	   Quality of Experience (QoE).

246	   The ITU-T Standards ITU-T P.10/G.100 [22] defines the term "Quality
247	   of Experience" as "the overall acceptability of an application or
248	   service, as perceived subjectively by the end-user." The ITU-T
249	   document G.RQAM [21] extends this definition by noting that "quality
250	   of experience includes the complete end-to-end system effects
251	   (client, terminal, network, services infrastructure, etc.)" and that
252	   the "overall acceptability may be influenced by user expectations
253	   and context".

255	   These definitions already give guidelines on how to judge the
256	   quality of the IIAC:

258	   o The acceptability and the subjective quality impression of
259	      endusers have to be measured (Section 3).

261	   o The IIAC codec has to be tested as part of an entire
262	      telecommunication system. It must be carefully considered whether
263	      to measure the codec's performance just in a stand-alone setup or
264	      to evaluate it as part of the overall system (Section 8).

266	   o The environments and contexts of particular communication
267	      scenarios have to be considered and controlled because they have
268	      an impact on the human rating behavior and on quality
269	      expectations and requirements (Section 9).

271	3. Measuring Speech and Audio Quality

273	   The perceived quality of a service can be measured by various means.
274	   If humans are interrogated, those quality tests are called
275	   subjective. If the tests are conducted by instrumental means (such
276	   as an algorithm) they are called objective. Subjective tests are
277	   divided up into formal and informal tests. Formal tests follow
278	   strictly defined procedures and methods and typically include a
279	   large number of subjects. Informal tests are less precise because
280	   they are conducted in an uncontrolled manner.

282	3.1. Formal Subjective Tests

284	   Formal subjective tests must follow a well-defined procedure.
285	   Otherwise the results of multiple tests cannot be mutually compared
286	   and are not repeatable. Most subjective testing procedures have been
287	   standardized by the ITU.  If applied to coding testing, the testing
288	   procedures follow the same pattern [26]:

290	     "Performing subjective evaluations of digital codecs proceeds
291	     via a number of steps:

293	        o Preparation of source speech materials, including recording of
294	          talkers;

296	        o Selection of experimental parameters to exercise the features
297	          of the codec that are of interest;

299	        o Design of the experiment;

301	        o Selection of a test procedure and conduct of the experiment;

303	        o Analysis of results."

305	   The ITU has standardized different formal subjective tests to
306	   measure the quality of speech and audio transmission, which are
307	   described in the following.

309	3.1.1. ITU-R Recommendation BS.1116-1

311	   The ITU-R BS.1116-1 standard [14] is good for audio items with small
312	   degradations (stimuli) and uses a continuous scale from
313	   imperceptible (5.0) to very annoying (1.0). It is a double blind
314	   triple-stimulus with a hidden reference testing method and must be
315	   done twice for the degraded sample and the hidden reference. In a 30
316	   minutes session, 10-15 sample items can be judged. Overall, about 20
317	   subjects shall rate the items. Testing shall take place with
318	   loudspeakers in a controlled environment or with headphones in a
319	   quiet room.

321	3.1.2. ITU-R Recommendation BS.1534-1 (MUSHRA)

323	   The ITU-R BS.1534-1 standard [16] defines a method for the
324	   subjective assessment of intermediate quality levels. Multiple audio
325	   stimuli are compared at the same time. Maximal 12 but preferably
326	   only 8 stimuli plus a hidden one with Hidden Reference and an anchor
327	   are compared and judged. MUSHRA uses a continuous quality scale
328	   (CQS) ranging from 0 to 100 divided into five equal intervals ("bad"
329	   to "excellent"). In 30 minutes, about 42 stimuli can be tested.
330	   Again, 20 test subjects shall rate the items with either headphones
331	   or loudspeakers.

333	   The standard recommends using as lower anchor a low-pass filtered
334	   version with a bandwidth limit of 3.5 kHz. Additional anchors are
335	   recommended, especially if specific distortions are to be tested.

337	3.1.3. ITU-T Recommendation P.800

339	   The ITU-T P.800 defines multiple testing procedures to assess the
340	   speech quality of telephone connections. The most important
341	   procedure is called listening-only speech quality of telephone
342	   connections. Listeners rate short groups of unrelated sentences. The
343	   listeners are taken from the normal telephone-using population (no
344	   experts). They use a typical sending system (e.g. a local telephone)
345	   that may follow "modified IRS" frequency characteristics. The
346	   results is the listening-quality scale, which is an absolute
347	   category scale (ACS) ranging from excellent=5 to bad=1. Listeners
348	   can judge about 54 stimuli within 30 minutes.

350	   Other tests described in P.800 measure listening-effort, loudness-
351	   preference scale, conversation opinion and difficulty,
352	   delectability, degradation, or minimal differences.

354	3.1.4. ITU-T Recommendation P.805

356	   The P.805 standard [24] extends P.800 and defines precisely how to
357	   measure conversational quality. Subjects have to do conversation
358	   tests to evaluate the communication quality of a connected. Expert,
359	   experienced or untrained (naive) subjects have to do these tests
360	   collaboratively in soundproof cabinets. Typically, 6 transmission
361	   conditions can be tested within 30 minutes. Depending on the
362	   required precision, these tests have to be made 20 to 40 times.

364	3.1.5. ITU-T Recommendation P.880

366	   To measure time-variable distortion, a continuous evaluation of
367	   speech quality has been defined in P.880 [31]. Subjects have to
368	   assess transmitted speech quality consisting of long speech
369	   sequences with quality/time fluctuations. The quality is rated on a
370	   continuous scale ranging from Excellent=5 to Bad=1 is dynamically
371	   changed over the time while the stimuli are played. Stimuli have a
372	   length of between 45 seconds and 3 minutes.

374	3.1.6. Formal Methods Used for Codec Testing at the ITU

376	   In the last year, new narrow and wideband codecs have been tested
377	   using ITU-T P.800 (and ITU-T P.830). For the ITU-T G.719 standard,
378	   which supports besides speech content also audio, the ITU-R BS.1116-
379	   1 testing method has been applied during the selection of potential
380	   codec candidates. During the qualification phase, the method that
381	   was used was the ITU-P BS.1584-1. For the ITU-T G.718 codec, the
382	   Absolute Category Rating (ACR) following ITU-T P.800 has been
383	   applied.

385	3.2. Informal Subjective Tests

387	   Besides formal tests, informal subjective tests following less
388	   stringent conditions might be taken to judge the quality of stimuli.
389	   However, informal tests cannot be easily verified and lack the
390	   reliability, accuracy and precision of formal tests. Informal tests
391	   are needed if the available number of subjects who are able to
392	   conduct the tests is low, or if time or money is limited.

394	3.3. Interview and Survey Tests

396	   In ITU-T P.800 [23] and [9] interview and survey tests are
397	   described. In P.800, it says that "if the rather large amount of
398	   effort needed is available and the importance of the study warrants
399	   it, transmission quality can be determined by 'service
400	   observations'."

402	   These service observations are based on statistical surveys common
403	   in social science and marketing research. Typically, the questions
404	   asked in a survey are structured.

406	   In addition, according to [23]: "To maintain a high degree of
407	   precision a total of at least 100 interviews per condition is
408	   required. A disadvantage of the service-observation method for many
409	   purposes is that little control is possible over the detailed
410	   characteristics of the telephone connections being tested."

412	3.4. Web-based Testing

414	   If the large-wide scale proliferation of the Internet, researchers
415	   suggested testing the speech or audio quality on web sites via web
416	   site visitors [43]. A current web site that compares multiple audio
417	   codecs has been setup at SoundExpert.org [42]. On this web site, a
418	   user can download an audio item that consists of a reference item
419	   and a degraded item. Then, the user must identify the reference and
420	   rate the ODG of the degraded item. The tests are single-blind as the
421	   user does not know which codec he is currently rating.

423	   One can anticipate that the visitors of web sites will use similar
424	   equipment for testing of audio samples and for conducting VoIP
425	   calls. Thus, web site testing can be made realistic in a way that
426	   considers the impact of (typically used) loudspeakers and
427	   headphones.

429	   However, currently used web sites lack a proper identification of
430	   outliers. Thus, all ratings of all users are considered despite the
431	   fact that they might be (deliberately) faked or that subjects might
432	   not be able to hear well the acoustic difference. Thus, one can
433	   expect that web based ratings will show a high degree of variation
434	   and that many more tests are needed to achieve the same confidence
435	   that is gained within formal tests. A profound scientific study on
436	   the quality of web based audio rating has not yet been published.
437	   Thus, any statements on the validity of web based rating are
438	   premature.

440	3.5. Call Length and Conversational Quality

442	   In the ETSI technical report document ETR-250 [6], a model is
443	   presented that discusses various impairments caused in narrow band
444	   telephone systems. The ETSI model describes the combinatorial effect
445	   of all those impairments. The ETSI model later became the famous E-
446	   Model described in ITU-T G.107. Both the ETSI- and the E-Model
447	   calculate the R factor that ranges from 0 (bad) to 100 (excellent
448	   conversational quality).

450	   Based on the R factor, the users' reaction to the voice transmission
451	   quality of a connection can be predicted. For example, Section 8.3
452	   describes the effect that users terminate the call if the quality is
453	   bad. More precisely, they summarize it as users who "(i) terminate
454	   their calls unusually early, (ii) re-dial or even (iii) actually
455	   complain to the network operator".

457	   In the ETSI model, the percentage of users "terminating calls
458	   early", TME, is given as

460	     $TME=100\cdot erf\left(\frac{36-R}{16}\right)\%$

462	   with $erf(X)$ being the sigmoid shaped Gaussian error function and
463	   $R$ the R Factor of the E-Model (Figure 1). This relation is based
464	   on results from "AT&T Long toll" interviews as cited in [2].

466	   These findings have been confirmed by Holub et al. [12] who have
467	   studied the correlation between call length and narrow band speech
468	   quality. Birke et al. [1] have also studied the duration of phone
469	   calls which show a duration varying with day time and day of the
470	   week and also may be affected by pricing schemata.

472	             100 -+TME.                               +- 5
473	                  |..iii.                             |
474	        T         |    .ii                            |
475	        e         |      ii                        MOS|
476	        r         |       i.                     .iiii|
477	        m     80 -+       .i.                  .ii.   |
478	        i         |        .i                .ii.     +- 4
479	        n         |         i.              .i.       |      M
480	        a         |         .i            .ii.        |      O
481	        t         |          i.          .i.          |      S
482	        e     60 -+          .i         .i.           |      |
483	                  |           i.       ii.            |      C
484	        E         |           .i     .ii              +- 3   Q
485	        a         |            i.   .i.               |      E
486	        r     40 -+            .i  .i.                |
487	        l         |             i..i.                 |
488	        y         |             .ii.                  |
489	                  |            .il.                   |
490	        (         |           .i..i                   +- 2
491	        T     20 -+          .i.  i.                  |
492	        M         |        .ii.   .i.                 |
493	        E         |      .ii.      .i.                |
494	        )         |    .ii.         .ii.              |
495	                  |MOSlii.           .iiiiiiiiiiiiiTME|
496	               0 -+-----------------+-----------------+- 1
497	                  |                 |                 |
498	                  0                 50               100

500	                                R Factor

502	     Figure 1 - Relation between calls terminating early, the R Factor,
503	                 and the speech quality given in (MOS-CQE)

505	   Whereas bad quality is related to short calls, it remains unproven
506	   whether better quality (>4 MOS) results in longer phone calls. There
507	   are two factors which might have an opposite effect on the call
508	   length. On the one hand, if the quality is superb, the talkers might
509	   be more willing to talk because of the pleasure of talking, on the
510	   other hand they might fulfill their conversational tasks faster
511	   because of the great quality Thus, depending on the context, good
512	   speech quality might result either in longer or shorter calls.

514	3.6. Field Studies

516	   Field studies can be conducted if usage data on calls are collected.
517	   Field studies are useful to monitor real user behavior and to
518	   collect data about the actual conversational context.

520	   Because of highly varying conditions, the precision of those
521	   measurements is high and many tests have to be done to get
522	   significantly different measurement values. Also, the tests are not
523	   repeatable because the conditions are changing with time.

525	   For example, Skype has done quality tests in a deployed VoIP system
526	   in the field with its users as testers [47]. The subjective tests
527	   are done in the following manner.

529	   o Download of test vectors to VoIP clients. Typically, this can be
530	      done with an automated software update.

532	   o Delivery changing VoIP configurations (such as the used codecs)
533	      so that different calls are subjected to different
534	      configurations. The selection of configurations can be done
535	      randomly, alternating in time or based on other criteria.

537	   o Collecting feedback from the users. For example, the following
538	      parameters can be monitored or recorded:

540	        o The call length and other call specific parameters

542	        o A user's quality voting (e.g. MOS-ACR) after the call

544	        o Other feedback of the user (e.g. via support channels)

546	   The field tests have the benefit of being conducted under real
547	   conditions with the real users. However, they have some drawbacks.
548	   First, the experimental conditions cannot be controlled well.
549	   Second, the tests are only valid for the current situations and do
550	   not allow predictions for other use cases. Third, the statistical
551	   significance might be largely questionable if confidence intervals
552	   are overlapping.

554	   The costs for running the tests are low because the users are doing
555	   the tests for free. However, the operator might lose users after a
556	   user experienced a test case causing bad quality.

558	3.7. Objective Tests

560	   Objective tests, also called instrumental tests, try to predict the
561	   human rating behavior with mathematical models and algorithms. They
562	   also calculate quality ratings for a given set of audio items.
563	   Naturally, they are not rating as precisely as their human
564	   counterparts, whom they try to simulate. However, the results are
565	   repeatable and less costly than formal subjective testing campaigns.
566	   Instrumental methods have a limited precision. That means that their
567	   quality ratings do not perfectly match the results of formal
568	   listening-only tests. Typically, the correlation between formal
569	   results and instrumental calculations are compared using a
570	   correlation function. The resulting metric is given as R ranging
571	   from 0 (no correlation) to 1 (perfect match).

573	   Over the last years, several objective evaluation algorithms have
574	   been developed and standardized. We describe them briefly in the
575	   following.

577	3.7.1. ITU-R Recommendation BS.1387-1

579	   The ITU developed an algorithm that is called Perceptual Evaluation
580	   of Audio Quality (PEAQ). It was published in the document ITU-R
581	   BS.1387 called Method for objective measurements of perceived audio
582	   quality in 1998 [15]. PEAQ is intended to predict the quality rating
583	   of low-bit-rate coded audio signals. Two different versions of PEAQ
584	   are provided: a basic version with lower computational complexity
585	   and an advanced version with higher computational complexity.

587	   PEAQ calculates a quality grading called "Objective Difference
588	   Grade" (ODB) ranging from 0 to -4. Typically, it shows a prediction
589	   quality of between R=0.85 and 0.97 when compared to subjective
590	   testing results. The ITU-T Study Group 12 assumes that PEAQ can
591	   detect auditable differences between two implementations of the same
592	   codec [5].

594	3.7.2. ITU-T Recommendation P.862

596	   The ITU-T PESQ algorithm [27] is intended to judge distortions
597	   caused by narrow band speech codecs and other kind of channel and
598	   transmission errors. These include also variable delays, filtering
599	   and short localize distortions such as those caused by frame loss
600	   concealment. For a large number of conditions, the validity and
601	   precision of PESQ has been proven. For untested distortions, prior
602	   subjective tests must be conducted to verify whether PESQ judges
603	   these kinds of distortions precisely. Also, it is recommended to use
604	   PESQ for 3.1 kHz (narrow-band) handset telephony and narrow-band
605	   speech codecs only. For wide-band operations, a modified filter has
606	   to be applied prior to the tests.

608	   Furthermore, the ITU-T Recommendation P.862.1 [28] describes how to
609	   transfer the PESQ's raw scores, which range from -0.5 to 4.5, to
610	   MOS-LQO values similar to those gathered from ACR ratings. Then, as
611	   it has been shown, the correlation between a large corpus of testing
612	   samplings shows a correlation of R=0.879 (instead of R=0.876)
613	   between subjective and MOS-LQO (respective PESQ raw) ratings. The
614	   ITU-T Recommendation P.862.2 [29] modifies the PESQ algorithm
615	   slightly to support wideband operations. And finally, the ITU-T
616	   Recommendation P.862.3 [30] gives detailed hints and recommendations
617	   on how and when to use the PESQ algorithms.

619	3.7.3. ITU-T Draft P.OLQA

621	   The soon-to-be standardized algorithm P.OLQA [40] extends PESQ and
622	   will be able to rate narrow to super-wideband speech and the effect
623	   of time-varying speech playout. Later distortions are common in
624	   modern VoIP systems which stretch and shrink the speech playout
625	   during voice activity to adapt it to the delay process of the
626	   network.

628	4. Measuring Complexity

630	   Besides audio and speech quality, the complexity of a codec is of
631	   prime importance. Knowing the algorithmic efficiency is important
632	   because:

634	   .  the complexity has an impact on power consumption and system
635	      costs

637	   .  the hardware can be selected to fit pre-known complexity
638	      requirements and

640	   .  different codec proposals can be compared if they show similar
641	      performances in other aspects.

643	   Before any complexity comparisons can be made, one has to agree on
644	   an objective, precise, reliable, and repeatable metric on how to
645	   measure the algorithmic efficiency. In the following, we list three
646	   different approaches.

648	4.1. ITU-T Approaches to Measuring Algorithmic Efficiency

650	   Over the last 17 years, the ITU-T Study Group 16 measured the
651	   complexity of codecs using a library called ITU-T Basic Operators
652	   and described in ITU-T G.191 [19], which counts the kind and number
653	   of operations and the amount of memory used. The latest version of
654	   the standard supports both fix-point operations of different widths
655	   and floating operations.  Each operation can be counted
656	   automatically and weighted accordingly. The following source code is
657	   an [edited] excerpt from the source file baseop32.h:

659	   /* Prototypes for basic arithmetic operators */

661	   /* Short add,           1   */
662	   Word16 add (Word16 var1, Word16 var2);

664	   /* Short sub,           1   */
665	   Word16 sub (Word16 var1, Word16 var2);

667	   /* Short abs,           1   */
668	   Word16 abs_s (Word16 var1);

670	   /* Short shift left,    1   */
671	   Word16 shl (Word16 var1, Word16 var2);

673	   /* Short shift right,   1   */
674	   Word16 shr (Word16 var1, Word16 var2);

676	   ...

678	   /* Short division,       18  */
679	   Word16 div_s (Word16 var1, Word16 var2);

681	   /* Long norm,             1  */
682	   Word16 norm_l (Word32 L_var1);

684	   In the upcoming ITU-T G.GSAD standard another approach has been used
685	   as shown in the following code example. For each operation, WMPOS
686	   functions have been added, which count the number of operations. If
687	   the efficiency of an algorithm has to be measured, the program is
688	   started and the operations are counted for a known input length.

690	   for (i=0; i<NUM_BAND; i++)
691	   {
692	   #ifdef WMOPS_FX
693	       move32();move32();
694	       move32();move32();
695	   #endif
696	       state_fx->band_enrg_long_fx[i] = 30;
697	       state_fx->band_enrg_fx[i] = 30;
698	       state_fx->band_enrg_bgd_fx[i] = 30;
699	       state_fx->min_band_enrg_fx[i] = 30;
700	   }

702	4.2. Software Profiling

704	   The previously described methods are well-established procedures on
705	   how to measure computational complexity. Still, they have some
706	   drawbacks:

708	   o Existing algorithms must be modified manually to include
709	      instructions that count arithmetic operations. In complex codecs,
710	      this may take substantial time.

712	   o The CPU model is simple as it does not consider memory access
713	      (e.g. cache), parallel executions, or other kinds of optimization
714	      that are done in modern microprocessors and compilers. Thus, the
715	      number of instructions might not correlate to the actual
716	      execution time on modern CPUs.

718	   Thus, instead of counting instructions manually, run times of the
719	   codec can be measured on a real system. In software engineering,
720	   this is called profiling. The Wikipedia article on profiling [54]
721	   explains profiling as follows:

723	     "In software engineering, program profiling, software profiling or
724	     simply profiling, a form of dynamic program analysis (as opposed
725	     to static code analysis), is the investigation of a program's
726	     behavior using information gathered as the program executes. The
727	     usual purpose of this analysis is to determine which sections of a
728	     program to optimize - to increase its overall speed, decrease its
729	     memory requirement or sometimes both.

731	        o  A (code) profiler is a performance analysis tool that, most
732	          commonly, measures only the frequency and duration of
733	          function calls, but there are other specific types of
734	          profilers (e.g. memory profilers) in addition to more
735	          comprehensive profilers, capable of gathering extensive
736	          performance data

738	        o  An instruction set simulator which is also - by necessity - a
739	          profiler, can measure the totality of a program's behaviour
740	          from invocation to termination."

742	   Thus, a typical profiler such as the GNU gprof can be used to
743	   measure and understand the complexity of a codec implementation.
744	   This is precisely the case because it is used on modern computers.
745	   However, the execution times depend on the CPU architecture, the PC
746	   in general, the OS and parallel running programs.

748	   To ensure repeatable results, the execution environment (i.e. the
749	   computer) must be standardized. Otherwise the results of run times
750	   cannot be verified by other parties as the results may differ if
751	   done under slightly changed conditions.

753	4.3. Cycle Accurate Simulation

755	   If reliable and repeatable results are needed, another similar
756	   approach can be chosen. Instead of run times, CPU clock cycles on a
757	   virtual reference system can be measured. Quoting Wikipedia again
758	   [52]:

760	     "A Cycle Accurate Simulator (CAS) is a computer program that
761	     simulates a microarchitecture cycle-accurate. In contrast
762	     an instruction set simulator simulates an Instruction Set
763	     Architecture usually faster but not cycle-accurate to a specific
764	     implementation of this architecture."

766	   With a cycle accurate simulator, the execution times are precise and
767	   repeatable for the system that is being studied. If two parties make
768	   measurements using different real computers, they still get the same
769	   results if they use the same CAS.

771	   A cycle accurate simulator is slower than the real CPU by a factor
772	   of about 100. Also, it might have a measurement error as compared to
773	   the simulated, real CPU because the CPU is typically not perfectly
774	   modeled.

776	   If an x86-64 architecture shall be simulated, the open-source Cycle
777	   accurate simulator called PTLsim can be considered [55]. PTLsim
778	   simulates a Pentium IV. On their website, the authors of PTLsim
779	   write:

781	     "PTLsim is a cycle accurate x86 microprocessor simulator and
782	     virtual machine for the x86 and x86-64 instruction sets. PTLsim
783	     models a modern superscalar out of order x86-64 compatible
784	     processor core at a configurable level of detail ranging from
785	     full-speed native execution on the host CPU all the way down
786	     to RTL level models of all key pipeline structures."

788	   Another cycle accurate simulator called FaCSIM simulated the ARM9E-S
789	   processor core and ARM926EJ-S memory subsystem [36]. It is also
790	   available as open-source. Texas Instruments also provides as CAS for
791	   its C64x+ digital signal processor [44].

793	   To have a metric that is independent of a particular architecture,
794	   the results of cycle accurate simulators could be combined.

796	4.4. Typical run time environments

798	   The IIAC codec will run on various different platforms with quite
799	   diverse properties. After discussions on the WG mailing list, a few
800	   typical run time environments have been identified.

802	   Three of the run time environments are end devices (aka phones). The
803	   first one is a PC, either stationary or a portable, having a >2 GHz
804	   PCU, >2 GByte of RAM, and a hard disk for permanent storage.
805	   Typically, a Windows, MacOS or Linux operating system is running on
806	   a PC. The second one is a SmartPhone, for example with an ARM11 500
807	   MHz CPU, 192 Mbyte RAM and 256 MByte Flashrom. An example is the HTC
808	   Dream Smart phone equipped with Qualcomm MSM7201A chip. Various
809	   operating systems are found on those devices such as Symbian,
810	   Android, and iOS. The last ones are high end stationary VoIP phones
811	   with for example a 275-MHz MIPS32 CPU (with 400 DMIPS) with a 125-
812	   MHz (250 MIPS) ZSP DSP with dual-MAC. They both have more than 1
813	   Mbyte RAM and FlashRom. An exemplary Chip is the BCM1103 [3].

815	   Besides phones, VoIP gateways are frequently needed for conferencing
816	   or transcoding to legacy VoIP or PSTN. In this case, two different
817	   platforms have been identified. The first one is based on standard
818	   PC server platforms. It consists, for example, of an Intel six core
819	   Xeon 54XX or 55XX, two 1 GB NIC, 12 GByte RAM, hard disks, and a
820	   Linux operating system. Thus, a server can serve from 400 to 10000
821	   calls depending on conference mode, codecs used, and ability of user
822	   pre-encoded audio [46]. On the other hand, high density, highly
823	   optimized voice gateways use a special purpose hardware platform
824	   like for example, TNETV3020 chips consisting of six TI C64x+ DSPs
825	   with 5.5 MB internal RAM. If they run with a Telogy conference
826	   engine, they might serve about 1300 AMR or 3000 G.711 calls per chip
827	   [45].

829	5. Measuring Latency

831	   Latency is a measure of time delay experienced in a system. Latency
832	   can be measured as one-way delay or as round-trip time. The latter
833	   one is the one-way latency from a source to destination plus the
834	   one-way latency back from destination to source. Latency can be
835	   measured at multiple positions, at the network layer or at higher
836	   layers [53].

838	   As we aim to increase the Quality of Experience, the mouth-to-ear
839	   delay is of importance because it directly correlates with
840	   perceptual quality [17]. More precisely, the acoustic round-trip
841	   time shall be a means of optimization when studying interactive and
842	   conversational application scenarios.

844	5.1. ITU-T Recommendation G.114

846	   The G.114 standard [45] gives guidelines on how to estimate one-way
847	   transmission delays. It describes how the delay introduced by the
848	   codec is generated. Because most of the encoders do a processing of
849	   frames, the duration of a frame (named "frame size") is the foremost
850	   contributor to the overall algorithmic delay. Citing [18]:

852	     "In addition, many coders also look into the succeeding frame to
853	     improve compression efficiency. The length of this advance look is
854	     known as the look-ahead time of the coder. The time required to
855	     process an input frame is assumed to be the same as the frame
856	     length since efficient use of processor resources will be
857	     accomplished when an encoder/decoder pair (or multiple
858	     encoder/decoder pairs operating in parallel on multiple input
859	     streams) fully uses the available processing power (evenly
860	     distributed in the time domain). Thus, the delay through an
861	     encoder/decoder pair is normally assumed to be:"

863	   $2*frameSize + lookAhead$

865	   In addition, if the link speeds are low, the serialization delay
866	   might contribute significantly to the codec delay.

868	   Also, if IP transmissions are used and multiple frames are
869	   concatenated in one IP packet, further delay is added. Then, "the
870	   minimum delay attributable to codec-related processing in IP-based
871	   systems with multiple frames per packet is:"

873	   $(N+1)*frameSize + lookAhead$

875	   "where N is the number of frames in each packet."

877	5.2. Discussion

879	   Extensive discussion on the WG mailing list led to the insight that
880	   the afore mentioned ITU delay model overestimates the delay
881	   introduced by the codec. In the last decade, two developments led to
882	   slightly other conditions.

884	   First, the processing power of CPU increased significantly (see
885	   Section 4.4). Nowadays, even stand-alone VoIPs have CPUs with a
886	   speed of 300 MHz. They are capable of doing the encoding and
887	   decoding faster than real time. Thus, also the delay introduced by
888	   processing is not at 100% anymore but significantly lower. For
889	   example, it might be just 10% or less.

891	   Second, even if the CPUs are fully loaded, especially if also other
892	   tasks such as a video conference or other calls need to be
893	   processed, advantaged scheduling algorithms allow for a timely
894	   encoding and decoding. For example, a staggered processing schedule
895	   can be used to reduce processing delays [45].

897	   Thus, the impact of processing delay is reduced significantly in
898	   most of the cases.

900	   Moreover, besides a look-ahead time, the decoder might also
901	   contribute to the algorithmic delay e.g. if decoded and concealed
902	   periods shall be mixed well.

904	6. Measuring Bit and Frame Rates

906	   For decades, there was a quest to achieve high quality while keeping
907	   the coding rate low. Coding rate, sometimes called multimedia bit
908	   rate, is the bit rate that an encoder produces as its output stream.
909	   In cases of variable rate encoding, the coding bit rate differs over
910	   time. Thus, one has to describe the coding rate statistically. For
911	   example, minimal, mean, and maximal coding rates need to be
912	   measured.

914	   A second parameter is the frame rate as the encoder produces frames
915	   at a given rate.  Again, in case of discontinuous transmission modes
916	   (DTX), the frame rate can vary and a statistical description is
917	   required.

919	   Both coding and frame rate influence network related bit rates. For
920	   example, the physical layer gross bit rate is the total number of
921	   physically transferred bits per second over a communication link,
922	   including useful data as well as protocol overhead [51]. It depends
923	   on the access technology, the packet rate, and packet sizes. The
924	   physical layer net bit rate is measured in a similar way but
925	   excludes the physical layer protocol overhead. The network
926	   throughput is the maximal throughput of a communication link of an
927	   access network. Finally, the goodput or data transfer rate refers to
928	   the net bit rate delivered to an application excluding all protocol
929	   headers and data link layer retransmissions, etc. Typically, to
930	   avoid packet losses or queuing delay, the goodput shall be equally
931	   large as the coding rate.

933	   The relation between goodput and the physical layer gross bit rate
934	   is not trivial.  First of all, the goodput is measured end-to-end.
935	   The end-to-end path can consist of multiple physical links, each
936	   having a different overhead. Second, the overhead of physical layers
937	   may vary with time and load, depending for example on link
938	   utilization and link quality. Third, packets may be tunneled through
939	   the network and additional headers (such as IPsec) might be added.
940	   Fourth, IP header compression might be applied (as in LTE networks)
941	   and the overhead might be reduced. Overall, many information about
942	   the network connection must be collected to predict what the
943	   relation between physical layer gross bit rate and a given coding
944	   and frame rate is going to be. Applications, which have only a
945	   limited view of the network, can hardly know the precise relation.

947	   For example, the DCCP TFRC-SP transport protocol simply estimates a
948	   header size on data packets of 36 bytes (20 bytes for the IPv4
949	   header and 16 bytes for the DCCP-Data header with 48-bit sequence
950	   numbers) [7][8]. Thus, [11] suggested a typical scenario in which
951	   one encoded frame is transmitted with the RTP, UDP, IPv4 and IEEE
952	   802.3 protocols and thus each packet contains packet headers having
953	   12 bytes, 8 bytes, 20 bytes and 18 bytes respectively. The gross bit
954	   rate calculates as

956	   $r_{gross}=r_{coding}+overhead \cdot framerate$

958	   where $r_{coding}$ is the coding rate of the encoding, $framerate$
959	   is the frame rate of the codec, $overhead$ is the number of bits for
960	   protocol headers in each packet (typically 58*8=464), and the
961	   $r_{gross}$ is the rate used on physical mediums.

963	7. Codec Testing Procedures Used by Other SDOs

965	   To ensure quality, each newly standardized codec is rigorously
966	   tested. ITU-T Study Group 12 and 16 have developed very good and
967	   mature procedures on how to test codecs. The ITU-T Study Group 12
968	   has described the testing procedures of narrow- and wide-band codecs
969	   in the ITU-T P.830 standard.

971	7.1. ITU-T Recommendation P.830

973	   The ITU-T P.830 recommendation describes methods and procedures for
974	   conducting subjective performance evaluations of digital speech
975	   codecs. It recommends for most applications the Absolute Category
976	   Rating (ACR) method using the Listening Quality scale. The process
977	   of judging the quality of a speech codec consists of five steps,
978	   which are described in the following.

980	   Step 1: Preparation of Source Speech Materials Including Recording
981	   of Talkers. When testing a narrow band codec, the recommendation
982	   suggests to use a bandwidth filter before applying sample items to a
983	   codec. This bandwidth filter is called modified Intermediate
984	   Reference System (IRS) and limits the frequency band to the range
985	   between 300 and 3400 Hz. In addition, the recommendation states that
986	   "if a wideband system (100-7000 Hz) is to be used for audio-
987	   conferencing, then the sending end should conform to IEC Publication
988	   581.7."

990	   It also says that "speech material should consist of simple, short,
991	   meaningful sentences." The sentences shall be understandable to a
992	   broad audience and sample items should consist of two or three
993	   sentences, each of them having a duration of between 2 and 3
994	   seconds. Sample items should not contain noise or reverberations
995	   longer than 500 ms. The recommendation also makes suggestions on the
996	   loudness of the signal: "A typical nominal value for mean active
997	   speech level (measured according to Recommendation P.56) is
998	   -20 dBm0, corresponding to approximately -26 dBov"

1000	   Step 2: Selection of Experimental Parameters to Exercise the
1001	   Features of the Codec That Are of Interest. Various parameters shall
1002	   be tested. Those include

1004	   o Codec Conditions

1006	        o  Speech input levels ("input levels of 14, 26 and 38 dB below
1007	          the overload point of the codec")

1009	        o  Listening levels ("levels should lie 10 dB to either side of
1010	          the preferred listening level")

1012	        o  Talkers

1014	             . Different talkers ("a minimum of two male and two female
1015	               talkers")

1017	             . Multiple talkers ("multiple simultaneous voice input
1018	               signals")

1020	        o  Errors ("randomly distributed bit errors" or burst-errors)

1022	        o  Bitrates ("The codec must be tested at all the bit rates")

1024	        o  Transcodings ("Asynchronous tandeming", "Synchronous
1025	          tandeming", and "Interoperability with other speech coding
1026	          standards")

1028	        o  Mismatch (sender and receiver operate in different modes)

1030	        o  Environmental noise (sending) ("30 dB for room noise" and "10
1031	          dB and 20 dB for vehicular noise")

1033	        o  Network information signals ("signaling tones, conforming to
1034	          Recommendation Q.35, should be tested subjectively, and the
1035	          minimum should be proceed to dial tone,  called subscriber
1036	          ringing tone, called subscriber engaged tone, equipment
1037	          engaged tone, [and] number unobtainable tone.")

1039	        o  Music ("to ensure that the music is of reasonable quality")

1041	   o Reference conditions ("for making meaningful comparisons")

1043	        o  Direct (no coding, only input and output filtering)

1045	        o  Modulated Noise Reference Unit (MNRU)

1047	        o  Signal-to-Noise Ratio (SNR) (for comparison purposes)

1049	        o  Reference codecs

1051	   Step 3: Design of the Experiment. The considerations described in
1052	   B.3/P.80 apply here. Typically, it is not possible to test each
1053	   combination of parameters. Thus, recommendation P.830 states that
1054	   "it is recommended that a minimum set of experiments be conducted,
1055	   which, although they would not cover every combination, would result
1056	   in sufficient data to make sensible decisions. [...] Extreme caution
1057	   should be used when comparing systems with widely differing
1058	   degradations, e.g. digital codecs, frequency division multiplex
1059	   systems, vocoders, etc., even within the same test."

1061	   Step 4: Selection of a Test Procedure and Conduct of the Experiment.
1062	   Here, the considerations as in B.4/P.80 apply. However, a modified
1063	   IRS at the receiver shall be used (narrow band) or an IEC
1064	   Publication 581.7 filter (wideband). Also, "Gaussian noise
1065	   equivalent to -68 dBmp should be added at the input to the receiving
1066	   system to reduce noise contrast effects at the onset of speech
1067	   utterances."

1069	   Step 5: Analysis of Results. Again, the considerations detailed in
1070	   B.4.7/P.80 apply. The arithmetic mean (over subjects) is to be
1071	   calculated for each condition at each listening level.

1073	7.2. Testing procedure for the ITU-T G.719

1075	   Recently, the ITU-T has standardized the audio and speech codec ITU-
1076	   T G.719. The G.719 has similar properties as the anticipated IIAC,
1077	   thus the optimization and characterization of the G.719 is of
1078	   particular interest.

1080	   In the following, we will describe the "Quality Assessment Test
1081	   Plan" in TD 322 and 323 [33][35]. The ITU Study Group 16 used ITU-R
1082	   BS.1116 to tests sample items. Audio sample items were sampled at 48
1083	   kHz mixed down to mono. Speech sample items contain one sentence
1084	   with a duration of 4 s, mixed content had a duration of 5-6 s and
1085	   music a duration of between 10 and 15 s. The beginning and ending of
1086	   the samples were smoothed. Also, a filter was applied to limit the
1087	   nominal bandwidth of the input signal to the range of 20 to 20000
1088	   Hz. As for the mixed content, advertisements, film trailers and news
1089	   (including a jingle) have been selected. For music items, classical
1090	   and modern styles of music have been selected. Besides the codec
1091	   under test, test stimuli degraded with LAMP MP3 and G722 were added
1092	   to the tests. Some test stimuli have been modified to include
1093	   reverberations or an interfering talker and office noise. Some tests
1094	   were done studying the effect of a frame erasure rate of 3% having
1095	   random loss patterns. All listening labs used different sample items
1096	   and attention paid to not use the same material twice.

1098	   Listening labs were required to provide the results of 24
1099	   experienced listeners excluding those listeners, who did not passed
1100	   a pre- and post-screening. The experienced listeners should "neither
1101	   have a background in technical implementations of the equipment
1102	   under test nor do they have detailed knowledge of the influence of
1103	   these implementations on subjective quality".

1105	   During the tests, "circum aural headphones - open back for example:
1106	   STAX Signature SR-404 or Sennheiser HD-600) on both ears (diotic
1107	   presentation)" were used. The listening levels were -26 dB relative
1108	   to OVL.

1110	   Some results of the listening tests are given in TD 341 R1 [34]. In
1111	   those tests, they also compared the subjective ratings that were
1112	   made following BS.1116 with the objective ratings of ITU-R BS.1387-
1113	   1. The correlation between objective and subjective ratings was
1114	   below R=0.9.

1116	8. Transmission Channel

1118	   Between speech encoder and decoder lies a transmission channel that
1119	   effects the transmission. For cellular or wireless phones, the
1120	   typical transmission channel is assumed to be equal to the wireless
1121	   link(s). This typically means, that a circuit switch link is assumed
1122	   (e.g., in GSM, UMTS, DECT). The bandwidth is typically constant in
1123	   DECT and GSM or variable in a given range depending on the quality
1124	   of the wireless transmission (UMTS). Bit errors do occur but they
1125	   don't be equally distributed if unequal bit error correction is
1126	   applied (UMTS).

1128	   In the case of the IIAC codec, the transmission channel is the
1129	   internet. More precisely, it is the packet transmission over the
1130	   Internet, plus the transport protocol (e.g. UDP, TCP, DCCP), plus
1131	   potentially Forward Error Correction, and plus dejittering buffers.

1133	   Also, the transmission channel is reactive. It changes its
1134	   properties depending on how much data is transmitted. For example,
1135	   parallel TCP flows reduce their transmission bandwidth in the
1136	   presence of an unresponsive UDP stream.

1138	   Overall, one can say that the transmission channel "Internet" is
1139	   difficult to understand. Thus, in this chapter, we try to shed light
1140	   on the question of what types of transmission channels a codec has
1141	   to cope with.

1143	8.1. ITU-T G.1050: Network Model for Evaluating Multimedia Transmission
1144	   Performance over IP (11/2007)

1146	   The current ITU-T G.1050 standard [20] describes layer 3 packet
1147	   transmission models that can be used to evaluate IP applications.
1148	   The models are of statistical nature. They consider networks
1149	   architectures, types of access links, QoS controlled edge routing,
1150	   MTU size, networks faults, link failures, route flapping, reordered
1151	   packets, packet loss, one-way delay, variable deploys and background
1152	   traffics.

1154	   G.1050 is a network model consisting of three parts, LAN a, LAN b,
1155	   and an interconnection core. Both LANs can have different rates and
1156	   occupancy and can be of different types. LAN and core are connected
1157	   via access technologies, which might vary in data rate, occupancy
1158	   and MTU size.

1160	   The core is characterized by route flapping, link failures, one-way
1161	   delay, jitter, packet loss and reordered packets. Route flaps are
1162	   repeatedly changed in a transmission path because of alternating
1163	   routing tables. These routing updates cause incremental changes in
1164	   the transmission delays. A link failure is a period of consecutive
1165	   packet loss. Packet losses can be bursty having a high loss rate
1166	   during bursts and having otherwise a lower loss rate otherwise.
1167	   Delays are modeled via multiple different jitter models supporting
1168	   delay spikes, random jitter and filtered random jitters.

1170	   The standard recommends three profiles, named "Well-managed IP
1171	   network", "Partially-managed IP network", and "Unmanaged IP Network,
1172	   Internet", which differ in their connection qualities.

1174	   Limitations to these models are the missing cross-correlation
1175	   between packet delays and packet loss events, the lack of
1176	   responsiveness to the tests application flow, and the lack of link
1177	   qualities that vary with time.

1179	8.2. Draft G.1050 / TIA-921B

1181	   Currently, an enhancement to ITU-T G.1050 (11/2007) is being
1182	   developed (e.g. [13])). It does not use a statistical model but
1183	   takes advantage of the NS/2 simulator. Thus, most of the above
1184	   mentioned limitations have been overcome.

1186	   Despite that, even the new model does not yet give an answer to the
1187	   question of which distributions of typical Internet connection
1188	   qualities can be expected.

1190	8.3. Delay and Throughput Distributions on the Global Internet

1192	   In general, it is not precisely known how the qualities of end-to-
1193	   end connections are distributed. It is also unclear whether the
1194	   anticipated IIAC Codec will be used globally or whether its area of
1195	   usage will be somehow restricted.

1197	   Despite the fact, that the codec has to be optimized for an unknown
1198	   Internet, the following scientific publications give an estimate on
1199	   how different Internet end-to-end paths might behave. One recent
1200	   example is on studies about the residential broadband Internet
1201	   access traffic of a major European ISP [37].

1203	         +------------------------------------------------------------+
1204	   p 0.6-+                                                            |
1205	   r     |  e eDonkey                                                 |
1206	   o     |                           ee                               |
1207	   b     |  H HTTP                   e e                              |
1208	   a     |                          ee e                              |
1209	   b     |                          e  e                              |
1210	   i 0.4-+                          e   e                             |
1211	   l     |                          e   e                             |
1212	   i     |                         e    e                             |
1213	   t     |                         e    e     HHHH                    |
1214	   y     |                         e     e  HHHHHHHHH                 |
1215	         |                        ee     e HH       HH                |
1216	   d 0.2-+                        e      eHH         HH               |
1217	   e     |                        e       H           HH              |
1218	   n     |                       ee      He            HH             |
1219	   s     |            ee         e      HH e            HH            |
1220	   i     |           e  ee      e     HH    e            HHH          |
1221	   t     |         ee     eeeeee HHHHHH      eeee          HHH        |
1222	   y 0.0-+ eHeHeHeHHHHHHHHHHHHHHH                eeeeeeeeeeeeeHHHHHHH |
1223	         +----+---------+---------+--------+---------+---------+------+
1224	              |         |         |        |         |         |
1225	             0.1       1.0       10       100      1000      10000

1227	                              Throughput [kbps]

1229	    Figure 2 Achieved throughput of flows measured for eDonkey and HTTP
1230	                             applications [37]

1232	   Figure 2 displays the throughput distribution of TCP connections for
1233	   eDonkey peer-to-peer and HTTP applications. It only considers single
1234	   flow with a length of more than 50 Kbyte. But typically, a web
1235	   browser uses two to three TCP connections at the same time and an
1236	   eDonkey client about 10. Still, the throughput of a single HTTP flow
1237	   is in about an order faster than the of eDonkey flow. In [37], the
1238	   authors assume this is due to the fact that peer-to-peer connections
1239	   fill the uplink and that HTTP is used at the faster downlink.

1241	         +------------------------------------------------------------+
1242	         |                                                            |
1243	         |                    **                                      |
1244	   p 0.8-+                    **                                      |
1245	   r     |                    ***                                     |
1246	   o     |                    * *                                     |
1247	   b     |                   ** *                                     |
1248	   a 0.6-+                   *  *                                     |
1249	   b     |                   *  **                                    |
1250	   i     |                   *   *                                    |
1251	   l     |                   *   *                                    |
1252	   i     |                   *   *                                    |
1253	   t 0.4-+                  **   **                                   |
1254	   y     |                  *     *                                   |
1255	         |                  *      * ****                             |
1256	   d     |                  *       *    *                            |
1257	   e 0.2-+                  *            **                           |
1258	   n     |                 **             **                          |
1259	   s     |            **** *               ***                        |
1260	   i     |          ***  ***                 ***                      |
1261	   t     |        ***                          **************         |
1262	   y 0.0-+*********                                  *****************|
1263	         +-------+-----------------+----------------+-----------------+
1264	                 |                 |                |                 |
1265	                10                100             1000            10000

1267	                                        RTT [ms]

1269	                     Figure 3 TCP roundtrip times [36]

1271	   Figure 3 displays TCP roundtrip times including both access and
1272	   backbone network. Both graphs can be seen as an indication for the
1273	   assumption that an application, even in modern Internet access
1274	   networks, might be subjected to a wide variability of throughput
1275	   ranging from a few kbits/s up to 10 Gbit/s and TCP round trip times
1276	   from 5ms up to one of several seconds.

1278	   Albeit these results are only valid for TCP, similar results should
1279	   be expected for RTP over UDP - with a small advantage because UDP
1280	   flows are not always responsive.

1282	   As a summary, a codec for the Internet should be able to work under
1283	   these widely varying transmission conditions and should be tested
1284	   against a wide distribution of expected throughputs.

1286	8.4. Transmission Variability on the Internet

1288	   Besides effects such as route flapping or link failures modeled in
1289	   G.1050 [20], the Internet experience in short-time scales sharp
1290	   changes sharply in bandwidth utilization. For example, [49] and [38]
1291	   showed that variability of Internet traffic comes in form of spike
1292	   like traffic increments. Similarly, [32] studied why the Internet is
1293	   bursty in time scales of between 100 and to 1000 milliseconds.

1295	   In the light of these results, one can assume that the IIAC's
1296	   transmission conditions will vary in similar time scales. More
1297	   precisely, it will be subjected to

1299	   .  variability due to bursty traffic having a duration of between
1300	      100 and 1000 milliseconds,

1302	   .  interruptions due to temporal link failures every minute to every
1303	      hour that might have a temporal interruption from 64 ms to
1304	      several seconds [20], and

1306	   .  route flap events every minute to every hour that have a delay of
1307	      between 2 and 128 ms [20].

1309	8.5. The Effects of Transport Protocols

1311	   Realtime multimedia is not always transported over RTP and UDP.
1312	   Sometimes it makes sense to use a different transport protocol or an
1313	   additional rate adaptation. The reasons for that are manifold.

1315	   .  If a scalable codec shall be supported, RTCP-based feedback
1316	      information can be utilized to implement a rate control
1317	      mechanisms [41]. However, RTCP-based feedback suffers from the
1318	      drawback that RTCP messages are allowed only every 5 s. Thus,
1319	      implementing a fast responding mechanism is not possible.

1321	   .  In the presence of restricted firewalls, VoIP can sometimes only
1322	      be transmitted over TCP. In those cases, the transmission
1323	      scheduling is not given by the codec but by TCP. TCP algorithms
1324	      typically don't have a smooth sending rate but frequently send
1325	      packets in bursts and change the amount of packets sent every
1326	      round trip time (Figure 4). More precisely, TCP causes the
1327	      sending schedule to behave in the following way:

1329	        .  During the Slow Start phase (for example at the beginning of
1330	          a TCP connection) the transmission rate increases
1331	          exponentially.

1333	        .  If a TCP segment is not acknowledged after about four RTTs,
1334	          the TCP sending rate starts at one packet per RTT again.

1336	        .  During congestion avoidance, the sending rate increases
1337	          steadily by one segment per RTT.

1339	        .  If a congestion event is then detected, the sending rate is
1340	          reduced by 50%.

1342	   p 15-+-------------------------------------------------------------+
1343	   a    |                                                             |
1344	   c    |             **                  **              **          |
1345	   k    |           ** *                ** *            ** *          |
1346	   e    |         **   *              **   *          **   *          |
1347	   t    |       **     *            **     *        **     *        **|
1348	   s    |     **       *          **       *      **       *      **  |
1349	      8-+   **         *        **         *    **         *    **    |
1350	   p    |   *          *      **           *  **           *  **      |
1351	   e    |   *          *      *            ***             ***        |
1352	   r    |   *          *      *                                       |
1353	      4-+  *           *     *                                        |
1354	   R    |  *           *     *                                        |
1355	   T  2-+ *            *    *                                         |
1356	   T  1-+*             *   *                                          |
1357	        +---------+---------+---------+---------+---------+---------+-+
1358	        |         |         |         |         |         |         |
1359	        0        10        20        30        40        50        60

1361	                     time in round- trip times (RTT)

1363	             Figure 4 Sending rate of a standard TCP over time

1365	   .  The DCCP transport protocol supports multiple congestion control
1366	      protocols and gives means to support TCP friendliness without
1367	      retransmission. Thus, it is suitable for real time multimedia
1368	      transmissions. DCCP supports a TCP emulation, which shows a
1369	      similar rate over time as TCP, and the TFRC congestion control,
1370	      which changes its rate in a smoother way (Figure 5).
1371	      Besides TFRC, which is intended to transmit packets of maximal
1372	      size (aka MTU), TFRC-SP is optimized for flows with variable
1373	      packet sizes such as VoIP. With TFRC-SP, smaller packets can be
1374	      transmitted at a faster pace than it is the case for larger
1375	      packets because they contribute less to the gross bandwidth
1376	      consumption.
1377	      The TFRC protocol might provide a lower bandwidth and a lower QoE
1378	      as UDP or TCP, unless if not proper optimizations are taken (see
1379	      [48]). Also, it is suggested to limit the rate control to 100
1380	      packets per second. This limit might be too low for an IIAC.

1382	   p 15-+-------------------------------------------------------------+
1383	   a    |                                                             |
1384	   c    |             **                  **                  **      |
1385	   k    |           **  **              **  **              **  **    |
1386	   e    |         **      **          **      **          **      **  |
1387	   t    |       **          **      **          **      **          **|
1388	   s    |     **              **  **              **  **              |
1389	      8-+   **                  **                  **                |
1390	   p    |   *                                                         |
1391	   e    |   *                                                         |
1392	   r    |   *                                                         |
1393	      4-+  *                                                          |
1394	   R    |  *                                                          |
1395	   T  2-+ *                                                           |
1396	   T  1-+*                                                            |
1397	        +---------+---------+---------+---------+---------+---------+-+
1398	        |         |         |         |         |         |         |
1399	        0        10        20        30        40        50        60

1401	                    time in round- trip times (RTT)

1403	                Figure 5 Sending rate of the TFRC protocol

1405	   In general, the transport protocol has a clear influence on the
1406	   transmission conditions. Coding rates need to be adapted by sharply
1407	   and smoothly to changed bandwidth estimations. Changes of the
1408	   bandwidth estimation may occur every RTT. Also, in cases of a TCP
1409	   timeout, the transmission is halted and the decoding must be
1410	   stalled.

1412	8.6. The Effect of Jitter Buffers and FEC

1414	   Both jitter buffers trade frame losses against delay. In cases of a
1415	   jitter buffer, frames are delayed before playout. This helps in
1416	   cases of lately arriving frames that would otherwise be ignored and
1417	   would have to be concealed. Jitter buffers are adaptive and are
1418	   changing dynamically to the current loss process on the Internet.

1420	   Forward Error Correction helps to cope with isolated losses as
1421	   redundant speech frames are transmitted in the following packets. In
1422	   the presence of loss, FEC increases the delay because the receiver
1423	   has to wait for the following packets. Both delay and packet losses
1424	   are important contributors to the overall Quality of Experience [2].

1426	   Since the delay process on the Internet often comes in the form of a
1427	   gamma distribution, thus a statistical monitor of past delays helps
1428	   to predict the size of future jitter. Then, if the playout schedule
1429	   does not match the predicted loss process, playout can be
1430	   accelerated or slowed down.

1432	   However, due to the reasons described in Section 8.4 not all
1433	   increments in transmission time might be predictable. This has a
1434	   profound effect on the jitter buffer as it actually cannot predict
1435	   well, whether a frame is lost or whether it is going to be delayed.
1436	   If a frame is scheduled for playout but has not been received, the
1437	   jitter buffer has to consider two cases. First, the frame is lost
1438	   and has to be concealed. This typically means that the audio signal
1439	   needs to be extrapolated or interpolated to conceal the gap due to a
1440	   lost frame. Second, the frame is delayed and shall be played out at
1441	   a later point in time. Then, the resulting gap in playout must be
1442	   concealed by extrapolating the previous audio signal.

1444	   These issues have an effect on testing the concealment algorithm of
1445	   the codec. The same concealment function must be tested against time
1446	   gap concealment and loss concealment.

1448	8.7. Discussion

1450	   Judging a codec performance using a realistic model of a
1451	   transmission channel is difficult. Good models of IP transmission
1452	   channels are available. However, before a codec can be tested
1453	   against those channels, further building blocks such as the
1454	   transport protocol, the jitter buffer, and FEC should be known - at
1455	   least roughly.

1457	   Alternatively, a codec can be tested only against of packet loss
1458	   patterns only without considering any rate adaption or playout
1459	   rescheduling. But then again, the codec should be additionally
1460	   tested for those impairments, which occur due to the dynamics of the
1461	   Internet. These include

1463	   o slowing down and speeding up the playout in cases of moderate
1464	      rescheduling of playout times,

1466	   o stalling and resuming the playout in cases of temporal link
1467	      outages,

1469	   o moderately reducing and increasing bit and frame rates during
1470	      contention periods, and

1472	   o sharply reducing (in case of congestion) and fast increasing
1473	      (during connection establishment) of bit and frame rates.

1475	   o Time gap and loss concealment.

1477	   o Speeding up and slowing down the playout speed.

1479	9. Usage Scenarios

1481	   Quality of Experience is the service quality perceived subjectively
1482	   by end-users (refer to Section 2) and as ITU-T document G.RQAM [21]
1483	   states "overall acceptability may be influenced by user expectations
1484	   and context". Thus, in this section we describe the usage scenarios,
1485	   in which the IIAC codec will probably be used, and the expectations
1486	   users have in those communication contexts. We list seven main
1487	   scenarios and describe their quality requirements.

1489	9.1. Point-to-point Calls (VoIP)

1491	   The classic scenario is that of the phone usage to which we will
1492	   refer in this document as Voice over IP (VoIP). Human speech is
1493	   transmitted interactively between two Internet hosts. Typically,
1494	   besides speech some background noise is present, too.

1496	   The quality of a telephone call is traditionally judged by
1497	   subjective tests such as those described in [24]. The ACR scale used
1498	   in MOS-LQS sometimes might not be very suitable for high quality
1499	   calls, then - for example - the MUSHRA [16] rating can be applied.

1501	   A telephone call is considered good if it has a maximal mouth-to-ear
1502	   delay of 150 ms [17] and a speech quality of MOS-LQS 4 or above.
1503	   However, interhuman communication is still possible if the mounth-
1504	   to-ear delay is much larger.

1506	   The effect of delay jitter might not be very well notable in case of
1507	   speech. Thus, playout rescheduling can happen often take place.

1509	   In many cases, phone calls are made between mobile devices such as
1510	   mobile phones and cellular phone. In these cases, energy consumption
1511	   is crucial and both complexity and transmission rate may be reduced
1512	   to save resources.

1514	9.2. High Quality Interactive Audio Transmissions (AoIP)

1516	   In this scenario we consider a telephone call having a very good
1517	   audio quality at modest acoustic one-way latencies ranging from 50
1518	   and 150 ms [17], so that music can be listened to over the telephone
1519	   while two persons are talking interactively.

1521	   While delay expectations might be similar to those of classic
1522	   telephony, the audio quality must meet similar standards as those of
1523	   consumer Hifi equipment like MP3 and CD players, iPods, etc.

1525	   If music is played, playout rescheduling events may be heard easily
1526	   be heard as the rhythm changes. Only a few studies such as [10] have
1527	   been made to examine the effect of time varying delays on service
1528	   quality. In general, it can be assumed that the requirements
1529	   regarding constancies of playout schedules are higher than in case
1530	   of speech because human beings can notice rhythmic changes easily.
1531	   Thus, in the presence of music, frequent playout rescheduling shall
1532	   be avoided.

1534	9.3. High Quality Teleconferencing

1536	   Also, for today's teleconferencing and videoconferencing systems
1537	   there is a strong and increasing demand for audio coding providing
1538	   the full human auditory bandwidth of 20 Hz to 20 kHz. This rising
1539	   demand for high quality audio is due to the following reasons:

1541	   o Conferencing systems are increasingly used for more elaborated
1542	      presentations, often including music and sound effects which
1543	      occupy a wider audio bandwidth than that of speech. For example,
1544	      Web conferences such as WebEx, GoToMeeting, Adobe Acrobat Connect
1545	      are based on an IP based transmission.

1547	   o The new "Telepresence" video conferencing systems, providing the
1548	      user with High Definition video and audio quality, create the
1549	      experience of being in the same room by introducing high quality
1550	      media delivery (such as from Cisco).

1552	   o The emerging Digital Living Rooms are to be interconnected and
1553	      might require a constant high quality acoustic transmission at
1554	      high qualities.

1556	   o Spatial audio teleconference solutions increase the quality
1557	      because they take advantage of the cocktail-party effect. By
1558	      taking advantage of 3D audio, participants can be identified by
1559	      their location in a virtual acoustic environment and multiple
1560	      talkers can be distinguished from each other. However, these
1561	      systems require stereo audio, if the spatial audio is rendered
1562	      for headphones.

1564	9.4. Interconnecting to Legacy PSTN and VoIP (Convergence)

1566	   This scenario does not include the use case of using a VoIP-PSTN
1567	   gateway to connect to legacy telephone systems. In those cases, the
1568	   gateway would make an audio conversion from broadband Internet voice
1569	   to the frugal 1930's 3.1 kHz audio bandwidth.

1571	   The quality requirements in this scenario are low because legacy
1572	   PSTN typically uses narrow-band voice. Also, in those cases one
1573	   might expect the codec negotiation might decide on a common codec
1574	   both for PSTN and VoIP in order to avoid transcoding.

1576	   However, the complexity requirements might be stringent because
1577	   central media gateways must scale to a high number of users. In this
1578	   context, hardware costs are an important criterion and the codec has
1579	   to operate efficient.

1581	9.5. Music streaming

1583	   Music streaming typically does not require low delays. However, in
1584	   special cases such as live events and in the presence of alternative
1585	   transmission technologies, low-delay streaming may be demanded.

1587	   Examples are important sport events, which are streamed both on
1588	   terrestrial, (analogue) and low delay broadcast networks and on IP-
1589	   based distribution networks. The latter ones becomes aware (such as
1590	   when a footballer scores) more lately than the ones their neighbors
1591	   using terrestrial technology.

1593	9.6. Ensemble Performances over a Network

1595	   In some usage scenarios, users want to act simultaneously and not
1596	   just interactively. For example, if persons sing in a chorus, if
1597	   musicians jam, or if e-sportsmen play computer games in a team
1598	   together they need to communicate acoustically.

1600	   In this scenario, the latency requirements are much harder than for
1601	   interactive usages. For example, if two musicians are placed more
1602	   than 10 meters apart, they can hardly stay synchronized. Empirical
1603	   studies [10] have shown that if ensembles play over networks, the
1604	   optimal acoustic latency is at around 11.5 ms with a targeted range
1605	   from 10 to 25 ms.

1607	   Also, the users demand very high audio quality, very low delay and
1608	   very few events of playout rescheduling.

1610	9.7. Push-to-talk like Services (PTT)

1612	   In spite of the development of broadband access (xDSL), a lot of
1613	   users do only have service access via PSTN modems or mobile links.
1614	   Also, on these links the available bandwidth might be shared among
1615	   multiple flows and is subjected to congestion. Then, even low coding
1616	   rates of about 8 kbps are too high.

1618	   If transmission capacity hardly exists, one can still degrade the
1619	   quality of a telephone call to something like a push-to-talk (PTT)
1620	   like service having very high latencies. Technically, this scenario
1621	   takes advantage of bandwidth gains due to disruptive transmission
1622	   (DTX) modes and very large packets containing multiple speech frames
1623	   causing a very low packetization overhead.

1625	   The quality requirements of a push-to-talk like service have hardly
1626	   been studied. The OMA lists as a requirement of a Push-to-talk over
1627	   cellular service a transmission delay of 1.6 s and a MOS values of
1628	   above 3.0 that typically should be kept [39]. However, as long as an
1629	   understandable transmission of speech is possible, the delay can be
1630	   even higher. For example, [39] allows a delay of typically up to 4 s
1631	   for the first talk-burst. Also, [39] describes a maximum duration of
1632	   speaking. If a participant speaking reaches the time limit, the
1633	   participant's righttospeak shall be automatically revoked.

1635	   If the quality of a telephone call is very low, then instead of
1636	   listening-only speech quality the degree of understandability can be
1637	   chosen as performance metric. For example, objective tests of the
1638	   understandability use automatic speech recognition (ASR) systems and
1639	   measure the amount of correctly detected words.

1641	   In any case, the participant shall be informed about the quality of
1642	   connection, the presence of high delays, the half-duplex style of
1643	   communication, and its (limited) righttospeak. For example this can
1644	   be achieved by a simulated talker echo.

1646	9.8. Discussion

1648	   The requirements of the usage scenarios are summarized in the
1649	   following table.

1651	                |     Sound Quality  |      Latency       | Complexity
1652	     Scenario   | low  | avg. | hifi | 10ms | 150ms| high | low  | high
1653	   -------------+------+------+------+------+------+------+------+-----
1654	   VoIP         |   X  |      |      |      |   X  |      |   X  |   X
1655	   AoIP         |      |   X  |   X  |      |   X  |      |      |   X
1656	   Conference   |      |   X  |      |      |   X  |      |      |   X
1657	   Convergence  |   X  |      |      |      |   X  |      |   X  |   X
1658	   Streaming    |      |   X  |   X  |      |      |   X  |      |   X
1659	   Performances |      |      |   X  |   X  |      |      |      |   X
1660	   Push-To-Talk |   X  |      |      |      |      |   X  |   X  |   X

1662	       Figure 6 Different requirements for different usage scenarios

1664	10. Recommendations for Testing the IIAC

1666	   The IETF IIAC differs substantially from a classic narrow and
1667	   wideband codec. Thus, the previously applied codec testing
1668	   procedures such as ITU P.830 cannot be entirely adopted. Instead,
1669	   one must check carefully, which of the procedures are used without
1670	   changes, which procedures are used with minor changes and which
1671	   procedures are dropped or replaced.

1673	   In Section 1 we listed five groups of stakeholders, which have
1674	   different requirements and demands on how to test the quality of an
1675	   IIAC. In the following, we recommend testing procedures for those
1676	   stakeholders.

1678	10.1. During Codec Development

1680	   The codec development is an innovative process. In general,
1681	   innovation and research in general benefits from openness and
1682	   discussion between experts. Thus, format restrictions on how to test
1683	   the codec might hinder the codec development because innovation may
1684	   also take place in testing procedures. Instead, many experts both in
1685	   codec development and codec usage shall be able to participate. If
1686	   this is the case, they contribute with their expertise, identify
1687	   weaknesses, and discuss potential codec enhancements. During
1688	   innovation, openness in participation and discussion is very
1689	   fruitful and leads to good results.

1691	   Based on the ongoing experience, codec developers know best on how
1692	   to tests their codecs. Typically, those tests include informal
1693	   testing, semiformal testing, and expert interviews. They are
1694	   intended to find weaknesses in the codec, to identify artifacts or
1695	   distortions, and to achieve algorithmic progress.

1697	10.2. Characterization Phase

1699	   The characterization phase is intended to study the features, the
1700	   quality tradeoff and the properties of a codec under
1701	   standardization. It is intended to be an objective measure of the
1702	   codec's quality to convince third parties of the quality properties
1703	   of the standardized codec. In order to achieve this aim, a formal
1704	   testing procedure has to be established.

1706	   In general, we recommend to base the procedure of the
1707	   characterization phase on procedures that are similar to those that
1708	   were used for the G.719 standardization (Section 7.2 and especially
1709	   [35]). In the following, we describe the suggested testing procedure
1710	   in the characterization phase.

1712	10.2.1. Methodology

1714	   The testing of sound quality can be done using the MUSHRA tests
1715	   with eight samples and three anchors. One anchor is the known
1716	   reference, the second one is a hidden reference, and the third one
1717	   the hidden anchor. It is suggested to use a bandwidth filtered
1718	   signal with at low-pass filter at 3.5 kHz. However, because a will
1719	   range of qualities are to be tested ranging from Hifi down to toll
1720	   quality, it is beneficial to add a further low quality anchor such
1721	   as a 3.5 kHz bandwidth sample distorted by modulated noise (MNRU)
1722	   [25], for example with MNRU of a strength of Q=25 dB that
1723	   corresponds to a MOS value of 1.79 [4].

1725	10.2.2. Material

1727	   Reference samples should be 48 kHz sampled, stereo channel material.
1728	   The nominal bandwidth of the reference samples shall be limited to
1729	   the range of 20 to 20000 Hz. Three different kinds of contents shall
1730	   be tested: speech, music and mixed content.

1732	   Speech samples shall include different languages including English
1733	   and tonal languages. The speech samples shall be recorded in a quiet
1734	   environment without background noise or reverberations. The speech
1735	   samples shall contain one meaningful sentence having a length of
1736	   about 4 s.

1738	   Music samples shall contain a wide variety of music styles including
1739	   classical music, pop, jazz, and single instruments. The length of
1740	   samples shall be of between 10 and 15 s. A smoothing of 100 ms both
1741	   at the beginning and at the end shall be conducted, if required.

1743	   Mixed content may contain advertisements, film trailers, news with
1744	   jingles and other mixtures of speech, music and noises. The length
1745	   may be at about 5-6 s.

1747	10.2.3. Listening Laboratory

1749	   Multiple independent laboratories shall conduct the listening tests.
1750	   They are responsible for generating or selecting reference samples
1751	   as well as for the pre and post screening of subjects. In the end,
1752	   the results of about 24 experienced listeners shall be published (in
1753	   addition to the samples).

1755	   The tests must be conducted in a quiet listening environment at
1756	   about NC25 (approximate 35 dBA). For example, an ISOBOOTH room can
1757	   be used.

1759	   It is recommended to use a high quality D/A, such as Benchmark DAC,
1760	   Metric Halo ULN-2, Apogee MiniDAC. High quality headphone amplifiers
1761	   and playback level calibration shall be used. Playback levels might
1762	   be measured via Etymotic in-ear microphones. Also, high quality
1763	   headphones (e.g. AKG 240DF, Sennheiser HD600) are advisable.

1765	10.2.4. Degradation Factors

1767	   The IIAC is likely to be highly configurable. However, due to time
1768	   limits, only a few parameter sets can be tested subjectively. Thus,
1769	   we recommend to do subjective studies with

1771	   o different bit rates (from low to high, 5 tests)

1773	   o different frame rates (from low to high, 2 tests)

1775	   o different loss pattern (G.1050 profile A, B, and C at low rate
1776	      with speech content and at high rate with music content. The
1777	      influence of jitter, delay, and link failures shall be ignored.
1778	      In total, this would be 6 tests)

1780	   o different sample contents

1782	        o Speech, speech+reverberations, and speech+noise+reverberations
1783	          at low and medium rates (3 tests).

1785	        o The speech sample must be tested in different languages
1786	          (English, Chinese, ...) and with male/female voices (6 tests)

1788	        o Mixed content and music shall be tested at medium and high
1789	          rates (about 10 tests).

1791	   o A low complexity mode, DTX and the FEC mode shall be tested at
1792	      low rates because they are typically used on constraint devices
1793	      (3 tests)

1795	   o Abrupt changes in bit and frame rates (reduction by half,
1796	      exponential start, 2 tests)

1798	   o Smooth changes of bit and frame rates (incrementing or degreasing
1799	      the codec's gross rate by 1.5 kbyte every 100ms, 2 tests)

1801	   o Stall and continue operations (20, 200, and 1000 ms, 3 tests)

1803	   o Accelerated and slowed down playout (+- 10% for speech at low
1804	      rates)

1806	   o Reference codecs such as LAME MP3, G.719, and AMR each at two
1807	      coding rate (6 tests)

1809	   Already, these are 48 different tests that need to be conducted.

1811	   In addition, for intermediate values objective tests shall be run
1812	   using PEAQ (for music) and P.OLQA (for speech). The intermediate
1813	   results shall be mapped on the MUSHRA scale with a quadratic
1814	   regression because PEAQ and P.OLGA are using an ODG and MOS scale
1815	   respectively.

1817	10.3. Application Developers

1819	   Application developers can take advantage of the results of the
1820	   qualification phase. They may use the results to develop a quality
1821	   model, which describes the expected quality of the codec at a given
1822	   parameter set (refer to [11] for an example).

1824	   In addition, they can test their system using the draft G.1050
1825	   simulation model, which is especially useful for optimizing rate
1826	   control, dejittering buffers and concealment algorithms. Different
1827	   systems may be tested with quality models, subjective listening
1828	   tests, conversational listening tests, or with objective measures
1829	   such as POLQA.

1831	   Also, field tests may be conducted to test the effect of a real
1832	   network on the VoIP application.

1834	10.4. Codec Implementers

1836	   To tests the conformance of a codec, codec implementers can use
1837	   objective tools like PEAQ or P.OLQA to see, whether the newly
1838	   implemented codec performs in a way that is similar to the
1839	   performance of the reference implementation. These tests shall be
1840	   done for many different parameter sets.

1842	10.5. End Users

1844	   End user may be included in the qualification tests. The intentions
1845	   of these tests are two-fold. First, the awareness of the end-user
1846	   shall be increased. Second, querying users may be a cost effective
1847	   way of conducting listening-only tests.

1849	   However, before the rating results of end users can be considered
1850	   for further usage, one need to compare between formal and web-based
1851	   testing results to see, to what extent they differ from each other.

1853	11. Security Considerations

1855	   The results of the quality tests shall be convincing. Thus, special
1856	   care has to be taken to make the tests precise, accurate, repeatable
1857	   and trustworthy.

1859	   Some testing houses may have a conflict of interest between accurate
1860	   quality ratings and promotion of own codecs. Thus, a high degree of
1861	   openness shall be enforced that requires all of the testing material
1862	   and results to be published. This way, others may verify the results
1863	   of testing houses. In addition, some stimuli shall be tested by all
1864	   the testing houses to compare their quality of rating.

1866	   Moreover, hidden anchors may help to identify subjects, which rate
1867	   the quality of samples less precisely.

1869	12. IANA Considerations

1871	   This document has no actions for IANA.

1873	13. References

1875	13.1. Normative References

1877	13.2. Informative References

1879	   [1]  R. Birke, M. Mellia, M. Petracca, D. Rossi, "Understanding
1880	         VoIP from Backbone Measurements", IEEE INFOCOM 2007, 26th IEEE
1881	         International Conference on Computer Communications, pp.2027-
1882	         2035, May 2007.

1884	   [2]  C. Boutremans, J.-Y. Le Boudec, "Adaptive joint playout buffer
1885	         and FEC adjustment for Internet telephony," IEEE Societies
1886	         INFOCOM 2003. Twenty-Second Annual Joint Conference of the
1887	         IEEE Computer and Communications., vol.1, pp. 652- 662 vol.1,
1888	         30 March-3 April 2003.

1890	   [3]  Broadcom, "BCM1103: GIGABIT IP PHONE CHIP", Jan. 2005,
1891	         http://www.datasheetcatalog.org/datasheet2/3/07ozspx224dsarq6z
1892	         u13i2ofyqyy.pdf

1894	   [4]  N. Cote, V. Koehl, V. Gautier-Turbin, A. Raake, S. Moeller,
1895	         "Reference Units for the Comparison of Speech Quality Test
1896	         Results", Audio Engineering Society Convention 126, May 2009.

1898	   [5]  Ericsson, "Analysis of PEAQ's applicability in predicting the
1899	         quality difference between alternative implementations of the
1900	         G.722.1FB coding algorithm", ITU-T SG12, Received on 2008-05-
1901	         09, Related to question(s) : Q9/12, Meeting 2008-05-22.

1903	   [6]  ETSI TC-TM, "ETR 250: Transmission and Multiplexing (TM);
1904	         Speech communication quality from mouth to ear for 3,1 kHz
1905	         handset telephony across networks", ETSI Technical Report,
1906	         July 1996.

1908	   [7]  S. Floyd, E. Kohler, "Profile for Datagram Congestion Control
1909	         Protocol (DCCP) Congestion ID 4: TCP-Friendly Rate Control for
1910	         Small Packets (TFRC-SP)", RFC 5622, August 2009.

1912	   [8]  S. Floyd, E. Kohler, "TCP Friendly Rate Control (TFRC): The
1913	         Small-Packet (SP) Variant", RFC 4828, April 2007.

1915	   [9]  J. Gruber, G. Williams, Transmission Performance of Evolving
1916	         Telecommunications Networks, Artech House, 1992.

1918	   [10] M. Gurevich, C. Chafe, G. Leslie, S. Tyan, "Simulation of
1919	         Networked Ensemble Performance with Varying Time Delays:
1920	         Characterization of Ensemble Accuracy", Proceedings of the
1921	         2004 International Computer Music Conference, Miami, USA,
1922	         2004.

1924	   [11] C. Hoene, H. Karl, A. Wolisz, "A perceptual quality model
1925	         intended adaptive VoIP applications", International Journal of
1926	         Communication Systems, Wiley, August 2005.

1928	   [12] J. Holub, J.G. Beerends, R. Smid, "A dependence between
1929	         average call duration and voice transmission quality:
1930	         measurement and applications," Wireless Telecommunications
1931	         Symposium, 2004, pp. 75- 81, May 2004.

1933	   [13] ITU, "Incoming LS: Proposed G.1050/TIA-921B IP Network Model
1934	         Simulation", ITU-T SG 12, Temporary Document 268-GEN, May 12,
1935	         2010.

1937	   [14] ITU, "ITU-R BS.1116-1: Methods for the subjective assessment
1938	         of small impairments in audio systems including multichannel
1939	         sound systems", Recommendation, October 1997.

1941	   [15] ITU, "ITU-R BS.1387: Method for objective measurements of
1942	         perceived audio quality", Recommendation, November 2001.

1944	   [16] ITU, "ITU-R BS.1534-1: Method for the subjective assessment of
1945	         intermediate quality levels of coding systems",
1946	         Recommendation, January 2003.

1948	   [17] ITU, "ITU-T G.107: The E-model: a computational model for use
1949	         in transmission planning", Recommendation, April 2009.

1951	   [18] ITU, "ITU-T G.114: One-way transmission time", Recommendation,
1952	         May 2003.

1954	   [19] ITU, "ITU-T G.191: Software tools for speech and audio coding
1955	         standardization", Recommendation, March 2010.

1957	   [20] ITU, "ITU-T G.1050: Network model for evaluating multimedia
1958	         transmission performance over Internet Protocol",
1959	         Recommendation, November 2007.

1961	   [21] ITU, "ITU-T G.RQAM, "Reference guide to QoE assessment
1962	         methodologies", standard draft TD 310rev1, May 2010.

1964	   [22] ITU, "ITU-T P.10/G.100: Vocabulary and effects of transmission
1965	         parameters on customer opinion of transmission quality",
1966	         Recommendation, July 2006.

1968	   [23] ITU, "ITU-T P.800: Methods for objective and subjective
1969	         assessment of quality", Recommendation, August 1996.

1971	   [24] ITU, "ITU-T P.805: Subjective evaluation of conversational
1972	         quality", Recommendation, April 2007.

1974	   [25] ITU, "ITU-T P.810: Modulated noise reference unit (MNRU)",
1975	         Recommendation, February 1996.

1977	   [26] ITU, "ITU-T P.830: Subjective performance assessment of
1978	         telephone-band and wideband digital codecs", Recommendation,
1979	         February 1996.

1981	   [27] ITU, "ITU-T P.862: Perceptual evaluation of speech quality
1982	         (PESQ): An objective method for end-to-end speech quality
1983	         assessment of narrow-band telephone networks and speech
1984	         codecs", Recommendation, February 2001.

1986	   [28] ITU, "ITU-T P.862.1: Mapping function for transforming P.862
1987	         raw result scores to MOS-LQO", Recommendation, November 2003.

1989	   [29] ITU, "ITU-T P.862.2: Wideband extension to Recommendation
1990	         P.862 for the assessment of wideband telephone networks and
1991	         speech codecs", Recommendation, November 2007.

1993	   [30] ITU, "ITU-T P.862.3: Application guide for objective quality
1994	         measurement based on Recommendations P.862, P.862.1 and
1995	         P.862.2", Recommendation, November 2007.

1997	   [31] ITU, "ITU-T P.880: Continuous evaluation of time-varying
1998	         speech quality", Recommendation, May 2004.

2000	   [32] H. Jiang, C. Dovrolis, "Why is the internet Traffic Bursty in
2001	         Short Time Scales?" Sigmetrics'05, Banff, Alberta, Canada,
2002	         June 2005.

2004	   [33] C. Lamblin, R. Even, "Processing Test Plan for the ITU-T
2005	         G.722.1 fullband extension optimization/characterization
2006	         phase", ITU-T Study Group 16, Temporary Document TD 322 (WP
2007	         3/16), 22 April - 2 May 2008.

2009	   [34] C. Lamblin, R. Even, "G.722.1 fullband extension
2010	         characterization phase test results: objective (ITU-R BS.1387-
2011	         1) and subjective (ITU-R BS.1116) scores", ITU-T Study Group
2012	         16, Temporary Document TD 341 R1 (WP 3/16), 22 April - 2 May
2013	         2008.

2015	   [35] C. Lamblin, R. Even, "G.722.1 fullband extension
2016	         optimization/characterization Quality Assessment Test Plan",
2017	         ITU-T Study Group 16, Temporary Document TD 323 (WP 3/16), 22
2018	         April - 2 May 2008.

2020	   [36] J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, S Han,
2021	         "FaCSim: A Fast and Cycle-Accurate Architecture Simulator for
2022	         Embedded Systems", in Proceedings of the International
2023	         Conference on Languages, Compilers, and Tools for Embedded
2024	         Systems (LCTES'08), Tucson, Arizona, USA, June 2007, Software
2025	         available at http://facsim.snu.ac.kr/.

2027	   [37] G. Maier, A. Feldmann, V. Paxson, M. Allman, "On Dominant
2028	         Characteristics of Residential Broadband Internet Traffic",
2029	         IMC'09, November 4-6, 2009, Chicago, Illinois, USA.

2031	   [38] T. Mori, S.  Naito, R. Kawahara, S. Goto, "On the
2032	         characteristics of internet traffic variability: Spikes and
2033	         Elephants", SAINT'04, 2004.

2035	   [39] Open Mobile Alliance, "Push to talk over Cellular
2036	         Requirements", Approved Version 1.0, 09 Jun 2006, OMA-RD-PoC-
2037	         V1_0-20060609-A.pdf

2039	   [40] OPTICOM, SwissQual, TNO, "Announcement of OPTICOM, SwissQual
2040	         and TNO to submit a joint P.OLQA model", ITU-T SG 12,
2041	         Contribution 117, Received on 2010-05-07. Related to
2042	         question(s): Q9/12.

2044	   [41] D. Sisalem, A. Wolisz, "Towards TCP-friendly adaptive
2045	         multimedia applications based on RTP", IEEE International
2046	         Symposium on Computers and Communications, pp. 166-172, 1999.

2048	   [42] S. Smirnoff, K. Pupkov, "SoundExpert, How it Works, Audio
2049	         quality measurements in the digital age",
2050	         http://soundexpert.org/, revived Nov. 2010.

2052	   [43] L. Sun, "Speech Quality prediction For Voice Over Internet",
2053	         PhD thesis, University of Plymouth, January 2004,
2054	         http://www.tech.plymouth.ac.uk/spmc/people/lfsun/mos/.

2056	   [44] Texas Instruments, "C64x+ CPU Cycle Accurate Simulator",
2057	         October 2010,
2058	         http://processors.wiki.ti.com/index.php/C64x%2B_CPU_Cycle_Accu
2059	         rate_Simulator.

2061	   [45] Texas Instruments, "TNETV3020: Carrier Infrastructure
2062	         Platform, Telogy Software products integrated with TI's DSP-
2063	         based high-density communications processor", 2008,
2064	         http://focus.ti.com/lit/ml/spat174a/spat174a.pdf

2066	   [46] TransNexus, "Asterisk V1.4.11 Performance", webpage, accessed
2067	         Nov. 2010,
2068	         http://www.transnexus.com/White%20Papers/asterisk_V1-4-
2069	         11_performance.htm

2071	   [47] K. Vos, K. Vandborg Sorensen, S. Skak Jensen, J. Spittka,
2072	         "SILK", presentation at the 77th IETF meeting in the WG Codec,
2073	         March 22, 2010, Anaheim, USA.
2074	         http://tools.ietf.org/agenda/77/slides/codec-3.pdf

2076	   [48] H. Vlad Balan, L. Eggert, S. Niccolini, M. Brunner, "An
2077	         Experimental Evaluation of Voice Quality Over the Datagram
2078	         Congestion Control Protocol," IEEE INFOCOM 2007. 26th IEEE
2079	         International Conference on Computer Communications. pp. 2009-
2080	         2017, 6-12 May 2007.

2082	   [49] J. Wallerich, A. Feldmann, "Capturing the Variability of
2083	         Internet Flows Across Time", Proceedings INFOCOM 2006. 25th
2084	         IEEE International Conference on Computer Communications, 23-
2085	         29 April 2006.

2087	   [50] M. Westerlund, "How to Write an RTP Payload Format", work in
2088	         progress, draft-ietf-avt-rtp-howto-06, Internet-draft,
2089	         March 2, 2009.

2091	   [51] Wikipedia contributors, "Bit rate", Wikipedia, The Free
2092	         Encyclopedia, 10 October 2010, 20:00 UTC,
2093	         http://en.wikipedia.org/w/index.php?title=Bit_rate&oldid=38993
2094	         1944

2096	   [52] Wikipedia contributors, "Cycle accurate simulator", Wikipedia,
2097	         The Free Encyclopedia, 4 September 2010, 14:27 UTC,
2098	         http://en.wikipedia.org/w/index.php?title=Cycle_accurate_simul
2099	         ator&oldid=382876676

2101	   [53] Wikipedia contributors, "Latency (engineering)", The Free
2102	         Encyclopedia, 15 October 2010, 23:54 UTC,
2103	         http://en.wikipedia.org/w/index.php?title=Latency_(engineering
2104	         )&oldid=390971153

2106	   [54] Wikipedia contributors, "Profiling (computer programming)",
2107	         Wikipedia, The Free Encyclopedia, 15 August 2010, 03:57 UTC,
2108	         http://en.wikipedia.org/w/index.php?title=Profiling_(computer_
2109	         programming)&oldid=378987422.

2111	   [55] M. T. Yourst, "PTLsim: A cycle accurate full system x86-64
2112	         microarchitectural simulator", in ISPASS '07, 2007, software
2113	         available at http://www.ptlsim.org/.

2115	14. Acknowledgments

2117	   This document is based on many discussions with experts in the field
2118	   of codec design, quality of experience and quality management. My
2119	   special thanks go to Michael Knappe, Sebastian Moeller, Raymond
2120	   Chen, Jack Douglass, Paul Coverdale, Jean-Marc Valin, Koen Vos,
2121	   Bilke Ullrich, and all active participants of the Codec WG mailing
2122	   list. Also, I like to express my appreciation to the members of the
2123	   ITU-T study groups 12 and 16, with whom I had many fruitful
2124	   discussions.

2126	Authors' Addresses

2128	   Christian Hoene
2129	   Universitaet Tuebingen
2130	   WSI-ICS
2131	   Sand 13
2132	   72076 Tuebingen
2133	   Germany

2135	   Phone: +49 7071 2970532
2136	   Email: hoene@uni-tuebingen.de