idnits 2.17.1 

draft-valin-codec-requirements-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 26, 2009) is 5293 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

     No issues found here.

     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          JM. Valin
3	Internet-Draft                                              Octasic Inc.
4	Intended status: Standards Track                              S. Borilin
5	Expires: April 29, 2010                                       SPIRIT DSP
6	                                                                  K. Vos
7	                                                                   Skype
8	                                                           C. Montgomery
9	                                                     Xiph.Org Foundation
10	                                                                 R. Chen
11	                                                    Broadcom Corporation
12	                                                        October 26, 2009

14	                           Codec Requirements
15	                   draft-valin-codec-requirements-02

17	Status of this Memo

19	   This Internet-Draft is submitted to IETF in full conformance with the
20	   provisions of BCP 78 and BCP 79.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF), its areas, and its working groups.  Note that
24	   other groups may also distribute working documents as Internet-
25	   Drafts.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   The list of current Internet-Drafts can be accessed at
33	   http://www.ietf.org/ietf/1id-abstracts.txt.

35	   The list of Internet-Draft Shadow Directories can be accessed at
36	   http://www.ietf.org/shadow.html.

38	   This Internet-Draft will expire on April 29, 2010.

40	Copyright Notice

42	   Copyright (c) 2009 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents in effect on the date of
47	   publication of this document (http://trustee.ietf.org/license-info).
48	   Please review these documents carefully, as they describe your rights
49	   and restrictions with respect to this document.

51	Abstract

53	   This document provides specific requirements for Internet audio
54	   codecs.  These requirements address quality, sampling rate, bit-rate,
55	   and packet loss robustness, as well as other desirable properties.

57	Table of Contents

59	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
60	   2.  Applications . . . . . . . . . . . . . . . . . . . . . . . . .  5
61	     2.1.  Point to point calls . . . . . . . . . . . . . . . . . . .  5
62	     2.2.  Conferencing . . . . . . . . . . . . . . . . . . . . . . .  5
63	     2.3.  Telepresence . . . . . . . . . . . . . . . . . . . . . . .  6
64	     2.4.  Teleoperation  . . . . . . . . . . . . . . . . . . . . . .  6
65	     2.5.  In-game voice chat . . . . . . . . . . . . . . . . . . . .  7
66	     2.6.  Live distributed music performances / Internet music
67	           lessons  . . . . . . . . . . . . . . . . . . . . . . . . .  7
68	     2.7.  Other applications . . . . . . . . . . . . . . . . . . . .  8
69	   3.  Constraints Imposed by the Internet on the Codec . . . . . . .  9
70	     3.1.  Security . . . . . . . . . . . . . . . . . . . . . . . . . 10
71	   4.  Detailed Basic Requirements  . . . . . . . . . . . . . . . . . 11
72	     4.1.  Operating space  . . . . . . . . . . . . . . . . . . . . . 11
73	     4.2.  Quality and bit-rate . . . . . . . . . . . . . . . . . . . 11
74	     4.3.  Packet loss robustness . . . . . . . . . . . . . . . . . . 12
75	     4.4.  Computational resources  . . . . . . . . . . . . . . . . . 12
76	   5.  Additional considerations  . . . . . . . . . . . . . . . . . . 15
77	     5.1.  Low-complexity audio mixing  . . . . . . . . . . . . . . . 15
78	     5.2.  Encoder side potential for improvement . . . . . . . . . . 15
79	     5.3.  Layered bit-stream . . . . . . . . . . . . . . . . . . . . 15
80	     5.4.  Partial redundancy . . . . . . . . . . . . . . . . . . . . 16
81	     5.5.  Bit error robustness . . . . . . . . . . . . . . . . . . . 16
82	     5.6.  Partial redundancy . . . . . . . . . . . . . . . . . . . . 16
83	     5.7.  Time stretching and shortening . . . . . . . . . . . . . . 16
84	     5.8.  Legacy compatibility . . . . . . . . . . . . . . . . . . . 17
85	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
86	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
87	   8.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 20
88	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 21
89	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22

91	1.  Introduction

93	   This documents provides requirements for audio codecs designed
94	   specifically for use over the Internet.  The requirements attempt to
95	   address the needs of the most common Internet interactive audio
96	   transmission applications and to ensure good quality when operating
97	   in conditions that are typical for the Internet.  These requirements
98	   address the quality, sampling rate, delay, bit-rate, and packet loss
99	   robustness.  Other desirable codec properties are considered as well.

101	   Throughout this document, we will use the following conventions when
102	   referring to the sampling rate of a signal:

104	      Narrowband: 8 kHz sampling rate

106	      Wideband: 16 kHz sampling rate

108	      Super-wideband: 32 kHz sampling rate

110	      Full-band: 44.1/48 kHz and above

112	   Codec bit-rates in bits per second (b/s) will be considered without
113	   counting any overhead (IP/UDP/RTP headers, padding, ...).  The codec
114	   delay is the total algorithmic delay when one adds the codec frame
115	   size to the "look-ahead".  It is thus the minimum theoretically
116	   achievable end-to-end delay of a transmission system that uses the
117	   codec.

119	2.  Applications

121	   The following applications should be considered for Internet audio
122	   codecs, along with their requirements:

124	   o  Point to point calls

126	   o  Conferencing

128	   o  Telepresence

130	   o  Teleoperation

132	   o  In-game voice chat

134	   o  Live distributed music performances / Internet music lessons

136	   o  Other applications

138	2.1.  Point to point calls

140	   Point to point calls are voice over IP (VoIP) calls from two
141	   "standard" (fixed or mobile) phones, and implemented in hardware or
142	   software.  For these applications, a wideband codec is required,
143	   along with narrowband support for compatibility with legacy telephony
144	   equipment (PSTN).  It is expected for the range of useful bit-rates
145	   to be 12 - 32 kb/s for wideband speech and 8 - 16 kb/s for narrowband
146	   speech.  The codec delay must be less than 40 ms, but no more than 25
147	   ms is desirable.  Support for encoding music is not required, but it
148	   is desirable for the codecs not to make background (on-hold) music
149	   excessively unpleasant to hear.  Also, the codec should be robust to
150	   noise (produce intelligible speech and no annoying artifacts) even at
151	   lower bit-rates.

153	2.2.  Conferencing

155	   Conferencing applications (which support multi-party calls) have
156	   additional requirements on top of the requirements for point-to-point
157	   calls.  Conferencing systems often have higher-fidelity audio
158	   equipment and have greater network bandwidth available -- especially
159	   when video transmission is involved.  For that reason, support for
160	   super-wideband audio becomes important, with useful bit-rates in the
161	   32 - 64 kb/s range.  The ability to vary the bit-rate according to
162	   the "difficulty" of the audio signal (VBR) is a desirable feature for
163	   codecs.  This not only saves bandwidth "on average", but it can also
164	   help conference servers make more efficient use of the available
165	   bandwidth by using more bandwidth for important audio streams and
166	   less bandwidth for less important ones (e.g. background noise).

168	   Conferencing end-points often operate in hands-free conditions, which
169	   creates acoustic echo problems.  For this reason lower delay is
170	   important, as it reduces the quality degradation due to any residual
171	   echo after acoustic echo cancellation (AEC).  For this reason, the
172	   codec delay must be less than 30 ms for this application.  An
173	   optional low-delay mode with less than 10 ms delay is desirable, but
174	   not required.

176	   Most conferencing systems operate with a bridge that mixes some (or
177	   all) of the audio streams and sends them back to all the
178	   participants.  In that case, it is important that the codec not
179	   produce annoying artefacts when two voices are present at the same
180	   time.  Also, this mixing operation should be as easy as possible to
181	   perform.  To make it easier to determine which streams have to be
182	   mixed (and which are noise/silence), it must be possible to measure
183	   (or estimate) the voice activity in a packet without having to fully
184	   decode the packet (saving most of the complexity when the packet need
185	   not be decoded).  Also, the ability to save on the computational
186	   complexity when mixing is also desirable, but not required.  For
187	   example, a transform codec may make it possible to mix the streams in
188	   the transform domain, without having to go back to time-domain.  Low-
189	   complexity up-sampling and down-sampling within the codec is also a
190	   desirable feature when mixing streams with different sampling rates.

192	2.3.  Telepresence

194	   Most telepresence applications can be considered to be essentially
195	   very high-quality video-conferencing environments, so all of the
196	   conferencing requirements also apply to telepresence.  In addition,
197	   telepresence applications require super-wideband and full-band audio
198	   capability with useful bit-rates in the 32 - 80 kb/s range.  While
199	   voice is still the most important signal to be encoded, it must be
200	   possible to obtain good quality (even if not transparent) music.

202	   Most telepresence applications require more than one audio channel,
203	   so support for stereo and multi-channel is important.  While this can
204	   always be accomplished by encoding multiple single-channel streams,
205	   it is preferable to take advantage of the redundancy that exists
206	   between channels.

208	2.4.  Teleoperation

210	   Teleoperation applications are similar to telepresence, with the
211	   exception that they involve remote physical interactions.  For
212	   example, the user may be controlling a robot while receiving real-
213	   time audio feedback from that robot.  For these applications, the
214	   delay has to be less than 10 ms.  The other requirements of
215	   telepresence (quality, bit-rate, multi-channel) apply to
216	   teleoperation as well.  The only exception is that mixing is not an
217	   important issue for teleoperation.

219	2.5.  In-game voice chat

221	   An increasing number of computer/console games make use of VoIP to
222	   allow players to communicate in real-time.  The requirements for
223	   gaming are similar to those of conferencing, with the main difference
224	   being that narrowband compatibility is not necessary.  While for most
225	   applications a codec delay up to 30 ms is acceptable, a low-delay (<
226	   10 ms) option is highly desirable, especially for games with rapid
227	   interactions.  The ability to use VBR (with a maximum allowed
228	   bitrate) is also highly desirable because it can significantly reduce
229	   the bandwidth requirement for a game server.

231	2.6.  Live distributed music performances / Internet music lessons

233	   Live music over the Internet requires extremely low end-to-end delay
234	   and is one of the most demanding application for interactive audio
235	   transmission.  It has been observed that for most scenarios, total
236	   end-to-end delays up to 25 ms could be tolerated by musicians, with
237	   the absolute limit (where none of the scenarios are possible) being
238	   around 50 ms [carot09].  In order to achieve this low delay on the
239	   Internet -- either in the same city or a nearby city -- the network
240	   propagation time must be taken into account.  When also subtracting
241	   the delay of the audio buffer, jitter buffer, and acoustic path, that
242	   leaves around 2 ms to 10 ms for the total delay of the codec.
243	   Considering the speed of light in fiber, every 1 ms reduction in the
244	   codec delay increases the range over which synchronization is
245	   possible by approximately 200 km.

247	   Acoustic echo is expected to be an even more important issue for
248	   network music than it is in conferencing, especially considering that
249	   the music quality requirements essentially forbid the use of a
250	   "nonlinear processor" (NLP) with the AEC.  This is another reason why
251	   very low delay is essential.

253	   Considering that the application is music, the full audio bandwidth
254	   (44.1 or 48 kHz sampling rate) must be transmitted with a bit-rate
255	   that is sufficient to provide near-transparent to transparent
256	   quality.  With the current audio coding technology, this corresponds
257	   to approximately 64 kb/s to 128 kb/s per channel.  As for
258	   telepresence, support for two or more channels is often desired, so
259	   it would be useful for a codec to be able to take advantage of the
260	   redundancy that is often present between audio channels.

262	2.7.  Other applications

264	   The above list is by no means a complete list of all applications
265	   involving interactive audio transmission on the Internet.  However,
266	   it is believed that meeting the needs of all these different
267	   applications should be sufficient to ensure that most applications
268	   not listed will also be met.

270	3.  Constraints Imposed by the Internet on the Codec

272	   Packet losses are inevitable on the Internet and dealing with those
273	   is one of the most fundamental requirements for an Internet audio
274	   codec.  While any audio codec can be combined with a good packet loss
275	   concealment (PLC) algorithm, the important aspect is what happens on
276	   the first packets received _after_ the loss.  More specifically, this
277	   means that:

279	   o  it should be possible to interpret the contents of any received
280	      packet, irrespective of previous losses as specified in BCP 36
281	      [PAYLOADS]; and

283	   o  the decoder should re-synchronize as quickly as possible (i.e. the
284	      output should quickly converge to the output that would have been
285	      obtained if no-loss had occurred).

287	   The constraint of being able to decode any packet implies the
288	   following considerations for an audio codec:

290	   o  The size of a compressed frame must be kept smaller than the MTU
291	      to avoid fragmentation;

293	   o  The interpretation of any parameter encoded in the bit-stream must
294	      not depend on information contained in other packets.  For
295	      example, it is not acceptable for a codec to allow signaling a
296	      mode change in one packet and assume that subsequent frames will
297	      be decoded according to that mode.

299	   Although the interpretation of parameters cannot depend on other
300	   packets, it is still reasonable to use some amount of prediction
301	   across frames, provided that the predictors can resynchronize quickly
302	   in case of a lost packet.  In this case, it is important to use the
303	   best compromise between the gain in coding efficiency and the loss in
304	   packet loss robustness due to the use of inter-frame prediction.  It
305	   is a desirable property for the codecs to allow some real-time
306	   control of that trade-off so that it can take advantage of more
307	   prediction when the loss rate is small, while being more robust to
308	   losses when the loss rate is high.

310	   To improve the robustness to packet loss, it would be desirable for
311	   the codec to allow an adaptive (data- and network-dependent) amount
312	   of side information to help improve audio quality when losses occur.
313	   For example, this side information may include the retransmission of
314	   certain parameters encoded in the previous frame(s).

316	   Another important property of the Internet is that it is mostly a
317	   best-effort network, with no guaranteed bandwidth.  This means that
318	   the codecs have to be able to vary their output bit-rate dynamically
319	   (in real-time), without requiring an out-of-band signaling mechanism,
320	   and without causing audible artifacts at the bit-rate change
321	   boundaries.  Additional desirable features are:

323	   o  Having the possibility to use smooth bit-rate changes with one
324	      byte/frame resolution;

326	   o  Making it possible for a codec to adapt its bit-rate based on the
327	      source signal being encoded (source-controlled VBR) to maximize
328	      the quality for a certain _average_ bit-rate.

330	   Because the Internet transmits data in bytes, codecs should produce
331	   compressed data in integer numbers of bytes.  In general, the codec
332	   design should take into consideration explicit congestion
333	   notification (ECN) and may include features that would improve the
334	   quality of an ECN implementation.

336	   The IETF has defined a set of application-layer protocols to be used
337	   for transmitting real-time transport of multimedia data, including
338	   voice.  It is thus important for the resulting codecs to be easy to
339	   use with these protocols.  For example, it must be possible to create
340	   an [RTP] payload format that conforms to BCP 36 [PAYLOADS].  If any
341	   codec parameters need to be negotiated between end-points, the
342	   negotiation should be as easy as possible to carry over SIP/SDP or
343	   alternatively over XMPP/Jingle.

345	3.1.  Security

347	   Just like for any protocol to be used over the Internet, security is
348	   a very important aspect to consider.  This goes beyond the obvious
349	   considerations of preventing buffer overflows and similar attacks
350	   that can lead to denial-of-service or remote code execution.  One
351	   very important security aspect is to make sure that the decoders have
352	   a bounded and reasonable worst-case complexity.  This prevents an
353	   attacker from causing a DoS by sending packets that are specially
354	   crafted to take a very long (or infinite) time to decode.

356	   A more subtle aspect is the information leak that can occur when the
357	   codec is used over an encrypted channel (e.g.  [SRTP]).  For example,
358	   it was suggested [wright08] that use of source-controlled VBR may
359	   reveal some information about a conversation through the size of the
360	   compressed packets.  This would have to be investigated when
361	   standardizing a codec.

363	4.  Detailed Basic Requirements

365	   This section summarizes all the constraints imposed by the target
366	   applications and by the Internet into a set of actual requirements
367	   for codec development.

369	4.1.  Operating space

371	   The operating space for the target applications can be divided in
372	   terms of delay: most applications require a "medium delay" (20-30
373	   ms), while a few require a "very low delay" (< 10 ms).  It makes
374	   sense to divide the space based on delay because lowering the delay
375	   has a cost in terms of quality vs bit-rate.

377	   For medium delay, the resulting codecs must be able to efficiently
378	   operate within the following range of bit-rates (per channel):

380	   o  Narrowband: 8 kb/s to 16 kb/s

382	   o  Wideband: 12 to 32 kb/s

384	   o  Super-wideband: 24 to 64 kb/s

386	   o  Full-band: 32 to 80 kb/s

388	   Obviously, a lower-delay codec that can operate in the above range is
389	   also acceptable.

391	   For very low delay, the resulting codecs will need to operate within
392	   the following range of bit-rates (per channel):

394	   o  Super-wideband: 32 to 80 kb/s

396	   o  Full-band: 48 to 128 kb/s

398	   o  (Narrowband and wideband not required)

400	4.2.  Quality and bit-rate

402	   The quality of a codec is directly linked to the bit-rate, so these
403	   two must be considered jointly.  When comparing the bit-rate codecs,
404	   the overhead of IP/UDP/RTP headers should not be considered, but any
405	   additional bits required in the RTP payload format after the header
406	   (e.g. required signalling) should be considered.  In terms of quality
407	   vs bit-rate, the codecs to be developed must be better than the
408	   currently available codecs that satisfy the IPR requirements in the
409	   guidelines document, which are:

411	   o  For narrowband: Speex (NB), GSM-FR, and iLBC(*)

413	   o  For wideband: Speex (WB), G.722, G.722.1(*)

415	   o  For super-wideband: Speex (UWB), G.722.1C(*)

417	   The codecs marked with (*) do not meet all the licensing guidelines,
418	   but the codecs to be developed should still not perform significantly
419	   worse.  Quality should be measured for multiple languages, including
420	   tonal languages.  The case of multiple simultaneous voices (as
421	   sometimes happens in conferencing) should be evaluated as well.

423	   The comparison with the above codecs assumes that the codecs being
424	   compared have similar delay characteristics.  The bit-rate required
425	   for a certain level of quality may be higher than the referenced
426	   codecs in cases where a much lower delay is required.  In that case,
427	   the increase in bit-rate must be less than the ratio between the
428	   delays.

430	   It is desirable for the codecs to support source-controlled variable
431	   bit-rate (VBR) to take advantage from the fact that different inputs
432	   require a different bitrate to achieve the same quality.  However, it
433	   should still be possible to use the codecs at truely constant bit-
434	   rate to ensure that no information leak is possible when using an
435	   encrypted channel.

437	4.3.  Packet loss robustness

439	   Robustness to packet loss is a very important aspect of any codec to
440	   be used on the Internet.  Codecs must maintain acceptable quality at
441	   loss rates up to 5% and maintain good intelligibility up to 15% loss
442	   rate.  At any sampling rate, bit-rate, and packet loss rate, the
443	   quality must be no less than the quality obtained with the Speex
444	   codec or the GSM-FR codec in the same conditions.  The actual packet
445	   loss "patterns" to be used in testing must be obtained from real
446	   packet loss traces collected on the Internet, rather than from loss
447	   models.  These traces should be representative of the typical
448	   environments in which the applications of Section 2 operate.  For
449	   example, traces related to VoIP calls should consider the loss
450	   patterns observed for typical home broadband and corporate
451	   connections.

453	4.4.  Computational resources

455	   The resulting codecs should be implementable on a wide range of
456	   devices, so there should be a fixed-point implementation or at least
457	   assurance that a reasonable fixed-point is possible.  The
458	   computational resources figures listed below are meant to be upper
459	   bounds.  Even below these bounds, resources should still be
460	   minimized.  Any proposed increase in computational resources
461	   consumption (e.g. to increase quality) should be carefully evaluated
462	   even if the resulting resource consumption is below the upper bound.
463	   Having variable complexity would be useful (but not required) in
464	   achieving that goal as it would allow trading quality/bit-rate for
465	   lower complexity.

467	   The computational requirements for real-time encoding and decoding
468	   are:

470	   o  Narrowband should require little CPU resources and be
471	      implementable on most DSPs with a 16x16 multiplier (e.g. < 40
472	      MIPS).

474	   o  Wideband can have a bit more complexity than narrowband, but
475	      should still be implementable on a cheap DSP (e.g. < 80 MIPS)

477	   o  Super-wideband/full-band may require higher complexity, but should
478	      be implementable on higher-end DSP (e.g. < 200 MIPS), and if
479	      possible also on cheaper DSPs as well.

481	   The MIPS values are approximate clock frequencies required for real-
482	   time encoding+decoding on a DSP capable of single-cycle MAC
483	   operations (16x16 multiplication accumulated into 32 bits).  Similar
484	   computational requirements apply to floating-point processors.  For
485	   example Narrowband encoding and decoding should be possible using 40
486	   MHz on a modern x86 CPU (2% of a 2 GHz CPU).  For applications that
487	   require mixing (e.g. conferencing), it must be possible to estimate
488	   the energy of the decoded signal with less than 10% of the complexity
489	   figures listed above.

491	   In terms of memory use, the codec context/state size required should
492	   be no more than 2*R*C bytes in floating-point, where R is the
493	   sampling rate and C is the number of channels.  For fixed-point, that
494	   size should be less than R*C. The scratch space required should also
495	   be less than 2*R*C bytes for floating point or less than R*C bytes
496	   for fixed-point.  The combined codec size and data ROM should be
497	   small enough not to cause significant implementation problems.  Code
498	   size is more difficult to evaluate since it is highly dependent on
499	   the architecture, but when implemented on an x86 CPU, the codec
500	   should require no more than 100 kB for instructions and constant
501	   data.

503	   It is the intent to maximize the range of devices on which a codec
504	   can be implemented.  For this reasons, the reference implementation
505	   must not depend on "special hardware features" to be present in order
506	   to meet the complexity requirement.  However, it might be desirable
507	   to take advantage of such hardware, (e.g., hardware accelerators for
508	   operations like FFTs and convolutions).  A codec should also minimize
509	   the use of saturating arithmetic so as to be implementable on
510	   architectures that do not provide hardware saturation (e.g.  ARMv4).

512	5.  Additional considerations

514	   There are additional features or characteristics that may be
515	   desirable under some circumstances, but should not be part of the
516	   strict requirements.  The benefit of meeting these considerations
517	   should be weighted against the associated cost.

519	5.1.  Low-complexity audio mixing

521	   In many applications that require a mixing server (e.g. conferencing,
522	   games), it is important to minimize the computational cost of the
523	   mixing.  As much as possible, it should be possible to perform the
524	   mixing with fewer computations than it would take to decode all the
525	   streams, mix them, and re-encode the result.  Properties that reduce
526	   the complexity of the mixing process include:

528	   o  the ability to derive sufficient parameters, such as loudness
529	      and/or spectral envelope, for estimating voice activity of a
530	      compressed frame without fully decoding that frame;

532	   o  the ability to mix the streams in an intermediate representation
533	      (e.g. transform domain), rather than having to fully decode the
534	      signals before the mixing;

536	   o  the use of bit-stream layers (Section 5.3) by aggregating a small
537	      number of active streams at lower quality.

539	   For conferencing applications, the total complexity of the decoding,
540	   VAD and mixing should be considered when evaluating proposals.

542	5.2.  Encoder side potential for improvement

544	   In many codecs, it is possible to improve the quality by improving
545	   the encoder without breaking compatibility (i.e. without changing the
546	   decoder).  Potential for improvement varies from one codec to
547	   another.  It is generally low for PCM or ADPCM codecs and higher for
548	   perceptual transform codecs.  All things being equal, being able to
549	   improve a codec after the bit-stream is a desirable property.
550	   However, this should not be done at the expense of quality in the
551	   reference encoder.

553	5.3.  Layered bit-stream

555	   A layered codec makes it possible to transmit only a certain subset
556	   of the bits and still obtain a valid bit-stream with a quality that
557	   is equivalent to the quality that would be obtained from encoding at
558	   the corresponding rate.  While this is not a necessary feature for
559	   most applications, it can be desirable for cases where a "mixing
560	   server" needs to handle a large number of streams with limited
561	   computational resources.

563	5.4.  Partial redundancy

565	   One possible way of increasing robustness to packet loss is to
566	   include partial redundancy within packets.  This can be achieved
567	   either by including the base layer of the previous frame (for a
568	   layered codec) or by transmitting other parameters from the previous
569	   frame(s) to assist the PLC algorithm in case of loss.  The ability to
570	   include partial redundancy for high-loss scenarios is desirable,
571	   provided that the feature can be dynamically turned on or off (so
572	   that no bandwidth is wasted in case of loss-free transmission).

574	5.5.  Bit error robustness

576	   The vast majority of Internet-based applications do not need to be
577	   robust to bit errors because packets either arrive unaltered, or do
578	   not arrive at all.  Considering that, the emphasis should be on
579	   packet loss robustness and packet loss concealment.  That being said,
580	   it is often the case that extra robustness to bit errors can be
581	   achieved at no cost at all (i.e. no increase in size, complexity or
582	   bit-rate, no decrease in quality or packet loss robustness, ...).  In
583	   those cases then it is useful to make a change that increases the
584	   robustness to bit errors.  This can be useful for applications that
585	   use UDP Lite transmission (e.g. over a wireless LAN).  Robustness to
586	   packet loss should *never* be sacrificed to achieve higher bit error
587	   robustness.

589	5.6.  Partial redundancy

591	   One possible way of increasing robustness to packet loss is to
592	   include partial redundancy within packets.  This can be achieved
593	   either by including the base layer of the previous frame (for a
594	   layered codec) or by transmitting other parameters from the previous
595	   frame(s) to assist the PLC algorithm in case of loss.  The ability to
596	   include partial redundancy for high-loss scenarios is desirable,
597	   provided that the feature can be dynamically turned on or off (so
598	   that no bandwidth is wasted in case of loss-free transmission).

600	5.7.  Time stretching and shortening

602	   When adaptive jitter buffers are used it is often necessary to
603	   stretch or shorten the audio signal to allow changes in buffering.
604	   While this operation can be performed directly on the decoder's
605	   output, it is often more computationally efficient to stretch or
606	   shorten the signal directly within the decoder.  It is desirable for
607	   the reference implementation to provide a time stretching/shortening
608	   implementation, although it should not be normative.

610	5.8.  Legacy compatibility

612	   In order to create the best possible codec for the Internet, there is
613	   no requirement for compatibility with legacy Internet codecs.

615	6.  Security Considerations

617	   The codec requirements themselves do not have security
618	   considerations.  However, codec security issues are discussed in
619	   Section 3.1.

621	7.  IANA Considerations

623	   This document has no actions for IANA.

625	8.  Acknowledgments

627	   We would like to thank all the other people who contributed directly
628	   or indirectly to this document, including Jason Fischl, Gregory
629	   Maxwell, Alan Duric, Jonathan Christensen, Julian Spittka, and Henry
630	   Sinnreich.  We also like to thank Cullen Jennings and Gregory
631	   Lebovitz for their advice.

633	9.  Informative References

635	   [carot09]  Carot, A., Werner, C., and T. Fischinger, "Towards a
636	              Comprehensive Cognitive Analysis of Delay-Influenced
637	              Rhythmical Interaction",  2009.

639	   [PAYLOADS]
640	              Handley, M. and C. Perkins, "Guidelines for Writers of RTP
641	              Payload Format Specifications", RFC 2736, BCP 36.

643	   [RTP]      Schulzrinne, H., Casner, S., Frederick, R., and V.
644	              Jacobson, "RTP: A Transport Protocol for real-time
645	              applications", RFC 3550.

647	   [SRTP]     Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
648	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
649	              RFC 3711, March 2004.

651	   [wright08]
652	              Wright, C., Ballard, L., Coull, S., Monrose, F., and G.
653	              Masson, "Spot me if you can: Uncovering spoken phrases in
654	              encrypted VoIP conversations",  2008.

656	Authors' Addresses

658	   Jean-Marc Valin
659	   Octasic Inc.
660	   4101, Molson Street
661	   Montreal, Quebec
662	   Canada

664	   Email: jean-marc.valin@octasic.com

666	   Slava Borilin
667	   SPIRIT DSP

669	   Email: borilin@spiritdsp.net

671	   Koen Vos
672	   Skype

674	   Email: koen.vos@skype.net

676	   Christopher Montgomery
677	   Xiph.Org Foundation

679	   Email: xiphmont@xiph.org

681	   Raymond (Juin-Hwey) Chen
682	   Broadcom Corporation

684	   Email: rchen@broadcom.com