idnits 2.17.1 

draft-ietf-codec-requirements-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 27, 2011) is 4656 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 4566
     (Obsoleted by RFC 8866)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	codec                                                          JM. Valin
3	Internet-Draft                                                   Mozilla
4	Intended status: Informational                                    K. Vos
5	Expires: January 28, 2012                        Skype Technologies S.A.
6	                                                           July 27, 2011

8	                Requirements for an Internet Audio Codec
9	                    draft-ietf-codec-requirements-05

11	Abstract

13	   This document provides specific requirements for an Internet audio
14	   codec.  These requirements address quality, sampling rate, bit-rate,
15	   and packet loss robustness, as well as other desirable properties.

17	Status of this Memo

19	   This Internet-Draft is submitted in full conformance with the
20	   provisions of BCP 78 and BCP 79.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF).  Note that other groups may also distribute
24	   working documents as Internet-Drafts.  The list of current Internet-
25	   Drafts is at http://datatracker.ietf.org/drafts/current/.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   This Internet-Draft will expire on January 28, 2012.

34	Copyright Notice

36	   Copyright (c) 2011 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents
41	   (http://trustee.ietf.org/license-info) in effect on the date of
42	   publication of this document.  Please review these documents
43	   carefully, as they describe your rights and restrictions with respect
44	   to this document.  Code Components extracted from this document must
45	   include Simplified BSD License text as described in Section 4.e of
46	   the Trust Legal Provisions and are provided without warranty as
47	   described in the Simplified BSD License.

49	Table of Contents

51	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
52	   2.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  4
53	   3.  Applications . . . . . . . . . . . . . . . . . . . . . . . . .  5
54	     3.1.  Point to point calls . . . . . . . . . . . . . . . . . . .  5
55	     3.2.  Conferencing . . . . . . . . . . . . . . . . . . . . . . .  5
56	     3.3.  Telepresence . . . . . . . . . . . . . . . . . . . . . . .  6
57	     3.4.  Teleoperation and Remote Software Services . . . . . . . .  6
58	     3.5.  In-game voice chat . . . . . . . . . . . . . . . . . . . .  7
59	     3.6.  Live distributed music performances / Internet music
60	           lessons  . . . . . . . . . . . . . . . . . . . . . . . . .  7
61	     3.7.  Delay Tolerant Networking or Push-to-Talk Services . . . .  8
62	     3.8.  Other applications . . . . . . . . . . . . . . . . . . . .  8
63	   4.  Constraints Imposed by the Internet on the Codec . . . . . . .  9
64	   5.  Detailed Basic Requirements  . . . . . . . . . . . . . . . . . 11
65	     5.1.  Operating space  . . . . . . . . . . . . . . . . . . . . . 11
66	     5.2.  Quality and bit-rate . . . . . . . . . . . . . . . . . . . 11
67	     5.3.  Packet loss robustness . . . . . . . . . . . . . . . . . . 12
68	     5.4.  Computational resources  . . . . . . . . . . . . . . . . . 13
69	   6.  Additional considerations  . . . . . . . . . . . . . . . . . . 15
70	     6.1.  Low-complexity audio mixing  . . . . . . . . . . . . . . . 15
71	     6.2.  Encoder side potential for improvement . . . . . . . . . . 15
72	     6.3.  Layered bit-stream . . . . . . . . . . . . . . . . . . . . 15
73	     6.4.  Partial redundancy . . . . . . . . . . . . . . . . . . . . 16
74	     6.5.  Stereo support . . . . . . . . . . . . . . . . . . . . . . 16
75	     6.6.  Bit error robustness . . . . . . . . . . . . . . . . . . . 16
76	     6.7.  Time stretching and shortening . . . . . . . . . . . . . . 16
77	     6.8.  Input robustness . . . . . . . . . . . . . . . . . . . . . 17
78	     6.9.  Support of Audio forensics . . . . . . . . . . . . . . . . 17
79	     6.10. Legacy compatibility . . . . . . . . . . . . . . . . . . . 17
80	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
81	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
82	   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 20
83	   10. Informative References . . . . . . . . . . . . . . . . . . . . 21
84	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23

86	1.  Introduction

88	   This document provides requirements for an audio codec designed
89	   specifically for use over the Internet.  The requirements attempt to
90	   address the needs of the most common Internet interactive audio
91	   transmission applications and to ensure good quality when operating
92	   in conditions that are typical for the Internet.  These requirements
93	   address the quality, sampling rate, delay, bit-rate, and packet loss
94	   robustness.  Other desirable codec properties are considered as well.

96	2.  Definitions

98	   Throughout this document, we will use the following conventions when
99	   referring to the sampling rate of a signal:

101	      Narrowband: 8 kHz

103	      Wideband: 16 kHz

105	      Super-wideband: 24/32 kHz

107	      Full-band: 44.1/48 kHz

109	   Codec bit-rates in bits per second (b/s) will be considered without
110	   counting any overhead (IP/UDP/RTP headers, padding, ...).  The codec
111	   delay is the total algorithmic delay when one adds the codec frame
112	   size to the "look-ahead".  It is thus the minimum theoretically
113	   achievable end-to-end delay of a transmission system that uses the
114	   codec.

116	3.  Applications

118	   The following applications should be considered for Internet audio
119	   codecs, along with their requirements:

121	   o  Point to point calls

123	   o  Conferencing

125	   o  Telepresence

127	   o  Teleoperation

129	   o  In-game voice chat

131	   o  Live distributed music performances / Internet music lessons

133	   o  Delay Tolerant Networking or Push-to-Talk Services

135	   o  Other applications

137	3.1.  Point to point calls

139	   Point to point calls are voice over IP (VoIP) calls from two
140	   "standard" (fixed or mobile) phones, and implemented in hardware or
141	   software.  For these applications, a wideband codec is required,
142	   along with narrowband support for compatibility with legacy telephony
143	   equipment (PSTN).  It is expected for the range of useful bit-rates
144	   to be 12 - 32 kb/s for wideband speech and 8 - 16 kb/s for narrowband
145	   speech.  The codec delay must be less than 40 ms, but no more than 25
146	   ms is desirable.  Support for encoding music is not required, but it
147	   is desirable for the codec not to make background (on-hold) music
148	   excessively unpleasant to hear.  Also, the codec should be robust to
149	   noise (produce intelligible speech and no annoying artifacts) even at
150	   lower bit-rates.

152	3.2.  Conferencing

154	   Conferencing applications (which support multi-party calls) have
155	   additional requirements on top of the requirements for point-to-point
156	   calls.  Conferencing systems often have higher-fidelity audio
157	   equipment and have greater network bandwidth available -- especially
158	   when video transmission is involved.  For that reason, support for
159	   super-wideband audio becomes important, with useful bit-rates in the
160	   32 - 64 kb/s range.  The ability to vary the bit-rate (VBR) according
161	   to the "difficulty" of the audio signal is a desirable feature for
162	   the codec.  This not only saves bandwidth "on average", but it can
163	   also help conference servers make more efficient use of the available
164	   bandwidth by using more bandwidth for important audio streams and
165	   less bandwidth for less important ones (e.g. background noise).

167	   Conferencing end-points often operate in hands-free conditions, which
168	   creates acoustic echo problems.  For this reason lower delay is
169	   important, as it reduces the quality degradation due to any residual
170	   echo after acoustic echo cancellation (AEC).  For this reason, the
171	   codec delay must be less than 30 ms for this application.  An
172	   optional low-delay mode with less than 10 ms delay is desirable, but
173	   not required.

175	   Most conferencing systems operate with a bridge that mixes some (or
176	   all) of the audio streams and sends them back to all the
177	   participants.  In that case, it is important that the codec not
178	   produce annoying artefacts when two voices are present at the same
179	   time.  Also, this mixing operation should be as easy as possible to
180	   perform.  To make it easier to determine which streams have to be
181	   mixed (and which are noise/silence), it must be possible to measure
182	   (or estimate) the voice activity in a packet without having to fully
183	   decode the packet (saving most of the complexity when the packet need
184	   not be decoded).  Also, the ability to save on the computational
185	   complexity when mixing is also desirable, but not required.  For
186	   example, a transform codec may make it possible to mix the streams in
187	   the transform domain, without having to go back to time-domain.  Low-
188	   complexity up-sampling and down-sampling within the codec is also a
189	   desirable feature when mixing streams with different sampling rates.

191	3.3.  Telepresence

193	   Most telepresence applications can be considered to be essentially
194	   very high-quality video-conferencing environments, so all of the
195	   conferencing requirements also apply to telepresence.  In addition,
196	   telepresence applications require super-wideband and full-band audio
197	   capability with useful bit-rates in the 32 - 80 kb/s range.  While
198	   voice is still the most important signal to be encoded, it must be
199	   possible to obtain good quality (even if not transparent) music.

201	   Most telepresence applications require more than one audio channel,
202	   so support for stereo and multi-channel is important.  While this can
203	   always be accomplished by encoding multiple single-channel streams,
204	   it is preferable to take advantage of the redundancy that exists
205	   between channels.

207	3.4.  Teleoperation and Remote Software Services

209	   Teleoperation applications are similar to telepresence, with the
210	   exception that they involve remote physical interactions.  For
211	   example, the user may be controlling a robot while receiving real-
212	   time audio feedback from that robot.  For these applications, the
213	   delay has to be less than 10 ms.  The other requirements of
214	   telepresence (quality, bit-rate, multi-channel) apply to
215	   teleoperation as well.  The only exception is that mixing is not an
216	   important issue for teleoperation.

218	   The requirements for remote software services are simiar to those of
219	   teleoperation.  These applications include remote desktop
220	   applications, remote virtualization, and interactive media
221	   application being rendered remotely (e.g. video games rendered on
222	   central servers).  For all these applications, full-band audio with
223	   an algorithmic delay below 10 ms are important.

225	3.5.  In-game voice chat

227	   An increasing number of computer/console games make use of VoIP to
228	   allow players to communicate in real-time.  The requirements for
229	   gaming are similar to those of conferencing, with the main difference
230	   being that narrowband compatibility is not necessary.  While for most
231	   applications a codec delay up to 30 ms is acceptable, a low-delay (<
232	   10 ms) option is highly desirable, especially for games with rapid
233	   interactions.  The ability to use VBR (with a maximum allowed
234	   bitrate) is also highly desirable because it can significantly reduce
235	   the bandwidth requirement for a game server.

237	3.6.  Live distributed music performances / Internet music lessons

239	   Live music over the Internet requires extremely low end-to-end delay
240	   and is one of the most demanding application for interactive audio
241	   transmission.  It has been observed that for most scenarios, total
242	   end-to-end delays up to 25 ms could be tolerated by musicians, with
243	   the absolute limit (where none of the scenarios are possible) being
244	   around 50 ms [carot09].  In order to achieve this low delay on the
245	   Internet -- either in the same city or a nearby city -- the network
246	   propagation time must be taken into account.  When also subtracting
247	   the delay of the audio buffer, jitter buffer, and acoustic path, that
248	   leaves around 2 ms to 10 ms for the total delay of the codec.
249	   Considering the speed of light in fiber, every 1 ms reduction in the
250	   codec delay increases the range over which synchronization is
251	   possible by approximately 200 km.

253	   Acoustic echo is expected to be an even more important issue for
254	   network music than it is in conferencing, especially considering that
255	   the music quality requirements essentially forbid the use of a
256	   "nonlinear processor" (NLP) with the AEC.  This is another reason why
257	   very low delay is essential.

259	   Considering that the application is music, the full audio bandwidth
260	   (44.1 or 48 kHz sampling rate) must be transmitted with a bit-rate
261	   that is sufficient to provide near-transparent to transparent
262	   quality.  With the current audio coding technology, this corresponds
263	   to approximately 64 kb/s to 128 kb/s per channel.  As for
264	   telepresence, support for two or more channels is often desired, so
265	   it would be useful for a codec to be able to take advantage of the
266	   redundancy that is often present between audio channels.

268	3.7.  Delay Tolerant Networking or Push-to-Talk Services

270	   Internet transmissions are subjected to interruptions of connectivity
271	   that severely disturb a phone call.  This may happen in cases of
272	   route changes, handovers, slow fading, or device failures.  To
273	   overcome this distortion, the phone call can be halted and resumed
274	   after the connectivity has been reestablished again.

276	   Also, if transmission capacity is lower than the minimal coding rate,
277	   switching to a push-to-talk mode still allows for effective
278	   communication.  In that situation, voice is transmitted at slower-
279	   than-real-time bitrate and conversations are interrupted until the
280	   speech has been transmitted.

282	   These modes require interrupting the audio playout and continuing
283	   after a pause of arbitrary duration.

285	3.8.  Other applications

287	   The above list is by no means a complete list of all applications
288	   involving interactive audio transmission on the Internet.  However,
289	   it is believed that meeting the needs of all these different
290	   applications should be sufficient to ensure that most applications
291	   not listed will also be met.

293	4.  Constraints Imposed by the Internet on the Codec

295	   Packet losses are inevitable on the Internet and dealing with those
296	   is one of the most fundamental requirements for an Internet audio
297	   codec.  While any audio codec can be combined with a good packet loss
298	   concealment (PLC) algorithm, the important aspect is what happens on
299	   the first packets received _after_ the loss.  More specifically, this
300	   means that:

302	   o  it should be possible to interpret the contents of any received
303	      packet, irrespective of previous losses as specified in BCP 36
304	      [PAYLOADS]; and

306	   o  the decoder should re-synchronize as quickly as possible (i.e. the
307	      output should quickly converge to the output that would have been
308	      obtained if no-loss had occurred).

310	   The constraint of being able to decode any packet implies the
311	   following considerations for an audio codec:

313	   o  The size of a compressed frame must be kept smaller than the MTU
314	      to avoid fragmentation;

316	   o  The interpretation of any parameter encoded in the bit-stream must
317	      not depend on information contained in other packets.  For
318	      example, it is not acceptable for a codec to allow signaling a
319	      mode change in one packet and assume that subsequent frames will
320	      be decoded according to that mode.

322	   Although the interpretation of parameters cannot depend on other
323	   packets, it is still reasonable to use some amount of prediction
324	   across frames, provided that the predictors can resynchronize quickly
325	   in case of a lost packet.  In this case, it is important to use the
326	   best compromise between the gain in coding efficiency and the loss in
327	   packet loss robustness due to the use of inter-frame prediction.  It
328	   is a desirable property for the codec to allow some real-time control
329	   of that trade-off so that it can take advantage of more prediction
330	   when the loss rate is small, while being more robust to losses when
331	   the loss rate is high.

333	   To improve the robustness to packet loss, it would be desirable for
334	   the codec to allow an adaptive (data- and network-dependent) amount
335	   of side information to help improve audio quality when losses occur.
336	   For example, this side information may include the retransmission of
337	   certain parameters encoded in the previous frame(s).

339	   To ensure freedom of implementation, decoder-side only error
340	   concealment does not need to be specified, although a functional PLC
341	   algorithm is desirable as part of the codec reference implementation.
342	   Obviously, any information signaled in the bitstream intended to aid
343	   PLC needs to be specified.

345	   Another important property of the Internet is that it is mostly a
346	   best-effort network, with no guaranteed bandwidth.  This means that
347	   the codec has to be able to vary its output bit-rate dynamically (in
348	   real-time), without requiring an out-of-band signaling mechanism, and
349	   without causing audible artifacts at the bit-rate change boundaries.
350	   Additional desirable features are:

352	   o  Having the possibility to use smooth bit-rate changes with one
353	      byte/frame resolution;

355	   o  Making it possible for a codec to adapt its bit-rate based on the
356	      source signal being encoded (source-controlled VBR) to maximize
357	      the quality for a certain _average_ bit-rate.

359	   Because the Internet transmits data in bytes, a codec should produce
360	   compressed data in integer numbers of bytes.  In general, the codec
361	   design should take into consideration explicit congestion
362	   notification (ECN) and may include features that would improve the
363	   quality of an ECN implementation.

365	   The IETF has defined a set of application-layer protocols to be used
366	   for transmitting real-time transport of multimedia data, including
367	   voice.  It is thus important for the resulting codec to be easy to
368	   use with these protocols.  For example, it must be possible to create
369	   an [RTP] payload format that conforms to BCP 36 [PAYLOADS].  If any
370	   codec parameters need to be negotiated between end-points, the
371	   negotiation should be as easy as possible to carry over SIP
372	   [RFC3261]/SDP [RFC4566] or alternatively over XMPP [RFC6120]/Jingle
373	   [XEP-0167].

375	5.  Detailed Basic Requirements

377	   This section summarizes all the constraints imposed by the target
378	   applications and by the Internet into a set of actual requirements
379	   for codec development.

381	5.1.  Operating space

383	   The operating space for the target applications can be divided in
384	   terms of delay: most applications require a "medium delay" (20-30
385	   ms), while a few require a "very low delay" (< 10 ms).  It makes
386	   sense to divide the space based on delay because lowering the delay
387	   has a cost in terms of quality vs bit-rate.

389	   For medium delay, the resulting codec must be able to efficiently
390	   operate within the following range of bit-rates (per channel):

392	   o  Narrowband: 8 kb/s to 16 kb/s

394	   o  Wideband: 12 to 32 kb/s

396	   o  Super-wideband: 24 to 64 kb/s

398	   o  Full-band: 32 to 80 kb/s

400	   Obviously, a lower-delay codec that can operate in the above range is
401	   also acceptable.

403	   For very low delay, the resulting codec will need to operate within
404	   the following range of bit-rates (per channel):

406	   o  Super-wideband: 32 to 80 kb/s

408	   o  Full-band: 48 to 128 kb/s

410	   o  (Narrowband and wideband not required)

412	5.2.  Quality and bit-rate

414	   The quality of a codec is directly linked to the bit-rate, so these
415	   two must be considered jointly.  When comparing the bit-rate of
416	   codecs, the overhead of IP/UDP/RTP headers should not be considered,
417	   but any additional bits required in the RTP payload format after the
418	   header (e.g. required signalling) should be considered.  In terms of
419	   quality vs bit-rate, the codec to be developed must be better than
420	   the following codecs, that are generally considered as royalty-free:

422	   o  For narrowband: Speex (NB) [Speex], and iLBC(*) [RFC3951]

424	   o  For wideband: Speex (WB) [Speex], G.722.1(*) [ITU.G722.1]

426	   o  For super-wideband/fullband: G.722.1C(*) [ITU.G722.1]

428	   The codecs marked with (*) have additional licensing restrictions,
429	   but the codec to be developed should still not perform significantly
430	   worse.  In addition to the quality targets listed above, a desirable
431	   objective is for the codec quality to be no worse than AMB-NB and
432	   AMR-WB, for narrowband and wideband, respectively.  Quality should be
433	   measured for multiple languages, including tonal languages.  The case
434	   of multiple simultaneous voices (as sometimes happens in
435	   conferencing) should be evaluated as well.

437	   The comparison with the above codecs assumes that the codecs being
438	   compared have similar delay characteristics.  The bit-rate required
439	   for a certain level of quality may be higher than the referenced
440	   codecs in cases where a much lower delay is required.  In that case,
441	   the increase in bit-rate must be less than the ratio between the
442	   delays.

444	   It is desirable for the codecs to support source-controlled variable
445	   bit-rate (VBR) to take advantage from the fact that different inputs
446	   require a different bitrate to achieve the same quality.  However, it
447	   should still be possible to use the codec at truely constant bit-rate
448	   to ensure that no information leak is possible when using an
449	   encrypted channel.

451	5.3.  Packet loss robustness

453	   Robustness to packet loss is a very important aspect of any codec to
454	   be used on the Internet.  Codecs must maintain acceptable quality at
455	   loss rates up to 5% and maintain good intelligibility up to 15% loss
456	   rate.  At any sampling rate, bit-rate, and packet loss rate, the
457	   quality must be no less than the quality obtained with the Speex
458	   codec or the GSM-FR codec in the same conditions.  The actual packet
459	   loss "patterns" to be used in testing must be obtained from real
460	   packet loss traces collected on the Internet, rather than from loss
461	   models.  These traces should be representative of the typical
462	   environments in which the applications of Section 3 operate.  For
463	   example, traces related to VoIP calls should consider the loss
464	   patterns observed for typical home broadband and corporate
465	   connections.

467	5.4.  Computational resources

469	   The resulting codec should be implementable on a wide range of
470	   devices, so there should be a fixed-point implementation or at least
471	   assurance that a reasonable fixed-point is possible.  The
472	   computational resources figures listed below are meant to be upper
473	   bounds.  Even below these bounds, resources should still be
474	   minimized.  Any proposed increase in computational resources
475	   consumption (e.g. to increase quality) should be carefully evaluated
476	   even if the resulting resource consumption is below the upper bound.
477	   Having variable complexity would be useful (but not required) in
478	   achieving that goal as it would allow trading quality/bit-rate for
479	   lower complexity.

481	   The computational requirements for real-time encoding and decoding of
482	   a mono signal on one core of a recent x86 CPU (as measured with the
483	   unix "time" utility or equivalent) are as follows:

485	   o  Narrowband: 40 MHz (2% of a 2 GHz CPU core)

487	   o  Wideband: 80 MHz (4% of a 2 GHz CPU core)

489	   o  Superwideband/fullband: 200 MHz (10% of a 2 GHz CPU core)

491	   It is a desirable objective that the MHz values listed above also be
492	   achievable on fixed-point digital signal processors that are capable
493	   of single-cycle multiply-accumulate operations (16x16 multiplication
494	   accumulated into 32 bits).

496	   For applications that require mixing (e.g. conferencing), it should
497	   be possible to estimate the energy and/or the voice activity status
498	   of the decoded signal with less than 10% of the complexity figures
499	   listed above.

501	   It is the intent to maximize the range of devices on which a codec
502	   can be implemented.  For this reasons, the reference implementation
503	   must not depend on special hardware features or instructions to be
504	   present in order to meet the complexity requirement.  However, it may
505	   be desirable to take advantage of such hardware when available,
506	   (e.g., hardware accelerators for operations like fast Fourier
507	   transforms and convolutions).  A codec should also minimize the use
508	   of saturating arithmetic so as to be implementable on architectures
509	   that do not provide hardware saturation (e.g.  ARMv4).

511	   The combined codec size and data ROM should be small enough not to
512	   cause significant implementation problems on typical embedded
513	   devices.  The codec context/state size required should be no more
514	   than 2*R*C bytes in floating-point, where R is the sampling rate and
515	   C is the number of channels.  For fixed-point, that size should be
516	   less than R*C. The scratch space required should also be less than
517	   2*R*C bytes for floating point or less than R*C bytes for fixed-
518	   point.

520	6.  Additional considerations

522	   There are additional features or characteristics that may be
523	   desirable under some circumstances, but should not be part of the
524	   strict requirements.  The benefit of meeting these considerations
525	   should be weighted against the associated cost.

527	6.1.  Low-complexity audio mixing

529	   In many applications that require a mixing server (e.g. conferencing,
530	   games), it is important to minimize the computational cost of the
531	   mixing.  As much as possible, it should be possible to perform the
532	   mixing with fewer computations than it would take to decode all the
533	   streams, mix them, and re-encode the result.  Properties that reduce
534	   the complexity of the mixing process include:

536	   o  the ability to derive sufficient parameters, such as loudness
537	      and/or spectral envelope, for estimating voice activity of a
538	      compressed frame without fully decoding that frame;

540	   o  the ability to mix the streams in an intermediate representation
541	      (e.g. transform domain), rather than having to fully decode the
542	      signals before the mixing;

544	   o  the use of bit-stream layers (Section 6.3) by aggregating a small
545	      number of active streams at lower quality.

547	   For conferencing applications, the total complexity of the decoding,
548	   VAD and mixing should be considered when evaluating proposals.

550	6.2.  Encoder side potential for improvement

552	   In many codecs, it is possible to improve the quality by improving
553	   the encoder without breaking compatibility (i.e. without changing the
554	   decoder).  Potential for improvement varies from one codec to
555	   another.  It is generally low for PCM or ADPCM codecs and higher for
556	   perceptual transform codecs.  All things being equal, being able to
557	   improve a codec after the bit-stream is a desirable property.
558	   However, this should not be done at the expense of quality in the
559	   reference encoder.  Other potential improvements include signal-
560	   adaptive frame size selection and improved discontinuous transmission
561	   (DTX) algorithms that take advantage of predicting the decoder sides
562	   packet loss concealment (PLC) algorithms.

564	6.3.  Layered bit-stream

566	   A layered codec makes it possible to transmit only a certain subset
567	   of the bits and still obtain a valid bit-stream with a quality that
568	   is equivalent to the quality that would be obtained from encoding at
569	   the corresponding rate.  While this is not a necessary feature for
570	   most applications, it can be desirable for cases where a "mixing
571	   server" needs to handle a large number of streams with limited
572	   computational resources.

574	6.4.  Partial redundancy

576	   One possible way of increasing robustness to packet loss is to
577	   include partial redundancy within packets.  This can be achieved
578	   either by including the base layer of the previous frame (for a
579	   layered codec) or by transmitting other parameters from the previous
580	   frame(s) to assist the PLC algorithm in case of loss.  The ability to
581	   include partial redundancy for high-loss scenarios is desirable,
582	   provided that the feature can be dynamically turned on or off (so
583	   that no bandwidth is wasted in case of loss-free transmission).

585	6.5.  Stereo support

587	   It is highly desirable for the codec to have stereo support.  At a
588	   minimum, the codec should be able to encode two channels
589	   independently without causing significant stereo image artefacts.  It
590	   is also desirable for the codec to take advantage of the inter-
591	   channel redundancy in stereo audio to reduce the bitrate (for an
592	   equivalent quality) of stereo audio compared to coding channels
593	   independently.

595	6.6.  Bit error robustness

597	   The vast majority of Internet-based applications do not need to be
598	   robust to bit errors because packets either arrive unaltered, or do
599	   not arrive at all.  Considering that, the emphasis should be on
600	   packet loss robustness and packet loss concealment.  That being said,
601	   it is often the case that extra robustness to bit errors can be
602	   achieved at no cost at all (i.e. no increase in size, complexity or
603	   bit-rate, no decrease in quality or packet loss robustness, ...).  In
604	   those cases then it is useful to make a change that increases the
605	   robustness to bit errors.  This can be useful for applications that
606	   use UDP Lite transmission (e.g. over a wireless LAN).  Robustness to
607	   packet loss should *never* be sacrificed to achieve higher bit error
608	   robustness.

610	6.7.  Time stretching and shortening

612	   When adaptive jitter buffers are used it is often necessary to
613	   stretch or shorten the audio signal to allow changes in buffering.
614	   While this operation can be performed directly on the decoder's
615	   output, it is often more computationally efficient to stretch or
616	   shorten the signal directly within the decoder.  It is desirable for
617	   the reference implementation to provide a time stretching/shortening
618	   implementation, although it should not be normative.

620	6.8.  Input robustness

622	   The systems providing input to the encoder and receiving output from
623	   the decoder may be far from ideal in actual use.  Input and output
624	   audio streams may be corrupted by compounding non-linear artifacts
625	   from analog hardware and digital processing.  The codecs to be
626	   developed should be tested to ensure that they degrade gracefully
627	   under adverse audio input conditions.  Types of digital corruption
628	   that may be tested include tandeming, transcoding, low-quality
629	   resampling, and digital clipping.  Types of analog corruption that
630	   may be tested include microphones with substantial background noise,
631	   analog clipping, and loudspeaker distortion.  No specific end-to-end
632	   quality requirements are mandated for use with the proposed codec.
633	   It is advisable, however, that several typical in-situ environments/
634	   processing chains be specified for the purpose of benchmarking end-
635	   to-end quality with the proposed codec.

637	6.9.  Support of Audio forensics

639	   Emergency calls can be analyzed using audio forensics if the context
640	   and situation of the caller has to be identified.  Thus, it is
641	   important to transmit not only the voice of the callees well but also
642	   to transmit background noise at high quality.  In these situations,
643	   sounds or noises of low volume should also not be compressed or
644	   dropped.  For this reason, the encoder must allow DTX to be disabled
645	   when required (e.g. for emergency calls).

647	6.10.  Legacy compatibility

649	   In order to create the best possible codec for the Internet, there is
650	   no requirement for compatibility with legacy Internet codecs.

652	7.  Security Considerations

654	   Although this document itself does not have security considerations,
655	   this section describes the security requirements for the codec.

657	   Just like for any protocol to be used over the Internet, security is
658	   a very important aspect to consider.  This goes beyond the obvious
659	   considerations of preventing buffer overflows and similar attacks
660	   that can lead to denial-of-service or remote code execution.  One
661	   very important security aspect is to make sure that the decoders have
662	   a bounded and reasonable worst-case complexity.  This prevents an
663	   attacker from causing a DoS by sending packets that are specially
664	   crafted to take a very long (or infinite) time to decode.

666	   A more subtle aspect is the information leak that can occur when the
667	   codec is used over an encrypted channel (e.g.  [SRTP]).  For example,
668	   it was suggested [wright08] [white11] that use of source-controlled
669	   VBR may reveal some information about a conversation through the size
670	   of the compressed packets.  For that reason, it should be possible to
671	   use the codec at truely constant bit-rate if needed.

673	8.  IANA Considerations

675	   This document has no actions for IANA.

677	9.  Acknowledgments

679	   The original authors of this document are: Jean-Marc Valin, Slava
680	   Borilin, Koen Vos, Christopher Montgomery and Raymond (Juin-Hwey)
681	   Chen.  We would like to thank all the other people who contributed
682	   directly or indirectly to this document, including Jason Fischl,
683	   Gregory Maxwell, Alan Duric, Jonathan Christensen, Julian Spittka,
684	   Michael Knappe, Christian Hoene, and Henry Sinnreich.  We also like
685	   to thank Cullen Jennings and Gregory Lebovitz for their advice.

687	10.  Informative References

689	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
690	              A., Peterson, J., Sparks, R., Handley, M., and E.
691	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
692	              June 2002.

694	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
695	              Description Protocol", RFC 4566, July 2006.

697	   [RFC6120]  Saint-Andre, P., "Extensible Messaging and Presence
698	              Protocol (XMPP): Core", RFC 6120, March 2011.

700	   [XEP-0167]
701	              Ludwig, S., Saint-Andre, P., Egan, S., McQueen, R., and D.
702	              Cionoiu, "Jingle RTP Sessions", XSF XEP 0167,
703	              December 2009.

705	   [RFC3951]  Andersen, S., Duric, A., Astrom, H., Hagen, R., Kleijn,
706	              W., and J. Linden, "Internet Low Bit Rate Codec (iLBC)",
707	              RFC 3951, December 2004.

709	   [ITU.G722.1]
710	              International Telecommunications Union, "Low-complexity
711	              coding at 24 and 32 kbit/s for hands-free operation in
712	              systems with low frame loss", ITU-T Recommendation
713	              G.722.1, May 2005.

715	   [Speex]    Xiph.Org Foundation, "Speex: http://www.speex.org/", 2003.

717	   [carot09]  Carot, A., Werner, C., and T. Fischinger, "Towards a
718	              Comprehensive Cognitive Analysis of Delay-Influenced
719	              Rhythmical Interaction: http://www.carot.de/icmc2009.pdf",
720	               2009.

722	   [PAYLOADS]
723	              Handley, M. and C. Perkins, "Guidelines for Writers of RTP
724	              Payload Format Specifications", RFC 2736, BCP 36.

726	   [RTP]      Schulzrinne, H., Casner, S., Frederick, R., and V.
727	              Jacobson, "RTP: A Transport Protocol for real-time
728	              applications", RFC 3550.

730	   [SRTP]     Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
731	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
732	              RFC 3711, March 2004.

734	   [wright08]
735	              Wright, C., Ballard, L., Coull, S., Monrose, F., and G.
736	              Masson, "Spot me if you can: Uncovering spoken phrases in
737	              encrypted VoIP conversations:
738	              http://www.cs.jhu.edu/~cwright/oakland08.pdf",  2008.

740	   [white11]  White, A., Matthews, A., Snow, K., and F. Monrose,
741	              "Phonotactic Reconstruction of Encrypted VoIP
742	              Conversations: Hookt on fon-iks
743	              http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf",
744	               2011.

746	Authors' Addresses

748	   Jean-Marc Valin
749	   Mozilla
750	   650 Castro Street
751	   Mountain View, CA  94041
752	   USA

754	   Email: jmvalin@jmvalin.ca

756	   Koen Vos
757	   Skype Technologies S.A.
758	   Stadsgarden 6
759	   Stockholm,   11645
760	   Sweden

762	   Email: koen.vos@skype.net