idnits 2.17.1 draft-ietf-codec-requirements-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 27, 2011) is 4656 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 4566 (Obsoleted by RFC 8866) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 codec JM. Valin 3 Internet-Draft Mozilla 4 Intended status: Informational K. Vos 5 Expires: January 28, 2012 Skype Technologies S.A. 6 July 27, 2011 8 Requirements for an Internet Audio Codec 9 draft-ietf-codec-requirements-05 11 Abstract 13 This document provides specific requirements for an Internet audio 14 codec. These requirements address quality, sampling rate, bit-rate, 15 and packet loss robustness, as well as other desirable properties. 17 Status of this Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at http://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on January 28, 2012. 34 Copyright Notice 36 Copyright (c) 2011 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 53 3. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 5 54 3.1. Point to point calls . . . . . . . . . . . . . . . . . . . 5 55 3.2. Conferencing . . . . . . . . . . . . . . . . . . . . . . . 5 56 3.3. Telepresence . . . . . . . . . . . . . . . . . . . . . . . 6 57 3.4. Teleoperation and Remote Software Services . . . . . . . . 6 58 3.5. In-game voice chat . . . . . . . . . . . . . . . . . . . . 7 59 3.6. Live distributed music performances / Internet music 60 lessons . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 3.7. Delay Tolerant Networking or Push-to-Talk Services . . . . 8 62 3.8. Other applications . . . . . . . . . . . . . . . . . . . . 8 63 4. Constraints Imposed by the Internet on the Codec . . . . . . . 9 64 5. Detailed Basic Requirements . . . . . . . . . . . . . . . . . 11 65 5.1. Operating space . . . . . . . . . . . . . . . . . . . . . 11 66 5.2. Quality and bit-rate . . . . . . . . . . . . . . . . . . . 11 67 5.3. Packet loss robustness . . . . . . . . . . . . . . . . . . 12 68 5.4. Computational resources . . . . . . . . . . . . . . . . . 13 69 6. Additional considerations . . . . . . . . . . . . . . . . . . 15 70 6.1. Low-complexity audio mixing . . . . . . . . . . . . . . . 15 71 6.2. Encoder side potential for improvement . . . . . . . . . . 15 72 6.3. Layered bit-stream . . . . . . . . . . . . . . . . . . . . 15 73 6.4. Partial redundancy . . . . . . . . . . . . . . . . . . . . 16 74 6.5. Stereo support . . . . . . . . . . . . . . . . . . . . . . 16 75 6.6. Bit error robustness . . . . . . . . . . . . . . . . . . . 16 76 6.7. Time stretching and shortening . . . . . . . . . . . . . . 16 77 6.8. Input robustness . . . . . . . . . . . . . . . . . . . . . 17 78 6.9. Support of Audio forensics . . . . . . . . . . . . . . . . 17 79 6.10. Legacy compatibility . . . . . . . . . . . . . . . . . . . 17 80 7. Security Considerations . . . . . . . . . . . . . . . . . . . 18 81 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 82 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 20 83 10. Informative References . . . . . . . . . . . . . . . . . . . . 21 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 86 1. Introduction 88 This document provides requirements for an audio codec designed 89 specifically for use over the Internet. The requirements attempt to 90 address the needs of the most common Internet interactive audio 91 transmission applications and to ensure good quality when operating 92 in conditions that are typical for the Internet. These requirements 93 address the quality, sampling rate, delay, bit-rate, and packet loss 94 robustness. Other desirable codec properties are considered as well. 96 2. Definitions 98 Throughout this document, we will use the following conventions when 99 referring to the sampling rate of a signal: 101 Narrowband: 8 kHz 103 Wideband: 16 kHz 105 Super-wideband: 24/32 kHz 107 Full-band: 44.1/48 kHz 109 Codec bit-rates in bits per second (b/s) will be considered without 110 counting any overhead (IP/UDP/RTP headers, padding, ...). The codec 111 delay is the total algorithmic delay when one adds the codec frame 112 size to the "look-ahead". It is thus the minimum theoretically 113 achievable end-to-end delay of a transmission system that uses the 114 codec. 116 3. Applications 118 The following applications should be considered for Internet audio 119 codecs, along with their requirements: 121 o Point to point calls 123 o Conferencing 125 o Telepresence 127 o Teleoperation 129 o In-game voice chat 131 o Live distributed music performances / Internet music lessons 133 o Delay Tolerant Networking or Push-to-Talk Services 135 o Other applications 137 3.1. Point to point calls 139 Point to point calls are voice over IP (VoIP) calls from two 140 "standard" (fixed or mobile) phones, and implemented in hardware or 141 software. For these applications, a wideband codec is required, 142 along with narrowband support for compatibility with legacy telephony 143 equipment (PSTN). It is expected for the range of useful bit-rates 144 to be 12 - 32 kb/s for wideband speech and 8 - 16 kb/s for narrowband 145 speech. The codec delay must be less than 40 ms, but no more than 25 146 ms is desirable. Support for encoding music is not required, but it 147 is desirable for the codec not to make background (on-hold) music 148 excessively unpleasant to hear. Also, the codec should be robust to 149 noise (produce intelligible speech and no annoying artifacts) even at 150 lower bit-rates. 152 3.2. Conferencing 154 Conferencing applications (which support multi-party calls) have 155 additional requirements on top of the requirements for point-to-point 156 calls. Conferencing systems often have higher-fidelity audio 157 equipment and have greater network bandwidth available -- especially 158 when video transmission is involved. For that reason, support for 159 super-wideband audio becomes important, with useful bit-rates in the 160 32 - 64 kb/s range. The ability to vary the bit-rate (VBR) according 161 to the "difficulty" of the audio signal is a desirable feature for 162 the codec. This not only saves bandwidth "on average", but it can 163 also help conference servers make more efficient use of the available 164 bandwidth by using more bandwidth for important audio streams and 165 less bandwidth for less important ones (e.g. background noise). 167 Conferencing end-points often operate in hands-free conditions, which 168 creates acoustic echo problems. For this reason lower delay is 169 important, as it reduces the quality degradation due to any residual 170 echo after acoustic echo cancellation (AEC). For this reason, the 171 codec delay must be less than 30 ms for this application. An 172 optional low-delay mode with less than 10 ms delay is desirable, but 173 not required. 175 Most conferencing systems operate with a bridge that mixes some (or 176 all) of the audio streams and sends them back to all the 177 participants. In that case, it is important that the codec not 178 produce annoying artefacts when two voices are present at the same 179 time. Also, this mixing operation should be as easy as possible to 180 perform. To make it easier to determine which streams have to be 181 mixed (and which are noise/silence), it must be possible to measure 182 (or estimate) the voice activity in a packet without having to fully 183 decode the packet (saving most of the complexity when the packet need 184 not be decoded). Also, the ability to save on the computational 185 complexity when mixing is also desirable, but not required. For 186 example, a transform codec may make it possible to mix the streams in 187 the transform domain, without having to go back to time-domain. Low- 188 complexity up-sampling and down-sampling within the codec is also a 189 desirable feature when mixing streams with different sampling rates. 191 3.3. Telepresence 193 Most telepresence applications can be considered to be essentially 194 very high-quality video-conferencing environments, so all of the 195 conferencing requirements also apply to telepresence. In addition, 196 telepresence applications require super-wideband and full-band audio 197 capability with useful bit-rates in the 32 - 80 kb/s range. While 198 voice is still the most important signal to be encoded, it must be 199 possible to obtain good quality (even if not transparent) music. 201 Most telepresence applications require more than one audio channel, 202 so support for stereo and multi-channel is important. While this can 203 always be accomplished by encoding multiple single-channel streams, 204 it is preferable to take advantage of the redundancy that exists 205 between channels. 207 3.4. Teleoperation and Remote Software Services 209 Teleoperation applications are similar to telepresence, with the 210 exception that they involve remote physical interactions. For 211 example, the user may be controlling a robot while receiving real- 212 time audio feedback from that robot. For these applications, the 213 delay has to be less than 10 ms. The other requirements of 214 telepresence (quality, bit-rate, multi-channel) apply to 215 teleoperation as well. The only exception is that mixing is not an 216 important issue for teleoperation. 218 The requirements for remote software services are simiar to those of 219 teleoperation. These applications include remote desktop 220 applications, remote virtualization, and interactive media 221 application being rendered remotely (e.g. video games rendered on 222 central servers). For all these applications, full-band audio with 223 an algorithmic delay below 10 ms are important. 225 3.5. In-game voice chat 227 An increasing number of computer/console games make use of VoIP to 228 allow players to communicate in real-time. The requirements for 229 gaming are similar to those of conferencing, with the main difference 230 being that narrowband compatibility is not necessary. While for most 231 applications a codec delay up to 30 ms is acceptable, a low-delay (< 232 10 ms) option is highly desirable, especially for games with rapid 233 interactions. The ability to use VBR (with a maximum allowed 234 bitrate) is also highly desirable because it can significantly reduce 235 the bandwidth requirement for a game server. 237 3.6. Live distributed music performances / Internet music lessons 239 Live music over the Internet requires extremely low end-to-end delay 240 and is one of the most demanding application for interactive audio 241 transmission. It has been observed that for most scenarios, total 242 end-to-end delays up to 25 ms could be tolerated by musicians, with 243 the absolute limit (where none of the scenarios are possible) being 244 around 50 ms [carot09]. In order to achieve this low delay on the 245 Internet -- either in the same city or a nearby city -- the network 246 propagation time must be taken into account. When also subtracting 247 the delay of the audio buffer, jitter buffer, and acoustic path, that 248 leaves around 2 ms to 10 ms for the total delay of the codec. 249 Considering the speed of light in fiber, every 1 ms reduction in the 250 codec delay increases the range over which synchronization is 251 possible by approximately 200 km. 253 Acoustic echo is expected to be an even more important issue for 254 network music than it is in conferencing, especially considering that 255 the music quality requirements essentially forbid the use of a 256 "nonlinear processor" (NLP) with the AEC. This is another reason why 257 very low delay is essential. 259 Considering that the application is music, the full audio bandwidth 260 (44.1 or 48 kHz sampling rate) must be transmitted with a bit-rate 261 that is sufficient to provide near-transparent to transparent 262 quality. With the current audio coding technology, this corresponds 263 to approximately 64 kb/s to 128 kb/s per channel. As for 264 telepresence, support for two or more channels is often desired, so 265 it would be useful for a codec to be able to take advantage of the 266 redundancy that is often present between audio channels. 268 3.7. Delay Tolerant Networking or Push-to-Talk Services 270 Internet transmissions are subjected to interruptions of connectivity 271 that severely disturb a phone call. This may happen in cases of 272 route changes, handovers, slow fading, or device failures. To 273 overcome this distortion, the phone call can be halted and resumed 274 after the connectivity has been reestablished again. 276 Also, if transmission capacity is lower than the minimal coding rate, 277 switching to a push-to-talk mode still allows for effective 278 communication. In that situation, voice is transmitted at slower- 279 than-real-time bitrate and conversations are interrupted until the 280 speech has been transmitted. 282 These modes require interrupting the audio playout and continuing 283 after a pause of arbitrary duration. 285 3.8. Other applications 287 The above list is by no means a complete list of all applications 288 involving interactive audio transmission on the Internet. However, 289 it is believed that meeting the needs of all these different 290 applications should be sufficient to ensure that most applications 291 not listed will also be met. 293 4. Constraints Imposed by the Internet on the Codec 295 Packet losses are inevitable on the Internet and dealing with those 296 is one of the most fundamental requirements for an Internet audio 297 codec. While any audio codec can be combined with a good packet loss 298 concealment (PLC) algorithm, the important aspect is what happens on 299 the first packets received _after_ the loss. More specifically, this 300 means that: 302 o it should be possible to interpret the contents of any received 303 packet, irrespective of previous losses as specified in BCP 36 304 [PAYLOADS]; and 306 o the decoder should re-synchronize as quickly as possible (i.e. the 307 output should quickly converge to the output that would have been 308 obtained if no-loss had occurred). 310 The constraint of being able to decode any packet implies the 311 following considerations for an audio codec: 313 o The size of a compressed frame must be kept smaller than the MTU 314 to avoid fragmentation; 316 o The interpretation of any parameter encoded in the bit-stream must 317 not depend on information contained in other packets. For 318 example, it is not acceptable for a codec to allow signaling a 319 mode change in one packet and assume that subsequent frames will 320 be decoded according to that mode. 322 Although the interpretation of parameters cannot depend on other 323 packets, it is still reasonable to use some amount of prediction 324 across frames, provided that the predictors can resynchronize quickly 325 in case of a lost packet. In this case, it is important to use the 326 best compromise between the gain in coding efficiency and the loss in 327 packet loss robustness due to the use of inter-frame prediction. It 328 is a desirable property for the codec to allow some real-time control 329 of that trade-off so that it can take advantage of more prediction 330 when the loss rate is small, while being more robust to losses when 331 the loss rate is high. 333 To improve the robustness to packet loss, it would be desirable for 334 the codec to allow an adaptive (data- and network-dependent) amount 335 of side information to help improve audio quality when losses occur. 336 For example, this side information may include the retransmission of 337 certain parameters encoded in the previous frame(s). 339 To ensure freedom of implementation, decoder-side only error 340 concealment does not need to be specified, although a functional PLC 341 algorithm is desirable as part of the codec reference implementation. 342 Obviously, any information signaled in the bitstream intended to aid 343 PLC needs to be specified. 345 Another important property of the Internet is that it is mostly a 346 best-effort network, with no guaranteed bandwidth. This means that 347 the codec has to be able to vary its output bit-rate dynamically (in 348 real-time), without requiring an out-of-band signaling mechanism, and 349 without causing audible artifacts at the bit-rate change boundaries. 350 Additional desirable features are: 352 o Having the possibility to use smooth bit-rate changes with one 353 byte/frame resolution; 355 o Making it possible for a codec to adapt its bit-rate based on the 356 source signal being encoded (source-controlled VBR) to maximize 357 the quality for a certain _average_ bit-rate. 359 Because the Internet transmits data in bytes, a codec should produce 360 compressed data in integer numbers of bytes. In general, the codec 361 design should take into consideration explicit congestion 362 notification (ECN) and may include features that would improve the 363 quality of an ECN implementation. 365 The IETF has defined a set of application-layer protocols to be used 366 for transmitting real-time transport of multimedia data, including 367 voice. It is thus important for the resulting codec to be easy to 368 use with these protocols. For example, it must be possible to create 369 an [RTP] payload format that conforms to BCP 36 [PAYLOADS]. If any 370 codec parameters need to be negotiated between end-points, the 371 negotiation should be as easy as possible to carry over SIP 372 [RFC3261]/SDP [RFC4566] or alternatively over XMPP [RFC6120]/Jingle 373 [XEP-0167]. 375 5. Detailed Basic Requirements 377 This section summarizes all the constraints imposed by the target 378 applications and by the Internet into a set of actual requirements 379 for codec development. 381 5.1. Operating space 383 The operating space for the target applications can be divided in 384 terms of delay: most applications require a "medium delay" (20-30 385 ms), while a few require a "very low delay" (< 10 ms). It makes 386 sense to divide the space based on delay because lowering the delay 387 has a cost in terms of quality vs bit-rate. 389 For medium delay, the resulting codec must be able to efficiently 390 operate within the following range of bit-rates (per channel): 392 o Narrowband: 8 kb/s to 16 kb/s 394 o Wideband: 12 to 32 kb/s 396 o Super-wideband: 24 to 64 kb/s 398 o Full-band: 32 to 80 kb/s 400 Obviously, a lower-delay codec that can operate in the above range is 401 also acceptable. 403 For very low delay, the resulting codec will need to operate within 404 the following range of bit-rates (per channel): 406 o Super-wideband: 32 to 80 kb/s 408 o Full-band: 48 to 128 kb/s 410 o (Narrowband and wideband not required) 412 5.2. Quality and bit-rate 414 The quality of a codec is directly linked to the bit-rate, so these 415 two must be considered jointly. When comparing the bit-rate of 416 codecs, the overhead of IP/UDP/RTP headers should not be considered, 417 but any additional bits required in the RTP payload format after the 418 header (e.g. required signalling) should be considered. In terms of 419 quality vs bit-rate, the codec to be developed must be better than 420 the following codecs, that are generally considered as royalty-free: 422 o For narrowband: Speex (NB) [Speex], and iLBC(*) [RFC3951] 424 o For wideband: Speex (WB) [Speex], G.722.1(*) [ITU.G722.1] 426 o For super-wideband/fullband: G.722.1C(*) [ITU.G722.1] 428 The codecs marked with (*) have additional licensing restrictions, 429 but the codec to be developed should still not perform significantly 430 worse. In addition to the quality targets listed above, a desirable 431 objective is for the codec quality to be no worse than AMB-NB and 432 AMR-WB, for narrowband and wideband, respectively. Quality should be 433 measured for multiple languages, including tonal languages. The case 434 of multiple simultaneous voices (as sometimes happens in 435 conferencing) should be evaluated as well. 437 The comparison with the above codecs assumes that the codecs being 438 compared have similar delay characteristics. The bit-rate required 439 for a certain level of quality may be higher than the referenced 440 codecs in cases where a much lower delay is required. In that case, 441 the increase in bit-rate must be less than the ratio between the 442 delays. 444 It is desirable for the codecs to support source-controlled variable 445 bit-rate (VBR) to take advantage from the fact that different inputs 446 require a different bitrate to achieve the same quality. However, it 447 should still be possible to use the codec at truely constant bit-rate 448 to ensure that no information leak is possible when using an 449 encrypted channel. 451 5.3. Packet loss robustness 453 Robustness to packet loss is a very important aspect of any codec to 454 be used on the Internet. Codecs must maintain acceptable quality at 455 loss rates up to 5% and maintain good intelligibility up to 15% loss 456 rate. At any sampling rate, bit-rate, and packet loss rate, the 457 quality must be no less than the quality obtained with the Speex 458 codec or the GSM-FR codec in the same conditions. The actual packet 459 loss "patterns" to be used in testing must be obtained from real 460 packet loss traces collected on the Internet, rather than from loss 461 models. These traces should be representative of the typical 462 environments in which the applications of Section 3 operate. For 463 example, traces related to VoIP calls should consider the loss 464 patterns observed for typical home broadband and corporate 465 connections. 467 5.4. Computational resources 469 The resulting codec should be implementable on a wide range of 470 devices, so there should be a fixed-point implementation or at least 471 assurance that a reasonable fixed-point is possible. The 472 computational resources figures listed below are meant to be upper 473 bounds. Even below these bounds, resources should still be 474 minimized. Any proposed increase in computational resources 475 consumption (e.g. to increase quality) should be carefully evaluated 476 even if the resulting resource consumption is below the upper bound. 477 Having variable complexity would be useful (but not required) in 478 achieving that goal as it would allow trading quality/bit-rate for 479 lower complexity. 481 The computational requirements for real-time encoding and decoding of 482 a mono signal on one core of a recent x86 CPU (as measured with the 483 unix "time" utility or equivalent) are as follows: 485 o Narrowband: 40 MHz (2% of a 2 GHz CPU core) 487 o Wideband: 80 MHz (4% of a 2 GHz CPU core) 489 o Superwideband/fullband: 200 MHz (10% of a 2 GHz CPU core) 491 It is a desirable objective that the MHz values listed above also be 492 achievable on fixed-point digital signal processors that are capable 493 of single-cycle multiply-accumulate operations (16x16 multiplication 494 accumulated into 32 bits). 496 For applications that require mixing (e.g. conferencing), it should 497 be possible to estimate the energy and/or the voice activity status 498 of the decoded signal with less than 10% of the complexity figures 499 listed above. 501 It is the intent to maximize the range of devices on which a codec 502 can be implemented. For this reasons, the reference implementation 503 must not depend on special hardware features or instructions to be 504 present in order to meet the complexity requirement. However, it may 505 be desirable to take advantage of such hardware when available, 506 (e.g., hardware accelerators for operations like fast Fourier 507 transforms and convolutions). A codec should also minimize the use 508 of saturating arithmetic so as to be implementable on architectures 509 that do not provide hardware saturation (e.g. ARMv4). 511 The combined codec size and data ROM should be small enough not to 512 cause significant implementation problems on typical embedded 513 devices. The codec context/state size required should be no more 514 than 2*R*C bytes in floating-point, where R is the sampling rate and 515 C is the number of channels. For fixed-point, that size should be 516 less than R*C. The scratch space required should also be less than 517 2*R*C bytes for floating point or less than R*C bytes for fixed- 518 point. 520 6. Additional considerations 522 There are additional features or characteristics that may be 523 desirable under some circumstances, but should not be part of the 524 strict requirements. The benefit of meeting these considerations 525 should be weighted against the associated cost. 527 6.1. Low-complexity audio mixing 529 In many applications that require a mixing server (e.g. conferencing, 530 games), it is important to minimize the computational cost of the 531 mixing. As much as possible, it should be possible to perform the 532 mixing with fewer computations than it would take to decode all the 533 streams, mix them, and re-encode the result. Properties that reduce 534 the complexity of the mixing process include: 536 o the ability to derive sufficient parameters, such as loudness 537 and/or spectral envelope, for estimating voice activity of a 538 compressed frame without fully decoding that frame; 540 o the ability to mix the streams in an intermediate representation 541 (e.g. transform domain), rather than having to fully decode the 542 signals before the mixing; 544 o the use of bit-stream layers (Section 6.3) by aggregating a small 545 number of active streams at lower quality. 547 For conferencing applications, the total complexity of the decoding, 548 VAD and mixing should be considered when evaluating proposals. 550 6.2. Encoder side potential for improvement 552 In many codecs, it is possible to improve the quality by improving 553 the encoder without breaking compatibility (i.e. without changing the 554 decoder). Potential for improvement varies from one codec to 555 another. It is generally low for PCM or ADPCM codecs and higher for 556 perceptual transform codecs. All things being equal, being able to 557 improve a codec after the bit-stream is a desirable property. 558 However, this should not be done at the expense of quality in the 559 reference encoder. Other potential improvements include signal- 560 adaptive frame size selection and improved discontinuous transmission 561 (DTX) algorithms that take advantage of predicting the decoder sides 562 packet loss concealment (PLC) algorithms. 564 6.3. Layered bit-stream 566 A layered codec makes it possible to transmit only a certain subset 567 of the bits and still obtain a valid bit-stream with a quality that 568 is equivalent to the quality that would be obtained from encoding at 569 the corresponding rate. While this is not a necessary feature for 570 most applications, it can be desirable for cases where a "mixing 571 server" needs to handle a large number of streams with limited 572 computational resources. 574 6.4. Partial redundancy 576 One possible way of increasing robustness to packet loss is to 577 include partial redundancy within packets. This can be achieved 578 either by including the base layer of the previous frame (for a 579 layered codec) or by transmitting other parameters from the previous 580 frame(s) to assist the PLC algorithm in case of loss. The ability to 581 include partial redundancy for high-loss scenarios is desirable, 582 provided that the feature can be dynamically turned on or off (so 583 that no bandwidth is wasted in case of loss-free transmission). 585 6.5. Stereo support 587 It is highly desirable for the codec to have stereo support. At a 588 minimum, the codec should be able to encode two channels 589 independently without causing significant stereo image artefacts. It 590 is also desirable for the codec to take advantage of the inter- 591 channel redundancy in stereo audio to reduce the bitrate (for an 592 equivalent quality) of stereo audio compared to coding channels 593 independently. 595 6.6. Bit error robustness 597 The vast majority of Internet-based applications do not need to be 598 robust to bit errors because packets either arrive unaltered, or do 599 not arrive at all. Considering that, the emphasis should be on 600 packet loss robustness and packet loss concealment. That being said, 601 it is often the case that extra robustness to bit errors can be 602 achieved at no cost at all (i.e. no increase in size, complexity or 603 bit-rate, no decrease in quality or packet loss robustness, ...). In 604 those cases then it is useful to make a change that increases the 605 robustness to bit errors. This can be useful for applications that 606 use UDP Lite transmission (e.g. over a wireless LAN). Robustness to 607 packet loss should *never* be sacrificed to achieve higher bit error 608 robustness. 610 6.7. Time stretching and shortening 612 When adaptive jitter buffers are used it is often necessary to 613 stretch or shorten the audio signal to allow changes in buffering. 614 While this operation can be performed directly on the decoder's 615 output, it is often more computationally efficient to stretch or 616 shorten the signal directly within the decoder. It is desirable for 617 the reference implementation to provide a time stretching/shortening 618 implementation, although it should not be normative. 620 6.8. Input robustness 622 The systems providing input to the encoder and receiving output from 623 the decoder may be far from ideal in actual use. Input and output 624 audio streams may be corrupted by compounding non-linear artifacts 625 from analog hardware and digital processing. The codecs to be 626 developed should be tested to ensure that they degrade gracefully 627 under adverse audio input conditions. Types of digital corruption 628 that may be tested include tandeming, transcoding, low-quality 629 resampling, and digital clipping. Types of analog corruption that 630 may be tested include microphones with substantial background noise, 631 analog clipping, and loudspeaker distortion. No specific end-to-end 632 quality requirements are mandated for use with the proposed codec. 633 It is advisable, however, that several typical in-situ environments/ 634 processing chains be specified for the purpose of benchmarking end- 635 to-end quality with the proposed codec. 637 6.9. Support of Audio forensics 639 Emergency calls can be analyzed using audio forensics if the context 640 and situation of the caller has to be identified. Thus, it is 641 important to transmit not only the voice of the callees well but also 642 to transmit background noise at high quality. In these situations, 643 sounds or noises of low volume should also not be compressed or 644 dropped. For this reason, the encoder must allow DTX to be disabled 645 when required (e.g. for emergency calls). 647 6.10. Legacy compatibility 649 In order to create the best possible codec for the Internet, there is 650 no requirement for compatibility with legacy Internet codecs. 652 7. Security Considerations 654 Although this document itself does not have security considerations, 655 this section describes the security requirements for the codec. 657 Just like for any protocol to be used over the Internet, security is 658 a very important aspect to consider. This goes beyond the obvious 659 considerations of preventing buffer overflows and similar attacks 660 that can lead to denial-of-service or remote code execution. One 661 very important security aspect is to make sure that the decoders have 662 a bounded and reasonable worst-case complexity. This prevents an 663 attacker from causing a DoS by sending packets that are specially 664 crafted to take a very long (or infinite) time to decode. 666 A more subtle aspect is the information leak that can occur when the 667 codec is used over an encrypted channel (e.g. [SRTP]). For example, 668 it was suggested [wright08] [white11] that use of source-controlled 669 VBR may reveal some information about a conversation through the size 670 of the compressed packets. For that reason, it should be possible to 671 use the codec at truely constant bit-rate if needed. 673 8. IANA Considerations 675 This document has no actions for IANA. 677 9. Acknowledgments 679 The original authors of this document are: Jean-Marc Valin, Slava 680 Borilin, Koen Vos, Christopher Montgomery and Raymond (Juin-Hwey) 681 Chen. We would like to thank all the other people who contributed 682 directly or indirectly to this document, including Jason Fischl, 683 Gregory Maxwell, Alan Duric, Jonathan Christensen, Julian Spittka, 684 Michael Knappe, Christian Hoene, and Henry Sinnreich. We also like 685 to thank Cullen Jennings and Gregory Lebovitz for their advice. 687 10. Informative References 689 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 690 A., Peterson, J., Sparks, R., Handley, M., and E. 691 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 692 June 2002. 694 [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session 695 Description Protocol", RFC 4566, July 2006. 697 [RFC6120] Saint-Andre, P., "Extensible Messaging and Presence 698 Protocol (XMPP): Core", RFC 6120, March 2011. 700 [XEP-0167] 701 Ludwig, S., Saint-Andre, P., Egan, S., McQueen, R., and D. 702 Cionoiu, "Jingle RTP Sessions", XSF XEP 0167, 703 December 2009. 705 [RFC3951] Andersen, S., Duric, A., Astrom, H., Hagen, R., Kleijn, 706 W., and J. Linden, "Internet Low Bit Rate Codec (iLBC)", 707 RFC 3951, December 2004. 709 [ITU.G722.1] 710 International Telecommunications Union, "Low-complexity 711 coding at 24 and 32 kbit/s for hands-free operation in 712 systems with low frame loss", ITU-T Recommendation 713 G.722.1, May 2005. 715 [Speex] Xiph.Org Foundation, "Speex: http://www.speex.org/", 2003. 717 [carot09] Carot, A., Werner, C., and T. Fischinger, "Towards a 718 Comprehensive Cognitive Analysis of Delay-Influenced 719 Rhythmical Interaction: http://www.carot.de/icmc2009.pdf", 720 2009. 722 [PAYLOADS] 723 Handley, M. and C. Perkins, "Guidelines for Writers of RTP 724 Payload Format Specifications", RFC 2736, BCP 36. 726 [RTP] Schulzrinne, H., Casner, S., Frederick, R., and V. 727 Jacobson, "RTP: A Transport Protocol for real-time 728 applications", RFC 3550. 730 [SRTP] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 731 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 732 RFC 3711, March 2004. 734 [wright08] 735 Wright, C., Ballard, L., Coull, S., Monrose, F., and G. 736 Masson, "Spot me if you can: Uncovering spoken phrases in 737 encrypted VoIP conversations: 738 http://www.cs.jhu.edu/~cwright/oakland08.pdf", 2008. 740 [white11] White, A., Matthews, A., Snow, K., and F. Monrose, 741 "Phonotactic Reconstruction of Encrypted VoIP 742 Conversations: Hookt on fon-iks 743 http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf", 744 2011. 746 Authors' Addresses 748 Jean-Marc Valin 749 Mozilla 750 650 Castro Street 751 Mountain View, CA 94041 752 USA 754 Email: jmvalin@jmvalin.ca 756 Koen Vos 757 Skype Technologies S.A. 758 Stadsgarden 6 759 Stockholm, 11645 760 Sweden 762 Email: koen.vos@skype.net