idnits 2.17.1 draft-spiritdsp-ipmr-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 09, 2011) is 4644 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group V. Sviridenko 3 Internet-Draft S. Ikonin 4 Intended status: Standards Track D. Yudin 5 Expires: February 09, 2012 SPIRIT DSP 6 August 09, 2011 8 IPMR Speech Codec 9 draft-spiritdsp-ipmr-01.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with 14 the provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six 22 months and may be updated, replaced, or obsoleted by other documents 23 at any time. It is inappropriate to use Internet-Drafts as 24 reference material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on February 09, 2012. 34 Copyright Notice 36 Copyright (c) 2011 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with 44 respect to this document. 46 Abstract 48 This document describes IPMR, a scalable variable adaptive multi- 49 rate speech and audio codec designed for use in IP based networks. 50 This codec is suitable for real time communications such as 51 telephony, voice&video conferencing.Four different sampling 52 frequencies are supported for encoding the audio input signal. 53 Adaptation to network characteristics is provided through control of 54 bitrate, packet rate, packet loss resilience and use of discontinuous 55 transmission (DTX). 56 IP-MR support different profiles for input signal content which 57 should be specified during codec initialization. It can be in Speech, 58 Audio or Auto-detection mode. In Auto-detection mode codec recognizes 59 type of input content automatically and switch to appropriate Speech 60 or Audio mode automatically. 62 Table of Contents 64 1. Intoduction ....................................................3 65 2. Technical Rrequirements ........................................4 66 2.1. Voice/Audio Quality ........................................4 67 2.2. Sampling Rate ..............................................4 68 2.3. Adaptive Multi Rate ........................................4 69 2.4. Bitrate Scalability ........................................4 70 2.5. Packet Loss Resilience .....................................4 71 2.6. Delay ......................................................4 72 2.7. DTX ........................................................5 73 3. IP-MR Codec Description ........................................5 74 4. Algorithm Overview .............................................8 75 4.1. Coding profiles ............................................8 76 4.2. Mixed CELP/MDCT codec ......................................9 77 4.3. Scalable CELP-based encoder ...............................11 78 4.4. Scalable CELP-based decoder ...............................13 79 4.5. Scalable MDCT-based encoder ...............................14 80 4.6. Scalable MDCT-based decoder ...............................16 81 5. Security Considerations .......................................19 82 6. Informative References ........................................20 83 7. IANA Considerarions ...........................................21 84 Authors' Addresses ...............................................22 86 1. Introduction 88 To ensure high-quality IP audio transmitting the codec has to overcome 89 a set of problems and obstacles. The best codec should be able to work 90 at a wide range of bitrates with relatively small delay, should deliver 91 high quality speech even in case of packet losses and poor network 92 connection and should be able to provide wideband quality (which is a 93 must for today's biz-level communication) and ultra wideband quality 94 for next-generation applications. This document describes the IP-MR 95 codec which is scalable variable adaptive multi-rate speech and audio 96 codec designed for use in IP based networks. 98 2. Technical Requirements 99 We agree with some technical requirements described in [SILK] and 100 include them into this section. The Internet Wideband Speech/Audio 101 Codec must be optimized towards real-time communications over the 102 Internet, and must have the flexibility to adjust to the environment it 103 operates in. Below is a list of main requirements for the codec. 105 2.1. Voice/Audio Quality 106 The codec should provide a quality/bitrate trade-off that is 107 competitive with other state-of-the-art codecs. At low bitrates it 108 should deliver good quality of speech in any language. At high bitrates 109 the quality should be excellent for any audio signal, including music, 110 at standard conditions. 112 2.2. Sampling Rate 113 Audio bandwidth is determined by the codec sampling frequency - 8 kHz 114 for narrowband voice (PSTN) and 16 kHz for wideband. Obviously, 115 wideband speech is much more natural and comfortable and wideband 116 codecs are more convenient to use in IP communication. However, 117 sometimes there isn't enough bandwidth to allow 16 kHz sampling 118 frequency, and codec must be able to switch to 8 kHz. Moreover, codec 119 should support ultra wide band (20 kHz and more) for next-generation 120 high-end quality. 122 2.3. Adaptive Multi Rate 123 The codec should have a set of bitrates with needed granularities to 124 fit into different channels capacities. The bitrates should be 125 adjustable in real-time. The codec should be capable of running at 126 bitrates starting from 6 kbps. 128 2.4. Bitrate Scalability 129 Codec should have bitrate scalability feature (embedded or layered 130 structure of bitstream) to enable reduce voice traffic during 131 transition without re-encoding. This is necessity for dynamic 132 congestion control, multicast and conferencing applications. From the 133 other hand the payment for scalability is less compression efficiency 134 and more computational complexity at the same bitrate. Because of that 135 it will be good if scalability feature can be switched-off when it's 136 not needed. 138 2.5. Packet Loss Resilience 139 The codec should be capable of running with little error propagation, 140 meaning that the decoded signal after one or more packet losses is 141 close to the decoded signal without packet losses after no more than 142 two additional packets. The codec should have a packet loss resilience 143 that is adjustable in real-time, where a lower packet loss resilience 144 setting improves the quality/bitrate trade-off. 146 2.6. Delay 147 For comfort conversation the codec must have algorithmic delay not more 148 than 50 ms. 150 2.7. DTX 151 The codec should be capable of using Discontinuous Transmission (DTX) 152 where packets are sent at a reduced rate when the input signal contains 153 only background noise. 155 3. IP-MR Codec Description 156 The IP-MR codec is scalable variable adaptive multi-rate speech and 157 audio codec designed for use in IP based networks. This codec is 158 suitable for real time communications such as telephony, voice&video 159 conferencing. 161 Sampling rate 162 IP-MR support three sampling rate modes: 8, 16 and 32 kHz 164 Speech/Audio modes 165 IP-MR support different profiles for input signal content which should 166 be specified during codec initialization. It can be in Speech, Audio or 167 Auto-detection mode. In Auto-detection mode codec recognizes type 168 of input content automatically and switch to appropriate Speech or 169 Audio mode automatically. 171 Voice Quality 172 The Mean Opinion Score (MOS) of this speech codec's speech quality 173 is about 3,7-4,4 (for clean speech) and it's depended on current mode 174 and average bit rate. At higher bitrates codec achieves FM quality on 175 generic audio content. 177 Algorithmic delay 178 The frame length is 20 ms. Algorithmic delay varies from 35 to 50 ms 179 depending of coding profile. 181 Adaptive Multi Rate 182 Depending of sampling rate IP-MR has 8 or 10 bitrate modes between 183 6 and 120 kbps which can be changed in real time in compliance with 184 the current network conditions. 186 +--------------------------------------------------------------------+ 187 |Sampling | Coding | Frame |Algorith.| Number | Avg. Bit Rates | 188 | Rate | profile | size | Delay |of Rates|for active speech| 189 +--------------------------------------------------------------------+ 190 | | Speech/ | | | | | 191 | | Auto- | | | | | 192 | | -detection | | 35 ms | | | 193 | | with | | | | | 194 | | short | 20 | | | | 195 | | delay | | | | | 196 | 8 kHz |-------------| |---------| 8 | 6 - 50 kbps | 197 | | Audio/ | ms | | | | 198 | | Auto- | | 50 ms | | | 199 | | -detection | | | | | 200 | | with | | | | | 201 | | long delay | | | | | 202 |--------------------------------------------------------------------| 203 | | Speech/ | | | | | 204 | | Auto- | | | | | 205 | | -detection | | 36.875 | | | 206 | | with | | ms | | | 207 | | short delay | 20 | | | | 208 | 16 kHz |-------------| |---------| 10 | 6 - 70 kbps | 209 | | Audio/ | ms | | | | 210 | | Auto- | | 50 ms | | | 211 | | -detection | | | | | 212 | | with long | | | | | 213 | | delay | | | | | 214 |--------------------------------------------------------------------| 215 | | Speech/ | | | | | 216 | | Auto- | | | | | 217 | | -detection | | 37.8125 | | | 218 | | with | | ms | | | 219 | | short delay | 20 | | | | 220 | 32 kHz |-------------| |---------| 10 | 6 - 120 kbps | 221 | | Audio/ | ms | | | | 222 | | Auto- | | 50 ms | | | 223 | | -detection | | | | | 224 | | with long | | | | | 225 | | delay | | | | | 226 +--------------------------------------------------------------------+ 228 Variable Bit Rate 229 Encoder's bit rate is constantly varying in accordance with the actual 230 speech content (voiced/unvoiced, pauses, stationary/non-stationary 231 voiced, etc.). IP-MR codec optimizes and reduces traffic while 232 keeping the efficiency, as the encoding is adaptive to the actual 233 characteristics of speech. All average bitrates are specified for 234 active speech without consideration of inter-speech (silence) regions. 236 Bitrate Scalability 238 The coded frame has layered (embedded) structure. It consists of 239 multiple coding layers - base (or core) layer and several enhancement 240 layers which are coded independently. Only the core layer is mandatory 241 to decode understandable speech and upper layers provide quality 242 enhancement. These enhancement layers may be omitted and remaining 243 base layer can be meaningfully decoded without notable artifacts. This 244 making the bit stream scalable and allows reduce bit rate during 245 transmission without re-encoding. 247 Bitrate scalability provides additional possibilities for congestion 248 control. Some intermediate network node may modify the IP-MR codec's 249 payload by dropping some of the layers during transmission to meet the 250 available bandwidth requirements. In case the payload is forwarded with 251 modified content at least the base layer must be preserved in the 252 payload which is being delivered to receiving side guarantees 253 meaningful speech decoding without packet loss concealment procedure. 255 --+--------+--------+--------+--------+--------+--------+--------+-- 256 | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | 257 --+--------+--------+--------+--------+--------+--------+--------+-- 259 <---- p(n-1) ----> 260 <----- p(n) -----> 261 <---- p(n+1) ----> 262 <---- p(n+2) ----> 263 <---- p(n+3) ----> 264 <---- p(n+4) ----> 266 But because of the scalable nature of IP-MR codec there is no need to 267 duplicate the whole previous frame - only the core layer may be 268 retransmitted. This reduces redundancy overhead while keeping 269 efficiency. 271 Moreover, the speech bits encoded in core layer are divided on six 272 classes (from A to F) of perceptual sensitivity to errors. Class A 273 contains most perceptually significant bits. This class's bits should 274 be delivered to Decoder to exclude fully "error propagation". Class F 275 contains less significant bits. Sum of all classes from A to F 276 contains all encoded parameters of the first (core) encoding layer. 277 These parameters are sufficient to synthesize speech with near "toll 278 quality". 280 Using these classes as introduced redundancy make possible to smoothly 281 adjust trade-off between overhead and robustness against packet loss. 283 DTX 284 IP-MR codec support Discontinuous Transmission mode for silence 285 compression. During silence intervals the codec bitrate can be reduced 286 to 0.3 kbps. 288 4. Algorithm overview 290 4.1. Coding profiles 291 IP-MR support different profiles for type of input signal content. It 292 can be Speech, Audio or Auto-detection modes. In Auto-detection mode 293 codec recognizes type of input content automatically and switch to 294 appropriate Speech or Audio mode automatically. At high level encoder 295 consists of three basic modules (see Figure 1). 297 -Speech/Music detector - automatically classify type of input 298 content as speech or music to enable appropriate coding model. 299 -CELP-based speech coder - implements source-filter model, speech 300 content oriented. 301 -MDCT-based audio coder - for general audio coding purpose. 303 +-------------------+ 304 |Predefined Speech/ | 305 | Audio | 306 | Profile | 307 +----------+--------+ 308 | 309 \|/ 310 +----------+-------+ 311 input signal | Speech/ | 312 ---------------+ Music detector | 313 +---+---------+----+ 314 S| M| 315 P| u| 316 e| s| 317 e| i| 318 c| c| 319 h| | 320 | | 321 +..............|.........|..........+ 322 . \|/ \|/ coder . 323 . +------------+--+ +--+-----+ . 324 . | CELP/MDCT | | MDCT | . 325 . +--------+------+ +----+---+ . 326 +..........|...............|........+ 327 | | 328 \|/ \|/ 329 +------+---------------+--+ 330 | Bitstream +---> 331 +-------------------------+ 333 Figure 1 High level encoder structure 335 Depending of type of input signal (speech/music) different coding 336 models are used. The type of input signal can be detected automatically 337 in 'Autodetection' mode or specified as predefined setting during codec 338 initialization. The speech content is coded by mixed CELP/MDCT based 339 model. General audio content is coded by pure MDCT-based model. 341 The decoder does backward operations. First, compressed frame goes to 342 CELP-decoder; it extracts core and extension layers. Then, both the 343 rest of bitstream and reconstructed signal go to MDCT-decoder which 344 restores residue and generates joint output. 346 +----------+ Rest of compressed +--------+ 347 Compressed | | data | | 348 frame | CELP +---------------------->+ MDCT | 349 ------------->+ | Reconstructed | | 350 | decoder | signal |decoder +--OUTPUT-> 351 | +---------------------->+ | 352 +----------+ +--------+ 354 Figure 2 High level decoder structure 356 In fact CELP and MDCT are two different decoders and thus, they can 357 work simultaneously. Parallel processing requires only two modules to 358 be carried out of decoder structure (see Figure 1) they are - bitstream 359 demultiplexing and signal mixing. 361 +---------+ 362 | CELP | +---------+ 363 +->+ decoder +----->+ | 364 Compressed / +---------+ | MDCT | 365 frame +-------+ | +--Output--> 366 ------------->| DEMUX | | decoder | 367 +-+---+-+ +---------+ | | 368 \ | MDCT +----->+ | 369 +->+ decoder | +---------+ 370 +---------+ 372 Figure 2 High level decoder structure (parallel) 374 Note, that demultiplexing is simple to implement because of the size of 375 CELP stream portion can be calculated without decoding. 377 4.2. Mixed CELP/MDCT codec 379 The mixed CELP/MDCT Codec is composed from two independent codecs - 380 CELP and MDCT-based. The first one processes source signal and feeds 381 the residue to the second. In order to provide flexible and transparent 382 coupling between codecs, corresponding sampling rate conversion and 383 frame synchronization procedures are applied. 385 The resulting bitstream naturally constructed from two continues 386 regions belong to CELP and MDCT codecs correspondingly. The CELP-codec 387 bitstream has a layer structure (core + extensions) while the 388 MDCT-codec generates byte-scalable stream. 390 The next figure provides an example of 16 kHz source material encoding 391 if CELP-base encoder operates at 8 kHz sampling rate. 393 Core layer 394 +------------+ +------------+ params 395 -Input speech-+-->| Downsample +-->| Scalable +--------------+ 396 FS=16 kHz | | to 8 kHz | | CELP-based | | 397 | +------------+ | Encoder +---+ | 398 | +--+---------+ | | 399 | | | | 400 Synth Speech | | 401 | | Enhancement | 402 | | layers | 403 | | params | 404 | \|/ | \|/ 405 | +----------+---------+ | +------+-----+ 406 | | Upsample to 16 kHz | | | Core layer | 407 | +-----+--------------+ | +------------+ 408 | | | | Ext.layer 1| 409 | \|/ | +------------+ 410 +---------------->(-) +-->+ Ext.layer 2| 411 | +------------+ 412 | | Ext.layer 3| 413 | +------------+ 414 Residual | | 415 | | | 416 \|/ | Scalable | 417 +--------------------+--+ | bitstream | 418 | Scalable | Scalable | | 419 | MDCT-based Encoder +---bitstream------>| | 420 +-----------------------+ +------------+ 422 Figure 3 Structural block diagram of mixed CELP/MDCT encoder 423 (16kHz mode) 425 First, input signal is down-sampled to 8 kHz and encoded by Scalable 426 CELP-based encoder which packs quantized parameters in layered 427 bitstream. The difference between up-sampled synthesized signal and 428 original source goes to Scalable MDCT-based encoder which forms the 429 rest of bitstream. 431 Below CELP and MDCT-based codecs are considered in more details. 433 4.3. Scalable CELP-based encoder 435 Scalable CELP-based coder applied to speech coding consists of the core 436 (base layer) encoder and three enchancement encoders. In Figure 4 the 437 structure of core encoder is shown. 439 Core Encoder codes speech in a "base frequency bandwidth" (up to 4 kHz) 440 with speech quality near to "Toll Quality" and forms a coded bit stream 441 at minimum average bit rate (about 6.0 kbps). Current bit rate is 442 driven by information content of input speech and can vary in range 443 from 4.3 kbps up to 10.35 kbps. 445 The Core Encoder performs LPC analysis and pitch detection, estimates 446 parameters of the pitch-predictor and excitation by the 447 "analysis-by-synthesis" method on the "subframe-by-subframe" base. 448 The subframe length is 5 ms. 450 Encoded parameters and bits are separated to 6 sensitivity classes 451 from: Class A to Class F to provide a possibility of the additional 452 protection them against packet losses. 454 Class A contains most perceptually significant bits. This class's bits 455 should be delivered to Decoder to exclude fully "error propagation". 457 Class F contains less significant bits. Sum of all classes from A to F 458 contains all encoded parameters of the first (core) encoding layer. 459 These parameters are sufficient to synthesize speech with "toll 460 quality". 462 | 463 Input Speech 464 Fs=8 kHz 465 +--------------+ | 466 | LPC Analyzer +<---------+ 467 +------+-------+ | 468 | | 469 +------Codebook memory--+ LPC | 470 | vector update | \|/ | 471 \|/ | +-------+-------+ | 472 +---+------+ | | LPC Quantizer +-LSFs-> | 473 | Adaptive +--Pitch-> | +------------+--+ | 474 +-->| Codebook | | | | 475 | +------+---+ | QLPC | 476 | | | \|/ | 477 | | | +---+--------+ | 478 | +-------------->(+)--+-Excitation->+ LPC-filter | | 479 | /|\ +----+-------+ | 480 | +-----------------+ | | 481 | +------+---+ Synth. | 482 +->| Fixed + Speech | 483 | | Codebook +-Pulse information | | 484 | +----------+ | | 485 | \|/ | 486 | +-------------+ (-)<----------+ 487 +-+ Error | | 488 |Minimization | | 489 | Control | | 490 +-------+-----+ | 491 /|\ | 492 | | 493 | +------------+ | 494 +---------+---+ | Perceptual | | 495 | Error | | Weighing +<------------------+ 496 | Calculation +-->+ Filter | | 497 +------+------+ +------------+ | 498 Residual 1 499 | 500 \|/ 502 Figure 4 Structural block diagram of CELP-based Core Encoder 504 | 505 Pulse information | 506 from previous layer | Residual 507 | | of 508 \|/ | previous layer 509 +-----+------------+ | (Fs=8 kHz) 510 | Adaptive Pulse- | QLPC | 511 | Position Control | from core layer | 512 +------+-----------+ | | 513 | | | 514 \|/ \|/ | 515 +------+---------+ Enhancement +-----+------+ \|/ 516 | Fixed Codebook +---- Layer --->+ LPC-filter +----------->(-) 517 +---+------------+ Excitation +------------+ | 518 /|\ | 519 | +--------------+ +-------------+ +------------+ | 520 | | Error | | Error | | Perceptual | | 521 +-+ Minimization +<-+ Calculation +<-+ Weighing +<-------+ 522 | Control | +-------------+ | Filter | | 523 +--------------+ +------------+ Residual of 524 current layer 525 \|/ 527 Figure 5 Structural block diagram of CELP-based Extension Encoder 529 The difference between input speech and synthesized speech (by Core 530 Encoder) is delivered to extension coding. Each next Extension Encoder 531 codes the residual (delivered from previous layer) and forms own 532 additional coded bit stream. Therefore, full bit stream contains a sum 533 of the base and extension bit streams. The number of layers, which is 534 used at coding and corresponded to number of the bit streams in the 535 sum on the encoder's output, can be changed "on the fly". 537 Each CELP Extension Encoder uses results of previous layer's encoding 538 and estimates additional excitation by the "analysis-by-synthesis" 539 method on the "subframe-by-subframe" base (Figure 5). There are total 3 540 CELP Extension Encoders. 542 4.4. Scalable CELP-based decoder 543 The decoder dequantizes parameters of each encoding layer, reconstructs 544 total excitation by sum of adaptive codebook and fixed codebooks (core 545 and enhancement) and synthesizes speech using LPC-filter. Reconstructed 546 speech is post-filtered and output to the 160 samples buffer (20 ms at 547 8 kHz). In Figure 6 the structure of CELP-based decoder is presented. 549 | 550 LSF indices 551 | 552 \|/ 553 -Acbk gain--------------+ +------+------+ 554 \|/ | LPC | 555 +----------+ +++ | Dequantizer | 556 -Pitch->| Adaptive |-->+X+-----------+ +------+------+ 557 | Codebook | +-+ | | 558 +----------+ | QLPC 559 | | 560 -Fcbk 1 gain-------------------+ | \|/ 561 \|/ | +------+------+ 562 ---Pulse +------------+ +++ \|/ |LPC Synthesis| 563 information-->+ Fixed |->|X+-->(+)--Excitation->+ Filter | 564 | Codebook 1 | +-+ /|\ +------+------+ 565 +------------+ | | 566 . | | 567 . | \|/ 568 . | +------+------+ 569 +------------+ | | Post Filter | 570 -Pulse | Fixed | +-+ | +------+------+ 571 Information n->+ Copybook n +->+X+->-+ | 572 +------------+ +++ Synthesized 573 /|\ Speech 8 kHz 574 | | 575 --Fcbk 2 gain-------------------+ \|/ 577 Figure 6 Scalable CELP-based Decoder 579 Decoder has ability to conceal of the lost frames (PLC-like function) 580 by partial reconstruction of speech, using speech parameters of the 581 last received frames. However, to provide highest robustness to packet 582 loss, classes of the most significant parameters only should be 583 protected. 585 4.5. Scalable MDCT-based encoder 587 Scalable MDCT-based encoder operates on a frame basis in a domain of 588 MDCT spectrum. Quantized spectrum samples are written into the 589 bitstream. 591 +------+ +-----------+ +-----------+ 592 --Input signal->+ MDCT +-->+ Quantizer +->+ Bitstream +--Scalable 593 +------+ +-----------+ | formatter | bitstream--> 594 +-----------+ 596 Figure 7 Scalable MDCT-based Encoder 598 This approach is widely used in modern audio coding algorithms. The 599 main advantage of developed compression scheme is a bitstream formatter 600 unit. It constructs stream in a way that any initial part of the 601 compressed data can be decoded and used for reconstruction. In other 602 words, each initial part of compressed frame carries self-sufficient 603 information about band-limited signal with a given level of accuracy. 605 The bitstream formatter unit operates on a band basis, each eight 606 samples long. Coding loop iterates over all bands and transmits update 607 for a given band. Loop ends if all spectrum bands are fully 608 transmitted. 610 +-----------+ 611 / Spectrum / 612 +-----+-----+ 613 | 614 \|/ 615 +-----+------+ +-----------------+ 616 | Start +------------>/ numCodedBands=0 / 617 +-------+----+ +-----------------+ 618 | 619 \|/ 620 +----+-------------+ no +------------------+ yes +-----+ 621 +->| chooseCodedBand()+---->+ isAllBandsCoded()+---->+ End | 622 | +----+-------------+ +----+-------------+ +-----+ 623 | yes| |no 624 | \|/ \|/ 625 | +-----+-------+ +------------+--+ +-----------------+ 626 | | updateBand()+<--+ startNewBand()+--->+ numCodedBands++ | 627 | +-----+-------+ +----+----------+ +-----------------+ 628 | | . 629 | +................+ 630 | | 631 | \|/ 632 | +-----+-------------------+ 633 | | applyCompressionModel() | 634 | +--------+----------------+ 635 | | 636 | \|/ 637 | +-------+-----+ +--------------+ 638 +->+ rangeCodec()+--------->+ bits/sample | 639 +-----+-------+ +--------------+ 640 \|/ 641 +-----+------------+ 642 | Compressed frame | 643 +------------------+ 645 Figure 8 Spectrum encoding loop 647 Bandwidth expansion (coding band increment) is based on actual 648 bit/samples ratio known for both encoder and decoder. Coding band 649 increment only occurs if compression rate exceed some fixed 650 threshold or all available bands are already fully encoded. 651 Practical experiments show that if compression ratio exceeds 652 1.7 - 2 bits/sample than it is reasonable to expand bandwidth 653 rather than update existing bands. 655 Band update procedure is based on a bit-planes data representation. 656 One bit-plane issues per band at time. In terms of binary planes it 657 means that each update carries one bit of mantissa for each band 658 sample. Current implementation uses ternary planes instead of 659 conventional binary planes. This allows encoder to reduce the amount 660 of noise introduced if only top plane is transmitted. 662 The sign and sample presence flag together form a top plane for 663 particular band which transmitted first than on band coding start. 664 Encoder keeps a track of transmitted planes for each band and chooses 665 the highest non transmitted plane to update. 667 Encoder applies different statistic models and compression schemes for 668 different planes and bands. Actually only several top planes (following 669 by sign/flag plane) are well suited for compression, whereas all others 670 tend to have random distribution and in fact can't be compressed at 671 all. After compression scheme is applied, raw data and chosen statistic 672 model go to range codec(1) which writes it into a bitstream. 674 4.6. Scalable MDCT-based decoder 676 Decoder performs all the same operations as encoder does, but in 677 backward manner. First bitstream reader reconstructs quantized spectrum 678 samples from compressed frame, than inverse quantized reconstructs MDCT 679 spectrum and inverse MDCT transforms signal back from frequency to time 680 domain. 682 +-----------+ +-----------+ +---------+ 683 Scalable | Bitstream +-->+ Inverse | | Inverse +--Reconstructed 684 -bitstream->+ reader | | Quantizer +-->+ MDCT | signal --> 685 +-----------+ +-----------+ +---------+ 687 Figure 9 Scalable MDCT-based Decoder 689 (1) Range codec is a sort of arithmetic codec providing byte stream 690 granularity. 692 The resulting signal accuracy and bandwidth dependent on the amount of 693 available input data. Codec introduces no inter frame data dependency 694 except 50% time domain overlapping required for MDCT transform. In 695 practice, it means that signal can't be correctly reconstructed from a 696 first successfully received compressed frame, but the second frame will 697 be reconstructed correctly. 699 The bitstream reader decompress input stream using inverse range coder. 700 Because of encoder and decoder operate synchronously, each time decoder 701 runs inverse range codec it uses exactly the same context as were used 702 by encoder during compression. Stream parsing ends if no more data 703 available for compressed frame. The following figure demonstrates 704 spectrum decoding loop. 706 +------------------+ 707 | Compressed frame | 708 +---+--------------+ 709 | 710 \|/ 711 +--+----+ +-----------------+ 712 | Start +-------> / numCodedBands=0 / 713 +---+---+ +-----------------+ 714 | 715 \|/ 716 +---+---------------+ no +-----+ 717 | isDataAvailablle()+-------------->+ End | 718 +----+--------------+ +-----+ 719 yes| 720 \|/ 721 +----+----------------+ no +---------------------+ +-----+ 722 | chooseDecodedBand() +--->+ isAllBandsDecoded() +---->+ End | 723 +---+-----------------+ +-----------+---------+ +-----+ 724 yes| | no 725 +----------------------------------+ 726 | 727 \|/ 728 +---+----------+ +-------------+ 729 | rangeCodec() +-------------->/ bits/sample / 730 | (inverse) | +-------------+ 731 +----+---------+ 732 | 733 \|/ 734 +----+-------------------+ 735 | applyCompressionMode() | 736 | (inverse) | 737 +-----+------------------+ 738 | 739 +.........................+ 740 \|/ \|/ 741 +-----+--------+ +----------+-----+ +-----------------+ 742 | updateBand() | | startNewBand() +-->/ numCodedBands++ / 743 | (inverse) | | (inverse) | +-----------------+ 744 +--------+-----+ +------+---------+ 745 | | 746 \|/ \|/ 747 +------+------------------+--------+ 748 / Spectrum / 749 +----------------------------------+ 751 Figure 10 Spectrum decoding loop 753 In spite of codec has no lower bitrate limit, the compression scheme 754 used provides artificial reconstructed signal if transmission rate is 755 low than 16-24 kbps. For low bitrates presented audio codec is used in 756 a bunch with speech codec and processes the speech codec residue. 758 5. Security Considerations 760 To Be Defined. 762 6. Informative References 764 [SILK] SILK Speech Codec Draft, https://developer.skype.com/silk? 765 action=AttachFile&do=get&target=draft-vos-silk-00.txt 767 7. IANA Considerarions 769 This document has no actions for IANA 771 Authors' Addresses 773 Vladimir Sviridenko 774 SPIRIT DSP 775 Solzhenitsina 27 776 Moscow 109004 777 Russia 779 Phone: +7 495 661 2178 780 Email: vladimirs@spiritdsp.com 782 Sergey Ikonin 783 SPIRIT DSP 784 Solzhenitsina 27 785 Moscow 109004 786 Russia 788 Phone: +7 495 661 2178 789 Email: s.ikonin@gmail.com 791 Dmitry Yudin 792 SPIRIT DSP 793 Solzhenitsina 27 794 Moscow 109004 795 Russia 797 Phone: +7 495 661 2178 798 Email: yudin@spiritdsp.com 800 Person & email address to contact for further information: 801 Yury Morzeev 802 morzeev@spiritdsp.com