idnits 2.17.1 

draft-spiritdsp-ipmr-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (August 09, 2011) is 4644 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

     No issues found here.

     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     V. Sviridenko
3	Internet-Draft                                                S. Ikonin
4	Intended status: Standards Track                               D. Yudin
5	Expires: February 09, 2012                                   SPIRIT DSP
6	                                                        August 09, 2011

8	                           IPMR Speech Codec
9	                         draft-spiritdsp-ipmr-01.txt

11	Status of this Memo

13	   This Internet-Draft is submitted to IETF in full conformance with
14	   the provisions of BCP 78 and BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six
22	   months and may be updated, replaced, or obsoleted by other documents
23	   at any time.  It is inappropriate to use Internet-Drafts as
24	   reference material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on February 09, 2012.

34	Copyright Notice

36	   Copyright (c) 2011 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents
41	   (http://trustee.ietf.org/license-info) in effect on the date of
42	   publication of this document. Please review these documents
43	   carefully, as they describe your rights and restrictions with
44	   respect to this document.

46	Abstract

48	   This document describes IPMR, a scalable variable adaptive multi-
49	   rate speech and audio codec designed for use in IP based networks.
50	   This codec is suitable for real time communications such as
51	   telephony, voice&video conferencing.Four different sampling
52	   frequencies are supported for encoding the audio input signal.
53	   Adaptation to network characteristics is provided through control of
54	   bitrate, packet rate, packet loss resilience and use of discontinuous
55	   transmission (DTX).
56	   IP-MR support different profiles for input signal content which
57	   should be specified during codec initialization. It can be in Speech,
58	   Audio or Auto-detection mode. In Auto-detection mode codec recognizes
59	   type of input content automatically and switch to appropriate Speech
60	   or Audio mode automatically.

62	Table of Contents

64	   1. Intoduction ....................................................3
65	   2. Technical Rrequirements ........................................4
66	     2.1. Voice/Audio Quality ........................................4
67	     2.2. Sampling Rate ..............................................4
68	     2.3. Adaptive Multi Rate ........................................4
69	     2.4. Bitrate Scalability ........................................4
70	     2.5. Packet Loss Resilience .....................................4
71	     2.6. Delay ......................................................4
72	     2.7. DTX ........................................................5
73	   3. IP-MR Codec Description ........................................5
74	   4. Algorithm Overview .............................................8
75	     4.1. Coding profiles ............................................8
76	     4.2. Mixed CELP/MDCT codec ......................................9
77	     4.3. Scalable CELP-based encoder ...............................11
78	     4.4. Scalable CELP-based decoder ...............................13
79	     4.5. Scalable MDCT-based encoder ...............................14
80	     4.6. Scalable MDCT-based decoder ...............................16
81	   5. Security Considerations .......................................19
82	   6. Informative References ........................................20
83	   7. IANA Considerarions ...........................................21
84	   Authors' Addresses ...............................................22

86	1.  Introduction

88	To ensure high-quality IP audio transmitting the codec has to overcome
89	a set of problems and obstacles. The best codec should be able to work
90	at a wide range of bitrates with relatively small delay, should deliver
91	high quality speech even in case of packet losses and poor network
92	connection and should be able to provide wideband quality (which is a
93	must for today's biz-level communication) and ultra wideband quality
94	for next-generation applications. This document describes the IP-MR
95	codec which is scalable variable adaptive multi-rate speech and audio
96	codec designed for use in IP based networks.

98	2. Technical Requirements
99	We agree with some technical requirements described in [SILK] and
100	include them into this section. The Internet Wideband Speech/Audio
101	Codec must be optimized towards real-time communications over the
102	Internet, and must have the flexibility to adjust to the environment it
103	operates in. Below is a list of main requirements for the codec.

105	2.1. Voice/Audio Quality
106	The codec should provide a quality/bitrate trade-off that is
107	competitive with other state-of-the-art codecs. At low bitrates it
108	should deliver good quality of speech in any language. At high bitrates
109	the quality should be excellent for any audio signal, including music,
110	at standard conditions.

112	2.2. Sampling Rate
113	Audio bandwidth is determined by the codec sampling frequency - 8 kHz
114	for narrowband voice (PSTN) and 16 kHz for wideband. Obviously,
115	wideband speech is much more natural and comfortable and wideband
116	codecs are more convenient to use in IP communication. However,
117	sometimes there isn't enough bandwidth to allow 16 kHz sampling
118	frequency, and codec must be able to switch to 8 kHz. Moreover, codec
119	should support ultra wide band (20 kHz and more) for next-generation
120	high-end quality.

122	2.3. Adaptive Multi Rate
123	The codec should have a set of bitrates with needed granularities to
124	fit into different channels capacities. The bitrates should be
125	adjustable in real-time. The codec should be capable of running at
126	bitrates starting from 6 kbps.

128	2.4. Bitrate Scalability
129	Codec should have bitrate scalability feature (embedded or layered
130	structure of bitstream) to enable reduce voice traffic during
131	transition without re-encoding. This is necessity for dynamic
132	congestion control, multicast and conferencing applications. From the
133	other hand the payment for scalability is less compression efficiency
134	and more computational complexity at the same bitrate. Because of that
135	it will be good if scalability feature can be switched-off when it's
136	not needed.

138	2.5. Packet Loss Resilience
139	The codec should be capable of running with little error propagation,
140	meaning that the decoded signal after one or more packet losses is
141	close to the decoded signal without packet losses after no more than
142	two additional packets. The codec should have a packet loss resilience
143	that is adjustable in real-time, where a lower packet loss resilience
144	setting improves the quality/bitrate trade-off.

146	2.6. Delay
147	For comfort conversation the codec must have algorithmic delay not more
148	than 50 ms.

150	2.7. DTX
151	The codec should be capable of using Discontinuous Transmission (DTX)
152	where packets are sent at a reduced rate when the input signal contains
153	only background noise.

155	3.  IP-MR Codec Description
156	The IP-MR codec is scalable variable adaptive multi-rate speech and
157	audio codec designed for use in IP based networks. This codec is
158	suitable for real time communications such as telephony, voice&video
159	conferencing.

161	Sampling rate
162	IP-MR support three sampling rate modes: 8, 16 and 32 kHz

164	Speech/Audio modes
165	IP-MR support different profiles for input signal content which should
166	be specified during codec initialization. It can be in Speech, Audio or
167	Auto-detection mode. In Auto-detection mode codec recognizes type
168	of input content automatically and switch to appropriate Speech or
169	Audio mode automatically.

171	Voice Quality
172	The Mean Opinion Score (MOS) of this speech codec's speech quality
173	is about 3,7-4,4 (for clean speech) and it's depended on current mode
174	and average bit rate. At higher bitrates codec achieves FM quality on
175	generic audio content.

177	Algorithmic delay
178	The frame length is 20 ms. Algorithmic delay varies from 35 to 50 ms
179	depending of coding profile.

181	Adaptive Multi Rate
182	Depending of sampling rate IP-MR has 8 or 10 bitrate modes between
183	6 and 120 kbps which can be changed in real time in compliance with
184	the current network conditions.

186	+--------------------------------------------------------------------+
187	|Sampling |   Coding    | Frame |Algorith.| Number | Avg. Bit Rates  |
188	|  Rate   |   profile   | size  |  Delay  |of Rates|for active speech|
189	+--------------------------------------------------------------------+
190	|         |   Speech/   |       |         |        |                 |
191	|         |     Auto-   |       |         |        |                 |
192	|         |  -detection |       | 35 ms   |        |                 |
193	|         |    with     |       |         |        |                 |
194	|         |     short   |  20   |         |        |                 |
195	|         |     delay   |       |         |        |                 |
196	| 8 kHz   |-------------|       |---------|    8   |   6 - 50 kbps   |
197	|         |    Audio/   |  ms   |         |        |                 |
198	|         |     Auto-   |       | 50 ms   |        |                 |
199	|         | -detection  |       |         |        |                 |
200	|         |    with     |       |         |        |                 |
201	|         | long delay  |       |         |        |                 |
202	|--------------------------------------------------------------------|
203	|         |     Speech/ |       |         |        |                 |
204	|         |     Auto-   |       |         |        |                 |
205	|         |  -detection |       | 36.875  |        |                 |
206	|         |    with     |       |  ms     |        |                 |
207	|         | short delay |  20   |         |        |                 |
208	| 16 kHz  |-------------|       |---------|   10   |   6 - 70 kbps   |
209	|         |    Audio/   |  ms   |         |        |                 |
210	|         |   Auto-     |       |  50 ms  |        |                 |
211	|         | -detection  |       |         |        |                 |
212	|         |  with long  |       |         |        |                 |
213	|         |  delay      |       |         |        |                 |
214	|--------------------------------------------------------------------|
215	|         |    Speech/  |       |         |        |                 |
216	|         |   Auto-     |       |         |        |                 |
217	|         | -detection  |       | 37.8125 |        |                 |
218	|         |    with     |       |   ms    |        |                 |
219	|         | short delay |  20   |         |        |                 |
220	|  32 kHz |-------------|       |---------|  10    |   6 - 120 kbps  |
221	|         |    Audio/   |  ms   |         |        |                 |
222	|         |     Auto-   |       |  50 ms  |        |                 |
223	|         | -detection  |       |         |        |                 |
224	|         |  with long  |       |         |        |                 |
225	|         |    delay    |       |         |        |                 |
226	+--------------------------------------------------------------------+

228	Variable Bit Rate
229	Encoder's bit rate is constantly varying in accordance with the actual
230	speech content (voiced/unvoiced, pauses, stationary/non-stationary
231	voiced, etc.). IP-MR codec optimizes and reduces traffic while
232	keeping the efficiency, as the encoding is adaptive to the actual
233	characteristics of speech. All average bitrates are specified for
234	active speech without consideration of inter-speech (silence) regions.

236	Bitrate Scalability

238	The coded frame has layered (embedded) structure. It consists of
239	multiple coding layers - base (or core) layer and several enhancement
240	layers which are coded independently. Only the core layer is mandatory
241	to decode understandable speech and upper layers provide quality
242	enhancement. These enhancement layers may be omitted and remaining
243	base layer can be meaningfully decoded without notable artifacts. This
244	making the bit stream scalable and allows reduce bit rate during
245	transmission without re-encoding.

247	Bitrate scalability provides additional possibilities for congestion
248	control. Some intermediate network node may modify the IP-MR codec's
249	payload by dropping some of the layers during transmission to meet the
250	available bandwidth requirements. In case the payload is forwarded with
251	modified content at least the base layer must be preserved in the
252	payload which is being delivered to receiving side guarantees
253	meaningful speech decoding without packet loss concealment procedure.

255	--+--------+--------+--------+--------+--------+--------+--------+--
256	  | f(n-2) | f(n-1) |  f(n)  | f(n+1) | f(n+2) | f(n+3) | f(n+4) |
257	--+--------+--------+--------+--------+--------+--------+--------+--

259	  <---- p(n-1) ---->
260	           <----- p(n) ----->
261	                     <---- p(n+1) ---->
262	                               <---- p(n+2) ---->
263	                                        <---- p(n+3) ---->
264	                                                 <---- p(n+4) ---->

266	But because of the scalable nature of IP-MR codec there is no need to
267	duplicate the whole previous frame - only the core layer may be
268	retransmitted. This reduces redundancy overhead while keeping
269	efficiency.

271	Moreover, the speech bits encoded in core layer are divided on six
272	classes (from A to F) of perceptual sensitivity to errors. Class A
273	contains most perceptually significant bits. This class's bits should
274	be delivered to Decoder to exclude fully "error propagation". Class F
275	contains less significant bits. Sum of all classes from A to F
276	contains all encoded parameters of the first (core) encoding layer.
277	These parameters are sufficient to synthesize speech with near "toll
278	quality".

280	Using these classes as introduced redundancy make possible to smoothly
281	adjust trade-off between overhead and robustness against packet loss.

283	DTX
284	IP-MR codec support Discontinuous Transmission mode for silence
285	compression. During silence intervals the codec bitrate can be reduced
286	to 0.3 kbps.

288	4.  Algorithm overview

290	4.1. Coding profiles
291	IP-MR support different profiles for type of input signal content. It
292	can be Speech, Audio or Auto-detection modes. In Auto-detection mode
293	codec recognizes type of input content automatically and switch to
294	appropriate Speech or Audio mode automatically. At high level encoder
295	consists of three basic modules (see Figure 1).

297	   -Speech/Music detector - automatically classify type of input
298	content as speech or music to enable appropriate coding model.
299	   -CELP-based speech coder - implements source-filter model, speech
300	content oriented.
301	   -MDCT-based audio coder - for general audio coding purpose.

303	               +-------------------+
304	               |Predefined Speech/ |
305	               |       Audio       |
306	               |      Profile      |
307	               +----------+--------+
308	                          |
309	                         \|/
310	               +----------+-------+
311	  input signal |       Speech/    |
312	---------------+  Music detector  |
313	               +---+---------+----+
314	                  S|        M|
315	                  P|        u|
316	                  e|        s|
317	                  e|        i|
318	                  c|        c|
319	                  h|         |
320	                   |         |
321	    +..............|.........|..........+
322	    .             \|/       \|/   coder .
323	    . +------------+--+   +--+-----+    .
324	    . |   CELP/MDCT   |   | MDCT   |    .
325	    . +--------+------+   +----+---+    .
326	    +..........|...............|........+
327	               |               |
328	              \|/             \|/
329	        +------+---------------+--+
330	        |        Bitstream        +--->
331	        +-------------------------+

333	      Figure 1 High level encoder structure

335	Depending of type of input signal (speech/music) different coding
336	models are used. The type of input signal can be detected automatically
337	in 'Autodetection' mode or specified as predefined setting during codec
338	initialization. The speech content is coded by mixed CELP/MDCT based
339	model. General audio content is coded by pure MDCT-based model.

341	The decoder does backward operations. First, compressed frame goes to
342	CELP-decoder; it extracts core and extension layers. Then, both the
343	rest of bitstream and reconstructed signal go to MDCT-decoder which
344	restores residue and generates joint output.

346	              +----------+  Rest of compressed   +--------+
347	 Compressed   |          |        data           |        |
348	   frame      |  CELP    +---------------------->+  MDCT  |
349	------------->+          |    Reconstructed      |        |
350	              | decoder  |       signal          |decoder +--OUTPUT->
351	              |          +---------------------->+        |
352	              +----------+                       +--------+

354	                Figure 2 High level decoder structure

356	In fact CELP and MDCT are two different decoders and thus, they can
357	work simultaneously. Parallel processing requires only two modules to
358	be carried out of decoder structure (see Figure 1) they are - bitstream
359	demultiplexing and signal mixing.

361	                           +---------+
362	                           |   CELP  |      +---------+
363	                        +->+ decoder +----->+         |
364	 Compressed            /   +---------+      |   MDCT  |
365	   frame      +-------+                     |         +--Output-->
366	------------->| DEMUX |                     | decoder |
367	              +-+---+-+    +---------+      |         |
368	                       \   |   MDCT  +----->+         |
369	                        +->+ decoder |      +---------+
370	                           +---------+

372	       Figure 2 High level decoder structure (parallel)

374	Note, that demultiplexing is simple to implement because of the size of
375	CELP stream portion can be calculated without decoding.

377	4.2. Mixed CELP/MDCT codec

379	The mixed CELP/MDCT Codec is composed from two independent codecs -
380	CELP and MDCT-based. The first one processes source signal and feeds
381	the residue to the second. In order to provide flexible and transparent
382	coupling between codecs, corresponding sampling rate conversion and
383	frame synchronization procedures are applied.

385	The resulting bitstream naturally constructed from two continues
386	regions belong to CELP and MDCT codecs correspondingly. The CELP-codec
387	bitstream has a layer structure (core + extensions) while the
388	MDCT-codec generates byte-scalable stream.

390	The next figure provides an example of 16 kHz source material encoding
391	if CELP-base encoder operates at 8 kHz sampling rate.

393	                                                   Core layer
394	                  +------------+   +------------+     params
395	-Input speech-+-->| Downsample +-->|   Scalable +--------------+
396	 FS=16 kHz    |   |   to 8 kHz |   | CELP-based |              |
397	              |   +------------+   |  Encoder   +---+          |
398	              |                    +--+---------+   |          |
399	              |                       |             |          |
400	                                 Synth Speech       |          |
401	              |                       |         Enhancement    |
402	              |                       |           layers       |
403	              |                       |           params       |
404	              |                      \|/            |         \|/
405	              |            +----------+---------+   |   +------+-----+
406	              |            | Upsample to 16 kHz |   |   | Core layer |
407	              |            +-----+--------------+   |   +------------+
408	              |                  |                  |   | Ext.layer 1|
409	              |                 \|/                 |   +------------+
410	              +---------------->(-)                 +-->+ Ext.layer 2|
411	                                 |                      +------------+
412	                                 |                      | Ext.layer 3|
413	                                 |                      +------------+
414	                            Residual                    |            |
415	                                 |                      |            |
416	                                \|/                     |  Scalable  |
417	            +--------------------+--+                   |  bitstream |
418	            |      Scalable         |    Scalable       |            |
419	            |  MDCT-based Encoder   +---bitstream------>|            |
420	            +-----------------------+                   +------------+

422	  Figure 3 Structural block diagram of mixed CELP/MDCT encoder
423	                               (16kHz mode)

425	First, input signal is down-sampled to 8 kHz and encoded by Scalable
426	CELP-based encoder which packs quantized parameters in layered
427	bitstream. The difference between up-sampled synthesized signal and
428	original source goes to Scalable MDCT-based encoder which forms the
429	rest of bitstream.

431	Below CELP and MDCT-based codecs are considered in more details.

433	4.3. Scalable CELP-based encoder

435	Scalable CELP-based coder applied to speech coding consists of the core
436	(base layer) encoder and three enchancement encoders. In Figure 4 the
437	structure of core encoder is shown.

439	Core Encoder codes speech in a "base frequency bandwidth" (up to 4 kHz)
440	with speech quality near to "Toll Quality" and forms a coded bit stream
441	at minimum average bit rate (about 6.0 kbps). Current bit rate is
442	driven by information content of input speech and can vary in range
443	from 4.3 kbps up to 10.35 kbps.

445	The Core Encoder performs LPC analysis and pitch detection, estimates
446	parameters of the pitch-predictor and excitation by the
447	"analysis-by-synthesis" method on the "subframe-by-subframe" base.
448	The subframe length is 5 ms.

450	Encoded parameters and bits are separated to 6 sensitivity classes
451	from: Class A to Class F to provide a possibility of the additional
452	protection them against packet losses.

454	Class A contains most perceptually significant bits. This class's bits
455	should be delivered to Decoder to exclude fully "error propagation".

457	Class F contains less significant bits. Sum of all classes from A to F
458	contains all encoded parameters of the first (core) encoding layer.
459	These parameters are sufficient to synthesize speech with "toll
460	quality".

462	                                                                |
463	                                                           Input Speech
464	                                                            Fs=8 kHz
465	                                      +--------------+          |
466	                                      | LPC Analyzer +<---------+
467	                                      +------+-------+          |
468	                                             |                  |
469	        +------Codebook memory--+           LPC                 |
470	        |         vector update |           \|/                 |
471	       \|/                      |    +-------+-------+          |
472	    +---+------+                |    | LPC Quantizer +-LSFs->   |
473	    | Adaptive +--Pitch->       |    +------------+--+          |
474	+-->| Codebook |                |                 |             |
475	|   +------+---+                |                QLPC           |
476	|          |                    |                \|/            |
477	|          |                    |             +---+--------+    |
478	|          +-------------->(+)--+-Excitation->+ LPC-filter |    |
479	|                          /|\                +----+-------+    |
480	|         +-----------------+                      |            |
481	|  +------+---+                                  Synth.         |
482	+->|   Fixed  +                                  Speech         |
483	|  | Codebook +-Pulse information                  |            |
484	|  +----------+                                    |            |
485	|                                                 \|/           |
486	| +-------------+                                 (-)<----------+
487	+-+  Error      |                                  |
488	  |Minimization |                                  |
489	  |  Control    |                                  |
490	  +-------+-----+                                  |
491	         /|\                                       |
492	          |                                        |
493	          |       +------------+                   |
494	+---------+---+   | Perceptual |                   |
495	|    Error    |   | Weighing   +<------------------+
496	| Calculation +-->+   Filter   |                   |
497	+------+------+   +------------+                   |
498	                                              Residual 1
499	                                                   |
500	                                                  \|/

502	       Figure 4 Structural block diagram of CELP-based Core Encoder

504	      |
505	Pulse information                                             |
506	from previous layer                       |               Residual
507	      |                                   |                  of
508	     \|/                                  |           previous layer
509	+-----+------------+                      |               (Fs=8 kHz)
510	| Adaptive Pulse-  |                    QLPC                   |
511	| Position Control |                 from core layer           |
512	+------+-----------+                      |                    |
513	       |                                  |                    |
514	      \|/                                \|/                   |
515	+------+---------+     Enhancement  +-----+------+            \|/
516	| Fixed Codebook +----  Layer   --->+ LPC-filter +----------->(-)
517	+---+------------+    Excitation    +------------+             |
518	   /|\                                                         |
519	    | +--------------+  +-------------+  +------------+        |
520	    | |    Error     |  |   Error     |  | Perceptual |        |
521	    +-+ Minimization +<-+ Calculation +<-+ Weighing   +<-------+
522	      |   Control    |  +-------------+  |  Filter    |        |
523	      +--------------+                   +------------+    Residual of
524	                                                         current layer
525	                                                              \|/

527	      Figure 5 Structural block diagram of CELP-based Extension Encoder

529	The difference between input speech and synthesized speech (by Core
530	Encoder) is delivered to extension coding. Each next Extension Encoder
531	codes the residual (delivered from previous layer) and forms own
532	additional coded bit stream. Therefore, full bit stream contains a sum
533	of the base and extension bit streams. The number of layers, which is
534	used at coding and corresponded to number of the bit streams in the
535	sum on the encoder's output, can be changed "on the fly".

537	Each CELP Extension Encoder uses results of previous layer's encoding
538	and estimates additional excitation by the "analysis-by-synthesis"
539	method on the "subframe-by-subframe" base (Figure 5). There are total 3
540	CELP Extension Encoders.

542	4.4. Scalable CELP-based decoder
543	The decoder dequantizes parameters of each encoding layer, reconstructs
544	total excitation by sum of adaptive codebook and fixed codebooks (core
545	and enhancement) and synthesizes speech using LPC-filter. Reconstructed
546	speech is post-filtered and output to the 160 samples buffer (20 ms at
547	8 kHz). In Figure 6 the structure of CELP-based decoder is presented.

549	                                                            |
550	                                                       LSF indices
551	                                                            |
552	                                                           \|/
553	-Acbk gain--------------+                            +------+------+
554	                       \|/                           |     LPC     |
555	        +----------+   +++                           | Dequantizer |
556	-Pitch->| Adaptive |-->+X+-----------+               +------+------+
557	        | Codebook |   +-+           |                      |
558	        +----------+                 |                    QLPC
559	                                     |                      |
560	-Fcbk 1 gain-------------------+     |                     \|/
561	                              \|/    |               +------+------+
562	---Pulse      +------------+  +++   \|/              |LPC Synthesis|
563	information-->+    Fixed   |->|X+-->(+)--Excitation->+    Filter   |
564	              | Codebook 1 |  +-+   /|\              +------+------+
565	              +------------+         |                      |
566	                     .               |                      |
567	                     .               |                     \|/
568	                     .               |               +------+------+
569	               +------------+        |               | Post Filter |
570	-Pulse         |  Fixed     |  +-+   |               +------+------+
571	Information n->+ Copybook n +->+X+->-+                      |
572	               +------------+  +++                      Synthesized
573	                               /|\                     Speech 8 kHz
574	                                |                           |
575	--Fcbk 2 gain-------------------+                          \|/

577	     Figure 6 Scalable CELP-based Decoder

579	Decoder has ability to conceal of the lost frames (PLC-like function)
580	by partial reconstruction of speech, using speech parameters of the
581	last received frames. However, to provide highest robustness to packet
582	loss, classes of the most significant parameters only should be
583	protected.

585	4.5. Scalable MDCT-based encoder

587	Scalable MDCT-based encoder operates on a frame basis in a domain of
588	MDCT spectrum. Quantized spectrum samples are written into the
589	bitstream.

591	                +------+   +-----------+  +-----------+
592	--Input signal->+ MDCT +-->+ Quantizer +->+ Bitstream +--Scalable
593	                +------+   +-----------+  | formatter |  bitstream-->
594	                                          +-----------+

596	                    Figure 7 Scalable MDCT-based Encoder

598	This approach is widely used in modern audio coding algorithms. The
599	main advantage of developed compression scheme is a bitstream formatter
600	unit. It constructs stream in a way that any initial part of the
601	compressed data can be decoded and used for reconstruction. In other
602	words, each initial part of compressed frame carries self-sufficient
603	information about band-limited signal with a given level of accuracy.

605	The bitstream formatter unit operates on a band basis, each eight
606	samples long. Coding loop iterates over all bands and transmits update
607	for a given band. Loop ends if all spectrum bands are fully
608	transmitted.

610	  +-----------+
611	 / Spectrum  /
612	+-----+-----+
613	      |
614	     \|/
615	+-----+------+              +-----------------+
616	|    Start   +------------>/ numCodedBands=0 /
617	+-------+----+            +-----------------+
618	        |
619	       \|/
620	   +----+-------------+ no  +------------------+ yes +-----+
621	+->| chooseCodedBand()+---->+ isAllBandsCoded()+---->+ End |
622	|  +----+-------------+     +----+-------------+     +-----+
623	|    yes|                        |no
624	|      \|/                      \|/
625	| +-----+-------+   +------------+--+    +-----------------+
626	| | updateBand()+<--+ startNewBand()+--->+ numCodedBands++ |
627	| +-----+-------+   +----+----------+    +-----------------+
628	|       |                .
629	|       +................+
630	|       |
631	|      \|/
632	| +-----+-------------------+
633	| | applyCompressionModel() |
634	| +--------+----------------+
635	|          |
636	|         \|/
637	|  +-------+-----+          +--------------+
638	+->+ rangeCodec()+--------->+  bits/sample |
639	   +-----+-------+          +--------------+
640	        \|/
641	   +-----+------------+
642	   | Compressed frame |
643	   +------------------+

645	        Figure 8 Spectrum encoding loop

647	Bandwidth expansion (coding band increment) is based on actual
648	bit/samples ratio known for both encoder and decoder. Coding band
649	increment only occurs if compression rate exceed some fixed
650	threshold or all available bands are already fully encoded.
651	Practical experiments show that if compression ratio exceeds
652	1.7 - 2 bits/sample than it is reasonable to expand bandwidth
653	rather than update existing bands.

655	Band update procedure is based on a bit-planes data representation.
656	One bit-plane issues per band at time. In terms of binary planes it
657	means that each update carries one bit of mantissa for each band
658	sample. Current implementation uses ternary planes instead of
659	conventional binary planes. This allows encoder to reduce the amount
660	of noise introduced if only top plane is transmitted.

662	The sign and sample presence flag together form a top plane for
663	particular band which transmitted first than on band coding start.
664	Encoder keeps a track of transmitted planes for each band and chooses
665	the highest non transmitted plane to update.

667	Encoder applies different statistic models and compression schemes for
668	different planes and bands. Actually only several top planes (following
669	by sign/flag plane) are well suited for compression, whereas all others
670	tend to have random distribution and in fact can't be compressed at
671	all. After compression scheme is applied, raw data and chosen statistic
672	model go to range codec(1)  which writes it into a bitstream.

674	4.6. Scalable MDCT-based decoder

676	Decoder performs all the same operations as encoder does, but in
677	backward manner. First bitstream reader reconstructs quantized spectrum
678	samples from compressed frame, than inverse quantized reconstructs MDCT
679	spectrum and inverse MDCT transforms signal back from frequency to time
680	domain.

682	            +-----------+   +-----------+   +---------+
683	  Scalable  | Bitstream +-->+  Inverse  |   | Inverse +--Reconstructed
684	-bitstream->+  reader   |   | Quantizer +-->+   MDCT  |     signal  -->
685	            +-----------+   +-----------+   +---------+

687	        Figure 9 Scalable MDCT-based Decoder

689	(1) Range codec is a sort of arithmetic codec providing byte stream
690	    granularity.

692	The resulting signal accuracy and bandwidth dependent on the amount of
693	available input data. Codec introduces no inter frame data dependency
694	except 50% time domain overlapping required for MDCT transform. In
695	practice, it means that signal can't be correctly reconstructed from a
696	first successfully received compressed frame, but the second frame will
697	be reconstructed correctly.

699	The bitstream reader decompress input stream using inverse range coder.
700	Because of encoder and decoder operate synchronously, each time decoder
701	runs inverse range codec it uses exactly the same context as were used
702	by encoder during compression. Stream parsing ends if no more data
703	available for compressed frame. The following figure demonstrates
704	spectrum decoding loop.

706	+------------------+
707	| Compressed frame |
708	+---+--------------+
709	    |
710	   \|/
711	 +--+----+          +-----------------+
712	 | Start +-------> / numCodedBands=0 /
713	 +---+---+        +-----------------+
714	     |
715	    \|/
716	 +---+---------------+  no           +-----+
717	 | isDataAvailablle()+-------------->+ End |
718	 +----+--------------+               +-----+
719	   yes|
720	     \|/
721	 +----+----------------+ no +---------------------+     +-----+
722	 | chooseDecodedBand() +--->+ isAllBandsDecoded() +---->+ End |
723	 +---+-----------------+    +-----------+---------+     +-----+
724	  yes|                                  | no
725	     +----------------------------------+
726	     |
727	    \|/
728	 +---+----------+                +-------------+
729	 | rangeCodec() +-------------->/ bits/sample /
730	 |  (inverse)   |              +-------------+
731	 +----+---------+
732	      |
733	     \|/
734	 +----+-------------------+
735	 | applyCompressionMode() |
736	 |       (inverse)        |
737	 +-----+------------------+
738	       |
739	       +.........................+
740	      \|/                       \|/
741	 +-----+--------+     +----------+-----+    +-----------------+
742	 | updateBand() |     | startNewBand() +-->/ numCodedBands++ /
743	 | (inverse)    |     |   (inverse)    |  +-----------------+
744	 +--------+-----+     +------+---------+
745	          |                  |
746	         \|/                \|/
747	   +------+------------------+--------+
748	  /               Spectrum           /
749	 +----------------------------------+

751	     Figure 10 Spectrum decoding loop

753	In spite of codec has no lower bitrate limit, the compression scheme
754	used provides artificial reconstructed signal if transmission rate is
755	low than 16-24 kbps. For low bitrates presented audio codec is used in
756	a bunch with speech codec and processes the speech codec residue.

758	5.  Security Considerations

760	   To Be Defined.

762	6.  Informative References

764	   [SILK] SILK Speech Codec Draft, https://developer.skype.com/silk?
765	          action=AttachFile&do=get&target=draft-vos-silk-00.txt

767	7. IANA Considerarions

769	   This document has no actions for IANA

771	Authors' Addresses

773	   Vladimir Sviridenko
774	   SPIRIT DSP
775	   Solzhenitsina 27
776	   Moscow  109004
777	   Russia

779	   Phone: +7 495 661 2178
780	   Email: vladimirs@spiritdsp.com

782	   Sergey Ikonin
783	   SPIRIT DSP
784	   Solzhenitsina 27
785	   Moscow  109004
786	   Russia

788	   Phone: +7 495 661 2178
789	   Email: s.ikonin@gmail.com

791	   Dmitry Yudin
792	   SPIRIT DSP
793	   Solzhenitsina 27
794	   Moscow  109004
795	   Russia

797	   Phone: +7 495 661 2178
798	   Email: yudin@spiritdsp.com

800	Person & email address to contact for further information:
801	   Yury Morzeev
802	   morzeev@spiritdsp.com