idnits 2.17.1 

draft-ietf-avt-rtp-dsr-codecs-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3667, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 806.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 783.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 790.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 796.

  ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure
     Acknowledgement -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate
     instead of verbatim RFC 3978 boilerplate.  After 6 May 2005, submission
     of drafts without verbatim RFC 3978 boilerplate is not accepted.

     The following non-3978 patterns matched text found in the document. 
     That text should be removed or replaced:

        By submitting this Internet-Draft, I certify that any applicable patent
        or other IPR claims of which I am aware have been disclosed, or
        will be disclosed, and any of which I become aware will be
        disclosed, in accordance with RFC 3668.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 82 instances of too long lines in the document, the longest
     one being 1 character in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 17, 2004) is 7246 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: '4' is defined on line 716, but no explicit reference
     was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566)

  ** Obsolete normative reference: RFC 3267 (ref. '8') (Obsoleted by RFC 4867)


     Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Audio Video Transport WG                                          Q. Xie
3	Internet-Draft                                                 D. Pearce
4	Expires: December 16, 2004                                      Motorola
5	                                                            June 17, 2004

7	      RTP Payload Formats for European Telecommunications Standards
8	  Institute (ETSI) European Standard ES 202 050, ES 202 211, and ES 202
9	               212 Distributed Speech Recognition Encoding
10	                   draft-ietf-avt-rtp-dsr-codecs-03.txt

12	Status of this Memo

14	    By submitting this Internet-Draft, I certify that any applicable
15	    patent or other IPR claims of which I am aware have been disclosed,
16	    and any of which I become aware will be disclosed, in accordance with
17	    RFC 3668.

19	    Internet-Drafts are working documents of the Internet Engineering
20	    Task Force (IETF), its areas, and its working groups.  Note that
21	    other groups may also distribute working documents as
22	    Internet-Drafts.

24	    Internet-Drafts are draft documents valid for a maximum of six months
25	    and may be updated, replaced, or obsoleted by other documents at any
26	    time.  It is inappropriate to use Internet-Drafts as reference
27	    material or to cite them other than as "work in progress."

29	    The list of current Internet-Drafts can be accessed at
30	    http://www.ietf.org/ietf/1id-abstracts.txt.

32	    The list of Internet-Draft Shadow Directories can be accessed at
33	    http://www.ietf.org/shadow.html.

35	    This Internet-Draft will expire on December 16, 2004.

37	Abstract

39	    This document specifies RTP payload formats for encapsulating ETSI
40	    Standard ES 202 050 DSR Advanced Front-end (AFE), ES 202 211 DSR
41	    Extended Front-end (XFE), and ES 202 212 DSR Extended Advanced
42	    Front-end (XAFE) signal processing feature streams for distributed
43	    speech recognition (DSR) systems.

45	Table of Contents

47	    1.  Conventions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
48	    2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
49	      2.1   ETSI ES 202 050 Advanced DSR Front-end Codec . . . . . . .  4
50	      2.2   ETSI ES 202 211 Extended DSR Front-end Codec . . . . . . .  4
51	      2.3   ETSI ES 202 212 Extended Advanced DSR Front-end Codec  . .  5
52	    3.  DSR RTP Payload Formats  . . . . . . . . . . . . . . . . . . .  6
53	      3.1   Common Considerations of the Three DSR RTP Payload
54	            Formats  . . . . . . . . . . . . . . . . . . . . . . . . .  6
55	        3.1.1   Number of FPs in Each RTP Packet . . . . . . . . . . .  6
56	        3.1.2   Support for Discontinuous Transmission . . . . . . . .  6
57	        3.1.3   RTP header usage . . . . . . . . . . . . . . . . . . .  6
58	      3.2   Payload Format for ES 202 050 DSR  . . . . . . . . . . . .  7
59	        3.2.1   Frame Pair Formats . . . . . . . . . . . . . . . . . .  7
60	      3.3   Payload Format for ES 202 211 DSR  . . . . . . . . . . . .  9
61	        3.3.1   Frame Pair Formats . . . . . . . . . . . . . . . . . .  9
62	      3.4   Payload Format ES 202 212 DSR  . . . . . . . . . . . . . . 11
63	        3.4.1   Frame Pair Formats . . . . . . . . . . . . . . . . . . 11
64	    4.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
65	      4.1   Mapping MIME Parameters into SDP . . . . . . . . . . . . . 15
66	      4.2   Usage in Offer/Answer  . . . . . . . . . . . . . . . . . . 16
67	    5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
68	    6.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 16
69	    7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
70	    7.1   Normative References . . . . . . . . . . . . . . . . . . . . 16
71	    7.2   Informative References . . . . . . . . . . . . . . . . . . . 17
72	        Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 17
73	        Intellectual Property and Copyright Statements . . . . . . . . 19

75	1.  Conventions

77	    The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
78	    SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when
79	    they appear in this document, are to be interpreted as described in
80	    RFC 2119 [5].

82	    The following acronyms are used in this document:

84	       DSR  - Distributed Speech Recognition
85	       ETSI - the European Telecommunications Standards Institute
86	       FP   - Frame Pair
87	       DTX  - Discontinuous Transmission
88	       VAD  - Voice Activity Detection

90	2.  Introduction

92	    Distributed speech recognition (DSR) technology is intended for a
93	    remote device acting as a thin client, also known as the front-end,
94	    to communicate with a speech recognition server, also called a speech
95	    engine, over a network connection to obtain speech recognition
96	    services.  More details on DSR over Internet can be found in RFC 3557
97	    [11].

99	    To achieve interoperability with different client devices and speech
100	    engines, the first ETSI standard DSR front-end ES 201 108 was
101	    published in early 2000 [12], and an RTP packetization for ES 201 108
102	    frames is defined in RFC 3557 [11] by IETF.

104	    In ES 202 050 [1], ETSI issues another standard for an Advanced DSR
105	    front-end that provides substantially improved recognition
106	    performance when background noise is present.  The codecs in ES 202
107	    050 uses a slightly different frame format from that of ES 201 108
108	    and thus the two do not inter-operate with each other.

110	    The RTP packetization for ES 202 050 front-end defined in this
111	    document uses the same RTP packet format layout as that defined in
112	    RFC 3557 [11].  The differences are in the DSR codec frame bit
113	    definition and the payload type MIME registration.

115	    The two further standards, ES 202 211 and ES 202 212, provided
116	    extensions to each of the DSR front-end standards.  The extensions
117	    allow the speech waveform to be reconstructed for human audition and
118	    can also be used to improve recognition performance for tonal
119	    languages.  This is done by sending additional pitch and voicing
120	    information for each frame along with the recognition features.

122	    The RTP packet format for these extended standards are also defined
123	    in this document.

125	    It is worthwhile to note that the performance of most speech
126	    recognizers are extremely sensitive to consecutive frame losses and
127	    the DSR speech recognizers are no exception.  If a DSR over RTP
128	    session is expected to endure high packet loss ratio between the
129	    front-end and the speech engine, one should consider limiting the
130	    maximum number of DSR frames allowed in a packet, or employing other
131	    loss management techniques, such as FEC or interleaving, to minimize
132	    the chance of losing consecutive frames.

134	2.1  ETSI ES 202 050 Advanced DSR Front-end Codec

136	    Some relevant characteristics of ES 202 050 Advanced DSR front-end
137	    codec are summarized below.

139	    The front-end calculation is a frame-based scheme that produces an
140	    output vector every 10 ms.  In the front-end feature extraction,
141	    noise reduction by two stages of Wiener filtering is performed first.
142	    Then, waveform processing is applied to the de-noised signal and
143	    mel-cepstral features are calculated.  At the end, blind equalization
144	    is applied to the cepstral features.  The front-end algorithm
145	    produces at its output a mel-cepstral representation in the same
146	    format as ES 210 108, i.e., 12 cepstral coeffients [C1 - C12], C0 and
147	    log Energy.  Voice activity detection (VAD) for the classification of
148	    each frame as speech or non-speech is also implemented in Feature
149	    Extraction.  The VAD information is included in the payload format
150	    for each frame pair to be sent to the remote recognition engine as
151	    part of the payload.  This information may optionally be used by the
152	    receiving recognition engine to drop non-speech frames.  The
153	    front-end supports three raw sampling rates: 8 kHz, 11 kHz, and 16
154	    kHz (It is worthwhile to note that unlike some other speech codecs,
155	    the feature frame size of DSR presented to RTP packetization is not
156	    dependent on the number of speech samples used in each 10 ms sample
157	    frame.  This will become more evident in the following sections).

159	    After calculation of the mel-cepstral representation, the
160	    representation is first quantized via split-vector quantization to
161	    reduce the data rate of the encoded stream.  Then, the quantized
162	    vectors from two consecutive frames are put into an frame pair (FP),
163	    as described in more detail in Section 3.2 below.

165	2.2  ETSI ES 202 211 Extended DSR Front-end Codec

167	    Some relevant characteristics of ES 202 211 Extended DSR front-end
168	    codec are summarized below.

170	    ES 202 211 is an extension of the mel-cepstrum DSR Front-end standard
171	    ES 201 108 [12].  The mel-cepstrum front-end provides the features
172	    for speech recognition but these are not available for human
173	    listening.  The purpose of the extension is allow the reconstruction
174	    of the speech waveform from these features so that they can be
175	    replayed.  The front-end feature extraction part of the processing is
176	    exactly the same as for ES 201 108.  To allow speech reconstruction
177	    additional fundamental frequency (perceived as pitch) and voicing
178	    class (e.g.  non-speech, voiced, unvoiced and mixed) information is
179	    needed.  This is the extra information that is provided by the
180	    extended front-end processing algorithms at the device side that is
181	    compressed and transmitted along with the front-end features to the
182	    server.  This extra information may also be useful for improved
183	    speech recognition performance with tonal languages such as Mandarin,
184	    Cantonese and Thai.

186	    Full information about the client side signal processing algorithms
187	    used in the standard are described in the specification ES 202 211
188	    [2].

190	    The additional fundamental frequency and voicing class information is
191	    compressed for each frame pair.  The pitch for the first frame of the
192	    FP is quantised to 7 bits and the second frame is differentially
193	    quantized with 5 bits.  The voicing class is indicated with one bit
194	    for each frame.  The total for the extension information for a frame
195	    pair therefore consists of 14 bits plus and additional 2 bits of CRC
196	    error protection computed over these extension bits only.

198	    The total information for the frame pair is made up of 92 bits for
199	    the two compressed front-end feature frames (including 4 bits for
200	    their CRC) plus 16 bits for the extension (including 2 bits for their
201	    CRC) and 4 bits of null padding to give a total of 14 octets per
202	    frame pair.  As for ES 201 208 the extended frame pair also
203	    corresponds to 20ms of speech.  The extended front-end supports three
204	    raw sampling rates: 8 kHz, 11 kHz, and 16 kHz.

206	    The quantized vectors from two consecutive frames are put into an FP,
207	    as described in more detail in Section 3.3 below.

209	    The parameters received at the remote server from the RTP extended
210	    DSR payload specified here can be used to synthesize an intelligible
211	    speech waveform for replay.  The algorithms to do this are described
212	    in the specification ES 202 211 [2].

214	2.3  ETSI ES 202 212 Extended Advanced DSR Front-end Codec

216	    ES 202 212 is the extension for the DSR Advanced Front-end ES 202 050
217	    [1].  It provides the same capabilities as the extended mel-cepstrum
218	    front-end described in section 2.2 but for the DSR Advanced
219	    Front-end.

221	3.  DSR RTP Payload Formats

223	3.1  Common Considerations of the Three DSR RTP Payload Formats

225	    The three DSR RTP payload formats defined in this document share the
226	    following consideration or behaviours.

228	3.1.1  Number of FPs in Each RTP Packet

230	    Any number of FPs MAY be aggregate together in an RTP payload and
231	    they MUST be consecutive in time.  However, one SHOULD always keep
232	    the RTP payload size smaller than the MTU in order to avoid IP
233	    fragmentation and SHOULD follow the recommendations given in Section
234	    3.1 in RFC 3557 [11] when determining the proper number of FPs in an
235	    RTP payload.

237	3.1.2  Support for Discontinuous Transmission

239	    Same considerations described in Section 3.2 of RFC 3557 [11] apply
240	    to all the three DSR RTP payloads defined in this document.

242	3.1.3  RTP header usage

244	    The format of the RTP header is specified in RFC 3550 [9].  The three
245	    payload formats defined here use the fields of the header in a manner
246	    consistent with that specification.

248	    The RTP timestamp corresponds to the sampling instant of the first
249	    sample encoded for the first FP in the packet.  The timestamp clock
250	    frequency is the same as the sampling frequency, so the timestamp
251	    unit is in samples.

253	    As defined by all the three front-end codecs, the duration of one FP
254	    is 20 ms, corresponding to 160, 220, or 320 encoded samples with
255	    sampling rate of 8, 11, or 16 kHz being used at the front-end,
256	    respectively.  Thus, the timestamp is increased by 160, 220, or 320
257	    for each consecutive FP, respectively.

259	    The DSR payload for all these three front-end codecs is always an
260	    integral number of octets.  If additional padding is required for
261	    some other purpose, then the P bit in the RTP in the header may be
262	    set and padding appended as specified in RFC 3550 [9].

264	    The RTP header marker bit (M) MUST be set following the general rules
265	    for audio codecs as defined in Section 4.1 in RFC 3551 [10].

267	    The assignment of an RTP payload type for these three new packet
268	    formats is outside the scope of this document, and will not be
269	    specified here.  It is expected that the RTP profile under which any
270	    of these payload formats is being used will assign a payload type for
271	    this encoding or specify that the payload type is to be bound
272	    dynamically.

274	3.2  Payload Format for ES 202 050 DSR

276	    An ES 202 050 DSR RTP payload datagram uses exactly the same layout
277	    as defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header
278	    followed by a DSR payload containing a series of DSR FPs.

280	    The size of each ES 202 050 FP is still 96 bits or 12 octets (defined
281	    in the following sections).  This ensures that a DSR RTP payload will
282	    always end on an octet boundary.

284	3.2.1  Frame Pair Formats

286	3.2.1.1  Format of Speech and Non-speech FPs

288	    The following mel-cepstral frame MUST be used, as defined in [1]:

290	    As defined in [1], pairs of the quantized 10ms mel-cepstral frames
291	    MUST be grouped together and protected with a 4-bit CRC, forming a
292	    92-bit long FP.  At the end, each FP MUST be padded with 4 zeros to
293	    the MSB 4 bits of the last octet in order to make the FP aligned to
294	    the octet boundary.

296	    The following diagram shows a complete ES 202 050 FP:

298	      Frame #1 in FP:
299	      ===============
300	         (MSB)                                     (LSB)
301	           0     1     2     3     4     5     6     7
302	        +-----+-----+-----+-----+-----+-----+-----+-----+
303	        :  idx(2,3) |            idx(0,1)               |    Octet 1
304	        +-----+-----+-----+-----+-----+-----+-----+-----+
305	        :       idx(4,5)        |     idx(2,3) (cont)   :    Octet 2
306	        +-----+-----+-----+-----+-----+-----+-----+-----+
307	        |             idx(6,7)              |idx(4,5)(cont)  Octet 3
308	        +-----+-----+-----+-----+-----+-----+-----+-----+
309	    idx(10,11)| VAD |              idx(8,9)             |    Octet 4
310	        +-----+-----+-----+-----+-----+-----+-----+-----+
311	        :       idx(12,13)      |   idx(10,11) (cont)   :    Octet 5
312	        +-----+-----+-----+-----+-----+-----+-----+-----+
313	                                |   idx(12,13) (cont)   :    Octet 6/1
314	                                +-----+-----+-----+-----+

316	     Frame #2 in FP:
317	     ===============
318	         (MSB)                                     (LSB)
319	           0     1     2     3     4     5     6     7
320	        +-----+-----+-----+-----+
321	        :        idx(0,1)       |                            Octet 6/2
322	        +-----+-----+-----+-----+-----+-----+-----+-----+
323	        |              idx(2,3)             |idx(0,1)(cont)  Octet 7
324	        +-----+-----+-----+-----+-----+-----+-----+-----+
325	        :  idx(6,7) |              idx(4,5)             |    Octet 8
326	        +-----+-----+-----+-----+-----+-----+-----+-----+
327	        :        idx(8,9)       |      idx(6,7) (cont)  :    Octet 9
328	        +-----+-----+-----+-----+-----+-----+-----+-----+
329	        |          idx(10,11)         | VAD |idx(8,9)(cont)  Octet 10
330	        +-----+-----+-----+-----+-----+-----+-----+-----+
331	        |                   idx(12,13)                  |    Octet 11
332	        +-----+-----+-----+-----+-----+-----+-----+-----+

334	     CRC for Frame #1 and Frame #2 and padding in FP:
335	     ================================================
336	         (MSB)                                     (LSB)
337	           0     1     2     3     4     5     6     7
338	        +-----+-----+-----+-----+-----+-----+-----+-----+
339	        |  0  |  0  |  0  |  0  |          CRC          |    Octet 12
340	        +-----+-----+-----+-----+-----+-----+-----+-----+

342	    The 4-bit CRC in the FP MUST be calculated using the formula
343	    (including the bit-order rules) defined in 7.2 in [1].

345	    Therefore, each FP represents 20ms of original speech.  Note, as
346	    shown above, each FP MUST be padded with 4 zeros to the MSB 4 bits of
347	    the last octet in order to make the FP aligned to the octet boundary.
348	    This makes the total size of an FP 96 bits, or 12 octets.  Note, this
349	    padding is separate from padding indicated by the P bit in the RTP
350	    header.

352	    The definition of the indices and 'VAD' flag are described in [1] and
353	    their value is only set and examined by the codecs in the front-end
354	    client and the recognizer.

356	3.2.1.2  Format of Null FP

358	    Null FPs are sent to mark the end of a transmission segment.  Details
359	    on transmission segment and the use of Null FPs can be found in RFC
360	    3557 [11].

362	    A Null FP for the ES 202 050 front-end codec is defined by setting
363	    the content of the first and second frame in the FP to null (i.e.,
364	    filling the first 88 bits of the FP with 0's).  The 4-bit CRC MUST be
365	    calculated the same way as described in 7.2.4 in [1], and 4 zeros
366	    MUST be padded to the end of the Null FP to made it octet aligned.

368	3.3  Payload Format for ES 202 211 DSR

370	    An ES 202 211 DSR RTP payload datagram is very similar to that
371	    defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header
372	    followed by a DSR payload containing a series of DSR FPs.

374	    The size of each ES 202 211 FP is 112 bits or 14 octets (defined in
375	    the following sections).  This ensures that a DSR RTP payload will
376	    always end on an octet boundary.

378	3.3.1  Frame Pair Formats

380	3.3.1.1  Format of Speech and Non-speech FPs

382	    The following mel-cepstral frame MUST be used, as defined in Section
383	    6.2.4 in [2]:

385	    As defined in Section 6.2.4 in [2], after two frames (Frame #1 and
386	    Frame #2) worth of codebook indices, or 88 bits, a 4-bit CRC
387	    calculated on these 88 bits immediately follows it.  The pitch
388	    indices of the first frame (Pidx1: 7 bits) and the second frame
389	    (Pidx2: 5 bits) of the frame pair then follow.  The class indices of
390	    the two frames in the frame pair worth 1 bit each (Cidx1 and Cidx2)
391	    next follow.  Finally, a 2-bit CRC calculated on the pitch and class
392	    bits (total: 14 bits) of the frame pair is included (PC-CRC).  The
393	    total number of bits in frame pair packet is therefore 44 + 44 + 4 +
394	    7 + 5 + 1 + 1 + 2 = 108.  At the end, each FP MUST be padded with 4
395	    zeros to the MSB 4 bits of the last octet in order to make the FP
396	    aligned to the octet boundary.

398	    The following diagram shows a complete ES 202 211 FP:

400	      Frame #1 in FP:
401	      ===============
402	        (MSB)                                     (LSB)
403	          0     1     2     3     4     5     6     7
404	       +-----+-----+-----+-----+-----+-----+-----+-----+
405	       :  idx(2,3) |            idx(0,1)               |    Octet 1
406	       +-----+-----+-----+-----+-----+-----+-----+-----+
407	       :       idx(4,5)        |     idx(2,3) (cont)   :    Octet 2
408	       +-----+-----+-----+-----+-----+-----+-----+-----+
409	       |             idx(6,7)              |idx(4,5)(cont)  Octet 3
410	       +-----+-----+-----+-----+-----+-----+-----+-----+
411	        idx(10,11) |              idx(8,9)             |    Octet 4
412	       +-----+-----+-----+-----+-----+-----+-----+-----+
413	       :       idx(12,13)      |   idx(10,11) (cont)   :    Octet 5
414	       +-----+-----+-----+-----+-----+-----+-----+-----+
415	                               |   idx(12,13) (cont)   :    Octet 6/1
416	                               +-----+-----+-----+-----+

418	     Frame #2 in FP:
419	     ===============
420	        (MSB)                                     (LSB)
421	          0     1     2     3     4     5     6     7
422	       +-----+-----+-----+-----+
423	       :        idx(0,1)       |                            Octet 6/2
424	       +-----+-----+-----+-----+-----+-----+-----+-----+
425	       |              idx(2,3)             |idx(0,1)(cont)  Octet 7
426	       +-----+-----+-----+-----+-----+-----+-----+-----+
427	       :  idx(6,7) |              idx(4,5)             |    Octet 8
428	       +-----+-----+-----+-----+-----+-----+-----+-----+
429	       :        idx(8,9)       |      idx(6,7) (cont)  :    Octet 9
430	       +-----+-----+-----+-----+-----+-----+-----+-----+
431	       |          idx(10,11)               |idx(8,9)(cont)  Octet 10
432	       +-----+-----+-----+-----+-----+-----+-----+-----+
433	       |                   idx(12,13)                  |    Octet 11
434	       +-----+-----+-----+-----+-----+-----+-----+-----+

436	     CRC for Frame #1 and Frame #2 in FP:
437	     ====================================
438	        (MSB)                                     (LSB)
439	          0     1     2     3     4     5     6     7
440	                               +-----+-----+-----+-----+
441	                               |          CRC          |    Octet 12/1
442	                               +-----+-----+-----+-----+

444	     Extension information and padding in FP:
445	     ========================================
446	        (MSB)                                     (LSB)
447	          0     1     2     3     4     5     6     7
448	       +-----+-----+-----+-----+
449	       :       Pidx1           |                            Octet 12/2
450	       +-----+-----+-----+-----+-----+-----+-----+-----+
451	       |            Pidx2            |   Pidx1 (cont)  :    Octet 13
452	       +-----+-----+-----+-----+-----+-----+-----+-----+
453	       |  0  |  0  |  0  |  0  |  PC-CRC   |Cidx2|Cidx1|    Octet 14
454	       +-----+-----+-----+-----+-----+-----+-----+-----+

456	    The 4-bit CRC and the 2-bit PC-CRC in the FP MUST be calculated using
457	    the formula (including the bit-order rules) defined in 6.2.4 in [2].

459	    Therefore, each FP represents 20ms of original speech.  Note, as
460	    shown above, each FP MUST be padded with 4 zeros to the MSB 4 bits of
461	    the last octet in order to make the FP aligned to the octet boundary.
462	    This makes the total size of an FP 112 bits, or 14 octets.  Note,
463	    this padding is separate from padding indicated by the P bit in the
464	    RTP header.

466	3.3.1.2  Format of Null FP

468	    A Null FP for the ES 202 211 front-end codec is defined by setting
469	    all the 112 bits of the FP with 0's.  Null FPs are sent to mark the
470	    end of a transmission segment.  Details on transmission segment and
471	    the use of Null FPs can be found in RFC 3557 [11].

473	3.4  Payload Format ES 202 212 DSR

475	    Similar to other ETSI DSR front-end encoding schemes, the encoded DSR
476	    feature stream of ES 202 212 is transmitted in a sequence of frame
477	    pairs (FPs), where each FP represents two consecutive original voice
478	    frames.

480	    An ES 202 212 DSR RTP payload datagram is very similar to that
481	    defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header
482	    followed by a DSR payload containing a series of DSR FPs.

484	    The size of each ES 202 212 FP is 112 bits or 14 octets (defined in
485	    the following sections).  This ensures that an ES 202 212 DSR RTP
486	    payload will always end on an octet boundary.

488	3.4.1  Frame Pair Formats
489	3.4.1.1  Format of Speech and Non-speech FPs

491	    The following mel-cepstral frame MUST be used, as defined in Section
492	    7.2.4 in [3]:

494	    As defined in Section 7.2.4 in [3], after two frames (Frame #1 and
495	    Frame #2) worth of codebook indices, or 88 bits, a 4-bit CRC
496	    calculated on these 88 bits immediately follows it.  The pitch
497	    indices of the first frame (Pidx1: 7 bits) and the second frame
498	    (Pidx2: 5 bits) of the frame pair then follow.  The class indices of
499	    the two frames in the frame pair worth 1 bit each next follow (Cidx1
500	    and Cidx2).  Finally, a 2-bit CRC (PC-CRC) calculated on the pitch
501	    and class bits (total: 14 bits) of the frame pair is included.  The
502	    total number of bits in frame pair packet is therefore 44 + 44 + 4 +
503	    7 + 5 + 1 + 1 + 2 = 108.  At the end, each FP MUST be padded with 4
504	    zeros to the MSB 4 bits of the last octet in order to make the FP
505	    aligned to the octet boundary.  The padding brings the total size of
506	    a FP to 112 bits, or 14 octets.  Note, this padding is separate from
507	    padding indicated by the P bit in the RTP header.

509	    The following diagram shows a complete ES 202 212 FP:

511	      Frame #1 in FP:
512	      ===============
513	         (MSB)                                     (LSB)
514	           0     1     2     3     4     5     6     7
515	        +-----+-----+-----+-----+-----+-----+-----+-----+
516	        :  idx(2,3) |            idx(0,1)               |    Octet 1
517	        +-----+-----+-----+-----+-----+-----+-----+-----+
518	        :       idx(4,5)        |     idx(2,3) (cont)   :    Octet 2
519	        +-----+-----+-----+-----+-----+-----+-----+-----+
520	        |             idx(6,7)              |idx(4,5)(cont)  Octet 3
521	        +-----+-----+-----+-----+-----+-----+-----+-----+
522	    idx(10,11)| VAD |              idx(8,9)             |    Octet 4
523	        +-----+-----+-----+-----+-----+-----+-----+-----+
524	        :       idx(12,13)      |   idx(10,11) (cont)   :    Octet 5
525	        +-----+-----+-----+-----+-----+-----+-----+-----+
526	                                |   idx(12,13) (cont)   :    Octet 6/1
527	                                +-----+-----+-----+-----+

529	     Frame #2 in FP:
530	     ===============
531	         (MSB)                                     (LSB)
532	           0     1     2     3     4     5     6     7
533	        +-----+-----+-----+-----+
534	        :        idx(0,1)       |                            Octet 6/2
535	        +-----+-----+-----+-----+-----+-----+-----+-----+
536	        |              idx(2,3)             |idx(0,1)(cont)  Octet 7
537	        +-----+-----+-----+-----+-----+-----+-----+-----+
538	        :  idx(6,7) |              idx(4,5)             |    Octet 8
539	        +-----+-----+-----+-----+-----+-----+-----+-----+
540	        :        idx(8,9)       |      idx(6,7) (cont)  :    Octet 9
541	        +-----+-----+-----+-----+-----+-----+-----+-----+
542	        |          idx(10,11)         | VAD |idx(8,9)(cont)  Octet 10
543	        +-----+-----+-----+-----+-----+-----+-----+-----+
544	        |                   idx(12,13)                  |    Octet 11
545	        +-----+-----+-----+-----+-----+-----+-----+-----+

547	     CRC for Frame #1 and Frame #2 in FP:
548	     ====================================
549	         (MSB)                                     (LSB)
550	           0     1     2     3     4     5     6     7
551	                                +-----+-----+-----+-----+
552	                                |          CRC          |    Octet 12/1
553	                                +-----+-----+-----+-----+

555	     Extension information and padding in FP:
556	     ========================================
557	         (MSB)                                     (LSB)
558	           0     1     2     3     4     5     6     7
559	        +-----+-----+-----+-----+
560	        :       Pidx1           |                            Octet 12/2
561	        +-----+-----+-----+-----+-----+-----+-----+-----+
562	        |            Pidx2            |   Pidx1 (cont)  :    Octet 13
563	        +-----+-----+-----+-----+-----+-----+-----+-----+
564	        |  0  |  0  |  0  |  0  |  PC-CRC   |Cidx2|Cidx1|    Octet 14
565	        +-----+-----+-----+-----+-----+-----+-----+-----+

567	    The codebook indices, VAD flag, pitch index, and class index are
568	    specified in Section 6 of [3].  The 4-bit CRC and the 2-bit PC-CRC in
569	    the FP MUST be calculated using the formula (including the bit-order
570	    rules) defined in 7.2.4 in [3].

572	3.4.1.2  Format of Null FP

574	    A Null FP for the ES 202 212 front-end codec is defined by setting
575	    all the 112 bits of the FP with 0's.  Null FPs are sent to mark the
576	    end of a transmission segment.  Details on transmission segment and
577	    the use of Null FPs can be found in RFC 3557 [11].

579	4.  IANA Considerations

581	    For each of the three ETSI DSR front-end codecs covered in this
582	    document, a new MIME subtype registration is required for the
583	    corresponding payload type, as described below.

585	    Media Type name: audio

587	    Media subtype names:

589	          dsr-es202050 (for ES 202 050 front-end)

591	          dsr-es202211 (for ES 202 211 front-end)

593	          dsr-es202212 (for ES 202 212 front-end)

595	    Required parameters: none

597	    Optional parameters:

599	    rate: Indicates the sample rate of the speech.  Valid values include:
600	       8000, 11000, and 16000.  If this parameter is not present, 8000
601	       sample rate is assumed.

603	    maxptime: see RFC 3267 [8].  If this parameter is not present,
604	       maxptime is assumed to be 80ms.

606	       Note, since the performance of most speech recognizers are
607	       extremely sensitive to consecutive FP losses, if the user of the
608	       payload format expects a high packet loss ratio for the session,
609	       it MAY consider to explicitly choose a maxptime value for the
610	       session that is shorter than the default value.

612	    ptime: see RFC 2327 [6].

614	    Encoding considerations: These types are defined for transfer via RTP
615	       [9] as described in Section 3 of RFC XXXX.

617	    Security considerations: See Section 5 of RFC XXXX.

619	    Person & email address to contact for further information:
620	       Qiaobing.Xie@motorola.com

622	    Intended usage: COMMON.  It is expected that many VoIP applications
623	       (as well as mobile applications) will use this type.

625	    Author/Change controller:

627	       *  Qiaobing.Xie@motorola.com

629	       *  IETF Audio/Video transport working group

631	4.1  Mapping MIME Parameters into SDP

633	    The information carried in the MIME media type specification has a
634	    specific mapping to fields in the Session Description Protocol (SDP)
635	    [6], which is commonly used to describe RTP sessions.  When SDP is
636	    used to specify sessions employing ES 202 050, ES 202 211, or ES 202
637	    212 DSR codec, the mapping is as follows:

639	    o  The MIME type ("audio") goes in SDP "m=" as the media name.

641	    o  The MIME subtype ("dsr-es202050", "dsr-es202211", or
642	       "dsr-es202212") goes in SDP "a=rtpmap" as the encoding name.

644	    o  The optional parameter "rate" also goes in "a=rtpmap" as clock
645	       rate.  If no rate is given, then the default value (i.e., 8000) is
646	       used in SDP.

648	    o  The optional parameters "ptime" and "maxptime" go in the SDP
649	       "a=ptime" and "a=maxptime" attributes, respectively.

651	    Example of usage of ES 202 050 DSR:

653	      m=audio 49120 RTP/AVP 101
654	      a=rtpmap:101 dsr-es202050/8000
655	      a=maxptime:40

657	    Example of usage of ES 202 211 DSR:

659	      m=audio 49120 RTP/AVP 101
660	      a=rtpmap:101 dsr-es202211/8000
661	      a=maxptime:40

663	    Example of usage of ES 202 212 DSR:

665	      m=audio 49120 RTP/AVP 101
666	      a=rtpmap:101 dsr-es202212/8000
667	      a=maxptime:40

669	4.2  Usage in Offer/Answer

671	    All SDP parameters in this payload format are declarative, and all
672	    reasonable values are expected to be supported.  Thus, the standard
673	    usage of Offer/Answer as described in RFC 3264 [7] should be
674	    followed.

676	5.  Security Considerations

678	    Implementations using the payload defined in this specification are
679	    subject to the security considerations discussed in the RTP
680	    specification RFC 3550 [9] and any RTP profile, e.g.  RFC 3551 [10].
681	    This payload does not specify any different security services.

683	    Congestion control for RTP MUST be used in accordance with RFC 3550
684	    [9], and any applicable RTP profile, e.g.  RFC 3551 [10].

686	6.  Acknowledgments

688	    The design presented here is based on that of RFC 3557 [11].  The
689	    authors wish to thank for the review and comments from Magnus
690	    Westerlund and others.

692	7.  References

694	7.1  Normative References

696	    [1]   European Telecommunications Standards Institute (ETSI) Standard
697	          ES 202 050, "Speech Processing, Transmission and Quality
698	          Aspects (STQ); Distributed Speech Recognition; Front-end
699	          Feature Extraction Algorithm; Compression Algorithms", (http://
700	          pda.etsi.org/pda/) , October 2002.

702	    [2]   European Telecommunications Standards Institute (ETSI) Standard
703	          ES 202 211, "Speech Processing, Transmission and Quality
704	          Aspects (STQ); Distributed Speech Recognition; Extended
705	          front-end feature extraction algorithm; Compression algorithms;
706	          Back-end speech reconstruction algorithm",
707	          (http://pda.etsi.org/pda/) , November 2003.

709	    [3]   European Telecommunications Standards Institute (ETSI) Standard
710	          ES 202 212, "Speech Processing, Transmission and Quality
711	          aspects (STQ); Distributed speech recognition; Extended
712	          advanced front-end feature extraction algorithm; Compression
713	          algorithms; Back-end speech reconstruction algorithm", (http://
714	          pda.etsi.org/pda/) , November 2003.

716	    [4]   Bradner, S., "The Internet Standards Process -- Revision 3",
717	          BCP 9, RFC 2026, October 1996.

719	    [5]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
720	          Levels", BCP 14, RFC 2119, March 1997.

722	    [6]   Handley, M. and V. Jacobson, "SDP: Session Description
723	          Protocol", RFC 2327, April 1998.

725	    [7]   Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with
726	          the Session Description Protocol (SDP)", RFC 3264, June 2002.

728	    [8]   Sjoberg, J., Westerlund, M., Lakaniemi, A. and Q. Xie,
729	          "Real-Time Transport Protocol (RTP) Payload Format and File
730	          Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive
731	          Multi-Rate Wideband (AMR-WB) Audio Codecs", RFC 3267, June
732	          2002.

734	    [9]   Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
735	          "RTP: A Transport Protocol for Real-Time Applications", RFC
736	          3550, July 2003.

738	    [10]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video
739	          Conferences with Minimal Control", RFC 3551, July 2003.

741	    [11]  Xie, Q., "RTP Payload Format for European Telecommunications
742	          Standards Institute (ETSI) European Standard ES 201 108
743	          Distributed Speech Recognition Encoding", RFC 3557, July 2003.

745	7.2  Informative References

747	    [12]  European Telecommunications Standards Institute (ETSI) Standard
748	          ES 201 108, "Speech Processing, Transmission and Quality
749	          Aspects (STQ); Distributed Speech Recognition; Front-end
750	          Feature Extraction Algorithm; Compression Algorithms", (http://
751	          webapp.etsi.org/pda/) , April 2000.

753	Authors' Addresses

755	    Qiaobing Xie
756	    Motorola, Inc.
757	    1501 W. Shure Drive, 2-F9
758	    Arlington Heights, IL  60004
759	    US

761	    Phone: +1-847-632-3028
762	    EMail: qxie1@email.mot.com
763	    David Pearce
764	    Motorola Labs
765	    UK Research Laboratory
766	    Jays Close
767	    Viables Industrial Estate
768	    Basingstoke, HANTS  RG22 4PD
769	    UK

771	    Phone: +44 (0)1256 484 436
772	    EMail: bdp003@motorola.com

774	Intellectual Property Statement

776	    The IETF takes no position regarding the validity or scope of any
777	    Intellectual Property Rights or other rights that might be claimed to
778	    pertain to the implementation or use of the technology described in
779	    this document or the extent to which any license under such rights
780	    might or might not be available; nor does it represent that it has
781	    made any independent effort to identify any such rights.  Information
782	    on the procedures with respect to rights in RFC documents can be
783	    found in BCP 78 and BCP 79.

785	    Copies of IPR disclosures made to the IETF Secretariat and any
786	    assurances of licenses to be made available, or the result of an
787	    attempt made to obtain a general license or permission for the use of
788	    such proprietary rights by implementers or users of this
789	    specification can be obtained from the IETF on-line IPR repository at
790	    http://www.ietf.org/ipr.

792	    The IETF invites any interested party to bring to its attention any
793	    copyrights, patents or patent applications, or other proprietary
794	    rights that may cover technology that may be required to implement
795	    this standard.  Please address the information to the IETF at
796	    ietf-ipr@ietf.org.

798	Disclaimer of Validity

800	    This document and the information contained herein are provided on an
801	    "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
802	    OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
803	    ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
804	    INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
805	    INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
806	    WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

808	Full Copyright Statement

810	    Copyright (C) The Internet Society (2004).  This document is subject
811	    to the rights, licenses and restrictions contained in BCP 78, and
812	    except as set forth therein, the authors retain all their rights.

814	Acknowledgment

816	    Funding for the RFC Editor function is currently provided by the
817	    Internet Society.