idnits 2.17.1 

draft-ietf-avt-dsr-es202050-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 17, 2003) is 7498 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: '2' is defined on line 369, but no explicit reference
     was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  ** Obsolete normative reference: RFC 2327 (ref. '4') (Obsoleted by RFC 4566)


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Audio Video Transport WG                                          Q. Xie
3	Internet-Draft                                                 D. Pearce
4	Expires: April 16, 2004                                         Motorola
5	                                                        October 17, 2003

7	      RTP Payload Format for European Telecommunications Standards
8	   Institute (ETSI) European Standard ES 202 050 Distributed  Speech
9	                          Recognition Encoding
10	                   draft-ietf-avt-dsr-es202050-01.txt

12	Status of this Memo

14	   This document is an Internet-Draft and is in full conformance with
15	   all provisions of Section 10 of RFC2026.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups. Note that other
19	   groups may also distribute working documents as Internet-Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time. It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at http://
27	   www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on April 16, 2004.

34	Copyright Notice

36	   Copyright (C) The Internet Society (2003). All Rights Reserved.

38	Abstract

40	   This document specifies an RTP payload format for encapsulating ETSI
41	   Standard ES 202 050 advanced front-end signal processing feature
42	   streams for distributed speech recognition (DSR) systems.

44	Table of Contents

46	   1.  Conventions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
47	   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
48	   2.1 ETSI ES 202 050 Advanced DSR Front-end Codec . . . . . . . . .  3
49	   3.  ES 202 050 DSR RTP Payload Format  . . . . . . . . . . . . . .  4
50	   3.1 Consideration on Number of FPs in Each RTP Packet  . . . . . .  4
51	   3.2 Support for Discontinuous Transmission . . . . . . . . . . . .  4
52	   4.  Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . .  4
53	   4.1 Format of Speech and Non-speech FPs  . . . . . . . . . . . . .  4
54	   4.2 Format of Null FP  . . . . . . . . . . . . . . . . . . . . . .  6
55	   4.3 RTP header usage . . . . . . . . . . . . . . . . . . . . . . .  7
56	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  7
57	   5.1 Mapping MIME Parameters into SDP . . . . . . . . . . . . . . .  8
58	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
59	   7.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . .  9
60	       Normative References . . . . . . . . . . . . . . . . . . . . .  9
61	       Informative References . . . . . . . . . . . . . . . . . . . . 10
62	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 10
63	       Intellectual Property and Copyright Statements . . . . . . . . 11

65	1. Conventions

67	   The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD,
68	   SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when
69	   they appear in this document, are to be interpreted as described in
70	   [3].

72	   The following acronyms are used in this document:

74	      DSR  - Distributed Speech Recognition
75	      ETSI - the European Telecommunications Standards Institute
76	      FP   - Frame Pair
77	      DTX  - Discontinuous Transmission
78	      VAD  - Voice Activity Detection

80	2. Introduction

82	   Distributed speech recognition (DSR) technology is intended for a
83	   remote device acting as a thin client, also known as the front-end,
84	   to communicate with a speech recognition server, also called a speech
85	   engine, over a network connection to obtain speech recognition
86	   services. More details on DSR over Internet can be found in [7].

88	   To achieve interoperability with different client devices and speech
89	   engines, the first ETSI standard DSR front-end ES 201 108 was
90	   published in early 2000 [8], and an RTP packetization for ES 201 108
91	   frames is defined in [7] in IETF.

93	   In ES 202 050 [1], ETSI issues another standard for an Advanced DSR
94	   front-end that provides substantially improved recognition
95	   performance when background noise is present. The codecs in ES 202
96	   050 uses a slightly different frame format from that of ES 201 108
97	   and thus the two do not inter-operate with each other.

99	   The RTP packetization for ES 202 050 front-end defined in this
100	   document uses the same RTP packet format layout as that defined in
101	   [7]. The differences are in the DSR codec frame bit definition and
102	   the payload type MIME registration.

104	2.1 ETSI ES 202 050 Advanced DSR Front-end Codec

106	   Some relevant characteristics of ES 202 050 Advanced DSR front-end
107	   codec are summarized below.

109	   The front-end calculation is a frame-based scheme that produces an
110	   output vector every 10 ms. In the front-end feature extraction, noise
111	   reduction by two stages of Wiener filtering is performed first. Then,
112	   waveform processing is applied to the de-noised signal and
113	   mel-cepstral features are calculated. At the end, blind equalization
114	   is applied to the cepstral features. The front-end algorithm produces
115	   at its output a mel-cepstral representation in the same format as ES
116	   210 108, i.e., 12 cepstral coeffients [C1 - C12], C0 and log Engergy.
117	   Voice activity detection (VAD) for the clasification of each frame as
118	   speech or non-speech is also implemented in Feature Extraction. The
119	   VAD information is included in the payload format for each frame pair
120	   to be sent to the remote recognition engine as part of the payload.
121	   This information may optionally be used by the receiving recognition
122	   engine to drop non-speech frames. The front-end supports three raw
123	   sampling rates: 8 kHz, 11 kHz, and 16 kHz (It is worthwhile to note
124	   that unlike some other speech codecs, the feature frame size of DSR
125	   presented to RTP packetization is not dependent on the number of
126	   speech samples used in each 10 ms sample frame. This will become more
127	   evident in the following sections).

129	   After calculation of the mel-cepstral representation, the
130	   representation is first quantized via split-vector quantization to
131	   reduce the data rate of the encoded stream. Then, the quantized
132	   vectors from two consecutive frames are put into an FP, as described
133	   in more detail in Section 4.1 below.

135	3. ES 202 050 DSR RTP Payload Format

137	   An ES 202 050 DSR RTP payload datagram uses exactly the same layout
138	   as defined in Section 3 of [7], i.e., a standard RTP header followed
139	   by a DSR payload containing a series of DSR FPs.

141	   The size of each ES 202 050 FP is still 96 bits or 12 octets (see
142	   Sections 4 below). This ensures that a DSR RTP payload will always
143	   end on an octet boundary.

145	3.1 Consideration on Number of FPs in Each RTP Packet

147	   Same considerations described in Section 3.1 of [7] apply to ES 202
148	   050 RTP payload.

150	3.2 Support for Discontinuous Transmission

152	   Same considerations described in Section 3.2 of [7] apply to ES 202
153	   050 RTP payload.

155	4. Frame Pair Formats

157	4.1 Format of Speech and Non-speech FPs

159	   The following mel-cepstral frame MUST be used, as defined in [1]:

161	   As defined in [1], pairs of the quantized 10ms mel-cepstral frames
162	   MUST be grouped together and protected with a 4-bit CRC, forming a
163	   92-bit long FP:

165	    0                   1                   2                   3
166	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
167	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
168	   |                                                               |
169	   +                                                               +
170	   |         Frame Pair (88 bits) = Frame #1 + Frame #2            |
171	   +                                               +-+-+-+-+-+-+-+-+
172	   |                                               | CRC   |0|0|0|0|
173	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

175	   Here Frame #1 and Frame #2 above MUST use the following mel-cepstral
176	   frame formats:

178	     Frame #1 in FP:
179	     ===============
180	        (MSB)                                     (LSB)
181	          0     1     2     3     4     5     6     7
182	       +-----+-----+-----+-----+-----+-----+-----+-----+
183	       :  idx(2,3) |            idx(0,1)               |    Octet 1
184	       +-----+-----+-----+-----+-----+-----+-----+-----+
185	       :       idx(4,5)        |     idx(2,3) (cont)   :    Octet 2
186	       +-----+-----+-----+-----+-----+-----+-----+-----+
187	       |             idx(6,7)              |idx(4,5)(cont)  Octet 3
188	       +-----+-----+-----+-----+-----+-----+-----+-----+
189	   idx(10,11)| VAD |              idx(8,9)             |    Octet 4
190	       +-----+-----+-----+-----+-----+-----+-----+-----+
191	       :       idx(12,13)      |   idx(10,11) (cont)   :    Octet 5
192	       +-----+-----+-----+-----+-----+-----+-----+-----+
193	                               |   idx(12,13) (cont)   :    Octet 6/1
194	                               +-----+-----+-----+-----+

196	    Frame #2 in FP:
197	    ===============
198	        (MSB)                                     (LSB)
199	          0     1     2     3     4     5     6     7
200	       +-----+-----+-----+-----+
201	       :        idx(0,1)       |                            Octet 6/2
202	       +-----+-----+-----+-----+-----+-----+-----+-----+
203	       |              idx(2,3)             |idx(0,1)(cont)  Octet 7
204	       +-----+-----+-----+-----+-----+-----+-----+-----+
205	       :  idx(6,7) |              idx(4,5)             |    Octet 8
206	       +-----+-----+-----+-----+-----+-----+-----+-----+
207	       :        idx(8,9)       |      idx(6,7) (cont)  :    Octet 9
208	       +-----+-----+-----+-----+-----+-----+-----+-----+
209	       |          idx(10,11)         | VAD |idx(8,9)(cont)  Octet 10
210	       +-----+-----+-----+-----+-----+-----+-----+-----+
211	       |                   idx(12,13)                  |    Octet 11
212	       +-----+-----+-----+-----+-----+-----+-----+-----+

214	   The 4-bit CRC in the FP MUST be calculated using the formula
215	   (including the bit-order rules) defined in 7.2 in [1].

217	   Therefore, each FP represents 20ms of original speech. Note, as shown
218	   above, each FP MUST be padded with 4 zeros to the LSB 4 bits of the
219	   last octet in order to make the FP aligned to the 32-bit word
220	   boundary. This makes the total size of an FP 96 bits, or 12 octets.
221	   Note, this padding is separate from padding indicated by the P bit in
222	   the RTP header.

224	   The definition of the indices and 'VAD' flag are described in [1] and
225	   their value is only set and examined by the codecs in the front-end
226	   client and the recognizer.

228	   Any number of FPs MAY be aggregate together in an RTP payload and
229	   they MUST be consecutive in time. However, one SHOULD always keep the
230	   RTP payload size smaller than the MTU in order to avoid IP
231	   fragmentation and SHOULD follow the recommendations given in Section
232	   3.1 in [7] when determining the proper number of FPs in an RTP
233	   payload.

235	4.2 Format of Null FP

237	   A Null FP for the ES 202 050 front-end codec is defined by setting
238	   the content of the first and second frame in the FP to null (i.e.,
239	   filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be
240	   calculated the same way as described in 7.2.4 in [1], and 4 zeros
241	   MUST be padded to the end of the Null FP to made it 32-bit word
242	   aligned.

244	4.3 RTP header usage

246	   The format of the RTP header is specified in [5]. This payload format
247	   uses the fields of the header in a manner consistent with that
248	   specification.

250	   The RTP timestamp corresponds to the sampling instant of the first
251	   sample encoded for the first FP in the packet. The timestamp clock
252	   frequency is the same as the sampling frequency, so the timestamp
253	   unit is in samples.

255	   As defined by ES 202 050 front-end codec, the duration of one FP is
256	   20 ms, corresponding to 160, 220, or 320 encoded samples with
257	   sampling rate of 8, 11, or 16 kHz being used at the front-end,
258	   respectively.  Thus, the timestamp is increased by 160, 220, or 320
259	   for each consecutive FP, respectively.

261	   The DSR payload for ES 202 050 front-end codes is always an integral
262	   number of octets. If additional padding is required for some other
263	   purpose, then the P bit in the RTP in the header may be set and
264	   padding appended as specified in [5].

266	   The RTP header marker bit (M) should be set following the general
267	   rules for audio codecs as defined in Section 4.1 in [6].

269	   The assignment of an RTP payload type for this new packet format is
270	   outside the scope of this document, and will not be specified here.
271	   It is expected that the RTP profile under which this payload format
272	   is being used will assign a payload type for this encoding or specify
273	   that the payload type is to be bound dynamically.

275	5. IANA Considerations

277	   One new MIME subtype registration is required for this payload type,
278	   as described below.

280	   Media Type name: audio

282	   Media subtype name: dsr-es202050

284	   Required parameters: none

286	   Optional parameters:

288	   rate: Indicates the sample rate of the speech.  Valid values include:
289	      8000, 11000, and 16000.  If this parameter is not present, 8000
290	      sample rate is assumed.

292	   maxptime: The maximum amount of media which can be encapsulated in
293	      each packet, expressed as time in milliseconds. The time shall be
294	      calculated as the sum of the time the media present in the packet
295	      represents.  The time SHOULD be a multiple of the frame pair size
296	      (i.e., one FP => 20ms).

298	      If this parameter is not present, maxptime is assumed to be 80ms.

300	      Note, since the performance of most speech recognizers are
301	      extremely sensitive to consecutive FP losses, if the user of the
302	      payload format expects a high packet loss ratio for the session,
303	      it MAY consider to explicitly choose a maxptime value for the
304	      session that is shorter than the default value.

306	   ptime: see RFC2327 [4].

308	   Encoding considerations: This type is defined for transfer via RTP
309	      [5] as described in Sections 3 and 4 of RFC XXXX.

311	   Security considerations: See Section 6 of RFC XXXX.

313	   Person & email address to contact for further information:
314	      Qiaobing.Xie@motorola.com

316	   Intended usage: COMMON. It is expected that many VoIP applications
317	      (as well as mobile applications) will use this type.

319	   Author/Change controller:

321	      *  Qiaobing.Xie@motorola.com

323	      *  IETF Audio/Video transport working group

325	5.1 Mapping MIME Parameters into SDP

327	   The information carried in the MIME media type specification has a
328	   specific mapping to fields in the Session Description Protocol (SDP)
329	   [4], which is commonly used to describe RTP sessions. When SDP is
330	   used to specify sessions employing ES 202 050 DSR codec, the mapping
331	   is as follows:

333	   o  The MIME type ("audio") goes in SDP "m=" as the media name.

335	   o  The MIME subtype ("dsr-es202050") goes in SDP "a=rtpmap" as the
336	      encoding name.

338	   o  The optional parameter "rate" also goes in "a=rtpmap" as clock
339	      rate.

341	   o  The optional parameters "ptime" and "maxptime" go in the SDP
342	      "a=ptime" and "a=maxptime" attributes, respectively.

344	   Example of usage of ES 202 050 DSR:

346	     m=audio 49120 RTP/AVP 101
347	     a=rtpmap:101 dsr-es202050/8000
348	     a=maxptime:40

350	6. Security Considerations

352	   Implementations using the payload defined in this specification are
353	   subject to the security considerations discussed in the RTP
354	   specification [5] and the RTP profile [6]. This payload does not
355	   specify any different security services.

357	7. Acknowledgments

359	   The design presented here is based on that of [7].

361	Normative References

363	   [1]  European Telecommunications Standards Institute (ETSI) Standard
364	        ES 202 050, "Speech Processing, Transmission and Quality Aspects
365	        (STQ); Distributed Speech Recognition; Front-end Feature
366	        Extraction Algorithm; Compression Algorithms", (http://
367	        pda.etsi.org/pda/home.asp?wki_id=6402) , October 2002.

369	   [2]  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
370	        9, RFC 2026, October 1996.

372	   [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
373	        Levels", BCP 14, RFC 2119, March 1997.

375	   [4]  Handley, M. and V. Jacobson, "SDP: Session Description
376	        Protocol", RFC 2327, April 1998.

378	   [5]  Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
379	        "RTP: A Transport Protocol for Real-Time Applications", RFC
380	        3550, July 2003.

382	   [6]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video
383	        Conferences with Minimal Control", RFC 3551, July 2003.

385	   [7]  Xie, Q., "RTP Payload Format for European Telecommunications
386	        Standards Institute (ETSI) European Standard ES 201 108
387	        Distributed Speech Recognition Encoding", RFC 3557, July 2003.

389	Informative References

391	   [8]  European Telecommunications Standards Institute (ETSI) Standard
392	        ES 201 108, "Speech Processing, Transmission and Quality Aspects
393	        (STQ); Distributed Speech Recognition; Front-end Feature
394	        Extraction Algorithm; Compression Algorithms", http://
395	        webapp.etsi.org/pda/home.asp?wki_id=9948 , April 2000.

397	Authors' Addresses

399	   Qiaobing Xie
400	   Motorola, Inc.
401	   1501 W. Shure Drive, 2-F9
402	   Arlington Heights, IL  60004
403	   US

405	   Phone: +1-847-632-3028
406	   EMail: qxie1@email.mot.com

408	   David Pearce
409	   Motorola Labs
410	   UK Research Laboratory
411	   Jays Close
412	   Viables Industrial Estate
413	   Basingstoke, HANTS  RG22 4PD
414	   UK

416	   Phone: +44 (0)1256 484 436
417	   EMail: bdp003@motorola.com

419	Intellectual Property Statement

421	   The IETF takes no position regarding the validity or scope of any
422	   intellectual property or other rights that might be claimed to
423	   pertain to the implementation or use of the technology described in
424	   this document or the extent to which any license under such rights
425	   might or might not be available; neither does it represent that it
426	   has made any effort to identify any such rights. Information on the
427	   IETF's procedures with respect to rights in standards-track and
428	   standards-related documentation can be found in BCP-11. Copies of
429	   claims of rights made available for publication and any assurances of
430	   licenses to be made available, or the result of an attempt made to
431	   obtain a general license or permission for the use of such
432	   proprietary rights by implementors or users of this specification can
433	   be obtained from the IETF Secretariat.

435	   The IETF invites any interested party to bring to its attention any
436	   copyrights, patents or patent applications, or other proprietary
437	   rights which may cover technology that may be required to practice
438	   this standard. Please address the information to the IETF Executive
439	   Director.

441	Full Copyright Statement

443	   Copyright (C) The Internet Society (2003). All Rights Reserved.

445	   This document and translations of it may be copied and furnished to
446	   others, and derivative works that comment on or otherwise explain it
447	   or assist in its implementation may be prepared, copied, published
448	   and distributed, in whole or in part, without restriction of any
449	   kind, provided that the above copyright notice and this paragraph are
450	   included on all such copies and derivative works. However, this
451	   document itself may not be modified in any way, such as by removing
452	   the copyright notice or references to the Internet Society or other
453	   Internet organizations, except as needed for the purpose of
454	   developing Internet standards in which case the procedures for
455	   copyrights defined in the Internet Standards process must be
456	   followed, or as required to translate it into languages other than
457	   English.

459	   The limited permissions granted above are perpetual and will not be
460	   revoked by the Internet Society or its successors or assignees.

462	   This document and the information contained herein is provided on an
463	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
464	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
465	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
466	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
467	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

469	Acknowledgement

471	   Funding for the RFC Editor function is currently provided by the
472	   Internet Society.