idnits 2.17.1 

draft-xie-avt-dsr-00.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-25) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 424 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 13 instances of lines with control characters in the document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 6, 2001) is 8329 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC2026' is mentioned on line 18, but not defined

  == Missing Reference: 'RFC-2119' is mentioned on line 45, but not defined

  == Unused Reference: 'RFC2016' is defined on line 342, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2119' is defined on line 345, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ES201108'

  ** Obsolete normative reference: RFC 1889 (Obsoleted by RFC 3550)

  == Outdated reference: A later version (-13) exists of
     draft-ietf-avt-profile-new-08

  -- Possible downref: Normative reference to a draft: ref. 'RFC1890' 


     Summary: 7 errors (**), 0 flaws (~~), 7 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force			       Q. Xie, Motorola
3	Audio Video Transport WG                            D. Pearce, Motorola
4	INTERNET-DRAFT                                  S. Balasuriya, Motorola
5	                                                      Y. Kim, VerbalTek
6	                                                        S. H. Maes, IBM
7	                                               Hari Garudadri, Qualcomm

9	Expires in six months                                      July 6, 2001

11	         RTP Payload Format for Distributed Speech Recognition
12	                    <draft-xie-avt-dsr-00.txt>

14	Status of this Memo

16	This document is an Internet-Draft and is in full conformance with
17	all provisions of Section 10 of [RFC2026].

19	Internet-Drafts are working documents of the Internet Engineering
20	Task Force (IETF), its areas, and its working groups. Note that
21	other groups may also distribute working documents as Internet-
22	Drafts. Internet-Drafts are draft documents valid for a maximum of
23	six months and may be updated, replaced, or obsoleted by other
24	documents at any time. It is inappropriate to use Internet- Drafts
25	as reference material or to cite them other than as "work in
26	progress."

28	The list of current Internet-Drafts can be accessed at
29	http://www.ietf.org/ietf/1id-abstracts.txt
30	The list of Internet-Draft Shadow Directories can be accessed at
31	http://www.ietf.org/shadow.html.

33	1. Abstract

35	This document specifies an RTP payload format for encapsulating a
36	front-end signal processing feature streams for distributed speech
37	recognition (DSR) systems, with the ETSI Standard ES 201 108 front-end
38	being the default codec.

40	2. Conventions

42	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
43	"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
44	this document are to be interpreted as described in [RFC-2119].

46	3. Introduction

48	Motivated by technology advances in the field of speech recognition,
49	voice interfaces to a variety of services (such as airline
50	information systems, unified messaging, and the like) are becoming
51	more and more prevalent. In parallel, the popularity of mobile
52	computing and communications devices has also increased
53	dramatically. However, the voice codecs typically employed in mobile
54	systems were designed to optimize audible voice quality and not
55	speech recognition accuracy, and using these codecs with speech
56	recognizers can result in poor recognition performance. For systems
57	that can be accessed from multiple networks using multiple speech
58	codecs, recognition system designers are further challenged to
59	accommodate the characteristics of these differences in a robust
60	manner. Channel errors and lost data packets in these networks result
61	in further degradation of the speech signal.

63	In traditional systems as described above, the entire speech
64	recognizer lies on the server appliance. It is forced to use
65	incoming speech in whatever condition it arrives in after the
66	network decodes the vocoded speech. A solution that combats this
67	uses a scheme called "distributed speech recognition" (DSR). In this
68	system, the remote device acts as a thin client in communication
69	with a speech recognition server, also called a speech engine (SE). The
70	remote device processes the speech, compresses, and error protects the
71	bitstream in a manner optimal for speech recognition. The speech engin
72	then uses this representation directly, minimizing the signal
73	processing necessary and benefiting from enhanced error concealment.

75	To achieve interoperability with different client devices and speech
76	engins, a common format is needed. Within the "Aurora" DSR working
77	group of the European Telecommunications Standards Institute (ETSI), a
78	payload has been defined and was published as a standard in February
79	2000 [ES201108].

81	For interactive voice user interface dialogues between a caller and a
82	voice service, low latency is also a high priority along with accurate
83	speech recognition. While jitter in the speech recognizer input is not
84	particularly important, many issues related to speech interaction over
85	an IP-based connection are still relevant.  Therefore, it will be
86	desirable to use the DSR payload in an RTP-based session.

88	3.1 Typical Scenarios for Using DSR Payload Format

90	The following diagrams show some typical use scenarios of the DSR RTP
91	payload format.

93	  +--------+                     +----------+
94	  |IP USER |  IP/UDP/RTP/DSR     |IP SPEECH |
95	  |TREMINAL|-------------------->|  ENGINE  |
96	  |        |                     |          |
97	  +--------+                     +----------+

99	  +--------+  DSR over      +-------+                +----------+
100	  | Non-IP |  Circuit link  |       | IP/UDP/RTP/DSR |IP SPEECH |
101	  |  USER  |:::::::::::::::>|GATEWAY|--------------->|  ENGINE  |
102	  |TERMINAL|  ETSI payload  |       |                |          |
103	  +--------+  format        +-------+                +----------+

105	  +--------+                  +-------+  DSR over       +----------+
106	  |IP USER |  IP/UDP/RTP/DSR  |       |  circuit link   |  Non-IP  |
107	  |TREMINAL|----------------->|GATEWAY|::::::::::::::::>|  SPEECH  |
108	  |        |                  |       |  ETSI payload   |  ENGINE  |
109	  +--------+                  +-------+  format         +----------+

111	    Figure 1: Typical Scenarios for Using DSR Payload Format.

113	For the different scenarios in Figure 1, the speech recognizer resides
114	in the speech engin, while a DSR front-end encoder inside the User
115	Terminal performs front-end speech processing and sends the resultant
116	data to the speech engin in the form of "frame-pairs" (FPs). Each
117	frame-pair normally contains two sets of encoded speech vectors
118	representing 20ms of original speech.

120	4. DSR RTP Payload Format

122	4.1 Payload Header

124	Each DSR payload MUST begin with the follow payload header of one
125	octet length:

127	    0
128	    0 1 2 3 4 5 6 7
129	   +-+-+-+-+-+-+-+-+
130	   |  FPC  |E|R|R|R|
131	   +-+-+-+-+-+-+-+-+

133	   Figure 2: Payload header.

135	 FPC - Frame-Pair Count, indicating the number of Frame-pairs (FPs)
136	       included in this payload packet.

138	 E - End of speech segment flag. When set to 1, indicating the last
139	     frame pair in this payload packet is the end of the current
140	     speech segment.

142	 R - reserved bits. Must be set to 0 by the sender of the payload
143	     and ignored by the receiver.

145	4.2 Payload Body

147	The DSR payload is formed by concatenating the above payload header
148	and FPC number of frame-pairs.

150	Each DSR payload MUST be octet-aligned at the end, i.e., if a DSR
151	payload does not end on an octet boundary, it then MUST be padded at
152	the end with zeros to the next octet boundary.

154	The following example shows a DSR payload carrying 3 frame pairs:

156	    0                   1                   2                   3
157	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
158	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
159	   | FPC=3 |E|0|0|0|                                               |
160	   +-+-+-+-+-+-+-+-+                                               +
161	   |                         FP #1                                 |
162	   +                                                       +-+-+-+-+
163	   |                                                       |       |
164	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+       +
165	   |                                                               |
166	   +                                                               +
167	   |                         FP #2                                 |
168	   +                                               +-+-+-+-+-+-+-+-+
169	   |                                               |               |
170	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               +
171	   |                                                               |
172	   +                         FP #3                                 +
173	   |                                                               |
174	   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
175	   |                       |0|0|0|0|
176	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

178	In this example, the payload is shown with 4 zeros padded at the end
179	to make it octet-aligned.

181	The number of FPs per payload packet should be determined by the
182	latency and bandwidth requirements of the DSR application.

184	A decreasing number of FPs per payload packet reduces the bandwidth
185	efficiency due to the RTP header overhead, while an increacing number
186	of FPs per packet causes longer end-to-end delay and hence bigger
187	recognition latency.

189	Furthermore, an increacing number of FPs per packet rises the
190	potential of the loss of a large number of consecutive frame-pairs,
191	which is a situation most speech recogziers have difficult to deal
192	with.

194	Therefore, it is RECOMMENDED that the number of FPs per DSR
195	payload packet be minimized, subject to meeting the application's
196	requirements on network bandwidth efficiency.

198	RTP header compression [RFC2508] SHOULD be considered to improve
199	network bandwidth efficiency.

201	5. Frame-pair Format

203	Depending on the type of the DSR front-end encoder to be used in the
204	present DSR RTP session, the frame-pair format may be different.

206	When setting up a DSR RTP sessions, the user terminal will inform the
207	speech engine the type of the front-end encoder, using the
208	front-end-type MIME parameter as defined in Section 7.

210	In this memo, we only define the frame-pair format that MUST be used
211	when the ESTI ES 201 108 Front-end Codec [ES201108] is used. Frame-
212	pair formats for future DSR front-end codecs may be defined in
213	separate IETF documents.

215	5.1. Frame-Pair Format For ETSI ES 201 108 Front-end Codec

217	The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal
218	processing front-end and compression scheme for speech input to a
219	speech recognition system. Some relevant characteristics of this ETSI
220	DSR front-end codec are summarized below.

222	The coding algorithm, a standard mel-cepstral technique common to many
223	speech recognition systems, supports three raw sampling rates: 8 kHz,
224	11 kHz, and 16 kHz. The mel-cepstral calculation is a frame- based
225	scheme that produces an output vector every 10 ms.

227	After calculation of the mel-cepstral representation, the
228	representation is quantized via split-vector quantization to reduce
229	the data rate of the encoded stream. This is a lossy compression, with
230	the output being a frame containing an integer representation of the
231	encoded speech.

233	For ES 201 108 Front-end Codec, the following mel-cepstral frame MUST
234	be used, as defined in [ES201108]:

236	    0                   1                   2                   3
237	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
238	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
239	   |  idx(0,1) |  idx(2,3) |  idx(4,5) |  idx(6,7) |  idx(8,9) |idx
240	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
241	   (10,11) |   idx(12,13)  |
242	   +-+-+-+-+-+-+-+-+-+-+-+-+

244	The length of a frame is 44 bits representing 10ms of voice.

246	As defined in [ES201108], pairs of the quantized 10ms mel-cepstral
247	frames MUST be grouped together and protected with a 4-bit CRC,
248	forming a 92-bit long frame-pair:

250	    0                   1                   2                   3
251	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
252	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
253	   |                      Frame #1  (44 bits)                      |
254	   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
255	   |                       |          Frame #2 (44 bits)           |
256	   +-+-+-+-+-+-+-+-+-+-+-+-+                       +-+-+-+-+-+-+-+-+
257	   |                                               | CRC   |
258	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

260	Therefore, each frame-pair represents 20ms of original speech.

262	The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in
263	[ES201108].

265	6. DSR MIME Type Registration

267	Media Type name:     audio

269	Media subtype name:  DSR

271	Required parameters: none

273	Optional parameters for RTP mode:

275	 sample-rate: Indicating the sample rate of the speech. Valid values
276		      include: 8k, 11k, and 16k.

278		      If this parameter is not present, 8k sample rate is
279		      assumed.

281	 front-end-type: Indicating the type of the front-end codec to be used
282			 for this DSR session. Valid values are:

284			 etsi_mfcc - indicates that ETSI ES 201 108 Front-end
285			 Codec as defined in [ES201108] will be used.

287			 unspecified - indicates that other front-end codec
288			 will be used.

290			 If this parameter is absent, ETSI ES 201 108
291			 Front-end will be assumed.

293	 maxptime:  The maximum amount of media which can be encapsulated in
294	            each packet, expressed as time in milliseconds. The time
295	            shall be calculated as the sum of the time the media
296	            present in the packet represents. The time SHOULD be a
297	            multiple of the frame pair size (i.e., one FP <-> 20ms).

299		    If this parameter is not present, maxptime will be assumed
300		    to 60ms.

302	Encoding considerations : <TBD>

304	Security considerations : <TBD>

306	Interoperability considerations : <TBD>

308	Person & email address to contact for further information: <TBD>

310	Intended usage: COMMON. It is expected that many VoIP applications
311	(as well as mobile applications) will use this type.

313	Author/Change controller:
314	  <TBD>
315	  IETF Audio/Video transport working group

317	7. Security Considerations

319	Implementations using the payload defined in this specification are
320	subject to the security considerations discussed in the RTP
321	specification [RFC1889] and the RTP profile [RFC1890]. This payload
322	does not specify any different security services.

324	8. References

326	[ES201108] European Telecommunications Standards Institute (ETSI)
327	   Standard ES 201 108, "Speech Processing, Transmission and Quality
328	   Aspects (STQ); Distributed Speech Recognition; Front-end Feature
329	   Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April
330	   11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948

332	[RFC1889] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson,
333	   "RTP: A transport protocol for real-time applications," Internet
334	   Draft, Internet Engineering Task Force, Feb. 1999 Work in progress,
335	   revision to RFC 1889.

337	[RFC1890] H. Schulzrinne and S. Casner, "RTP Profile for Audio and
338	   Video Conferences with Minimal Control," Internet Draft
339	   draft-ietf-avt-profile-new-08.txt, Work in Progress January 14,
340	   2000, revision to RFC 1890.

342	[RFC2016] Bradner, S., "The Internet Standards Process -- Revision 3",
343	   BCP 9, RFC 2026, October 1996.

345	[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
346	   Requirement Levels", BCP 14, RFC 2119, March 1997

348	[RFC2508] S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers
349	   for Low-Speed Serial Links," RFC 2508, February 1999.

351	9.   Acknowledgments

353	The design presented here benefits greatly from an earlier work on DSR
354	RTP payload design by Jeff Meunier.

356	10. Author's Addresses

358	Qiaobing Xie                        Tel:   +1-847-632-3028
359	Motorola, Inc.                      EMail: qxie1@email.mot.com
360	1501 W. Shure Drive, 2-F9
361	Arlington Heights, IL 60004, USA

363	David Pearce                        Tel: +44 (0)1256 484 436
364	Motorola Labs                       EMail: bdp003@motorola.com
365	UK Research Laboratory
366	Jays Close
367	Viables Industrial Estate
368	Basingstoke, HANTS, RG22 4PD

370	Senaka Balasuriya                   Tel:   +1-630-353-8347
371	Motorola, Inc.              EMail: Senaka.Balasuriya@motorola.com
372	1411 Opus Place, Suite 350
373	Downers Grover, IL 60515, USA

375	Yoon Kim                            Tel: +1-408-768-4974
376	VerbalTek, Inc.                     EMail: yoonie@verbaltek.com
377	2921 Copper Rd.
378	Santa Clara, CA 95051

380	Stephane H. Maes                    Tel: +1-914-945-2908
381	IBM                                 EMail: smaes@us.ibm.com
382	TJ Watson Research Center
383	P.O. Box 218,
384	Yorktown Heights, NY 10598, USA.

386	Hari Garudadri                      Tel:
387	Qualcomm                            EMail: hgarudad@qualcomm.com

389	      This Internet Draft expires in 6 months from July 2001.