idnits 2.17.1
draft-burger-mrcp-reqts-00.txt:
** The Abstract section seems to be numbered
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** Looks like you're using RFC 2026 boilerplate. This must be updated to
follow RFC 3978/3979, as updated by RFC 4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
** The document seems to lack a 1id_guidelines paragraph about 6 months
document validity -- however, there's a paragraph with a matching
beginning. Boilerplate error?
== There are 8 instances of lines with non-ascii characters in the document.
== The page length should not exceed 58 lines per page, but there was 7
longer pages, the longest (page 2) being 59 lines
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack an IANA Considerations section. (See Section
2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
when there are no actions for IANA.)
** The document seems to lack separate sections for Informative/Normative
References. All references will be assumed normative when checking for
downward references.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (February 19, 2002) is 8073 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
-- Missing reference section? '1' on line 14 looks like a reference
-- Missing reference section? '2' on line 50 looks like a reference
-- Missing reference section? '3' on line 69 looks like a reference
-- Missing reference section? '4' on line 69 looks like a reference
-- Missing reference section? '5' on line 70 looks like a reference
-- Missing reference section? '6' on line 70 looks like a reference
-- Missing reference section? '7' on line 221 looks like a reference
-- Missing reference section? '8' on line 81 looks like a reference
-- Missing reference section? '9' on line 127 looks like a reference
-- Missing reference section? '10' on line 161 looks like a reference
-- Missing reference section? '11' on line 177 looks like a reference
-- Missing reference section? '12' on line 213 looks like a reference
-- Missing reference section? '13' on line 261 looks like a reference
Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group E. Burger
3 Internet Draft SnowShore Networks, Inc.
4 Document: draft-burger-mrcp-reqts-00.txt D. Oran
5 Category: Informational Cisco Systems, Inc.
6 Expires August 2002 February 19, 2002
8 Requirements for Distributed Control of ASR and TTS Resources
10 Status of this Memo
12 This document is an Internet-Draft and is in full conformance with
13 all provisions of Section 10 of RFC2026 [1].
15 Internet-Drafts are working documents of the Internet Engineering
16 Task Force (IETF), its areas, and its working groups. Note that
17 other groups may also distribute working documents as Internet-
18 Drafts. Internet-Drafts are draft documents valid for a maximum of
19 six months and may be updated, replaced, or obsoleted by other
20 documents at any time. It is inappropriate to use Internet- Drafts
21 as reference material or to cite them other than as "work in
22 progress."
24 The list of current Internet-Drafts can be accessed at
25 http://www.ietf.org/ietf/1id-abstracts.txt
27 The list of Internet-Draft Shadow Directories can be accessed at
28 http://www.ietf.org/shadow.html.
30 1. Abstract
32 This document outlines the needs and requirements for a protocol to
33 control distributed speech processing of audio streams. By speech
34 processing, this document specifically means automatic speech
35 recognition and text-to-speech. Other IETF protocols, such as SIP
36 and RTSP, address rendezvous and control for generalized media
37 streams. However, speech processing presents additional
38 requirements that none of the extant IETF protocols address.
40 Discussion of this and related documents is on the MRCP list. To
41 subscribe, send the message "subscribe mrcp" to
42 majordomo@snowshore.com. The public archive is at
43 http://flyingfox.snowshore.com/mrcp_archive/maillist.html.
45 2. Conventions used in this document
47 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
48 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
49 this document are to be interpreted as described in RFC-2119 [2].
51 Burger & Oran Informational � Expires August 2002 1
52 Distributed Media Control Requirements February 2002
54 FORMATTING NOTE: Notes, such at this one, provide additional,
55 nonessential information that the reader may skip without missing
56 anything essential. The primary purpose of these non-essential
57 notes is to convey information about the rationale of this document,
58 or to place this document in the proper historical or evolutionary
59 context. Readers whose sole purpose is to construct a conformant
60 implementation may skip such information. However, it may be of use
61 to those who wish to understand why we made certain design choices.
63 OPEN ISSUES: This document highlights questions that are, as yet,
64 undecided as "OPEN ISSUES".
66 3. Introduction
68 There are multiple IETF protocols for establishment and termination
69 of media sessions (SIP[3]), low-level media control (MGCP[4] and
70 megaco[5]), and media record and playback (RTSP[6]). The focus of
71 this document is requirements for one or more protocols to support
72 the control of network elements that perform Automated Speech
73 Recognition (ASR) and rendering text into audio, a.k.a. Text-to-
74 Speech (TTS). Many multimedia applications can benefit from having
75 automatic speech recognition (ASR) and text-to-speech (TTS)
76 processing available as a distributed, network resource. This
77 requirements document limits its focus on the distributed control of
78 ASR and TTS servers.
80 To date, there are a number of proprietary ASR and TTS API's, as
81 well as two IETF drafts that address this problem [7] [8]. However,
82 there are serious deficiencies to the existing drafts. In
83 particular, they mix the semantics of existing protocols yet are
84 close enough to other protocols as to be confusing to the
85 implementer.
87 This document sets forth requirements for protocols to support
88 distributed speech processing of audio streams.
90 For simplicity, and to remove confusion with existing protocol
91 proposals, this document presents the requirements as being for a
92 "new protocol" that addresses the distributed control of speech
93 resources It refers to such a protocol as "SRCP", for Speech
94 Resource Control Protocol.
96 4. SRCP Framework
98 The following is the SRCP framework for speech processing.
100 Burger & Oran Informational � Expires August 2002 2
101 Distributed Media Control Requirements February 2002
103 +-------------+
104 | Application |
105 | Server |
106 +-------------+
107 SIP or whatever /
108 /
109 +------------+ / +--------+
110 | Media |/ SRCP | ASR |
111 | Processing |-------------------------| and/or |
112 RTP | Entity | RTP | TTS |
113 =====| |=========================| Server |
114 +------------+ +--------+
116 The "Media Processing Entity" is a network element that processes
117 media. The "Application Server" is a network element that instructs
118 the Media Processing Entity on what transformations to make to the
119 media stream. The "ASR and/or TTS Server" is a network element that
120 either generates a RTP stream based on text input (TTS) or returns
121 speech recognition results in response to an RTP stream as input
122 (ASR). The Media Processing Entity controls the ASR or TTS Server
123 using SRCP as a control protocol.
125 Physical embodiments of the entities can reside in one physical
126 instance per entity, or some combination of entities. For example,
127 a VoiceXML [9] Gateway may combine the ASR and TTS functions on the
128 same platform as the Media Processing Entity. Note that VoiceXML
129 Gateways themselves are outside the scope of this protocol.
131 Likewise, one can combine the Application Server and Media
132 Processing Entity, as would be the case in an interactive voice
133 response (IVR) platform.
135 One can also decompose the Media Processing Entity into an entity
136 that controls media endpoints and entities that process media
137 directly. Such would be the case with a decomposed gateway using
138 MGCP or megaco. However, this decomposition is again orthogonal to
139 the scope of SRCP.
141 5. General Requirements
143 5.1. Reuse Existing Protocols
145 To the extent feasible, the SRCP framework SHOULD use existing
146 protocols whenever possible.
148 5.2. Maintain Existing Protocol Integrity
150 In meeting requirement 5.1, the SRCP framework MUST NOT redefine the
151 semantics of an existing protocol.
153 Said differently, we will not break existing protocols.
155 Burger & Oran Informational � Expires August 2002 3
156 Distributed Media Control Requirements February 2002
158 5.3. Avoid Duplicating Existing Protocols
160 To the extent feasible, SRCP SHOULD NOT duplicate the functionality
161 of existing protocols. For example, SIP with msuri [10] and RTSP
162 already define how to request playback of audio.
164 The focus of SRCP is new functionality not addressed by existing
165 protocols or extending existing protocols within the strictures of
166 requirement 5.2.
168 6. TTS Requirements
170 The SRCP framework MUST allow a Media Processing Entity, using a
171 control protocol, to request the TTS Server to playback text as
172 voice in an RTP stream.
174 The TTS Server MUST support the reading of plain text. For reading
175 plain text, the language and voicing is a local matter.
177 The TTS Server SHOULD support the reading of SSML [11] text.
179 OPEN ISSUE: Should the TTS Server infer the text is SSML by
180 detecting a legal SSML document, or must the protocol tell the TTS
181 Server the document type?
183 The TTS Server MUST accept text over the SRCP connection for reading
184 over the RTP connection.
186 OPEN ISSUE: Should we allow the TTS Server to retrieve text on its
187 own? That is, have SRCP pass in a URI from which the TTS Server
188 retrieves the text.
190 OPEN ISSUE: Should we allow (or require) the TTS Server to use long-
191 lived control channels?
193 The TTS Server SHOULD support, and the SRCP framework MUST support
194 the specification of, "VCR Controls", such as skip forward, skip
195 backward, play faster, and play slower.
197 OPEN ISSUE: Should we allow for session parameters, like prosody and
198 voicing, as is specified for MRCP over RTSP [7]?
200 OPEN ISSUE: Should we allow for speech markers, as is specified for
201 MRCP over RTSP [7]?
203 7. ASR Requirements
205 The SRCP framework MUST allow a Media Processing Entity to request
206 the ASR Server to perform automatic speech recognition on an RTP
207 stream, returning the results over SRCP.
209 Burger & Oran Informational � Expires August 2002 4
210 Distributed Media Control Requirements February 2002
212 The ASR Server MUST support the XML specification for speech
213 recognition [12].
215 OPEN ISSUE: Should we allow the ASR Server to support alternative
216 grammar formats? If so, we need mechanisms to specify what format
217 the grammar is in, capability discovery, and handling unsupported
218 grammars.
220 OPEN ISSUE: Is there a need for all of the parameters specified for
221 MRCP over RTSP [7]? Most of them are part of the W3C speech
222 recognition grammar.
224 The ASR Server SHOULD support a method for capturing the input media
225 stream for later analysis and tuning of the ASR engine.
226 The ASR Server SHOULD support sharing grammars across sessions.
227 This supports applications with large grammars for which it is
228 unrealistic to dynamically load. An example is a city-country
229 grammar for a weather service.
231 8. Dual-Mode Requirements
233 One very important requirement for an interactive speech-driven
234 system is that user perception of the quality of the interaction
235 depends strongly on the ability of the user to interrupt a prompt or
236 rendered TTS with speech. Interrupting, or barging, the speech
237 output requires more than energy detection from the user's
238 direction. Many advanced systems halt the media towards the user by
239 employing the ASR engine to decide if an utterance is likely to be
240 real speech, as opposed to a cough, for example.
242 To achieve low latency between utterance detection and halting of
243 playback, many implementations combine the speaking and ASR
244 functions. The SRCP framework MUST support such dual-mode
245 implementations.
247 9. Thoughts to Date (non-normative)
249 The protocol assumes RTP carriage of media. Assuming session-
250 oriented media transport, the protocol will use SDP to describe the
251 session.
253 The working group will not be investigating distributed speech
254 recognition (DSR), as exemplified by the ETSI Aurora project. The
255 working group will not be recreating functionality available in
256 other protocols, such as SIP or SDP.
258 TTS looks very much like playing back a file. Extending RTSP looks
259 promising for when one requires VCR controls or markers in the text
260 to be spoken. When one does not require VCR controls, SIP in a
261 framework such as Network Announcements [13] works directly without
262 modification.
264 Burger & Oran Informational � Expires August 2002 5
265 Distributed Media Control Requirements February 2002
267 ASR has an entirely different set of characteristics. For barge-in
268 support, ASR requires real-time return of intermediate results.
269 Barring the discovery of a good reuse model for an existing
270 protocol, this will most likely become the focus of SRCP.
272 10. Security Considerations
274 Protocols relating to speech processing must take security into
275 account. This is particularly important as popular uses for TTS
276 include reading financial information. Likewise, popular uses for
277 ASR include executing financial transactions and shopping.
279 We envision that rather than providing application-specific security
280 mechanisms in SRCP itself, the resulting protocol will employ
281 security machinery of either containing protocols or the transport
282 on which it runs. For example, we will consider solutions such as
283 using TLS for securing the control channel, and SRTP for securing
284 the media channel.
286 11. References
288 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP
289 9, RFC 2026, October 1996.
291 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement
292 Levels", BCP 14, RFC 2119, March 1997
294 3 Handley, M., Schulzrinne, H., Schooler, E., and Rosenberg, J.,
295 "SIP: Session Initiation Protocol", RFC 2543, March 1999
297 4 Arango, M., Dugan, A., Elliott, I., Huitema, C., and Pickett, S.,
298 "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 2705,
299 October 1999
301 5 Cuervo, F., Greene, N., Rayhan, A., Huitema, C., Rosen, B., and
302 Segers, J., "Megaco Protocol Version 1.0", RFC 3015, November 2000
304 6 Schulzrinne, H., Rao, A., and Lanphier, R., "Real Time Streaming
305 Protocol (RTSP)", RFC 2326, April 1998
307 7 Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource
308 Control Protocol", draft-shanmugham-mrcp-01.txt, November 2001,
309 work in progress
311 8 Robinson, F., Marquette, B., and R. Hernandez, "Using Media
312 Resource Control Protocol with SIP", draft-robinson-mrcp-sip-
313 00.txt, September 2001, work in progress
315 Burger & Oran Informational � Expires August 2002 6
316 Distributed Media Control Requirements February 2002
318 9 World Wide Web Consortium, "Voice Extensible Markup Language
319 (VoiceXML) Version 2.0", W3C Working Draft,
320 ,
321 October 2001, work in progress
323 10 Van Dyke, J. and Burger, E., "SIP URI Conventions for Media
324 Servers", draft-burger-sipping-msuri-01, July 2001, work in
325 progress (expired)
327 11 World Wide Web Consortium, "Speech Synthesis Markup Language
328 Specification for the Speech Interface Framework", W3C Working
329 Draft, , January 2001,
330 work in progress
332 12 World Wide Web Consortium, "Speech Recognition Grammar
333 Specification for the W3C Speech Interface Framework", W3C
334 Working Draft, , August
335 2001, work in progress
337 13 O'Connor, W., Burger, E., "Network Announcements with SIP",
338 draft-ietf-sipping-netann-01.txt, November 2001, work in progress
340 12. Acknowledgments
342 Brian Eberman came up with the new name. It is catchy and describes
343 what we are working on.
345 OPEN ISSUE: Chose a name!
347 13. Author's Addresses
349 Eric W. Burger
350 SnowShore Networks, Inc.
351 Chelmsford, MA
352 USA
353 Email: eburger@snowshore.com
355 David R. Oran
356 Cisco Systems, Inc.
357 Acton, MA
358 USA
359 Email: oran@cisco.com
361 Burger & Oran Informational � Expires August 2002 7
362 Distributed Media Control Requirements February 2002
364 Full Copyright Statement
366 Copyright (C) The Internet Society (2002). All Rights Reserved.
368 This document and translations of it may be copied and furnished to
369 others, and derivative works that comment on or otherwise explain it
370 or assist in its implementation may be prepared, copied, published
371 and distributed, in whole or in part, without restriction of any
372 kind, provided that the above copyright notice and this paragraph are
373 included on all such copies and derivative works. However, this
374 document itself may not be modified in any way, such as by removing
375 the copyright notice or references to the Internet Society or other
376 Internet organizations, except as needed for the purpose of
377 developing Internet standards in which case the procedures for
378 copyrights defined in the Internet Standards process must be
379 followed, or as required to translate it into languages other than
380 English.
382 The limited permissions granted above are perpetual and will not be
383 revoked by the Internet Society or its successors or assigns. This
384 document and the information contained herein is provided on an "AS
385 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
386 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
387 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL
388 NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY
389 OR FITNESS FOR A PARTICULAR PURPOSE.
391 Acknowledgement
393 The Internet Society currently provides funding for the RFC Editor
394 function.
396 Burger & Oran Informational � Expires August 2002 8